Methodology
Organisation Data
The organisations included in the UK WAIfinder tool come from 2 main sources; organisations researching AI come from Gateway to Research
Gateway to Research
The Gateway to Research
The first step is searching for projects with certain topic tags that we felt were relevant to AI, e.g. “Image & Vision Computing”, “Robotics & Autonomy” and “Artificial Intelligence”. The organisations these projects took place at were collated together and filtered with the following criteria:
- The organisation is in a predefined list of organisations - which is a combination of universities listed by HESA
- The organisation received any amount of funding in the last 5 years
- The organisation has at least 400 projects OR it has had a total of at least £50 million in funding
- The organisation is in the UK
- The organisation has longitude/latitude data
This leaves us with research organisations that are large, relevant, and recent.
To supplement this data with URLs and organisation descriptions, we use the Bing search
API
Crunchbase
We query the Crunchbase
We first find the organisations that are tagged with topics we felt were relevant to AI (e.g. “artificial intelligence”, “augmented reality”, “autonomous vehicles”). We then find all investors of these organisations, where each investor may have funded multiple AI organisations, and each AI organisation may have been funded by multiple investors. Thus, for each investor we have:
- The number of AI organisations they have funded
- The number of total organisations they have funded
We get the longitude/latitude data (which Crunchbase doesn’t have) for these investors using the NSPL postcode look up
We filter this data to only include key AI investors with the following criteria:
- At least 10% of the organisations they fund are AI organisations
- They have funded at least 10 organisations
- The investor’s address is in the UK
- The “type” field for this investor is “organisation” (not “person”)
- The investor has longitude/latitude data
We then create our funders dataset by using the remaining funder organisation names to query the Bing search
API
GlassAI
Our data for companies and incubators/accelerators comes from Glass AI
If a company is also an incubator/accelerator then this is tagged as such in a ‘is_incubator’ field.
We get the longitude/latitude data (which GlassAI didn’t provide us with) for these companies using the NSPL postcode look up
The only filtering needed for this dataset was:
- The company has longitude/latitude data
Merging datasets
The three filtered datasets are concatenated together, then organisation names were cleaned in order to merge together organisations that might have been in more than one of the original datasets. For example the company CodeBase
If there is duplication we decide which rows to drop to include based on the criteria (useful if there are conflicting Links or latitude/longitude values):
- Trust Glass AI first - since several sources were considered to find Links and Lat/Long,
- then trust GtR - since Lat/Long was given in this data,
- lastly trust Crunchbase
Merged dataset outputs:
Number of organisations | |
---|---|
Company category | 2785 |
Funder category | 290 |
Incubator / accelerator category | 74 |
University / RTO category | 152 |
Total deduplicated | 3319 |
Adding place information
We add the ‘Place’ field to any data points that don’t have it by using the postcode or longitude/latitude data. We do this using two methods:
- Query the postcode to get the city using the pgeocode python package
- Query the longitude/latitude coordinates to get the city/town using the geopy python package
Then we finalise this ‘Place’ field for an organisation using the following method:
- Use the city name found from the original data or the pgeocode package if it’s a predefined list of 4276 cities from the UK (from Nesta’s “geographic_data” SQL table)
- If this isn’t in the list, use the city from the geopy package (as long as it’s in the list). If this isn’t possible use the other geopy outputs; see if the town name is in the predefined list of UK cities, then suburb name, village name, county name, and finally neighbourhood name. For example, one data point had the city given as ‘Vale of White Horse’, but this wasn’t in the predefined list of cities, but the suburb field “Botley” was.
- If no place names from any data sources are found in the predefined city list, then repeat steps 1 and 2 but don’t specify the place name needs to be in the predefined list.
Some cleaning of the place name fields is also included (e.g. convert “London Borough of Camden” to “London”).
For each unique place name we find we add NUTS data using the nuts-finder python package
The 3319 unique organisations in the map are located in 422 unique places, with the most common location being London.