Company data using Wikidata
Min
Posted on January 11, 2020
- Company data is useful.
- Company information is hard to find in bulk.
- Wikidata can be a useful starting point.
- Wikidata can be great for collecting other information too.
- Let's contribute to open data, not just open source codes.
Company data is really useful
Whether you are doing competitor analysis, econometric analysis, or policy analysis, it would be great to have information about companies. Working in the government, we are often asking how a new public policy might affect small businesses. Necessarily, we need to have data about the list of businesses, and whether they are large or small.
Collecting information about companies is hard
You may think that there are government bodies that collect this information. After all, corporate taxes are a big part of how governments source their funds. However, just because one part of the government has a certain set of data, it does not mean that other parts of the government can access that data. Much of these restrictions are necessary for privacy - and I am not here to debate data sharing and governance policies - but we can agree that having access to such data can benefit researchers.
Some options for company data
Buy the data
For the very reason that the company data is hard to get, there are organisations that sell such information. Organisations such as universities and government agencies purchase these data, without many alternatives.
Get data provided by some governments
Some governments are pretty good with organising and releasing data. Australia, for example, has data.gov.au where they host information such as a list of businesses and their ABNs (Australian Business Numbers) so that you can identify them. Also on data.gov.au is a dataset released for tax transparency purposes for "large" companies.
A quick side note: definitions of large companies can vary. You could use the annual turnover (revenue) and/or the number of employees as measures of size. It would depend on what question you are trying to answer.
Try and get publically available data
You could try some form of web scraping. Some very nice people at www.peopledatalabs.com have released what they have collected (I suspect from Linkedin, by scraping or by using the API). This dataset would be limited to those with a Linkedin presence and has no information about revenue.
Companies publish annual financial reports and searching for these online could be one way of gathering information. I guess that is what some of the data vendors do. This, as you can imagine, is a tedious and time-consuming task. Perhaps there should be an open-source community whose mission is the make this data available and accessible. Such an initiative does in fact exist: OpenCorporates. But to me, this data was exposed more as a search tool (as opposed to a downloadable bulk data), and there seems to be a lot of inconsistent duplication of data. At least for what I was trying to do, I could not work out how to even select the "closest" match for the search results. I'm sure that there are many use cases for this tool, but it did not suit my use case. Also, this data isn't exactly "free".
Wikidata
This is where Wikidata comes in. "Wikidata?" you may ask. Here is what the website says:
Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.
Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.
There is an enthusiastic talk about it in youtube as well:
It is really amazing. We should really consider supporting the Wikimedia Foundation
So I have been playing around with the SPARQL (feels like SQL but with its own dialect) (tutorial here).
Specifically for finding "large companies" I have developed these following codes that you can run yourself at Wikidata Query Service. Note that I have tried to gather ISNI and GRID ids as the identifiers that can be used to join with other sources of information if required.
For largest companies by (latest available information about) number of employees:
SELECT DISTINCT ?business ?isni ?grid_id ?businessLabel ?officialname ?shortname ?countryLabel ?employees
WHERE {
?business wdt:P31/wdt:P279* wd:Q4830453 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
?business wdt:P17 ?country .
OPTIONAL { ?business wdt:P213 ?isni } .
OPTIONAL { ?business wdt:P2427 ?grid_id } .
OPTIONAL { ?business wdt:P1448 ?officialname FILTER( LANG(?officialname) = "en" ) } .
OPTIONAL { ?business wdt:P1813 ?shortname FILTER( LANG(?shortname) = "en" ) } .
?business wdt:P1128 ?employees .
FILTER( ?employees >= 200 )
}
ORDER BY DESC (?employees)
For largest companies by (latest available information about) revenue (converted to USD):
SELECT DISTINCT ?business ?isni ?grid_id ?businessLabel ?officialname ?shortname ?countryLabel ?revenue_usd
WHERE
{
{
SELECT DISTINCT ?business ?isni ?grid_id ?businessLabel ?officialname ?shortname ?countryLabel (MAX(?date) AS ?max_date
WHERE {
?business wdt:P31/wdt:P279* wd:Q4830453 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
OPTIONAL { ?business wdt:P213 ?isni } .
OPTIONAL { ?business wdt:P2427 ?grid_id } .
OPTIONAL { ?business wdt:P1448 ?officialname FILTER( LANG(?officialname) = "en" ) } .
OPTIONAL { ?business wdt:P1813 ?shortname FILTER( LANG(?shortname) = "en" ) } .
?business wdt:P17 ?country .
?business p:P2139 ?statement .
OPTIONAL { ?statement pq:P585 ?date } .
}
GROUP BY ?business ?isni ?grid_id ?businessLabel ?officialname ?shortname ?countryLabel
} OPTIONAL {
SELECT DISTINCT ?business ?date (MAX(?revenue_usd_recorded) AS ?revenue_usd)
WHERE {
?business wdt:P31/wdt:P279* wd:Q4830453 .
?business p:P2139 ?statement .
OPTIONAL { ?statement pq:P585 ?date } .
{
?statement psv:P2139 [
wikibase:quantityAmount ?revenue; wikibase:quantityUnit wd:Q4917
] .
BIND( wd:Q4917 AS ?unit ) .
BIND( ?revenue AS ?revenue_usd_recorded ) .
FILTER( ?revenue_usd_recorded > 100000000 )
} UNION {
?statement psv:P2139 [
wikibase:quantityAmount ?revenue; wikibase:quantityUnit ?unit
] .
FILTER( ?unit != wd:Q4917 ) .
?unit p:P2284 ?unit_statement .
?unit_statement
psv:P2284 [ wikibase:quantityUnit wd:Q4917; wikibase:quantityAmount ?usd ] .
BIND( ?revenue * ?usd AS ?revenue_usd_recorded ) .
FILTER( ?revenue_usd_recorded > 100000000 )
}
}
GROUP BY ?business ?date
} .
FILTER( ?date = ?max_date )
}
ORDER BY DESC(?revenue_usd)
Not all companies will have entries for employee numbers or revenue. I have also tried getting publicly traded (securities) companies with the assumptions that such companies will be large. For these companies, I have also extracted their ISINs.
SELECT DISTINCT ?business ?isni ?grid_id ?businessLabel ?officialname ?shortname ?countryLabel ?stockexchangeLabel ?isin
WHERE {
?business wdt:P31/wdt:P279* wd:Q4830453 .
?business wdt:P17 ?country .
?business wdt:P414 ?stockexchange .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
OPTIONAL { ?business wdt:P213 ?isni } .
OPTIONAL { ?business wdt:P2427 ?grid_id } .
OPTIONAL { ?business wdt:P1448 ?officialname FILTER( LANG(?officialname) = "en" ) } .
OPTIONAL { ?business wdt:P1813 ?shortname FILTER( LANG(?shortname) = "en" ) } .
OPTIONAL { ?business wdt:P946 ?isin } .
}
ORDER BY ?business
I hope that someone may find this information useful. But I also hope that others can improve on this, because I would really like to access reliable and maintained data about companies - one with (official) identifiers, revenue data, employee numbers, and ultimate global owners. I must admit, I am not great with SPARQL.
Also, I can't thank all the people contributing to Wikipedia enough. Again: We should really consider supporting the Wikimedia Foundation - through contributing data or donations.
Posted on January 11, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.