Jordan Scrapes Washington’s Marijuana Producers
Jordan Hansen
Posted on October 22, 2020
Hello. The goal with this post is to find the legal names of Washington’s marijuana producers. This would be useful to persons who wanted to market to these producers. With the legal names you could confirm owners at the Washington secretary of state.
We are using two different sites to do this. The first, 502data.com, has a list of all the producers but not their legal names. The second, TopShelfData, has the legal name of the company. Using this legal name you can easily find the business information from the Washington secretary of state.
502data.com
After a quick inspection of 502data.com, it was clear that they were using Angularjs for their framework. Knowing this, I fully expected to be able to see XHR requests with the data. But, going to https://502data.com/allproducerprocessors only had two requests. Neither had any relevant information.
This really confused me. The data was clearly not there on page load. Look at what it was before all of the javascript rendered.
My next step was to go through the javascript. If the data was getting pulled in via XHR, it must be referenced somewhere in the javascript. Looking at these script files, however, nothing called out to me as something that would manage the app itself.
Next stop was the root page. Going through the script tags I finally found what I was looking for at the bottom of the page. Jackpot.
See $scope.licenses
? That’s what I’m looking for. It’s a huge array of all the marijuana producers in Washington. Checking the length gave me over 1500.
I’d never used cheerio to get script data before but it turned out to be fairly simple.
const url = 'https://502data.com/allproducerprocessors';
const axiosResponse = await axios.get(url);
const $ = cheerio.load(axiosResponse.data);
const script = $('script:nth-of-type(7)').html();
const scriptSplit = script?.split('$scope.licenses = ');
let arrayOfbusinesses: any[] = [];
if (scriptSplit) {
arrayOfbusinesses = JSON.parse(scriptSplit[1].split(';')[0]);
}
Only difference from the typical selectors is using the html()
instead of text()
. After that I just split the html until I found only the part I wanted. Then it was simply a matter of JSON.parse()
.
BAM. Just like that I have my producers. Now to get their legal name.
TopShelfData
Off we go to TopShelfData. The registered name is the item for which we are looking.
The data that we have from 502data.com looks like this:
{
"licensenumber": "78256",
"name": "EVERGREEN HERBAL",
"tier": 0,
"city": "SEATTLE",
"county": "KING",
"totalSales": 26827987.182500,
"ytdSales": 2887764.770000,
"lastMonthSales": 588414.440000
}
So we need to convert the above data into the URL from the above picture. At first I thought I could just lower case everything and put dashes to replace the spaces. But then we have the problem if we ever have more than one business with the same name. As you can see in the photo above, there is a 1 at the end of the URL.
So…I tried searching to see how TopShelfData narrowed it down.
Bam. We’re in business. The search returns XHR results. So I just submitted my business name as the query and then I would find the business from the suggestions that contained the same city.
export async function getSlugFromTopShelfData(businessName: string, city: string): Promise<IBusinessSearchData> {
const url = `https://www.topshelfdata.com/search?query=${businessName}`;
const convertedCity = city.toLocaleLowerCase().replace(/\s/g, '-');
const axiosResponse = await axios.get(url);
const suggestions = axiosResponse.data?.suggestions;
const foundBusiness = suggestions.find(suggestion => suggestion?.data?.address_city.includes(convertedCity));
return foundBusiness?.data;
}
With this, it was simply a matter navigating directly to the url and getting the legal name of the business.
export async function checkTopShelfDataDetails(businessSearchData: IBusinessSearchData) {
const url = `https://www.topshelfdata.com/wa/${businessSearchData.address_city}/${businessSearchData.slug}`;
let axiosResponse: AxiosResponse;
try {
axiosResponse = await axios.get(url);
}
catch (e) {
console.log('e', e.response ? e.response.status : e.errno);
throw '';
}
const $ = cheerio.load(axiosResponse.data);
const title = $('.business-info div:nth-of-type(3) a').text();
console.log('title', title);
}
Done. Very fun scrape!
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome web data. Learn more at Cobalt Intelligence!
The post Jordan Scrapes Washington’s Marijuana Producers appeared first on JavaScript Web Scraping Guy.
Posted on October 22, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.