Wikipedia API Part 4: PyWikiapi

zmbailey

Zander Bailey

Posted on November 8, 2019

Wikipedia API Part 4: PyWikiapi

MediaWiki is a useful api, and allows access to a vast collection of information. But as we have seen it is not always straightforward, and can get quite complex to work out the proper calls and parameters to return the desired information in the right format. Apparently the original author of MediaWiki agreed, and wrote PyWikiapi to make some operations a little more user-friendly, and easier to code. PyWikiapi is a package that can help make it a little more accessible. Where the Wikipedia library focuses on accessing a single page and refining all the content therein, PyWikiapi makes searches involving multiple results or lists easier to process. An example of this, would be searching for members of a category. As I discussed in a previous post, the standard way to look over all members in a category involves querying the category, finding the continue code in result object, and then querying the category again with the continue code included in the request. PyWikiapi offers a simpler way to handle this operation: a generator.

Let’s start by looking at how to set up PyWikiapi. First, you need to create a site object. A site is a reference to the specific version of Wikipedia you’re going to querying, depending on the language you want. For our purposes, we’ll be looking at the English language version of Wikipedia, so our site will look like this:

from pywikiapi import wikipedia
site = wikipedia(en)

Now that we have a site we can use it to make queries. Normally with MediaWiki we would have to make each query individually, but PyWikiapi is written to avoid all that. The site has functions to query, query_pages, as well as iterate. iterate can be used to manually iterate, but performing a query will return a generator that iterates automatically. Now all we have to do is write a loop like this:

for r in site.query(list='categorymembers', cmtitle='Category:1960 films'):
     for page in r.categorymembers:
          print(page.title)

This will loop over the entire list of members in the category with only three lines of code! This can also be used to loop over multiple results from a query:

for r in site.query(list='allpages', apprefix=Batman):
     for page in r.allpages:
          print(page.title)

Site also uses a search function called query_pages, which turns up information for individual pages, information like links and summaries. Again, this returns a generator object which can be iterated over to access each page in turn:

for page in site.query_pages(titles=[Nightwing, Batgirl, Catwoman], prop=['links', 'info'], pllimit=10):
    print(page.title)
    print(', '.join([l.title for l in page.links]))

It is worth noting that the parameters passed in these functions are the same or very similar to the parameters used in the base API. This is because PyWikiapi is a wrapper, and all the search functions discussed so far are based on an API call using action=query, so most parameters that may be used in a regular MediaWiki query may be used in a PyWikiapi query.(Note: I have not tested all parameters at the time of writing this article, but based on how PyWikiapi was built this logic would make sense.) Another peculiarity of PyWikiapi is that it does not appear to have any extended functionality for parse actions. In other words, there are no functions included in PyWikiapi that handle a search for the details of a certain page. query_pages can give you some information, but because it is based on the query action, it does not return page content. This is easily handled by using a package like Wikipedia, and by using a combination of libraries, we can start to build a powerful interface for accessing Wikipedia through code.

💖 💪 🙅 🚩
zmbailey
Zander Bailey

Posted on November 8, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related