Wikipedia API Part 3: Packages
Zander Bailey
Posted on November 2, 2019
There are many reasons for interacting with the API directly and building your own functions to get the exact output and functionality you need, but there are also many packages designed to streamline the process. There are packages for all different languages, but for now we will focus on those written for Python. A useful package to know about is Wikipedia. Some packages are designed with a certain focus in mind, and in this case Wikipedia is designed mainly around querying a single page, and returning an object with preprocessed data from the page. You can find the Wikipedia package here.
One of the primary ways to search using the Wikipedia package is with wikipedia.page()
, and pass the name of the page you want as a parameter. This will return a WikipediaPage
object, which has properties that filter useful data from the raw html. This can be a useful way to retrieve a page, because a normal MediaWiki response object only contains the raw html for the page, but a WikipediaPage object has the property WikipediaPage.content
, which returns a plain text version of the page. You can also use the WikipediaPage.links
property to get a simple list of all the links on the page. Other interesting properties include WikipediaPage.section('section_title')
, which returns the plain text version of a specific section of the page, and WikipediaPage.sections
which returns all the section titles as a list.
Another interesting feature of the Wikipedia package is the exceptions. If you have spent much time searching Wikipedia (the website), you might know that the search engine requires somewhat exact search terms to find a page. If you misspell a name it might find nothing, and if you search for a common term it may turn up multiple results. These are things that can also happen during a call the MediaWiki, but it can be difficult to understand exactly what went wrong with your search. Wikipedia (the package) has exceptions designed to be a little more informative of what happened. wikipedia.exceptions
boils it down to four basic types of exceptions: DisambiguationError, HTTPTimeoutError, PageError, and RedirectError. DisambiguationError is what happens when a page title has multiple results, which directs to a disambiguation page. On the website the user would then be required to clarify which page with that title the user is interested in, but in a program we don’t have that luxury and have to be more specific the first time. An HTTPError refers to simple timeout of the MediaWiki servers. A PageError occurs when there is no page with a matching title. Finally, a RedirectError happens when a title resolves in a redirect, which apparently is difficult for the API to handle automatically. These error types are a little more straightforward than the normal error handling for MediaWiki, and a little easier to understand and handle.
If you’d like to know more about the various packages for MediaWiki, you can the main page for them here
Posted on November 2, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024