π How to manage the Open Data in your project / Release package manager for open dataπ¦
ryo-ma
Posted on January 19, 2023
β» Open data is data that is openly accessible, exploitable, editable and shared by anyone for any purpose. Open data is licensed under an open license.(https://en.wikipedia.org/wiki/Open_data)
I thought open data should be managed by a package manager just like the software (ex: npm, apt, pip, gem...).
When fetching the open data, it would be convenient for users to be able to fetch them with commands like:
npm install xxxxx
After data is installed, it is recorded in a dim.json such as package.json
Stop chaotic open data management
A systematic method of managing software and libraries has been established by package managers(npm, gem, apt...). However, there is no systematic management approach for open data users.
If you were given the assignment to visualize a map using some kind of open data, how would you prepare the data?
The following flow is a common example.
Search for open data you want from Google
When you find the open data you want, download it from your browser
Check the open data and return to 1 if the open data is incomplete or not what you wanted
Processing the open data for utilization (character encoding conversion, file format conversion...)
Save the open data in the project directory or database
This process is sufficient for simple projects to utilize.
However, you may want to record the specs(name, URL, last-updated, etc...) of open data.
Project developed by multiple people
Projects to be maintained in the medium to long term
Public projects (published on GitHub as OSS, etc.)
, etc.
List of required the open data specifications
If you download the open data from various sites and process datasets, you may forget where you downloaded the open data from or how you processed the data. Therefore, it is useful to record the following specifications.
URL
Last-updated
Version
Post-processing
Hash value
, etc.
Approach
We have released a CLI tool the dim (Open Data Package Manager) v1.0.
(1) Support for search/download/processing/recording processes
The dim support search/download/processing/recording processes. The dim can also execute a series of processes by interactive commands.
(2) Support for post-processing commonly used in the data processing
The dim includes several post-processes commonly used in data processing. The post-process is recorded as well as the data URL. You can also use your scripts as post-process.
(3) Prepare data in one step using the existing data specification file
You can fetch and process all open data in one step by using a data specification file(dim.json) that has already been recorded.
As a user, you only share a data specification file(dim.json) without including the open data body in the repository by publishing the data specification file on GitHub.
(This is the same as publishing package.json etc. to GitHub)
About the development environment
Language: TypeScript
Execution environment: Deno
CI/CDοΌ GitHub Actions
CI: Test/Lint/Type Check/Coverage
CD: Automatically publish a release by tagging, building dim binary & upload
We are using Deno, which is expected to replace Node.js. We evaluated Deno for the following reasons.
simple to set up and easy to start projects
Lint and formatter are provided as standard functions
We have released version v1.0 of the open data package manager dim, which manages the open data like a package manager.
There are still a lot of features We want to add. If there is someone who can sympathize with the issues and solve the issue together, we would be very welcome.