Open Source Data Catalog Overview: CKAN

In this blog post, the first in a series about Open Source Data Catalogs, we will be talking about an Open Source Data Catalog known as CKAN. We will be going over what the main idea of CKAN is, what kinds of technologies make up CKAN, ways to install CKAN, and then go over installing CKAN using the package installation method along with some hurdles we ran into while doing so. Then, we will discuss a few optional features of CKAN such as its FileStore and DataStore, and then talk about ways of adding data to CKAN. Finally we conclude with some ending thoughts and conclusions on CKAN from the perspective of a short dive into it.

Introduction to CKAN

In this blog post, the first in a series about Open Source Data Catalogs, we will be talking about an Open Source Data Catalog known as CKAN. We will be going over what the main idea of CKAN is, what kinds of technologies makeup CKAN, ways to install CKAN, and then go over installing CKAN using the package installation method along with some hurdles we ran into while doing so. Then, we will discuss a few optional features of CKAN such as its FileStore and DataStore, and then talk about ways of adding data to CKAN. Finally, we conclude with some ending thoughts on CKAN as an Open Source Data Catalog from the perspective of a short dive into it.

Overview of CKAN

CKAN (Website: https://ckan.org/, Github Repository: https://github.com/ckan/ckan) is an Open Source Data Catalog (or DMS, Data Management System) that has been around for over 10 years at this point, with incremental upgrades and releases throughout the years.

CKAN is written primarily in Python. For searching through data that a user puts into the catalog, CKAN uses Open Source Apache Solr (https://solr.apache.org/). CKAN can be installed in the following three ways:

  1. Install from OS (Operating System) Package
  2. Install from source (found at their Github repository)
  3. Install from Docker Compose

CKAN is used by a number of public entities looking to serve a large amount of easily searchable data to the public. A few examples of some organizations that use CKAN include:

  1. Australian Government: https://data.gov.au/
  2. Government of Canada: https://www.canada.ca/en.html
  3. U.S. Government’s Open Data: https://www.data.gov/

And many other public and private entities as well. In the following sections, we will be going over the process of installation and some hurdles that were encountered when attempting to install CKAN from OS Package, including going through various configuration files and services that need to be run alongside CKAN for everything to work.

Process of Package Install / Look of CKAN

In our personal setup, we installed CKAN 2.9.0 on the latest version of Linux Mint Ulyssa (20.1), which is being run inside VMWare VirtualBox, through os package install located here: https://docs.ckan.org/en/latest/maintaining/installing/install-from-package.html

Prior to installing CKAN, it is necessary to have PostgreSQL running (CKAN uses PostgreSQL as its database for storing files). A default user and database (both named ckan_default) are made in PostgreSQL using the commands provided in the link above. PostgreSQL can be on a different server, in which case the field sqlalchemy.url inside the CKAN config file (located in /etc/ckan/default/ckan.ini) needs to be modified to point to the URL of the PostgreSQL database. In any case, this option inside the CKAN config file needs to be filled in with the proper database name, database username, and database password. Additionally, two configuration files related to PostgreSQL on the machine running PostgreSQL need to be changed to allow the server hosting CKAN to communicate with the database.

Next, Apache Solr has to be installed and set up. The port listed in the Apache Solr configuration file (/etc/tomcat9/server.xml) needs to match up with the solr_url field in the CKAN configuration file. Finally, the field site_url inside the CKAN configuration file needs to be set. In this case, we simply set it to http://localhost, but this will be different if the intention is to run CKAN in production. The CKAN PostgreSQL database can then be initialized using:

sudo ckan db init

Now CKAN can be run by starting the web server and restarting nginx:

sudo supervisorctl reload
sudo service nginx restart

At this point, visiting http://localhost should open up CKAN:

Note the search box that appears in both the top right and the bottom right of the page. Both of these are identical, and can be used to search for data through tags, title of the data, title of the data files, title of the organization that the data belongs to, and more.

Now we have to make a CKAN sysadmin user, which has to be done through the CLI (Command Line Interface). Following these instructions, a sysadmin user was made: https://docs.ckan.org/en/latest/maintaining/getting-started.html. Note that the commands mentioned at the top of this documentation will not work unless the virtualenv is activated using:

. /usr/lib/ckan/default/bin/activate

When seeding test data, attempting to run all eight of the commands listed on the getting started page will result in some errors dealing with duplicate users. For now, choosing one or two of these appears to result in no such issues. (I.E. vocabs and user)

Some additional changes are suggested here, i.e. changing ckan.site_title in the CKAN configuration file. For changes to take effect, after the configuration file is updated, it is necessary to restart uWSGI through supervisorctl with:

sudo supervisorctl restart ckan-uwsgi:*.

At this point, the CKAN page can be refreshed and a few basic datasets and organizations can be browsed (more or less depending on the number of seed commands used) through the Datasets and Organizations buttons at the top of the page. The data can also be searched through with the search box in the top right corner.

Since a sysadmin user was created, it is possible to log into CKAN using that account. Pressing “sign-in” in the top right corner and logging in, we can sign into a sysadmin user (a user that has complete control over that instance of CKAN). After logging in, if we press the hammer in the top right corner and hit “Config” afterward, we get the following configuration page:

From here, various things can be changed about the website (cover image, title of the website, and more). However, note that if any changes are made here, CKAN will no longer pull from the configuration file located locally on the machine. Pressing the “Reset” button at the bottom of this page will allow settings from the configuration file to be used again. For more information on various settings configurable inside the CKAN configuration file, one can look at the documentation: https://docs.ckan.org/en/2.9/maintaining/configuration.html

FileStore, DataStore and DataPusher

To allow users to upload data files to CKAN resources, and to upload images for organizations and groups, FileStore needs to be enabled. Documentation on FileStore can be found here: https://docs.ckan.org/en/latest/maintaining/filestore.html. Once FileStore is enabled, we restart uWSGI using supervisorctl and now there should be an upload button when users create or update a resource, group, or organization.

Once logged into our sysadmin user, we can click on Datasets and then create a dataset. Note that each dataset belongs to one organization. Here, we can write some things about the dataset (Title, Description, License, Tags, an Organization it belongs to, and more). Once some of these are filled out, pressing “Next: Add Data” will allow a user to upload a data resource to the dataset.

Effectively, this is how all data lives in CKAN. Datasets are created which are assigned to an organization, and then individual files or links to data resources are uploaded as resources under a dataset. Here is a picture of the upload button which appears when trying to add a new resource to a dataset once FileStore is installed:

FileStore also has an API which allows users to upload files that way instead of through the UI on the CKAN page.

Two other useful features which can be enabled are DataStore and DataPusher.
DataStore documentation: https://docs.ckan.org/en/latest/maintaining/datastore.html
DataStore is:

“The CKAN DataStore extension provides an ad hoc database for storage of structured data from CKAN resources. Data can be pulled out of resource files and stored in the DataStore.”

DataStore uses its own PostgreSQL database. Data can also be written to DataStore through the Data API.

One powerful thing that DataStore enables is data previews on a resource’s page using the data explorer extension. Without DataStore, any files such as excel spreadsheets would need to be completely downloaded to be used in any way. However, with DataStore enabled, it is possible to use the DataAPI to query on only portions of the data (without having to download the whole data file if it is very large) and various preview tools can be added to show portions of the data when looking at it through the CKAN UI. DataStore can be enabled through the instructions in the documentation, which primarily deal with editing a few lines in the CKAN config file and making a PostgreSQL database for DataStore to use.

DataPusher documentation and code: https://github.com/ckan/datapusher
DataPusher is:

“This task of automatically parsing and then adding data to the DataStore is performed by the DataPusher, a service that runs asynchronously and can be installed alongside CKAN.”

From a package installation of CKAN, DataPusher just needs to be enabled in the CKAN config file. Instructions for this can be found here: https://github.com/ckan/datapusher#configuring . Once DataPusher is enabled, all of the files in the CKAN system can be submitted to be pushed to the DataStore by DataPusher using the command:

ckan -c /etc/ckan/default/ckan.ini datapusher resubmit

After DataPusher + DataStore are enabled, when going to a specific resource (i.e. one specific file) and hitting “Manage”, a “Upload to DataStore” button appears under the “DataStore” tab:

This can be pushed and then the file will get picked up by DataPusher and sent to the DataStore. If any errors occur, or when the upload is finished, the status will update on this page here. Over time, DataPusher runs as a process in the background and will automatically push data that is uploaded to CKAN to the DataStore if the data is of a particular file type (configurable in CKAN config file).

Additionally, here are a few important portions of the CKAN documentation to look over:

  1. Database Management: https://docs.ckan.org/en/latest/maintaining/database-management.html. This section of the documentation deals with the various things that can be done with the ckan database i.e. saving and restoring the CKAN database from a backup.
  2. Command Line Interface (CLI): https://docs.ckan.org/en/latest/maintaining/cli.html. This section of the documentation deals with various command-line functions that can be used with the ckan command. These include cleaning/initializing the CKAN database, seeding the database with basic data, running a development server, creating sysadmin users, and many more.

Some other things that are important which are not covered in this blog post include CKAN Extensions. A list of some public CKAN extensions can be found here: https://extensions.ckan.org/

Authorization

Authorization in CKAN can be controlled through four ways: Organizations, Dataset collaborators, Configuration file options, and Extensions. Important info about sysadmins in CKAN from documentation:

“An organization admin in CKAN is an administrator of a particular organization within the site, with control over that organization and its members and datasets. A sysadmin is an administrator of the site itself. Sysadmins can always do everything, including adding, editing and deleting datasets, organizations and groups, regardless of the organization roles and configuration options described below.”

Each dataset belongs to one organization, and each organization controls access to its datasets. When a user gets added to an organization, an org admin can give them one of three roles with varying levels of access: member, editor, and admin. Dataset collaborators can be enabled in config file – allows users to be either a member or editor collaborator. This gives access to individual datasets (don’t have to give access to every dataset belonging to the organization, just individual ones). It is also possible to enable admin collaborators.

Generally using CKAN API (or DataStore Data API) requires generating an API-token which can be generated on each user’s page. This will allow them to access the information that their user is given access to in CKAN.

Adding Data

As mentioned previously, data can be added to CKAN through either the web UI, CKAN API or CKAN Data API (directly to DataStore). This brings up an important con of CKAN: databases cannot be directly attached to it, and so scripts would have to be written to transfer over a large amount of data / entire databases to CKAN using either the CKAN API or CKAN’s Data API (API for pushing files straight to CKAN’s DataStore). Additionally, some extensions or open-source projects which do this sort of data transferring are likely available with some searching online.

Conclusions

CKAN is a very powerful Open Source Data Catalog with PostgreSQL as a backend database and Apache Solr for search functionality. CKAN is used by many public entities for providing access to hundreds of thousands of data files, along with the ability to easily search through them.

Cassandra.Link

Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.

We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!