In this blog post, we will introduce a few open-source notebooks that we can use to do SQL on Cassandra. At the bottom of the blog, we have an accompanying webinar that you can watch to see a live demo using 2 of the notebooks we discuss in this blog. This is Part 3 of our series on “Doing SQL and Reporting on Apache Cassandra with Open Source Tools”, and Parts 1 and 2 are also linked below. Also, be on the lookout for part 4 coming soon!
The Jupyter Notebook is an open-source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text. The Jupyter Notebook is probably the most commonly used notebook that we will discuss in this blog, and for good reason given the capabilities and ease of use.
A few use cases of Jupyter Notebooks include data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
Jupyter supports over 40 programming languages, including Python, R, Julia, and Scala. We will be focusing on Python for our purposes as we will be utilizing PySpark in order to do SQL on Cassandra within Jupyter. To see a live demo using Jupyter and PySpark, check out the video linked below!
With Jupyter Notebook, users can leverage big data tools, such as Apache Spark, from Python, R, and Scala like we did by using PySpark (see video below). Users can then explore that same data with pandas, scikit-learn, ggplot2, TensorFlow for purposes like data analysis, data visualization, and doing machine learning.
The notebooks are an open document format based on JSON. They contain a complete record of the user’s sessions and include code, narrative text, equations, and rich output. The Notebook communicates with computational Kernels using the Interactive Computing Protocol, an open network protocol based on JSON data over ZMQ and WebSockets. The Kernels themselves are processes that run interactive code in a particular programming language and return output to the user. Kernels also respond to tab completion and introspection requests.
As mentioned a few times above, we explored Jupyter Notebook with a live demo, and the video of that is linked below.
Another prevalent open-source notebook is Apache Zeppelin. Apache Zeppelin is a web-based notebook that enables data-driven, interactive data analytics, and collaborative documents with SQL, Scala, and more. The most common use cases for Apache Zeppelin are data ingestion, data discovery, data analytics, data visualization, and collaboration.
Zeppelin supports Spark, PySpark, Spark R, and Spark SQL with a dependency loader. Zeppelin also allows users to connect any JDBC data sources seamlessly including technologies such as Postgresql, MySQL, MariaDB, Redshift, and Apache Hive. Apache Zeppelin also supports python with Matplotlib, Conda, Pandas SQL, and PySpark integrations.
The Apache Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Apache Zeppelin supports many interpreters such as Apache Spark, Python, JDBC, Markdown, and Shell.
As mentioned above, Apache Zeppelin can be used for data visualization. Zeppelin includes some basic charts from the get-go and visualizations are not limited to a SparkSQL query, as any output from any language backend can be recognized and visualized.
Zeppelin also aggregates values and displays them in a pivot chart with simple drag and drop. Users can easily create charts with multiple aggregated values including sum, count, average, min, max.
Zeppelin notebooks can also be shared for collaboration through URLs. Any changes made the notebooks will be broadcast in real-time so every user can appreciate the newly made changes.
The notebooks can also be embedded as an iframe wherever iframes are allowed.
Unfortunately, we were not able to demo Apache Zeppelin in the video linked below, but we were able to demo the next notebook (see video below!).
Quix is a multi-user, easy-to-use notebook manager, which can act as a shared space for BI insights. Quix has support for Presto, Athena, BigQuery, MySQL, PostgreSQL, ClickHouse, and more through either direct support or through JDBC’s. We used Quix’s Presto functionality in order to do SQL and joins on Cassandra from Quix itself. A live demo of this is available in the video linked below. Also, if you want to check out part 1 of this series, Presto and Cassandra, where we discussed how to set up Presto and connect it to Cassandra with a live demo of using Presto to do SQL and joins on Cassandra, then you can check that out here or at the series listing below. Also, Quix can be run in single-user or multi-user mode via Google OAuth.
Quix uses notes to edit and run queries, and those notes live inside notebooks, which can be organized in folders. The Database explorer lets users navigate, search, and preview their tables as seen below.
Quix can also help users write their queries with autocomplete, syntax highlighting, live syntax validation, error line highlight, and dataset explorer with drag and drop support. Additionally, the Quix smart editor supports strong type variables, multiple result statements, simple pivot and pie visualizations, and the ability to export results as CSV. The visualizations can be done with one click as seen below.
Users can share notebooks via a simple link; as well as, search the notes of every user within their organization with the help of full-text search.
As mentioned above, we have a live demo of Quix in addition to a live demo of Jupyter linked in the video below.
The final notebook that we will discuss is Polynote. Polynote is a polyglot notebook with first-class Scala support. Polyglot means knowing or using several languages, and in that regard, Polynote supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model.
A few unique features that Polynote offers are:
- Interactive autocomplete
- Error Highlighting
- Rich text editing
- LaTeX equations
- Individual task tracking
- Symbol table to track defined variables
- Real time running statement highlighting
- Ordered cell semantics
- Apache Spark integration
- Table structured data
- Rich representations of data
- Built-in plot editor
Another cool thing about Polynote is that users can utilize Vega for visualizations.
A warning that Polynote does provide is that it currently does not contain any built-in security or authentication of its own and entirely relies on the users deploying and configuring it in a secure way. Also, another thing is that Polynote is currently only tested on Linux and macOS using Chrome as the client. They hope to test other platforms and browsers soon, so if you run into issues using Windows, you can either Google potential fixes or create an issue on Polynote’s GitHub repo.
Unfortunately, we were not able to show a live demo of Polynote in addition to Apache Zeppelin. However, as mentioned before, we were able to live demo Jupyter Notebook and Quix, so if you want to watch them, it is linked below! And with this, we conclude our blog on open source notebooks and Cassandra.
Doing SQL and Reporting on Apache Cassandra with Open Source Tools
- Presto and Cassandra
- Spark and Cassandra
- Open Source Notebooks and Cassandra
- Open Source BI Tools and Cassandra
Cassandra.Link is a knowledge base that we created for all things Apache Cassandra. Our goal with Cassandra.Link was to not only fill the gap of Planet Cassandra, but to bring the Cassandra community together. Feel free to reach out if you wish to collaborate with us on this project in any capacity.
We are a technology company that specializes in building business platforms. If you have any questions about the tools discussed in this post or about any of our services, feel free to send us an email!