With an increasing number of specialized databases, each having their own query languages, data analysts have a hard time to combine data from multiples sources. To mitigate this issue, Facebook created Presto, a high performance, distributed SQL query engine for big data. I will be creating a small simulation using Metabase, a web based open-source BI solution to visualize data from MongoDB.
Metabase is an extremly easy-to-use Business Intelligence (BI) tool. It bills itself as the easy, open source way for everyone in your company to ask questions and learn from data.This page shows how Metabase can be setup to integrate with YugabyteDB's PostgreSQL compatible API. Start local cluster. I am using Metabase to ask questions on my DB. I am trying to join two tables where the same info (user ID) has two different names. The code I wrote is as follows: SELECT gamestates.gamemod.
Docker is a container platform. Containers package software applications and all their dependencies, which helps enable reproducible infrastructure. Docker Hub contains many container images so we have a starting point and don’t have to package everything ourselves. This is extremely helpful, specially during development.
(Note that, even though we’ve only selected the Orders table in the data section, Metabase will automatically join the Products and People tables to get the State and Category data.) The resulting table is a regular one, with rows for each combination of state, year, and product category.
Each container should only have one function. For example, a simple scenario when doing back-end development is to use a container for the application and another for the database. To connect the containers in a local environment, Docker Compose is the easiest solution but for production, the recommended approach is to use an orchestration system like Kubernetes.
Following the rise of NoSQL databases, many specialized query languages exist today. This leads to huge data integration efforts and expensive ETL processes. In the Hadoop world, a number of solutions emerged to enable the usage of SQL to retrieve data, Presto being the most interesting in my opinion.
The main advantage of Presto is that it has many data connectors, such as Kafka, Cassandra, Elasticsearch, MongoDB, Postgres, etc… It is then able to infer the schema automatically and handle semi-structured data. Arrays, nested objects, multi-database joins and the regular SQL operations are all supported.
Apache Drill is a very similar alternative to Presto, however it appears to be less popular and doesn’t support as many data sources. Performance comparisons are out of the scope for this post, but Presto is used by big players in data-intensive environments. Amazon created Athena which is based on Presto and heavily integrated in AWS.
There is a lot of commercial Business Intelligence software out there. Metabase stands out for being an open-source BI technology that is easy to use even for people that don’t know SQL, allowing them to explore the data and create web dashboards. Some advanced visualizations still require SQL knowledge.
Other alternatives include Redash and Superset. From my research, these support more chart types but are less user friendly and deployment is a bit more complicated.
MongoDB is a document-oriented database. Documents are stored as JSON objects, making it a good choice for semi-structured data with a flexible schema. It does not support SQL out of the box, which makes it harder for analysts to extract data because they have to learn another query language.
I created 4 containers, 2 for different databases, 1 for Presto and the last for Metabase. Application data should be stored in volumes. Configuration files can either be copied and stored as part of the image or we can follow the volume approach. On Windows, named volumes must be used in some cases because of file permission issues.
Metabase Join Multiple Tables
I’m including here some commands that I used for testing. JSON Generator is a very nice tool to create datasets. For more details about how to seed the databases and configure Presto, check the repository.
Unlike the other tools, on Metabase there is no way to import/export dashboard configurations, other than copying the database it uses, which is H2 by default. For that reason I have included the data folder in the project repository.
It is possible to see the queries that are executed by the Presto cluster. A distributed system is required to handle large amounts of data.
This final dashboard can be obtained by navigating to localhost:8000 after running docker-compose up. During the first manual setup I chose test[at]example.com as the username and test1234 for the password. Ideally this should be a configuration of the image or automatically setup when the container starts.
I recently had to create a web dashboard that needed data from multiple databases and these technologies allowed me to quickly build a quality prototype. Feel free to check the repository.
In the first part of our Plugin Developer Statistics blog post we described the different data sources and chose Metabase as our stats dashboard. Have a look at the article in case you’ve missed it.
Part 2 described how to pull data from wordpress.org and add to an SQLite database. Now we use this data to create a few charts on our dashboard.
Connect Metabase to our SQLite Database
Navigate to the settings icon on the top right and click Admin
- Select Databases
- delete the sample dataset
- and click Add Database
Note: the path to the file is relative to the volume mounted into our docker container in part 1 of this post.
After adding the data source with the values from the screenshot above Metabase suggests some automatic charts. Just discard them and click Ask question on the top right of the page.
Active installation chart
Select Simple Question and select the Active installation growth data.
Now you should see a table of values. To convert it to a line chart click Visualization on the bottom left and select Line.
We can already see some kind of chart but I think we should separate the data to show one chart per (relevant) plugin.
Use the Filter button on the top right to limit the chart to one plugin.
Our first chart is ready to place it on our dashboard. Save the chart and add it to a dashboard when Metabase asks for it.
UPDATE 2020-10-02: Benjamin Intal wrote an excellent article on how to compute your active installation data in detail using the same API we did in previous step. We have added this detailed active installations as a separate article.
Plugin metrics table
Our next dashboard widget will be a simple data table with some relevant numbers which do not need a chart in my opinion.
Start again with Ask a question but choose Native query now. Then enter the following SQL statement:
We’ve got some plugins with very few installations so we filter our table to plugins with at least 300 installations (see 2nd last line). That’s it, save the question and add it to your dashboard.
Sum up all your active plugin installation as pie chart
Metabase Join Tables Online
Although this data is not too accurate it is still a great overview.
Start again with Ask a question -> Native query choose WordPress Org data and enter this query:
Then click Visualization -> Pie Chart, save it and place it on your dashboard. If you don’t need a legend for your chart click the settings icon inside the widget to remove it.
Metabase Join Tables Excel
Your total plugin developer rating on your dashboard
The last widget we are going to create from this data source is an overview of our total ratings. We use a native query again with this SQL statement:
Now you can try different visualizations:
And that’s how our dashboard looks like (for now).