Data Federation in Presto with Docker - David Prat: Cloud Architecture | Big Data

Data Federation with Presto on Docker

Introduction

This post is about describing how to set a Presto installation in the form of a cluster on top of Docker containers. On the process of setting up the cluster several useful features of Docker will be presented. However, the main purpose is to show how Presto can connect to different databases and use them a source to perform queries over those data providing the user with the so-called data federation. That is to say, enabling to query different data sources from a single point of operation, which is Presto.

To query the data behind, Presto provides the user with a dialect of SQL that expands SQL functionality with very interesting features such as interactive queries or performing joins in between tables from different database technologies. Let it suffice to say, that this will be possible under the conditions of data models consistency between the different data base technologies.

The picture below shows the Presto control panel. For this demonstration two nodes have been created by using different containers. One that runs as coordinator as well as worker and another node as an extra worker node. In this tutorial, how to create the system will be explained as well as how to connect it with different database technologies. For this, extra containers will be created, one for each database technology to use.

Setting up Docker

Coordinator container

The Dockerfile of below shows how to configure the coordinator container. An openjdk image is used as a container base to configure Presto on top of it. This file shouldn’t be taken for production as it contains some configurations to make it easier to play with the system. For example: a root password defined so it is easy to install software having the container in running state, or defining some IPs with hardcoding to avoid introducing other concepts in the tutorial.

Pay attention to the lines that contain the configuration of Presto. Even though, the lines a pretty self-explanatory I recommend looking at Presto’s website in its section showing a basic configuration. Basically, here the idea is to indicate that this instance works as coordinator as well as as a worker, define a node id unique in the cluster and then defining the connectors to other databases that will be located in other containers.

FROM openjdk:8-jre-slim

ARG MIRROR="https://repo1.maven.org/maven2/com/facebook/presto"
ARG PRESTO_VERSION="0.229"
ARG PRESTO_BIN="${MIRROR}/presto-server/${PRESTO_VERSION}/presto-server-${PRESTO_VERSION}.tar.gz"
ARG PRESTO_CLI_BIN="${MIRROR}/presto-cli/${PRESTO_VERSION}/presto-cli-${PRESTO_VERSION}-executable.jar"

USER root
RUN echo "root:1234" | chpasswd

RUN apt-get update && apt-get install -y vim

RUN apt-get update && \
    apt-get install -y --allow-unauthenticated curl wget less && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

ENV PRESTO_HOME /presto
ENV PRESTO_USER presto
ENV PRESTO_CONF_DIR ${PRESTO_HOME}/etc
ENV PATH $PATH:$PRESTO_HOME/bin

RUN useradd \
        --create-home \
        --home-dir ${PRESTO_HOME} \
        --shell /bin/bash \
        $PRESTO_USER

RUN mkdir -p $PRESTO_HOME && \
    wget --quiet $PRESTO_BIN && \
    tar xzf presto-server-${PRESTO_VERSION}.tar.gz && \
    rm -rf presto-server-${PRESTO_VERSION}.tar.gz && \
    mv presto-server-${PRESTO_VERSION}/* $PRESTO_HOME && \
    rm -rf presto-server-${PRESTO_VERSION} && \
    mkdir -p ${PRESTO_CONF_DIR}/catalog/ && \
    mkdir -p ${PRESTO_HOME}/data && \
    cd ${PRESTO_HOME}/bin && \
    wget --quiet ${PRESTO_CLI_BIN} && \
    mv presto-cli-${PRESTO_VERSION}-executable.jar presto && \
    chmod +x presto && \
    chown -R ${PRESTO_USER}:${PRESTO_USER} $PRESTO_HOME

# Need to work with python2
# See: https://github.com/prestodb/presto/issues/4678

RUN apt-get update && apt-get install -y --no-install-recommends \
		python \
	&& rm -rf /var/lib/apt/lists/* \
    && cd /usr/local/bin \
	&& rm -rf idle pydoc python python-config 

RUN mkdir -p /var/presto
RUN mkdir -p /var/presto/data
RUN chmod -R 777 /var/presto

USER $PRESTO_USER

# create presto files
RUN touch /presto/etc/node.properties
RUN touch /presto/etc/jvm.config
RUN touch /presto/etc/config.properties
RUN touch /presto/etc/log.level
RUN touch /presto/etc/catalog/jmx.properties
RUN touch /presto/etc/catalog/mysql.properties
RUN touch /presto/etc/catalog/mongodb.properties

RUN echo "node.environment=production \nnode.id=ffffffff-ffff-ffff-ffff-ffffffffffff\nnode.data-dir=/var/presto/data" >> /presto/etc/node.properties

RUN echo "-server\n\n-Xmx16G\n-XX:+UseG1GC\n-XX:G1HeapRegionSize=32M\n-XX:+UseGCOverheadLimit\n-XX:+ExplicitGCInvokesConcurrent\n-XX:+HeapDumpOnOutOfMemoryError\n-XX:+ExitOnOutOfMemoryError" >> /presto/etc/jvm.config

RUN echo "coordinator=true\nnode-scheduler.include-coordinator=true\nhttp-server.http.port=8080\nquery.max-memory=5GB\nquery.max-memory-per-node=1GB\ndiscovery-server.enabled=true\ndiscovery.uri=http://172.17.0.2:8080" >> /presto/etc/config.properties

RUN echo "com.facebook.presto=INFO" >> /presto/etc/log.level

RUN echo "connector.name=jmx" >> /presto/etc/catalog/jmx.properties

RUN echo "connector.name=mysql\nconnection-url=jdbc:mysql://172.17.0.3:3306\nconnection-user=root\nconnection-password=secret" >> /presto/etc/catalog/mysql.properties

RUN echo "connector.name=mongodb\nmongodb.seeds=172.17.0.4" >> /presto/etc/catalog/mongodb.properties

# run presto
CMD ["/presto/bin/launcher", "run"]

Worker container

The Dockerfile below shows the worker container configuration. The configuration is very similar to the coordinator’s one. However, here the node is set as worker, the coordinator address has to be specified and the connectors with other databases are not listed. Bear in mind, that a unique node id is also required for the worker. Indeed, if more nodes were added to the cluster, each one would have to be configured with a unique id.

FROM openjdk:8-jre-slim

ARG MIRROR="https://repo1.maven.org/maven2/com/facebook/presto"
ARG PRESTO_VERSION="0.229"
ARG PRESTO_BIN="${MIRROR}/presto-server/${PRESTO_VERSION}/presto-server-${PRESTO_VERSION}.tar.gz"
ARG PRESTO_CLI_BIN="${MIRROR}/presto-cli/${PRESTO_VERSION}/presto-cli-${PRESTO_VERSION}-executable.jar"

USER root
RUN echo "root:1234" | chpasswd

RUN apt-get update && apt-get install -y vim

RUN apt-get update && \
    apt-get install -y --allow-unauthenticated curl wget less && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

ENV PRESTO_HOME /presto
ENV PRESTO_USER presto
ENV PRESTO_CONF_DIR ${PRESTO_HOME}/etc
ENV PATH $PATH:$PRESTO_HOME/bin

RUN useradd \
        --create-home \
        --home-dir ${PRESTO_HOME} \
        --shell /bin/bash \
        $PRESTO_USER

RUN mkdir -p $PRESTO_HOME && \
    wget --quiet $PRESTO_BIN && \
    tar xzf presto-server-${PRESTO_VERSION}.tar.gz && \
    rm -rf presto-server-${PRESTO_VERSION}.tar.gz && \
    mv presto-server-${PRESTO_VERSION}/* $PRESTO_HOME && \
    rm -rf presto-server-${PRESTO_VERSION} && \
    mkdir -p ${PRESTO_CONF_DIR}/catalog/ && \
    mkdir -p ${PRESTO_HOME}/data && \
    cd ${PRESTO_HOME}/bin && \
    wget --quiet ${PRESTO_CLI_BIN} && \
    mv presto-cli-${PRESTO_VERSION}-executable.jar presto && \
    chmod +x presto && \
    chown -R ${PRESTO_USER}:${PRESTO_USER} $PRESTO_HOME


# Need to work with python2
# See: https://github.com/prestodb/presto/issues/4678


RUN apt-get update && apt-get install -y --no-install-recommends \
		python \
	&& rm -rf /var/lib/apt/lists/* \
    && cd /usr/local/bin \
	&& rm -rf idle pydoc python python-config 


RUN mkdir -p /var/presto
RUN mkdir -p /var/presto/data
RUN chmod -R 777 /var/presto


USER $PRESTO_USER


# create presto files
RUN touch /presto/etc/node.properties
RUN touch /presto/etc/jvm.config
RUN touch /presto/etc/config.properties
RUN touch /presto/etc/log.level
RUN touch /presto/etc/catalog/jmx.properties

RUN echo "node.environment=production \nnode.id=ffffffff-ffff-ffff-ffff-fffffffffff1\nnode.data-dir=/var/presto/data" >> /presto/etc/node.properties

RUN echo "-server\n\n-Xmx16G\n-XX:+UseG1GC\n-XX:G1HeapRegionSize=32M\n-XX:+UseGCOverheadLimit\n-XX:+ExplicitGCInvokesConcurrent\n-XX:+HeapDumpOnOutOfMemoryError\n-XX:+ExitOnOutOfMemoryError" >> /presto/etc/jvm.config

RUN echo "coordinator=false\nhttp-server.http.port=8080\nquery.max-memory=50GB\nquery.max-memory-per-node=1GB\ndiscovery.uri=http://172.17.0.2:8080" >> /presto/etc/config.properties

RUN echo "com.facebook.presto=INFO" >> /presto/etc/log.level

RUN echo "connector.name=jmx" >> /presto/etc/catalog/jmx.properties

#CMD ["launcher", "run"]

# run presto
CMD ["/presto/bin/launcher", "run"]

Now let’s build the image that will serve to build the containers:

docker build -t greattenchu/openjdk-presto:1.0 .
docker build -t greattenchu/openjdk-prestoworker:1.0 . -f prestoWorker.Dockerfile

And here are the commands to run each container and connect with each one interactively

docker run -p 8080:8080 --rm -ti greattenchu/openjdk-presto:1.0 /bin/sh
docker run -p 8081:8081 --rm -ti greattenchu/openjdk-prestoworker:1.0 /bin/sh

Next is to set up the containers of the databases. Here, MySQL and MongoDB have been taken due to its wide use.

MySQL container

For the MySQL container first a docker volume is created. This will allow to keep the databases in case the container is stoped. Once the container is created, it is mounted to the container when executing the run command. Last, we connect to the container while invoquing the mysql prompt.

docker volume create mysql-db-data
docker volume ls
docker run -d -p 33060:3306 --name mysql-db -e MYSQL_ROOT_PASSWORD=secret --mount src=mysql-db-data,dst=/var/lib/mysql mysql
docker exec -it mysql-db mysql -p

To fill the database with some sample data, a file containing a Spotify song list is used. Spotify list. You can for example insert a part of the list and another part to MongoDB to play later with the federated queries. To load data to MySQL a database client can be used. In this case, I have used TablePlus, which is a very updated client that allows to connect to several database technologies. It supports the last version of MySQL (to the date of this post, it is the 8.019).

In case you don’t want to spend time defining the table, below there’s the code to define the table and load the data:

create database spotifyDB;
use spotifyDB;

create table spotifysongs (
  id int not null primary key,
  acousticness double,
  danceability double,
  duration_ms int,
  energy double,
  instrumentalness double,
  key_ int,
  liveness double,
  loudness double,
  mode_ int,
  speechiness double,
  tempo double,
  time_signature double,
  valence double,
  target int,
  song_title varchar(50),
  artist varchar(50)
);

load data infile '/Users/david/Downloads/spotifyclissification.csv' into table spotifysongs fields terminated by ',';

MongoDB container

Setting up the MongoDB container is even easier. In this case, we don’t mind about keeping the data and we just run a default invocation.

docker run --name some-mongo -d -p 27017:27017 mongo:4.2.2-bionic
docker exec -it some-mongo mongo

To load the data into MongoDB a different approach has been taken. We connect to the container of MongoDB, launch the MongoDB prompt and create the collection. Although the same can be done through the database client. After this, the data is copied to the container with the cp command. Then, inside the container, we execute the mongoimport command importing with this the data in the file. You can prepare yourself the data to be loaded into MongoDB as a subset of the provided data above.

db.createCollection("spotifysongs")
docker cp ./spotifyclissificationmongodb.csv 0615c3575a49:/root
mongoimport -d spotifydata -c spotifysongs --type csv --file ./spotifyclissificationmongodb.csv --headerline

Querying with Presto

Finally, all the system is set up and ready to be used. Connect to Presto coordinator container and execute presto. Then go through the catalogs and schemas to check that the links between Presto coordinator and the databases work. Roughly speaking, catalogs are each connectors with each DB technology, and schemas are databases inside each technology.

show catalogs;
show schemas from mysql;
show tables from mysql.spotifydb;
use mysql.spotifydb;

Once the connections are checked a federated queries can be run. Below there is and example:

SELECT (
    SELECT COUNT(*) FROM mysql.spotifydata.spotifysongs
), (
    SELECT COUNT(*) FROM mongodb.spotifydata.spotifysongs
);

This will return the number of rows for each database technology. If you try to run the a query to such as selecting all rows from all database technologies and return the results, for example by separating each database by a comma, you’ll find out that the result is not a list concatenating both tables but some other inconsistent results. In this sense, federated queries are limited. They are effective to count the rows of each DB technology or to make joins between DB technologies with compatible table columns, which is something that can be very useful if for example the data model is scattered between databases. This can happen very easily in environments with some legal regulations to avoid having some sensitive data together. For example in a clinical context, where donors or patients data are scattered across different databases and even infrastructures.

Table of Contents

Data Federation with Presto on Docker

Introduction

Setting up Docker

Coordinator container

Worker container

MySQL container

MongoDB container

Querying with Presto

Latest News

Evaluate production RAG with Trulens

Build your own production RAG with Llamaindex, Chroma, Ollama and FastAPI

Configure Jenkins to work with Gitlab, Docker and Skaffold

I am Cloud Architect and Big Data Expert

- Resources -