Data Science Cloud Infrastructure
A step-by-step guide to deploy a Data Science platform on AWS with open-source software
In our previous post, we saw how to configure AWS Batch and tested our infrastructure by executing a task that spinned up a container, waited for 3 seconds and shut down.
In this post, we’ll leverage the existing infrastructure, but this time, we’ll execute a more interesting example.
We’ll ship our code to AWS by building a container and storing it in Amazon ECR, a service that allows us to store Docker images.
If you want to keep up-to-date with my Data Science content. Follow me on Medium or Twitter. Thanks for reading!
We’ll be using the
aws CLI again to configure the infrastructure, so ensure you’re authenticated and have enough permissions:
We’ll be using Docker for this part, so ensure it’s up and running:
We first create a repository, which will host our Docker images:
Create ECR repository:
The command above will print the repository URI, assign it to the next variable since we’ll need it later:
We’ll now use two open source tools ( Ploomber, and Soopervisor) to write our computational task, generate a Docker image, push it to ECR, and schedule a job in AWS Batch.
Let’s install the packages:
Note: We recommend you installing them in a virtual environment.
Let’s get an example. This example trains and evaluates a Machine Learning model:
Let’s see the files:
The structure is a typical Ploomber project. Ploomber allows you to easily organize computational workflows as functions, scripts or notebooks and execute them locally. To learn more check out Ploomber’s documentation.
On the other hand, Soopervisor allows you to export a Ploomber project and execute it in the cloud.
The next command will tell Soopervisor to create the necessary files so we can export to AWS Batch:
No pipeline.aws-env.yaml found, looking for pipeline.yaml instead
Loading... Adding /Users/Edu/dev/ploomber.io/raw/ds-platform-part-ii/example/aws-env/Dockerfile...
Fill in the configuration in the 'aws-env' section in soopervisor.yaml then submit to AWS Batch with: soopervisor export aws-env
Environment added, to export it: $ soopervisor export aws-env
To force execution of all tasks: $ soopervisor export aws-env --mode force
soopervisor add will create a
soopervisor.yaml file and a
aws-batch folder contains a
Dockerfile (which we need to create a Docker image):
soopervisor.yaml file contains configuration parameters:
There are a few parameters we have to configure here, we created a small script to generate the configuration file:
job_queue: the name of your job queue
aws_region: the region where your AWS Batch infrastructure is located
repository: the ECR repository URI
Here are the values for my infrastructure (replace it with yours):
Note: If you don’t have the job queue name, you can get it from the AWS console (ensure you’re in the right region).
Let’s download a utility script to facilitate creating the configuration files:
soopervisor.yaml configuration file:
This is how the file looks like:
Let’s now use
soopervisor export to execute the command in AWS Batch. Such command will do a few things for us:
- Build the Docker container
- Push it to the Amazon ECR repository
- Submit the jobs to AWS Batch
We need to install
boto3 since it’s a dependency to submit jobs to AWS Batch:
Authenticate with Amazon ECR so we can push images:
Let’s now export the project. Bear in mind that this command will take a few minutes:
If all goes well, you’ll see something like this:
If you encounter issues with the
soopervisor export command, or are unable to push to ECR, join our community and we’ll help you!
Once the command finishes execution, the job will be submitted to AWS Batch. Let’s use the
aws CLI to list the jobs submitted to the queue:
After a a minute, you’ll see that task shows as
SUCCEEDED (it’ll appear as
RUNNING if it hasn’t finished).
However, there is a catch: AWS Batch ran our code but shortly after, it shut down the EC2 instance, hence, we no longer have access to the output.
To fix that, we’ll add an S3 client to our project, so all outputs are stored.
Let’s first create a bucket in S3. S3 bucket names must be unique, you can run the following snippet in your terminal or choose a unique name and assign it to the
Ploomber allows us to specify an S3 bucket and it’ll take care of uploading all outputs for us. We only have to create a short file. The
generate.py script can create one for us:
We need to configure our
pipeline.yaml file so it uploads artifacts to S3. Let’s use the
generate.py file so it does it for us:
Furthermore, let’s add
boto3 to our dependencies since we’ll be calling it to upload artifacts to S3:
Let’s add S3 permissions to our AWS Batch tasks. Generate a policy:
We’re now ready to execute our task on AWS Batch!
Let’s ensure we can push to ECR:
Submit the task again:
Note that this time, the
soopervisor export command is a lot faster, since it cached our Docker image!
Let’s check the status of the task:
After a minute, you should see it as
Check the contents of our bucket, we’ll see the task output (a
In this post, we learned how to upload our code and execute it in AWS Batch via a Docker image. We also configured AWS Batch to read and write an S3 bucket. With this configuration, we can start running Data Science experiments in a scalable way without worrying about maintaining infrastructure!
In next (and final) post of this series, we’ll see how to easily generate hundreds of experiments and retrieve the results.
If you want to be the first to know when the final part comes out; follow us on Twitter, LinkedIn, or subscribe to our newsletter!
If you wish to delete the infrastructure we created in this post, here are the commands.
Delete ECR repository:
Delete S3 bucket:
Deploying a Data Science Platform on AWS: Running containerized experiments (Part II) Republished from Source https://towardsdatascience.com/deploying-a-data-science-platform-on-aws-running-containerized-experiments-part-ii-bef0e22bd8ae?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed