Set up Amazon FSx for Lustre

How to build a metagenomic binning pipeline on AWS (Part 1)

Sixing Huang
AWS in Plain English

--

Bioinformatics is leaping into the cloud

In the webinar “Scaling genomics workloads using HPC on AWS” on July 14, 2021, I learned that the heavyweights such as AstraZeneca and Illumina have already moved their genome analyses into the AWS cloud and have been reaping the great benefits ever since. The cloud reduced both the runtime and costs dramatically. For example, AstraZeneca cut the runtime for its sequence data processing pipeline by 2,400% on AWS while Illumina is saving close to $400,000 in monthly compute and storage costs.

One statement from the webinar lecturer Dr. Evan Bollig stood out: “start your migration with S3 and focus on S3.” This makes sense. S3 was one of the three “founding” members of AWS in 2006. It is a gateway that connects the local on-premises data and the cloud resources: users can upload or synchronize data from their local computers into S3 and then process them with compute services. The processed data can also be transferred back to the local computers. S3 is likely to be the first service that a new user learns and tests in AWS, and it is perhaps the most frequently used too.

This is also true for bioinformaticians. In my previous article “Dive into the Sequence Read Archive with AWS Glue and Athena”, I mentioned that the National Center for Biotechnology Information (NCBI) has distributed its Sequence Read Archive (SRA) into AWS. Bioinformaticians can now request SRA to deliver large amounts of sequence data to their personal S3 buckets. My second article “BLAST on the Cloud with NCBI’s ElasticBLAST” shows that ElasticBLAST also requires S3 as the output destination.

Figure 1. FSx for Lustre bridges S3 and compute instances. Image from https://aws.amazon.com/blogs/storage/new-enhancements-for-moving-data-between-amazon-fsx-for-lustre-and-amazon-s3/.

In order to process the S3 data, compute instances such as EC2 first need to access them. Previously I thought that copying the data from S3 to the compute storage such as EBS or EFS was the only way.

The blog post “Using Amazon FSx for Lustre for Genomics Workflows on AWS” and the webinar taught me the second way: FSx for Lustre. This fully managed service can transform an S3 bucket into a scalable file system. This system can be mounted as a folder on a compute instance. Users can run Linux-based applications on the files. And FSx will synchronize the changes between S3 and the compute instance.

In my opinion, FSx is more intuitive than file copying. Its performance is impressive. AWS states that FSx delivers consistent sub-millisecond latencies, up to hundreds of gigabytes per second of throughput, and up to millions of IOPS. In this tutorial, I am going to show you how to create a FSx file system and work on it with an EC2 instance.

1. Create an S3 bucket

The journey starts with S3. First, create an S3 bucket in your preferred region. This bucket serves both as our input and output staging area. Upload some file into the bucket as test data.

Figure 2: An S3 bucket with test data. Image by author.

2. Set up FSx for Lustre

Next, go to the FSx and create a file system. Choose Amazon FSx for Lustre and give it a name. For this tutorial and most small projects, the capacity of 1.2 TiB is enough. To keep it simple, just use the default VPC and default security group.

To link the S3 bucket, unfold Data Repository Import/Export — optional and check Import data from and export data to S3 and Update my file and directory listing as objects are added to or changed in my S3 bucket. Fill in the bucket name under Import bucket. Under Export prefix, choose A prefix you specify but clear out the text field (Figure 3.).

This configuration allows you to have an 1:1 mapping between FSx and S3, that is, changes will be written back to the original files on S3. Finally, confirm and create the file system.

Figure 3. Link FSx for Lustre to S3. Image by author.

It may take some time for AWS to get FSx running. Let’s proceed to the next step during the wait.

3. Set up an EC2 instance

Launch an Amazon Linux EC2 instance. For this tutorial, take the free-tier option. Make sure in step 3. Configure Instance and 6. Configure Security Group that the EC2 is created in the same VPC and security group as the FSx file system. Leave everything else with the default values.

Next, we need to edit the security group to allow FSx traffic inside the network. In the Security tab of the newly created EC2 instance, click open the security group name. Under Edit inbound rules, add an All TCP rule and the Source should be the security group itself (in my case, sg-b09433d3).

Figure 4: Security group settings. Image by author.

Afterwards, login the EC2 instance via SSH. Install the Lustre client via:

sudo amazon-linux-extras install -y lustre2.10

Now wait until FSx is up and running. Select the FSx system and click Attach(Figure 5.):

Figure 5. FSx attachment. Image by author.

A window will open up. Under Attach instruction — using the default DNS name, copy and run both commands (the second command requires two variables):

sudo mkdir /fsxsudo mount -t lustre -o noatime,flock [your FSx DNS name]@tcp:/[your FSx Mount name] /fsx

In my case (yours will be different), the second command is:

sudo mount -t lustre -o noatime,flock fs-0cc0beac6723ba4b6.fsx.ap-east-1.amazonaws.com@tcp:/zlhwpbmv /fsx

These commands mount the S3-backed FSx into the /fsx folder. In order to avoid sudo afterwards, we can change the ownership of the /fsx folder to ec2-user via:

sudo chown -R ec2-user:ec2-user /fsx

4. Test the file system on EC2

Amazing! Now we can read and write data in this /fsx folder and the data will be synchronized with the designated S3 bucket. The ls command can confirm that my test data in S3 is indeed present in /fsx!

Figure 6. Show the files from FSx. Image by author.

You can now modify the content of the file, create a new file and so on, as I did here:

Figure 7. Create and modify the contents in FSx. Image by author.

5. Synchronize the changes back to S3

There is a catch: if you upload new files into your S3 bucket, they will show up right away in the /fsx folder inside your EC2. But if you modify those files, the changes are not immediately synchronized back to S3! To write back the changes, you need to do a Data repository task.

In your Amazon FSx console, click open your FSx file system, click Actions and Export data to repository(Figure 8.). In the popup window, click Create data repository task.

Figure 8. Create a data repository task. Image by author.

This should kick off a synchronization process and write back all the changes you made in EC2 back into S3. Once the task is completed, check your S3 bucket. In Figure 9 you can see that my new_file.txt is available in my S3 bucket.

Figure 9. Changes are synchronized back to S3. Image by author.

Conclusion

Success! In this quick walk-through, you can see how to “mount” an S3 bucket to an EC2 instance with the help of FSx for Lustre. In fact, not only EC2, but AWS Batch, ParallelCluster and SageMaker can also work with FSx. It means that we can have the best of both worlds: the high availability of S3 and high performance of FSx.

Additionally, we can manage the lifecycle, replication and inventory of our data easily with S3. Also we can now do all sorts of computation on these data. Batch data transformation? AWS Batch or ParallelCluster can manage that. Machine learning? SageMaker has got you covered. We can then synchronize the results back and forth between compute instances and S3.

In my experience, FSx for Lustre does come with a higher price tag than the alternative solutions. I myself just racked up an FSx bill of $6.32 for this two-day experiment.

For example, in this AWS blog post “Using Amazon FSx for Lustre for Genomics Workflows on AWS”, author Lee Pang calculated that his workflow cost $0.12 with FSx while it cost only $0.04 with EBS. But FSx was faster. And he posited that the cost difference would be smaller in production because he assumed that real life projects involve more steps and more data. Be careful that FSx for Lustre is not a serverless service. You will pay the full price for your 1.2 TiB data capacity even though you just use a few KB of it.

In addition, FSx for Lustre has no “stop” state, that is, it costs money until you delete the file system. AWS mentioned that we can lower the cost by using the unreplicated, scratch file systems for shorter-term processing of data. Also, we can use data compression to reduce storage consumption of both the file system storage and backups.

So, is FSx for Lustre worth your time?

I will keep this option in mind and I will test FSx in my AWS metagenomic binning pipeline, because some I/O intensive steps such as BLAST can benefit from the high performance provided by FSx. Previously in “Parallel BLAST against CAZy with AWS Batch”, I used EFS for the task. So I wonder whether FSx can beat EFS in terms of performance and cost. I hope that during the pipeline construction, a more clear picture about the cost-effectiveness of FSx can emerge.

Have you used FSx for Lustre? What is your experience with it? Please share your story in the comments!

More content at plainenglish.io

--

--

A Neo4j Ninja, German bioinformatician in Gemini Data. I like to try things: Cloud, ML, satellite imagery, Japanese, plants, and travel the world.