Set up Amazon FSx for Lustre
How to build a metagenomic binning pipeline on AWS (Part 1)
Bioinformatics is leaping into the cloud
In the webinar “Scaling genomics workloads using HPC on AWS” on July 14, 2021, I learned that the heavyweights such as AstraZeneca and Illumina have already moved their genome analyses into the AWS cloud and have been reaping the great benefits ever since. The cloud reduced both the runtime and costs dramatically. For example, AstraZeneca cut the runtime for its sequence data processing pipeline by 2,400% on AWS while Illumina is saving close to $400,000 in monthly compute and storage costs.
One statement from the webinar lecturer Dr. Evan Bollig stood out: “start your migration with S3 and focus on S3.” This makes sense. S3 was one of the three “founding” members of AWS in 2006. It is a gateway that connects the local on-premises data and the cloud resources: users can upload or synchronize data from their local computers into S3 and then process them with compute services. The processed data can also be transferred back to the local computers. S3 is likely to be the first service that a new user learns and tests in AWS, and it is perhaps the most frequently used too.
This is also true for bioinformaticians. In my previous article “Dive into the Sequence Read Archive with AWS Glue and Athena”, I mentioned that the National Center for Biotechnology Information (NCBI) has distributed its Sequence Read Archive (SRA) into AWS. Bioinformaticians can now request SRA to deliver large amounts of sequence data to their personal S3 buckets. My second article “BLAST on the Cloud with NCBI’s ElasticBLAST” shows that ElasticBLAST also requires S3 as the output destination.
In order to process the S3 data, compute instances such as EC2 first need to access them. Previously I thought that copying the data from S3 to the compute storage such as EBS or EFS was the only way.
The blog post “Using Amazon FSx for Lustre for Genomics Workflows on AWS” and the webinar taught me the second way: FSx for Lustre. This fully managed service can transform an S3 bucket into a scalable file system. This system can be mounted as a folder on a compute instance. Users can run Linux-based applications on the files. And FSx will synchronize the changes between S3 and the compute instance.
In my opinion, FSx is more intuitive than file copying. Its performance is impressive. AWS states that FSx delivers consistent sub-millisecond latencies, up to hundreds of gigabytes per second of throughput, and up to millions of IOPS. In this tutorial, I am going to show you how to create a FSx file system and work on it with an EC2 instance.
1. Create an S3 bucket
The journey starts with S3. First, create an S3 bucket in your preferred region. This bucket serves both as our input and output staging area. Upload some file into the bucket as test data.
2. Set up FSx for Lustre
Next, go to the FSx and create a file system. Choose Amazon FSx for Lustre
and give it a name. For this tutorial and most small projects, the capacity of 1.2 TiB is enough. To keep it simple, just use the default VPC and default security group.
To link the S3 bucket, unfold Data Repository Import/Export — optional
and check Import data from and export data to S3
and Update my file and directory listing as objects are added to or changed in my S3 bucket
. Fill in the bucket name under Import bucket
. Under Export prefix
, choose A prefix you specify
but clear out the text field (Figure 3.).
This configuration allows you to have an 1:1 mapping between FSx and S3, that is, changes will be written back to the original files on S3. Finally, confirm and create the file system.
It may take some time for AWS to get FSx running. Let’s proceed to the next step during the wait.
3. Set up an EC2 instance
Launch an Amazon Linux EC2 instance. For this tutorial, take the free-tier option. Make sure in step 3. Configure Instance
and 6. Configure Security Group
that the EC2 is created in the same VPC and security group as the FSx file system. Leave everything else with the default values.
Next, we need to edit the security group to allow FSx traffic inside the network. In the Security
tab of the newly created EC2 instance, click open the security group name. Under Edit inbound rules
, add an All TCP
rule and the Source
should be the security group itself (in my case, sg-b09433d3).
Afterwards, login the EC2 instance via SSH. Install the Lustre client via:
sudo amazon-linux-extras install -y lustre2.10
Now wait until FSx is up and running. Select the FSx system and click Attach
(Figure 5.):
A window will open up. Under Attach instruction — using the default DNS name
, copy and run both commands (the second command requires two variables):
sudo mkdir /fsxsudo mount -t lustre -o noatime,flock [your FSx DNS name]@tcp:/[your FSx Mount name] /fsx
In my case (yours will be different), the second command is:
sudo mount -t lustre -o noatime,flock fs-0cc0beac6723ba4b6.fsx.ap-east-1.amazonaws.com@tcp:/zlhwpbmv /fsx
These commands mount the S3-backed FSx into the /fsx
folder. In order to avoid sudo
afterwards, we can change the ownership of the /fsx
folder to ec2-user
via:
sudo chown -R ec2-user:ec2-user /fsx
4. Test the file system on EC2
Amazing! Now we can read and write data in this /fsx
folder and the data will be synchronized with the designated S3 bucket. The ls
command can confirm that my test data in S3 is indeed present in /fsx
!
You can now modify the content of the file, create a new file and so on, as I did here:
5. Synchronize the changes back to S3
There is a catch: if you upload new files into your S3 bucket, they will show up right away in the /fsx
folder inside your EC2. But if you modify those files, the changes are not immediately synchronized back to S3! To write back the changes, you need to do a Data repository task
.
In your Amazon FSx
console, click open your FSx file system, click Actions
and Export data to repository
(Figure 8.). In the popup window, click Create data repository task
.
This should kick off a synchronization process and write back all the changes you made in EC2 back into S3. Once the task is completed, check your S3 bucket. In Figure 9 you can see that my new_file.txt
is available in my S3 bucket.
Conclusion
Success! In this quick walk-through, you can see how to “mount” an S3 bucket to an EC2 instance with the help of FSx for Lustre. In fact, not only EC2, but AWS Batch, ParallelCluster and SageMaker can also work with FSx. It means that we can have the best of both worlds: the high availability of S3 and high performance of FSx.
Additionally, we can manage the lifecycle, replication and inventory of our data easily with S3. Also we can now do all sorts of computation on these data. Batch data transformation? AWS Batch or ParallelCluster can manage that. Machine learning? SageMaker has got you covered. We can then synchronize the results back and forth between compute instances and S3.
In my experience, FSx for Lustre does come with a higher price tag than the alternative solutions. I myself just racked up an FSx bill of $6.32 for this two-day experiment.
For example, in this AWS blog post “Using Amazon FSx for Lustre for Genomics Workflows on AWS”, author Lee Pang calculated that his workflow cost $0.12 with FSx while it cost only $0.04 with EBS. But FSx was faster. And he posited that the cost difference would be smaller in production because he assumed that real life projects involve more steps and more data. Be careful that FSx for Lustre is not a serverless service. You will pay the full price for your 1.2 TiB data capacity even though you just use a few KB of it.
In addition, FSx for Lustre has no “stop” state, that is, it costs money until you delete the file system. AWS mentioned that we can lower the cost by using the unreplicated, scratch file systems for shorter-term processing of data. Also, we can use data compression to reduce storage consumption of both the file system storage and backups.
So, is FSx for Lustre worth your time?
I will keep this option in mind and I will test FSx in my AWS metagenomic binning pipeline, because some I/O intensive steps such as BLAST can benefit from the high performance provided by FSx. Previously in “Parallel BLAST against CAZy with AWS Batch”, I used EFS for the task. So I wonder whether FSx can beat EFS in terms of performance and cost. I hope that during the pipeline construction, a more clear picture about the cost-effectiveness of FSx can emerge.
Have you used FSx for Lustre? What is your experience with it? Please share your story in the comments!
More content at plainenglish.io