In this post, I’ll be walking through our process of tackling this problem, as well as nuggets on all the exciting tools AWS (and FastAPI) gives to developers when it comes to creating and deploying ETL pipelines.
One of the great things about learning data science at Lambda School is that after all of the sprint challenges, assessments, and code challenges, you still have to prove your acumen by working on a real-world, cross-functional project. They call this portion of the program Lambda Labs, and in my Labs experience, I got to work on a project called Citrics. The idea for this project was to solve a problem faced by nomads (people who move frequently), which was the cumbersome nature of trying to compare various statistics for cities throughout the US.
Imagine if you were going to live in three different cities over the next three years: how would you choose where to go? You might want to know what rental prices looked liked, or which job industry was the most prevalent, or maybe even how “walkable” a city was. The truth is, there are probably lots of things we’d like to know before moving, but we probably don’t have hours and hours to research 10 different websites for these answers. That’s where Citrics comes in.
As a data scientist, the big-picture task for my team was to source and wrangle data for these cities and deploy an APIthat our front-end team could utilize to satisfy end-user search requests. While this may sound simple enough, my first concern going into this project was the wrangling piece because various sources of data may have various naming conventions for cities. Consider examples like
Fort Lauderdale vs
Ft. Lauderdale, or
Saint Paul vs
St. Paul. We knew intensive data cleaning would be necessary to ensure data integrity and continuity between each of our sources. The other initial concern was in regard to the deployment of the API because our stakeholder expected AWS deployment, but each data scientist on our team of 4 only had experience in Heroku. In this post, I’ll be walking through our process of tackling this problem, as well as nuggets on all the exciting tools AWS (and FastAPI) gives to developers when it comes to creating and deploying ETL pipelines.
Understand how data changes in a fast growing company makes working with data challenging. In the last article, we looked at how users view data and the challenges they face while using data.
Understanding how users view data and their pain points when using data. In this article, I would like to share some of the things that I have learnt while managing terabytes of data in a fintech company.
Intro to Data Engineering for Data Scientists: An overview of data infrastructure which is frequently asked during interviews
In this post, will to show you an implementation of Data Warehouse on AWS based on a case study performed a couple of months ago. This implementation uses AWS S3 as the Data Lake (DL). AWS Glue as the Data Catalog. And AWS Redshift and Redshift Spectrum as the Data Warehouse (DW).
Know the Difference Between a Data Scientist and a Data Engineer. Big data engineer certification and data science certification programs stand resourceful for those looking to get into the data realm.