Recently, I wrote a tool “universe-lite”: https://github.com/GuandataOSS/universe-lite.
Which is a lightweight ELT & ETL tool, based on Duckdb and Apache Parquet, seamless integration with Python & Java plugins. Which also describe ETL steps in plain config file (TypeSafe Config format).
Since I already have a hammer, I want to find some nails to practice.
During daily work, I use Apache Spark a lot. So, first idea is I want to know: For those who starred Apache Spark Github repo, which else repos are they interested in? And whether there are some great project which I haven’t learned?
So, let’s start. The goal is very clear, but the road is tough. But “there are more solutions than problems”, after solve several unexpected issues, I finally get all data, which is a dataset with 12,115,030 rows. (I will talk later about “universe-lite” tool and how to use it to get Github data later).
#github #data-visualization #apache-spark