Recently, I wrote a tool “universe-lite”:

Which is a lightweight ELT & ETL tool, based on Duckdb and Apache Parquet, seamless integration with Python & Java plugins. Which also describe ETL steps in plain config file (TypeSafe Config format).

Since I already have a hammer, I want to find some nails to practice.

During daily work, I use Apache Spark a lot. So, first idea is I want to know: For those who starred Apache Spark Github repo, which else repos are they interested in? And whether there are some great project which I haven’t learned?

So, let’s start. The goal is very clear, but the road is tough. But “there are more solutions than problems”, after solve several unexpected issues, I finally get all data, which is a dataset with 12,115,030 rows. (I will talk later about “universe-lite” tool and how to use it to get Github data later).

1. Spark’s star count change trending

2. What else projects are starred at the same time

3. What else “Spark Related” projects are starred

4. Among people who starred Spark, what is the “total starred project number” distribution

#github #data-visualization #apache-spark

Apache Spark : for Those Who Starred Spark in Github, What Else Projects Were Starred?
2.05 GEEK