- Design and test complex data architectures involving big data platforms such as Airflow, Athena, Spark, Kafka, etc.
- Develop streaming and batch ETLs transforming and slicing data from multiple data sources (mostly using AWS services)
- Write, test, and optimize complex SQL queries and reports
- Administer, configure, run, and POC managed data services and platforms
- At least 3 years’ experience as a data engineer developing big data pipelines.
- Python dominates your stack, and you are familiar with core data libraries, including pyspark and pandas.
- You are familiar with AWS services, components and cloud infra design.
- SQL is your second language, and you understand the pains and gains of OLTP and OLAP databases and how to improve the performance of a nasty query.
- You read documentation, love to learn new technologies and run POCs
- You are a team player that can take ownership on complicated tasks, become an expert in your area while mentoring others and sharing your knowledge, and lead group discussion
- Experienced working with Python and the relevant ML libraries, as well as with SQL and big data technology and terminology.
Although it’s not mandatory, it’s even nicer if you
- Have experience running and configuring big data services, such as Kafka, Airflow, Presto, and using K8S and Docker as infra in your projects.
- Have experience in either Scala, Java, or GO
- Know how to configure and spin an EMR cluster, run Athena queries, write a lambda, create a spark job, monitor EKS, and write a cloud formation script
- Have a degree in CS or a quantitative discipline