Abeer Yosef
3 min readJun 9, 2021

3 minutes read and you will learn how PyFlink works — The concepts behind PyFlink

Don’t worry I will not explain the main concept of Apache Flink :) it already exists on the Flink Documentation (https://ci.apache.org/projects/flink/flink-docs-stable/) instead I will give you a brief on how PyFlink works and why it is important to learn PyFlink against learning Flink on Java or Scala.

What is PyFlink?

PyFlink is simply a mixture of Apache Flink and Python, or we can call it Flink on Python. By combining Apache Flink with Python, all Flink’s features are available for use by Python users. In addition, all computing capabilities of Python’s extensive ecosystem are available on PyFlink as Python has a lot of big data ecosystems.

PyFlink: combining Apache Flink with Python

Why Flink supports Python instead of other big data languages such as R?

Statistics show that Python is the top language for data science and machine learning and it is the most popular choice according to IEEE Spectrum. Python is about to take over the first position in the TIOBE index (https://www.tiobe.com/tiobe-index/). Despite that Java and Scala are Flink’s default languages, but it seems crucial for Flink to support Python.

Python ranking on TIOBE

The community of PyFlink has two main objectives for PyFlink. The first one is to make all Flink features available to Python users: such as event-driven, streaming analytics, ETL, etc… and the second objective is to make all python scientific computing functions available on Flink. Thus, they decided to have mutual benefit.

PyFlink objectives

How the two objectives achieved?

Objective 1: to make Flink features available to Python users. Attempts have been made on Flink 1.8 to develop a Python engine on Flink like the one provided for Java, but unfortunately, this attempt doesn’t work well. Thanks for the fact that there is the simplest way to use the features of Flink in python by providing one layer of Python APIs that reuse\call the existing computing Java engine.

To reuse the current Flink java engine, a handshake between Python Virtual Machine (PyVM) and a Java Virtual Machine (JVM) should be established. Py4J is the dedicated solution to establish this connection.

Objective 1: to make Flink features available to Python users

Objective 2: to make all Python functions (Such as Scikit-learn, NumPy, Pandas, Matplotlib, etc…) available on Flink. Python class libraries functions are doctored as user-defined (UDF) functions to be integrated into Flink.

To integrate function as UDF this is the time for Apache Beam to come into play as Portability Framework. Due to its ability to setting up Python user-defined function execution environment, and managing Python’s dependencies.

Objective 2: to make all Python functions available on Flink

Conclusion

It is necessary to learn PyFlink as the Python community is growing rapidly and it is expected for PyFlink to be in the first place compared to Java and Scala. Especially because Python has a lot of library functions that facilitate the life of a Big Data developer. Despite the importance of PyFlink, it is still suffering from the lack of documentation, tutorials, and examples.

Hope this is useful.

If you like this article and would like to connect with me. It is my pleasure to check my GitHub repositories you will find repositories for different machine learning applications for NLP and Computer Vision.

GitHub: https://github.com/abeermohamed1

Best regards

Abeer Youssef

Reference: https://www.alibabacloud.com/blog/the-flink-ecosystem-a-quick-start-to-pyflink_596150