Calling Python dead for data science may seem a strange thing to say. Today Python is the language of data science and has never been more popular. As of November 2020, it beat out perennial favorite Java to take the number two spot. I always defend Python as the language every data scientist should know. Most of the work in data science is implementation and execution and Python excels here. It has excellent package support and lets you glue together different pieces of code and rapidly get them up and running. There are also many articles and papers about data science with Python code which helps you to quickly solve tricky issues when developing. All of this means that you can be extremely productive in a way that is difficult to achieve in other languages. So don’t get me wrong, I love Python, but it has shortcomings and it’s these that have caused Google to move on.
The biggest issue with Python is that it is not a high-performance language. Data science requires processing a lot of data and because of design decisions from very long ago Python just doesn’t scale. To solve these issues developers mix-in other languages like C and Fortran into their Python code. This is very pragmatic but it has its downsides. If you have ever tried to understand exactly what happens in a function from NumPy, Tensorflow or most other data science packages you eventually hit the “C wall”. It’s the point where the Python code ends and the C code begins, and it’s usually at this point that you give up and just accept you don’t really know what is going on. Sure you could find the source on github and possibly setup another debugger to walk through the function and any other it calls, but this is already sacrificing the biggest advantage Python gives you, productivity.
“A next generation system for deep learning and differentiable computing”.
Trying to parallelize anything substantial will also kill your productivity as you go up against the GIL. The GIL (or Global Interpreter Lock) blocks all but one thread from running at a time and means that you have to do nasty things like forking your processes. Python does provide you with nice wrappers to achieve this but at the end of the day forking is not efficient.
Performance also poses a challenge as data science problems find their way onto mobile and IoT platforms. If you want to run a predictive model on an IoT sensor you will likely develop and train that model in Python but deploy it in another language due to the sensor’s limited resources. Training and deploying in different languages increases the gap between R&D and production and the probability of something going wrong.
These issues and others make it difficult for pushing the boundaries of machine learning and is why Google is developing Swift for Tensorflow. When I first heard about it, I thought it was just about running Tensorflow models in iOS apps. It seems Google has much bigger ambitions for Swift. Google goes as far as to call it “a next generation system for deep learning and differentiable computing”. The key words there are “next generation”, as in after Python. They even detail why Python is not a good language for Tensorflow.
In comparison to Python, Swift is fast. Fast enough that the Tensorflow team believes that soon “Swift will be a credible replacement for many uses of C++”. Python is the primary language of Tensorflow, yet it is only 26% of the code base according to github, while C++ is 61%. Swift also supports better parallelization through low level access to pthreads or you can use Swifts GCD (Grand Central Dispatch). There has also been a push to add a new concurrency model to further improve scaling. Finally, Swift is designed to run on mobile devices such as iOS and runs faster and with less memory than Python on IoT platforms like Raspberry PI.
Google has taken several steps to ease the transition for data scientists to the “next-gen platform”. NumPy is a staple of the data science community and the foundation of many packages. Google has been reimplementing it as TensorFlow NumPy that offers many advantages over classic NumPy such as GPU acceleration. For other packages, you can use PythonKit to run them as if they were made for Swift. Now all those articles and papers out there that supply code can be reused inside your Swift program. And guess what. PythonKit traces its roots to Google and the Tensorflow team. Google even made sure that you can use Swift in Jupyter notebooks so you don’t have to change your workflow.
Now I’m not seriously suggesting that Google is out there seeking to kill off Python, but it’s obvious that they have found the limits of the language for data science. Given their tremendous investment in machine learning, it makes sense that they would reach these limits before anyone else. The question really is how long until you too reach those limits? Google and Tensorflow are moving on from Python and eventually, you will too. The only question is when.
For a deeper dive into the technical details see this article from the Google Tensorflow team.