Acing the Data Science Interview: on the day
This opinion post forms the final part of our three part series ‘the Data Science interview’, written by expert guest...
Thanks to the incredibly active Data Science community, open source tools and libraries have become more trusted, reliable and scalable.
In turn, open source tools play an incremental part in the Python Data Science environment as they provide robust, cost effective and accessible technologies to support your data-driven initiatives.
Drawing from our Applied Data Science alumni and industry partners, we've put together a two-part series on widely-used tools and libraries for Data Scientists. The goal of this article is to provide an overview of tools for daily Data Science activities and tools for deploying your model into production. Then we'll cover key Data Science libraries in part two.
The selection of Data Science tools covered in this article are: Anaconda, Pycharm, Keras, AWS, Flask and H2O.ai.
Anaconda is a software distribution that makes package management a lot easier for machine learning. Anaconda comes with an extensive collection of pre-compiled packages for classic data mining (numpy, scipy, matplotlib and pandas) - eliminating the need to pip install each library. Anaconda also includes a package manager called Conda for creating your virtual environment and for installing additional libraries, including those developed outside the Python ecosystem. https://www.anaconda.com/distribution/
Deep Learning is popular, obviously. However in the day-by-day activities, it is not something very useful. Classic data mining techniques (classification, regression, decision algorithms) have broader applicability, from "standard" ;”business intelligence to analysis of engineering processes.
Filippo Spiga, Staff Research Engineer, Arm
PyCharm is a popular IDE for Python programming coming with optimised code assistant and code intention tools. Like a lot of IDE’s, it provides code completion, type hints and syntax highlighting that save a lot of time and effort. PyCharm also offers a number of useful features including a debugger, suggesting fixes and improvements to your code aligned with PEP8 guidelines; and numerous built in testing frameworks. Additionally, PyCharm integrates easily with Git for source code management and has a useful local history feature that auto saves your local changes even when they are not committed. https://www.jetbrains.com/pycharm/
If you are exploring Deep Learning, Keras is a great place to start. In the post ‘Standardizing on Keras’ by Tensorflow, Keras is described as “an API standard for defining and training machine learning models.” A core benefit is that Keras is very easy to experiment with, providing you with an intuitive interface for developing models. When using Keras, you simply need to define your configuration (specifying the number of neural layers, the activation functions, the optimization method to use) and Keras will build the model. So you can quickly prototype ideas, then create different configurations to compare how different models perform, and even switch between back ends such as TensorFlow, CNTK or Theano to test which one works better. https://keras.io
Cloud platforms such as AWS provide a more scalable development environment. AWS offers several GPU optimised instance types using it’s Elastic Compute Cloud (EC2 ) to support compute-intensive applications and provide faster performance. In turn, this cloud infrastructure will help to reduce cost and accelerate time-to-value. https://aws.amazon.com/getting-started/
TensorFlow, Google Cloud and MS Azure are of high interest. Applications range from linear stand alone models to aid with model selection, training and evaluation, to distributed DNN models to scale and deploy, and everything in between. I see the trend and market needs moving towards all commercially successful entities having mandatory Machine Learning capabilities to improve the useability and relevance of their offerings.
Jad Hinawi, CTO
Flask is an application framework that can be easily loaded into your Python code. To take your project into production, you can turn the Python script you have written locally into a simple API using the Flask Python framework. Then once you have an API, you can deploy that to your existing infrastructure on any cloud hosting provider. Flask has updated its documentation with it’s resent release of Flask 01, including tutorials to “Deploy to Production” http://flask.pocoo.org/docs/1.0/tutorial/deploy/ and a list of recommended “Deployment Options” to host your API. http://flask.pocoo.org/docs/1.0/deploying/.
If you are looking to speed up your end-to-end data science processes, from model development and training to deployment to production, H2O is a good tool to check out. H2O is a semi-open source, distributed, machine learning platform for creating models and deploying them into production. Their machine learning environment comes with suite of supervised and unsupervised learning algorithms for fast model development, and provides a selection of design patterns for training and scoring your model for deployment. http://docs.h2o.ai/steam/latest-stable/index.html
I'm mostly interested these days in end-to-end data science platforms. The rationale is simple, they are an enabler for all of us (Data Science and Tech) to develop and deploy faster even with complicated models. Look at H2O steam for example (http://docs.h2o.ai/steam/latest-stable/index.html) it can download Data Scientist code as a REST service with minimal hassle to be leveraged right away in a real-time screening platform.
David Illes, Analytics Technology Lead, Investment Banking
Thanks for reading! This post covered six essential tools for your Data Science environment. Next, in part two we will cover a series of widely-used Data Science libraries - stay tuned!
To go deeper, take a look at our onsite Data Science courses designed to equip your team with the most relevant Data Science and Machine Learning skills for your business needs. If you are looking to upskill your organisation, get in touch using the form below and we'll give you a call to discuss your objectives and how we can support your Data Science initiatives.
Please complete the form to find out more about how we can help with your training requirements.
Our team will be in touch within two working days to follow up.