Most Popular Tools for Data Scientists in 2020

Most Popular Tools for Data Scientists in 2020

·

16 min read

As the world entered the era of Big Data it was necessary to store this data and then technologies like Hadoop were used to solve this issue. Big data is a term used to describe a collection of data that is huge in size and yet growing exponentially with time. The next problem after the storage of huge volume of data was to process it. Data Science mainly comes into the picture here for processing and analyzing of data and then deriving certain insights which are beneficial for various aspects of human life. It can be used in business to predict sales of products, predict the stocks in share market, in healthcare it can be used to diagnose diseases, find drugs beneficial for different diseases, it can be used to conserve wildlife, predict natural calamities and save lives, predict how different infrastructure must be built to provide optimum resources to the population.

In the present times Data Science was used to tackle COVID-19 in many ways. It was used to determine the structure of the virus, to determine the efficacy of different drugs on the virus, to determine the containment zone, to analyze the data to produce vaccines and for tracing the spread of virus.

Data science according to Wikipedia is defined as

"Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data."

Alt Text


Some of the various aspects of data science are - Predictive casual analytics where we are dealing with future prediction like stock prices, sales, etc. , Prescriptive Analytics is a relatively new field and deals with giving advice an example of this is self driving car where the model provides suggestions to the car about when to turn, when to apply brakes, when to accelerate, Machine learning for pattern discovery which involves some of unsupervised ML algorithms for finding clusters, detecting anomalies, etc.

Many people have this notion that Data Science just means training Machine Learning models but it is an amalgamation of various fields. A person must have knowledge of statistics, cloud technologies, coding and databases to become a data scientist. With the increasing change in technology knowledge of DevOps for Machine Learning(known as MLOps) and AutoML is also necessary.

Kaggle the largest community of Data Scientists conducted a survey and based on that survey I am presenting the list of different tools and technologies widely used in different domains of Data Science. Feel free to add some other tools you know which are popular and not mentioned in this list.

Machine Learning Frameworks

Alt Text


Machine Learning is one of the core technologies associated with Data Science. Python and R are the widely used languages for ML. The most popular frameworks are based on Python namely scikit-learn, Tensorflow and PyTorch. Machine Learning is the superset of Deep Learning and an integral part of Artificial Intelligence.

The famous basic machine learning algorithms are supervised algorithms like Linear Regression, Logistic Regression, Naive Bayes algorithms, Support Vector Machines, Decision Trees and Random Forests and unsupervised algorithms like clustering with k-nearest neighbor and k-means. Then various ensembling techniques are used to combine these models. The various ensembling techniques are bagging, boosting, and stacking.

Scikit-learn contains all these basic models for supervised and unsupervised machine learning along with various performance metrics used to measure the performance of models. The various performance measures are accuracy, precision, recall, f1-score, area under ROC curve and confusion matrix for classification related tasks and MAE (mean absolute error), MSE (mean squared error), and R2-score for regression related tasks.

LightGBM (Light Gradient Boosting Machine) and Catboost are frameworks for implementing ensembling models. TensorFlow, Pytorch, and Keras are frameworks for implementing Deep Learning models. They can be used to implement various types of neural networks like Recurrent Neural Networks(RNNs), Convolutional Neural Networks (CNNs), Transformers, etc.

The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles. It is based on the R language and one of the most famous frameworks of R language.

  • Scikit-learn
  • Tensorflow
  • Keras
  • Xgboost
  • PyTorch
  • LightGBM
  • Caret
  • Catboost
  • Prophet
  • Fast.ai
  • Tidymodels
  • H2O 3
  • MXNet
  • JAX

Enterprise Machine Learning Tools

Alt Text


Most of the beginners don't know about the tools for ML on the cloud. If you are a beginner in the field of machine learning you work on datasets which are small and algorithms which are less complex, but as you grow in experience the dataset becomes immense and the algorithms start becoming more and more complex. GPT-3 which was one of the breakthroughs of this year in the field of ML and specifically Natural Language Processing(NLP) is a language model. It was the largest language model trained on a humungous dataset and 175 billion parameters. It would take 355 years to train GPT-3 on a Tesla V100, the fastest GPU on the market. So these types of problem require a different kind of solution. The solution to training large and complex models and huge databases is to run the ML algorithms on cloud and also use cloud for storage of data. The enterprise machine learning tools are there for solving complex problems and are proprietary to big tech companies.

Amazon and Google have the biggest shares in the field of enterprise ML tools. The most popular tool is Amazon SageMaker. Amazon SageMaker is a cloud machine-learning platform that was launched in November 2017. SageMaker enables developers to create, train, and deploy machine-learning models in the cloud. SageMaker also enables developers to deploy ML models on embedded systems and edge-devices. The gcloud ml-engine command group lets you manage AI Platform jobs and training models. AI Platform is a managed service that enables you to easily build machine learning models, that work on any type of data, of any size.

These are some of the best enterprise ML tools by software giants like Google, Amazon, and Azure.

  • Amazon SageMaker
  • Google Cloud ML Engine
  • Azure Machine Learning Studio
  • Google Cloud Vision AI
  • Google Cloud Natural Language
  • Azure Cognitive Services
  • Amazon Rekognition
  • Google Cloud Video Ai
  • Amazon Forecast

Business Intelligence Tools

Alt Text


Business Intelligence means analyzing data of companies and giving reports and predicting sales and markets. BI(Business Intelligence) is a set of processes, architectures, and technologies that convert raw data into meaningful information that drives profitable business actions. It is a suite of software and services to transform data into actionable intelligence and knowledge. It is one of the most popular use cases of data science and deals mainly with Statistics and Data Visualization. It has three integral parts:

  • Data gathering
  • Data storage
  • Knowledge management

Tableau is the most famous business intelligence tool used widely in the community. It is developed by Salesforce. It helps in interactive data visualization. The famous tech-giants in the world of BI are Microsoft, Google, Amazon, Salesforce and SAP labs. Some of the most popular Business Intelligence tools used in the industry are given below.

  • Tableau
  • Microsoft Power BI
  • Google Data Studio
  • Qlik
  • Amazon QuickSight
  • Salesforce
  • Looker
  • Alteryx
  • SAP Analytics Cloud
  • TIBCO Spotfire
  • Sisense
  • Einstein Analytics
  • Domo

Databases Used

Alt Text


Databases form an important part of DataScience because without data there is no data science. A database is a collection of information that is organized so that it can be easily accessed, managed and updated. Computer databases typically contain aggregations of data records or files, containing information about sales transactions or interactions with specific customers. Now a days the data is stored on cloud as it is not possible to store such huge data on local systems and also because cloud is cheaper.

Databases are mainly divided into two types namely Relational database and Non-relational database. Relational databases has data in tabular form with a fixed schema. The tables consists of rows and columns and are known as relations. The rows are called records or tuples and columns are called as attributes. Some of the properties of relational databases is that they guarantee ACID properties and joins can be performed on the tables known as relations. The non-relational databases do not have a fixed schema and do not store data in tabular format. They also do not guarantee ACID properties. Both types of databases have their own advantages and disadvantages.

The famous Relational databases are MySQL, Oracle, PostgreSQL, IBM DB2, and SQL server. The famous non-relational databases are MongoDB, Cassandra, Redis, Memcached, and Amazon DynamoDB. The different databases used by data scientists all over the world is as follow.

  • MySQL
  • PostgreSQL
  • Microsoft SQL server
  • MongoDB
  • SQLite
  • Google Cloud BigQuery
  • Oracle Database
  • Amazon Redshift
  • Microsoft Azure Datalake Storage
  • Amazon Athena
  • Snowflake
  • Amazon DynamoDB
  • Microsoft Access
  • IBM DB2
  • Google Cloud Firestore

Automated Machine Learning (AutoML)

Alt Text


AutoML is one of the most promising technology of modern era which is growing at an alarming rate. The idea of Auto-ML is to optimize all of the pipeline for a data science project. The basic idea is to simplify the various stages of Machine Learning pipeline like data pre-processing, feature engineering, feature extraction, feature selection, etc. It is built on the basic idea that Machine Learning must be accessible by non-experts as well, when the various stages of ML pipeline are simplified it can be easily executed by non-experts of data science domain.

Some of the advantages of Auto-ML are:

A good background for data preparation Cleaning (filter noisy) and formatting (coded value like categorical) data needs a good background for data preparation. With the AutoML we can accelerate this phase by a process in which we have a different way to format and detect the noise in data.

Avoiding using the default parameters in the models Because searching for the best parameters needs a knowledge of the Grid Search & Random Search methods (tuning techniques that attempts to compute the optimum values of hyperparameters) in order to give a list of settings and then choose the best ones. This whole process can be time consuming and that is why AutoML is needed to solve the problem.

Simplification to create and manage models Usually, the data scientist make a list of the interesting models according to the context and to the problem. This requires a deep knowledge and a business expertise in the field of data. AutoML makes this step easier because it is a pipeline with more models to use for most problems.

Deep Learning (DL) Optimization Deep Learning is a function that imitates the human brain in processing data and creating patterns to be used in the decision making process. To do so, we have to look for the best architecture of neural network for the specific problem. For example, with Keras, an open source library for Deep Learning, we need a lot of lines of code to make the best architecture. However, thanks to the method Auto-Keras (library for DL) of Machine Learning, we are now able to obtain a better result with way less lines.

Some of the famous AutoML tools are-

  • Google Cloud AutoML
  • H2O Driverless AI
  • DataRobot AutoML
  • Databricks AutoML

The above tools are payable or enterprise AutoML tools we have some open source AutoML tools as well like-

  • TPOT (Tree-based Pipeline Optimization Tool)
  • MLBox
  • Auto-Sklearn
  • Auto-Keras
  • Auto-Pytorch

Hope you liked this list of tools used by data scientists all across the globe. Feel free to add your favourite tools in the comment and add popular tools which are not mentioned in the current list. Also please react and shower your love to this article.

Cover Image Cartoon vector created by vectorjuice - www.freepik.com