Data science tools are evolving. There are 2 classes of tools emerging:
- Self-service tools for those with technical expertise (programming skills and understanding of statistics and computer science)
- Tools for business users that automate commonly used analysis
Learn the most popular data science tools for techies
Becoming data scientist is hard. In any hard task, focus is critical. As a data scientist, Python should probably be the first tool you should master.
Kaggle, the community for data science competitions, publishes surveys of data scientist such as their “2017 the State of Data Science” report. Below, you can find the most popular tools from their survey:
|Data science tool||% of respondents using the tool|
|Amazon Web services||23.5|
|Unix shell / awk||23.3|
|Spark / MLlib||17.1|
|Microsoft Excel Data Mining||13.7|
|Microsoft Azure Machine Learning||7.4|
|Google Cloud Compute||6.8|
|IBM SPSS Statistics||5.9|
|Microsoft SQL Server Data Mining||5.7|
|Amazon Machine Learning||5.3|
|SAS Enterprise Miner||5.1|
|Microsoft R Server (Formerly Revolution Analytics)||4.8|
|RapidMiner (free version)||4.3|
|KNIME (free version)||3.5|
|IBM SPSS Modeler||3.4|
|IBM Watson / Waton Analytics||3.2|
|Oracle Data Mining/ Oracle R Enterprise||2.9|
|SAP BusinessObjects Predictive Analytics||1.2|
|RapidMiner (commercial version)||1.0|
|Statistica (Quest/Dell-formerly Statsoft)||0.6|
|KNIME (commercial version)||0.5|
|Salfrod Systems CART/MARS/TreeNet/RF/SPM||0.5|
Python and R are the top performers. Other sources such as KDNuggets’ poll results also support this:
It would be ironic to talk about trends based on 2 data points in an article about data science but Python seems to be growing in popularity. We checked an older source from 2014.
They counted popular skills demanded from data scientists in job posts. Below you can find a list of popular skills compiled by data science weekly. Back then R was more popular than Python. I think it is safe to say that Python has been growing in popularity
|Keyword||Frequency (number of times tool mentioned on a sample data scientist job list)|
Finally, as of 2017, most data scientists that use both R and Python strongly recommend Python as shown below. We expect Python’s popularity to continue to grow.
Though it is clear that Python is the most popular tool among data scientist, there’s a whole ecosystem of data science tools which is summarized nicely in the image below.
8 Data Science Tools Everyone Needs to Know
RapidMiner builds software for real data science, fast and simple. They make data science teams more productive through a lightning-fast platform that unifies data prep, machine learning, and model deployment. More than 300,000 users in over 150 countries use RapidMiner products to drive revenue, reduce costs, and avoid risks. They built their platform on three major components. RapidMiner studio is the Visual Workflow Designer for Data Science Teams. It is a platform with Code-optional with guided analytics With more than 1500 function, it allows users to automate predefined connections, built-in templates, and repeatable workflows. RapidMiner serves Share and collaborates on every step and aspect of the data mining process. It allows to optimize with the advanced queuing mechanism: RapidMiner Server can slice out resources and dedicate to teams, use cases or projects. The platform makes it possible to get visibility into data science teamwork and governance. RapidMiner Radoop removes the complexity of data prep and machine learning on Hadoop and Spark. The platform is used in many industries with different types of solutions.
DataRobot offers a machine learning platform for data scientists of all skill levels to build and deploy accurate predictive models in a fraction of the time it used to take. The technology addresses the critical shortage of data scientists by changing the speed and economics of predictive analytics. The DataRobot platform uses massively parallel processing to train and evaluate 1000’s of models in R, Python, Spark MLlib, H2O and other open source libraries. It searches through millions of possible combinations of algorithms, pre-processing steps, features, transformations and tuning parameters to deliver the best models for your dataset and prediction target. They offer three main products. DataRobot cloud is built with the knowledge and experience from some of the world’s top data scientists, DataRobot Cloud is the easiest way to build world-class prediction models in just minutes. They have partnered with Web Services (AWS), the world’s most comprehensive and broadly adopted cloud platform. The flexibility and scale of the AWS platform enable DataRobot to deliver a robust, secure, on-demand platform to our customers. DataRobot Enterprise extends the value of the machine learning platform with enterprise features including flexible deployment, governance, training, and world-class support.
Alteryx Inc., headquartered in Irvine, CA, offers a quick-to-implement, end-to-end analytics platform that empowers business analysts and data scientists alike to break data barriers and deliver game-changing insights that are solving big business problems. The Alteryx platform is self-serve, click, drag-and-drop for hundreds of thousands of people in leading enterprises all over the world.
Qubole is passionate about making data-driven insights easily accessible to anyone. Qubole customers currently process nearly an exabyte of data every month, making us the leading cloud-agnostic big-data-as-a-service provider. Customers have chosen Qubole because we created the industry’s first autonomous data platform. This cloud-based data platform self-manages, self-optimizes and learns to improve automatically and as a result delivers unbeatable agility, flexibility, and TCO. Qubole customers focus on their data, not their data platform. Qubole investors include CRV, Lightspeed Venture Partners, Norwest Venture Partners and IVP.
Paxata is the pioneer in intelligently empowering all business consumers to transform raw data into ready information, instantly and automatically, with an intelligent, self-service data preparation application built on a scalable, enterprise-grade platform powered by machine learning. Their Adaptive Information Platform weaves data into an Information Fabric from any source, any cloud or environment, for any enterprise to create trusted information. With Paxata, user clicks, not code to achieve results in minutes, not months. They empower all business consumers to get smart about information at the speed of thought. Be an Information Inspired Business. Paxata partners with an industry-leading cloud, big data and business intelligence solutions providers such as Cloudera and Amazon, and seamlessly connect to BI tools, including Salesforce Wave, Tableau, Qlik and Microsoft Excel to greatly accelerate the time to actionable business insights.
Trifacta’s mission is to create radical productivity for people who analyze data. They are deeply focused on solving the biggest bottleneck in the data lifecycle, data wrangling, by making it more intuitive and efficient for anyone who works with data. Their main product is the Wrangler. Wrangler helps data analysts clean and prepare messy, diverse data more quickly and accurately. Simply import your datasets to Wrangler and the application will automatically begin to organize and structure your data. Wrangler’s machine learning algorithms will even help you to prepare your data by suggesting common transformations and aggregations. When you’re happy with your wrangled dataset, you can export the file to be used for data initiatives like data visualization or machine learning. Wrangler Edge is specifically designed to make this process faster for teams that don’t require the parallel computing power of big data platforms. Powered by high-performance data wrangling engine, analysts can share the process of exploring, structuring, and publishing out analysis-ready datasets for faster, more accurate analysis.
LumenData is a leading provider of Enterprise Information Management solutions with deep expertise in implementing Data persistence layers for data mastering, prediction systems, and data lakes as well as Data Strategy, Data Quality, Data Governance, and Predictive Analytics. Through a combination of highly trained consultants, strong partnerships, relentless focus on quality and executive oversight, LumenData has successfully delivered planning, implementation, integration, maintenance, and training services to over 50 blue chip clients in various industries. Its clients include Autodesk, Bayer, Bausch & Lomb, Citibank, Credit Suisse, Cummins, Gilead, HP, Nintendo, PC Connection, Starbucks, University of Colorado, the University of Texas at Dallas, Weight Watchers, Westpac, and many other data-dependent companies.
Feature Labs is a predictive analytics platform created to make data science automation a strategic component of any organization. By using Feature Labs, teams can utilize machine learning and artificial intelligence to deploy new products or services, identify critical insights, and understand what their data says about the future of their business.
As data science researchers at MIT, Feature Labs’ founders experienced first-hand the challenges inherent in the development of predictive models. To address these problems, Max Kanter and Kalyan Veeramachaneni created the “Data Science Machine” to automate this time intensive and human-driven process — and then created Feature Labs in 2015 to bring cutting-edge data science automation to the world.
Though, we covered the whole ecosystem of data science tools, we did not cover all tools in depth. Let us know if you need help in finding the right partner in data science: