Data science is an amalgamation of computer science and statistics seeking to arrive at insights from data. With the propagation of data in the information age, there has been increasing demand for data scientists across many industries. The United States Bureau of Labor Statistics projects that “employment of computer and information research scientists is projected to grow 22 percent from 2020 to 2030, much faster than the average for all occupations.”
Most data scientists will employ one or multiple programming languages as part of their toolkit. In recent times, the programming languages with the most focus in data science circles have been Python and R. In the recent Stack Overflow Developer Survey, these languages ranked #3 and #21 respectively in popularity across all programming languages. Some reasons these programming languages are popular include:
- Relative ease of use: They are high-level, interpreted, declarative languages which make it possible to test and interact with programs without complex compilation.
- Large user communities: Both Python and R are developed by strong open-source communities with consistent development cycles and fantastic community support. Both programming languages enjoy broad academic and commercial support.
- Broad capabilities and extension support: Data scientists use a common set of add-on packages to enable them to explore/process data and develop/train models.
Instead of programming in a terminal (imagine a dark screen with a flashing cursor), most users take advantage of an Integrated Development Environment (IDE) to make it easier to write, edit, run, and test programs. IDEs are even more powerful for data scientists as they provide the ability to interrogate the data and programs in real-time as the programs execute. This capability has been extended with the increased popularity of Notebooks. Notebooks allow programs to be written that combine inline code, commentary, and outputs in a cohesive document. This offers a detailed and documented manner of exploring data within a structured environment.
Organizations have invested significant resources in building data science teams and hiring data scientists. Yet there is a perception that, by having data scientists using Python and R, the organization will minimize the significance of Excel to its business. This is a false dichotomy, as Excel should be used to supplement the capabilities and toolbox for data scientists.
Although programming languages may offer many tools to access and work with data, this requires code that will differ depending on different datasets and structures. Excel offers the ability to work and interact directly with data that is difficult to work with, even with a Notebook, IDE, or visualization software. With Power BI, Excel has additional capabilities to ingest and work against larger, server-based datasets. Excel is also a terrific communication tool, as it provides a standard way to express logic and calculations in a manner that non-programmers can understand and review. This is especially beneficial for documenting and productionizing business logic, while also allowing users to interact and evaluate calculations. Despite the power of data science programming languages, where calculations involve many tables with varying lookups or lots of conditional steps and logic, these are more easily developed and explained in Excel. It’s common for data science models to be exported to Excel for final processing and adjustments before implementation. The benefits that Python and R have with respect to ease of use, large user communities, and broad capabilities and extension support are also benefits from using Excel. In fact, the Notebooks now available to data science programming languages represent an evolution of the functionality that has been present in spreadsheets for decades. Users have the capability to directly mix data, code via formulas, and other outputs together in one document.
The perspective that Excel should be replaced with data science programming languages is perhaps based upon the way that organizations have used the flexibility of Excel to address internal needs in a way that has introduced End-User Computing (EUC) risk. Shifting to a more formal programming language will not automatically address these risks. With Coherent Spark we offer the best of both worlds: the ability for business users to manage their Excel logic using our solution while addressing concerns from IT about security, validation, and testing. With Spark we can take any calculations and logic in Excel and convert them into code that can be consumed by IT teams using an Application Programming Interfaces (API). This means the logic in an Excel file can be securely distributed to different integrating applications in an organization. This is performed nearly instantaneously in a system that logs, versions, and hashes all changes. Spark also includes a comprehensive Testing Center to empower business users, such as data scientists, to validate their logic in production environments as well as to incorporate Excel calculations as part of data science analysis and modelling processes.
For data scientists, great emphasis is placed on being able to assess and deliver insights from data – software is meant to be a tool to facilitate the analysis and delivery of these insights. Although programming languages such as R and Python are very commonly used, they are not a direct replacement for the capabilities of Excel. Excel and Coherent Spark together offer invaluable capabilities to complement a data scientist’s workflow and ability to deliver results.
Coherent Spark Product Director
Simon is the Product Manager for Spark at Coherent, leading a team to develop the platform’s features and capabilities. He is a qualified actuary and with 15+ years’ experience in the Property & Casualty / General Insurance space with an exclusive focus on pricing, data and analytics. Having held roles in Canada, UK for a big 4 consultancy across banking and insurance, and a multi-national insurer based in Hong Kong, Simon has significant experience in understanding the challenges that analytical and business users face in executing and deploying calculations and logic.