Blog > Data Science

Exploratory Data Analysis in Data Science Projects 

In the simplest and usual meaning, EDA - Exploratory Data Analysis, is a kind of effort to understand the data on which the data science project will be built at the first stages. Initial analysis of data is usually done on an unknown or unfamiliar dataset.

In general, major steps in EDA are as follows.

  • Identification of variables and data types

  • Analyzing the basic metrics

  • Non-Graphical and graphical univariate analysis

  • Bivariate analysis

  • Missing value analysis

  • Outlier analysis

This process has some intersections with statistical analysis. Statistical analysis is more conventional and might have a clear purpose. The difference between exploratory data analysis and statistics is not conventional. Rule-based approaches on the discovery analysis may hinder the finding of new patterns, unknown facts, and models underlying the data. For this reason, the initial analysis of data is based on a visual and graphical presentation that has commonalities with statistical techniques. Briefly, the statistical process is not interchangeable with the inaugural data process however, statistics support preliminary data analysis with some conventional techniques to add quantitative aspects.

Figure: A heatmap for the correlations among variables in a dataset.

 

Preliminary data analysis aligns with the essence of data science. Both may try to help humans to find some non-discover-able data benefits with the help of non-human systems. For example, some outputs in a system that have a frequency or value set changing by time, shortly called a time series, might have an abnormal value sometimes. After plotting these outputs on a timeline, these abnormal values can be seen clearly. Moreover, the case of an existing pattern of abnormal values can be explored visually on the graph in this inaugural data work.

Figure: A timeline plot for a variable in a dataset.

 

In conclusion, the success of data science projects is also related to the understanding of the domain knowledge and data characteristics besides its quality. Exploring and cleaning the data at the beginning of the project might be time-consuming but it is one of the best assistants to lead stakeholders to get the maximum benefits of their data assets. EDA is a kind philosophy and attitude in the introduction on the phase of the data journey.



 

Tags