What is the magic skillset that a Data Analyst has, that is so valuable to industry? Why all this hype and demand Data Analyst? To understand this, we have to appreciation of fundamental workings of the universe is necessary!! Seriously!!
Let’s make an outrageous statement: “Everything in the universe is a pattern.” From the massive galaxies, sun, planets, gravity, even humans and all living things, down to atoms etc. And we have to identify the pattern and possibly a mathematical way to describe and replicate the pattern. Newtons equations of Gravity describes the pattern of how objects in space time move.. likewise the shape of each humans face is a pattern that we recognize and differentiate between someone we know and those we don’t know. In living things the information of the pattern is expressed via the DNA (DiOxyRibo Nucleic Acid).
Let us use a stunning example to prove how complex structures are just patterns that can be replicated sometimes very easily. One of nature’s most wonderous pattern is called the Mandelbrot Set. The animated image below shows how it is drawn from start to the final form which is beautiful (press play to see the animation).
The fascinating part of the Mandelbrot is not just the beautiful shape – but the fact that it is endless ie. the more you zoom to any part of the edges – you will see an infinite repetition of more beautiful patterns. Please see the YouTube video to see it in all its glory (https://www.youtube.com/watch?v=b005iHf8Z3g).
Note: Those of you who are curious to learn the maths behind the fractal pattern can view the YouTube video: https://www.youtube.com/watch?v=2JUAojvFpCo
Now guess the very huge equation to draw this complex infinitely creative, recursive beautiful pattern. Did you think this is simple? Can’t be right? Wrong: the full equation to draw this beautiful pattern is:
Zn= Z(n-1)2 + C. That’s it!!!! Just that here both Z and C are complex numbers (ie. square root of -1). Still not convinced? Then please see the very simple code to draw the Mandelbrot set in Python (for you Python geeks out there) https://www.geeksforgeeks.org/mandelbrot-fractal-set-visualization-in-python/ . There is also code in R language if you are more comfortable with R.
In this Python code the only function to derive this equation is below with just 1 line (highlighted in yellow) for the entire mapping. Rest of the code (refer the URL above) is just iterating & colouring the plot.
# function defining a mandelbrot
def mandelbrot(x, y):
c0 = complex(x, y)
c = 0
for i in range(1, 1000):
if abs(c) > 2:
return rgb_conv(i)
c = c * c + c0
return (0, 0, 0)
This conclusively proves that even complex shapes are just patterns. We just have to identify this hidden pattern and the mathematics to describe it.
A data analysts role is to therefore, extract knowledge and insights from data also termed as “KDD” (Knowledge Discovery from Datasets). The steps are to start with collating data then extract information and finally derive knowledge (pattern recognition) from this information.
She is responsible for all steps ie. understanding what data is required and how to capture it and integrate it. Next she has to manage missing values and cleansing data, selecting the right data for analysis based on business objectives, much of the data needs to be transformed based on problem statement and the analytical methods or software being used, then the key step is to identify the underlying pattern.
Figure 2: Knowledge Discovery in Datasets. (Source: https://www.geeksforgeeks.org/kdd-process-in-data-mining/?ref=rp)
A Data Scientists role is all about managing the data pipeline to provide data to solve any business problems.
First step after understanding the business objective is to gather the relevant data. This requires integration from various sources. At this stage an working knowledge of SQL (or other data query languages), is necessary. From the past decade however, KDD increasingly, requires use of unstructured data, popularly called “Big Data”. The industry popular tools for handling and transforming Big Data are Hadoop which is an open source framework for extracting, manipulating and storing big data. In another blog we will discuss Big Data definitions and concepts. So a Data Scientist needs to know these concepts well.
Blogs in this series:
1. Who is a Data Analyst
2. Knowledge Discovery in Datasets.
3. Roadmap from a Novice to a valuable Data Analyst.