Data Visualization means to convert information into a visual perspective like a map, graph, or chart. The main aim of data visualization is to make data easier to understand for the human brain and identify trends, patterns, and outliers from data. Data visualization is one of the most important steps of Data Science. After the data has been collected and processed it is modeled. Data visualization is the next step after this which helps in finding insights from the data.
In Machine Learning also we use data visualization before making models to better understand the data. It helps in finding the outliers, finding important features, finding the correlation between data. The languages used for data visualization are Python and R. Some of the famous data visualization libraries are-
- Matplotlib
- Plotly
- Seaborn
- Altair
- Bokeh
- Ggplot
Python
Python is one of the most famous programming languages in the Machine Learning community. It is also widely used for data visualization as well. Some of the famous libraries are listed below along with code for stacked bar chart -
Matplotlib
The code here represents how to make a stacked bar chart using matplotlib. The data used is just random data taken by me which represents sale of two products A and B in each month. This same data will be used throughout the blog. The function for bar chart is bar()
.
import matplotlib.pyplot as plt
labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
product_A = [20, 35, 30, 35, 27, 43, 24, 34, 14, 27, 22, 50]
product_B = [25, 32, 34, 20, 25, 34, 54, 23, 43, 33, 27, 29]
width = 0.35
fig, ax = plt.subplots()
ax.bar(labels, product_A, width, label='Product A')
ax.bar(labels, product_B, width, bottom=product_A,
label='Product B')
ax.set_ylabel('Scores')
ax.set_title('Sales of Product A and Product B')
ax.legend()
plt.show()
Plotly
The same data is used here but has been converted into a pandas dataframe. The bar()
function is used to draw bar chart.
import plotly.express as px
import pandas as pd
df = pd.DataFrame(dict(time=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
product_A = [20, 35, 30, 35, 27, 43, 24, 34, 14, 27, 22, 50],
product_B = [25, 32, 34, 20, 25, 34, 54, 23, 43, 33, 27, 29]))
fig = px.bar(df, x="time", y=["product_A","product_B"], title="Sales of Product A and Product B")
fig.show()
Seaborn
In Seaborn, there is no easy way to make stacked bar char so I have used pandas plot()
function to plot the chart. The set()
function is used to apply the seaborn theme. barplot()
and histplot()
functions are there in seaborn for plotting bar chart and histogram respectively.
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame(dict(time=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
product_A = [20, 35, 30, 35, 27, 43, 24, 34, 14, 27, 22, 50],
product_B = [25, 32, 34, 20, 25, 34, 54, 23, 43, 33, 27, 29]))
sns.set(palette='pastel')
df.plot(kind='bar',x='time', stacked=True)
Altair
Here also the dataframe is used but as the data is in wide format it is converted into long format and then plotted. One more thing which I have added here is cornerRadiusTopLeft
and cornerRadiusTopRight
it is used to make the top of bar rounded.
import altair as alt
df = pd.DataFrame(dict(time=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
product_A = [20, 35, 30, 35, 27, 43, 24, 34, 14, 27, 22, 50],
product_B = [25, 32, 34, 20, 25, 34, 54, 23, 43, 33, 27, 29]))
alt.Chart(df).transform_fold(['product_A', 'product_B'],
as_=['column', 'product']
).mark_bar(cornerRadiusTopLeft=3,
cornerRadiusTopRight=3).encode(
x='time:N',
y= 'product:Q',
color = 'column:N'
)
Bokeh
The Bokeh packages are imported in this segment.
from bokeh.io import output_file,show,output_notebook,push_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource,HoverTool,CategoricalColorMapper
from bokeh.layouts import row,column,gridplot
from bokeh.models.widgets import Tabs,Panel
output_notebook()
The code for plotting the data is given below. The vbar_stack()
is used for stacked bar chart.
output_file("bar_stacked.html")
time = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
product = ["product_A","product_B"]
colors = ["#718dbf", "#e84d60"]
data = { 'time': time,
'product_A' : [20, 35, 30, 35, 27, 43, 24, 34, 14, 27, 22, 50],
'product_B' : [25, 32, 34, 20, 25, 34, 54, 23, 43, 33, 27, 29]
}
p = figure(x_range=time, plot_height=250, title="Sales of Products",
toolbar_location=None, tools="hover")
p.vbar_stack(product, x='time', width=0.9, color=colors, source=data,
legend_label=product)
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
show(p)
R
R is another famous language that is used by the ML and Data Science community. It is seen that R is used by more experienced Data scientists as it is a little difficult to learn than python. Two of the famous data visualization libraries of R are listed below-
ggplot2
First, the ggplot2
library is imported and then the dataframe is created. The head()
function is used to see the initial elements of dataframe.
library(ggplot2)
df <- data.frame(
time = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'),
p1 = c(20, 35, 30, 35, 27, 43, 24, 34, 14, 27, 22, 50),
p2 = c(25, 32, 34, 20, 25, 34, 54, 23, 43, 33, 27, 29)
)
head(df)
The dataframe is in long format which can be seen below as output of head() function.
The order of the
time
variable is fixed and the dataframe is converted into wide format.
df$time <- factor(df$time, levels = df$time)
library(reshape2)
df1 <- melt(df, id.var="time")
The data is plotted and the variable used to store the plot is
p
.
p <- ggplot(df1, aes(x = time, y = value, fill = variable)) +
geom_bar(stat = "identity")
p
Plotly
The code for plotting stacked bar chart using plotly in R is given below.
library(plotly)
time = c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
p1 = c(20, 35, 30, 35, 27, 43, 24, 34, 14, 27, 22, 50)
p2 = c(25, 32, 34, 20, 25, 34, 54, 23, 43, 33, 27, 29)
data <- data.frame(time,p1,p2 )
data$time <- factor(data$time, levels = data$time)
fig <- plot_ly(data, x = ~time, y = ~p1, type = 'bar', name = 'Product A')
fig <- fig %>% add_trace(y = ~p2, name = 'Product B')
fig <- fig %>% layout(yaxis = list(title = 'Sale of Products'), barmode = 'stack')
fig
I hope you all enjoyed this short introduction to data visualization and the code for stacked bar chart using different libraries. Comment your favorite library and like this post.