Data visualization is the graphical display of abstract information for two purposes: sense-making (also called data analysis) and communication. Important stories live in our data and data visualization is a powerful means to discover and understand these stories, and then to present them to others.
Yes, Visualizing information can give us a very quick solution to the problems. We can get clarity or the answer to a simple problem very quickly, famous quote by the British journalist- David McCandless.
In this blog I am talking about the grammar behind the graphics…!!!!!
What is Grammar of Graphics?
The ggplot2 package is extremely flexible and repeating plots for groups is quite easy. The “gg” in ggplot2 stands for the Grammar of Graphics. The Grammar of Graphics, Wilkinson showed how you could describe plots not as discrete types like bar plot or pie chart, but using a “grammar” that would work not only for plots we commonly use but for almost any conceivable graphic.
From this perspective a pie chart is just a bar chart with a circular (polar) coordinate system replacing the rectangular Cartesian coordinate system. However, it is not a light read and it presents an abstract graphical syntax that is meant to clarify his concepts. It is not a language you can use to recreate this graphs. The ggplot2 is a simplified implementation of grammar of graphics written by Hadley Wickham for R. It is simplified only in that he uses R for data transformation and restructuring, rather than implementing that in his syntax.
“Ggplot2 is a plotting system for R, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.”
To understand ggplot, you need to ask yourself, what are the fundamental parts of every data graph? They are:
How to implement this concept in practical? Let’s move into the practical examples.
- Step One. Check That You Have ggplot2 installed
First, go to the tab “packages” in RStudio, an IDE to work with R efficiently, search for ggplot2 and mark the checkbox. Alternatively, it could be that you need to install the package. In this case, you stay in the same tab and you click on “Install”. Enter ggplot2, press ENTER and wait one or two minutes for the package to install. You can also install ggplot2 from the console with the install.packages() function:
- Step Two. The Data
Next, make sure that you have some dataset to work with: import the necessary file or use one that is built into R. This tutorial will be working with the chol dataset. If you’re just tuning in, you can download the dataset from here. You can load in the chol data set by using the url() function embedded into the read.table() function:
chol <- read.table(url(“http://assets.datacamp.com/blog_assets/chol.txt”), header = TRUE)
- Step Three. Making Your Histogram with ggplot2
You have two options to make a Histogram With ggplot2 package. You can either use the qplot() function, which looks very much like the hist() function: #Take the column “AGE” from the “chol” dataset and make a histogram of it.
You can also use the ggplot() function to make the same histogram:# Take the dataset “chol” to be plotted, pass the “AGE” column from the “chol” dataset as values on the x-axis and compute a histogram of this :
ggplot(data=chol, aes(chol$AGE)) + geom_histogram()
Now you may think about these two options qplot() and the ggplot(), the qplot() function is supposed to make the same graph as ggplot(), but with a simpler syntax. While ggplot() allows for maximum features and flexibility, qplot() is a simpler but less customizable wrapper around ggplot.
- Step Four. Taking It One Step Further
Adjusting qplot() :The options to adjust your histogram through qplot() are not too extensive, but this function does allow you to adjust the basics to improve the visualization and hence the understanding of the histograms.
Histogram for the “AGE” column in the “chol” dataset, with title “Histogram for Age” and label for the x-axis (“Age”), with bins of a width of 5 that range from values 20 to 50 on the x-axis and that have transparent blue filling and red borders.
qplot(chol$AGE, geom=”histogram”, binwidth = 5, main = “Histogram for Age”, xlab = “Age”, fill=I(“blue”), col=I(“red”), alpha=I(.2), xlim=c(20,50))
To adjust the colours of your histogram, just add the arguments col and fill, together with the desired color:
The alpha argument controls the fill transparency. Remember to pass a value between 0 (transparent) and 1 (opaque):
ggplot(data=chol, aes(chol$AGE)) +geom_histogram(breaks=seq(20, 50, by =2),col=”red”, fill=”green”, alpha = .2)
You can also fill the bins with colours according to the count numbers that are presented in the y-axis, something that is not possible in the qplot() function:
The default color scheme is blue. If you want to change this, you should add something more to your code: the scale_fill_gradient, which allows you to specify.
ggplot(data=chol,aes(chol$AGE))+geom_histogram(breaks=seq(20,50,by=2),col=”red”,aes(fill=..count..)) + scale_fill_gradient(“Count”, low = “green”, high = “red”)
Remember that the ultimate purpose of adjusting your histogram should always be improving the understanding of it; Even though the histograms above look very fancy, they might not be exactly what you need; So always keep in mind what you’re trying to achieve!
Note that there are several more options to adjust the color of your histograms. If you want to experiment some more, you can find other arguments in the “Scales” section of the ggplot documentation page.
ggplot(data=chol,aes(chol$AGE))+geom_histogram(breaks=se(20,50,by=2),col=”red”,fill=”green”, alpha = .2) + labs(title=”Histogram for Age”)
X-And Y Axes
Similar to the arguments that the hist() function uses to adjust the x-and y-axes, you can use the xlim() and ylim(). If you add these two functions, you end up with the histogram from the start of this section:
ggplot(data=chol, aes(chol$AGE)) + geom_histogram(breaks=seq(20, 50, by = 2),col=”red”, fill=”green”, alpha = .2) + labs(title=”Histogram for Age”) + labs(x=”Age”, y=”Count”) + xlim(c(18,52)) + ylim(c(0,30))
Extra: Trend line
You can easily add a trend line to your histogram by adding geom_density to your code:
ggplot(data=chol, aes(chol$AGE)) + geom_histogram(aes(y =..density..), breaks=seq(20, 50, by = 2),col=”red”, fill=”green”, alpha = .2) + geom_density(col=2) + labs(title=”Histogram for Age”)+ labs(x=”Age”, y=”Count”)
Remember, just like with the hist() function, your histograms with ggplot2 also need to plot the density for this to work. Remember also that the hist() function required you to make a trend line by entering two separate commands while ggplot2 allows you to do it all in one single command.
While we know a lot about how to create reasonable visualizations, there is still a lot we do not know or are not yet aware of. Even seemingly basic knowledge like how the layout of a visualization influences our reading of the data still needs more work to be understood and turned into useful recommendations and best practices.
The ggplot package offers a nearly endless array of combinations to visualize your data. Explore and create your own data visuals. It’s outstanding to dive deep into ggplot2…!!!