Categories
Learning R Project and Assignments Reflection

Project Prerequisites

18 August 2022,

Our group was assigned with the project titled “Reducing Traffic Mortality in The USA” in Datacamp. The project has several data analyzing techniques including data manipulation, data visualization, machine learning, and importing also cleaning data. To understand how these methods work, we would have to undergo some prerequisite courses. 

These prerequisite courses on R are comprised of: 

  • Unsupervised learning in R. This technique has the goal to find patterns in data without trying to make predictions. It also consisted of introductions to clustering and dimensionality reduction (PCA) techniques. 
  • Introduction to the tidyverse. This programming tool (tidyverse) provides us with techniques on data manipulation and visualization, mostly using the tools dplyr and ggplot2. With data manipulation, we will be able to filter, sort, and summarize a dataset. With that information, we will then turn the processed data into line plots, bar plots, histogram, and others using the ggplot2 package. 
  • Intermediate regression in R. In this course, we will be likely to learn about statistical models, especially linear regression. Linear regression serves as a tool to explore relationships between variables in datasets. The tool is provided to understand how interactions between variables can affect predictions. 

 

Then when I take a hint at the project’s task instructions, I summarize the materials through the instructions:

  1. The first step is to read and get the overview of the data. This means that this step requires us to be able to import data and manipulate it, which then results in the process of generating the overview of the data frame. This task requires basic knowledge of tidyverse
  2. The second step is to create a textual and graphical overview of the data. This means that this step requires us to be able to visualize data that has been manipulated (structured) before. To create the visualization of the data, we use the ggplot function of the tidyverse package. This means that the task requires basic knowledge of tidyverse
  3. The third step is to explore correlation between variables in the data frame. This requires us to be able to create linear regression or correlation. 
  4. The fourth step is to make a multivariate linear regression. Multivariate means that it should involve two or more variables, which means that we have to fit all the variables so we can see not only the correlation between two variables, but also with other variables.
  5. The fifth step is to perform PCA on standardized data. PCA is an analysis of linear components of all existing attributes in the data. With this technique, it allows us to visualize variations that present in a dataset. This requires basic knowledge of unsupervised learning in R. 
  6. The sixth step is to visualize the first two principal components. The first two principal component scores that are the results of PCA are extracted from a data frame, then we can visualize it using ggplot. This requires basic knowledge of tidyverse and unsupervised learning in R. 
  7. The seventh step is to make clusters out of similar states in the data. Creating clusters is a way to partition data sets into several groups based on their similarities. KMeans function is used in this step to create clusters. This requires basic knowledge of unsupervised learning in R. 
  8. The eight step is to use KMeans function to visualize clusters in the PCA scatter plot.Visualizing clusters requires basic knowledge of tidyverse and ggplot.  
  9. The ninth step is to visualize feature differences between clusters. Visualizing also requires basic knowledge of tidyverse and ggplot
  10. The tenth step is to compute numbers within each cluster. To compute, we need to do data manipulation in order to get the right numbers. Then, we can visualize it with ggplot. This requires basic knowledge of tidyverse
  11. And the last step is to make a decision out of the results. This is the step that determines which cluster should be a focus for policy intervention of the project. 

It takes time to fully comprehend the concept of the project and the implementation of each prerequisite in this project.