Category: Project and Assignments

Learning R Portfolio Project and Assignments

Reducing Incidence of AIDS in Indonesia: A Data Analysis Model

Post author By shaniaanggun
Post date September 7, 2022

AIDS

Cases of HIV/AIDS in Indonesia have always been a concern to the government, healthcare providers, and the public itself since 1999, where the cases have developed ever since (Hardisman, 2009). Indonesia’s Central Bureau of Statistics has recorded 41.987 new cases of HIV throughout provinces in 2021 in which the numbers of total cases are said to be 16.5% decreased nationally from the previous year. While there is a decrease of HIV incidence, there is a 22,78% increase of AIDS incidence from 7.036 cases in 2019 to 8.639 in 2020 (Badan Pusat Statistik, 2020). This should be an emerging concern that needs further investigation and means of prevention through health policy.

HIV and AIDS differs from its disease course. While HIV is the virus that can cause progression towards the state of AIDS, then AIDS itself is a state when the virus has caused severe damage to the HIV patient’s immune system (last stage of HIV disease) which takes an overall ten years of development. This difference in state of occurrence makes these two diseases to have different causes, treatment, and prevention (Vaillant & Gulick, 2022). While percentages of both HIV and AIDS risk factors are already identified, investigation and determination of patterns and trends that occur across provinces correlates to the development of AIDS in Indonesia.

Investigation can be done through several data analysis techniques including machine learning, data manipulation, data cleaning, and data visualization. The goal here is to find patterns in which causes or risk factors of AIDS are intertwined with each other, so that the interactions of certain factors (variations) that can mainly contribute to the rising incidence of AIDS can be understood. The concept will need to make use of data wrangling, plotting, dimensional reduction, and unsupervised clustering. The data will be imported from The Ministry of Health’s 2020 Profil Kesehatan Indonesia and HIV InfoDATIN.

Since most of the data from The Ministry of Health are published in a readable and graphical manner, database modelling is a step to provide these data into structured datasets. Database modelling is important to simplify data, eliminate redundancy, and enable efficient retrieval (Watt, 2018). Since the objective of the research is to find patterns of causes and risk factors of AIDS in Indonesia, the dataset that will be needed including data of number of new cases of AIDS based on sex, age group, provinces and other risk factors. However, there is a limitation in finding numbers or percentages of AIDS cases based on risk factors in the year 2020 that will have to undergo further investigation and research.

The data will be consisted of 34 rows as there are 34 provinces, and the columns are for representations of the numbers of new AIDS cases and each risk factors: sex (female, male), age group (<1 years-old, <1-4 years-old, 5-14 years-old, 15-19 years-old, 20-29 years-old, 30-39 years-old, 40-49 years-old, 50-59 years-old, and 60 years-old), sexuality (heterosex, homosex, and bisex), transfusion, IDU, perinatal, and other medical conditions. The data will have approximately 34 rows and 20 columns, which translated into 34 observations (provinces) and 20 variables. This database modelling can be made with some commonly known software and applications, such as Microsoft Access or MySQL. Afterwards, the AIDS case database is imported to R to be further analysed and read in to get the overview of the data.

To read the data, tidyverse package must be installed first to facilitate data wrangling, grouping and summarizing, and visualizing the analysis results. Using this package, use the function dim() to save the number of rows and columns and str() to generate an overview of the data frame (Wright, Ellis, Hicks, & Peng, 2021). The data frame here can be given the name as aids_ina.

To further familiarize the data, calculation of summary statistics using summary() can be done which provides information of length, minimum, 1^st quartile, median, mean, 3^rd quartile, and maximum values. Afterwards, to get a sense for the variables’ distribution within the data, a graphical overview can be made of it using scatterplots with the function ggpairs(). Scatter plots is one of the statistical graphics to explore pairwise relationship between the columns that are in the data set. However, in order to know the relationships between the variables, it is necessary to deselect the province column as to show only the plots of variables (Wright, Ellis, Hicks, & Peng, 2021).

Following the result of the scatter plot graphic, it can be seen if there are some potential relationships between the target feature (numbers of new AIDS cases) with feature variables (risk factors). To further quantify the relationships that are in the scatter plots, Pearson correlation coefficient matrix can be one of the methods to quantify correlations between variables and has its thresholds to interpret how significant is the correlation. The thresholds are: 0.2 as weak, 0.5 as medium, 0.8 as strong, and 0.9 as very strong. Using the function cor(), a matrix will be shown with the quantity of correlations between each variables (Caron, Dedecker, & Michel, 2021).

Following the result of the correlation matrix, it can be learned which feature variable is strongly correlated with the number of new AIDS cases. Thus, it can derive the conclusion to which risk factor should be the focus for the development of health policy and further investigation. However, to know further associations between each feature variable, it can be tried to split the provinces that account for the 18 feature variables. To get these visualizations of associations, data clustering using PCA (principal component analysis) function or princomp() is one way to reduce dimensional space to visualize the patterns. However, data scaling before running a PCA is important since PCA uses absolute variance to enumerate comprehensive variance for every principal component. Therefore, the function mutate(data_frame = scale(data_frame) can be used to standardize each feature to be situated with mean 0 and with standard deviation 1 (Wright, Ellis, Hicks, & Peng, 2021).

Put an example of column abbreviations for the 18 feature columns:

num_of_male = Number of new cases of AIDS based on male sex,
num_of_fml = Number of new cases of AIDS based on female sex,
num_of_age_one = Number of new cases of AIDS based on age group <1 years-old,
num_of_age_two = Number of new cases of AIDS based on age group 1-4 years-old,
num_of_age_three = Number of new cases of AIDS based on age group 5-14 years-old,
num_of_age_four = Number of new cases of AIDS based on age group 15-19 years-old,
num_of_age_five = Number of new cases of AIDS based on age group 20-29 years-old,
num_of_age_six = Number of new cases of AIDS based on age group 30-39 years-old,
num_of_age_seven = Number of new cases of AIDS based on age group 40-49 years-old,
num_of_age_eight = Number of new cases of AIDS based on age group 50-59 years-old,
num_of_age_nine = Number of new cases of AIDS based on age group 60 years-old,
num_of_hetero = Number of new cases of AIDS based on hetero sexuality,
num_of_homo = Number of new cases of AIDS based on homo sexuality,
num_of_bi = Number of new cases of AIDS based on bi sexuality,
num_of_trf = Number of new cases of AIDS by transfusion transmission,
num_of_idu = Number of new cases of AIDS of injecting drug users,
num_of_peri = Number of new cases of AIDS by perinatal transmission,
num_of_medco = Number of new cases of AIDS based on other medical conditions.

Following the results of PCA, it enables us to look at the results of principle component 1, 2, and 3 scores. However, with visualizing the first two principle component, with PC1 representing the most variation and PC2 representing the second most variation, it enables visualization with two dimensions while also capturing a high portion of variation from all 18 features. Therefore, scatter plot of PC1 and PC2 will enable analysts to recognize patterns in the data to find groups of alike states and to explore how provinces cluster together. PC1 and PC2 can be extracted using the function pca$scores[… , …] and can be visualized using ggplot() function, specifically geom_point() (Wright, Ellis, Hicks, & Peng, 2021).

After getting the results on PCA scatter plots, the number of groups in which the provinces cluster are not yet entirely clear. To identify the number of clusters, KMeans clustering can be used to create a scree plot to find an indication of when adding more clusters can not add much to explanatory power. To create KMeans, we can use the function kmeans(data_frame[ … , … ], centers = …, nstart = …). To visualize the KMeans clusters, we can use scatter plots (ggplot() function) that consist of points (geom_point()), and line (geom_line()) (Wright, Ellis, Hicks, & Peng, 2021).

Afterwards, the plot within-cluster sum-of-squares are shown and if an elbow can be seen within the scree plot, the provinces can be assigned to the number of clusters which has been determined by the points in the scree plot that formed an ‘elbow’. To see how the number of provinces clusters interact, a PCA scatter plot can be formed with each cluster being coloured with each different colour. The function used for obtaining clusters with the number chosen is as.factor(kmeans_fit[[n]]$cluster), and to provide the plot, use function ggplot() and geom_point(aes(col = cluster)) to colour each clusters (Wright, Ellis, Hicks, & Peng, 2021).

After knowing the results, the structures in the data can be understood, but it is not always easy to understand since it visualizes how the provinces group together based on the principal components. Therefore, to understand how the clusters are different with reference to the 18 features that are used for clustering. To interpret the differences, the unscaled features can be used. To add the cluster, the data frame of the cluster must be added to the original data frame with the function data_frame$cluster <- cluster_id. Since the data will have to be in a long format, the numbers of new AIDS cases must be deselected, and use the function gather() to combine a column set and place it into one key column, except for the province and cluster columns. Then visualize it with ggplot() and geom_violin() to picture the probability density of data at different values (Wright, Ellis, Hicks, & Peng, 2021).

The plot results will show how each cluster of provinces differs in the eighteen features and that also may need different kinds of interventions. To make it clear which to prioritize from the clusters (provinces), computation of the number of AIDS cases within each cluster can be done. The results will show which cluster has the most numbers of new AIDS incidence. With these data analysing results: the violin plot and the computation, it can be known:

Which cluster has the highest probability of features among others (violin plots),
Which feature has the highest probability in each cluster (violin plots), and
Which cluster has the most total sum of the number of new AIDS cases throughout the year (computation).

With those three interpretations, a focus for health policy and further investigation can be determined. The goal to find patterns in which causes or risk factors of AIDS that can mainly contribute to the rising incidence of AIDS can be understood with further investigations and the results of data visualizations above. Limitations to this data analysis is the lack of data regarding AIDS cases and the risk factors that are published by The Ministry of Health in the year 2020. All in all, the methods above can be applied to analyse and find backgrounds to create health policies and do further research about this matter.

References

Badan Pusat Statistik. (2020). Jumlah Kasus HIV dan AIDS yang Dilaporkan di Indonesia (2010-2020). Profil Kesehatan 2020.

Caron, E., Dedecker, J., & Michel, B. (2021). Linear Regression. The R Journal, 83-100.

Hardisman. (2009). HIV/AIDS di Indonesia : Fenomena Gunung Es dan Peranan Pelayanan Kesehatan Primer. Jurnal Kesehatan Masyarakat Nasional Vol. 3 No. 5, 237-238.

Vaillant, A. A., & Gulick, P. G. (2022). HIV Disease Current Practice. StatPearls.

Watt, A. (2018). Data Modelling. In A. Watt, & N. Eng, Database Design.

Wright, C., Ellis, S. E., Hicks, S. C., & Peng, R. D. (2021). Tidyverse Skills for Data Science.

Learning R Portfolio Project and Assignments

Group Project Portfolio: Reducing Traffic Mortality in the USA

Post author By shaniaanggun
Post date September 2, 2022

2 September 2022,

We have just finished our presentation on presenting our group project about reducing traffic mortality. We learned so much about statistics and data analyzing along with the run of the project. And yet, it was not an easy one to do, there are lots of misunderstandings and misinterprets here and there. Nevertheless, I’m pretty curious of how the methods in this project can be applied to an idea that I thought of just a week ago. Let us see and wait.

Access our project report here:

https://tinyurl.com/Group1-DataCampProject

Tags Data Analysis, Data Science, Datacamp, Learning R, Project, R

Learning R Project and Assignments Reflection

Project Prerequisites

Post author By shaniaanggun
Post date August 28, 2022

Notes from Unsupervised Learning in R

18 August 2022,

Our group was assigned with the project titled “Reducing Traffic Mortality in The USA” in Datacamp. The project has several data analyzing techniques including data manipulation, data visualization, machine learning, and importing also cleaning data. To understand how these methods work, we would have to undergo some prerequisite courses.

These prerequisite courses on R are comprised of:

Unsupervised learning in R. This technique has the goal to find patterns in data without trying to make predictions. It also consisted of introductions to clustering and dimensionality reduction (PCA) techniques.
Introduction to the tidyverse. This programming tool (tidyverse) provides us with techniques on data manipulation and visualization, mostly using the tools dplyr and ggplot2. With data manipulation, we will be able to filter, sort, and summarize a dataset. With that information, we will then turn the processed data into line plots, bar plots, histogram, and others using the ggplot2 package.
Intermediate regression in R. In this course, we will be likely to learn about statistical models, especially linear regression. Linear regression serves as a tool to explore relationships between variables in datasets. The tool is provided to understand how interactions between variables can affect predictions.

Then when I take a hint at the project’s task instructions, I summarize the materials through the instructions:

The first step is to read and get the overview of the data. This means that this step requires us to be able to import data and manipulate it, which then results in the process of generating the overview of the data frame. This task requires basic knowledge of tidyverse.
The second step is to create a textual and graphical overview of the data. This means that this step requires us to be able to visualize data that has been manipulated (structured) before. To create the visualization of the data, we use the ggplot function of the tidyverse package. This means that the task requires basic knowledge of tidyverse.
The third step is to explore correlation between variables in the data frame. This requires us to be able to create linear regression or correlation.
The fourth step is to make a multivariate linear regression. Multivariate means that it should involve two or more variables, which means that we have to fit all the variables so we can see not only the correlation between two variables, but also with other variables.
The fifth step is to perform PCA on standardized data. PCA is an analysis of linear components of all existing attributes in the data. With this technique, it allows us to visualize variations that present in a dataset. This requires basic knowledge of unsupervised learning in R.
The sixth step is to visualize the first two principal components. The first two principal component scores that are the results of PCA are extracted from a data frame, then we can visualize it using ggplot. This requires basic knowledge of tidyverse and unsupervised learning in R.
The seventh step is to make clusters out of similar states in the data. Creating clusters is a way to partition data sets into several groups based on their similarities. KMeans function is used in this step to create clusters. This requires basic knowledge of unsupervised learning in R.
The eight step is to use KMeans function to visualize clusters in the PCA scatter plot.Visualizing clusters requires basic knowledge of tidyverse and ggplot.
The ninth step is to visualize feature differences between clusters. Visualizing also requires basic knowledge of tidyverse and ggplot.
The tenth step is to compute numbers within each cluster. To compute, we need to do data manipulation in order to get the right numbers. Then, we can visualize it with ggplot. This requires basic knowledge of tidyverse.
And the last step is to make a decision out of the results. This is the step that determines which cluster should be a focus for policy intervention of the project.

It takes time to fully comprehend the concept of the project and the implementation of each prerequisite in this project.

Tags Data Science, Learn, Learning R, Self-Reflection

Learning R Portfolio Project and Assignments

Group Assignments on Introduction to R Course

Post author By shaniaanggun
Post date August 25, 2022

Homepage of Introduction to R Course on datacamp.com

11 August 2022,

On the last weekly meet and discussion, we were assigned into groups to do group tasks and projects. Moreover, we were also assigned to finish an Introduction to R course on datacamp.com by the week.

The course contains of six chapters:

1. Intro to Basics

In this chapter, we learned about the basic usage of the R console as a calculator and variable assignments. Hence the name, it was not challenging to finish this chapter as each exercise is equipped with simple commands and also uncomplicated theories.

Exercise on variable assignment.

2. Vectors

In this chapter, we learned about how to create vectors in R, name vectors, select elements from vectors, and compare different vectors. At this stage, it was quite challenging as there were so many new terms that we had to understand first, but it was still bearable and we still enjoyed the process.

Exercise on naming a vector.

3. Matrices

In this chapter, we learned how to create matrices and do basic computations with them. We also learned how to analyze data with matrices using a study case in R. At this stage, it was already challenging because I have to recall matrix learning materials a bit from my high school math lessons, although It wasn’t that complicated, still I have to know first the usage of matrices. I also reflect on how matrices can be applied and used in our daily lives, for example, in this chapter the study case on how to analyze box office numbers of a certain movie. It’s awesome how I can see with my own eyes how mathematics plays a big role in our lives, especially in data science.

Exercise on selecting matrix elements.

4. Factors

In this chapter, we learned about how to store categorical variables with factors in R. It was also stated that factors are very important in data analysis. We also learned how to create factors, subset factors, and compare factors. It was challenging to understand the implementation of factors and the usage in R, but after several practices and explanations, we were able to understand the application of factors in data analysis.

5. Data frames

In this chapter, we learned about data frames that store datasets. We also learned how to create them, select parts of them, and order a data frame according to certain variables. This, by far, is my most favorite chapter of the course because it contains lessons on datasets that I think I would encounter during my years as a medical student. So, I took many notes in this chapter, and I personally liked and enjoyed the process. Storing data can’t be much more fun than this.

Exercise on creating a data frame.

6. Lists

In the last chapter of the course, we learned about lists. Lists hold components of different types of data, name it matrices, vectors, data frames, or even other lists. We also learned how to create them, name them, and subset them.

Exercise on creating a list.

All in all, I enjoyed the process of learning R in datacamp because it is practical, easy-to-learn, and is highly guided. Throughout this course, I can finally learn how to apply mathematics in learning statistics and data!

Tags Data Science, Datacamp, Learn, Learning R, Self-Reflection

Learning R Portfolio Project and Assignments

Exploring R

Post author By shaniaanggun
Post date August 24, 2022

Exploring R task

5 August 2022,

The task that dr. Budi assigned has led me to learn as much about the basics of R, using the data set of pregnancy estriol and birth weight. Here is the list of what I learned:

Basic data analysis in R

Fivenum: a function that results the minimum, 1st quartile, median, 2nd quartile, and maximum value of a data set. This also serves as a base to determine the structure of boxplot graphic.
Summary: a function that results the minimum, 1st quartile, median, mean, 2nd quartile, and maximum value of a data set.

Exploratory graphs (visualization) in R

Boxplot: a graph which visualizes how well the data observations are distributed. Presence of outliers can be a sign of a data that do not distributed well.
Histogram: a graph which visualizes frequency of data observations and also determines skewness.
Scatterplot: a graph which visualizes and determines if two vectors/variables are linear or nonlinear. This also determines if the vectors/variables are interconnected or not.

Normality test and normalization in R

Normality test: a test to confirm whether the data is normally distributed or not. Can be tested using shapiro-wilk test, skewness, q-q plot, or other methods.
Normalization: a method to reduce the scale of variables, affecting the data distribution.

With these materials, I have tried to apply its basic knowledge to analyze the pregnancy estriol level and birth weight data set that has been given before. The results can be seen in the attachment below:

https://bit.ly/IntrotoR-Anggun

Latest posts

Categories