Learning R Portfolio Project and Assignments

Reducing Incidence of AIDS in Indonesia: A Data Analysis Model

Cases of HIV/AIDS in Indonesia have always been a concern to the government, healthcare providers, and the public itself since 1999, where the cases have developed ever since (Hardisman, 2009). Indonesia’s Central Bureau of Statistics has recorded 41.987 new cases of HIV throughout provinces in 2021 in which the numbers of total cases are said to be 16.5% decreased nationally from the previous year. While there is a decrease of HIV incidence, there is a 22,78% increase of AIDS incidence from 7.036 cases in 2019 to 8.639 in 2020 (Badan Pusat Statistik, 2020). This should be an emerging concern that needs further investigation and means of prevention through health policy.

HIV and AIDS differs from its disease course. While HIV is the virus that can cause progression towards the state of AIDS, then AIDS itself is a state when the virus has caused severe damage to the HIV patient’s immune system (last stage of HIV disease) which takes an overall ten years of development. This difference in state of occurrence makes these two diseases to have different causes, treatment, and prevention (Vaillant & Gulick, 2022). While percentages of both HIV and AIDS risk factors are already identified, investigation and determination of patterns and trends that occur across provinces correlates to the development of AIDS in Indonesia.

Investigation can be done through several data analysis techniques including machine learning, data manipulation, data cleaning, and data visualization. The goal here is to find patterns in which causes or risk factors of AIDS are intertwined with each other, so that the interactions of certain factors (variations) that can mainly contribute to the rising incidence of AIDS can be understood. The concept will need to make use of data wrangling, plotting, dimensional reduction, and unsupervised clustering. The data will be imported from The Ministry of Health’s 2020 Profil Kesehatan Indonesia and HIV InfoDATIN.

Since most of the data from The Ministry of Health are published in a readable and graphical manner, database modelling is a step to provide these data into structured datasets. Database modelling is important to simplify data, eliminate redundancy, and enable efficient retrieval (Watt, 2018). Since the objective of the research is to find patterns of causes and risk factors of AIDS in Indonesia, the dataset that will be needed including data of number of new cases of AIDS based on sex, age group, provinces and other risk factors. However, there is a limitation in finding numbers or percentages of AIDS cases based on risk factors in the year 2020 that will have to undergo further investigation and research.

 The data will be consisted of 34 rows as there are 34 provinces, and the columns are for representations of the numbers of new AIDS cases and each risk factors: sex (female, male), age group (<1 years-old, <1-4 years-old, 5-14 years-old, 15-19 years-old, 20-29 years-old, 30-39 years-old, 40-49 years-old, 50-59 years-old, and  60 years-old), sexuality (heterosex, homosex, and bisex), transfusion, IDU, perinatal, and other medical conditions. The data will have approximately 34 rows and 20 columns, which translated into 34 observations (provinces) and 20 variables. This database modelling can be made with some commonly known software and applications, such as Microsoft Access or MySQL. Afterwards, the AIDS case database is imported to R to be further analysed and read in to get the overview of the data.

To read the data, tidyverse package must be installed first to facilitate data wrangling, grouping and summarizing, and visualizing the analysis results. Using this package, use the function dim() to save the number of rows and columns and str() to generate an overview of the data frame (Wright, Ellis, Hicks, & Peng, 2021). The data frame here can be given the name as aids_ina.

To further familiarize the data, calculation of summary statistics using summary() can be done which provides information of length, minimum, 1st quartile, median, mean, 3rd quartile, and maximum values. Afterwards, to get a sense for the variables’ distribution within the data, a graphical overview can be made of it using scatterplots with the function ggpairs(). Scatter plots is one of the statistical graphics to explore pairwise relationship between the columns that are in the data set. However, in order to know the relationships between the variables, it is necessary to deselect the province column as to show only the plots of variables (Wright, Ellis, Hicks, & Peng, 2021).

Following the result of the scatter plot graphic, it can be seen if there are some potential relationships between the target feature (numbers of new AIDS cases) with feature variables (risk factors). To further quantify the relationships that are in the scatter plots, Pearson correlation coefficient matrix can be one of the methods to quantify correlations between variables and has its thresholds to interpret how significant is the correlation. The thresholds are: 0.2 as weak, 0.5 as medium, 0.8 as strong, and 0.9 as very strong. Using the function cor(), a matrix will be shown with the quantity of correlations between each variables (Caron, Dedecker, & Michel, 2021).

Following the result of the correlation matrix, it can be learned which feature variable is strongly correlated with the number of new AIDS cases. Thus, it can derive the conclusion to which risk factor should be the focus for the development of health policy and further investigation. However, to know further associations between each feature variable, it can be tried to split the provinces that account for the 18 feature variables. To get these visualizations of associations, data clustering using PCA (principal component analysis) function or princomp() is one way to reduce dimensional space to visualize the patterns. However, data scaling before running a PCA is important since PCA uses absolute variance to enumerate comprehensive variance for every principal component. Therefore, the function mutate(data_frame = scale(data_frame) can be used to standardize each feature to be situated with mean 0 and with standard deviation 1 (Wright, Ellis, Hicks, & Peng, 2021).

Put an example of column abbreviations for the 18 feature columns:

  •       num_of_male = Number of new cases of AIDS based on male sex,
  •       num_of_fml = Number of new cases of AIDS based on female sex,
  •       num_of_age_one = Number of new cases of AIDS based on age group <1 years-old,
  •       num_of_age_two = Number of new cases of AIDS based on age group 1-4 years-old,
  •       num_of_age_three = Number of new cases of AIDS based on age group 5-14 years-old,
  •       num_of_age_four = Number of new cases of AIDS based on age group 15-19 years-old,
  •       num_of_age_five = Number of new cases of AIDS based on age group 20-29 years-old,
  •       num_of_age_six = Number of new cases of AIDS based on age group 30-39 years-old,
  •       num_of_age_seven = Number of new cases of AIDS based on age group 40-49 years-old,
  •       num_of_age_eight = Number of new cases of AIDS based on age group 50-59 years-old,
  •       num_of_age_nine = Number of new cases of AIDS based on age group  60 years-old,
  •       num_of_hetero = Number of new cases of AIDS based on hetero sexuality,
  •       num_of_homo = Number of new cases of AIDS based on homo sexuality,
  •       num_of_bi = Number of new cases of AIDS based on bi sexuality,
  •       num_of_trf = Number of new cases of AIDS by transfusion transmission,
  •       num_of_idu = Number of new cases of AIDS of injecting drug users,
  •       num_of_peri = Number of new cases of AIDS by perinatal transmission,
  •       num_of_medco = Number of new cases of AIDS based on other medical conditions.

Following the results of PCA, it enables us to look at the results of principle component 1, 2, and 3 scores. However, with visualizing the first two principle component, with PC1 representing the most variation and PC2 representing the second most variation, it enables visualization with two dimensions while also capturing a high portion of variation from all 18 features. Therefore, scatter plot of PC1 and PC2 will enable analysts to recognize patterns in the data to find groups of alike states and to explore how provinces cluster together. PC1 and PC2 can be extracted using the function pca$scores[… , …] and can be visualized using ggplot() function, specifically geom_point() (Wright, Ellis, Hicks, & Peng, 2021).

After getting the results on PCA scatter plots, the number of groups in which the provinces cluster are not yet entirely clear. To identify the number of clusters, KMeans clustering can be used to create a scree plot to find an indication of when adding more clusters can not add much to explanatory power. To create KMeans, we can use the function kmeans(data_frame[ … , … ], centers = …, nstart = …). To visualize the KMeans clusters, we can use scatter plots (ggplot() function) that consist of points (geom_point()), and line (geom_line()) (Wright, Ellis, Hicks, & Peng, 2021).

Afterwards, the plot within-cluster sum-of-squares are shown and if an elbow can be seen within the scree plot, the provinces can be assigned to the number of clusters which has been determined by the points in the scree plot that formed an ‘elbow’. To see how the number of provinces clusters interact, a PCA scatter plot can be formed with each cluster being coloured with each different colour. The function used for obtaining clusters with the number chosen is as.factor(kmeans_fit[[n]]$cluster), and to provide the plot, use function ggplot() and geom_point(aes(col = cluster)) to colour each clusters (Wright, Ellis, Hicks, & Peng, 2021).

After knowing the results, the structures in the data can be understood, but it is not always easy to understand since it visualizes how the provinces group together based on the principal components. Therefore, to understand how the clusters are different with reference to the 18 features that are used for clustering. To interpret the differences, the unscaled features can be used. To add the cluster, the data frame of the cluster must be added to the original data frame with the function data_frame$cluster <- cluster_id. Since the data will have to be in a long format, the numbers of new AIDS cases must be deselected, and use the function gather() to combine a column set and place it into one key column, except for the province and cluster columns. Then visualize it with ggplot() and geom_violin() to picture the probability density of data at different values (Wright, Ellis, Hicks, & Peng, 2021).

The plot results will show how each cluster of provinces differs in the eighteen features and that also may need different kinds of interventions. To make it clear which to prioritize from the clusters (provinces), computation of the number of AIDS cases within each cluster can be done. The results will show which cluster has the most numbers of new AIDS incidence. With these data analysing results: the violin plot and the computation, it can be known:

  1.     Which cluster has the highest probability of features among others (violin plots),
  2.     Which feature has the highest probability in each cluster (violin plots), and
  3.     Which cluster has the most total sum of the number of new AIDS cases throughout the year (computation).  

With those three interpretations, a focus for health policy and further investigation can be determined. The goal to find patterns in which causes or risk factors of AIDS that can mainly contribute to the rising incidence of AIDS can be understood with further investigations and the results of data visualizations above. Limitations to this data analysis is the lack of data regarding AIDS cases and the risk factors that are published by The Ministry of Health in the year 2020. All in all, the methods above can be applied to analyse and find backgrounds to create health policies and do further research about this matter.


Badan Pusat Statistik. (2020). Jumlah Kasus HIV dan AIDS yang Dilaporkan di Indonesia (2010-2020). Profil Kesehatan 2020.

Caron, E., Dedecker, J., & Michel, B. (2021). Linear Regression. The R Journal, 83-100.

Hardisman. (2009). HIV/AIDS di Indonesia : Fenomena Gunung Es dan Peranan Pelayanan Kesehatan Primer. Jurnal Kesehatan Masyarakat Nasional Vol. 3 No. 5, 237-238.

Vaillant, A. A., & Gulick, P. G. (2022). HIV Disease Current Practice. StatPearls.

Watt, A. (2018). Data Modelling. In A. Watt, & N. Eng, Database Design.

Wright, C., Ellis, S. E., Hicks, S. C., & Peng, R. D. (2021). Tidyverse Skills for Data Science.