How to apply cluster analysis to a database in R

How to apply cluster analysis to a database in R - r

I have a base that shows me the answers of the applicants to a course. The original base is 2k rows and 105 columns, of which 100 correspond to questions from 4 basic areas of mathematics, language, science, and social.
I have created the following short example so that you can see more or less how the table is
sector<-c("Privado" ,"Publico" ,"Publico" ,"Publico", "Publico",
"Publico" ,"Publico", "Publico", "Publico" ,"Publico", "Publico",
"Publico" ,"Publico", "Publico", "Publico" ,"Privado" ,"Publico" ,
"Publico" ,"Publico" ,"Publico")
aspirante<-c("337877" ,"339161", "388425" ,"371828" ,"288598" ,"396295" ,"400196",
"370915", "276891" ,"335406" ,"358013", "404406", "356633", "284792", "372549" ,
"271082", "396135" ,"398664" ,"406397", "354609")
claves<-c("10" ,"9" , "10", "4" , "4" , "3" , "3" , "4" , "9" ,"10", "3",
"3" , "3" , "4" , "4" , "4" , "4", "4" ,"9" , "3")
question1<-c(1, 0, 0, 0 ,0, 0, 0, 0, 0 ,0, 0, 0 ,0, 0, 0 ,0, 0, 0 ,1, 0)
question2<-c(0, 1, 1 ,0 ,0, 0 ,0 ,0 ,1, 0, 0,0,1 ,0 ,1, 1, 0 ,0, 0, 0)
question3<-c( 0 ,0, 1, 1, 1 ,1 ,0, 0, 0 ,0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0)
question4<-c(0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1)
question5<-c(1, 0, 1, 0 ,0, 1, 0, 1, 1, 0, 1, 1, 0 ,0 ,0, 0, 1, 0, 0, 0)
note<-c(4 ,2, 6, 4, 2, 6, 0 ,4, 4 ,0, 4 ,4 ,6, 2, 4 ,4, 4, 2, 4, 2)
example<-data.frame("candidate"=aspirante,"sector"=sector,"p1"=question1,
"p2"=question2,"p3"=question3,"p4"=question4,"p5"=question5,"note"=note)
the note column is equal to the sum of the row multiplied by 2
I am asked to do a cluster analysis but I have no idea what to do, I had planned to divide the final notes into 4 categories:
failed: grades less than 3
considered: between 3 and 5
space availability: between 5 and 7
approved: from 7 to 10
but in the original base the sizes of each will vary and I cannot create a new base that divides the notes by group. Do you have any suggestions or an example where cluster analysis is applied to dichotomous data?

Related

Create a new variable based on other columns values

I have a paneldata dataframe structure, something like this:
df <- data.frame("id" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
"Status_2014" = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0),
"Status_2015" = c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0),
"Status_2016" = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0))
I want to generate a new dummy variable, that takes the value 1, if the rows contains 1 in any of the three columns or otherwise 0 if not. It should end up like this:
df <- data.frame("id" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
"Status_2014" = c(1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0),
"Status_2015" = c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0),
"Status_2016" = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
"Final_status" = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0))
Can anyone help me achieve this?

We can use if_any on the columns that starts_with 'Status', to check for any 1 value in a row and it returns TRUE if there is one or else FALSE which is coerced to binary with as.integer/+
library(dplyr)
df %>%
mutate(Final_status = +(if_any(starts_with('Status'), ~ . ==1)))
-outptu
id Status_2014 Status_2015 Status_2016 Final_status
1 1 1 0 0 1
2 1 1 0 0 1
3 1 1 0 0 1
4 1 1 0 0 1
5 2 0 1 0 1
6 2 0 1 0 1
7 2 0 1 0 1
8 2 0 1 0 1
9 3 0 0 0 0
10 3 0 0 0 0
11 3 0 0 0 0
12 3 0 0 0 0
Or using rowSums from base R
df$Final_status <- +(rowSums(df[-1] > 0) > 0)

You write an if condition to define the variable as 1 or 0, and inside this condition the most straight forward ways would be a dplyr pipe.
I don't have the dplyr syntax in my head, to long not used, but dplyr is what you want.
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
best greetings

R: expand sequence of binary values from a time column

I have a table of time and binary values,
> head(x,10)
Time binary
1 358.214 1
2 359.240 1
3 360.039 0
4 361.163 0
5 361.164 1
6 362.113 1
7 362.114 0
8 365.038 0
9 365.039 0
10 367.488 0
I want to check after a second wether the value in binary column is 1 or 0, and then create new column of the new values. The time here is not continues. For example, first value here is (358.214) and the binary value is 1, if I add a second it is going to be (359.214) and the value is still 1 based on the previous value because (359.214) is not in the dataset.
I want to add two new columns, one for the seconds increasing and one for the new binary values.
time2 new_binary
1 358.214 1
2 359.214 1
3 360.214 0
4 361.214 1
5 362.214 0
6 363.214 0
7 364.214 0
8 365.214 0
9 366.214 0
10 367.214 0
How can I do this in R?
The dataset,
Time <- c(358.214, 359.240, 360.039, 361.163, 361.164, 362.113, 362.114, 365.038, 365.039, 367.488, 367.489, 368.763, 368.764, 371.538, 371.539, 384.013, 384.014, 386.088, 386.089, 389.463, 389.464, 392.663, 392.664, 414.588, 414.589, 421.463, 421.464, 427.863, 427.864, 431.488, 431.489, 432.074, 432.075, 437.124, 437.125, 439.024, 439.025, 451.724, 451.725, 456.224, 456.225, 457.301, 457.302, 459.526, 459.527, 470.776, 470.777, 471.951, 471.952, 477.651, 477.652, 479.601, 479.602, 480.426, 480.427, 480.950, 480.951, 494.626, 494.627, 516.551, 516.552, 539.901, 539.902, 545.276, 545.277, 546.536, 546.537, 548.436, 548.437, 551.111, 551.112, 556.086, 556.087, 557.561, 557.562, 567.799, 567.800, 580.049, 580.050, 583.249, 583.250, 587.374, 587.375, 588.599, 588.600, 596.199, 596.200, 597.674, 597.675, 601.249, 601.250, 602.499, 602.500, 620.699, 620.700, 631.099, 631.100, 637.249, 637.250, 638.999, 639.000, 650.574, 650.575, 658.199, 658.200, 658.696, 658.697, 668.396, 668.397, 676.021, 676.022, 678.846, 678.847, 688.121, 688.122, 690.371, 690.372, 701.946, 701.947, 704.921, 704.922, 712.346, 712.347, 719.321, 719.322, 721.146, 721.147, 723.496, 723.497, 725.696, 725.697, 727.121, 727.122, 729.871, 729.872, 733.721, 733.722, 739.054, 758.078, 761.321, 761.322, 764.221, 764.222, 768.679, 768.680, 774.529, 774.530, 776.679, 776.680, 778.129, 778.130, 780.779, 780.780, 837.204, 837.205, 842.079, 842.080, 846.329, 846.330, 847.579)
binary <- c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0 ,0 ,1 ,1, 0, 0, 1, 1, 0, 0, 1, 1 ,0, 0 ,1 ,1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0 ,0 ,1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1 ,0 ,0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1)
Update my attempts:
First I got a sequence of the new seconds (which is longer than the original Time)
time2 <- seq(x$Time[1],x$Time[length])
Then I used ifelse to loop through the Time and compare it with time2, if the value in time2 not equal to the value in Time -> put the previous binary value of Time, else, get the binary value. So I want a function that continue comparing two different length columns.
What I did is this,
View(vec_new <-data.frame(time2))
vec_new <- vec_new %>%
mutate(new_Binary = ifelse((x$Time != vec_new$time2)&(vec_new$time2 %l% x$Time),lag(x$binary), x$binary))
However, I got this warning because of the different length columns.
"longer object length is not a multiple of shorter object length"
Also, the results are not quit what I expected. I don't know how this loop works through the values and how loops through all the values. I got a complete binaries till the end of the time2 though.
Any idea how to achieve this in R?

If you use mutate from the dplyr package the solution is relatively easy:
library(dplyr)
df <- data.frame(Time, binary) %>%
mutate(Time=Time-Time[1]) %>%
mutate(binary=as.logical(binary))
Output
head(df)
# Time binary
# 1 0.000 TRUE
# 2 1.026 TRUE
# 3 1.825 FALSE
# 4 2.949 FALSE
# 5 2.950 TRUE
# 6 3.899 TRUE
If you want to create new columns you simply have to give them a new name.
df <- data.frame(Time, binary) %>%
mutate(time2=Time-Time[1]) %>%
mutate(new_binary=as.logical(binary))
Output
head(df)
# Time binary time2 new_binary
# 1 358.214 1 0.000 TRUE
# 2 359.240 1 1.026 TRUE
# 3 360.039 0 1.825 FALSE
# 4 361.163 0 2.949 FALSE
# 5 361.164 1 2.950 TRUE
# 6 362.113 1 3.899 TRUE
And this solution gives you the time according to your desired output (I hope).
df <- data.frame(Time, binary) %>%
mutate(time2=as.numeric(rownames(df))+357.214) %>%
mutate(new_binary=as.logical(binary))
head(df)
Output
head(df)
# Time binary time2 new_binary
# 1 358.214 1 358.214 TRUE
# 2 359.240 1 359.214 TRUE
# 3 360.039 0 360.214 FALSE
# 4 361.163 0 361.214 FALSE
# 5 361.164 1 362.214 TRUE
# 6 362.113 1 363.214 TRUE

Is there an R function similar to foreach loops in Stata for creating new variables based on the name (or root) of existing variables?

I have a list of 60 variables (30 pairs, essentially), and I need to combine the information across all the pairs to create new variables based on the data stored in each pair.
To give some context, I am working on a systematic review of prediction model studies, and I extracted data on which variables were considered for inclusion in the prediction model of each study (the first 30 variables) and which variables were included in the model (the second 30 variables)
All variables are binary.
The first 30 variables are written in the form “p_[varname]”
The second 30 are written in the form “p_[varname]_inc”.
I want to create a new variable that is called [varname] and takes the values “Not considered”, “Considered”, and “Included”.
In Stata, I could easily do this like so:
  foreach v of [varname1]-[varname30] {
gen `v' = "Not considered" if p_`v' == 0
replace `v' = "Considered" if p_`v' == 1 & p_`v'_inc == 0
replace `v' = "Included" if p_`v'_inc == 1 & p_`v'_inc == 1
}
In R, the only way I can figure out to do it is by copy and pasting the same ifelse statement for all variables, for example:
predictor_vars %>%
mutate(age = ifelse(p_age==1 & p_age_inc==1, "Included",
ifelse(p_age==1 & p_age_inc==0, "Considered", "Not considered")),
sex = ifelse(p_sex==1 & p_sex_inc==1, "Included",
ifelse(p_sex==1 & p_sex_inc==0, "Considered", "Not considered")),
....
[varname] = ifelse([varname]==1 & [varname]_inc==1, "Included",
ifelse([varname]==1 & [varname]==0, "Considered", "Not considered"))
)
Is there an easier way to do this in R / dplyr?
Edit: Sorry for not providing enough detail before (new here, but really appreciate the fast responses!). Here is a sample of the data
structure(list(p_age = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0), label = "Age", class = c("labelled",
"numeric")), p_age_inc = structure(c(1, 0, 0, 1, 1, 1, 1, 1,
1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0
), label = "Age", class = c("labelled", "numeric")), p_sex = structure(c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
1, 1, 1, 0, 1, 1, 0), label = "Sex", class = c("labelled", "numeric"
)), p_sex_inc = structure(c(1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), label = "Sex", class = c("labelled",
"numeric")), p_nation = structure(c(0, 0, 0, 0, 1, 1, 0, 1, 0,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), label = "Nationality / country", class = c("labelled",
"numeric")), p_nation_inc = structure(c(0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
0), label = "Nationality / country", class = c("labelled", "numeric"
)), p_prevtb = structure(c(0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0), label = "Treatment regimen / treatment status (retreatment)", class = c("labelled",
"numeric")), p_prevtb_inc = structure(c(0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), label = "Previous TB / retreated TB", class = c("labelled",
"numeric"))), row.names = c(NA, 50L), class = "data.frame")
The first 5 rows (with 4 sets of selected predictors) looks like this:
p_age p_age_inc p_sex p_sex_inc p_nation p_nation_inc p_prevtb
1 1 1 1 1 0 0 0
2 1 0 1 0 0 0 0
3 1 0 1 1 0 0 0
4 1 1 1 1 0 0 0
5 1 1 1 0 1 0 1
6 1 1 1 0 1 0 1
p_prevtb_inc
1 0
2 0
3 0
4 0
5 0
6 0
And I'd like to create the new variables like this:
p_age p_age_inc p_sex p_sex_inc p_nation p_nation_inc p_prevtb
1 1 1 1 1 0 0 0
2 1 0 1 0 0 0 0
3 1 0 1 1 0 0 0
4 1 1 1 1 0 0 0
5 1 1 1 0 1 0 1
6 1 1 1 0 1 0 1
p_prevtb_inc age sex nation prevtb
1 0 Included Included Not considered Not considered
2 0 Considered Considered Not considered Not considered
3 0 Considered Included Not considered Not considered
4 0 Included Included Not considered Not considered
5 0 Included Considered Considered Considered
6 0 Included Considered Considered Considered

This solution could be improved upon but it works. The function does what the question asks for creating the variables in a standard for loop over the p_* variables. And then returns the result.
Argument Bind can be used to return just the newly created variables by setting Bind = FALSE.
create_var <- function(X, Bind = TRUE){
xnames <- names(X)
p_only <- grep('p_([^_]+$)', xnames, value = TRUE)
res <- vector('list', length = length(p_only))
for(i in seq_along(p_only)){
x <- X[[ p_only[i] ]]
y <- X[[paste0(p_only[i], '_inc')]]
res[[i]] <- case_when(
as.logical(x) & as.logical(y) ~ "Included",
as.logical(x) & !as.logical(y) ~ "Considered",
!as.logical(x) ~ "Not considered",
TRUE ~ "Not considered"
)
}
names(res) <- sub('^p_', '', p_only)
res <- do.call(cbind.data.frame, res)
if(Bind) cbind(X, res) else res
}
create_var(df1)
df1 %>% create_var()
df1 %>% create_var(Bind = FALSE)

How to find bounding boxes of objects in raster?

I have a binary raster consisting of objects (1) and background (0). How can I find bounding boxes of objects? Each object should have its own bouding box.
Input:
library("raster")
mat = matrix(
c(0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 0, 0,
0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0,
0, 1, 1, 1, 1, 0,
0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0),
ncol = 6, nrow = 8, byrow = TRUE
)
ras = raster(mat)
I expect this result:
result = raster(matrix(
c(0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0,
0, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 0,
0, 1, 0, 0, 1, 0,
0, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0),
ncol = 6, nrow = 8, byrow = TRUE
))

Here in an approach
Example data
library(raster)
mat = matrix(
c(0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 0, 0,
0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 0,
0, 1, 1, 1, 1, 0,
0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0),
ncol = 6, nrow = 8, byrow = TRUE )
ras <- raster(mat)
Solution
f <- function(r) {
x <- reclassify(ras, cbind(0,NA))
y <- rasterToPolygons(x, dissolve=TRUE)
z <- disaggregate(y)
e <- sapply(1:length(z), function(i) extent(z[i,]))
p <- spPolygons(e)
r <- rasterize(p, r)
d <- boundaries(r)
reclassify(d, cbind(NA, 0))
}
r <- f(res)
as.matrix(r)
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 0 0 0 0 0 0
#[2,] 0 1 1 1 1 0
#[3,] 0 1 1 1 1 0
#[4,] 0 0 0 0 0 0
#[5,] 0 1 1 1 1 0
#[6,] 0 1 0 0 1 0
#[7,] 0 1 1 1 1 0
#[8,] 0 0 0 0 0 0
It is of course possible that bounding boxes of objects overlap, in which there is no solution, I suppose.

Compute Multiple Means in Dataframe in R [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I have a dataframe with 2 columns. A quality score, and an outcomes. Outcomes are either 1 or 0. Quality Scores are different integers from 1 - 135. This is a snapshot of the dataframe:
For each quality score, I would like to compute the mean. I can do it for one Quality Score at a time as such:
test <- subset(deletion_qs, qs == 10)
sum(test$outcomes)/length(test$outcomes)
[1] 0.4
But this is too slow. I was wondering if there is a way using one of the apply functions?
Here is the data:
quality_score <- c(2, 1 ,18 , 1 , 2 , 1 , 1 , 1 , 2 , 1, 1 , 1 , 1 , 1 ,10 , 10 ,10, 10 , 10 , 10 , 10 , 10, 10 , 10 , 1 ,29 ,1 , 29 ,63 , 1 ,25 , 1 , 1 ,52 ,28 , 1 , 1 ,10 , 3, 28 , 1 , 20, 1, 10, 1 , 10 , 3 , 1 , 3 , 10 ,10 , 56 , 1, 1, 2 , 3 , 2 , 1 , 1, 44 , 1 , 1, 10 , 33 , 67 ,67, 19 , 8 , 39, 10 , 2 , 1 , 42 , 22, 7 , 93 , 1 , 12 , 10 ,135 , 1 , 31 , 6 , 16, 15 , 1 , 35 , 1, 10 , 10)
outcome <- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1)

You can use dplyr group_by and summarise
Combine first to "tot.data". Then
library(dplyr)
group_by(tot.data, quality_score) %>% summarise(Mean1 = mean(outcome))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to apply cluster analysis to a database in R - r

Related

Create a new variable based on other columns values

R: expand sequence of binary values from a time column

Is there an R function similar to foreach loops in Stata for creating new variables based on the name (or root) of existing variables?

How to find bounding boxes of objects in raster?

Compute Multiple Means in Dataframe in R [duplicate]

Categories

Resources