Calculations for grouped data in R - r

I'm stuck on the following problem in R and was hoping someone had a quick solution.
I have two sets of data, A and B, where A contains data for a control group and B a case group. I have measures for the same variables for each group.
Within A and B are subgroups - and they are in some instances paired between A and B - let's say they are siblings where one or more can be a case and one or more a control.
The data look something like this:
SET A:
Source Area group pch pch2 col col2 group2
R1-1 1983447 1 0 16 1 1 1
R1-3 1400362 1 0 16 1 1 1
R3-4 2834393 2 1 16 2 2 1
R4-2 2232820 3 2 16 3 3 1
R4-5 1713796 3 2 16 3 3 1
R4-6 1525740 3 2 16 3 3 1
R4-7 1182300 3 2 16 3 3 1
SET B:
Source Area group pch pch2 col col2 group2
R1-2 1246124 1 0 16 1 1 2
R3-1 1627610 2 1 16 2 2 2
R3-2 1401600 2 1 16 2 2 2
R4-1 1367146 3 2 16 3 3 2
R4-3 1764125 3 2 16 3 3 2
R4-4 1299864 3 2 16 3 3 2
Source is ID, Area is the variable of interest, group is group, and the rest are additional variables that are not of interest here.
What I'd like to do is calculate relative Area for each of the individuals in set B - i.e., relative to mean Area of their siblings in Set A. I'd like this value to appear as a seperate column in set B (under relArea in sample below). The output would therefore look like this:
Output (Set B):
Source Area group relArea pch pch2 col col2 group2
R1-2 1246124 1 0.736521476 0 16 1 1 2
R3-1 1627610 2 0.574235824 1 16 2 2 2
R3-2 1401600 2 0.494497411 1 16 2 2 2
R4-1 1367146 3 0.821768097 2 16 3 3 2
R4-3 1764125 3 1.06038539 2 16 3 3 2
R4-4 1299864 3 0.781326037 2 16 3 3 2
Finally, if an individual in set B does not have a sibling in set A, then his relArea value would be the Area relative to average Area of all the controls (i.e., all measurements in set A).
Any help with this would be much appreciated.
thanks,
Bjorn

You could compute the average area per group in Set A with aggregate and then add your new column:
seta = read.table(text="Source Area group pch pch2 col col2 group2
R1-1 1983447 1 0 16 1 1 1
R1-3 1400362 1 0 16 1 1 1
R3-4 2834393 2 1 16 2 2 1
R4-2 2232820 3 2 16 3 3 1
R4-5 1713796 3 2 16 3 3 1
R4-6 1525740 3 2 16 3 3 1
R4-7 1182300 3 2 16 3 3 1 ", header=T)
setb = read.table(text="Source Area group pch pch2 col col2 group2
R1-2 1246124 1 0 16 1 1 2
R3-1 1627610 2 1 16 2 2 2
R3-2 1401600 2 1 16 2 2 2
R4-1 1367146 3 2 16 3 3 2
R4-3 1764125 3 2 16 3 3 2
R4-4 1299864 3 2 16 3 3 2", header=T)
grouped.area = aggregate(seta$Area, by=list(group=seta$group), mean)
setb$relArea = setb$Area / grouped.area$x[match(setb$group, grouped.area$group)]
setb$relArea
# [1] 0.7365215 0.5742358 0.4944974 0.8217681 1.0603854 0.7813260

Related

Replacing values in a data.frame that have lost their order

In my toy data, for each unique study, the numeric variables (sample and group) must have an order starting from 1. But:
For example, in study 1, we see that there are two unique sample values (1 & 3), so 3 must be replaced with 2.
For example, in study 2, we see that there is one unique group value (2), so it must be replaced with 1.
In study 3, both sample and group seem ok meaning their unique values are 1 and 2 (no replacing needed).
For this toy data, my desired output is shown below. But I appreciate a functional solution that can automatically replace any number of numeric variables in a data.frame that have lost their order just like I showed in my toy data.
m="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 3 1 A
1 3 1 B
1 3 2 A
1 3 2 B
2 1 2 A
2 1 2 B
2 2 2 A
2 2 2 B
2 3 2 A
2 3 2 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"
data <- read.table(text=m, h=T)
Desired_output="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 2 1 A
1 2 1 B
1 2 2 A
1 2 2 B
2 1 1 A
2 1 1 B
2 2 1 A
2 2 1 B
2 3 1 A
2 3 1 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"
You can do:
library(dplyr)
data %>%
group_by(study) %>%
mutate(across(tidyselect::vars_select_helpers$where(is.numeric),
function(x) as.numeric(as.factor(x)))) %>%
as.data.frame()
The resultant data frame looks like this:
study sample group outcome
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B
Here is an alternative (not as elegant as #Allan Cameron +1 ) dplyr solution:
library(dplyr)
df %>%
group_by(study) %>%
mutate(x = n()/length(unique(sample)),
sample = rep(row_number(), each=x, length.out = n()),
y = length(unique(group)),
group = ifelse(y==1, 1, group)) %>%
select(-x, -y)
study sample group outcome
<int> <int> <dbl> <chr>
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B

r recode by a splitting rule

I have a student dataset including student information, question id (5 questions), the sequence of each trial to answer the questions. I would like to create a variable to distinguish where exactly student starts reviewing questions after finishing all questions.
Here is a sample dataset:
data <- data.frame(
person = c(1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
question = c(1,2,2,3,3,3,4,3,5,1,2, 1,1,1,2,3,4,4,4,5,5,4,3,4,4,5,4,5),
sequence = c(1,1,2,1,2,3,1,4,1,2,3, 1,2,3,1,1,1,2,3,1,2,4,2,5,6,3,7,4))
data
person question sequence
1 1 1 1
2 1 2 1
3 1 2 2
4 1 3 1
5 1 3 2
6 1 3 3
7 1 4 1
8 1 3 4
9 1 5 1
10 1 1 2
11 1 2 3
12 2 1 1
13 2 1 2
14 2 1 3
15 2 2 1
16 2 3 1
17 2 4 1
18 2 4 2
19 2 4 3
20 2 5 1
21 2 5 2
22 2 4 4
23 2 3 2
24 2 4 5
25 2 4 6
26 2 5 3
27 2 4 7
28 2 5 4
sequence variables record each visit by giving a sequence number. Generally revisits could be before seeing all questions. However, the attempt variable should only record after the student sees all 5 questions. With the new variable, I target this dataset.
> data
person question sequence attempt
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
Any ideas?
Thanks!
What a challenging question. Took almost 2 hours to find the solution.
Try this
library(dplyr)
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
data %>%
mutate(var0 = n_distinct(question)) %>%
group_by(person) %>%
mutate(var1 = dist_cum(question),
var2 = cumsum(c(1, diff(question) != 0))) %>%
ungroup() %>%
mutate(var3 = if_else(sequence == 1 | var1 < var0, 0, 1)) %>%
group_by(person, var2) %>%
mutate(var4 = min(var3)) %>%
ungroup() %>%
mutate(attemp = if_else(var4 == 0, "initial", "review")) %>%
select(-starts_with("var")) %>%
as.data.frame
Result
person question sequence attemp
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 initial
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
dist_cum is a function to calculate rolling distinct (Source). var0...var4 are helpers
One way to do it is by finding where the reviewing starts (i.e. the next entry after the fifth question has been seen) and where the sequence is 2. See v1 and v2. Then by means of subsetting for every individual person and looping by each subset, you can update the missing entries for the attempt variable since it is now known where the reviewing starts.
v1 <- c(FALSE, (data$question == 5)[-(nrow(data))])
v2 <- data$sequence == 2
data$attempt <- ifelse(v1 * v2 == 1, "review", NA)
persons <- unique(data$person)
persons.list <- vector(mode = "list", length = length(persons))
for(i in 1:length(persons)){
person.i <- subset(data, person == persons[i])
n <- which(person.i$attempt == "review")
m <- nrow(person.i)
person.i$attempt[(n+1):m] <- "review"
person.i$attempt[which(is.na(person.i$attempt))] <- "initial"
persons.list[[i]] <- person.i
}
do.call(rbind, persons.list)
person question sequence attempt
1 1 1 1 initial
2 1 2 1 initial
3 1 2 2 initial
4 1 3 1 initial
5 1 3 2 initial
6 1 3 3 initial
7 1 4 1 initial
8 1 3 4 initial
9 1 5 1 initial
10 1 1 2 review
11 1 2 3 review
12 2 1 1 initial
13 2 1 2 initial
14 2 1 3 initial
15 2 2 1 initial
16 2 3 1 initial
17 2 4 1 initial
18 2 4 2 initial
19 2 4 3 initial
20 2 5 1 initial
21 2 5 2 review
22 2 4 4 review
23 2 3 2 review
24 2 4 5 review
25 2 4 6 review
26 2 5 3 review
27 2 4 7 review
28 2 5 4 review
Alternatively, you can also use lapply:
do.call(rbind,
lapply(persons, function(x){
person.x <- subset(data, person == x)
n <- which(person.x$attempt == "review")
m <- nrow(person.x)
person.x$attempt[(n+1):m] <- "review"
person.x$attempt[which(is.na(person.x$attempt))] <- "initial"
person.x
}))

Split columns to few variables and move corresponding value to new column

I have a data frame like this (with many more rows):
id act_l_n pas_l_n act_q_p pas_q_p act_l_p pas_l_p act_q_n pas_q_n
1 14 8 14 10 21 11 21 11
2 19 9 11 17 22 11 20 11
Every column name contains information about 3 variables separated by '_' (each has 2 levels named act/pas, l/q, n/p). Values are scores corresponding to each combination of variables (i.e. 1 of 8 conditions).
I need to move 3 variables to 3 separate columns, mark their levels by digits, and move corresponding value to separate column called "score". So from 1st row of current data frame I'd get something like this:
id score actpas lq pn
1 14 1 1 1
1 8 2 1 1
1 14 1 2 2
1 10 2 2 2
1 21 1 1 2
1 11 2 1 2
1 21 1 2 1
1 11 2 2 1
I've tried wrangling this with dplyr using gather and separate functions, but I can't really get what I need. Help with dplyr would be most appriciated!
If I understand well:
df<-read.table(textConnection(
"id,act_l_n,pas_l_n,act_q_p,pas_q_p,act_l_p,pas_l_p,act_q_n,pas_q_n
1,14,8,14,10,21,11,21,11
2,19,9,11,17,22,11,20,11"),
header=TRUE,sep=",")
library(tidyr)
library(dplyr)
gather(df,k,score,-id) %>% mutate(v1=1+as.integer(substr(k,1,3)=="pas")
,v2=1+as.integer(substr(k,5,5)=="q")
,v3=1+as.integer(substr(k,7,7)=="p")) %>%
select(-2) %>% arrange(id)
# id score v1 v2 v3
#1 1 14 1 1 1
#2 1 8 2 1 1
#3 1 14 1 2 2
#4 1 10 2 2 2
#5 1 21 1 1 2
#6 1 11 2 1 2
#7 1 21 1 2 1
#8 1 11 2 2 1
#9 2 19 1 1 1
#10 2 9 2 1 1
#11 2 11 1 2 2
#12 2 17 2 2 2
#13 2 22 1 1 2
#14 2 11 2 1 2
#15 2 20 1 2 1
#16 2 11 2 2 1

Relabel samples in kmean results considering the order of centers

I am using kmeans to cluster my data, for the produced result I have a plan.
I wanted to relabel the samples based on ordered centres. Consider following example :
a = c("a","b","c","d","e","F","i","j","k","l","m","n")
b = c(1,2,3,20,21,21,40,41,42,4,23,50)
mydata = data.frame(id=a,amount=b)
result = kmeans(mydata$amount,3,nstart=10)
Here is the result :
clus$cluster
2 2 2 3 3 3 1 1 1 2 3 1
clus$centers
1 43.25
2 2.50
3 21.25
mydata = data.frame(mydata,label =clus$cluster)
mydata
id amount label
1 a 1 2
2 b 2 2
3 c 3 2
4 d 20 3
5 e 21 3
6 F 21 3
7 i 40 1
8 j 41 1
9 k 42 1
10 l 4 2
11 m 23 3
12 n 50 1
What I am looking for is sorting the centres and producing the labels accordingly:
1 2.50
2 21.25
3 43.25
and label the samples going to:
1 1 1 2 2 2 3 3 3 1 2 3
and the result should be :
id amount label
1 a 1 1
2 b 2 1
3 c 3 1
4 d 20 2
5 e 21 2
6 F 21 2
7 i 40 3
8 j 41 3
9 k 42 3
10 l 4 1
11 m 23 2
12 n 50 3
I think it is possible to do it by, order the centres and for each sample taking the index of minimum distance of samples with centres as the label of that cluster.
Is there another way that R can do it automatically ?
One idea is to create a named vector by matching your centers with the sorted centers. Then match the vector with mydata$label and replace with the names of the vector, i.e.
i1 <- setNames(match(sort(result$centers), result$centers), rownames(result$centers))
as.numeric(names(i1)[match(mydata$label, i1)])
# [1] 1 1 1 2 2 2 3 3 3 1 2 3
You can use for loop, if you don't mind loops
cls <- result$cluster
for (i in 1 : length(result$cluster))
result$cluster[cls == order(result$centers)[i]] <- i
result$cluster
#[1] 1 1 1 2 2 2 3 3 3 1 2 3

Is there a way to change the index on kmeans()$cluster?

I am using kmeans() to create groups based on a score. The goal is to assign star ratings, so that the individuals with the highest scores get four stars, and the individuals with the lowest scores get 1 star. I would like to create the star variable based on the kmeans()$cluster value. However, as it stands, kmeans()$cluster indexes the clusters, but the index does not correspond to the relative position of the group.
Is there a way to manually assign the cluster indexes, or to set the index to be assigned in a certain order? I'm hoping to have kmeans()$cluster=1 for the low score group, kmeans()$cluster=2 for second lowest, etc.
id <- 1:500
set.seed(12); score <- runif(500, 0, 1)
dat <- data.frame(id, score)
km = kmeans(dat$score, 4, nstart=10)
plot(dat$score,
col = c(km$cluster),
main="K-Means result with 4 clusters",
pch=20,
cex=0.8)
dat$star <- km$cluster
plot(dat$score,
dat$star,
main="Score v. cluster number")
Any of these will yield a new cluster assignment vector such that 1 refers to the cluster with the smallest center, 2 the next and so on. The first is expressed solely in terms of fitted(km) whereas the second is expressed in terms of km$centers and km$cluster and the last is expressed in terms of fitted(km) and km$center
fit <- fitted(km)
factor(fit, labels = 1:nlevels(factor(fit)))
rank(km$centers)[km$cluster])
match(fitted(km), sort(km$centers))
Yes. You can just use a small table of what you want the values to be and use the original cluster number to look them up. Here is an example.
set.seed(2017)
KM3 = kmeans(iris[,1:4], 3)
KM3$cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[77] 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3
[115] 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1 1 3
The clusters are in an awkward order. I want the low numbered points to be in cluster 1, the middle in cluster 2 and the high numbered points in cluster 3. So I want to change all of the 1's to 3, the 2's to 1 and the 3's to 2.
Relabel = c(3,1,2)
KM3$cluster = Relabel[KM3$cluster]
KM3$cluster
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[77] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2
[115] 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2
Just a little extra detail. It says Relabel = c(3,1,2) because I want 1 to become 3, so the first location has a 3. I want 2 to become 1, so the second location has a 1. And i want 3 to become 2 so the third location has a 2.

Resources