Related
I am using kmeans() to create groups based on a score. The goal is to assign star ratings, so that the individuals with the highest scores get four stars, and the individuals with the lowest scores get 1 star. I would like to create the star variable based on the kmeans()$cluster value. However, as it stands, kmeans()$cluster indexes the clusters, but the index does not correspond to the relative position of the group.
Is there a way to manually assign the cluster indexes, or to set the index to be assigned in a certain order? I'm hoping to have kmeans()$cluster=1 for the low score group, kmeans()$cluster=2 for second lowest, etc.
id <- 1:500
set.seed(12); score <- runif(500, 0, 1)
dat <- data.frame(id, score)
km = kmeans(dat$score, 4, nstart=10)
plot(dat$score,
col = c(km$cluster),
main="K-Means result with 4 clusters",
pch=20,
cex=0.8)
dat$star <- km$cluster
plot(dat$score,
dat$star,
main="Score v. cluster number")
Any of these will yield a new cluster assignment vector such that 1 refers to the cluster with the smallest center, 2 the next and so on. The first is expressed solely in terms of fitted(km) whereas the second is expressed in terms of km$centers and km$cluster and the last is expressed in terms of fitted(km) and km$center
fit <- fitted(km)
factor(fit, labels = 1:nlevels(factor(fit)))
rank(km$centers)[km$cluster])
match(fitted(km), sort(km$centers))
Yes. You can just use a small table of what you want the values to be and use the original cluster number to look them up. Here is an example.
set.seed(2017)
KM3 = kmeans(iris[,1:4], 3)
KM3$cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[77] 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1 1 1 3
[115] 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1 1 3
The clusters are in an awkward order. I want the low numbered points to be in cluster 1, the middle in cluster 2 and the high numbered points in cluster 3. So I want to change all of the 1's to 3, the 2's to 1 and the 3's to 2.
Relabel = c(3,1,2)
KM3$cluster = Relabel[KM3$cluster]
KM3$cluster
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[77] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 3 2
[115] 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3 3 2
Just a little extra detail. It says Relabel = c(3,1,2) because I want 1 to become 3, so the first location has a 3. I want 2 to become 1, so the second location has a 1. And i want 3 to become 2 so the third location has a 2.
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I would like to divide certain cells in one dataframe by certain cells in another dataframe.
Dataframe1 -- The column names in the dataset are: Compound_Number, Compound_Concentration, Time, Technical_Replicate, and Colony_Count. In total, there are 12 compounds, 3 concentrations, 3 time points, and 6 technical replicates.
Dataframe2 -- The column names are Compound_Number, Technical_Replicate, Time, and Replicate_Mean. In total, there are 12 compounds, 3 time points, and 1 replicate mean.
I need to divide the Colony_Count in Dataframe1 by the Replicate_Mean in Dataframe2 -- but I need to make sure that the division occurs in a paired-fashion (e.g., the compound number must be the same, as well as the time and technical replicate).
I know that I can do all of this by hand...
#Dataframe1 (corpus)
C1_T1_TR1 <- corpus[ which(corpus$Compound_Number==1 & corpus$Technical_Replicate==1 & corpus$Time==1),]
#Dataframe2 (normalizing_means)
NC1_T1_TR1 <- normalizing_means[ which(normalizing_means$Compound_Number==1 & normalizing_means$Time==1 & normalizing_means$Technical_Replicate==1),]$Replicate_Mean
Then I can do...
C1_T1_TR1$Colony_Count/N1_T1
However, that means writing out those lines multiple times in order to catch all the compounds, replicates, and times -- and then merging the results from all those selections and operations back into a single dataframe. This is cumbersome and I am certain there is a better R way of doing it. I did see this: https://stackoverflow.com/questions/33150534/divide-multiple-columns-of-one-data-frame-by-row-names-value-of-another-datafram, but it is not quite what I need -- any assistance is greatly appreciated.
Here is a bit of data.
Dataframe1
Compound_Number Compound_Concentration Time Technical_Replicate Colony_Count
1 0.1 mM 5 4 46000000 #This is the example line
1 0.05 mM 5 4 109000000
1 0.02 mM 5 4 220000000
1 0.1 mM 25 4 30000
1 0.05 mM 25 4 16000000
1 0.02 mM 25 4 340000000
1 0.1 mM 1 1 5000000
1 0.05 mM 1 1 220000000
1 0.02 mM 1 1 210000000
1 0.1 mM 5 1 9000000
1 0.05 mM 5 1 70000000
1 0.02 mM 5 1 57000000
1 0.1 mM 5 2 560000
1 0.05 mM 5 2 34000000
1 0.02 mM 5 2 300000000
1 0.1 mM 25 2 10000
2 0.05 mM 1 3 120000000
2 0.02 mM 1 3 210000000
2 0.1 mM 5 3 280000000
2 0.05 mM 5 3 240000000
2 0.02 mM 5 3 80000000
2 0.1 mM 25 3 110000000
2 0.05 mM 25 3 250000000
2 0.02 mM 25 3 350000000
2 0.1 mM 1 4 290000000
2 0.05 mM 1 4 340000000
2 0.05 mM 1 1 300000000
2 0.02 mM 1 1 110000000
2 0.1 mM 5 1 510000000
2 0.05 mM 5 1 420000000
Dataframe2
Compound_Number Technical_Replicate Time Replicate_Mean
1 1 1 288000000
1 1 5 232000000
1 1 25 230000000
1 2 1 351666666.666667
1 2 5 320000000
1 2 25 291666666.666667
1 3 1 570000000
1 3 5 493333333.333333
1 3 25 701666666.666667
1 4 1 425000000
1 4 5 630000000 #This is the example line
1 4 25 380000000
1 5 1 473333333.333333
1 5 5 463333333.333333
1 5 25 433333333.333333
1 6 1 478333333.333333
1 6 5 453333333.333333
1 6 25 520000000
2 1 1 391666666.666667
2 1 5 356666666.666667
2 1 25 373333333.333333
2 2 1 445000000
2 2 5 423333333.333333
2 2 25 353333333.333333
2 3 1 248333333.333333
2 3 5 281666666.666667
2 3 25 151666666.666667
2 4 1 325000000
2 4 5 360000000
2 4 25 420000000
2 5 1 156666666.666667
2 5 5 298333333.333333
2 5 25 338333333.333333
2 6 1 313333333.333333
2 6 5 318333333.333333
2 6 25 276666666.666667
For clarity this is an example row from Dataframe1:
Compound_Number Compound_Concentration Time Technical_Replicate Colony_Count
1 0.1 mM 5 4 46000000
I need to find the corresponding row in Dataframe2.
Compound_Number Technical_Replicate Time Replicate_Mean
1 4 5 630000000
And I would like to divide 46000000 by 630000000.
Thank you in advance.
I cannot take credit for this (although I hope I would have gotten there eventually).
NEW_DF <- merge(corpus, normalizing_means)
Then I can easily divide one column by another.
Thank you to #sirallen.
I have done an experiment in which participants have solved a task in pairs, with another participant. Each participant has then received a score for how well they did the task. Pairs have gone through different amounts of trials.
I have a data frame similar to the one below:
participant <- c(1,1,2,2,3,3,3,4,4,4,5,6)
pair <- c(1,1,1,1,2,2,2,2,2,2,3,3)
trial <- c(1,2,1,2,1,2,3,1,2,3,1,1)
score <- c(2,3,6,3,4,7,3,1,8,5,4,3)
data <- data.frame(participant, pair, trial, score)
participant pair trial score
1 1 1 2
1 1 2 3
2 1 1 6
2 1 2 3
3 2 1 4
3 2 2 7
3 2 3 3
4 2 1 1
4 2 2 8
4 2 3 5
5 3 1 4
6 3 1 3
I would like to add a new vector to the data frame, where each participant gets the numeric difference between their own score and the other participant's score within each trial.
Does someone have an idea about how one might do that?
It should end up looking something like this:
participant pair trial score difference
1 1 1 2 4
1 1 2 3 0
2 1 1 6 4
2 1 2 3 0
3 2 1 4 3
3 2 2 7 1
3 2 3 3 2
4 2 1 1 3
4 2 2 8 1
4 2 3 5 2
5 3 1 4 1
6 3 1 3 1
Here's a solution that involves first reordering data such that each sequential pair of rows corresponds to a single pair within a single trial. This allows us to make a single call to diff() to extract the differences:
data <- data[order(data$trial,data$pair,data$participant),];
data$diff <- rep(diff(data$score)[c(T,F)],each=2L)*c(-1L,1L);
data;
## participant pair trial score diff
## 1 1 1 1 2 -4
## 3 2 1 1 6 4
## 5 3 2 1 4 3
## 8 4 2 1 1 -3
## 11 5 3 1 4 1
## 12 6 3 1 3 -1
## 2 1 1 2 3 0
## 4 2 1 2 3 0
## 6 3 2 2 7 -1
## 9 4 2 2 8 1
## 7 3 2 3 3 -2
## 10 4 2 3 5 2
I assumed you wanted the sign to capture the direction of the difference. So, for instance, if a participant has a score 4 points below the other participant in the same trial-pair, then I assumed you would want -4. If you want all-positive values, you can remove the multiplication by c(-1L,1L) and add a call to abs():
data$diff <- rep(abs(diff(data$score)[c(T,F)]),each=2L);
data;
## participant pair trial score diff
## 1 1 1 1 2 4
## 3 2 1 1 6 4
## 5 3 2 1 4 3
## 8 4 2 1 1 3
## 11 5 3 1 4 1
## 12 6 3 1 3 1
## 2 1 1 2 3 0
## 4 2 1 2 3 0
## 6 3 2 2 7 1
## 9 4 2 2 8 1
## 7 3 2 3 3 2
## 10 4 2 3 5 2
Here's a solution built around ave() that doesn't require reordering the whole data.frame first:
data$diff <- ave(data$score,data$trial,data$pair,FUN=function(x) abs(diff(x)));
data;
## participant pair trial score diff
## 1 1 1 1 2 4
## 2 1 1 2 3 0
## 3 2 1 1 6 4
## 4 2 1 2 3 0
## 5 3 2 1 4 3
## 6 3 2 2 7 1
## 7 3 2 3 3 2
## 8 4 2 1 1 3
## 9 4 2 2 8 1
## 10 4 2 3 5 2
## 11 5 3 1 4 1
## 12 6 3 1 3 1
Here's how you can get the score of the other participant in the same trial-pair:
data$other <- ave(data$score,data$trial,data$pair,FUN=rev);
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 2 1 1 2 3 3
## 3 2 1 1 6 2
## 4 2 1 2 3 3
## 5 3 2 1 4 1
## 6 3 2 2 7 8
## 7 3 2 3 3 5
## 8 4 2 1 1 4
## 9 4 2 2 8 7
## 10 4 2 3 5 3
## 11 5 3 1 4 3
## 12 6 3 1 3 4
Or, assuming the data.frame has been reordered as per the initial solution:
data$other <- c(rbind(data$score[c(F,T)],data$score[c(T,F)]));
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 3 2 1 1 6 2
## 5 3 2 1 4 1
## 8 4 2 1 1 4
## 11 5 3 1 4 3
## 12 6 3 1 3 4
## 2 1 1 2 3 3
## 4 2 1 2 3 3
## 6 3 2 2 7 8
## 9 4 2 2 8 7
## 7 3 2 3 3 5
## 10 4 2 3 5 3
Alternative, using matrix() instead of rbind():
data$other <- c(matrix(data$score,2L)[2:1,]);
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 3 2 1 1 6 2
## 5 3 2 1 4 1
## 8 4 2 1 1 4
## 11 5 3 1 4 3
## 12 6 3 1 3 4
## 2 1 1 2 3 3
## 4 2 1 2 3 3
## 6 3 2 2 7 8
## 9 4 2 2 8 7
## 7 3 2 3 3 5
## 10 4 2 3 5 3
Here is an option using data.table:
library(data.table)
setDT(data)[,difference := abs(diff(score)), by = .(pair, trial)]
data
# participant pair trial score difference
# 1: 1 1 1 2 4
# 2: 1 1 2 3 0
# 3: 2 1 1 6 4
# 4: 2 1 2 3 0
# 5: 3 2 1 4 3
# 6: 3 2 2 7 1
# 7: 3 2 3 3 2
# 8: 4 2 1 1 3
# 9: 4 2 2 8 1
#10: 4 2 3 5 2
#11: 5 3 1 4 1
#12: 6 3 1 3 1
A slightly faster option would be:
setDT(data)[, difference := abs((score - shift(score))[2]) , by = .(pair, trial)]
If we need the value of the other pair:
data[, other:= rev(score) , by = .(pair, trial)]
data
# participant pair trial score difference other
# 1: 1 1 1 2 4 6
# 2: 1 1 2 3 0 3
# 3: 2 1 1 6 4 2
# 4: 2 1 2 3 0 3
# 5: 3 2 1 4 3 1
# 6: 3 2 2 7 1 8
# 7: 3 2 3 3 2 5
# 8: 4 2 1 1 3 4
# 9: 4 2 2 8 1 7
#10: 4 2 3 5 2 3
#11: 5 3 1 4 1 3
#12: 6 3 1 3 1 4
Or using dplyr:
library(dplyr)
data %>%
group_by(pair, trial) %>%
mutate(difference = abs(diff(score)))
# participant pair trial score difference
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 2 4
#2 1 1 2 3 0
#3 2 1 1 6 4
#4 2 1 2 3 0
#5 3 2 1 4 3
#6 3 2 2 7 1
#7 3 2 3 3 2
#8 4 2 1 1 3
#9 4 2 2 8 1
#10 4 2 3 5 2
#11 5 3 1 4 1
#12 6 3 1 3 1
I am running an experiment with two experiments: experiment_1 and experiment_2. Each experiment has 5 different treatments (i.e. 1, 2, 3, 4, 5). We are trying to randomly assign the treatments within groups.
We would like to do this via sampling without replacement iteratively within each group. We want to do this to insure that we get as a balanced a sample as possible in the treatment (e.g. we don't want to end up with 4 subjects in group 1 getting assigned to treatment 2 and no one getting treatment 1). So if a group has 23 subjects, we want to split the respondent into 4 subgroups of 5, and 1 subgroup of 3. We then want to randomly sample without replacement across the first subgroup of 5, so everyone gets assigned 1 of the treatments, do the same things for the the second, third and 4th subgroup of 5, and for the final subgroup of 3 randomly sample without replacement. So we would guarantee that every treatment is assigned to at least 4 subjects, and 3 are assigned to 5 subjects within this group. We would like to do this for all the groups in the experiment and for both treatments. The resultant output would look something like this...
group experiment_1 experiment_2
[1,] 1 5 3
[2,] 1 3 2
[3,] 1 4 4
[4,] 1 1 5
[5,] 1 2 1
[6,] 1 2 3
[7,] 1 4 1
[8,] 1 3 2
[9,] 2 5 5
[10,] 2 1 4
[11,] 2 3 4
[12,] 2 1 5
[13,] 2 2 1
. . . .
. . . .
. . . .
I know how to use the sample function, but am unsure how to sample without replacement within each group, so that our output corresponds to above described procedure. Any help would be appreciated.
I think we just need to shuffle sample IDs, see this example:
set.seed(124)
#prepare groups and samples(shuffled)
df <- data.frame(group=sort(rep(1:3,9)),
sampleID=sample(1:27,27))
#treatments repeated nrow of df
df$ex1 <- rep(c(1,2,3,4,5),ceiling(nrow(df)/5))[1:nrow(df)]
df$ex2 <- rep(c(2,3,4,5,1),ceiling(nrow(df)/5))[1:nrow(df)]
df <- df[ order(df$group,df$sampleID),]
#check treatment distribution
with(df,table(group,ex1))
# ex1
# group 1 2 3 4 5
# 1 2 2 2 2 1
# 2 2 2 2 1 2
# 3 2 2 1 2 2
with(df,table(group,ex2))
# ex2
# group 1 2 3 4 5
# 1 1 2 2 2 2
# 2 2 2 2 2 1
# 3 2 2 2 1 2
How about this function:
f <- function(n,m) {sample( c( rep(1:m,n%/%m), sample(1:m,n%%m) ), n )}
"n" is the group size, "m" the number of treatments.
Each treatment must be containt at least "n %/% m" times in the group.
The treatment numbers of the remaining "n %% m" group members are
assigned arbitrarily without repetition.
The vector "c( rep(1:m,n%/%m), sample(1:m,n%%m) )" contains these treatment numbers. Finally the "sample" function
perturbes these numbers.
> f(8,5)
[1] 5 3 1 5 4 2 2 1
> f(8,5)
[1] 4 5 3 4 2 2 1 1
> f(8,5)
[1] 4 2 1 5 3 5 2 3
Here is a function that creates a dataframe, using the above function:
Plan <- function( groupSizes, numExp=2, numTreatment=5 )
{
numGroups <- length(groupSizes)
df <- data.frame( group = rep(1:numGroups,groupSizes) )
for ( e in 1:numExp )
{
df <- cbind(df,unlist(lapply(groupSizes,function(n){f(n,numTreatment)})))
colnames(df)[e+1] <- sprintf("Exp_%i", e)
}
return(df)
}
Example:
> P <- Plan(c(8,23,13,19))
> P
group Exp_1 Exp_2
1 1 4 1
2 1 1 4
3 1 2 2
4 1 2 1
5 1 3 5
6 1 5 5
7 1 1 2
8 1 3 3
9 2 5 1
10 2 2 1
11 2 5 2
12 2 1 2
13 2 2 1
14 2 1 4
15 2 3 5
16 2 5 3
17 2 2 4
18 2 5 4
19 2 2 5
20 2 1 1
21 2 4 2
22 2 3 3
23 2 4 3
24 2 2 5
25 2 3 3
26 2 5 2
27 2 1 5
28 2 3 4
29 2 4 4
30 2 4 2
31 2 4 3
32 3 2 5
33 3 5 3
34 3 5 1
35 3 5 1
36 3 2 5
37 3 4 4
38 3 1 4
39 3 3 2
40 3 3 2
41 3 3 3
42 3 1 1
43 3 4 2
44 3 4 4
45 4 5 1
46 4 3 1
47 4 1 2
48 4 1 5
49 4 3 3
50 4 3 1
51 4 4 5
52 4 2 4
53 4 5 3
54 4 2 1
55 4 4 2
56 4 2 5
57 4 4 4
58 4 5 3
59 4 5 4
60 4 1 2
61 4 2 5
62 4 3 2
63 4 4 4
Check the distribution:
> with(P,table(group,Exp_1))
Exp_1
group 1 2 3 4 5
1 2 2 2 1 1
2 4 5 4 5 5
3 2 2 3 3 3
4 3 4 4 4 4
> with(P,table(group,Exp_2))
Exp_2
group 1 2 3 4 5
1 2 2 1 1 2
2 4 5 5 5 4
3 3 3 2 3 2
4 4 4 3 4 4
>
The design of efficient experiments is a science on its own and there are a few R-packages dealing with this issue:
https://cran.r-project.org/web/views/ExperimentalDesign.html
I am afraid your approach is not optimal regarding the resources, no matter how you create the samples...
However this might help:
n <- 23
group <- sort(rep(1:5, ceiling(n/5)))[1:n]
exp1 <- rep(NA, length(group))
for(i in 1:max(group)) {
exp1[which(group == i)] <- sample(1:5)[1:sum(group == i)]
}
Not exactly sure if this meets all your constraints, but you could use the randomizr package:
library(randomizr)
experiment_1 <- complete_ra(N = 23, num_arms = 5)
experiment_2 <- block_ra(experiment_1, num_arms = 5)
table(experiment_1)
table(experiment_2)
table(experiment_1, experiment_2)
Produces output like this:
> table(experiment_1)
experiment_1
T1 T2 T3 T4 T5
4 5 5 4 5
> table(experiment_2)
experiment_2
T1 T2 T3 T4 T5
6 3 6 4 4
> table(experiment_1, experiment_2)
experiment_2
experiment_1 T1 T2 T3 T4 T5
T1 2 0 1 1 0
T2 1 1 1 1 1
T3 1 1 1 1 1
T4 1 0 2 0 1
T5 1 1 1 1 1
Brain afunctional today: How do I tell acast to return different aggregations?
# the rows and columns have integer names
Rgames> foo
1 2
1 1 1
2 2 2
3 3 3
4 4 4
1 1 4
2 2 8
3 3 2
4 4 1
Rgames> mfoo<-melt(foo)
Rgames> mfoo
Var1 Var2 value
1 1 1 1
2 2 1 2
3 3 1 3
4 4 1 4
5 1 1 1
6 2 1 2
7 3 1 3
8 4 1 4
9 1 2 1
10 2 2 2
11 3 2 3
12 4 2 4
13 1 2 4
14 2 2 8
15 3 2 2
16 4 2 1
Rgames> acast(mfoo,Var1~Var2,function(x)x[1]-x[2])
1 2
1 0 -3
2 0 -6
3 0 1
4 0 3
# what I would like is the casting formula to return
1 2
1 1 -3
2 2 -6
3 3 1
4 4 3
With the caveat that this is a simple example. In the general case, there will be rows with unique names -- but never more than two rows with a given name, so my x[1]-x[2] won't ever fail.
Or should I just use this:
aggregate(foo[,2],by=list((foo[,1])),function(x)x[1]-x[2])