Comparing each element in subsets of a large data - r

I have a large data with raw responses and wanted to compare each element for subject 1 in group 1 with its corresponding element for subject 1 in group 2. Of course, the comparison needs to be kept between subject 2 in group 1 and subject 2 in group 2, and between subject 3 in group 1 and subject 3 in group 2, and so on. What makes the problem even complex is that there are 100 groups, which in turn are 50 paired groups.
The output needs to keep the original raw response if they are the same. If they are different, the raw response needs to be replaced with '9'.
I'm pretty sure I could do it with for-loop, but wondering if there is anything better than for-loop in r, such as ifelse or apply?
As making my data simple, it would look like below.
df<-as.data.frame(matrix(sample(c(1:5),60,replace=T),nrow=12))
df$subject<-rep(1:3)
df$group<-rep(1:4, each=3)
Thanks for any help.

#Initialization of data
df<-as.data.frame(matrix(sample(c(1:5),60,replace=T),nrow=12))
df$subject<-rep(1:3)
df$group<-rep(1:4, each=3)
>df
V1 V2 V3 V4 V5 subject group
1 3 3 3 4 5 1 1
2 4 4 3 1 3 2 1
3 3 2 2 4 2 3 1
4 4 4 3 5 3 1 2
5 3 2 1 5 1 2 2
6 2 5 4 4 1 3 2
7 3 2 3 2 2 1 3
8 1 2 3 3 3 2 3
9 2 2 2 2 5 3 3
10 3 3 3 5 4 1 4
11 5 3 5 4 2 2 4
12 5 3 1 1 3 3 4
Processing without for loop
#processing without for loop
# assumption: initial data is sorted by group (can be easily done)
coloumns<-!dimnames(x)[[2]] %in% c('group','subject');
subjects<-df[, 'subject']
tabl<-table(subjects)
rows<-order(subjects)
rows2<-cumsum(tabl)
rows1<-rows2-tabl+1
df[rows[-rows1],coloumns][df[rows[-rows1],coloumns]!=df[rows[-rows2],coloumns]]<-9
>df
V1 V2 V3 V4 V5 subject group
1 3 3 3 4 5 1 1
2 4 4 3 1 3 2 1
3 3 2 2 4 2 3 1
4 9 9 3 9 9 1 2
5 9 9 9 9 9 2 2
6 9 9 9 4 9 3 2
7 9 9 3 9 9 1 3
8 9 2 9 9 9 2 3
9 2 9 9 9 9 3 3
10 3 9 3 9 9 1 4
11 9 9 9 9 9 2 4
12 9 9 9 9 9 3 4

Below is what I did to get the output. Again, thanks to Stanislav
df<-as.data.frame(matrix(sample(c(1:5),60,replace=T),nrow=12))
df$subject<-rep(1:3)
df$group<-rep(1:4, each=3)
> df
V1 V2 V3 V4 V5 subject group
1 1 4 3 1 5 1 1
2 2 1 4 1 5 2 1
3 1 2 5 4 5 3 1
4 5 4 1 4 3 1 2
5 5 1 3 2 2 2 2
6 1 2 2 4 5 3 2
7 5 4 2 3 1 1 3
8 2 3 4 3 5 2 3
9 2 5 3 5 3 3 3
10 4 2 1 4 1 1 4
11 2 3 3 5 5 2 4
12 5 3 3 4 5 3 4
col<-!dimnames(df)[[2]] %in% c('subject','group')
n<-length(df[,1])
temp<-table(df$group)
n.sub<-temp[1]
temp<-seq(1,n,by=2*n.sub)
s1<-c(sapply(temp, function(x) seq.int(x, length.out=n.sub)))
temp<-seq(n.sub+1,n,by=2*n.sub)
s2<-c(sapply(temp, function(x) seq.int(x, length.out=n.sub)))
df[s2,col][df[s1,col]!=df[s2,col]]<-9
> df
V1 V2 V3 V4 V5 subject group
1 1 4 3 1 5 1 1
2 2 1 4 1 5 2 1
3 1 2 5 4 5 3 1
4 9 4 9 9 9 1 2
5 9 1 9 9 9 2 2
6 1 2 9 4 5 3 2
7 5 4 2 3 1 1 3
8 2 3 4 3 5 2 3
9 2 5 3 5 3 3 3
10 9 9 9 9 1 1 4
11 2 3 9 9 5 2 4
12 9 9 3 9 9 3 4

Related

filtering scores from one variable and placing them in a new variable

##So I have this variable test scores is coded on a scale from 1-9.
I have to take those who score 1-3 as low, 4-6 as good and 7-9 as high in new variables.
then have to make a new variable that compares low and high and a variable that compares low and good.
test_scores<- c(sample(1:10, 122, replace = TRUE)
test_scores<-as.data.frame(test_scores)
low<- filter(test_scores,test_scores1 > 3)
high<- filter(test_scores, test_scores< 7)
good<-filter(test_scores,test_scores== 4:6)
##but the N of in the new variables are not counting up to 122
##I thought of using the if function:
low<- ifelse(test_scores$test_scores == 1:3 , 1:3 , 0)
mods<- ifelse(test_scores$test_scores == 4:6, 4:6, 0)
high<- ifelse(test_scores$test_scores == 7:9, 7:9, 0)
##but some scores are not getting filter instead they become 0 even tho the score matches. any ideas?
You can use "cut" to generate the new bins:
set.seed(123)
test_scores <- sample(1:9, 122, T)
test_scores
#> [1] 3 3 2 6 5 4 6 9 5 3 9 9 9 3 8 7 9 3 4 1 7 5 7 9 9 7 5 7 5 6 9 2 5 8 2 1 9
#> [38] 9 6 5 9 4 6 8 6 6 7 1 6 2 1 2 4 5 6 3 9 4 6 9 9 7 3 8 9 3 7 3 7 6 5 5 8 3
#> [75] 2 2 6 4 1 6 3 8 3 8 1 7 7 7 6 7 5 6 8 5 7 4 3 9 7 6 9 7 2 3 8 4 7 4 1 8 4
#> [112] 9 8 6 4 8 3 4 4 6 1 4
cuts <- cut(test_scores, c(0,3,6,9), labels = F)
cuts
#> [1] 1 1 1 2 2 2 2 3 2 1 3 3 3 1 3 3 3 1 2 1 3 2 3 3 3 3 2 3 2 2 3 1 2 3 1 1 3
#> [38] 3 2 2 3 2 2 3 2 2 3 1 2 1 1 1 2 2 2 1 3 2 2 3 3 3 1 3 3 1 3 1 3 2 2 2 3 1
#> [75] 1 1 2 2 1 2 1 3 1 3 1 3 3 3 2 3 2 2 3 2 3 2 1 3 3 2 3 3 1 1 3 2 3 2 1 3 2
#> [112] 3 3 2 2 3 1 2 2 2 1 2
if you want a variable for each bin, and zero otherwise, you must use %in%, not ==
low<- ifelse(test_scores$test_scores %in% 1:3 , test_scores$test_scores , 0)
mods<- ifelse(test_scores$test_scores %in% 4:6, test_scores$test_scores, 0)
high<- ifelse(test_scores$test_scores %in% 7:9, test_scores$test_scores, 0)

Appending into data frame with a for loop [duplicate]

This question already has answers here:
Return a data frame from function
(2 answers)
Closed 6 years ago.
I want to read some file then removes the NA values from those read and then give the number of observation that left after removing the NAs
i have wrote this script but the result was something so weird
complete <- function(directory, id){
fileList <- list.files(directory, full.names = TRUE)[id]
datafamelist <- data.frame(id = numeric(), nobs = numeric())
for(Rfile in fileList){
cleandata <- na.omit(read.csv(file = Rfile))
datafamelist <- rbind(datafamelist, c(cleandata$ID, nrow(cleandata)))
}
datafamelist
}
and the result was something like that :
complete("~/Desktop/DataSets/specdata", 1:5)
X1L X1L.1 X1L.2 X1L.3 X1L.4 X1L.5 X1L.6 X1L.7 X1L.8 X1L.9 X1L.10 X1L.11 X1L.12 X1L.13 X1L.14 X1L.15 X1L.16 X1L.17 X1L.18 X1L.19
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.20 X1L.21 X1L.22 X1L.23 X1L.24 X1L.25 X1L.26 X1L.27 X1L.28 X1L.29 X1L.30 X1L.31 X1L.32 X1L.33 X1L.34 X1L.35 X1L.36 X1L.37
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.38 X1L.39 X1L.40 X1L.41 X1L.42 X1L.43 X1L.44 X1L.45 X1L.46 X1L.47 X1L.48 X1L.49 X1L.50 X1L.51 X1L.52 X1L.53 X1L.54 X1L.55
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.56 X1L.57 X1L.58 X1L.59 X1L.60 X1L.61 X1L.62 X1L.63 X1L.64 X1L.65 X1L.66 X1L.67 X1L.68 X1L.69 X1L.70 X1L.71 X1L.72 X1L.73
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.74 X1L.75 X1L.76 X1L.77 X1L.78 X1L.79 X1L.80 X1L.81 X1L.82 X1L.83 X1L.84 X1L.85 X1L.86 X1L.87 X1L.88 X1L.89 X1L.90 X1L.91
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.92 X1L.93 X1L.94 X1L.95 X1L.96 X1L.97 X1L.98 X1L.99 X1L.100 X1L.101 X1L.102 X1L.103 X1L.104 X1L.105 X1L.106 X1L.107 X1L.108
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.109 X1L.110 X1L.111 X1L.112 X1L.113 X1L.114 X1L.115 X1L.116 X117L
1 1 1 1 1 1 1 1 1 117
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
instead of being like this :
## id nobs
## 1 1 117
## 2 2 000
## 3 3 000
## 4 4 000
## 5 5 000
where the 000 is the number of observed values that supposed to be there
Try to read and form your dataframe like this
setwd("<Your Directory>")
file_list <- list.files()
for (file in file_list){
# if the merged dataset doesn't exist, create it
if (!exists("rawdata")){
rawdata <- read.csv(file)
}
# if the merged dataset does exist, append to it
if (exists("rawdata")){
temp_dataset <- read.csv(file)
rawdata<-rbind(rawdata, temp_dataset)
rm(temp_dataset)
}
}
For NAs, you can check which column contain NA and work according
to check NA, use summary

Mean and SD in a table

In R, when doing table of two variables, you'll get a frequency table
> table(data$Var1, data$Var2)
1 2 3 4 5
0 0 1 5 6 12
1 1 10 6 7 0
2 2 6 7 6 3
3 2 9 8 3 2
4 4 9 5 3 3
5 3 4 9 4 4
6 2 7 7 4 4
7 2 7 7 6 2
8 5 7 5 5 2
9 5 4 5 6 4
is there a way such that you include the mean and SD in each row, something like
1 2 3 4 5 mean SD
0 0 1 5 6 12 4.20833 0.93153
1 1 10 6 7 0 .. ..
2 2 6 7 6 3
3 2 9 8 3 2
4 4 9 5 3 3
5 3 4 9 4 4
6 2 7 7 4 4
7 2 7 7 6 2
8 5 7 5 5 2
9 5 4 5 6 4
Save the table in something called T, and then:
For the mean and sd:
> cbind(T,
mean=apply(T,1,function(x){
(sum(x*(1:5)))/sum(x)}),
sd=apply(T,1,function(x){sd(rep(1:5,x))}))
1 2 3 4 5 mean sd
0 4 3 1 1 1 2.200000 1.3984118
1 1 2 3 3 3 3.416667 1.3113722
2 2 2 1 2 1 2.750000 1.4880476
3 0 1 2 4 1 3.625000 0.9161254
So 2.2 and 1.3984 is mean and sd of (c(1,1,1,1,2,2,2,3,4,5))
Its probably inefficient to compute the sd by reconstructing the original vector with rep - but its late and working out all the sums of squares and squares of sums for the sd is not something my brain can do at 1am.

How do I add a vector where I collapse scores from individuals within pairs?

I have done an experiment in which participants have solved a task in pairs, with another participant. Each participant has then received a score for how well they did the task. Pairs have gone through different amounts of trials.
I have a data frame similar to the one below:
participant <- c(1,1,2,2,3,3,3,4,4,4,5,6)
pair <- c(1,1,1,1,2,2,2,2,2,2,3,3)
trial <- c(1,2,1,2,1,2,3,1,2,3,1,1)
score <- c(2,3,6,3,4,7,3,1,8,5,4,3)
data <- data.frame(participant, pair, trial, score)
participant pair trial score
1 1 1 2
1 1 2 3
2 1 1 6
2 1 2 3
3 2 1 4
3 2 2 7
3 2 3 3
4 2 1 1
4 2 2 8
4 2 3 5
5 3 1 4
6 3 1 3
I would like to add a new vector to the data frame, where each participant gets the numeric difference between their own score and the other participant's score within each trial.
Does someone have an idea about how one might do that?
It should end up looking something like this:
participant pair trial score difference
1 1 1 2 4
1 1 2 3 0
2 1 1 6 4
2 1 2 3 0
3 2 1 4 3
3 2 2 7 1
3 2 3 3 2
4 2 1 1 3
4 2 2 8 1
4 2 3 5 2
5 3 1 4 1
6 3 1 3 1
Here's a solution that involves first reordering data such that each sequential pair of rows corresponds to a single pair within a single trial. This allows us to make a single call to diff() to extract the differences:
data <- data[order(data$trial,data$pair,data$participant),];
data$diff <- rep(diff(data$score)[c(T,F)],each=2L)*c(-1L,1L);
data;
## participant pair trial score diff
## 1 1 1 1 2 -4
## 3 2 1 1 6 4
## 5 3 2 1 4 3
## 8 4 2 1 1 -3
## 11 5 3 1 4 1
## 12 6 3 1 3 -1
## 2 1 1 2 3 0
## 4 2 1 2 3 0
## 6 3 2 2 7 -1
## 9 4 2 2 8 1
## 7 3 2 3 3 -2
## 10 4 2 3 5 2
I assumed you wanted the sign to capture the direction of the difference. So, for instance, if a participant has a score 4 points below the other participant in the same trial-pair, then I assumed you would want -4. If you want all-positive values, you can remove the multiplication by c(-1L,1L) and add a call to abs():
data$diff <- rep(abs(diff(data$score)[c(T,F)]),each=2L);
data;
## participant pair trial score diff
## 1 1 1 1 2 4
## 3 2 1 1 6 4
## 5 3 2 1 4 3
## 8 4 2 1 1 3
## 11 5 3 1 4 1
## 12 6 3 1 3 1
## 2 1 1 2 3 0
## 4 2 1 2 3 0
## 6 3 2 2 7 1
## 9 4 2 2 8 1
## 7 3 2 3 3 2
## 10 4 2 3 5 2
Here's a solution built around ave() that doesn't require reordering the whole data.frame first:
data$diff <- ave(data$score,data$trial,data$pair,FUN=function(x) abs(diff(x)));
data;
## participant pair trial score diff
## 1 1 1 1 2 4
## 2 1 1 2 3 0
## 3 2 1 1 6 4
## 4 2 1 2 3 0
## 5 3 2 1 4 3
## 6 3 2 2 7 1
## 7 3 2 3 3 2
## 8 4 2 1 1 3
## 9 4 2 2 8 1
## 10 4 2 3 5 2
## 11 5 3 1 4 1
## 12 6 3 1 3 1
Here's how you can get the score of the other participant in the same trial-pair:
data$other <- ave(data$score,data$trial,data$pair,FUN=rev);
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 2 1 1 2 3 3
## 3 2 1 1 6 2
## 4 2 1 2 3 3
## 5 3 2 1 4 1
## 6 3 2 2 7 8
## 7 3 2 3 3 5
## 8 4 2 1 1 4
## 9 4 2 2 8 7
## 10 4 2 3 5 3
## 11 5 3 1 4 3
## 12 6 3 1 3 4
Or, assuming the data.frame has been reordered as per the initial solution:
data$other <- c(rbind(data$score[c(F,T)],data$score[c(T,F)]));
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 3 2 1 1 6 2
## 5 3 2 1 4 1
## 8 4 2 1 1 4
## 11 5 3 1 4 3
## 12 6 3 1 3 4
## 2 1 1 2 3 3
## 4 2 1 2 3 3
## 6 3 2 2 7 8
## 9 4 2 2 8 7
## 7 3 2 3 3 5
## 10 4 2 3 5 3
Alternative, using matrix() instead of rbind():
data$other <- c(matrix(data$score,2L)[2:1,]);
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 3 2 1 1 6 2
## 5 3 2 1 4 1
## 8 4 2 1 1 4
## 11 5 3 1 4 3
## 12 6 3 1 3 4
## 2 1 1 2 3 3
## 4 2 1 2 3 3
## 6 3 2 2 7 8
## 9 4 2 2 8 7
## 7 3 2 3 3 5
## 10 4 2 3 5 3
Here is an option using data.table:
library(data.table)
setDT(data)[,difference := abs(diff(score)), by = .(pair, trial)]
data
# participant pair trial score difference
# 1: 1 1 1 2 4
# 2: 1 1 2 3 0
# 3: 2 1 1 6 4
# 4: 2 1 2 3 0
# 5: 3 2 1 4 3
# 6: 3 2 2 7 1
# 7: 3 2 3 3 2
# 8: 4 2 1 1 3
# 9: 4 2 2 8 1
#10: 4 2 3 5 2
#11: 5 3 1 4 1
#12: 6 3 1 3 1
A slightly faster option would be:
setDT(data)[, difference := abs((score - shift(score))[2]) , by = .(pair, trial)]
If we need the value of the other pair:
data[, other:= rev(score) , by = .(pair, trial)]
data
# participant pair trial score difference other
# 1: 1 1 1 2 4 6
# 2: 1 1 2 3 0 3
# 3: 2 1 1 6 4 2
# 4: 2 1 2 3 0 3
# 5: 3 2 1 4 3 1
# 6: 3 2 2 7 1 8
# 7: 3 2 3 3 2 5
# 8: 4 2 1 1 3 4
# 9: 4 2 2 8 1 7
#10: 4 2 3 5 2 3
#11: 5 3 1 4 1 3
#12: 6 3 1 3 1 4
Or using dplyr:
library(dplyr)
data %>%
group_by(pair, trial) %>%
mutate(difference = abs(diff(score)))
# participant pair trial score difference
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 2 4
#2 1 1 2 3 0
#3 2 1 1 6 4
#4 2 1 2 3 0
#5 3 2 1 4 3
#6 3 2 2 7 1
#7 3 2 3 3 2
#8 4 2 1 1 3
#9 4 2 2 8 1
#10 4 2 3 5 2
#11 5 3 1 4 1
#12 6 3 1 3 1

From table to data.frame

I have a table that looks like:
dat = data.frame(expand.grid(x = 1:10, y = 1:10),
z = sample(LETTERS[1:3], size = 100, replace = TRUE))
tabl <- with(dat, table(z, y))
tabl
y
z 1 2 3 4 5 6 7 8 9 10
A 5 3 1 1 3 6 3 7 2 4
B 4 5 3 6 5 1 3 1 4 4
C 1 2 6 3 2 3 4 2 4 2
Now how do I transform it into a data.frame that looks like
1 2 3 4 5 6 7 8 9 10
A 5 3 1 1 3 6 3 7 2 4
B 4 5 3 6 5 1 3 1 4 4
C 1 2 6 3 2 3 4 2 4 2
Here are a couple of options.
The reason as.data.frame(tabl) doesn't work is that it dispatches to the S3 method as.data.frame.table() which does something useful but different from what you want.
as.data.frame.matrix(tabl)
# 1 2 3 4 5 6 7 8 9 10
# A 5 4 3 1 1 3 3 2 6 2
# B 1 4 3 4 5 3 4 4 3 3
# C 4 2 4 5 4 4 3 4 1 5
## This will also work
as.data.frame(unclass(tabl))

Resources