Appending into data frame with a for loop [duplicate] - r

This question already has answers here:
Return a data frame from function
(2 answers)
Closed 6 years ago.
I want to read some file then removes the NA values from those read and then give the number of observation that left after removing the NAs
i have wrote this script but the result was something so weird
complete <- function(directory, id){
fileList <- list.files(directory, full.names = TRUE)[id]
datafamelist <- data.frame(id = numeric(), nobs = numeric())
for(Rfile in fileList){
cleandata <- na.omit(read.csv(file = Rfile))
datafamelist <- rbind(datafamelist, c(cleandata$ID, nrow(cleandata)))
}
datafamelist
}
and the result was something like that :
complete("~/Desktop/DataSets/specdata", 1:5)
X1L X1L.1 X1L.2 X1L.3 X1L.4 X1L.5 X1L.6 X1L.7 X1L.8 X1L.9 X1L.10 X1L.11 X1L.12 X1L.13 X1L.14 X1L.15 X1L.16 X1L.17 X1L.18 X1L.19
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.20 X1L.21 X1L.22 X1L.23 X1L.24 X1L.25 X1L.26 X1L.27 X1L.28 X1L.29 X1L.30 X1L.31 X1L.32 X1L.33 X1L.34 X1L.35 X1L.36 X1L.37
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.38 X1L.39 X1L.40 X1L.41 X1L.42 X1L.43 X1L.44 X1L.45 X1L.46 X1L.47 X1L.48 X1L.49 X1L.50 X1L.51 X1L.52 X1L.53 X1L.54 X1L.55
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.56 X1L.57 X1L.58 X1L.59 X1L.60 X1L.61 X1L.62 X1L.63 X1L.64 X1L.65 X1L.66 X1L.67 X1L.68 X1L.69 X1L.70 X1L.71 X1L.72 X1L.73
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.74 X1L.75 X1L.76 X1L.77 X1L.78 X1L.79 X1L.80 X1L.81 X1L.82 X1L.83 X1L.84 X1L.85 X1L.86 X1L.87 X1L.88 X1L.89 X1L.90 X1L.91
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.92 X1L.93 X1L.94 X1L.95 X1L.96 X1L.97 X1L.98 X1L.99 X1L.100 X1L.101 X1L.102 X1L.103 X1L.104 X1L.105 X1L.106 X1L.107 X1L.108
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
X1L.109 X1L.110 X1L.111 X1L.112 X1L.113 X1L.114 X1L.115 X1L.116 X117L
1 1 1 1 1 1 1 1 1 117
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
instead of being like this :
## id nobs
## 1 1 117
## 2 2 000
## 3 3 000
## 4 4 000
## 5 5 000
where the 000 is the number of observed values that supposed to be there

Try to read and form your dataframe like this
setwd("<Your Directory>")
file_list <- list.files()
for (file in file_list){
# if the merged dataset doesn't exist, create it
if (!exists("rawdata")){
rawdata <- read.csv(file)
}
# if the merged dataset does exist, append to it
if (exists("rawdata")){
temp_dataset <- read.csv(file)
rawdata<-rbind(rawdata, temp_dataset)
rm(temp_dataset)
}
}
For NAs, you can check which column contain NA and work according
to check NA, use summary

Related

How to extract or predict latent class membership in gmnl?

Let's say you run the example for a latent class model from ?gmnl:
library(mlogit)
library(gmnl)
## Examples using the Electricity data set from the mlogit package
data("Electricity", package = "mlogit")
Electr <- mlogit.data(Electricity, id.var = "id", choice = "choice",
varying = 3:26, shape = "wide", sep = "")
## Estimate a LC model with 2 classes
Elec.lc <- gmnl(choice ~ pf + cl + loc + wk + tod + seas| 0 | 0 | 0 | 1,
data = Electr,
subset = 1:3000,
model = 'lc',
panel = TRUE,
Q = 2)
summary(Elec.lc)
You get a fitted model with coefficient estimates for two classes (class 1 & 2). Is there a way to extract (or predict) for each observation, what the most likely class is that this observation belongs to?
After several helpful comments and lots of digging, it seems that there is an undocumented feature that allows you to get predicted class probabilities, which are stored in Wnq. You get one entry per observation and the number of columns matches the number of latent classes (Q = 2 from above), and entries sum to 1.
## Get class probabilities
head(Elec.lc$Wnq)
init
[1,] 0.5547805 0.4452195
[2,] 0.5547805 0.4452195
[3,] 0.5547805 0.4452195
[4,] 0.5547805 0.4452195
[5,] 0.5547805 0.4452195
[6,] 0.5547805 0.4452195
The fitted model contains a matrix called prob.alt which gives the probability of each choice, so you can do:
predictions <- apply(Elec.cor$prob.alt,1, which.max)
predictions
#> [1] 1 1 2 3 1 4 4 3 3 3 2 1 2 2 3 1 1 1 2 3 4 4 4 1 1 4 1 1 4 4 4 2 4 3 1 2 4
#> [38] 4 4 1 1 4 1 1 4 4 4 2 1 1 2 3 4 4 4 2 4 3 4 2 1 4 2 2 2 2 4 2 1 3 4 3 4 4
#> [75] 4 1 4 2 3 2 2 1 3 3 4 3 4 1 1 4 2 1 4 4 2 2 2 2 2 2 1 4 2 2 2 2 1 2 2 4 3
#> [112] 1 1 1 2 3 4 4 4 2 4 3 4 1 1 4 2 1 4 4 2 2 1 4 2 2 2 2 1 2 1 2 4 3 2 2 2 2
#> [149] 1 4 2 2 2 1 2 1 4 3 2 2 2 1 2 1 1 4 2 1 4 2 2 2 2 1 2 1 1 4 3 2 2 2 2 1 4
#> [186] 2 2 2 2 4 2 1 4 3 2 2 2 2 2 1 1 4 2 1 4 4 3 2 2 4 4 1 3 4 1 2 4 3 1 1 1 2
#> [223] 3 4 4 4 1 2 4 2 3 4 4 1 3 4 2 3 3 2 4 1 1 4 4 4 2 1 3 1 2 1 1 2 3 1 4 4 2
#> [260] 4 3 2 1 2 4 2 3 3 4 1 3 4 2 3 3 4 4 4 4 4 1 3 2 3 1 3 3 1 4 2 1 4 4 2 2 1
#> [297] 3 1 1 4 2 4 1 2 4 1 1 4 4 4 2 1 1 2 3 4 4 4 2 4 3 4 1 1 1 2 3 1 4 4 3 4 3
#> [334] 2 1 1 4 1 1 4 4 2 2 1 3 1 3 1 4 2 2 2 2 1 2 1 3 4 3 2 2 2 2 1 4 3 2 2 2 1
#> [371] 2 4 4 1 3 4 2 3 3 2 1 3 3 3 3 4 1 1 4 1 1 4 4 2 2 2 4 2 3 4 4 4 1 4 2 3 2
#> [408] 1 4 3 2 2 2 1 2 1 1 4 3 1 1 2 3 4 4 4 3 3 3 2 1 2 4 3 4 4 4 3 4 3 4 3 4 1
#> [445] 1 4 1 1 4 4 4 2 1 4 2 2 2 2 1 2 1 3 4 3 1 4 2 2 2 2 1 2 4 2 4 3 3 3 4 1 1
#> [482] 4 2 1 4 4 2 2 2 2 3 1 1 1 2 3 4 4 4 2 2 4 2 3 4 4 4 3 4 2 3 2 2 4 2 3 4 4
#> [519] 1 1 4 2 3 2 2 4 1 1 4 4 4 2 2 3 1 3 2 1 2 2 1 4 4 2 2 2 4 2 1 4 3 2 2 2 4
#> [556] 2 1 1 4 2 1 4 2 2 2 2 1 2 1 2 4 3 1 1 2 3 4 4 4 2 4 3 4 2 4 4 4 3 4 2 3 3
#> [593] 3 1 3 3 1 1 2 3 1 4 4 3 4 3 2 1 2 2 2 2 1 4 3 2 2 2 2 2 2 4 2 3 3 4 1 3 4
#> [630] 2 3 3 2 3 1 1 4 4 4 2 2 3 1 3 1 1 2 3 1 4 4 3 3 3 4 1 4 4 4 3 4 1 4 3 1 1
#> [667] 3 3 2 2 3 1 1 1 2 3 1 4 4 2 1 4 2 2 2 2 1 2 1 1 4 2 1 1 2 3 4 4 4 2 4 3 4
#> [704] 1 2 2 2 2 1 4 2 2 2 2 4 2 2 2 2 2 1 4 3 2 2 2 4 2 1 4 2 2 2 2 4 2 1 3 4 3
#> [741] 1 4 3 2 2 2 2 2 1 1
If we compare these predictions to the actual choice, we see that the prediction is correct about 50% of the time (the values in the diagonal are correct):
table(predictions, Electricity$choice[1:750])
#>
#> predictions 1 2 3 4
#> 1 78 35 28 32
#> 2 40 129 40 33
#> 3 16 27 57 24
#> 4 27 36 38 110
Created on 2022-08-06 by the reprex package (v2.0.1)
I have a feeling that this object Wnq is not class membership probabilities though.
Even in your example above, when calling Elec.lc$Wnq, you seem to have obtained a list of probabilities of class membership for your individuals, but critically they are all equal across individuals.
When looking for this I also found myself with the same problem. I think Elec.lc$Wnq is just the mean of class membership probabilities.
I have not looked throughly in the gmnl code, but I think the object Qir is what you should look for ?

how to solve error : Error in storage.mode(x) <- "double" : 'list' object cannot be coerced to type 'double'

Hello Im trying to run som and kmeans analysis.
But I can't solve it because there's an error code.
Error in storage.mode(x) <- "double" : 'list' object cannot be coerced to type 'double'
How can I solve this problem?
cdata <- read.delim("Cluster.txt", stringsAsFactors=FALSE)
cdata.n <- scale(subset(cdata, select=-c(ID)))
som_model2 <- supersom(data = cdata.n, grid = somgrid(10, 10, "rectangular"))
k = 6
somClusters <- kmeans(som_model2$codes, centers = 6)
I want to culstering into 6 clusters.
Please help me
I use this data.
https://github.com/woosa7/R_DataAnalytics/blob/08ea98289f4def3c4f72d4c10d3767784b42619b/R_DataMining/data/Cluster.txt
Try unlist:
somClusters <- kmeans(unlist(som_model2$codes), centers = 6)
somClusters
Cluster means:
[,1]
1 -0.6702128
2 5.2157179
3 1.2555768
4 -0.2632253
5 2.6067733
6 0.3503127
Clustering vector:
[1] 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 4 4 4 4 4 4 4
[50] 4 6 6 4 6 4 4 4 4 4 4 6 3 3 6 6 4 4 4 4 4 3 3 3 3 6 6 4 4 4 4 5 5 3 3 6 6 4 4 4 4 2 5 3 6 6 6 4 6
[99] 6 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 4 1 1 1 1 1 1 4 1 1 1 1 1 6 6 4 4 4 4 1 4 1 1 3 3 6 6 4 4 4
[148] 1 4 4 3 3 6 6 6 4 4 4 4 4 5 5 3 6 4 6 4 4 4 4 5 5 3 6 6 6 6 6 4 4 2 5 3 3 6 6 6 6 4 4 2 5 3 6 3 6
[197] 6 6 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 1 1 4 4 4 4 4 3 3 4 4 4
[246] 4 4 4 4 4 3 3 6 4 6 4 6 6 4 4 3 3 6 6 6 6 6 6 6 6 5 3 3 3 3 6 6 6 6 6 5 5 3 3 3 3 3 6 6 6 5 5 5 5
[295] 3 3 3 3 3 6 2 5 3 3 6 6 4 4 4 1 5 5 3 3 6 6 6 4 4 1 5 3 3 6 6 6 4 4 4 1 3 6 6 6 4 6 6 4 4 1 1 1 4
[344] 4 4 4 6 4 4 1 1 1 1 1 4 4 4 4 4 1 1 1 1 1 1 1 4 4 4 1 1 1 1 1 1 1 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1
[393] 1 1 1 1 1 1 1 4
Within cluster sum of squares by cluster:
[1] 1.939971 9.714721 4.939015 2.981251 3.051715 3.374086
(between_SS / total_SS = 93.6 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"

Count non-zero values of column in R [duplicate]

This question already has an answer here:
Add a new column of the sum by group [duplicate]
(1 answer)
Closed 6 years ago.
Suppose i have data frame like this one
DF
Id X Y Z
1 1 5 0
1 2 0 0
1 3 0 5
1 4 9 0
1 5 2 3
1 6 5 0
2 1 5 0
2 2 4 0
2 3 0 6
2 4 9 6
2 5 2 0
2 6 5 2
3 1 5 6
3 2 4 0
3 3 6 5
3 4 9 0
3 5 2 0
3 6 5 0
I want to count the number of non zero entries for variable Z in a particular Id and record that value in a new column Count, so the new data frame will look like
DF1
Id X Y Z Count
1 1 5 0 2
1 2 4 0 2
1 3 6 5 2
1 4 9 0 2
1 5 2 3 2
1 6 5 0 2
2 1 5 0 3
2 2 4 0 3
2 3 6 6 3
2 4 9 6 3
2 5 2 0 3
2 6 5 2 3
3 1 5 6 2
3 2 4 0 2
3 3 6 5 2
3 4 9 0 2
3 5 2 0 2
3 6 5 0 2
We can use base R ave
Counting the number of non-zero values for column Z grouped by Id
df$Count <- ave(df$Z, df$Id, FUN = function(x) sum(x!=0))
df$Count
#[1] 2 2 2 2 2 2 3 3 3 3 3 3 2 2 2 2 2 2
You can try this, it gives you exactly what you want:
library(data.table)
dt <- data.table(df)
dt[, Count := sum(Z != 0), by = Id]
dt
# Id X Y Z Count
# 1: 1 1 5 0 2
# 2: 1 2 0 0 2
# 3: 1 3 0 5 2
# 4: 1 4 9 0 2
# 5: 1 5 2 3 2
# 6: 1 6 5 0 2
# 7: 2 1 5 0 3
# 8: 2 2 4 0 3
# 9: 2 3 0 6 3
# 10: 2 4 9 6 3
# 11: 2 5 2 0 3
# 12: 2 6 5 2 3
# 13: 3 1 5 6 2
# 14: 3 2 4 0 2
# 15: 3 3 6 5 2
# 16: 3 4 9 0 2
# 17: 3 5 2 0 2
# 18: 3 6 5 0 2
This will also work:
df$Count <- rep(aggregate(Z~Id, df[df$Z != 0,], length)$Z, table(df$Id))
Id X Y Z Count
1 1 1 5 0 2
2 1 2 0 0 2
3 1 3 0 5 2
4 1 4 9 0 2
5 1 5 2 3 2
6 1 6 5 0 2
7 2 1 5 0 3
8 2 2 4 0 3
9 2 3 0 6 3
10 2 4 9 6 3
11 2 5 2 0 3
12 2 6 5 2 3
13 3 1 5 6 2
14 3 2 4 0 2
15 3 3 6 5 2
16 3 4 9 0 2
17 3 5 2 0 2
18 3 6 5 0 2

How do I add a vector where I collapse scores from individuals within pairs?

I have done an experiment in which participants have solved a task in pairs, with another participant. Each participant has then received a score for how well they did the task. Pairs have gone through different amounts of trials.
I have a data frame similar to the one below:
participant <- c(1,1,2,2,3,3,3,4,4,4,5,6)
pair <- c(1,1,1,1,2,2,2,2,2,2,3,3)
trial <- c(1,2,1,2,1,2,3,1,2,3,1,1)
score <- c(2,3,6,3,4,7,3,1,8,5,4,3)
data <- data.frame(participant, pair, trial, score)
participant pair trial score
1 1 1 2
1 1 2 3
2 1 1 6
2 1 2 3
3 2 1 4
3 2 2 7
3 2 3 3
4 2 1 1
4 2 2 8
4 2 3 5
5 3 1 4
6 3 1 3
I would like to add a new vector to the data frame, where each participant gets the numeric difference between their own score and the other participant's score within each trial.
Does someone have an idea about how one might do that?
It should end up looking something like this:
participant pair trial score difference
1 1 1 2 4
1 1 2 3 0
2 1 1 6 4
2 1 2 3 0
3 2 1 4 3
3 2 2 7 1
3 2 3 3 2
4 2 1 1 3
4 2 2 8 1
4 2 3 5 2
5 3 1 4 1
6 3 1 3 1
Here's a solution that involves first reordering data such that each sequential pair of rows corresponds to a single pair within a single trial. This allows us to make a single call to diff() to extract the differences:
data <- data[order(data$trial,data$pair,data$participant),];
data$diff <- rep(diff(data$score)[c(T,F)],each=2L)*c(-1L,1L);
data;
## participant pair trial score diff
## 1 1 1 1 2 -4
## 3 2 1 1 6 4
## 5 3 2 1 4 3
## 8 4 2 1 1 -3
## 11 5 3 1 4 1
## 12 6 3 1 3 -1
## 2 1 1 2 3 0
## 4 2 1 2 3 0
## 6 3 2 2 7 -1
## 9 4 2 2 8 1
## 7 3 2 3 3 -2
## 10 4 2 3 5 2
I assumed you wanted the sign to capture the direction of the difference. So, for instance, if a participant has a score 4 points below the other participant in the same trial-pair, then I assumed you would want -4. If you want all-positive values, you can remove the multiplication by c(-1L,1L) and add a call to abs():
data$diff <- rep(abs(diff(data$score)[c(T,F)]),each=2L);
data;
## participant pair trial score diff
## 1 1 1 1 2 4
## 3 2 1 1 6 4
## 5 3 2 1 4 3
## 8 4 2 1 1 3
## 11 5 3 1 4 1
## 12 6 3 1 3 1
## 2 1 1 2 3 0
## 4 2 1 2 3 0
## 6 3 2 2 7 1
## 9 4 2 2 8 1
## 7 3 2 3 3 2
## 10 4 2 3 5 2
Here's a solution built around ave() that doesn't require reordering the whole data.frame first:
data$diff <- ave(data$score,data$trial,data$pair,FUN=function(x) abs(diff(x)));
data;
## participant pair trial score diff
## 1 1 1 1 2 4
## 2 1 1 2 3 0
## 3 2 1 1 6 4
## 4 2 1 2 3 0
## 5 3 2 1 4 3
## 6 3 2 2 7 1
## 7 3 2 3 3 2
## 8 4 2 1 1 3
## 9 4 2 2 8 1
## 10 4 2 3 5 2
## 11 5 3 1 4 1
## 12 6 3 1 3 1
Here's how you can get the score of the other participant in the same trial-pair:
data$other <- ave(data$score,data$trial,data$pair,FUN=rev);
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 2 1 1 2 3 3
## 3 2 1 1 6 2
## 4 2 1 2 3 3
## 5 3 2 1 4 1
## 6 3 2 2 7 8
## 7 3 2 3 3 5
## 8 4 2 1 1 4
## 9 4 2 2 8 7
## 10 4 2 3 5 3
## 11 5 3 1 4 3
## 12 6 3 1 3 4
Or, assuming the data.frame has been reordered as per the initial solution:
data$other <- c(rbind(data$score[c(F,T)],data$score[c(T,F)]));
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 3 2 1 1 6 2
## 5 3 2 1 4 1
## 8 4 2 1 1 4
## 11 5 3 1 4 3
## 12 6 3 1 3 4
## 2 1 1 2 3 3
## 4 2 1 2 3 3
## 6 3 2 2 7 8
## 9 4 2 2 8 7
## 7 3 2 3 3 5
## 10 4 2 3 5 3
Alternative, using matrix() instead of rbind():
data$other <- c(matrix(data$score,2L)[2:1,]);
data;
## participant pair trial score other
## 1 1 1 1 2 6
## 3 2 1 1 6 2
## 5 3 2 1 4 1
## 8 4 2 1 1 4
## 11 5 3 1 4 3
## 12 6 3 1 3 4
## 2 1 1 2 3 3
## 4 2 1 2 3 3
## 6 3 2 2 7 8
## 9 4 2 2 8 7
## 7 3 2 3 3 5
## 10 4 2 3 5 3
Here is an option using data.table:
library(data.table)
setDT(data)[,difference := abs(diff(score)), by = .(pair, trial)]
data
# participant pair trial score difference
# 1: 1 1 1 2 4
# 2: 1 1 2 3 0
# 3: 2 1 1 6 4
# 4: 2 1 2 3 0
# 5: 3 2 1 4 3
# 6: 3 2 2 7 1
# 7: 3 2 3 3 2
# 8: 4 2 1 1 3
# 9: 4 2 2 8 1
#10: 4 2 3 5 2
#11: 5 3 1 4 1
#12: 6 3 1 3 1
A slightly faster option would be:
setDT(data)[, difference := abs((score - shift(score))[2]) , by = .(pair, trial)]
If we need the value of the other pair:
data[, other:= rev(score) , by = .(pair, trial)]
data
# participant pair trial score difference other
# 1: 1 1 1 2 4 6
# 2: 1 1 2 3 0 3
# 3: 2 1 1 6 4 2
# 4: 2 1 2 3 0 3
# 5: 3 2 1 4 3 1
# 6: 3 2 2 7 1 8
# 7: 3 2 3 3 2 5
# 8: 4 2 1 1 3 4
# 9: 4 2 2 8 1 7
#10: 4 2 3 5 2 3
#11: 5 3 1 4 1 3
#12: 6 3 1 3 1 4
Or using dplyr:
library(dplyr)
data %>%
group_by(pair, trial) %>%
mutate(difference = abs(diff(score)))
# participant pair trial score difference
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 2 4
#2 1 1 2 3 0
#3 2 1 1 6 4
#4 2 1 2 3 0
#5 3 2 1 4 3
#6 3 2 2 7 1
#7 3 2 3 3 2
#8 4 2 1 1 3
#9 4 2 2 8 1
#10 4 2 3 5 2
#11 5 3 1 4 1
#12 6 3 1 3 1

Comparing each element in subsets of a large data

I have a large data with raw responses and wanted to compare each element for subject 1 in group 1 with its corresponding element for subject 1 in group 2. Of course, the comparison needs to be kept between subject 2 in group 1 and subject 2 in group 2, and between subject 3 in group 1 and subject 3 in group 2, and so on. What makes the problem even complex is that there are 100 groups, which in turn are 50 paired groups.
The output needs to keep the original raw response if they are the same. If they are different, the raw response needs to be replaced with '9'.
I'm pretty sure I could do it with for-loop, but wondering if there is anything better than for-loop in r, such as ifelse or apply?
As making my data simple, it would look like below.
df<-as.data.frame(matrix(sample(c(1:5),60,replace=T),nrow=12))
df$subject<-rep(1:3)
df$group<-rep(1:4, each=3)
Thanks for any help.
#Initialization of data
df<-as.data.frame(matrix(sample(c(1:5),60,replace=T),nrow=12))
df$subject<-rep(1:3)
df$group<-rep(1:4, each=3)
>df
V1 V2 V3 V4 V5 subject group
1 3 3 3 4 5 1 1
2 4 4 3 1 3 2 1
3 3 2 2 4 2 3 1
4 4 4 3 5 3 1 2
5 3 2 1 5 1 2 2
6 2 5 4 4 1 3 2
7 3 2 3 2 2 1 3
8 1 2 3 3 3 2 3
9 2 2 2 2 5 3 3
10 3 3 3 5 4 1 4
11 5 3 5 4 2 2 4
12 5 3 1 1 3 3 4
Processing without for loop
#processing without for loop
# assumption: initial data is sorted by group (can be easily done)
coloumns<-!dimnames(x)[[2]] %in% c('group','subject');
subjects<-df[, 'subject']
tabl<-table(subjects)
rows<-order(subjects)
rows2<-cumsum(tabl)
rows1<-rows2-tabl+1
df[rows[-rows1],coloumns][df[rows[-rows1],coloumns]!=df[rows[-rows2],coloumns]]<-9
>df
V1 V2 V3 V4 V5 subject group
1 3 3 3 4 5 1 1
2 4 4 3 1 3 2 1
3 3 2 2 4 2 3 1
4 9 9 3 9 9 1 2
5 9 9 9 9 9 2 2
6 9 9 9 4 9 3 2
7 9 9 3 9 9 1 3
8 9 2 9 9 9 2 3
9 2 9 9 9 9 3 3
10 3 9 3 9 9 1 4
11 9 9 9 9 9 2 4
12 9 9 9 9 9 3 4
Below is what I did to get the output. Again, thanks to Stanislav
df<-as.data.frame(matrix(sample(c(1:5),60,replace=T),nrow=12))
df$subject<-rep(1:3)
df$group<-rep(1:4, each=3)
> df
V1 V2 V3 V4 V5 subject group
1 1 4 3 1 5 1 1
2 2 1 4 1 5 2 1
3 1 2 5 4 5 3 1
4 5 4 1 4 3 1 2
5 5 1 3 2 2 2 2
6 1 2 2 4 5 3 2
7 5 4 2 3 1 1 3
8 2 3 4 3 5 2 3
9 2 5 3 5 3 3 3
10 4 2 1 4 1 1 4
11 2 3 3 5 5 2 4
12 5 3 3 4 5 3 4
col<-!dimnames(df)[[2]] %in% c('subject','group')
n<-length(df[,1])
temp<-table(df$group)
n.sub<-temp[1]
temp<-seq(1,n,by=2*n.sub)
s1<-c(sapply(temp, function(x) seq.int(x, length.out=n.sub)))
temp<-seq(n.sub+1,n,by=2*n.sub)
s2<-c(sapply(temp, function(x) seq.int(x, length.out=n.sub)))
df[s2,col][df[s1,col]!=df[s2,col]]<-9
> df
V1 V2 V3 V4 V5 subject group
1 1 4 3 1 5 1 1
2 2 1 4 1 5 2 1
3 1 2 5 4 5 3 1
4 9 4 9 9 9 1 2
5 9 1 9 9 9 2 2
6 1 2 9 4 5 3 2
7 5 4 2 3 1 1 3
8 2 3 4 3 5 2 3
9 2 5 3 5 3 3 3
10 9 9 9 9 1 1 4
11 2 3 9 9 5 2 4
12 9 9 3 9 9 3 4

Resources