Formatting exam results to perform a t-test in R - r

Question Overview: I have a dataset containing the results to a 15 question pre-instructional and post-instructional exam. I am looking to run a t-test on the results to compare the overall means but am having difficulty formatting the dataset properly. An example portion of the Dataset is given below:
1Pre 1Post 2Pre 2Post 3Pre 3Post 4Pre 4Post
Correct B B A A B B C C
1 B B C D C B C C
2 C B B D C B C A
3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 B B B A B B C C
5 B B B A B B C C
6 C B D A A D C B
7 C C D D E E C C
8 C A B B A A <NA> <NA>
Objective: I would like to match the "Correct" value to the values in the rows below for the test takers, such that a value of 1 is correct, and a value of 0 is incorrect. I have accomplished this using the following code:
for(j in 1:ncol(qDat)){
for(i in 1:nrow(qDat)){
if(qDat[i,j] == correctAns[1]){
qDat[i,j]=1
}else{
qDat[i,j]=0
}
}
}
I would then like to run a t-test comparing the pre and post means in addition to comparing the difference between the pre and post scores from each question, however, I need to omit any data points with NA. Currently, my method does not work with any NA values and thus replaces them with zero. Is there any method of running these tests and simply omitting NA values? Thank you!
The Desired Output:
1Pre 1Post 2Pre 2Post 3Pre 3Post
Correct B B A A B B
1 1 1 0 0 0 1
2 0 1 0 0 0 1
3 <NA> <NA> <NA> <NA> <NA> <NA>
4 1 1 0 0 1 1
5 1 1 0 0 1 1
6 0 1 0 1 0 0
7 0 0 0 0 0 0
8 0 0 0 0 0 0

You can try passing the following argument to the t.test call:
na.action = na.omit
Something like:
with(qDat, t.test(`1Pre`, `1Post`, na.action = na.omit))

What about this:
rewrote your loop - no need to worry to much about NAs as you treat them as 0, we can simply test the results and after set NAs as FALSE:
test <- qDat == correctAns # or correctAns[1] depending on your needs
test[is.na(test)] <- FALSE
storage.mode(test) <- "integer"
test
# X1 X2 X3 X4 X5 X6 X7 X8
# [1,] 0 1 0 0 1 0 1 0
# [2,] 0 0 1 0 0 0 0 0
# [3,] 0 1 0 0 1 0 0 0
# [4,] 0 0 1 0 0 0 0 0
# [5,] 1 0 0 0 0 0 1 0
# [6,] 0 0 1 1 1 1 1 0
# [7,] 0 0 0 1 0 0 1 0
# [8,] 0 0 0 0 0 0 0 1
with the data
set.seed(123)
correctAns <- sample(LETTERS[1:3], 8, replace = TRUE)
correctAns
# [1] "A" "C" "B" "C" "C" "A" "B" "C"
qDat <- sample(c(LETTERS[1:3], NA_character_), 8*2*4, replace = TRUE)
qDat <- data.frame(matrix(qDat, 8, 4*2), stringsAsFactors = FALSE)
qDat
# X1 X2 X3 X4 X5 X6 X7 X8
# 1 C A C C A B A <NA>
# 2 B A C <NA> B <NA> <NA> B
# 3 <NA> B C A B A <NA> <NA>
# 4 B <NA> C B B B B <NA>
# 5 C <NA> B <NA> A <NA> C <NA>
# 6 C C A A A A A B
# 7 A C <NA> B A C B <NA>
# 8 <NA> <NA> <NA> A B A B C
Edit
set.seed(123)
# correctAns is a vector of length 30
correctAns <- sample(LETTERS[1:3], 30, replace = TRUE)
length(correctAns)
# [1] 30
# qDat is a dataframe of dimensions 106x30
qDat <- sample(c(LETTERS[1:3], NA_character_), 106*30, replace = TRUE)
qDat <- data.frame(matrix(qDat, 106, 30), stringsAsFactors = FALSE)
dim(qDat)
# [1] 106 30
# still works
test <- qDat == correctAns
test[is.na(test)] <- FALSE
storage.mode(test) <- "integer"
str(test)
# int [1:106, 1:30] 0 0 0 0 0 0 0 0 1 0 ...
# - attr(*, "dimnames")=List of 2
# ..$ : NULL
# ..$ : chr [1:30] "X1" "X2" "X3" "X4" ...

Related

how to add columns iteratively for recoding with semi-modified names

I would use this dataset as an example
BEZ <- c("A","A","A","A","B","B","B")
var <- c("B","B","B","B","B","B","B")
bar <- c("B","B","B","B","B","B","B")
Bez1 <- c("A","A","A","A","B","B","B")
var1 <- c("B","B","B","B","B","B","B")
bar1 <- c("B","B","B","B","B","B","B")
dat <- data.frame(BEZ, var, bar, Bez1, var1, bar1)
the tricky thing that I would like to do is use a method (loops, map(), apply(), dplyr functions, and so on) to create aside the already existing new column where based on the respective row value is converted into a number.
Excepeted result
BEZ BEZ_num var var_num bar bar_num Bez1 BEZ1_num var1 var1_num bar1 bar1_num
A 0 B 1 B 1 A 0 B 1 B 1
A 0 B 1 B 1 A 0 B 1 B 1
A 0 B 1 C 2 A 0 B 1 A 0
A 0 B 1 B 1 A 0 C 2 B 1
B 1 B 1 B 1 B 1 C 2 C 2
B 1 B 1 B 1 A 0 B 1 B 1
B 1 B 1 B 1 A 0 B 1 B 1
This is more or less the idea I would like to hit. Any suggestions?
Thanks
Using factor
library(dplyr)
dat %>%
mutate(across(everything(), ~ as.integer(factor(.x))-1, .names = "{.col}_num"))
-output
BEZ var bar Bez1 var1 bar1 BEZ_num var_num bar_num Bez1_num var1_num bar1_num
1 A B B A B B 0 0 0 0 0 0
2 A B B A B B 0 0 0 0 0 0
3 A B B A B B 0 0 0 0 0 0
4 A B B A B B 0 0 0 0 0 0
5 B B B B B B 1 0 0 1 0 0
6 B B B B B B 1 0 0 1 0 0
7 B B B B B B 1 0 0 1 0 0
See in the comments. The provided data frame and the expected output do not match. But I think we could use mutate(across..) with the .names argument combined with case_when:
library(dplyr)
dat %>%
mutate(across(everything(), ~case_when(
. == "A" ~ "0",
. == "B" ~ "1",
. == "C" ~ "2"), .names = "{col}_num"))
BEZ var bar Bez1 var1 bar1 BEZ_num var_num bar_num Bez1_num var1_num bar1_num
1 A B B A B B 0 1 1 0 1 1
2 A B B A B B 0 1 1 0 1 1
3 A B B A B B 0 1 1 0 1 1
4 A B B A B B 0 1 1 0 1 1
5 B B B B B B 1 1 1 1 1 1
6 B B B B B B 1 1 1 1 1 1
7 B B B B B B 1 1 1 1 1 1
Using a for loop in base R:
dat2 <- dat[, 1, drop = FALSE]
for (col in names(dat)) {
dat2[[col]] <- dat[[col]]
dat2[[paste0(col, "_num")]] <- match(dat[[col]], LETTERS) - 1
}
dat2
# BEZ BEZ_num var var_num bar bar_num Bez1 Bez1_num var1 var1_num bar1 bar1_num
# 1 A 0 B 1 B 1 A 0 B 1 B 1
# 2 A 0 B 1 B 1 A 0 B 1 B 1
# 3 A 0 B 1 B 1 A 0 B 1 B 1
# 4 A 0 B 1 B 1 A 0 B 1 B 1
# 5 B 1 B 1 B 1 B 1 B 1 B 1
# 6 B 1 B 1 B 1 B 1 B 1 B 1
# 7 B 1 B 1 B 1 B 1 B 1 B 1
Or a (slightly convoluted) approach using dplyr::across():
library(dplyr)
dat %>%
mutate(
across(BEZ:bar1, list(TMP = identity, num = \(x) match(x, LETTERS) - 1)),
.keep = "unused"
) %>%
rename_with(\(x) gsub("_TMP$", "", x))
# same output as above
Or finally, if you don't care about the order of the output columns, you could also use dplyr::across() with the .names argument:
dat %>%
mutate(across(
BEZ:bar1,
\(x) match(x, LETTERS) - 1,
.names = "{.col}_num"
))
# BEZ var bar Bez1 var1 bar1 BEZ_num var_num bar_num Bez1_num var1_num bar1_num
# 1 A B B A B B 0 1 1 0 1 1
# 2 A B B A B B 0 1 1 0 1 1
# 3 A B B A B B 0 1 1 0 1 1
# 4 A B B A B B 0 1 1 0 1 1
# 5 B B B B B B 1 1 1 1 1 1
# 6 B B B B B B 1 1 1 1 1 1
# 7 B B B B B B 1 1 1 1 1 1
To add two further options:
With dplyr v.1.1.0 we can use consecutive_id():
library(dplyr) # v.1.1.0
dat %>%
mutate(across(everything(),
~ consecutive_id(.x)-1,
.names = "{.col}_num"))
#> BEZ var bar Bez1 var1 bar1 BEZ_num var_num bar_num Bez1_num var1_num bar1_num
#> 1 A B B A B B 0 0 0 0 0 0
#> 2 A B B A B B 0 0 0 0 0 0
#> 3 A B B A B B 0 0 0 0 0 0
#> 4 A B B A B B 0 0 0 0 0 0
#> 5 B B B B B B 1 0 0 1 0 0
#> 6 B B B B B B 1 0 0 1 0 0
#> 7 B B B B B B 1 0 0 1 0 0
Similar we could use data.table::rleid():
dat %>%
mutate(across(everything(),
~ data.table::rleid(.x)-1,
.names = "{.col}_num"))
#> BEZ var bar Bez1 var1 bar1 BEZ_num var_num bar_num Bez1_num var1_num bar1_num
#> 1 A B B A B B 0 0 0 0 0 0
#> 2 A B B A B B 0 0 0 0 0 0
#> 3 A B B A B B 0 0 0 0 0 0
#> 4 A B B A B B 0 0 0 0 0 0
#> 5 B B B B B B 1 0 0 1 0 0
#> 6 B B B B B B 1 0 0 1 0 0
#> 7 B B B B B B 1 0 0 1 0 0
Created on 2023-02-03 with reprex v2.0.2

R data.table - remove rows corresponding to a given marginal

I have the following problem. I have a data.table and a subset of columns M. I have vector x defined on M.
library(data.table)
data <- matrix(c(0,0,NA,1,0,1,NA,1,0,0,1,0,1,1,NA,NA,1,0,0,1,0,0,1,1,1,0,0,1,NA,0,1,1,0,1,1,1), byrow = T, ncol = 6, dimnames = LETTERS[1:6])
dt <- data.table(data)
dt
% A B C D E F
% 1: 0 0 NA 1 0 1
% 2: NA 1 0 0 1 0
% 3: 1 1 NA NA 1 0
% 4: 0 1 0 0 1 1
% 5: 1 0 0 1 NA 0
% 6: 1 1 0 1 1 1
M = LETTERS[2:5]
x <- dt[2,..M]
x
% B C D E
% 1: 1 0 0 1
I would like to remove all rows from dt with marginal on M equal to x. I.e. rows no. 2 and 4. Both M and x change during the program. The result for the given M and x will be:
A B C D E F
1: 0 0 NA 1 0 1
2: 1 1 NA NA 1 0
3: 1 0 0 1 NA 0
4: 1 1 0 1 1 1
data.table anti-join
dt[!x, on = M] # also works: dt[!dt[2], on = M]
# A B C D E F
# 1: 0 0 NA 1 0 1
# 2: 1 1 NA NA 1 0
# 3: 1 0 0 1 NA 0
# 4: 1 1 0 1 1 1
Base R
eq2 <- Reduce('&', lapply(dt[, ..M], function(x) x == x[2]))
dt[-which(eq2),]
# A B C D E F
# 1: 0 0 NA 1 0 1
# 2: 1 1 NA NA 1 0
# 3: 1 0 0 1 NA 0
# 4: 1 1 0 1 1 1
Not really a data.table option, but with base R you can do:
data[rowSums(sweep(data[, M], 2, FUN = `==`, x), na.rm = TRUE) != length(x), ]
A B C D E F
[1,] 0 0 NA 1 0 1
[2,] 1 1 NA NA 1 0
[3,] 1 0 0 1 NA 0
[4,] 1 1 0 1 1 1
Another base R solution
> subset(dt,!data.frame(t(dt[,..M])) %in% data.frame(t(x)))
A B C D E F
1: 0 0 NA 1 0 1
2: 1 1 NA NA 1 0
3: 1 0 0 1 NA 0
4: 1 1 0 1 1 1

How to replace certain values with their column name

I have a following table in R
df <- data.frame('a' = c(1,0,0,1,0),
'b' = c(1,0,0,1,0),
'c' = c(1,1,0,1,1))
df
a b c
1 1 1 1
2 0 0 1
3 0 0 0
4 1 1 1
4 0 0 1
What I want is to replace the row value with the column name whenever the row is equal to 1. The output would be this one:
a b c
1 a b c
2 0 0 c
3 0 0 0
4 a b c
4 0 0 c
How can I do this in R? Thanks.
I would use Map and replace:
df[] <- Map(function(n, x) replace(x, x == 1, n), names(df), df)
df
# a b c
# 1 a b c
# 2 0 0 c
# 3 0 0 0
# 4 a b c
# 5 0 0 c
We can use
df[] <- names(df)[(NA^!df) * col(df)]
df[is.na(df)] <- 0
df
# a b c
#1 a b c
#2 0 0 c
#3 0 0 0
#4 a b c
#4 0 0 c
You can try stack and unstack
a=stack(df)
a
values ind
1 1 a
2 0 a
3 0 a
4 1 a
5 0 a
6 1 b
7 0 b
8 0 b
9 1 b
10 0 b
11 1 c
12 1 c
13 0 c
14 1 c
15 1 c
a$values[a$values==1]=as.character(a$ind)[a$values==1]
unstack(a)
a b c
1 a b c
2 0 0 c
3 0 0 0
4 a b c
5 0 0 c
We can try iterating over the names of the data frame, and then handling each column, for a base R option:
df <- data.frame(a=c(1,0,0,1,0), b=c(1,0,0,1,0), c=c(1,1,0,1,1))
df <- data.frame(sapply(names(df), function(x) {
y <- df[[x]]
y[y == 1] <- x
return(y)
}))
df
a b c
1 a b c
2 0 0 c
3 0 0 0
4 a b c
5 0 0 c
Demo
You can do it with ifelse, but you have to do some intermediate transposing to account for R's column-major order processing.
data.frame(t(ifelse(t(df)==1,names(df),0)))
a b c
1 a b c
2 0 0 c
3 0 0 0
4 a b c
5 0 0 c

Find frequency of an element in a matrix in R

I have dataset "data" with 7 rows and 4 columns, as follows:
var1 var2 var3 var4
A C
A C B
B A C D
D B
B
D B
B C
I want to create following table "Mat" based on the data I have:
A B C D
1 1
1 1 1
1 1 1 1
1 1
1
1 1
1 1 1
Basically, I have taken unique elements from the original data and create a matrix "Mat" where number of rows in Mat=number of rows in Data and number of columns in "Mat"=number of unique elements in Data (that is, A, B, C, D)
I wrote following code in R:
rule <-c("A","B","C","D")
mat<-matrix(, nrow = dim(data)[1], ncol = dim(rule)[1])
mat<-data.frame(mat)
x<-rule[,1]
nm<-as.character(x)
names(mat)<-nm
n_data<-dim(data)[1]
for(i in 1:n_data)
{
for(j in 2:dim(data)[2])
{
for(k in 1:dim(mat)[2])
{
ifelse(data[i,j]==names(mat)[k],mat[i,k]==1,0)
}
}
}
I am getting all NA in "mat". Also, the running time is too much because in my original data set I have 20,000 rows and 100 columns in "Mat".
Any advice will be highly appreciated. Thanks!
This should be faster than the nested for loops:
> sapply(c("A", "B", "C", "D"), function(x) { rowSums(df == x, na.rm = T) })
# A B C D
# [1,] 1 0 1 0
# [2,] 1 1 1 0
# [3,] 1 1 1 1
# [4,] 0 1 0 1
# [5,] 0 1 0 0
# [6,] 0 1 0 1
# [7,] 0 1 1 0
Data
df <- read.table(text = "var1 var2 var3 var4
A C NA NA
A C B NA
B A C D
D B NA NA
NA B NA NA
D B NA NA
B C NA NA", header = T, stringsAsFactors = F)
By using table and rep
table(rep(1:nrow(df),dim(df)[2]),unlist(df))
A B C D
1 1 0 1 0
2 1 1 1 0
3 1 1 1 1
4 0 1 0 1
5 0 1 0 0
6 0 1 0 1
7 0 1 1 0

Summing labels line-section by line in r

I have a large dataframe of 34,000 rows x 24 columns, each of which contain a category label. I would like to efficiently go through the dataframe and count up how many times each label was listed in a section of the line, including 0s.
(I've used a for loop driving a length(which) statement that wasn't terribly efficient)
Example:
df.test<-as.data.frame(rbind(c("A", "B", "C","B","A","A"),c("C", "C", "C","C","C","C"), c("A", "B", "B","A","A","A")))
df.res<-as.data.frame(matrix(ncol=6, nrow=3))
Let's say columns 1:3 in df.test are from one dataset, 4:6 from the other. What is the most efficient way to generate df.res to show this:
A B C A B C
1 1 1 2 1 0
0 0 3 0 0 3
1 2 0 3 0 0
A way -using a lot of _applys- is the following:
#list with the different data frames
df_ls <- sapply(seq(1, ncol(df.test), 3), function(x) df.test[,x:(x+2)], simplify = F)
#count each category
df.res <- do.call(cbind,
lapply(df_ls, function(df.) { t(apply(df., 1,
function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) }))
#> df.res
# A B C A B C
#[1,] 1 1 1 2 1 0
#[2,] 0 0 3 0 0 3
#[3,] 1 2 0 3 0 0
Simulating a dataframe like the one you described:
DF <- data.frame(replicate(24, sample(LETTERS[1:3], 34000, T)), stringsAsFactors = F)
#> head(DF)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24
#1 B C C C B A C B B A C C B C B B B C B C C B B C
#2 C B C A B C B C A B A C B B A A C A B B B C A B
#3 B C C A A A C A C A A A B B A A A C B B A C C C
#4 C C A B A B B B A A A C C A B A C C A C C C B A
#5 B B A A A A C A B B A B B A C A A A C A A C B C
#6 C A C C A B B C C C B C A B B B B B A C A A B A
#> dim(DF)
#[1] 34000 24
DF_ls <- sapply(seq(1, ncol(DF), 3), function(x) DF[,x:(x+2)], simplify = F)
system.time(
DF.res <- do.call(cbind,
lapply(DF_ls, function(df.) { t(apply(df., 1,
function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) })))
#user system elapsed
#59.84 0.07 60.73
#> head(DF.res)
# A B C A B C A B C A B C A B C A B C A B C A B C
#[1,] 0 1 2 1 1 1 0 2 1 1 0 2 0 2 1 0 2 1 0 1 2 0 2 1
#[2,] 0 1 2 1 1 1 1 1 1 1 1 1 1 2 0 2 0 1 0 3 0 1 1 1
#[3,] 0 1 2 3 0 0 1 0 2 3 0 0 1 2 0 2 0 1 1 2 0 0 0 3
#[4,] 1 0 2 1 2 0 1 2 0 2 0 1 1 1 1 1 0 2 1 0 2 1 1 1
#[5,] 1 2 0 3 0 0 1 1 1 1 2 0 1 1 1 3 0 0 2 0 1 0 1 2
#[6,] 1 0 2 1 1 1 0 1 2 0 1 2 1 2 0 0 3 0 2 0 1 2 1 0
EDIT Some more comments on the approach.
I'll do the above step by step.
The first step is to subset the different dataframes that are bound together; each one of those dataframes is put in a list. The function function(x) { df.test[,x:(x+2)], simplify = F } subsets the whole dataframe based on those values of x: seq(1, ncol(df.test), 3). Extending this, if your different dataframes where 4 columns distant, 3 would have been changed with 4 in the above sequence.
#> df_ls <- sapply(seq(1, ncol(df.test), 3), function(x) df.test[,x:(x+2)], simplify = F)
#> df_ls
#[[1]]
# V1 V2 V3
#1 A B C
#2 C C C
#3 A B B
#[[2]]
# V4 V5 V6
#1 B A A
#2 C C C
#3 A A A
The next step is to lapply to the -previously made- list a function that counts each category in each row of one dataframe (i.e. element of the list). The function is this: t(apply(df., 1, function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })). The inside function (function(x)) turns one row in a factor with levels all the categories and counts (table) the number each category occured in that row. apply applies this function to each row (MARGIN = 1) of the dataframe. So, now, we have counted the frequency of each category in each row of one dataframe.
#> table(factor(unlist(df_ls[[1]][3,]), levels = c("A", "B", "C")))
#df_ls[[1]][3,] is the third row of the first dataframe of df_ls
#(i.e. _one_ row of _one_ dataframe)
#A B C
#1 2 0
#> apply(df_ls[[1]], 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })
# [,1] [,2] [,3] #df_ls[[1]] is the first dataframe of df_ls (i.e. _one_ dataframe)
#A 1 0 1
#B 1 0 2
#C 1 3 0
Because, the return of apply is not in the wanted form we use t to swap rows with columns.
The next step, is to lapply all the above to each dataframe (i.e. element of the list).
#> lapply(df_ls, function(df.) { t(apply(df., 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) })
#[[1]]
# A B C
#[1,] 1 1 1
#[2,] 0 0 3
#[3,] 1 2 0
#[[2]]
# A B C
#[1,] 2 1 0
#[2,] 0 0 3
#[3,] 3 0 0
The last step is to cbind all those elements together. The way to bind by column all the elements of a list is to do.call cbind in that list.
#NOT the expected, using only cbind
#> cbind(lapply(df_ls, function(df.) { t(apply(df., 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) }))
# [,1]
#[1,] Integer,9
#[2,] Integer,9
#Correct!
#> do.call(cbind, lapply(df_ls, function(df.) { t(apply(df., 1,
#+ function(x) { table(factor(unlist(x), levels = c("A", "B", "C"))) })) }))
# A B C A B C
#[1,] 1 1 1 2 1 0
#[2,] 0 0 3 0 0 3
#[3,] 1 2 0 3 0 0

Resources