I am trying to simulate the answers to a multi-choice question test (MCQ). Currently, I am using the following code to simulate the answers to a MCQ with only two questions:
answers <- data.frame(
Q1 = sample(LETTERS[1:5],10,replace = T, prob=c(0.1,0.6,0.1,0.1,0.1)),
Q2 = sample(LETTERS[1:5],10,replace = T, prob=c(0.5,0.1,0.1,0.2,0.1)))
The answers B and A are, respectively, the correct answers to Q1 and Q2.
My difficulty is to introduce correlation among the answers to the questions, in the sense that, for instance, a good student tends to select the correct answer to all questions. How can I accomplish that?
You could fill up the data with completely correct answers, assign a level of proficiency to each individual student and then randomly change values in their exams, depending on their proficiency:
correct = c(2,1,3)
nstudents = 20
exam = matrix(LETTERS[rep(correct,nstudents)],ncol=length(correct),byrow=T)
colnames(exam)=paste("Q",1:length(correct),sep="")
proficiency = runif(nstudents,1,5)/5 ## Each student has a level of expertise
for(question in 1:length(correct)){
difficulty = runif(nstudents,1,10)/10 ## Random difficulty for each question and student (may be made more or less difficult)
nmistakes = sum(proficiency<difficulty)
exam[,question][proficiency<difficulty] = sample(LETTERS[1:5],nmistakes,replace=T)
}
exam = as.data.frame(exam)
The result would be a data frame in which some students hardly ever make mistakes while others hardly ever get something right.
EDIT: The proficiency, in this case, follows an uniform distribution. If you need them normally distributed, just change the proficiency vector to use rnorm().
Here is a method that applies a covariance matrix Sigma= using MASS::mvrnorm.
n <- 15
r <- .9
set.seed(42)
library('MASS')
M <- abs(mvrnorm(n=n, mu=c(1, 500), Sigma=matrix(c(1, r, r, 1), nrow=2),
empirical=TRUE)) |>
as.data.frame() |>
setNames(c('Q1', 'Q2'))
We get the correlated levels A, ..., B by cutting the random numbers along custom quantiles (taken from OP),
f <- \(x, q) cut(x, breaks=c(0, quantile(x, cumsum(q))), include.lowest=T,
labels=LETTERS[1:5])
p1 <- c(0.1, 0.6, 0.1, 0.1, 0.1)
p2 <- c(0.5, 0.1, 0.1, 0.2, 0.1)
in a Map() call.
dat <- Map(f, M, list(p1, p2)) |>
as.data.frame()
dat
# Q1 Q2
# 1 A A
# 2 B A
# 3 E E
# 4 D E
# 5 A A
# 6 B A
# 7 C D
# 8 B A
# 9 B A
# 10 B A
# 11 B C
# 12 B B
# 13 E D
# 14 B A
# 15 C D
Check
dat_check <- lapply(dat, as.integer) |> as.data.frame()
cor(dat_check) ## correlation
# Q1 Q2
# Q1 1.00000 0.85426
# Q2 0.85426 1.00000
lapply(dat, table) ## students' answers
# $Q1
#
# A B C D E
# 2 8 2 1 2
#
# $Q2
#
# A B C D E
# 8 1 1 3 2
Related
I'm trying to produce a compact letter display for displaying all pairwise comparisons for overlapping intervals. I've tried searching for a package that would already do this, but I can't seem to find one not rooted in doing statistical pairwise comparisons.
trt <- paste("", letters[1:10], sep = "")
upper <- seq(from = 2, to = 10, length.out = 10)
lower <- seq(from = 0, to = 8, length.out = 10)
new.df <- data.frame(trt, upper, lower)
Thus, I end up with this data frame:
> new.df
trt upper lower
1 a 2.000000 0.0000000
2 b 2.888889 0.8888889
3 c 3.777778 1.7777778
4 d 4.666667 2.6666667
5 e 5.555556 3.5555556
6 f 6.444444 4.4444444
7 g 7.333333 5.3333333
8 h 8.222222 6.2222222
9 i 9.111111 7.1111111
10 j 10.000000 8.0000000
Then to make all the unique combinations this is what I used:
comparisons <- combn(unique(new.df$trt), 2)
Which I then convert into a data frame.
comparisons <- as.data.frame(cbind(comparisons[1,], comparisons[2,]))
> head(comparisons)
V1 V2
1 a b
2 a c
3 a d
4 a e
5 a f
6 a g
Once I get to my list of comparisons I'm not entirely sure where to proceed.
I have a data frame with two columns, let's call them X and Y. Here's an example of it:
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
which produces this:
X Y
A 14
B 12
C 12
D 11
E 9
F 6
G 4
H 1
Note that the data frame will always be ordered in a descending order based on Y. I want to group together cases where the Y values lie within a certain range, while updating the X column to reflect the grouping too. For example, if the value is 2, I would like the final output to be:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F G 5.00000
H 1.00000
Let me explain how I got that. From the starting df data frame, the closest values were B and C. Joining them would result in:
X new_Y
A 14
B C 12
D 11
E 9
F 6
G 4
H 1
The new_Y value for cases B and C is the average of the original values for B and C i.e. 12. From this second data frame, B C are within 2 from D so they are the next to be grouped together:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F 6.00000
G 4.00000
H 1.00000
Note that the Y value for B C D is 11.67 because the original values of B, C and D were 12, 12 and 11 respectively and their average is 11.667. I wouldn't want the code to return the average Y from the previous iteration (which in this case would be 11.5).
Finally, F and G can also be grouped together, producing the final output stated above.
I'm not sure of the code needed to achieve this. My only thoughts were to calculate the distance from the previous and following element, look for the minimum and check whether it exceeds the threshold value (of 2 in the example above). Based on where that minimum appears, join the X column while averaging the Y values from the original table. Repeat this until the minimum becomes larger than the threshold.
But I'm not sure how to write the necessary code to achieve this or whether there's a more efficient solution to the algorithm I'm suggesting above. Any help will be much appreciated.
P.S I forgot to mention that if the distance between the previous and the following Y value is the same, then the grouping should be done towards the larger Y value. So
X Y
A 10
B 8
C 6
would be returned as
X new_Y
A B 9
C 6
Thanks in advance for your patience. My apologies if I didn't explain this very well.
This sounds like hierarchical agglomerative clustering.
To get the groups, use dist, hclust and cutree.
Note that centroid clustering with hclust expects the distances as the square of the Euclidean distance.
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
dCutoff <- 2
d2 <- dist(df$Y)^2
hc <- hclust(d2, method = "centroid")
group_id <- cutree(hc, h = dCutoff^2)
group_id
#> [1] 1 2 2 2 3 4 4 5
To munge the original table, we can use dplyr.
library('dplyr')
df %>%
group_by(group_id = group_id) %>%
summarise(
X = paste(X, collapse = ' '),
Y = mean(Y))
#> # A tibble: 5 x 3
#> group_id X Y
#> <int> <chr> <dbl>
#> 1 1 A 14.00000
#> 2 2 B C D 11.66667
#> 3 3 E 9.00000
#> 4 4 F G 5.00000
#> 5 5 H 1.00000
This gives the average of the previous iteration though. In any case I hope it helps
library(data.table)
df <- data.table(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
while (minimum < 2){
index <- which(df$difference==minimum) # see where the minimum occurs
check = FALSE
# because the last row cannot have a number since there is not an element after that
# we need to see if this element has the minimum difference with its previous
# if it does not have the minimum difference then we exclude it and paste it later
if(df[nrow(df)-1,difference]!=minimum){
last_row <- df[nrow(df)]
df <- df[-nrow(df)]
check = TRUE
}
tmp <- df[(index:(index+1))]
df <- df[-(index:(index+1))]
to_bind <- data.table(X = paste0(tmp$X, collapse = " "))
to_bind$Y <- mean(tmp$Y)
df <- rbind(df[,.(X,Y)],to_bind)
if(check){
df <- rbind(df,last_row[,.(X,Y)])
}
setorder(df,-Y)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
}
I have a data frame that follows the following format.
match team1 team2 winningTeam
1 A D A
2 B E E
3 C F C
4 D C C
5 E B B
6 F A A
7 A D D
8 D A A
What I want to do is to crate variables that calculates the form of both team 1 and 2 over the last x matches. For example, I would want to create a variable called team1_form_last3_matches which for match 8 would be 0.33 (as they won 1 of their last 3 matches) and there would also be a variable called team2_form_last3_matches which would be 0.66 in match 8 (as they won 2 of their last 3 matches). Ideally I would like to be able to specify the number of previous matches to be considered when calculating the teamx_form_lasty variable and those variables to be automatically created. I have tried a bunch of approaches using dplyr, zoo rolling mean functions and a load of nested for / if statements. However, I have not quite cracked it and certainly not in an elegant way. I feel like I am missing a simple solution to this generic problem. Any help would be much appreciated!
Cheers,
Jack
This works for t1l3, you will need to replicate it for t2.
dat <- data.frame(match = c(1:8), team1 = c("A","B","C","D","E","F","A","D"), team2 = c("D","E","F","C","B","A","D","A"), winningTeam = c("A","E","C","C","B","A","D","A"),stringsAsFactors = FALSE)
dat$t1l3 <- c(NA,sapply(2:nrow(dat),function(i) {
df <- dat[1:(i-1),] #just previous games, i.e. excludes current game
df <- df[df$team1==dat$team1[i] | df$team2==dat$team1[i],] #just those containing T1
df <- tail(df,3) #just the last three (or fewer if there aren't three previous games)
return(sum(df$winningTeam==dat$team1[i])/nrow(df)) #total wins/total games (up to three)
}))
How about something like:
dat <- data.frame(match = c(1:8), team1 = c("A","B","C","D","E","F","A","D"), team2 = c("D","E","F","C","B","A","D","A"), winningTeam = c("A","E","C","C","B","A","D","A"))
match team1 team2 winningTeam
1 1 A D A
2 2 B E E
3 3 C F C
4 4 D C C
5 5 E B B
6 6 F A A
7 7 A D D
8 8 D A A
Allteams <- c("A","B","C","D","E","F")
# A vectorized function for you to use to do as you ask:
teamX_form_lastY <- function(teams, games, dat){
sapply(teams, function(x) {
games_info <- rowSums(dat[,c("team1","team2")] == x) + (dat[,"winningTeam"] == x)
lookup <- ifelse(rev(games_info[games_info != 0])==2,1,0)
games_won <- sum(lookup[1:games])
if(length(lookup) < games) warning(paste("maximum games for team",x,"should be",length(lookup)))
games_won/games
})
}
teamX_form_lastY("A", 4, dat)
A
0.75
# Has a warning for the number of games you should be using
teamX_form_lastY("A", 5, dat)
A
NA
Warning message:
In FUN(X[[i]], ...) : maximum games for team A should be 4
# vectorized input
teamX_form_lastY(teams = c("A","B"), games = 2, dat = dat)
A B
0.5 0.5
# so you ca do all teams
teamX_form_lastY(teams = Allteams, 2, dat)
A B C D E F
0.5 0.5 1.0 0.5 0.5 0.0
I have an issue with a function I have that calculates the cumulative mean with a lag of one over groups on a field:
cumroll <- function(x) { x <- head(x, -1)
c(head(x,1), cumsum(x) / seq_along(x))}
Everything works fine as long as I am performing this function over groups that are larger than one:
Player <- c('B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(2,7,3,9,6,3,7,1,7,3,8,3,4,1)
data.frame(Player, Team, Score)
test <- ave(Score, Player, Team, FUN = cumroll)
data.frame(Player, Team, Score, test)
However when my dataset has a grouping of size one:
Player <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1)
data.frame(Player, Team, Score)
test <- ave(Score, Player, Team, FUN = cumroll)
data.frame(Player, Team, Score, test)
I get the error:
Error in `split<-.default`(`*tmp*`, g, value = lapply(split(x, g), FUN)) :
replacement has length zero
I know there is a way to modify the function to account for this. I want to give the observed value when group size is 1 in these cases. Any help is appreciated!!
The simplest way to change the function's behavior conditional on the length of the input is, happily, to condition on the length of the input. E.g., you can use
cumroll <- function(x) {
if(length(x)<=1) {
x
} else {
x <- head(x, -1)
c(head(x,1), cumsum(x) / seq_along(x))
}
}
Player <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1)
test <- ave(Score, Player, Team, FUN = cumroll)
> data.frame(Player, Team, Score, test)
Player Team Score test
1 A A 5 5.000000
2 B B 2 2.000000
3 B B 7 2.000000
4 C C 3 3.000000
5 C C 9 3.000000
6 C C 6 6.000000
7 D D 3 3.000000
8 D D 7 3.000000
9 D D 1 5.000000
10 D D 7 3.666667
11 E E 3 3.000000
12 E E 8 3.000000
13 E E 3 5.500000
14 E E 4 4.666667
15 E E 1 4.500000
But I'm a little wary about your approach...how is cumulative mean with a lag of one defined precisely? You might look at shift in data.table and rollapply in zoo to get better performance and robustness.
Is there an easy way to let values between a certain range equal a letter?. So in the following example, how would I convert all values in df so that:
Values less than or equal to 1 = A.
Values less than or equal to 5 = B.
Values greater than 5 = C.
A small example dataset:
df1 <- rnorm (100, mean = 1, sd = 0.3)
df2 <- rnorm (100, mean = 5, sd = 1.6)
df <- cbind(df1,df2)
as.data.frame(apply(df,2, function(x) cut(x, c(-Inf,1,5,Inf), labels=c('A','B','C'))))
# df1 df2
# 1 A C
# 2 A C
# 3 B B
# 4 A C
# 5 A C
# 6 A B
# 7 A C
# 8 B B
# 9 B C
# 10 A C
Remember to use -Inf and Inf when creating cut points for your outer boundary. It's wrapped in an apply function to repeat the process over each column.