Related
I am trying to simulate the answers to a multi-choice question test (MCQ). Currently, I am using the following code to simulate the answers to a MCQ with only two questions:
answers <- data.frame(
Q1 = sample(LETTERS[1:5],10,replace = T, prob=c(0.1,0.6,0.1,0.1,0.1)),
Q2 = sample(LETTERS[1:5],10,replace = T, prob=c(0.5,0.1,0.1,0.2,0.1)))
The answers B and A are, respectively, the correct answers to Q1 and Q2.
My difficulty is to introduce correlation among the answers to the questions, in the sense that, for instance, a good student tends to select the correct answer to all questions. How can I accomplish that?
You could fill up the data with completely correct answers, assign a level of proficiency to each individual student and then randomly change values in their exams, depending on their proficiency:
correct = c(2,1,3)
nstudents = 20
exam = matrix(LETTERS[rep(correct,nstudents)],ncol=length(correct),byrow=T)
colnames(exam)=paste("Q",1:length(correct),sep="")
proficiency = runif(nstudents,1,5)/5 ## Each student has a level of expertise
for(question in 1:length(correct)){
difficulty = runif(nstudents,1,10)/10 ## Random difficulty for each question and student (may be made more or less difficult)
nmistakes = sum(proficiency<difficulty)
exam[,question][proficiency<difficulty] = sample(LETTERS[1:5],nmistakes,replace=T)
}
exam = as.data.frame(exam)
The result would be a data frame in which some students hardly ever make mistakes while others hardly ever get something right.
EDIT: The proficiency, in this case, follows an uniform distribution. If you need them normally distributed, just change the proficiency vector to use rnorm().
Here is a method that applies a covariance matrix Sigma= using MASS::mvrnorm.
n <- 15
r <- .9
set.seed(42)
library('MASS')
M <- abs(mvrnorm(n=n, mu=c(1, 500), Sigma=matrix(c(1, r, r, 1), nrow=2),
empirical=TRUE)) |>
as.data.frame() |>
setNames(c('Q1', 'Q2'))
We get the correlated levels A, ..., B by cutting the random numbers along custom quantiles (taken from OP),
f <- \(x, q) cut(x, breaks=c(0, quantile(x, cumsum(q))), include.lowest=T,
labels=LETTERS[1:5])
p1 <- c(0.1, 0.6, 0.1, 0.1, 0.1)
p2 <- c(0.5, 0.1, 0.1, 0.2, 0.1)
in a Map() call.
dat <- Map(f, M, list(p1, p2)) |>
as.data.frame()
dat
# Q1 Q2
# 1 A A
# 2 B A
# 3 E E
# 4 D E
# 5 A A
# 6 B A
# 7 C D
# 8 B A
# 9 B A
# 10 B A
# 11 B C
# 12 B B
# 13 E D
# 14 B A
# 15 C D
Check
dat_check <- lapply(dat, as.integer) |> as.data.frame()
cor(dat_check) ## correlation
# Q1 Q2
# Q1 1.00000 0.85426
# Q2 0.85426 1.00000
lapply(dat, table) ## students' answers
# $Q1
#
# A B C D E
# 2 8 2 1 2
#
# $Q2
#
# A B C D E
# 8 1 1 3 2
I'm looking to create a hybrid of cumsum() and TTR::runSum()where cumSum() runs up until a pre-specified number of datapoints, at which points it acts more like a runSum()
For example:
library(TTR)
data <- rep(1:3,2)
cumsum <- cumsum(data)
runSum <- runSum(data, n = 3)
DesiredResult <- ifelse(is.na(runSum),cumsum,runSum)
Is there a way to get to DesiredResult that doesn't require getting finangly with NAs?
That is what the partial=TRUE argument to rollapplyr does. Here we show this with sum and also with sd and IQR. (Note that the sd of one value is NA and we chose IQR since it is a measure of spread that can be calculated for scalars although it is always 0 in that case.)
library(zoo)
rollapplyr(data, 3, sum, partial = TRUE)
## [1] 1 3 6 6 6 6
rollapplyr(data, 3, sd, partial = TRUE)
## [1] NA 0.7071068 1.0000000 1.0000000 1.0000000 1.0000000
rollapplyr(data, 3, IQR, partial = TRUE)
## [1] 0.0 0.5 1.0 1.0 1.0 1.0
Here are three alternatives.
n <- 3
rowSums(embed(c(rep(0, n - 1), data), n)) # base R
# [1] 1 3 6 6 6 6
library(TTR)
runSum(c(rep(0, n - 1), data), n = n)
# [1] NA NA 1 3 6 6 6 6 # na.omit fixes the beginning
library(zoo)
rollsum(c(rep(0, n - 1), data), k = 3, align = "right")
# [1] 1 3 6 6 6 6
I have a data frame with two columns, let's call them X and Y. Here's an example of it:
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
which produces this:
X Y
A 14
B 12
C 12
D 11
E 9
F 6
G 4
H 1
Note that the data frame will always be ordered in a descending order based on Y. I want to group together cases where the Y values lie within a certain range, while updating the X column to reflect the grouping too. For example, if the value is 2, I would like the final output to be:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F G 5.00000
H 1.00000
Let me explain how I got that. From the starting df data frame, the closest values were B and C. Joining them would result in:
X new_Y
A 14
B C 12
D 11
E 9
F 6
G 4
H 1
The new_Y value for cases B and C is the average of the original values for B and C i.e. 12. From this second data frame, B C are within 2 from D so they are the next to be grouped together:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F 6.00000
G 4.00000
H 1.00000
Note that the Y value for B C D is 11.67 because the original values of B, C and D were 12, 12 and 11 respectively and their average is 11.667. I wouldn't want the code to return the average Y from the previous iteration (which in this case would be 11.5).
Finally, F and G can also be grouped together, producing the final output stated above.
I'm not sure of the code needed to achieve this. My only thoughts were to calculate the distance from the previous and following element, look for the minimum and check whether it exceeds the threshold value (of 2 in the example above). Based on where that minimum appears, join the X column while averaging the Y values from the original table. Repeat this until the minimum becomes larger than the threshold.
But I'm not sure how to write the necessary code to achieve this or whether there's a more efficient solution to the algorithm I'm suggesting above. Any help will be much appreciated.
P.S I forgot to mention that if the distance between the previous and the following Y value is the same, then the grouping should be done towards the larger Y value. So
X Y
A 10
B 8
C 6
would be returned as
X new_Y
A B 9
C 6
Thanks in advance for your patience. My apologies if I didn't explain this very well.
This sounds like hierarchical agglomerative clustering.
To get the groups, use dist, hclust and cutree.
Note that centroid clustering with hclust expects the distances as the square of the Euclidean distance.
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
dCutoff <- 2
d2 <- dist(df$Y)^2
hc <- hclust(d2, method = "centroid")
group_id <- cutree(hc, h = dCutoff^2)
group_id
#> [1] 1 2 2 2 3 4 4 5
To munge the original table, we can use dplyr.
library('dplyr')
df %>%
group_by(group_id = group_id) %>%
summarise(
X = paste(X, collapse = ' '),
Y = mean(Y))
#> # A tibble: 5 x 3
#> group_id X Y
#> <int> <chr> <dbl>
#> 1 1 A 14.00000
#> 2 2 B C D 11.66667
#> 3 3 E 9.00000
#> 4 4 F G 5.00000
#> 5 5 H 1.00000
This gives the average of the previous iteration though. In any case I hope it helps
library(data.table)
df <- data.table(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
while (minimum < 2){
index <- which(df$difference==minimum) # see where the minimum occurs
check = FALSE
# because the last row cannot have a number since there is not an element after that
# we need to see if this element has the minimum difference with its previous
# if it does not have the minimum difference then we exclude it and paste it later
if(df[nrow(df)-1,difference]!=minimum){
last_row <- df[nrow(df)]
df <- df[-nrow(df)]
check = TRUE
}
tmp <- df[(index:(index+1))]
df <- df[-(index:(index+1))]
to_bind <- data.table(X = paste0(tmp$X, collapse = " "))
to_bind$Y <- mean(tmp$Y)
df <- rbind(df[,.(X,Y)],to_bind)
if(check){
df <- rbind(df,last_row[,.(X,Y)])
}
setorder(df,-Y)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
}
I have an issue with a function I have that calculates the cumulative mean with a lag of one over groups on a field:
cumroll <- function(x) { x <- head(x, -1)
c(head(x,1), cumsum(x) / seq_along(x))}
Everything works fine as long as I am performing this function over groups that are larger than one:
Player <- c('B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(2,7,3,9,6,3,7,1,7,3,8,3,4,1)
data.frame(Player, Team, Score)
test <- ave(Score, Player, Team, FUN = cumroll)
data.frame(Player, Team, Score, test)
However when my dataset has a grouping of size one:
Player <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1)
data.frame(Player, Team, Score)
test <- ave(Score, Player, Team, FUN = cumroll)
data.frame(Player, Team, Score, test)
I get the error:
Error in `split<-.default`(`*tmp*`, g, value = lapply(split(x, g), FUN)) :
replacement has length zero
I know there is a way to modify the function to account for this. I want to give the observed value when group size is 1 in these cases. Any help is appreciated!!
The simplest way to change the function's behavior conditional on the length of the input is, happily, to condition on the length of the input. E.g., you can use
cumroll <- function(x) {
if(length(x)<=1) {
x
} else {
x <- head(x, -1)
c(head(x,1), cumsum(x) / seq_along(x))
}
}
Player <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Team <- c('A','B','B','C','C','C','D','D','D','D','E','E','E','E','E')
Score <- c(5,2,7,3,9,6,3,7,1,7,3,8,3,4,1)
test <- ave(Score, Player, Team, FUN = cumroll)
> data.frame(Player, Team, Score, test)
Player Team Score test
1 A A 5 5.000000
2 B B 2 2.000000
3 B B 7 2.000000
4 C C 3 3.000000
5 C C 9 3.000000
6 C C 6 6.000000
7 D D 3 3.000000
8 D D 7 3.000000
9 D D 1 5.000000
10 D D 7 3.666667
11 E E 3 3.000000
12 E E 8 3.000000
13 E E 3 5.500000
14 E E 4 4.666667
15 E E 1 4.500000
But I'm a little wary about your approach...how is cumulative mean with a lag of one defined precisely? You might look at shift in data.table and rollapply in zoo to get better performance and robustness.
Is there an easy way to let values between a certain range equal a letter?. So in the following example, how would I convert all values in df so that:
Values less than or equal to 1 = A.
Values less than or equal to 5 = B.
Values greater than 5 = C.
A small example dataset:
df1 <- rnorm (100, mean = 1, sd = 0.3)
df2 <- rnorm (100, mean = 5, sd = 1.6)
df <- cbind(df1,df2)
as.data.frame(apply(df,2, function(x) cut(x, c(-Inf,1,5,Inf), labels=c('A','B','C'))))
# df1 df2
# 1 A C
# 2 A C
# 3 B B
# 4 A C
# 5 A C
# 6 A B
# 7 A C
# 8 B B
# 9 B C
# 10 A C
Remember to use -Inf and Inf when creating cut points for your outer boundary. It's wrapped in an apply function to repeat the process over each column.