R: Grouping data within cetrain range - r

I have a data frame with two columns, let's call them X and Y. Here's an example of it:
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
which produces this:
X Y
A 14
B 12
C 12
D 11
E 9
F 6
G 4
H 1
Note that the data frame will always be ordered in a descending order based on Y. I want to group together cases where the Y values lie within a certain range, while updating the X column to reflect the grouping too. For example, if the value is 2, I would like the final output to be:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F G 5.00000
H 1.00000
Let me explain how I got that. From the starting df data frame, the closest values were B and C. Joining them would result in:
X new_Y
A 14
B C 12
D 11
E 9
F 6
G 4
H 1
The new_Y value for cases B and C is the average of the original values for B and C i.e. 12. From this second data frame, B C are within 2 from D so they are the next to be grouped together:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F 6.00000
G 4.00000
H 1.00000
Note that the Y value for B C D is 11.67 because the original values of B, C and D were 12, 12 and 11 respectively and their average is 11.667. I wouldn't want the code to return the average Y from the previous iteration (which in this case would be 11.5).
Finally, F and G can also be grouped together, producing the final output stated above.
I'm not sure of the code needed to achieve this. My only thoughts were to calculate the distance from the previous and following element, look for the minimum and check whether it exceeds the threshold value (of 2 in the example above). Based on where that minimum appears, join the X column while averaging the Y values from the original table. Repeat this until the minimum becomes larger than the threshold.
But I'm not sure how to write the necessary code to achieve this or whether there's a more efficient solution to the algorithm I'm suggesting above. Any help will be much appreciated.
P.S I forgot to mention that if the distance between the previous and the following Y value is the same, then the grouping should be done towards the larger Y value. So
X Y
A 10
B 8
C 6
would be returned as
X new_Y
A B 9
C 6
Thanks in advance for your patience. My apologies if I didn't explain this very well.

This sounds like hierarchical agglomerative clustering.
To get the groups, use dist, hclust and cutree.
Note that centroid clustering with hclust expects the distances as the square of the Euclidean distance.
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
dCutoff <- 2
d2 <- dist(df$Y)^2
hc <- hclust(d2, method = "centroid")
group_id <- cutree(hc, h = dCutoff^2)
group_id
#> [1] 1 2 2 2 3 4 4 5
To munge the original table, we can use dplyr.
library('dplyr')
df %>%
group_by(group_id = group_id) %>%
summarise(
X = paste(X, collapse = ' '),
Y = mean(Y))
#> # A tibble: 5 x 3
#> group_id X Y
#> <int> <chr> <dbl>
#> 1 1 A 14.00000
#> 2 2 B C D 11.66667
#> 3 3 E 9.00000
#> 4 4 F G 5.00000
#> 5 5 H 1.00000

This gives the average of the previous iteration though. In any case I hope it helps
library(data.table)
df <- data.table(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
while (minimum < 2){
index <- which(df$difference==minimum) # see where the minimum occurs
check = FALSE
# because the last row cannot have a number since there is not an element after that
# we need to see if this element has the minimum difference with its previous
# if it does not have the minimum difference then we exclude it and paste it later
if(df[nrow(df)-1,difference]!=minimum){
last_row <- df[nrow(df)]
df <- df[-nrow(df)]
check = TRUE
}
tmp <- df[(index:(index+1))]
df <- df[-(index:(index+1))]
to_bind <- data.table(X = paste0(tmp$X, collapse = " "))
to_bind$Y <- mean(tmp$Y)
df <- rbind(df[,.(X,Y)],to_bind)
if(check){
df <- rbind(df,last_row[,.(X,Y)])
}
setorder(df,-Y)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
}

Related

Simulating correlated answers to a multi-choice test

I am trying to simulate the answers to a multi-choice question test (MCQ). Currently, I am using the following code to simulate the answers to a MCQ with only two questions:
answers <- data.frame(
Q1 = sample(LETTERS[1:5],10,replace = T, prob=c(0.1,0.6,0.1,0.1,0.1)),
Q2 = sample(LETTERS[1:5],10,replace = T, prob=c(0.5,0.1,0.1,0.2,0.1)))
The answers B and A are, respectively, the correct answers to Q1 and Q2.
My difficulty is to introduce correlation among the answers to the questions, in the sense that, for instance, a good student tends to select the correct answer to all questions. How can I accomplish that?
You could fill up the data with completely correct answers, assign a level of proficiency to each individual student and then randomly change values in their exams, depending on their proficiency:
correct = c(2,1,3)
nstudents = 20
exam = matrix(LETTERS[rep(correct,nstudents)],ncol=length(correct),byrow=T)
colnames(exam)=paste("Q",1:length(correct),sep="")
proficiency = runif(nstudents,1,5)/5 ## Each student has a level of expertise
for(question in 1:length(correct)){
difficulty = runif(nstudents,1,10)/10 ## Random difficulty for each question and student (may be made more or less difficult)
nmistakes = sum(proficiency<difficulty)
exam[,question][proficiency<difficulty] = sample(LETTERS[1:5],nmistakes,replace=T)
}
exam = as.data.frame(exam)
The result would be a data frame in which some students hardly ever make mistakes while others hardly ever get something right.
EDIT: The proficiency, in this case, follows an uniform distribution. If you need them normally distributed, just change the proficiency vector to use rnorm().
Here is a method that applies a covariance matrix Sigma= using MASS::mvrnorm.
n <- 15
r <- .9
set.seed(42)
library('MASS')
M <- abs(mvrnorm(n=n, mu=c(1, 500), Sigma=matrix(c(1, r, r, 1), nrow=2),
empirical=TRUE)) |>
as.data.frame() |>
setNames(c('Q1', 'Q2'))
We get the correlated levels A, ..., B by cutting the random numbers along custom quantiles (taken from OP),
f <- \(x, q) cut(x, breaks=c(0, quantile(x, cumsum(q))), include.lowest=T,
labels=LETTERS[1:5])
p1 <- c(0.1, 0.6, 0.1, 0.1, 0.1)
p2 <- c(0.5, 0.1, 0.1, 0.2, 0.1)
in a Map() call.
dat <- Map(f, M, list(p1, p2)) |>
as.data.frame()
dat
# Q1 Q2
# 1 A A
# 2 B A
# 3 E E
# 4 D E
# 5 A A
# 6 B A
# 7 C D
# 8 B A
# 9 B A
# 10 B A
# 11 B C
# 12 B B
# 13 E D
# 14 B A
# 15 C D
Check
dat_check <- lapply(dat, as.integer) |> as.data.frame()
cor(dat_check) ## correlation
# Q1 Q2
# Q1 1.00000 0.85426
# Q2 0.85426 1.00000
lapply(dat, table) ## students' answers
# $Q1
#
# A B C D E
# 2 8 2 1 2
#
# $Q2
#
# A B C D E
# 8 1 1 3 2

Error in isoMDS(d): zero or negative distance between objects

I'm trying to do a nonmetric MDS (R version 3.3.3) using the isoMDS function in the MASS package and I get this error:
Error in isoMDS(d): zero or negative distance between objects 1 and 2
Here's an example of what I'm doing:
# LOAD LIBRARY
library(MASS)
# CREATE FAKE DATA
a <- c(1, 1, 1, 1)
b <- c(2, 2, 2, 2)
c <- c(3, 3, 4, 5)
d <- c(4, 4, 7, 9)
x <- data.frame(a, b, c, d)
x
a b c d
1 1 2 3 4
2 1 2 3 4
3 1 2 4 7
4 1 2 5 9
# EUCLIDEAN DISTANCE BETWEEN ROWS 1, 2, 3 and 4
d <- dist(x)
d
1 2 3
2 0.000000
3 3.162278 3.162278
4 5.385165 5.385165 2.236068
# NMDS
fit <- isoMDS(d)
Error in isoMDS(d) : distance négative ou nulle entre les objets 1 et 2
I don't know if there's a way of getting around this issue or if I'm doing something wrong. I understand that objects 1 and 2 are identical and that that's probably why the distance is negative or equals to zero. I found out that my question was a "FAQ", but one of the only answers I found is this:
Short answer: you cannot compare distances including NAs, so there is no
way to find a monotone mapping of distances. If the data really are identical for two rows, you can easily drop one of
them whilst doing MDS, and then assign the position found for one to the
other.
So, my next questions are: how do you drop rows whilst doing MDS, and is there any other way to perform a NMDS?
Any help would be greatly appreciated!
The dist function computes the distances between the rows of a data matrix.
Your a, b, c, and d vectors are the columns of the x matrix, not the rows.
A simple solution is to transpose x:
library(MASS)
a <- c(1, 1, 1, 1)
b <- c(2, 2, 2, 2)
c <- c(3, 3, 4, 5)
d <- c(4, 4, 7, 9)
x <- data.frame(a, b, c, d)
# Calculate distance between the columns
d <- dist(t(x))
# NMDS
fit <- isoMDS(d)
# initial value 0.000000
# final value 0.000000
# converged
fit
# $points
# [,1] [,2]
# a -4.594429 0.4509513
# b -2.770312 -0.3638885
# c 1.098884 -0.3114594
# d 6.265857 0.2243966
#
# $stress
# [1] 7.976932e-15
I hope it can help you.
As you noted, you have identical rows.
You can omit identical rows when you first create the distance matrix
d <- dist(x[-1,])
Then continue as normal
fit <- isoMDS(d)
Alternatively, you could try the vegan::metaMDS function:
library(vegan)
#> This is vegan 2.5-3
x <- data.frame(a = c(1, 1, 1, 1),
b = c(2, 2, 2, 2),
c = c(3, 3, 4, 5),
d = c(4, 4, 7, 9))
# The warnings are expected for such a small dataset
fit <- vegan::metaMDS(comm = dist(x))
#> ... Procrustes: rmse 0.09543314 max resid 0.108719
#> *** No convergence -- monoMDS stopping criteria:
#> 17: stress < smin
#> 3: scale factor of the gradient < sfgrmin
#> Warning in vegan::metaMDS(comm = dist(x)): stress is (nearly) zero: you may
#> have insufficient data
ordiplot(fit, type = "text")
Variables/columns "a" and "b" (1 and 2) get the same coordinates.
Similarly, using the smacof::mds function:
library(smacof)
fit2 <- smacof::mds(delta = dist(x), type = "ordinal")
fit2$conf
#> D1 D2
#> 1 0.5742535 0.007220978 # 1 & 2 get the same coordinates
#> 2 0.5742535 0.007220978
#> 3 -0.2749314 -0.034928060
#> 4 -0.8735757 0.020486105

Aggregated rolling average with a conditional statement in R

I have a data frame that follows the following format.
match team1 team2 winningTeam
1 A D A
2 B E E
3 C F C
4 D C C
5 E B B
6 F A A
7 A D D
8 D A A
What I want to do is to crate variables that calculates the form of both team 1 and 2 over the last x matches. For example, I would want to create a variable called team1_form_last3_matches which for match 8 would be 0.33 (as they won 1 of their last 3 matches) and there would also be a variable called team2_form_last3_matches which would be 0.66 in match 8 (as they won 2 of their last 3 matches). Ideally I would like to be able to specify the number of previous matches to be considered when calculating the teamx_form_lasty variable and those variables to be automatically created. I have tried a bunch of approaches using dplyr, zoo rolling mean functions and a load of nested for / if statements. However, I have not quite cracked it and certainly not in an elegant way. I feel like I am missing a simple solution to this generic problem. Any help would be much appreciated!
Cheers,
Jack
This works for t1l3, you will need to replicate it for t2.
dat <- data.frame(match = c(1:8), team1 = c("A","B","C","D","E","F","A","D"), team2 = c("D","E","F","C","B","A","D","A"), winningTeam = c("A","E","C","C","B","A","D","A"),stringsAsFactors = FALSE)
dat$t1l3 <- c(NA,sapply(2:nrow(dat),function(i) {
df <- dat[1:(i-1),] #just previous games, i.e. excludes current game
df <- df[df$team1==dat$team1[i] | df$team2==dat$team1[i],] #just those containing T1
df <- tail(df,3) #just the last three (or fewer if there aren't three previous games)
return(sum(df$winningTeam==dat$team1[i])/nrow(df)) #total wins/total games (up to three)
}))
How about something like:
dat <- data.frame(match = c(1:8), team1 = c("A","B","C","D","E","F","A","D"), team2 = c("D","E","F","C","B","A","D","A"), winningTeam = c("A","E","C","C","B","A","D","A"))
match team1 team2 winningTeam
1 1 A D A
2 2 B E E
3 3 C F C
4 4 D C C
5 5 E B B
6 6 F A A
7 7 A D D
8 8 D A A
Allteams <- c("A","B","C","D","E","F")
# A vectorized function for you to use to do as you ask:
teamX_form_lastY <- function(teams, games, dat){
sapply(teams, function(x) {
games_info <- rowSums(dat[,c("team1","team2")] == x) + (dat[,"winningTeam"] == x)
lookup <- ifelse(rev(games_info[games_info != 0])==2,1,0)
games_won <- sum(lookup[1:games])
if(length(lookup) < games) warning(paste("maximum games for team",x,"should be",length(lookup)))
games_won/games
})
}
teamX_form_lastY("A", 4, dat)
A
0.75
# Has a warning for the number of games you should be using
teamX_form_lastY("A", 5, dat)
A
NA
Warning message:
In FUN(X[[i]], ...) : maximum games for team A should be 4
# vectorized input
teamX_form_lastY(teams = c("A","B"), games = 2, dat = dat)
A B
0.5 0.5
# so you ca do all teams
teamX_form_lastY(teams = Allteams, 2, dat)
A B C D E F
0.5 0.5 1.0 0.5 0.5 0.0

Converting values between a certain range to a letter

Is there an easy way to let values between a certain range equal a letter?. So in the following example, how would I convert all values in df so that:
Values less than or equal to 1 = A.
Values less than or equal to 5 = B.
Values greater than 5 = C.
A small example dataset:
df1 <- rnorm (100, mean = 1, sd = 0.3)
df2 <- rnorm (100, mean = 5, sd = 1.6)
df <- cbind(df1,df2)
as.data.frame(apply(df,2, function(x) cut(x, c(-Inf,1,5,Inf), labels=c('A','B','C'))))
# df1 df2
# 1 A C
# 2 A C
# 3 B B
# 4 A C
# 5 A C
# 6 A B
# 7 A C
# 8 B B
# 9 B C
# 10 A C
Remember to use -Inf and Inf when creating cut points for your outer boundary. It's wrapped in an apply function to repeat the process over each column.

Groupby bins and aggregate in R

I have data like (a,b,c)
a b c
1 2 1
2 3 1
9 2 2
1 6 2
where 'a' range is divided into n (say 3) equal parts and aggregate function calculates b values (say max) and grouped by at 'c' also.
So the output looks like
a_bin b_m(c=1) b_m(c=2)
1-3 3 6
4-6 NaN NaN
7-9 NaN 2
Which is MxN where M=number of a bins, N=unique c samples or all range
How do I approach this? Can any R package help me through?
A combination of aggregate, cut and reshape seems to work
df <- data.frame(a = c(1,2,9,1),
b = c(2,3,2,6),
c = c(1,1,2,2))
breaks <- c(0, 3, 6, 9)
# Aggregate data
ag <- aggregate(df$b, FUN=max,
by=list(a=cut(df$a, breaks, include.lowest=T), c=df$c))
# Reshape data
res <- reshape(ag, idvar="a", timevar="c", direction="wide")
There would be easier ways.
If your dataset is dat
res <- sapply(split(dat[, -3], dat$c), function(x) {
a_bin <- with(x, cut(a, breaks = c(1, 3, 6, 9), include.lowest = T, labels = c("1-3",
"4-6", "7-9")))
c(by(x$b, a_bin, FUN = max))
})
res1 <- setNames(data.frame(row.names(res), res),
c("a_bin", "b_m(c=1)", "b_m(c=2)"))
row.names(res1) <- 1:nrow(res1)
res1
a_bin b_m(c=1) b_m(c=2)
1 1-3 3 6
2 4-6 NA NA
3 7-9 NA 2
I would use a combination of data.table and reshape2 which are both fully optimized for speed (not using for loops from apply family).
The output won't return the unused bins.
v <- c(1, 4, 7, 10) # creating bins
temp$int <- findInterval(temp$a, v)
library(data.table)
temp <- setDT(temp)[, list(b_m = max(b)), by = c("c", "int")]
library(reshape2)
temp <- dcast.data.table(temp, int ~ c, value.var = "b_m")
## colnames(temp) <- c("a_bin", "b_m(c=1)", "b_m(c=2)") # Optional for prettier table
## temp$a_bin<- c("1-3", "7-9") # Optional for prettier table
## a_bin b_m(c=1) b_m(c=2)
## 1 1-3 3 6
## 2 7-9 NA 2

Resources