According to the following table, I have many different teachers(10,11,12,...) with different ideas(1,2,... for example 1:Very Good, 2:Good,... ) of each class (1,2,3,...).
Some teachers don't have any idea about some classes.
class Teacher-code Opinion
1 12 1
1 13 1
1 14 1
2 11 3
2 13 1
3 10 1
3 11 2
3 12 1
3 13 1
This is a sample of my table but I have many records. I want to have a symmetric matrix of teachers with counts of their same ideas about classes. for example, teacher 12 and 13 have the same idea in the class of 1 and 3, then the intersection elements of them are 2. Or teacher codes of 14 and 13 have just one same idea about the first class. I want to get the following matrix:
[10] [11] [12] [13] [14]
[10] 0 0 1 1 0
[11] 0 0 0 0 0
[12] 1 0 0 2 1
[13] 1 0 2 0 1
[14] 0 0 1 1 0
This is a base R solution based on a general approach taken to find common rows between data frames. Maybe this could be helpful.
Create a function that will find overlap in your dataframe between teachers that share other common values in certain columns (in this case, class and Opinion). With merge you can identify overlap, and nrow to count the overlapping rows.
Using outer you can generate a matrix of all teachers. The function passed to the product needs to be vectorized.
the_teachers <- sort(unique(df$Teacher_code))
get_num_classes <- function(x, y) {
nrow(
merge(
df[df$Teacher_code == x, c("class", "Opinion")],
df[df$Teacher_code == y, c("class", "Opinion")]
)
)
}
mat <- outer(the_teachers, the_teachers, Vectorize(get_num_classes))
diag(mat) <- 0
dimnames(mat) <- list(the_teachers, the_teachers)
mat
Output
10 11 12 13 14
10 0 0 1 1 0
11 0 0 0 0 0
12 1 0 0 2 1
13 1 0 2 0 1
14 0 0 1 1 0
Edit: Based on comment, there is interest in identifying the fraction of (teacher pairs sharing same opinions in same class) / (teacher pairs sharing same class). Building off the same logic, you could modify the function as below. A separate merge will determine the number of teachers sharing the same class. If this number is not zero, it will determine the number of opinions shared between the teacher pair. If there are no classes shared, the function will just return zero. Depending on size of data and concordance between teachers this may be optimized further.
get_num_classes <- function(x, y) {
same_class <- nrow(
merge(
df[df$Teacher_code == x, "class", drop = F],
df[df$Teacher_code == y, "class", drop = F]
)
)
if (same_class != 0) {
same_opinion <- nrow(
merge(
df[df$Teacher_code == x, c("class", "Opinion")],
df[df$Teacher_code == y, c("class", "Opinion")]
)
)
return(same_opinion / same_class)
} else {
return(0)
}
}
Here is a base R option by defining a user function f, where aggregate + pmin + vecsets::vintersect are applied:
library(vecsets)
f <- function(df) {
u <- aggregate(. ~ Teacher_code, df, I)
res <- do.call(
pmin,
lapply(
u[c("class", "Opinion")],
function(x) outer(x, x, FUN = function(...) lengths(Vectorize(vintersect)(...)))
)
)
`dimnames<-`(`diag<-`(res, 0), rep(list(u[["Teacher_code"]]), 2))
}
and you will see
> f(df)
10 11 12 13 14
10 0 0 1 1 0
11 0 0 0 0 0
12 1 0 0 2 1
13 1 0 2 0 1
14 0 0 1 1 0
I mean its some pretty hideous code (I'm sure someone can do something better) but I think it gets you the result you need (this isn't homework is it -.- ?). Also in the future it makes life easier if you provide data using dput()
library(dplyr)
library(tidyr)
dat <- tibble(
class = c( 1,1,1,2,2,3,3,3,3),
Teacher_code = c(12,13,14,11,13,10,11,12,13),
Opinion = c(1,1,1,3,1,1,2,1,1)
)
dat2 <- complete(dat, class, Teacher_code)
classes <- unique(dat2$class)
teachers <- unique(dat2$Teacher_code)
len_teachers <- length(teachers)
mat <- matrix(nrow = len_teachers, ncol = len_teachers)
for(i in seq_along(teachers)){
for( j in seq_along(teachers)){
same_opinion <- 0
for(k in classes){
opinion_i <- dat2 %>% filter(Teacher_code == teachers[[i]] , class == k) %>% pull(Opinion)
opinion_j <- dat2 %>% filter(Teacher_code == teachers[[j]] , class == k) %>% pull(Opinion)
same_opinion <- same_opinion + (opinion_i == opinion_j & !(is.na(opinion_i) | is.na(opinion_j)))
}
mat[i,j] <- same_opinion
}
}
Related
I have a dataframe like this:
dat<- data.frame (
'Ones'=c(0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0),
'Thats'=c(0,5,3,6,8,4,5,6,8,3,1,3,4,5,6,7,4,3,4,5))
I have to create a function (gap1) that detects each 1 in Ones and than sums n-1, n and n+1 in Thats, with n being in the same row as 1.
For example in this dataset I have two 1.
dat<- data.frame (
'Ones'=c(0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0),
'Thats'=c(0,5,3,6,8,4,5,6,8,3,1,3,4,5,6,7,4,3,4,5))
dat
This should be the output:
Ones Thats gap1
1 4 17 #(8+4+5)
1 1 7 #(3+1+3)
I would like to extend this gap at will, for example:
Ones Thats gap1 gap2 gap3 ...
1 4 17 29 #(6+8+4+5+6)
1 1 7 9 #(8+3+1+3+4)
There is another problem I have to consider:
Suppose we have this data frame:
dat<- data.frame (
'Ones'=c(1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0),
'Thats'=c(0,5,3,6,8,4,5,6,8,3,1,NA,4,5,6,7,4,3,4,5))
In case there is a 1 at the beginning (or at the end), or if there is an NA, the function should use available data.
In this case, for example:
Ones Thats gap1 gap2
1 0 5 (0+5) 8 #(0+5+3)
1 4 17 (8+4+5) 29 #(6+8+4+5+6)
1 1 4 (3+1+NA) 16 #(8+3+1+NA+4)
Do you have any advice?
Using tidyverse / collapse
For arbitrary number of lead and lags the collapse package offers a nice function flag, which has further arguments to specify columns (cols), or grouping variables g.
library(dplyr)
f <- function(df, n){
df %>%
collapse::flag(-n:n) %>%
transmute(Ones, Thats, gap = rowSums(., na.rm = T) - 1) %>%
filter(Ones == 1)
}
x <- data.frame (
'Ones'=c(1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0),
'Thats'=c(0,5,3,6,8,4,5,6,8,3,1,NA,4,5,6,7,4,3,4,5))
# we can now specify how many lags to count:
f(x, 1)
Ones Thats gap
1 1 0 5
2 1 4 17
3 1 1 4
f(x, 2)
Ones Thats gap
1 1 0 8
2 1 4 29
3 1 1 16
Or if you want to specify the number of gaps to compute, we can simplify the function to
f <- function(df, n){
df %>%
collapse::flag(-n:n) %>%
rowSums(na.rm = T) - 1
}
x %>%
mutate(gap1 = f(., 1),
gap2 = f(., 2)) %>%
filter(Ones == 1)
Ones Thats gap1 gap2
1 1 0 5 8
2 1 4 17 29
3 1 1 4 16
Base R
If you like terse functions:
f <- Vectorize(\(df, n) rowSums(collapse::flag(df, -n:n), na.rm = T) - 1, "n")
x[paste0("gap", 1:2)] <- f(x, 1:2) ; subset(x, Ones == 1)
Ones Thats gap1 gap2
1 1 0 5 8
6 1 4 17 29
11 1 1 4 16
With BaseR,
myfun <- function(data,gap=1) {
points <- which(data["Ones"]==1)
sapply(points, function(x) {
bottom <- ifelse(x-gap<=0,1,x -gap)
top <- ifelse(x+ gap > nrow(data),nrow(data),x +gap)
sum(data[bottom:top,"Thats"], na.rm=T)
})
}
#> myfun(dat,1)
#[1] 5 17 4
#> myfun(dat,2)
#[1] 8 29 16
Another base R solution
f <- function(dat, width = 1)
{
dat$gaps <- sapply(seq(nrow(dat)), function(x) {
if(dat$Ones[x] == 0) return(0)
i <- x + seq(2 * width + 1) - (width + 1)
i <- i[i > 0]
i <- i[i < nrow(dat)]
sum(dat$Thats[i])
})
dat[dat$Ones == 1,]
}
f(dat, 1)
#> Ones Thats gaps
#> 6 1 4 17
#> 11 1 1 7
f(dat, 2)
#> Ones Thats gaps
#> 6 1 4 29
#> 11 1 1 19
With a list of unique objects, where the identities matter (and thus order... but only for the purpose of tracking identity):
fakeDataList <- list(one = 1,
two = 2,
three = 3,
four = 4)
There's a function that performs pairwise calculations...
fakeInnerFxn <- function(l){
x <- l[[1]]
y <- l[[2]]
low <- x + y - 1
mid <- x + y
high <- x + y + 1
out <- c(low, mid, high)
return(out)
}
... and returns three values per pair of ids
fakeInnerFxn(fakeDataList[c(1,2)])
#> [1] 2 3 4
The inner function is nested within the outer function which performs each pairwise operation on the full list...
fakeOuterFxn <- function(d){
n <- length(d)
out <- array(0, dim = c(n,n,3))
colnames(out) <- names(d)
rownames(out) <- names(d)
for(i in 1:n){
for(j in (i+1):n){
if (j <= n) {
out[i, j, ] <- fakeInnerFxn(d[c(i, j)])
}
}
}
diag(out[,,1]) <- 0 # not sure how to do this succinctly
diag(out[,,2]) <- 0
diag(out[,,3]) <- 0
return(out)
}
... and returns an array of three matrices representing the 'low', 'mid' and 'high'
fakeOuterFxn(fakeDataList)
#> , , 1
#>
#> one two three four
#> one 0 2 3 4
#> two 0 0 4 5
#> three 0 0 0 6
#> four 0 0 0 0
#>
#> , , 2
#>
#> one two three four
#> one 0 3 4 5
#> two 0 0 5 6
#> three 0 0 0 7
#> four 0 0 0 0
#>
#> , , 3
#>
#> one two three four
#> one 0 4 5 6
#> two 0 0 6 7
#> three 0 0 0 8
#> four 0 0 0 0
The actual data is a very long list and the calculations are slow.
How can I parallelize this code with foreach and doParallel in such a way that the array is preserved and the row/column orders are preserved (or at least able to be kept track of and re-ordered at the end)?
library(foreach)
library(doParallel)
#> Loading required package: iterators
#> Loading required package: parallel
registerDoParallel(detectCores()-2)
The for loop doesn't need to be inside a function, but it'd be neat if it was.
d <- fakeDataList
n <- length(d)
This is really as far as I've been able to get with it:
out <- foreach(i=1:n, .combine = 'c') %:%
foreach(j=(i+1):n, .combine = 'c') %dopar% {
if (j <= n) {
fakeInnerFxn(d[c(i, j)])
}
}
The answers are all here, but how do i get an array back?
out
#> [1] 2 3 4 3 4 5 4 5 6 4 5 6 5 6 7 6 7 8 7 8 9
Created on 2021-06-22 by the reprex package (v1.0.0)
You can always return the indices with the results and reconstruct your array later.
res <- foreach(i=1:(n-1), .combine = 'c') %:%
foreach(j=(i+1):n) %dopar% {
list(i, j, fakeInnerFxn(d[c(i, j)]))
}
n <- length(d)
out <- array(0, dim = c(n, n, 3))
for (res_k in res) out[res_k[[1]], res_k[[2]], ] <- res_k[[3]]
Here's my problem I couldn't solve it all.
Suppose that we have the following code as follows:
## A data frame named a
a <- data.frame(A = c(0,0,1,1,1), B = c(1,0,1,0,0), C = c(0,0,1,1,0), D = c(0,0,1,1,0), E = c(0,1,1,0,1))
## 1st function calculates all the combinaisons of colnames of a and the output is a character vector named item2
items2 <- c()
countI <- 1
while(countI <= ncol(a)){
for(i in countI){
countJ <- countI + 1
while(countJ <= ncol(a)){
for(j in countJ){
items2 <- c(items2, paste(colnames(a[i]), colnames(a[j]), collapse = '', sep = ""))
}
countJ <- countJ + 1
}
countI <- countI + 1
}
}
And here's my code I'm trying to solve (the output is a numeric vector called count_1):
## 2nd function
colnames(a) <- NULL ## just for facilitating the calculation
count_1 <- numeric(ncol(a)*2)
countI <- 1
while(countI <= ncol(a)){
for(i in countI){
countJ <- countI + 1
while(countJ <= ncol(a)){
for(j in countJ){
s <- a[, i]
p <- a[, j]
count_1[i*2] <- as.integer(s[i] == p[j] & s[i] == 1)
}
countJ <- countJ + 1
}
countI <- countI + 1
}
}
But when I execute this code in RStudio Console, a non-expectation result returned!:
count_1
[1] 0 0 0 0 0 1 0 1 0 0
However, I am expecting the following result:
count_1
[1] 1 2 2 2 1 1 1 1 2 1
You can see visit the following URL where you can find an image on Dropbox for detailed explanation.
https://www.dropbox.com/s/5ylt8h8wx3zrvy7/IMAG1074.jpg?dl=0
I'll try to explain a little more,
I posted the 1st function (code) just to show you what I'm looking for exactly that is an example that's all.
What I'm trying to get from the second function (code) is calculating the number of occurrences of number 1 (firstly we put counter = 0) in each row (while each row of two columns (AB, for example) must equal to one in both columns to say that counter = counter + 1) we continue by combing each column by all other columns (with AC, AD, AE, BC, BD, BE, CD, CE, and then DE), combination is n!/2!(n-2)!, that means for example if I have the following data frame:
a =
A B C D E
0 1 0 0 0
0 0 0 0 1
1 1 1 1 1
1 0 0 1 0
1 0 1 0 1
Then, the number of occurrences of the number 1 for each row by combining the two first columns is as follows: (Note that I put colnames(a) <- NULL just to facilitate the work and be more clear)
0 1 0 0 0
0 0 0 0 1
1 1 1 1 1
1 0 0 1 0
1 0 1 0 1
### Example 1: #####################################################
so from here I put (for columns A and B (AB))
s <- a[, i]
## s is equal to
## [1] 0 0 1 1 1
p <- a[, j]
## p is equal to
## [1] 1 0 1 0 0
Then I'll look for the occurrence of the number 1 in both vectors in condition it must be the same, i.e. a[, i] == 1 && a[, j] == 1 && a[, i] == a[, j], and for this example a numeric vector will be [1] 1
### Example 2: #####################################################
From here I put (for columns A and D (AD))
s <- a[, i]
## s is equal to
## [1] 0 0 1 1 1
p <- a[, j]
## p is equal to
## [1] 0 0 1 1 0
Then I'll look for the occurrence of the number 1 in both vectors in condition it must be the same, i.e. a[, i] == 1 && a[, j] == 1 && a[, i] == a[, j], and for this example a numeric vector will be [1] 2
And so on,
I'll have a numeric vector named count_1 equal to:
[1] 1 2 2 2 1 1 1 1 2 1
while each index of count_1 is a combination of each column by others (without the names of the data frame)
AB AC AD AE BC BD BE CD CE DE
1 2 2 2 1 1 1 1 2 1
Not clear what you're after at all.
As to the first code chunk, that is some ugly R coding involving a whole bunch of unnecessary while/for loops.
You can get the same result items2 in one single line.
items2 <- sort(toupper(unlist(sapply(1:4, function(i)
sapply(5:(i+1), function(j)
paste(letters[i], letters[j], sep = ""))))));
items2;
# [1] "AB" "AC" "AD" "AE" "BC" "BD" "BE" "CD" "CE" "DE"
As to the second code chunk, please explain what you're trying to calculate. It's likely that these while/for loops are as unnecessary as in the first case.
Update
Note that this is based on a as defined at the beginning of your post. Your expected output is based on a different a, that you changed further down the post.
There is no need for a for/while loop, both "functions" can be written in two one-liners.
# Your sample dataframe a
a <- data.frame(A = c(0,0,1,1,1), B = c(1,0,1,0,0), C = c(0,0,1,1,0), D = c(0,0,1,1,0), E = c(0,1,1,0,1))
# Function 1
items2 <- toupper(unlist(sapply(1:(ncol(a) - 1), function(i) sapply(ncol(a):(i+1), function(j)
paste(letters[i], letters[j], sep = "")))));
# Function 2
count_1 <- unlist(sapply(1:(ncol(a) - 1), function(i) sapply(ncol(a):(i+1), function(j)
sum(a[, i] + a[, j] == 2))));
# Add names and sort
names(count_1) <- items2;
count_1 <- count_1[order(names(count_1))];
# Output
count_1;
#AB AC AD AE BC BD BE CD CE DE
# 1 2 2 2 1 1 1 2 1 1
I'm trying to create a random data set in R that has metric, binomial and categorical variables. However, in the end when I check the class of my categorical variables R says they are numeric, but I need them to be factors for my further analysis. Does anybody have an idea what I'm doing wrong here?
that's my code:
set.seed(3456)
R.dat <- function(n = 5000,metr=1,bin=1,cat=3) {
j <- metr
X <- (matrix(0,n,j))
for (i in 1:n) {
X[i,] <- rnorm(j, mean = 0, sd = 1)
}
BIN <- matrix(0,n,bin)
for (i in 1:bin) {
BIN[,i] <- rbinom(n,1, 0.5)
}
CAT <- matrix(0,n,cat)
for (i in 1:cat) {
CAT[,i] <- factor(sample(1:4, n, TRUE))
}
X <- as.data.frame(cbind(X,BIN, CAT))
return(X)
}
Dat <- R.dat(n=5000,metr=1,bin=1, cat=3)
summary(Dat)
If I just sample like this:
x <- factor(sample(1:4, n, TRUE))
class(x)
it says x is a factor, so I don't get why it doesn't do the same when I use it in the function and loop...any help is much apprecciated, thanks in advance!
When you do this:
CAT <- matrix(0,n,cat)
for (i in 1:cat) {
CAT[,i] <- factor(sample(1:4, n, TRUE))
}
you create a numeric matrix CAT, and then you assign a new value to a subset of that matrix. When you do that assignment, the new value is coerced to the type of CAT, which is numeric.
Also, when you cbind the matrices X, BIN and CAT at the end, you coerce all of them to a common type. This would again mess up your variable types, even assuming everything was working correctly up to this point.
The rest of your code can also be simplified considerably. In particular, you don't need looping to reassign values to matrices; you can call the matrix constructor function directly on a vector of values.
Try this instead:
R.dat <- function(n=5000, metr=1, bin=1, cat=3)
{
X <- matrix(rnorm(n * metr), nrow=n)
B <- matrix(rbinom(n * bin, 1, 0.5), nrow=n)
F <- matrix(as.character(sample(1:4, n * cat, TRUE)), nrow=n)
data.frame(X=X, B=B, F=F)
}
You don't need a loop, If you switch to data.table, you can generate them by reference.
library(data.table)
n <- 10
bin <- 1
DT <- data.table(X=replicate(n, rnorm(bin, mean=0, sd = 1)),
BIN = rbinom(n,1, 0.5),
CAT = factor(sample(1:4, n, TRUE)))
## If you need you can add more columns
cols <- paste0("CAT", 1:3)
DT[, (cols):= lapply(rep(n, 3) ,rbinom, 1, .5) ]
cols <- paste0("BIN", 1:3)
DT[, (cols):= lapply(rep(n, 3) ,function(x){factor(sample(1:4, n, TRUE)) }) ]
DT
lapply(DT, class)
DT
X BIN CAT CAT1 CAT2 CAT3 BIN1 BIN2 BIN3
1: 1.2934720 1 2 0 0 0 1 1 2
2: -0.1183180 1 2 0 0 1 3 3 1
3: 0.3648810 1 2 1 1 1 3 2 3
4: -0.2149963 1 2 1 1 0 2 3 2
5: 0.3204577 1 1 0 1 1 2 2 4
6: -0.5941640 0 4 1 0 0 2 3 1
7: -1.8852835 1 4 1 0 0 2 1 1
8: -0.8329852 0 2 0 0 1 1 1 2
9: -0.1353628 0 4 0 1 1 1 4 1
10: -0.2943969 1 4 0 1 0 4 3 3
> lapply(DT, class)
$X
[1] "numeric"
$BIN
[1] "integer"
$CAT
[1] "factor"
$CAT1
[1] "integer"
$CAT2
[1] "integer"
$CAT3
[1] "integer"
$BIN1
[1] "factor"
$BIN2
[1] "factor"
$BIN3
[1] "factor"
Because matrix does not accept factor vector, it will be coerced into numbers.
Just change it into a dataframe :
CAT <- matrix(0,n,cat)
CAT <- as.data.frame(CAT)
This will do the trick.
Let's say I have something like this:
set.seed(0)
the.df <- data.frame( x=rep(letters[1:3], each=4),
n=rep(0:3, 3),
val=round(runif(12)))
the.df
x n val
1 a 0 1
2 a 1 0
3 a 2 0
4 a 3 1
5 b 0 1
6 b 1 0
7 b 2 1
8 b 3 1
9 c 0 1
10 c 1 1
11 c 2 0
12 c 3 0
Within each x, starting from n==2 (going from small to large), I want to set val to 0 if the previous val (in terms of n) is 0; otherwise, leave it as is.
For example, in the subset x=="b", I first ignore the two rows where n < 2. Now, in Row 7, because the previous val is 0 (the.df$val[the.df$x=="b" & the.df$n==1]), I set val to 0 (the.df$val[the.df$x=="b" & the.df$n==2] <- 0). Then on Row 8, now that val for the previous n is 0 (we just set it), I also want to set val here to 0 (the.df$val[the.df$x=="b" & the.df$n==3] <- 0).
Imagine that the data.frame is not sorted. Therefore procedures that depend on the order would require a sort. I also can't assume that adjacent rows exist (e.g., the row the.df[the.df$x=="a" & the.df$n==1, ] might be missing).
The trickiest part seems to be evaluating val in sequence. I can do this using a loop but I imagine that it would be inefficient (I have millions of rows). Is there a way I can do this more efficiently?
EDIT: wanted output
the.df
x n val wanted
1 a 0 1 1
2 a 1 0 0
3 a 2 0 0
4 a 3 1 0
5 b 0 1 1
6 b 1 0 0
7 b 2 1 0
8 b 3 1 0
9 c 0 1 1
10 c 1 1 1
11 c 2 0 0
12 c 3 0 0
Also, I don't mind making new columns (e.g., putting the wanted values there).
Using data.table I would try the following
library(data.table)
setDT(the.df)[order(n),
val := if(length(indx <- which(val[2:.N] == 0L)))
c(val[1:(indx[1L] + 1L)], rep(0L, .N - (indx[1L] + 1L))),
by = x]
the.df
# x n val
# 1: a 0 1
# 2: a 1 0
# 3: a 2 0
# 4: a 3 0
# 5: b 0 1
# 6: b 1 0
# 7: b 2 0
# 8: b 3 0
# 9: c 0 1
# 10: c 1 1
# 11: c 2 0
# 12: c 3 0
This will simultaneously order the data by n (as you said it's not ordered in real life) and recreate val by condition (meaning that if condition not satisfied, val will be untouched).
Hopefully in the near future this will be implemented and then the code could potentially be
setDT(the.df)[order(n), val[n > 2] := if(val[2L] == 0) 0L, by = x]
Which could be a great improvement both performance and syntax wise
A base R approach might be
df <- the.df[order(the.df$x, the.df$n),]
df$val <- ave(df$val, df$x, FUN=fun)
As for fun, #DavidArenburg's answer in plain R and written a bit more poetically might be
fun0 <- function(v) {
idx <- which.max(v[2:length(v)] == 0L) + 1L
if (length(idx))
v[idx:length(v)] <- 0L
v
}
It seems like a good idea to formulate the solution as an independent function first, because then it is easy to test. fun0 fails for some edge cases, e.g.,
> fun0(0)
[1] 0 0 0
> fun0(1)
[1] 0 0 0
> fun0(c(1, 1))
[1] 1 0
A better version is
fun1 <- function(v) {
tst <- tail(v, -1) == 0L
if (any(tst)) {
idx <- which.max(tst) + 1L
v[idx:length(v)] <- 0L
}
v
}
And even better, following #Arun
fun <- function(v)
if (length(v) > 2) c(v[1], cummin(v[-1])) else v
This is competitive (same order of magnitude) with the data.table solution, with ordering and return occurring in less than 1s for the ~10m row data.frame of #m-dz 's timings. At a second for millions of rows, it doesn't seem worth while to pursue further optimization.
Nonetheless, when there are a very large number of small groups (e.g., 2M each of size 5) an improvement is to avoid the tapply() function call by using group identity to offset the minimum. For instance,
df <- df[order(df$x, df$n),]
grp <- match(df$x, unique(df$x)) # strictly sequential groups
keep <- duplicated(grp) # ignore the first of each group
df$val[keep] <- cummin(df$val[keep] - grp[keep]) + grp[keep]
Hmmm, should be pretty efficient if you switch to data.table...
library(data.table)
# Define the.df as a data.table (or use data.table::setDT() function)
set.seed(0)
the.df <- data.table(
x = rep(letters[1:3], each = 4),
n = rep(0:3, 3),
val = round(runif(12))
)
m_dz <- function() {
setorder(the.df, x, n)
repeat{
# Get IDs of rows to change
# ids <- which(the.df[, (n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0)])
ids <- the.df[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE]
# If no IDs break
if(length(ids) == 0){
break
}
# Set val to 0
# for (i in ids) set(the.df, i = i, j = "val", value = 0)
set(the.df, i = ids, j = "val", value = 0)
}
return(the.df)
}
Edit: Above function is slightly modified thanks to #jangorecki's, i.e. uses which = TRUE and set(the.df, i = ids, j = "val", value = 0), which made the timings much more stable (no very high max timings).
Edit: timing comparison with #David Arenburgs's answer on a slightly bigger table, m-dz() updated (#FoldedChromatin's answer skipped because of diffrent results).
My function is slightly faster in terms of median and upper quantile, but there is quite a big spread in timings (see max...), I cannot figure out why. Hopefully the timing methodology is correct (returning the result to different object etc.).
Anything bigger will kill my PC :(
set.seed(0)
groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = ""))
size1 <- length(unique(groups_ids))
size2 <- round(1e7/size1)
the.df1 <- data.table(
x = rep(groups_ids, each = size2), # 52 * 500 = 26000
n = rep(0:(size2-1), size1),
val = round(runif(size1*size2))
)
the.df2 <- copy(the.df1)
# m-dz
m_dz <- function() {
setorder(df1, x, n)
repeat{
ids <- df1[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE]
if(length(ids) == 0){
break
}
set(df1, i = ids, j = "val", value = 0)
}
return(df1)
}
# David Arenburg
DavidArenburg <- function() {
setorder(df2, x, n)
df2[, val := if(length(indx <- which.max(val[2:.N] == 0) + 1L)) c(val[1:indx], rep(0L, .N - indx)), by = x]
return(df2)
}
library(microbenchmark)
microbenchmark(
res1 <- m_dz(),
res2 <- DavidArenburg(),
times = 100
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# res1 <- m_dz() 247.4136 268.5005 363.0117 288.4216 312.7307 7071.0960 100 a
# res2 <- DavidArenburg() 270.6074 281.3935 314.7864 303.5229 328.1210 525.8095 100 a
identical(res1, res2)
# [1] TRUE
Edit: (Old) results for even bigger table:
set.seed(0)
groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = ""))
size1 <- length(unique(groups_ids))
size2 <- round(1e8/size1)
# Unit: seconds
# expr min lq mean median uq max neval cld
# res1 <- m_dz() 5.599855 5.800264 8.773817 5.923721 6.021132 289.85107 100 a
# res2 <- m_dz2() 5.571911 5.836191 9.047958 5.970952 6.123419 310.65280 100 a
# res3 <- DavidArenburg() 9.183145 9.519756 9.714105 9.723325 9.918377 10.28965 100 a
Why not just use by
> set.seed(0)
> the.df <- data.frame( x=rep(letters[1:3], each=4),
n=rep(0:3, 3),
val=round(runif(12)))
> the.df
x n val
1 a 0 1
2 a 1 0
3 a 2 0
4 a 3 1
5 b 0 1
6 b 1 0
7 b 2 1
8 b 3 1
9 c 0 1
10 c 1 1
11 c 2 0
12 c 3 0
> Mod.df<-by(the.df,INDICES=the.df$x,function(x){
x$val[x$n==2]=0
Which=which(x$n==2 & x$val==0)+1
x$val[Which]=0
x})
> do.call(rbind,Mod.df)
x n val
a.1 a 0 1
a.2 a 1 0
a.3 a 2 0
a.4 a 3 0
b.5 b 0 1
b.6 b 1 0
b.7 b 2 0
b.8 b 3 0
c.9 c 0 1
c.10 c 1 1
c.11 c 2 0
c.12 c 3 0