I am trying to find the linear regression between all available groups of the following dataset.
library(data.table)
dt <- data.table(time = c(rep(rep(1:100, times = 1), 4), rep(1:30, times = 1)),
group = c(rep(c("a","b","c","d"), each = 100), rep("e", 30)),
value = rnorm(430))
dt[]
time group value
1: 1 a 0.1625954
2: 2 a -1.2288462
3: 3 a -0.1628570
4: 4 a 1.0597886
5: 5 a -1.1828334
---
426: 26 e -1.3762654
427: 27 e 0.3761436
428: 28 e -1.6982330
429: 29 e 0.1940263
430: 30 e -0.4631258
The output should be something like
group1 group2 regression
a b 1.2
a c 0.3
b c 0.5
d a 4.3
...
I am looking for a solution using data.table library only.
Linear regression of all the combinations of groups should be found. That includes cases a~b and b~a as the regression for each of these cases will be different.
Since the size of some groups is different, the time variables should be used to find the common rows between any set of groups.
The solution will require finding all combinations of groups.
With the new data, we could split the data by 'group' into a list. Then, use combn on the names of the list for pairwise combination, extract the list elements (s1, s2), check if there are any common 'time' (intersect). Use a condition based on length i.e. if there are common elements, then apply the lm on the corresponding 'value' columns, create a data.table with summarised coef along with the group names and rbind the list elements
library(data.table)
lst1 <- split(dt, dt$group)
rbindlist(combn(names(lst1), 2, FUN = function(x) {
s1 <- lst1[[x[1]]]
s2 <- lst1[[x[2]]]
i1 <- intersect(s1$time, s2$time)
if(length(i1) > 0) na.omit(s1[s2, on = .(time)][,
. (group1 = first(s1$group), group2 = first(s2$group),
regression = lm(i.value ~ value)$coef[2])])
else
data.table(group1 = first(s1$group), group2 = first(s2$group),
regression = NA_real_)}, simplify = FALSE))
-output
group1 group2 regression
1: a b 0.03033996
2: a c 0.06391242
3: a d -0.09138112
4: a e -0.27738183
5: b c 0.05663270
6: b d 0.05481604
7: b e 0.27789495
8: c d -0.13987978
9: c e 0.16388299
10: d e 0.12380720
If we want full combinations, use either expand.grid or CJ (from data.table
dt2 <- CJ(group1 = names(lst1), group2 = names(lst1))[group1 != group2]
dt2[, rbindlist(Map(function(x, y) {
s1 <- lst1[[x]]
s2 <- lst1[[y]]
i1 <- intersect(s1$time, s2$time)
if(length(i1) > 0) na.omit(s1[s2, on = .(time)][,
data.table(group1 = x, group2 = y,
regresion = lm(i.value ~ value)$coef[2])]) else
data.table(group1 = x, group2 = y, regression = NA_real_)
}, group1, group2))]
-output
group1 group2 regresion
1: a b 0.03033996
2: a c 0.06391242
3: a d -0.09138112
4: a e -0.27738183
5: b a 0.03247826
6: b c 0.05663270
7: b d 0.05481604
8: b e 0.27789495
9: c a 0.07488082
10: c b 0.06198333
11: c d -0.13987978
12: c e 0.16388299
13: d a -0.09295215
14: d b 0.05208743
15: d c -0.12144302
16: d e 0.12380720
17: e a -0.25136439
18: e b 0.34052322
19: e c 0.28677255
20: e d 0.21435666
Related
I have a dataframe with multiple factors and multiple numeric vars. I would like to collapse one of the factors (say by mean).
In my attempts I could only think of nested sapply or for loops to isolate the numerical elements to be averaged.
var <- data.frame(A = c(rep('a',8),rep('b',8)), B =
c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
D = rnorm(16), E = rnorm(16))
> var
A B C D E
1 a c e 1.1601720731 -0.57092435
2 a c f -0.0120178626 1.05003748
3 a d e 0.5311032778 1.67867806
4 a d f -0.3399901000 0.01459940
5 a c e -0.2887561691 -0.03847519
6 a c f 0.0004299922 -0.36695879
7 a d e 0.8124655890 0.05444033
8 a d f -0.3777058654 1.34074427
9 b c e 0.7380720821 0.37708543
10 b c f -0.3163496271 0.10921373
11 b d e -0.5543252191 0.35020193
12 b d f -0.5753686426 0.54642790
13 b c e -1.9973216646 0.63597405
14 b c f -0.3728926714 -3.07669300
15 b d e -0.6461596329 -0.61659041
16 b d f -1.7902722068 -1.06761729
sapply(4:ncol(var), function(i){
sapply(1:length(levels(var$A)), function(j){
sapply(1:length(levels(var$B)), function(t){
sapply(1:length(levels(var$C)), function(z){
mean(var[var$A == levels(var$A)[j] &
var$B == levels(var$B)[t] &
var$C == levels(var$C)[z],i])
})
})
})
})
[,1] [,2]
[1,] 0.435707952 -0.3046998
[2,] -0.005793935 0.3415393
[3,] 0.671784433 0.8665592
[4,] -0.358847983 0.6776718
[5,] -0.629624791 0.5065297
[6,] -0.344621149 -1.4837396
[7,] -0.600242426 -0.1331942
[8,] -1.182820425 -0.2605947
Is there a way to do this without this many sapply? maybe with mapply or outer
Maybe just,
var <- data.frame(A = c(rep('a',8),rep('b',8)), B =
c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
D = rnorm(16), E = rnorm(16))
library(dplyr)
var %>%
group_by(A,B,C) %>%
summarise_if(is.numeric,mean)
(Note that the output you show isn't what I get when I run your sapply code, but the above is identical to what I get when I run your sapply's.)
For inline aggregation (keeping same number of rows of data frame), consider ave:
var$D_mean <- with(var, ave(D, A, B, C, FUN=mean))
var$E_mean <- with(var, ave(E, A, B, C, FUN=mean))
For full aggregation (collapsed to factor groups), consider aggregate:
aggregate(. ~ A + B + C, var, mean)
I will complete the holy trinity with a data.table solution. Here .SD is a data.table of all the columns not listed in the by portion. This is a near-dupe of this question (only difference is >1 column being summarized), so click that if you want more solutions.
library(data.table)
setDT(var)
var[, lapply(.SD, mean), by = .(A, B, C)]
# A B C D E
# 1: a c e 0.07465822 0.032976115
# 2: a c f 0.40789460 -0.944631574
# 3: a d e 0.72054938 0.039781185
# 4: a d f -0.12463910 0.003363382
# 5: b c e -1.64343115 0.806838905
# 6: b c f -1.08122890 -0.707975411
# 7: b d e 0.03937829 0.048136471
# 8: b d f -0.43447899 0.028266455
Suppose that we have the following dataframe:
set.seed(1)
(tmp <- data.frame(x = 1:10, R1 = sample(LETTERS[1:5], 10, replace =
TRUE), R2 = sample(LETTERS[1:5], 10, replace = TRUE)))
x R1 R2
1 1 B B
2 2 B A
3 3 C D
4 4 E B
5 5 B D
6 6 E C
7 7 E D
8 8 D E
9 9 D B
10 10 A D
I want to do the following: if the difference between the level index
of factor R1 and that of factor R2 is an odd number, the levels of the
two factors need to be switched between them, which can be performed
through the following code:
for(ii in 1:dim(tmp)[1]) {
kk <- which(levels(tmp$R2) %in% tmp[ii,'R2'], arr.ind = TRUE) -
which(levels(tmp$R1) %in% tmp[ii,'R1'], arr.ind = TRUE)
if(kk%%2!=0) { # swap the their levels between the two factors
qq <- tmp[ii,]$R1
tmp[ii,]$R1 <- tmp[ii,]$R2
tmp[ii,]$R2 <- qq
}
}
More concise and efficient ways to achieve this?
P.S. A slightly different situation is the following.
set.seed(1)
(tmp <- data.frame(x = 1:10, R1 = sample(LETTERS[1:5], 10, replace =
TRUE), R2 = sample(LETTERS[2:6], 10, replace = TRUE)))
x R1 R2
1 C B
2 B B
3 C E
4 E C
5 E B
6 D E
7 E E
8 D F
9 C D
10 A E
Notice that the factor levels between the two factors, R1 and R2, slide by one level; that is, factor R1 does not have level F while factor R2 does not have level A. I want to swap the factor levels based on the combined levels of the two factors as shown below:
tl <- unique(c(levels(tmp$R1), levels(tmp$R2)))
for(ii in 1:dim(tmp)[1]) {
kk <- which(tl %in% tmp[ii,'R2'], arr.ind = TRUE) - which(tl %in%
tmp[ii,'R1'], arr.ind = TRUE)
if(kk%%2!=0) { # swap the their levels between the two factors
qq <- tmp[ii,]$R1
tmp[ii,]$R1 <- tmp[ii,]$R2
tmp[ii,]$R2 <- qq
}
}
How to go about this case? Thanks!
#Find out the indices where difference is odd
inds = abs(as.numeric(tmp$R1) - as.numeric(tmp$R2)) %% 2 != 0
#create new columns where values for the appropriate inds are from relevant columns
tmp$R1_new = replace(tmp$R1, inds, tmp$R2[inds])
tmp$R2_new = replace(tmp$R2, inds, tmp$R1[inds])
tmp
# x R1 R2 R1_new R2_new
#1 1 B B B B
#2 2 B A A B
#3 3 C D D C
#4 4 E B B E
#5 5 B D B D
#6 6 E C E C
#7 7 E D D E
#8 8 D E E D
#9 9 D B D B
#10 10 A D D A
Delete the old R1 and R2 if necessary
A solution using dplyr. dt is the final output. Notice that we need to use if_else from dplyr here, not the common ifelse from base R.
library(dplyr)
dt <- tmp %>%
mutate(R1_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R2, R1),
R2_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R1, R2)) %>%
select(x, R1 = R1_new, R2 = R2_new)
Update
For the updated case, add one mutate call to redefine the factor level of R1 and R2. The rest is the same.
tl <- unique(c(levels(tmp$R1), levels(tmp$R2)))
dt <- tmp %>%
mutate(R1 = factor(R1, levels = tl), R2 = factor(R2, levels = tl)) %>%
mutate(R1_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R2, R1),
R2_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R1, R2)) %>%
select(x, R1 = R1_new, R2 = R2_new)
Here is an option using data.table
library(data.table)
setDT(tmp)[(as.integer(R1) - as.integer(R2))%%2 != 0, c('R2', 'R1') := .(R1, R2)]
tmp
# x R1 R2
#1: 1 B B
#2: 2 A B
#3: 3 D C
#4: 4 B E
#5: 5 B D
#6: 6 E C
#7: 7 D E
#8: 8 E D
#9: 9 D B
#10:10 D A
I'm afraid I can't find an answer to my problem.
I am looking to create
1) They are 4 sets of cards A, B, C, D and 16 cards.
2) Each card is numbered within a set (A from 1 to 4, B from 5 to 8, and so on).
3) We want to randomize the assignment such that each person is randomly assigned a set of cards, for example A.
4) In addition, the order of the cards within the set has to be randomized.
So what we want is the following:
Person 1: Set A, cards 1-2-3-4
Person 2: Set A, cards 4-2-3-1
Person 3: Set D, cards 16-15-12-13
and so on.
I would also like each number to be in a separate column.
Thanks for your help!
S.
if each person gets one set of cards
> df=NULL
> a=rep(LETTERS[1:4],4)
> df$card1=sample(a,16,F)
> df=as.data.frame(df)
> df=df[order(card1),]
> df
card1
1: A
2: A
3: A
4: A
5: B
6: B
7: B
8: B
9: C
10: C
11: C
12: C
13: D
14: D
15: D
16: D
> df$card2=rep((1:4),4)
> df
card1 card2
1: A 1
2: A 2
3: A 3
4: A 4
5: B 1
6: B 2
7: B 3
8: B 4
9: C 1
10: C 2
11: C 3
12: C 4
13: D 1
14: D 2
15: D 3
16: D 4
> df1=df[sample(nrow(df)),]
> df1
card1 card2
1: A 2
2: D 4
3: C 3
4: D 3
5: B 3
6: D 1
7: C 2
8: A 3
9: B 2
10: D 2
11: B 1
12: A 1
13: C 4
14: C 1
15: B 4
16: A 4
Here's one way of approaching this.
person <- c("Person1", "Person2", "Person3", "Person4")
cardset <- LETTERS[1:4]
set.seed(357) # this is for reproducibility
xy <- data.frame(
person = sample(person), # pick out persons in a random order
set = sample(cardset)) # assign a random card set to a person
vx <- rep(xy$set, each = 4) # for each set, create repeats
vy <- split(paste(vx, rep(1:4, times = 4), sep = ""), f = vx) # append numbers to it
vz <- do.call(rbind, sapply(vy, FUN = sample, simplify = FALSE)) # shuffle using sapply and stitch together with do.call
cbind(xy, vz) # add it to the original data
person set 1 2 3 4
A Person1 C A4 A3 A2 A1
B Person4 B B2 B1 B4 B3
C Person3 D C2 C3 C4 C1
D Person2 A D1 D2 D4 D3
Here's another option:
# create data frame of decks and their numbered cards
cards <- data.frame(deck = rep(LETTERS[1:4], each = 4),
numbers = c(1:16),
stringsAsFactors = FALSE)
# create list of people
people <- c("Person1", "Person2", "Person3")
# loop through each person and randomly select a deck
# based on deck selected, subset the cards that can be used
# randomize the numbered cards
# add the deck, order of cards, and person to a
# growing data frame of assignments
assignment <- NULL
for(i in unique(people)) {
set <- sample(cards$deck, size = 1)
setCards <- cards[cards$deck == set, ]
orderCards <- sample(setCards$numbers)
assignment <- rbind(assignment, data.frame(Person = i,
Deck = set,
Card1 = orderCards[1],
Card2 = orderCards[2],
Card3 = orderCards[3],
Card4 = orderCards[4],
stringsAsFactors = FALSE))
}
I have some large data sets and am trying out data.table to combine them while summing up the shared column over matching rows. I know how to merge using [ matching rows in the LHS data.table as shown below with tables a2:LHS and a:RHS
a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")
a2
b c
1: a 100
2: b 100
3: c 100
4: d 100
5: e 100
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
a
b c
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
from second answer hereMerge data frames whilst summing common columns in R I saw how columns could be summed up over matching rows, as such:
setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
b c
1: a 101
2: b 102
3: c 103
4: d 104
5: e 105
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
However I am trying retain the rows that don't match as well.
Alternately I could use merge as shown below but I would like a void making a new table with 4 rows before reducing it to 2 rows.
c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]
c
b value
1: a 102
2: b 104
3: c 106
4: d 108
5: e 110
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
11: k 100
12: l 100
13: m 100
14: n 100
15: o 100
This last table is what I would like to achieve, Thanks in Advance.
merge is likely to not be very efficient for the end result you are after. Since both of your data.tables have the same structure, I would suggest rbinding them together and taking the sum by their key. In other words:
rbindlist(list(a, a2))[, sum(c), b]
I've used rbindlist because it is generally more efficient at rbinding data.tables (even though you have to first put your data.tables in a list).
Compare some timings on larger datasets:
library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))
system.time(rbindlist(list(a, a2))[, sum(c), b])
# user system elapsed
# 0.83 0.05 0.87
system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
# user system elapsed
# 159.58 0.48 162.95
## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166
nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166
I have a dataframe df with three categorical variables cat1,cat2,cat3 and two continuous variables con1,con2. I would like to compute list of functions sd,mean on list of columns con1,con2 based on different combinations of list of columns cat1,cat2,cat3. I have done them explicitly subsetting all different combinations.
# Random generation of values for categorical data
set.seed(33)
df <- data.frame(cat1 = sample( LETTERS[1:2], 100, replace=TRUE ),
cat2 = sample( LETTERS[3:5], 100, replace=TRUE ),
cat3 = sample( LETTERS[2:4], 100, replace=TRUE ),
con1 = runif(100,0,100),
con2 = runif(100,23,45))
# Introducing null values
df$con1[c(23,53,92)] <- NA
df$con2[c(33,46)] <- NA
results <- data.frame()
funs <- list(sd=sd, mean=mean)
# calculation of mean and sd on total observations
sapply(funs, function(x) sapply(df[,c(4,5)], x, na.rm=T))
# calculation of mean and sd on different levels of cat1
sapply(funs, function(x) sapply(df[df$cat1=='A',c(4,5)], x, na.rm=T))
sapply(funs, function(x) sapply(df[df$cat1=='B',c(4,5)], x, na.rm=T))
# calculation of mean and sd on different levels of cat1 and cat2
sapply(funs, function(x) sapply(df[df$cat1=='A' & df$cat2=='C' ,c(4,5)], x, na.rm=T))
.
.
.
sapply(funs, function(x) sapply(df[df$cat1=='B' & df$cat2=='E' ,c(4,5)], x, na.rm=T))
# Similarly for the combinations of three cat variables cat1, cat2, cat3
I would like to write a function on dynamically computing the list of functions for list of columns based on different combinations. Could you please give some suggestions. Thanks !
Edit:
I have already got some smart suggestions using dplyr. It would be great if someone provides suggestions using the apply family functions as it will help in using them(dataframes) in the further requirements.
This is a simple one-line base solution:
> do.call(cbind, lapply(funs, function(x) aggregate(cbind(con1, con2) ~ cat1 + cat2 + cat3, data = df, FUN = x, na.rm = TRUE)))
sd.cat1 sd.cat2 sd.cat3 sd.con1 sd.con2 mean.cat1 mean.cat2 mean.cat3 mean.con1 mean.con2
1 A C B NA NA A C B 25.52641 37.40603
2 B C B 32.67192 6.966547 B C B 46.70387 34.85437
3 A D B 31.05224 6.530313 A D B 37.91553 37.13142
4 B D B 23.80335 6.001468 B D B 59.75107 30.29681
5 A E B 22.79285 1.526472 A E B 38.54742 25.23007
6 B E B 32.92139 2.621067 B E B 51.56253 29.52367
7 A C C 26.98661 5.710335 A C C 36.32045 36.42465
8 B C C 20.22217 8.117184 B C C 60.60036 34.98460
9 A D C 33.39273 7.367412 A D C 40.77786 35.03747
10 B D C 12.95351 8.829061 B D C 49.77160 33.21836
11 A E C 33.73433 4.689548 A E C 55.53135 32.38279
12 B E C 25.38637 9.172137 B E C 46.69063 31.56733
13 A C D 36.12545 6.323929 A C D 48.34187 32.36789
14 B C D 30.01992 7.130869 B C D 53.87571 33.12760
15 A D D 15.94151 11.756115 A D D 35.89909 31.76871
16 B D D 10.89030 6.829829 B D D 22.86577 32.53725
17 A E D 24.88410 6.108631 A E D 47.32549 35.22782
18 B E D 12.73711 8.151424 B E D 33.95569 36.70167