I'm trying to create a dataset for each combination of rows from separate groups. Ideally, one row from each group would be selected and there would be a dataset for every combination. I have a dataset of that looks similar in structure to the sample below:
Name Group Stat1 Stat2
1 1 a 63 38
2 2 a 33 62
3 3 b 3 66
4 4 b 57 67
5 5 c 42 69
6 6 c 47 14
7 7 c 16 10
8 8 d 21 46
9 9 d 72 1
Trying to get the end result of the first dataset to look like this:
Name Group Stat1 Stat2
1 1 a 63 38
2 3 b 3 66
3 5 c 42 69
4 8 d 21 46
With the second data dataset looking like this:
Name Group Stat1 Stat2
1 1 a 63 38
2 3 b 3 66
3 5 c 42 69
4 9 d 72 1
Until every combination has been exhausted. I've tried strategies using apply functions and combn but cannot seem to get the result I want. This does not seem too challenging to me conceptually, so I'm not sure what I'm missing.
Any help would be greatly appreciated! Thanks in advance!
Lots of ways to approach this. A simple solution is to just generate all 4 row combos, then subset to those with all distinct Group values. I named your data df and assumed Name would be unique row id. If that's not true, you could replace df$Name with 1:nrow(df)
# All 4 row combos of row ids
combs <- combn(df$Name, 4)
# Match group labels to row ids
g <- matrix(df$Group[combs], nrow = 4)
# 4 row combs filtered to all distinct group vals
combs <- combs[,apply(g, 2, function(i) all(!duplicated(i)))]
# For each 4 row combo, extract rows from the dataframe
final_list <- apply(combs, 2, function(i) df[i,])
final_list[1:3]
[[1]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
5 5 c 42 69
8 8 d 21 46
[[2]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
5 5 c 42 69
9 9 d 72 1
[[3]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
6 6 c 47 14
8 8 d 21 46
Related
I have several seperate data frames that I would like to keep separated because merging them together would create a very large element.
However, there are variables from another data frame that I would like to merge with all of them now.
Here is an example of what I would like to do:
df1 <- data.frame(ID1 = c(1:10), Var1 = rep(c(1,0),5))
df2 <- data.frame(ID1 = c(1:10), Var2 = c(21:30))
dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))
mergewith <- data.frame(ID1 = c(1:10), ID2 = c(41:50))
My goal is that df1 and df2 will look like this:
df1
ID1 Var1 ID2
1 1 1 41
2 2 0 42
3 3 1 43
4 4 0 44
5 5 1 45
6 6 0 46
7 7 1 47
8 8 0 48
9 9 1 49
10 10 0 50
df2
ID1 Var2 ID2
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
What I have tried so far is:
dat = lapply(dfs,function(x){
merge(names(x), mergewith, by = "ID1");x})
list2env(dat,.GlobalEnv)
However, then I get the following message:
"'by' must specify a uniquely valid column"
Is it possible to do this without using a loop?
You can try Map
> Map(function(x, y) merge(x, y, by = "ID1"), dfs, list(mergewith))
[[1]]
ID1 Var1 ID2
1 1 1 41
2 2 0 42
3 3 1 43
4 4 0 44
5 5 1 45
6 6 0 46
7 7 1 47
8 8 0 48
9 9 1 49
10 10 0 50
[[2]]
ID1 Var2 ID2
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
You can use lapply to merge all the dataframes in dfs with mergewith. Use list2env to get the changed dataframes in the global environment.
list2env(lapply(dfs, function(x) merge(x, mergewith, by = 'ID1')), .GlobalEnv)
I'm trying to combine data frames (hundreds of them), but they have different numbers of rows.
df1 <- data.frame(c(7,5,3,4,5), c(43,56,23,78,89))
df2 <- data.frame(c(7,5,3,4,5,8,5), c(43,56,23,78,89,45,78))
df3 <- data.frame(c(7,5,3,4,5,8,5,6,7), c(43,56,23,78,89,45,78,56,67))
colnames(df1) <- c("xVar1","xVar2")
colnames(df2) <- c("yVar1","yVar2")
colnames(df3) <- c("zVar1","zVar2")
a1 <- list(df1,df2,df3)
a1 is what is my initial data actually looks like when I get it.
Now if I do:
b1 <- as.data.frame(a1)
I get an error, because the # of rows is not the same in the data (this would work fine if the # of rows was the same).
How do I make the # of rows equal or work around this issue?
I would like to be able to merge the data in this way (here is a working example with the same # of rows):
df1b <- data.frame(c(7,5,3,4,5), c(43,56,23,78,89))
df2b <- data.frame(c(7,5,3,4,6), c(43,56,24,48,89))
df3b <- data.frame(c(7,5,3,4,5), c(43,56,23,78,89))
colnames(df1b) <- c("xVar1","xVar2")
colnames(df2b) <- c("yVar1","yVar2")
colnames(df3b) <- c("zVar1","zVar2")
a2 <- list(df1b,df2b,df3b)
b2 <- as.data.frame(a2)
Thanks!
cbind.fill from rowr provides functionality for this and fills missing elements with NA:
library(purrr)
library(rowr)
b1 <- purrr::reduce(a1,cbind.fill,fill=NA)
One can add a key (row count as variable value in this case) to each dataframe then merge by the key.
# get list of dfs (should prob import data into a list of dfs instead)
list_df<-mget(ls(pattern = "df[0-9]"))
#add newcolumn -- "key"
list_df<-lapply(list_df, function(df, newcol) {
df[[newcol]]<-seq(nrow(df))
return(df)
}, "key")
#merge function
MergeAllf <- function(x, y){
df <- merge(x, y, by= "key", all.x= T, all.y= T)
}
#pass list to merge funct
library(tidyverse)
data <- Reduce(MergeAllf, list_df)%>%
select(key, everything())#reorder or can drop "key"
data
key xVar1 xVar2 yVar1 yVar2 zVar1 zVar2
1 1 7 43 7 43 7 43
2 2 5 56 5 56 5 56
3 3 3 23 3 23 3 23
4 4 4 78 4 78 4 78
5 5 5 89 5 89 5 89
6 6 NA NA 8 45 8 45
7 7 NA NA 5 78 5 78
8 8 NA NA NA NA 6 56
9 9 NA NA NA NA 7 67
Solution 1
You can achieve this with rbindlist(). Note that the column names will be the column names of the first data frame in the list:
library(data.table)
b1 = data.frame(rbindlist(a1))
> b1
xVar1 xVar2
1 7 43
2 5 56
3 3 23
4 4 78
5 5 89
6 7 43
7 5 56
8 3 23
9 4 78
10 5 89
11 8 45
12 5 78
13 7 43
14 5 56
15 3 23
16 4 78
17 5 89
18 8 45
19 5 78
20 6 56
21 7 67
Solution 2
Alternatively, you make all the columns have the same name, then bind by row:
b1 = lapply(a1, setNames, c("Var1","Var2"))
Now you can bind by rows:
b1 = do.call(dplyr::bind_rows, b1)
> b1
Var1 Var2
1 7 43
2 5 56
3 3 23
4 4 78
5 5 89
6 7 43
7 5 56
8 3 23
9 4 78
10 5 89
11 8 45
12 5 78
13 7 43
14 5 56
15 3 23
16 4 78
17 5 89
18 8 45
19 5 78
20 6 56
21 7 67
Let's say that we have the following matrix:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
c(14,28,42,14,46,64,71,85,14,28,51,84,66,22,38,32,40,42)))
colnames(x)<- c("ID","Visit", "Age")
The first column represents subject ID, the second a list of observations and the third the age at each of this consecutive observations.
Which would be the easiest way of finding visits where the age is wrong according to the previous visit age. (i.e. in row 13, subject C is 66 years old, when in the previous visit he was already 84 or in row 16, subject D is 32 years old, when in the previous visit he was already 38).
Which would be the way of highlighting the potential errors and removing rows 13 and 16?
I have tried to aggregate by IDs and look for the difference between ages across visits, but it seems hard for me since the error could occur in any visit.
How about this in base R?
df <- do.call(rbind.data.frame, lapply(split(x, x$ID), function(w)
w[c(1, which(diff(w[order(w$Visit), "Age"]) > 0) + 1), ]));
df;
# ID Visit Age
#A.1 A 1 14
#A.2 A 2 28
#A.3 A 3 42
#B.4 B 1 14
#B.5 B 2 46
#B.6 B 3 64
#B.7 B 4 71
#B.8 B 5 85
#C.9 C 1 14
#C.10 C 2 28
#C.11 C 3 51
#C.12 C 4 84
#D.14 D 1 22
#D.15 D 2 38
#D.17 D 4 40
#D.18 D 5 42
Explanation: We split the dataframe on column ID, then order every dataframe subset by Visit, calculate differences between successive Age values, and only keep those rows where the difference is > 0 (i.e. Age is increasing); rbinding gives the final dataframe.
You could do it by filtering out the rows where diff(Age) is negative for each ID.
Using the dplyr package:
library(dplyr)
x %>% group_by(ID) %>% filter(c(0,diff(Age))>=0)
# A tibble: 16 x 3
# Groups: ID [4]
ID Visit Age
<fctr> <fctr> <fctr>
1 A 1 14
2 A 2 28
3 A 3 42
4 B 1 14
5 B 2 46
6 B 3 64
7 B 4 71
8 B 5 85
9 C 1 14
10 C 2 28
11 C 3 51
12 C 4 84
13 D 1 22
14 D 2 38
15 D 4 40
16 D 5 42
The aggregate() approach is pretty concise.
Removing bad rows
good <- do.call(c, aggregate(Age ~ ID, x, function(z) c(z[1], diff(z)) > 0)$Age)
x[good,]
# ID Visit Age
# 1 A 1 14
# 2 A 2 28
# 3 A 3 42
# 4 B 1 14
# 5 B 2 46
# 6 B 3 64
# 7 B 4 71
# 8 B 5 85
# 9 C 1 14
# 10 C 2 28
# 11 C 3 51
# 12 C 4 84
# 14 D 1 22
# 15 D 2 38
# 17 D 4 40
# 18 D 5 42
This will only highlight which groups have an inconsistency:
aggregate(Age ~ ID, x, function(z) all(diff(z) > 0))
# ID Age
# 1 A TRUE
# 2 B TRUE
# 3 C FALSE
# 4 D FALSE
Say I have a data frame with 3 columns of data (a,b,c) and 1 column of categories with multiple instances of each category (class).
set.seed(273)
a <- floor(runif(20,0,100))
b <- floor(runif(20,0,100))
c <- floor(runif(20,0,100))
class <- floor(runif(20,0,6))
df1 <- data.frame(a,b,c,class)
print(df1)
a b c class
1 31 73 28 3
2 44 33 57 3
3 19 35 53 0
4 68 70 39 4
5 92 7 57 2
6 13 67 23 3
7 73 50 14 2
8 59 14 91 5
9 37 3 72 5
10 27 3 13 4
11 63 28 0 5
12 51 7 35 4
13 11 36 76 3
14 72 25 8 5
15 23 24 6 3
16 15 1 16 5
17 55 24 5 5
18 2 54 39 1
19 54 95 20 3
20 60 39 65 1
And I have another data frame with the same 3 columns of data and category column, however this only has one instance per category (class).
a <- floor(runif(6,0,20))
b <- floor(runif(6,0,20))
c <- floor(runif(6,0,20))
class <- seq(0,5)
df2 <- data.frame(a,b,c,class)
print(df2)
a b c class
1 8 15 13 0
2 0 3 6 1
3 14 4 0 2
4 7 10 6 3
5 18 18 16 4
6 17 17 11 5
How to I subset the first data frame so that only rows where a, b, and c are all greater than the value in the second data frame for each class? For example, I only want rows where class == 0 if a > 8 & b > 15 & c > 13.
Note that I don't want to join the data frames, as the second data frame is the lowest acceptable value for the the first data frame.
As commented by Frank this can be done with non-equi joins.
# coerce to data.table
tmp <- setDT(df1)[
# non-equi join to find which rows of df1 fulfill conditions in df2
setDT(df2), on = .(class, a > a, b > b, c > c), rn, nomatch = 0L, which = TRUE]
# return subset in original order of df1
df1[sort(tmp)]
a b c class
1: 31 73 28 3
2: 44 33 57 3
3: 19 35 53 0
4: 68 70 39 4
5: 92 7 57 2
6: 13 67 23 3
7: 73 50 14 2
8: 11 36 76 3
9: 2 54 39 1
10: 54 95 20 3
11: 60 39 65 1
The parameter which = TRUE returns a vector of the matching row numbers instead of the joined data set. This saves us from creating a row id column before the join. (Credit to #Frank for reminding me of the which parameter!)
Note that there is no row in df1 which fulfills the condition for class == 5 in df2. Therefore, the parameter nomatch = 0L is used to exclude non-matching rows from the result.
This can be put together in a "one-liner":
setDT(df1)[sort(df1[setDT(df2), on = .(class, a > a, b > b, c > c), nomatch = 0L, which = TRUE])]
I am trying to get all combinations of values per group. I want to prevent combination of values between different groups.
To create all combinations of values (no matter which group the value belongs) vaI can use:
expand.grid(value, value)
Awaited result should be the subset of result of previous command.
Example:
#base data
value = c(1,3,5, 1,5,7,9, 2)
group = c("a", "a", "a","b","b","b","b", "c")
base <- data.frame(value, group)
#creating ALL combinations of value
allComb <- expand.grid(base$value, base$value)
#awaited result is subset of allComb.
#Note: first colums shows the number of row from allComb.
#Empty rows are separating combinations per group and are shown only for clarification.
Var1 Var2
1 1 1
2 3 1
3 5 1
11 1 3
12 3 3
13 5 3
21 1 5
22 3 5
23 5 5
34 1 1
35 5 1
36 7 1
37 9 1
44 1 5
45 5 5
46 7 5
47 9 5
54 1 7
55 5 7
56 7 7
57 9 7
64 1 9
65 5 9
66 7 9
67 9 9
78 2 2