I have two lists that consist of the same amount of dataframes, and the order of dataframes in both lists indicates which dataframes belong together. In other words, the first dataframe in the first list goes together with the first list in the second dataframe, and the second one with the second, etc. I want to merge the dataframes in both lists with each other, but only the dataframes that belong together. Let's say the first list has these three dataframes:
df1:
id var1
1 0.2
2 0.1
3 0.4
4 0.3
df2:
id var1
1 0.2
6 0.5
df3:
id var1
1 0.2
3 0.1
6 0.4
And the second list has the following dataframes:
df1:
id var2
1 A
2 B
3 C
4 C
df2:
id var2
1 B
6 B
df3:
id var2
1 A
3 D
6 D
I would like to merge them based on the variable "id", and the end result then to be the following:
df1:
id var1 var2
1 0.2 A
2 0.1 B
3 0.4 C
4 0.3 C
df2:
id var1 var2
1 0.2 B
6 0.5 B
df3:
id var1 var2
1 0.2 A
3 0.1 D
6 0.4 D
How do I do this?
First list of data-sets:
list1<-list(df1,df2,df3)
Second list of data sets:
list2<-list(df1,df2,df3)
result:
lapply(1:length(list1),function(x) {merge(list1[[x]], list2[[x]], by.x = 'id')})
Using either tidyverse or base R :
Map(merge,l1,l2)
library(tidyverse)
map2(l1,l2,inner_join)
# [[1]]
# id a b
# 1 1 0.1 A
# 2 2 0.2 B
#
# [[2]]
# id a b
# 1 1 0.1 A
# 2 2 0.2 B
#
# [[3]]
# id a b
# 1 1 0.1 A
# 2 2 0.2 B
#
data
l1 <- replicate(3,data.frame(id= 1:2,a=c(0.1,0.2)),F)
l1
# [[1]]
# id a
# 1 1 0.1
# 2 2 0.2
#
# [[2]]
# id a
# 1 1 0.1
# 2 2 0.2
#
# [[3]]
# id a
# 1 1 0.1
# 2 2 0.2
l2 <- replicate(3,data.frame(id= 1:2,b=c("A","B")),F)
l2
# [[1]]
# id b
# 1 1 A
# 2 2 B
#
# [[2]]
# id b
# 1 1 A
# 2 2 B
#
# [[3]]
# id b
# 1 1 A
# 2 2 B
#
Use Map like this:
Map(merge, L1, L2)
giving:
$`df1`
id var1 var2
1 1 0.2 A
2 2 0.1 B
3 3 0.4 C
4 4 0.3 C
$df2
id var1 var2
1 1 0.2 B
2 6 0.5 B
$df3
id var1 var2
1 1 0.2 A
2 3 0.1 D
3 6 0.4 D
Note
The input lists in reproducible form are:
Lines1 <- "df1:
id var1
1 0.2
2 0.1
3 0.4
4 0.3
df2:
id var1
1 0.2
6 0.5
df3:
id var1
1 0.2
3 0.1
6 0.4"
Read <- function(Lines) {
L <- readLines(textConnection(Lines))
ix <- grep(":", L)
nms <- sub(":", "", L[ix])
g <- nms[cumsum(L[-ix] == "")+1]
lapply(split(L[-ix], g), function(x) read.table(text = x, header = TRUE))
}
L1 <- Read(Lines1)
and
Lines2 <- "df1:
id var2
1 A
2 B
3 C
4 C
df2:
id var2
1 B
6 B
df3:
id var2
1 A
3 D
6 D"
L2 <- Read(Lines2)
Related
I would like to stack my dataset so all observations relate to all other observations but itself.
Suppose I have the following dataset:
df <- data.frame(id = c("a", "b", "c", "d" ),
x1 = c(1,2,3,4))
df
id x1
1 a 1
2 b 2
3 c 3
4 d 4
I would like observation a to be related to b, c, and d. And the same for every other observation. The result should look like something like this:
id x1 id2 x2
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 a 1
5 b 2 c 3
6 b 2 d 4
7 c 3 a 1
8 c 3 b 2
9 c 3 d 4
10 d 4 a 1
11 d 4 b 2
12 d 4 c 3
So observation a is related to b,c,d. Observation b is related to a, c,d. And so on. Any ideas?
Another option:
library(dplyr)
left_join(df, df, by = character()) %>%
filter(id.x != id.y)
Or
output <- merge(df, df, by = NULL)
output = output[output$id.x != output$id.y,]
Thanks #ritchie-sacramento, I didn't know the by = NULL option for merge before, and thanks #zephryl for the by = character() option for dplyr joins.
tidyr::expand_grid() accepts data frames, which can then be filtered to remove rows that share the id:
library(tidyr)
library(dplyr)
expand_grid(df, df, .name_repair = make.unique) %>%
filter(id != id.1)
# A tibble: 12 × 4
id x1 id.1 x1.1
<chr> <dbl> <chr> <dbl>
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 a 1
5 b 2 c 3
6 b 2 d 4
7 c 3 a 1
8 c 3 b 2
9 c 3 d 4
10 d 4 a 1
11 d 4 b 2
12 d 4 c 3
You can use combn() to get all combinations of row indices, then assemble your dataframe from those:
rws <- cbind(combn(nrow(df), 2), combn(nrow(df), 2, rev))
df2 <- cbind(df[rws[1, ], ], df[rws[2, ], ])
# clean up row and column names
rownames(df2) <- 1:nrow(df2)
colnames(df2) <- c("id", "x1", "id2", "x2")
df2
id x1 id2 x2
1 a 1 b 2
2 a 1 c 3
3 a 1 d 4
4 b 2 c 3
5 b 2 d 4
6 c 3 d 4
7 b 2 a 1
8 c 3 a 1
9 d 4 a 1
10 c 3 b 2
11 d 4 b 2
12 d 4 c 3
I did some searches but could not find the best keywords to phrase my question so I think I will attempt to ask it here.
I am dealing with a data frame in R that have two variables represent the identity of the data points. In the following example, A and 1 represent the same individual, B and 2 are the same and so are C and 3 but they are being mixed in the original data.
ID1 ID2 Value
A 1 0.5
B 2 0.8
C C 0.7
A A 0.6
B 2 0.3
3 C 0.4
2 2 0.3
1 A 0.4
3 3 0.6
What I want to achieve is to unify the identity by using only one of the identifiers so it can be either:
ID1 ID2 Value ID
A 1 0.5 A
B 2 0.8 B
C C 0.7 C
A A 0.6 A
B 2 0.3 B
3 C 0.4 C
2 2 0.3 B
1 A 0.4 A
3 3 0.6 C
or:
ID1 ID2 Value ID
A 1 0.5 1
B 2 0.8 2
C C 0.7 3
A A 0.6 1
B 2 0.3 2
3 C 0.4 3
2 2 0.3 2
1 A 0.4 1
3 3 0.6 3
I can probably achieve it by using ifelse function but that means I have to write two ifelse statements for each condition and it does not seem efficient so I was wondering if there is a better way to do it. Here is the example data set.
df=data.frame(ID1=c("A","B","C","A","B","3","2","1","3"),
ID2=c("1","2","C","A","2","C","2","A","3"),
Value=c(0.5,0.8,0.7,0.6,0.3,0.4,0.3,0.4,0.6))
Thank you so much for the help!
Edit:
To clarify, the two identifiers I have in my real data are longer string of texts instead of just ABC and 123. Sorry I did not make it clear.
An option is to to detect the elements that are only digits, convert to integer, then get the corresponding LETTERS in case_when
library(dplyr)
library(stringr)
df %>%
mutate(ID = case_when(str_detect(ID1, '\\d+')~
LETTERS[as.integer(ID1)], TRUE ~ ID1))
# ID1 ID2 Value ID
#1 A 1 0.5 A
#2 B 2 0.8 B
#3 C C 0.7 C
#4 A A 0.6 A
#5 B 2 0.3 B
#6 3 C 0.4 C
#7 2 2 0.3 B
#8 1 A 0.4 A
#9 3 3 0.6 C
Or more compactly
df %>%
mutate(ID = coalesce(LETTERS[as.integer(ID1)], ID1))
If we have different sets of values, then create a key/value dataset and do a join
keyval <- data.frame(ID1 = c('1', '2', '3'), ID = c('A', 'B', 'C'))
left_join(df, keyval) %>% mutate(ID = coalesce(ID, ID1))
A base R option using replace
within(
df,
ID <- replace(
ID1,
!ID1 %in% LETTERS,
LETTERS[as.numeric(ID1[!ID1 %in% LETTERS])]
)
)
or ifelse
within(
df,
ID <- suppressWarnings(ifelse(ID1 %in% LETTERS,
ID1,
LETTERS[as.integer(ID1)]
))
)
which gives
ID1 ID2 Value ID
1 A 1 0.5 A
2 B 2 0.8 B
3 C C 0.7 C
4 A A 0.6 A
5 B 2 0.3 B
6 3 C 0.4 C
7 2 2 0.3 B
8 1 A 0.4 A
9 3 3 0.6 C
df
var1 var2 var3
1 a 1 0.5
2 b 2 5
3 a 3 12
4 c 6 0
5 d 88 0
6 b 0 0
df2
var1 var2 var3
1 k 1 0.5
2 l 0.6 5
3 k 3 12
4 k 6 0
5 v 12 0
> list <- list(df,df2)
for(i in list){
i %>%
group_by(var1) %>%
summarise(sum = sum(var1))
}
Whenever var1 is equal, I want the rest of the rows to be summed up, and this would be the new row. I want that the list of data.frames only contains data.frames that have unique rows, but the columns should add up. I have the loop from here sum of rows when condition is met- data.frame in R , but I was not statisfied with the answers.
The result should look like this
df
var1 var2 var3
1 a 4 12.5
2 b 2 5
4 c 6 0
5 d 88 0
df2
var1 var2 var3
1 k 10 12.5
2 l 0.6 5
3 v 12 0
my real list contains a lot data.frames with a lot of rows and columns.
Thank you
Tidy-version:
df <- read.table(text = "var1 var2 var3
1 a 1 0.5
2 b 2 5
3 a 3 12
4 c 6 0
5 d 88 0
6 b 0 0", stringsAsFactors = F, header = T)
df2 <- read.table(text = "var1 var2 var3
1 k 1 0.5
2 l 0.6 5
3 k 3 12
4 k 6 0
5 v 12 0", strings = F, header = T)
l <- list(df = df, df2 = df2) # please use other name than "list"
library(tidyverse)
l <- map(l, ~.x %>%
group_by(var1) %>%
summarise_all(list(sum)) %>%
ungroup())
l
# # A tibble: 4 x 3
# var1 var2 var3
# <chr> <int> <dbl>
# 1 a 4 12.5
# 2 b 2 5
# 3 c 6 0
# 4 d 88 0
#
# $df2
# # A tibble: 3 x 3
# var1 var2 var3
# <chr> <dbl> <dbl>
# 1 k 10 12.5
# 2 l 0.6 5
# 3 v 12 0
In base you can use aggregate in lapply to sum up per group.
lapply(list, function(x) aggregate(.~var1, x, sum))
#lapply(list, function(x) aggregate(x[,-1], as.list(x[1]), sum)) #Alternative
#[[1]]
# var1 var2 var3
#1 a 4 12.5
#2 b 2 5.0
#3 c 6 0.0
#4 d 88 0.0
#
#[[2]]
# var1 var2 var3
#1 k 10.0 12.5
#2 l 0.6 5.0
#3 v 12.0 0.0
or using rowsum with groups in the rownames:
lapply(list, function(x) rowsum(x[,-1], x[,1]))
The code is fine, you just need to put it in a function and use lapply. Here I've used data.table, but you could also put that dplyr code in a function and use that as the second argument to lapply.
library(data.table)
lapply(mylist, function(df) setDT(df)[, lapply(.SD, sum), var1])
I have a simple for-loop which works as I would like on vectors, I would like to use my for-loop on a column of a dataframe grouped by another column in the dataframe e.g.:
# here is my for-loop working as expected on a simple vector:
vect <- c(0.5, 0.7, 0.1)
res <- vector(mode = "numeric", length = 3)
for (i in 1:length(vect)) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
[1] 1.9411537 0.9715143 5.5456579
And here is psuedo-code trying to do it on a column of a dataframe:
#Example data
my.df <- data.frame(let = rep(LETTERS[1:3], each = 3),
num1 = 1:3, vect = c(0.5, 0.7, 0.1), num3 = NA)
my.df
let num1 vect num3
1 A 1 0.5 NA
2 A 2 0.7 NA
3 A 3 0.1 NA
4 B 1 0.5 NA
5 B 2 0.7 NA
6 B 3 0.1 NA
7 C 1 0.5 NA
8 C 2 0.7 NA
9 C 3 0.1 NA
# My attempt:
require(tidyverse)
my.df <- my.df %>%
group_by(let) %>%
mutate(for (i in 1:length(vect)) {
num3[i] <- sum(exp(-4 * (vect[i] - vect[-i])))
})
What result should look like (but my psuedo code above doesn't work):
let num1 vect num3
1 A 1 0.5 1.9411537
2 A 2 0.7 0.9715143
3 A 3 0.1 5.5456579
4 B 1 0.5 1.9411537
5 B 2 0.7 0.9715143
6 B 3 0.1 5.5456579
7 C 1 0.5 1.9411537
8 C 2 0.7 0.9715143
9 C 3 0.1 5.5456579
I feel like I am not using tidyverse logic by trying to having a for-loop inside mutate, any suggestions much appreciated.
The simple solution is to create a custom function and pass that to mutate. A working solution:
custom_func <- function(vec) {
res <- vector(mode = "numeric", length = 3)
for (i in 1:length(vect)) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
}
library(tidyverse)
my.df %>%
group_by(let) %>%
mutate(num3 = custom_func(vect))
#> # A tibble: 9 x 4
#> # Groups: let [3]
#> let num1 vect num3
#> <fct> <int> <dbl> <dbl>
#> 1 A 1 0.5 1.94
#> 2 A 2 0.7 0.972
#> 3 A 3 0.1 5.55
#> 4 B 1 0.5 1.94
#> 5 B 2 0.7 0.972
#> 6 B 3 0.1 5.55
#> 7 C 1 0.5 1.94
#> 8 C 2 0.7 0.972
#> 9 C 3 0.1 5.55
I'm wondering whether a more elegant version of the custom function is possible - perhaps someone smarter than me can tell you whether purrr::map, for example, could provide an alternative.
We can use map_dbl from purrr and apply the formula for calculation.
library(dplyr)
library(purrr)
my.df %>%
group_by(let) %>%
mutate(num3 = map_dbl(seq_along(vect), ~ sum(exp(-2 * (vect[.] - vect[-.])))))
# let num1 vect num3
# <fct> <int> <dbl> <dbl>
#1 A 1 0.5 1.94
#2 A 2 0.7 0.972
#3 A 3 0.1 5.55
#4 B 1 0.5 1.94
#5 B 2 0.7 0.972
#6 B 3 0.1 5.55
#7 C 1 0.5 1.94
#8 C 2 0.7 0.972
#9 C 3 0.1 5.55
You can turn your for-loop into a sapply-call and then use it in mutate.
sapply takes a function and aplys it to each list-element. In this case I'm looping over the number of elements in each groups (n()).
my.df %>%
group_by(let) %>%
mutate(num3 = sapply(1:n(), function(i) sum(exp(-2 * (vect[i] - vect[-i])))))
# A tibble: 9 x 4
# Groups: let [3]
# let num1 vect num3
# <fct> <int> <dbl> <dbl>
# 1 A 1 0.5 1.94
# 2 A 2 0.7 0.972
# 3 A 3 0.1 5.55
# 4 B 1 0.5 1.94
# 5 B 2 0.7 0.972
# 6 B 3 0.1 5.55
# 7 C 1 0.5 1.94
# 8 C 2 0.7 0.972
# 9 C 3 0.1 5.55
This is essential equivalent to the very wrong looking for-loop inside a mutate call. In this case, however I'd prefer the custom-function provided by A. Stam.
my.df %>%
group_by(let) %>%
mutate(num3 = {
res <- numeric(length = n())
for (i in 1:n()) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
})
You can also replace sapply with purrr's map_dbl.
Or using data.table
library(data.table)
setDT(my.df)[, num3 := unlist(lapply(seq_len(.N),
function(i) sum(exp(-2 * (vect[i] - vect[-i]))))), let]
my.df
# let num1 vect num3
#1: A 1 0.5 1.9411537
#2: A 2 0.7 0.9715143
#3: A 3 0.1 5.5456579
#4: B 1 0.5 1.9411537
#5: B 2 0.7 0.9715143
#6: B 3 0.1 5.5456579
#7: C 1 0.5 1.9411537
#8: C 2 0.7 0.9715143
#9: C 3 0.1 5.5456579
How do I split a dataframe like
a <- c("a","a","a","b","b","c","c","d","d","d")
b <- c(1,2,3,1,2,1,2,1,2,3)
df <- data.frame(a,b)
into single dataframes that contain only cases of equal length, i.e., all cases with three occurrences in a dataframe and all cases with two occurrences into a separate one?
The output should be:
dfa
a 1
a 2
a 3
d 1
d 2
d 3
dfb
b 1
b 2
c 1
c 2
Have a look at ?split and ?ave:
split(df, ave(df$b, df$a, FUN = length))
#$`2`
# a b
#4 b 1
#5 b 2
#6 c 1
#7 c 2
#
#$`3`
# a b
#1 a 1
#2 a 2
#3 a 3
#8 d 1
#9 d 2
#10 d 3
It's a bit more involved, but you could use droplevels with table
> tab <- table(df$a)
> lapply(3:2, function(x){
droplevels(df[df$a %in% names(tab)[tab == x], , drop = FALSE])
})
## [[1]]
## a b
## 1 a 1
## 2 a 2
## 3 a 3
## 8 d 1
## 9 d 2
## 10 d 3
## [[2]]
## a b
## 4 b 1
## 5 b 2
## 6 c 1
## 7 c 2