Loop over list of data frames using data.table - r

I have 3 data frames that I'd like to run the same data.table function on. I could do this manually for each data.frame but I'd like to learn how to do it more efficiently.
Using the data.table package, I want to replace the contents of col1 with the contents of col2 only if col1 contains "a". And I want to run this code over three different dataframes. On a single data.frame, this works fine:
df1 <- data.frame(col1 = c("a", "a", "b"), col2 = c("AA", "AA", "AA"))
library(data.table)
dt = data.table(df1)
dt[grepl(pattern = "a", x = df1$col1), col1 :=col2]
but I am lost trying to get this to run over multiple dataframes:
df1 <- data.frame(col1 = c("a", "a", "b"), col2 = c("AA", "AA", "AA"))
df2 <- data.frame(col1 = c("b", "b", "a"), col2 = c("AA", "BB", "BB"))
df3 <- data.frame(col1 = c("b", "b", "b"), col2 = c("AA", "AA", "BB"))
library(data.table)
listdfs = list(df1, df2, df3)
for (i in dt[[]]) {
dt[[i]][grepl(pattern = "a", x = df[[i]]$col1), col1 := col2] }
But this obviously doesn't work because I have no clue what I'm doing with the for loop. Any guidance/teaching would be appreciated. Thanks!

If we are looping through the list, then loop over the sequence of list and then do the assignment
listdfs = list(df1, df2, df3)
lapply(listdfs, setDT) # change the `data.frame` to `data.table`
for (i in seq_along(listdfs)) { # loop over sequence
listdfs[[i]][grepl(pattern = "a", x = col1), col1 := col2]
}
This would change the elements i.e. data.table with in the listdfs as well the object 'df1', 'df2', 'df3' itself as we didn't create any copy
df1
# col1 col2
#1: AA AA # change
#2: AA AA # change
#3: b AA
df2
# col1 col2
#1: b AA
#2: b BB
#3: BB BB # change
df3
# col1 col2
#1: b AA
#2: b AA
#3: b BB

Related

How to join same named sublist elements and add an ID variable

Suppose we have a list like this:
l <- list()
l[[1]] <- list()
l[[2]] <- list()
l[[3]] <- list()
names(l) <- c("A", "B", "C")
l[[1]][[1]] <- data.frame(6)
l[[1]][[2]] <- data.frame(3)
l[[1]][[3]] <- data.frame(8)
l[[1]][[4]] <- data.frame(7)
l[[2]][[1]] <- data.frame(5)
l[[2]][[2]] <- data.frame(4)
l[[2]][[3]] <- data.frame(7)
l[[2]][[4]] <- data.frame(9)
l[[3]][[1]] <- data.frame(1)
l[[3]][[2]] <- data.frame(6)
l[[3]][[3]] <- data.frame(2)
l[[3]][[4]] <- data.frame(8)
names(l[[1]]) <- c("aa", "bb", "cc", "dd")
names(l[[2]]) <- c("aa", "bb", "cc", "dd")
names(l[[3]]) <- c("aa", "bb", "cc", "dd")
I want to create a list l2 which contains 4 elements: aa, bb, cc and dd. Each of these elements would be the dataframe which would contain the values of aa, bb, cc and dd from list l and also an ID variable which would indicate if the element came from the A, B or C element of list l. So if we built the end result from scratch, it would look like this:
l2 <- list()
l2[[1]] <- data.frame(Value = c(6, 5, 1), ID = c("A", "B", "C"))
l2[[2]] <- data.frame(Value = c(3, 4, 6), ID = c("A", "B", "C"))
l2[[3]] <- data.frame(Value = c(8, 7, 2), ID = c("A", "B", "C"))
l2[[4]] <- data.frame(Value = c(7, 9, 8), ID = c("A", "B", "C"))
names(l2) <- c("aa", "bb", "cc", "dd")
However, I cannot build it from scratch, but instead I must "reshape" l to l2. What is the best way to do this? Preferred solution is in purrr.
The key is transpose(). You could set .id = "ID" in the inner map_dfr() to create a new column ID recording the names of sub-lists where each value comes when row-binding each element together.
library(purrr)
l %>%
transpose() %>%
map(~ map_dfr(.x, set_names, "Value", .id = "ID"))
Output
$aa
ID Value
1 A 6
2 B 5
3 C 1
$bb
ID Value
1 A 3
2 B 4
3 C 6
$cc
ID Value
1 A 8
2 B 7
3 C 2
$dd
ID Value
1 A 7
2 B 9
3 C 8

Updating old dataframe with new dataframe in R

I am working to update an old dataframe with a data from a new dataframe.
I found this option, it works for some of the fields, but not all. Not sure how to alter that as it is beyond my skill set. I tried removing the is.na(x) portion of the ifelse code and that did not work.
df_old <- data.frame(
bb = as.character(c("A", "A", "A", "B", "B", "B")),
y = as.character(c("i", "ii", "ii", "i", "iii", "i")),
z = 1:6,
aa = c(NA, NA, 123, NA, NA, 12))
df_new <- data.frame(
bb = as.character(c("A", "A", "A", "B", "A", "A")),
z = 1:6,
aa = c(NA, NA, 123, 1234, NA, 12))
cols <- names(df_new)[names(df_new) != "z"]
df_old[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[df_new$z == df_old$z], x), df_old[,cols], df_new[,cols])
The code also changes my bb variable from a character vector to a numeric. Do I need another call to mapply focusing on specific variable bb?
To update the aa and bb columns you can approach this using a join via merge(). This assumes column z is the index for these data frames.
# join on `z` column
df_final<- merge(df_old, df_new, by = c("z"))
# replace NAs with new values for column `aa` from `df_new`
df_final$aa <- ifelse(is.na(df_final$aa.x), df_final$aa.y, df_final$aa.x)
# choose new values for column `bb` from `df_new`
df_final$bb <- df_final$bb.y
df_final<- df_final[,c("bb", "z", "y", "aa")]
df_final
bb z y aa
1 A 1 i NA
2 A 2 ii NA
3 A 3 ii 123
4 B 4 i 1234
5 A 5 iii NA
6 A 6 i 12

R two tables merge and create new column for repeat values

Goodday everyone, I am trying to merge two dataframes and create new dataframe that contains the unique columns, and create new columns for repeat values.
For example, two dataframes are:
df1
col1 col2
A B
C D
df2
col1 col2 col3
A B E
A B F
C D G
C D H
C D I
Target output is
col1 col2 col3 col4 col5
A B E F
C D G H I
Hope you can help me. Thanks!
So I'm not sure weather the final format you are after is something that is helpful. However the first step is a simple left or full join
df1 <- data.frame(col1 = c("A", "C"),
col2 = c("B", "D"), stringsAsFactors = F)
df2 <- data.frame(col1 = c("A", "A", "C", "C", "C"),
col2 = c("B", "B", "D", "D", "D"),
col3 = c("E", "F", "G", "H", "I"), stringsAsFactors = F)
library(tidyverse)
res <- left_join(df1, df2, by = c("col1", "col2"))
res
col1 col2 col3
1 A B E
2 A B F
3 C D G
4 C D H
5 C D I
to get a result in the desired form is a bit trickier.
First we do the same left join as above, we then unite the two columns (col1 & col2) together so that we can group and spread by those columns easily.
Grouping by the united column (fuse) we want a number associated with each col3 value within the group, we paste "col" as a prefix so that when spreading it appears as a column name.
We then spread by the counter column n and fill it with the values of col3.
Finally, we reverse the unite we did earlier.
left_join(df1, df2, by = c("col1", "col2")) %>%
unite(fuse, col1, col2) %>%
group_by(fuse) %>%
mutate(n = paste0("col", 2 + 1:n())) %>%
spread(n, col3) %>%
separate(fuse, c("col1", "col2"))
# A tibble: 2 x 5
col1 col2 col3 col4 col5
<chr> <chr> <chr> <chr> <chr>
1 A B E F NA
2 C D G H I

R grep search patterns in multiple columns

I have a data frame like as follows:
Col1 Col2 Col3
A B C
D E F
G H I
I am trying to keep lines matching 'B' in 'Col2' OR F in 'Col3', in order to get:
Col1 Col2 Col3
A B C
D E F
I tried:
data[(grep("B",data$Col2) || grep("F",data$Col3)), ]
but it returns the entire data frame.
NOTE: it works when calling the 2 grep one at a time.
Or using a single grepl after pasteing the columns
df1[with(df1, grepl("B|F", paste(Col2, Col3))),]
# Col1 Col2 Col3
#1 A B C
#2 D E F
with(df1, df1[ Col2 == 'B' | Col3 == 'F',])
# Col1 Col2 Col3
# 1 A B C
# 2 D E F
Using grepl
with(df1, df1[ grepl( 'B', Col2) | grepl( 'F', Col3), ])
# Col1 Col2 Col3
# 1 A B C
# 2 D E F
Data:
df1 <- structure(list(Col1 = c("A", "D", "G"), Col2 = c("B", "E", "H"
), Col3 = c("C", "F", "I")), .Names = c("Col1", "Col2", "Col3"
), row.names = c(NA, -3L), class = "data.frame")
The data.table package makes this type of operation trivial due to its compact and readable syntax. Here is how you would perform the above using data.table:
> df1 <- structure(list(Col1 = c("A", "D", "G"), Col2 = c("B", "E", "H"
+ ), Col3 = c("C", "F", "I")), .Names = c("Col1", "Col2", "Col3"
+ ), row.names = c(NA, -3L), class = "data.frame")
> library(data.table)
> DT <- data.table(df1)
> DT
Col1 Col2 Col3
1: A B C
2: D E F
3: G H I
> DT[Col2 == 'B' | Col3 == 'F']
Col1 Col2 Col3
1: A B C
2: D E F
>
data.table performs its matching operations with with=TRUE by default. Note that the matching is much faster if you set keys on the data but that is for another topic.

How to join multiple data frames using dplyr?

I want to left_join multiple data frames:
dfs <- list(
df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
df2 = data.frame(c = 4:6, b = c("a", "c", "d")),
df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)
Reduce(left_join, dfs)
# a b c d
# 1 1 a 4 NA
# 2 2 b NA 7
# 3 3 c 5 8
This works because they all have the same b column, but Reduce doesn't let me specify additional arguments that I can pass to left_join. Is there a work around for something like this?
dfs <- list(
df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
df2 = data.frame(c = 4:6, d = c("a", "c", "d")),
df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)
Update
This kind of works: Reduce(function(...) left_join(..., by = c("b" = "d")), dfs) but when by is more than one element it gives this error: Error: cannot join on columns 'b' x 'd': index out of bounds
It's been too late i know....today I got introduced to the unanswered questions section. Sorry to bother.
Using left_join()
dfs <- list(
df1 = data.frame(b = c("a", "b", "c"), a = 1:3),
df2 = data.frame(d = c("a", "c", "d"), c = 4:6),
df3 = data.frame(b = c("b", "c", "e"), d = 7:9)
)
func <- function(...){
df1 = list(...)[[1]]
df2 = list(...)[[2]]
col1 = colnames(df1)[1]
col2 = colnames(df2)[1]
xxx = left_join(..., by = setNames(col2,col1))
return(xxx)
}
Reduce( func, dfs)
# b a c d
#1 a 1 4 NA
#2 b 2 NA 7
#3 c 3 5 8
Using merge() :
func <- function(...){
df1 = list(...)[[1]]
df2 = list(...)[[2]]
col1 = colnames(df1)[1]
col2 = colnames(df2)[1]
xxx=merge(..., by.x = col1, by.y = col2, , all.x = T)
return(xxx)
}
Reduce( func, dfs)
# b a c d
#1 a 1 4 NA
#2 b 2 NA 7
#3 c 3 5 8
Would this work for you?
jnd.tbl <- df1 %>%
left_join(df2, by='b') %>%
left_join(df3, by='d')
Yet another solution:
library(purrr)
library(dplyr)
dfs = list(
df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
df2 = data.frame(c = 4:6, b = c("a", "c", "d")),
df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)
purrr::reduce(dfs, dplyr::left_join, by = 'b')

Resources