I have 3 data frames that I'd like to run the same data.table function on. I could do this manually for each data.frame but I'd like to learn how to do it more efficiently.
Using the data.table package, I want to replace the contents of col1 with the contents of col2 only if col1 contains "a". And I want to run this code over three different dataframes. On a single data.frame, this works fine:
df1 <- data.frame(col1 = c("a", "a", "b"), col2 = c("AA", "AA", "AA"))
library(data.table)
dt = data.table(df1)
dt[grepl(pattern = "a", x = df1$col1), col1 :=col2]
but I am lost trying to get this to run over multiple dataframes:
df1 <- data.frame(col1 = c("a", "a", "b"), col2 = c("AA", "AA", "AA"))
df2 <- data.frame(col1 = c("b", "b", "a"), col2 = c("AA", "BB", "BB"))
df3 <- data.frame(col1 = c("b", "b", "b"), col2 = c("AA", "AA", "BB"))
library(data.table)
listdfs = list(df1, df2, df3)
for (i in dt[[]]) {
dt[[i]][grepl(pattern = "a", x = df[[i]]$col1), col1 := col2] }
But this obviously doesn't work because I have no clue what I'm doing with the for loop. Any guidance/teaching would be appreciated. Thanks!
If we are looping through the list, then loop over the sequence of list and then do the assignment
listdfs = list(df1, df2, df3)
lapply(listdfs, setDT) # change the `data.frame` to `data.table`
for (i in seq_along(listdfs)) { # loop over sequence
listdfs[[i]][grepl(pattern = "a", x = col1), col1 := col2]
}
This would change the elements i.e. data.table with in the listdfs as well the object 'df1', 'df2', 'df3' itself as we didn't create any copy
df1
# col1 col2
#1: AA AA # change
#2: AA AA # change
#3: b AA
df2
# col1 col2
#1: b AA
#2: b BB
#3: BB BB # change
df3
# col1 col2
#1: b AA
#2: b AA
#3: b BB
Related
Suppose we have a list like this:
l <- list()
l[[1]] <- list()
l[[2]] <- list()
l[[3]] <- list()
names(l) <- c("A", "B", "C")
l[[1]][[1]] <- data.frame(6)
l[[1]][[2]] <- data.frame(3)
l[[1]][[3]] <- data.frame(8)
l[[1]][[4]] <- data.frame(7)
l[[2]][[1]] <- data.frame(5)
l[[2]][[2]] <- data.frame(4)
l[[2]][[3]] <- data.frame(7)
l[[2]][[4]] <- data.frame(9)
l[[3]][[1]] <- data.frame(1)
l[[3]][[2]] <- data.frame(6)
l[[3]][[3]] <- data.frame(2)
l[[3]][[4]] <- data.frame(8)
names(l[[1]]) <- c("aa", "bb", "cc", "dd")
names(l[[2]]) <- c("aa", "bb", "cc", "dd")
names(l[[3]]) <- c("aa", "bb", "cc", "dd")
I want to create a list l2 which contains 4 elements: aa, bb, cc and dd. Each of these elements would be the dataframe which would contain the values of aa, bb, cc and dd from list l and also an ID variable which would indicate if the element came from the A, B or C element of list l. So if we built the end result from scratch, it would look like this:
l2 <- list()
l2[[1]] <- data.frame(Value = c(6, 5, 1), ID = c("A", "B", "C"))
l2[[2]] <- data.frame(Value = c(3, 4, 6), ID = c("A", "B", "C"))
l2[[3]] <- data.frame(Value = c(8, 7, 2), ID = c("A", "B", "C"))
l2[[4]] <- data.frame(Value = c(7, 9, 8), ID = c("A", "B", "C"))
names(l2) <- c("aa", "bb", "cc", "dd")
However, I cannot build it from scratch, but instead I must "reshape" l to l2. What is the best way to do this? Preferred solution is in purrr.
The key is transpose(). You could set .id = "ID" in the inner map_dfr() to create a new column ID recording the names of sub-lists where each value comes when row-binding each element together.
library(purrr)
l %>%
transpose() %>%
map(~ map_dfr(.x, set_names, "Value", .id = "ID"))
Output
$aa
ID Value
1 A 6
2 B 5
3 C 1
$bb
ID Value
1 A 3
2 B 4
3 C 6
$cc
ID Value
1 A 8
2 B 7
3 C 2
$dd
ID Value
1 A 7
2 B 9
3 C 8
I am working to update an old dataframe with a data from a new dataframe.
I found this option, it works for some of the fields, but not all. Not sure how to alter that as it is beyond my skill set. I tried removing the is.na(x) portion of the ifelse code and that did not work.
df_old <- data.frame(
bb = as.character(c("A", "A", "A", "B", "B", "B")),
y = as.character(c("i", "ii", "ii", "i", "iii", "i")),
z = 1:6,
aa = c(NA, NA, 123, NA, NA, 12))
df_new <- data.frame(
bb = as.character(c("A", "A", "A", "B", "A", "A")),
z = 1:6,
aa = c(NA, NA, 123, 1234, NA, 12))
cols <- names(df_new)[names(df_new) != "z"]
df_old[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[df_new$z == df_old$z], x), df_old[,cols], df_new[,cols])
The code also changes my bb variable from a character vector to a numeric. Do I need another call to mapply focusing on specific variable bb?
To update the aa and bb columns you can approach this using a join via merge(). This assumes column z is the index for these data frames.
# join on `z` column
df_final<- merge(df_old, df_new, by = c("z"))
# replace NAs with new values for column `aa` from `df_new`
df_final$aa <- ifelse(is.na(df_final$aa.x), df_final$aa.y, df_final$aa.x)
# choose new values for column `bb` from `df_new`
df_final$bb <- df_final$bb.y
df_final<- df_final[,c("bb", "z", "y", "aa")]
df_final
bb z y aa
1 A 1 i NA
2 A 2 ii NA
3 A 3 ii 123
4 B 4 i 1234
5 A 5 iii NA
6 A 6 i 12
Goodday everyone, I am trying to merge two dataframes and create new dataframe that contains the unique columns, and create new columns for repeat values.
For example, two dataframes are:
df1
col1 col2
A B
C D
df2
col1 col2 col3
A B E
A B F
C D G
C D H
C D I
Target output is
col1 col2 col3 col4 col5
A B E F
C D G H I
Hope you can help me. Thanks!
So I'm not sure weather the final format you are after is something that is helpful. However the first step is a simple left or full join
df1 <- data.frame(col1 = c("A", "C"),
col2 = c("B", "D"), stringsAsFactors = F)
df2 <- data.frame(col1 = c("A", "A", "C", "C", "C"),
col2 = c("B", "B", "D", "D", "D"),
col3 = c("E", "F", "G", "H", "I"), stringsAsFactors = F)
library(tidyverse)
res <- left_join(df1, df2, by = c("col1", "col2"))
res
col1 col2 col3
1 A B E
2 A B F
3 C D G
4 C D H
5 C D I
to get a result in the desired form is a bit trickier.
First we do the same left join as above, we then unite the two columns (col1 & col2) together so that we can group and spread by those columns easily.
Grouping by the united column (fuse) we want a number associated with each col3 value within the group, we paste "col" as a prefix so that when spreading it appears as a column name.
We then spread by the counter column n and fill it with the values of col3.
Finally, we reverse the unite we did earlier.
left_join(df1, df2, by = c("col1", "col2")) %>%
unite(fuse, col1, col2) %>%
group_by(fuse) %>%
mutate(n = paste0("col", 2 + 1:n())) %>%
spread(n, col3) %>%
separate(fuse, c("col1", "col2"))
# A tibble: 2 x 5
col1 col2 col3 col4 col5
<chr> <chr> <chr> <chr> <chr>
1 A B E F NA
2 C D G H I
I have a data frame like as follows:
Col1 Col2 Col3
A B C
D E F
G H I
I am trying to keep lines matching 'B' in 'Col2' OR F in 'Col3', in order to get:
Col1 Col2 Col3
A B C
D E F
I tried:
data[(grep("B",data$Col2) || grep("F",data$Col3)), ]
but it returns the entire data frame.
NOTE: it works when calling the 2 grep one at a time.
Or using a single grepl after pasteing the columns
df1[with(df1, grepl("B|F", paste(Col2, Col3))),]
# Col1 Col2 Col3
#1 A B C
#2 D E F
with(df1, df1[ Col2 == 'B' | Col3 == 'F',])
# Col1 Col2 Col3
# 1 A B C
# 2 D E F
Using grepl
with(df1, df1[ grepl( 'B', Col2) | grepl( 'F', Col3), ])
# Col1 Col2 Col3
# 1 A B C
# 2 D E F
Data:
df1 <- structure(list(Col1 = c("A", "D", "G"), Col2 = c("B", "E", "H"
), Col3 = c("C", "F", "I")), .Names = c("Col1", "Col2", "Col3"
), row.names = c(NA, -3L), class = "data.frame")
The data.table package makes this type of operation trivial due to its compact and readable syntax. Here is how you would perform the above using data.table:
> df1 <- structure(list(Col1 = c("A", "D", "G"), Col2 = c("B", "E", "H"
+ ), Col3 = c("C", "F", "I")), .Names = c("Col1", "Col2", "Col3"
+ ), row.names = c(NA, -3L), class = "data.frame")
> library(data.table)
> DT <- data.table(df1)
> DT
Col1 Col2 Col3
1: A B C
2: D E F
3: G H I
> DT[Col2 == 'B' | Col3 == 'F']
Col1 Col2 Col3
1: A B C
2: D E F
>
data.table performs its matching operations with with=TRUE by default. Note that the matching is much faster if you set keys on the data but that is for another topic.
I want to left_join multiple data frames:
dfs <- list(
df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
df2 = data.frame(c = 4:6, b = c("a", "c", "d")),
df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)
Reduce(left_join, dfs)
# a b c d
# 1 1 a 4 NA
# 2 2 b NA 7
# 3 3 c 5 8
This works because they all have the same b column, but Reduce doesn't let me specify additional arguments that I can pass to left_join. Is there a work around for something like this?
dfs <- list(
df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
df2 = data.frame(c = 4:6, d = c("a", "c", "d")),
df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)
Update
This kind of works: Reduce(function(...) left_join(..., by = c("b" = "d")), dfs) but when by is more than one element it gives this error: Error: cannot join on columns 'b' x 'd': index out of bounds
It's been too late i know....today I got introduced to the unanswered questions section. Sorry to bother.
Using left_join()
dfs <- list(
df1 = data.frame(b = c("a", "b", "c"), a = 1:3),
df2 = data.frame(d = c("a", "c", "d"), c = 4:6),
df3 = data.frame(b = c("b", "c", "e"), d = 7:9)
)
func <- function(...){
df1 = list(...)[[1]]
df2 = list(...)[[2]]
col1 = colnames(df1)[1]
col2 = colnames(df2)[1]
xxx = left_join(..., by = setNames(col2,col1))
return(xxx)
}
Reduce( func, dfs)
# b a c d
#1 a 1 4 NA
#2 b 2 NA 7
#3 c 3 5 8
Using merge() :
func <- function(...){
df1 = list(...)[[1]]
df2 = list(...)[[2]]
col1 = colnames(df1)[1]
col2 = colnames(df2)[1]
xxx=merge(..., by.x = col1, by.y = col2, , all.x = T)
return(xxx)
}
Reduce( func, dfs)
# b a c d
#1 a 1 4 NA
#2 b 2 NA 7
#3 c 3 5 8
Would this work for you?
jnd.tbl <- df1 %>%
left_join(df2, by='b') %>%
left_join(df3, by='d')
Yet another solution:
library(purrr)
library(dplyr)
dfs = list(
df1 = data.frame(a = 1:3, b = c("a", "b", "c")),
df2 = data.frame(c = 4:6, b = c("a", "c", "d")),
df3 = data.frame(d = 7:9, b = c("b", "c", "e"))
)
purrr::reduce(dfs, dplyr::left_join, by = 'b')