Partical match string between columns for multiple dataframes - r

I have a list of dataframes (df1, df2, df3) for which I would like to match columns with another dataframe (df) and substitute strings only if there is a match. Match should be based on a string specified when running the function, specified as partial match, in other words here it only for fields containing string "TEXT" and should work on cases like TEXT123 and TEXTabc. I did not get very far myself...
df1 <- data.frame(name = c("TEXT333","b","c"), column_A = 1:3, stringsAsFactors=FALSE)
df2 <- data.frame(name = c("b","TEXT345","d"), column_A = 4:6, stringsAsFactors=FALSE)
df3 <- data.frame(name = c("c","TEXT123","a"), column_A = 7:9, stringsAsFactors=FALSE)
df <- data.frame(name = c("TEXT333","TEXT123","a", "TEXT345", "k", "l", "b","c", "f"), column_B = 11:19, stringsAsFactors=FALSE)
list<-c(df1, df2, df3)
example for df1
partial_match <- function(column_A$df1, column_B, TEXT, df) {
df1_new <-df1
df1_new[, column_B] <- ifelse(grepl("TEXT.*", df1[, column_A]),
df[, column_B] - nchar(TEXT),
df[, column_B])
df1_new
}
Outcome for df1:
name column_A column_B
TEXT333 1 11
b 2 b
c 3 c

Here's one approach using a for loop. You were close! Note that I changed your reference dataframe name to dfs to avoid confusion with list().
Do you think you might encounter a situation where you might match multiple times in the same dataframe? If so, what I show below won't work without a couple more lines.
df1 <- data.frame(name = c("TEXT333","b","c"), column_A = 1:3, stringsAsFactors=FALSE)
df2 <- data.frame(name = c("b","TEXT345","d"), column_A = 4:6, stringsAsFactors=FALSE)
df3 <- data.frame(name = c("c","TEXT123","a"), column_A = 7:9, stringsAsFactors=FALSE)
dfs <- list(df1, df2, df3)
df <- data.frame(name = c("TEXT333","TEXT123","a", "TEXT345", "k", "l", "b","c", "f"), column_B = 11:19, stringsAsFactors=FALSE)
# loop over all dataframes in your list
for(i in 1:length(dfs)){
# get name that matches regex
val <- grep(pattern = "*TEXT*", x = dfs[[i]]$name, value = TRUE)
# use name to update value from reference df
dfs[[i]][dfs[[i]]$name == val,"column_A"] <- df[df$name == val,"column_B"]
}
Updated answer that can account for multiple matches in the same df
for(i in 1:length(dfs)){
vals <- grep(pattern = "*TEXT*", x = dfs[[i]]$name, value = TRUE)
for(val in vals){
dfs[[i]][dfs[[i]]$name == val, "column_A"] <- df[df$name == val,"column_B"]
}
}

Related

cbind with do.call stores dataframe name in column variable

Why do I have different column names here for test1 and test2? If I change the data frame to only one column, they have the same name. I would like to have them the same name but use the do.call function.
a <- data.frame(col1 = c(1,2,3), col2 = c(1,2,3))
b <- data.frame(col3 = c(1,2,3), col4 = c(1,2,3))
test1 = cbind(a, b)
dataframe_name = c("a", "b")
test2 <- do.call(cbind, mget(dataframe_name, envir = .GlobalEnv))
colnames(test1)
colnames(test2)
# only one column
a=a[1]
b=b[1]
test1 = cbind(a, b)
test2 <- do.call(cbind, mget(dataframe_name, envir = .GlobalEnv))
colnames(test1)
colnames(test2)

Return a changed list in R via lapply(), but objects in list not changed

I'm trying to loop through a list of data frames, dropping columns that don't match some condition. I want to change the data frames such that they're missing 1 column essentially. After executing the function, I'm able to change the LIST of data frames, but not the original data frames themselves.
df1 <- data.frame(
a = c("John","Peter","Dylan"),
b = c(1, 2, 3),
c = c("yipee", "ki", "yay"))
df2 <- data.frame(
a = c("Ray","Bob","Derek"),
b = c(4, 5, 6),
c = c("yum", "yummy", "donuts"))
df3 <- data.frame(
a = c("Bill","Sam","Nate"),
b = c(7, 8, 9),
c = c("I", "eat", "cake"))
l <- list(df1, df2, df3)
drop_col <- function(x) {
x <- x[, !names(x) %in% c("e", "b", "f")]
return(x)
}
l <- lapply(l, drop_col)
When I call the list l, I get a list of data frames with the changes I want. When I call an element in the list, df1 or df2 or df3, they do not have a dropped column.
I've looked at this solution and many others, I'm obviously missing something.
l list and df1 , df2 etc. dataframes are independent. They have nothing to do with each other. One way to get new changed dataframes is to assign names to the list and create new dataframe.
l <- lapply(l, drop_col)
names(l) <- paste0("df", 1:3)
list2env(l, .GlobalEnv)
The problem is that when you are creating l, you are filling it with copies of your data frames df1, df2, df3.
In R, it is not generally possible to pass references to variables. One workaround is to create an environment as #Ronak Shah does.
Another is to use get() and <<- to change the variable within the function.
drop_cols <- function(x) {
for(iter in x)
do.call("<<-", list(iter, drop_col(get(iter))))
}
drop_cols(c("df1","df2","df3"))
df1 <- data.frame(
a = c("John","Peter","Dylan"),
b = c(1, 2, 3),
c = c("yipee", "ki", "yay"))
df2 <- data.frame(
a = c("Ray","Bob","Derek"),
b = c(4, 5, 6),
c = c("yum", "yummy", "donuts"))
df3 <- data.frame(
a = c("Bill","Sam","Nate"),
b = c(7, 8, 9),
c = c("I", "eat", "cake"))
# Name the list elements:
l <- list(df1 = df1, df2 = df2, df3 = df3)
drop_col <- function(x) {
x <- x[, !names(x) %in% c("e", "b", "f")]
return(x)
}
l <- lapply(l, drop_col)
# View altered dfs:
View(l["df1"])

set names with magrittr where both name and value are variable of data.frame?

Lets say i have the following data:
> data.frame(value = 1:2, name = c("a", "b"))
value name
1 1 a
2 2 b
Goal:
Can i give it as Input to the pipe Operator and "send" it to setNames (or magrittr::set_names)?
What i have tried:
library(magrittr)
data.frame(value = 1:2, name = c("a", "b")) %>%
setNames(object = .$value, nm = .$name)
That doesnt work i guess, because the pipe wants to Hand over the whole data.frame and use it as a first Argument. That got me interested if i can skip this behaviour and use two subsets instead.
(So that data.frame(value = 1:2, name = c("a", "b")) %>% is fixed and not replaced by a variable).
Desired Output:
How it would look like without the pipe Operator:
> a <- data.frame(value = 1:2, name = c("a", "b"))
> setNames(object = a$value, nm = a$name)
a b
1 2
For this case, we can simply wrap it inside {}
library(dplyr)
data.frame(value = 1:2, name = c("a", "b")) %>%
{ setNames(object = .$value, nm = .$name)}
With tidyverse, there is also a deframe which will give a named vector
library(tibble)
data.frame(value = 1:2, name = c("a", "b")) %>%
select(2:1) %>%
deframe
#a b
#1 2

r - Split dataframe into multiple dataframes and save in environment

This is a follow up on this quesiton:
split into multiple subset of dataframes with dplyr:group_by? .
Reproducible example:
test <- data.frame(a = c(1,1,1,2,2,2,3,3,3), b = c(1:9))
I'm interested on how to save the dataframes from the following output:
test %>%
group_by(a) %>%
nest() %>%
select(data) %>%
unlist(recursive = F)
as separate dataframes in the environment ? The desired output is the following:
data1 <- data.frame(a = c(1,1,1), b = c(1:3))
data2 <- data.frame(a = c(2,2,2), b = c(4:6))
data3 <- data.frame(a = c(3,3,3), b = c(7:9))
There are many groups so automation is required giving: data1,data2,data3, ... data(n) dataframes.
If you want the dataframe names to be created automatically as well, you could try something like this.
test <- data.frame(a = c(1,1,1,2,2,2,3,3,3), b = c(1:9))
test
n <- length(unique(test$a))
eval(parse(text = paste0("data", seq(1:n), " <- ", split(test, test$a))))
eval(parse(text = paste0("data", seq(1:n), " <- as.data.frame(data", seq(1:3), ")")))

R ddply vector version

I am looking for a vector version of ddply.
I would like to do the following:
vector_ddply(frame1, frame2, ..., frameN, c("column1", "column2"), processingFunction);
Here all frames have both "column1" and "column2" and processingFunction takes N parameters.
Note that in my specific case it doesn't make sense to merge the N data frames into one.
The resulting frame would made of the unions of all the keys of the N frames.
Is there a way to achieve this ?
Thanks
Let's start with some sample data:
ll <- list(
f1 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), p = 1:4 ),
f2 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), q = 1:4 ),
f3 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), r = 1:4 )
)
1. Solution: apply data.frame-wise
You want to ddply processingFunction on each data.frame individually, and combine the results to one resulting data.frame:
ldply( ll, ddply, .(x, y), summarise, z = processingFunction(z) )
2. Solution: apply on one rbinded data.frame
You want to apply processingFunction over all rows of the data.frames at once. So then you should just rbind all data.frames together to a large one. Just in case this is not directly possible because the individual frames have not all columns in common, you have to rbind on the common column subset:
commonCols <- Reduce( "intersect", lapply(ll, colnames) )
oneDf <- do.call( "rbind", lapply( ll, "[", commonCols ) )
ddply( oneDf, .(x,y), summarise, z = processingFunction(z) )

Resources