I have been trying to replace a for loop in my code with an apply function, and i attempted to do it in all the possible ways, using sapply and lapply and apply and mapply, always seems to not work out, the original function looks like this
ds1 <- data.frame(col1 = c(NA, 2), col2 = c("A", "B"))
ds2 <- data.frame(colA = c("A", "B"), colB = c(90, 110))
for(i in 1:nrow(ds1)){
if(is.na(ds1$col1[i])){
ds1$col1[i] <- ds2[ds2[,"colA"] == ds1$col2[i], "colB"]
}
}
My latest attempt with the apply family looks like this
ds1 <- data.frame(col1 = c(NA, 2), col2 = c("A", "B"))
ds2 <- data.frame(colA = c("A", "B"), colB = c(90, 110))
sFunc <- function(x, y, z){
if(is.na(x)){
return(z[z[,"colA"] == y, "colB"])
} else {
return(x)
}
}
ds1$col1 <- sapply(ds1$col1, sFunc, ds1$col2, ds2)
Which returns ds2$colB for each row, can someone explain to me what I got wrong about this?
sapply only iterates over the first vector you pass. The other arguments you pass will be treated as whole vectors in each loop. To iterate over multiple vectors you need multivariate apply, which is mapply.
sFunc <- function(x, y){
if(is.na(x)){
return(ds2[ds2[,"colA"] == y, "colB"])
} else {
return(x)
}
}
mapply(sFunc, ds1$col1, ds1$col2)
#> [1] 90 2
A join would be useful here. You can do it in base R :
transform(merge(ds1, ds2, by.x = "col2", by.y = "colA"),
col1 = ifelse(is.na(col1), colB, col1))[names(ds1)]
# col1 col2
#1 90 A
#2 2 B
Or with dplyr
library(dplyr)
inner_join(ds1, ds2, by = c("col2" = "colA")) %>%
mutate(col1 = coalesce(col1, colB)) %>%
select(names(ds1))
Related
I have a list of dataframes (df1, df2, df3) for which I would like to match columns with another dataframe (df) and substitute strings only if there is a match. Match should be based on a string specified when running the function, specified as partial match, in other words here it only for fields containing string "TEXT" and should work on cases like TEXT123 and TEXTabc. I did not get very far myself...
df1 <- data.frame(name = c("TEXT333","b","c"), column_A = 1:3, stringsAsFactors=FALSE)
df2 <- data.frame(name = c("b","TEXT345","d"), column_A = 4:6, stringsAsFactors=FALSE)
df3 <- data.frame(name = c("c","TEXT123","a"), column_A = 7:9, stringsAsFactors=FALSE)
df <- data.frame(name = c("TEXT333","TEXT123","a", "TEXT345", "k", "l", "b","c", "f"), column_B = 11:19, stringsAsFactors=FALSE)
list<-c(df1, df2, df3)
example for df1
partial_match <- function(column_A$df1, column_B, TEXT, df) {
df1_new <-df1
df1_new[, column_B] <- ifelse(grepl("TEXT.*", df1[, column_A]),
df[, column_B] - nchar(TEXT),
df[, column_B])
df1_new
}
Outcome for df1:
name column_A column_B
TEXT333 1 11
b 2 b
c 3 c
Here's one approach using a for loop. You were close! Note that I changed your reference dataframe name to dfs to avoid confusion with list().
Do you think you might encounter a situation where you might match multiple times in the same dataframe? If so, what I show below won't work without a couple more lines.
df1 <- data.frame(name = c("TEXT333","b","c"), column_A = 1:3, stringsAsFactors=FALSE)
df2 <- data.frame(name = c("b","TEXT345","d"), column_A = 4:6, stringsAsFactors=FALSE)
df3 <- data.frame(name = c("c","TEXT123","a"), column_A = 7:9, stringsAsFactors=FALSE)
dfs <- list(df1, df2, df3)
df <- data.frame(name = c("TEXT333","TEXT123","a", "TEXT345", "k", "l", "b","c", "f"), column_B = 11:19, stringsAsFactors=FALSE)
# loop over all dataframes in your list
for(i in 1:length(dfs)){
# get name that matches regex
val <- grep(pattern = "*TEXT*", x = dfs[[i]]$name, value = TRUE)
# use name to update value from reference df
dfs[[i]][dfs[[i]]$name == val,"column_A"] <- df[df$name == val,"column_B"]
}
Updated answer that can account for multiple matches in the same df
for(i in 1:length(dfs)){
vals <- grep(pattern = "*TEXT*", x = dfs[[i]]$name, value = TRUE)
for(val in vals){
dfs[[i]][dfs[[i]]$name == val, "column_A"] <- df[df$name == val,"column_B"]
}
}
I'm trying to loop through a list of data frames, dropping columns that don't match some condition. I want to change the data frames such that they're missing 1 column essentially. After executing the function, I'm able to change the LIST of data frames, but not the original data frames themselves.
df1 <- data.frame(
a = c("John","Peter","Dylan"),
b = c(1, 2, 3),
c = c("yipee", "ki", "yay"))
df2 <- data.frame(
a = c("Ray","Bob","Derek"),
b = c(4, 5, 6),
c = c("yum", "yummy", "donuts"))
df3 <- data.frame(
a = c("Bill","Sam","Nate"),
b = c(7, 8, 9),
c = c("I", "eat", "cake"))
l <- list(df1, df2, df3)
drop_col <- function(x) {
x <- x[, !names(x) %in% c("e", "b", "f")]
return(x)
}
l <- lapply(l, drop_col)
When I call the list l, I get a list of data frames with the changes I want. When I call an element in the list, df1 or df2 or df3, they do not have a dropped column.
I've looked at this solution and many others, I'm obviously missing something.
l list and df1 , df2 etc. dataframes are independent. They have nothing to do with each other. One way to get new changed dataframes is to assign names to the list and create new dataframe.
l <- lapply(l, drop_col)
names(l) <- paste0("df", 1:3)
list2env(l, .GlobalEnv)
The problem is that when you are creating l, you are filling it with copies of your data frames df1, df2, df3.
In R, it is not generally possible to pass references to variables. One workaround is to create an environment as #Ronak Shah does.
Another is to use get() and <<- to change the variable within the function.
drop_cols <- function(x) {
for(iter in x)
do.call("<<-", list(iter, drop_col(get(iter))))
}
drop_cols(c("df1","df2","df3"))
df1 <- data.frame(
a = c("John","Peter","Dylan"),
b = c(1, 2, 3),
c = c("yipee", "ki", "yay"))
df2 <- data.frame(
a = c("Ray","Bob","Derek"),
b = c(4, 5, 6),
c = c("yum", "yummy", "donuts"))
df3 <- data.frame(
a = c("Bill","Sam","Nate"),
b = c(7, 8, 9),
c = c("I", "eat", "cake"))
# Name the list elements:
l <- list(df1 = df1, df2 = df2, df3 = df3)
drop_col <- function(x) {
x <- x[, !names(x) %in% c("e", "b", "f")]
return(x)
}
l <- lapply(l, drop_col)
# View altered dfs:
View(l["df1"])
I have a list of dataframes that to manipulate individually that looks like this:
df_list <- list(A1 = data.frame(v1 = 1:10,
v2 = 11:20),
A2 = data.frame(v1 = 21:30,
v2 = 31:40))
df_list
Using lapply allows me to run a function over the list of dataframes like this:
library(tidyverse)
some_func <- function(lizt, comp = 2){
lizt <- lapply(lizt, function(x){
x <- x %>%
mutate(IMPORTANT_v3 = v2 + comp)
return(x)
})
}
df_list_1 <- some_func(df_list)
df_list_1
So far so good but I need to run the function multiple times with different arguments so using mapply returns:
df_list_2 <- mapply(some_func,
comp = c(2, 3, 4),
MoreArgs = list(
lizt = df_list
),
SIMPLIFY = F
)
df_list_2
This creates a new list of dataframes for each argument fed to the function in mapply giving me 3 lists of 2 dataframes. This is good but the output I'm looking for is to append a new column to each original dataframe for each argument in the mapply that would look like this:
desired_df_list <- list(A1 = data.frame(v1 = 1:10,
v2 = 11:20,
IMPORTANT_v3 = 13:22,
IMPORTANT_v4 = 14:23,
IMPORTANT_v5 = 15:24),
A2 = data.frame(v1 = 21:30,
v2 = 31:40,
IMPORTANT_v3 = 33:42,
IMPORTANT_v4 = 34:43,
IMPORTANT_v5 = 35:44))
desired_df_list
How can I wrangle the output of lists of lists of dataframes to isolate and append only the desired new columns (IMPORTANT_v3) to the original dataframe? Also open to other options such as mutating multiple columns inside the lapply using mapply but I haven't figured out how to code that as yet.
Thanks!
Solved like this:
main_func <- function(lizt, comp = c(2:4)){
lizt <- lapply(lizt, function(x){
df <- mapply(movavg,
n = comp,
type = "w",
MoreArgs = list(x$v2),
SIMPLIFY = T
)
colnames(df) <- paste0("IMPORTANT_v", 1:ncol(df))
print(df)
print(x)
x <- cbind(x, df)
return(x)
})
}
desired_df_list_complete <- main_func(df_list)
desired_df_list_complete
using movavg from pracma package in this example.
I'm trying to merge multiple data frames by row names.
I know how to do it with two:
x = data.frame(a = c(1,2,3), row.names = letters[1:3])
y = data.frame(b = c(1,2,3), row.names = letters[1:3])
merge(x,y, by = "row.names")
But when I try using the reshape package's merge_all() I'm getting an error.
z = data.frame(c = c(1,2,3), row.names = letters[1:3])
l = list(x,y,z)
merge_all(l, by = "row.names")
Error in -ncol(df) : invalid argument to unary operator
What's the best way to do this?
Merging by row.names does weird things - it creates a column called Row.names, which makes subsequent merges hard.
To avoid that issue you can instead create a column with the row names (which is generally a better idea anyway - row names are very limited and hard to manipulate). One way of doing that with the data as given in OP (not the most optimal way, for more optimal and easier ways of dealing with rectangular data I recommend getting to know data.table instead):
Reduce(merge, lapply(l, function(x) data.frame(x, rn = row.names(x))))
maybe there exists a faster version using do.call or *apply, but this works in your case:
x = data.frame(X = c(1,2,3), row.names = letters[1:3])
y = data.frame(Y = c(1,2,3), row.names = letters[1:3])
z = data.frame(Z = c(1,2,3), row.names = letters[1:3])
merge.all <- function(x, ..., by = "row.names") {
L <- list(...)
for (i in seq_along(L)) {
x <- merge(x, L[[i]], by = by)
rownames(x) <- x$Row.names
x$Row.names <- NULL
}
return(x)
}
merge.all(x,y,z)
important may be to define all the parameters (like by) in the function merge.all you want to forward to merge since the whole ... arguments are used in the list of objects to merge.
As an alternative to Reduce and merge:
If you put all the data frames into a list, you can then use grep and cbind to get the data frames with the desired row names.
## set up the data
> x <- data.frame(x1 = c(2,4,6), row.names = letters[1:3])
> y <- data.frame(x2 = c(3,6,9), row.names = letters[1:3])
> z <- data.frame(x3 = c(1,2,3), row.names = letters[1:3])
> a <- data.frame(x4 = c(4,6,8), row.names = letters[4:6])
> lst <- list(a, x, y, z)
## combine all the data frames with row names = letters[1:3]
> gg <- grep(paste(letters[1:3], collapse = ""),
sapply(lapply(lst, rownames), paste, collapse = ""))
> do.call(cbind, lst[gg])
## x1 x2 x3
## a 2 3 1
## b 4 6 2
## c 6 9 3
I am looking for a vector version of ddply.
I would like to do the following:
vector_ddply(frame1, frame2, ..., frameN, c("column1", "column2"), processingFunction);
Here all frames have both "column1" and "column2" and processingFunction takes N parameters.
Note that in my specific case it doesn't make sense to merge the N data frames into one.
The resulting frame would made of the unions of all the keys of the N frames.
Is there a way to achieve this ?
Thanks
Let's start with some sample data:
ll <- list(
f1 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), p = 1:4 ),
f2 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), q = 1:4 ),
f3 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), r = 1:4 )
)
1. Solution: apply data.frame-wise
You want to ddply processingFunction on each data.frame individually, and combine the results to one resulting data.frame:
ldply( ll, ddply, .(x, y), summarise, z = processingFunction(z) )
2. Solution: apply on one rbinded data.frame
You want to apply processingFunction over all rows of the data.frames at once. So then you should just rbind all data.frames together to a large one. Just in case this is not directly possible because the individual frames have not all columns in common, you have to rbind on the common column subset:
commonCols <- Reduce( "intersect", lapply(ll, colnames) )
oneDf <- do.call( "rbind", lapply( ll, "[", commonCols ) )
ddply( oneDf, .(x,y), summarise, z = processingFunction(z) )