could you help me to convert all these code in a single function? I need to avoid writing code for every single dataframe
data <- merge(x = data_2021, y = corr_df, by = "XCode", all.x = TRUE)
#Drop column
data = subset(data, select = -c(XCode))
# Rename columns
names(data)[names(data) == "Zvar"] <- "XCode"
# Reorder column by name
col_order <- c("XCode", "x2" , "x3")
data <- data[,col_order]
Maybe something like this:
fn <- function(x,y) {
data <- merge(x = x, y = y, by = "XCode", all.x = TRUE)
## Drop column
data = subset(data, select = -c(XCode))
## Rename columns
names(data)[names(data) == "Zvar"] <- "XCode"
## Reorder column by name
col_order <- c("XCode", "x2" , "x3")
data <- data[,col_order]
data
}
data <- fn(data_2021, corr_df)
Related
I have a list that contain multiple data.frame with same name setting. I would like to combine them together in to one data.frame and adrop the duplicates. how can I do that?
sample data can be build using codes:
lst1 <- list(data1 = mpg, data2 = mpg, data3 = mpg)
The results will be mpg.
sth like:
Many thanks.
Maybe this. You can create an id for each dataframe in the list and after binding you can exclude the duplicated values:
library(dplyr)
#Data
lst1 <- list(data1 = mpg, data2 = mpg, data3 = mpg)
lst1 <- lapply(lst1, function(x) {x$id<-1:nrow(x);return(x)})
#Bind
df <- do.call(bind_rows,lst1)
df2 <- df[!duplicated(df$id),]
df2$id <- NULL
Update: To keep data source:
library(dplyr)
#Data
lst1 <- list(data1 = mpg, data2 = mpg, data3 = mpg)
lst1 <- lapply(lst1, function(x) {x$id<-1:nrow(x);return(x)})
#Bind
df <- bind_rows(lst1,.id = 'data')
df2 <- df[!duplicated(df$id),]
df2$id <- NULL
I have a list of dataframes that to manipulate individually that looks like this:
df_list <- list(A1 = data.frame(v1 = 1:10,
v2 = 11:20),
A2 = data.frame(v1 = 21:30,
v2 = 31:40))
df_list
Using lapply allows me to run a function over the list of dataframes like this:
library(tidyverse)
some_func <- function(lizt, comp = 2){
lizt <- lapply(lizt, function(x){
x <- x %>%
mutate(IMPORTANT_v3 = v2 + comp)
return(x)
})
}
df_list_1 <- some_func(df_list)
df_list_1
So far so good but I need to run the function multiple times with different arguments so using mapply returns:
df_list_2 <- mapply(some_func,
comp = c(2, 3, 4),
MoreArgs = list(
lizt = df_list
),
SIMPLIFY = F
)
df_list_2
This creates a new list of dataframes for each argument fed to the function in mapply giving me 3 lists of 2 dataframes. This is good but the output I'm looking for is to append a new column to each original dataframe for each argument in the mapply that would look like this:
desired_df_list <- list(A1 = data.frame(v1 = 1:10,
v2 = 11:20,
IMPORTANT_v3 = 13:22,
IMPORTANT_v4 = 14:23,
IMPORTANT_v5 = 15:24),
A2 = data.frame(v1 = 21:30,
v2 = 31:40,
IMPORTANT_v3 = 33:42,
IMPORTANT_v4 = 34:43,
IMPORTANT_v5 = 35:44))
desired_df_list
How can I wrangle the output of lists of lists of dataframes to isolate and append only the desired new columns (IMPORTANT_v3) to the original dataframe? Also open to other options such as mutating multiple columns inside the lapply using mapply but I haven't figured out how to code that as yet.
Thanks!
Solved like this:
main_func <- function(lizt, comp = c(2:4)){
lizt <- lapply(lizt, function(x){
df <- mapply(movavg,
n = comp,
type = "w",
MoreArgs = list(x$v2),
SIMPLIFY = T
)
colnames(df) <- paste0("IMPORTANT_v", 1:ncol(df))
print(df)
print(x)
x <- cbind(x, df)
return(x)
})
}
desired_df_list_complete <- main_func(df_list)
desired_df_list_complete
using movavg from pracma package in this example.
I have two dataframes like below:
df1 <- data.frame(Construction = c("Frame","Frame","Masonry","Fire Resistive","Masonry"),
Industry = c("Apartments","Restaurant","Condos","Condos","Condos"),
Size = c("[0-3)","[6-9)","[3-6)","[3-6)","9+"))
df2 <- data.frame(Category = c("Construction","Construction","Construction",
"Industry","Industry","Industry",
"Size","Size","Size","Size"),
Type = c("Frame","Masonry","Fire Resistive",
"Apartments","Restaurant","Condos",
"[0-3)","[3-6)","[6-9)","9+"),
Score1 = rnorm(10),
Score2 = rnorm(10),
Score3 = rnorm(10))
I want to join df2 to df1 so that Construction, Industry, and Size each have their respective Score.
I can do it manually by making a key equal to Category concatenated with Type and then doing a left-join for each column, but I want a way to automate it so I can add/remove variables easily.
Here's the format I want it to look like: (note: Score numbers don't match.)
df3 <- data.frame(Construction = c("Frame","Frame","Masonry","Fire Resistive","Masonry"),
Construction_Score1 = rnorm(5),
Construction_Score2 = rnorm(5),
Construction_Score3 = rnorm(5),
Industry = c("Apartments","Restaurant","Condos","Condos","Condos"),
Industry_Score1 = rnorm(5),
Industry_Score2 = rnorm(5),
Industry_Score3 = rnorm(5),
Size = c("[0-3)","[6-9)","[3-6)","[3-6)","9+"),
Size_Score1 = rnorm(5),
Size_Score2 = rnorm(5),
Size_Score3 = rnorm(5))
The idea here is joining df1 and df2 on c("Construction","Industry","Size") and Type and then construct a long dataframe consist of those merged dataframe which we later convert to wide to get it in the format you desired.
mylist <- lapply(names(df1), function(col){
merge(x = df1, y = df2,
by.x = col, by.y = "Type",
all.x = TRUE)})
mydf <- do.call(rbind, mylist)
df3 <- reshape(mydf, idvar = c("Construction","Industry","Size"),
timevar = "Category",
direction = "wide")
One thing to note is that you have Score as the value of your Category column in df2 which I think should be Size instead to match what you have in df3 and also what has been hinted in df1.
Update: Answering OP's follow-up question;
What if there are other columns that are in df1, but not df2?
Let's make df11 which has another column and apply the same approach on that:
df11 <- cbind(df1, a=1:5)
mydf <- do.call(rbind,
lapply(names(df11[1:3]), function(col){
merge(x = df11, y = df2,
by.x = col, by.y = "Type",
all.x = TRUE)}))
df33 <- reshape(mydf, idvar = names(df11),
timevar = "Category",
direction = "wide")
So, you just need to specify in lapply which columns of df11 you are using to merge with df2 and in the reshape you include all the columns from df11 whether they match with df2 or not.
Another possibility using tidyverse package (Thanks to #akrun for reminding me about map_df):
map_df(names(df11)[1:3], ~ left_join(df11, df2, by = set_names("Type", .x))) %>%
gather(mvar, mval, Score1:Score3) %>%
unite(var, mvar, Category) %>%
spread(var, mval)
I am using R. I need to create a new column in a data frame that is the sum of the three variables. The sum should only take place if there are numeric values for each of the three variables. In other words, if there are any NAs or blanks the sum should not take place.
I have written the code below which works, but would like to simplify it. I am interested in using vectors to avoid repetition in my code.
data.x <- data.frame('time' = c(1:11),
'x' = c(5,3,"",'ND',2,'ND',7,8,'ND',1," "))
data.x[data.x == ''] <- 'NA'
data.x[data.x == ' '] <- 'NA'
data.x[data.x == 'ND'] <- 'NA'
data.x.na.omit <- na.omit(data.x)
data.y <- data.frame('time' = c(1:8),
'y' = c(5,2,3,1,2,NA,NA,8))
data.y[data.y == ''] <- 'NA'
data.y[data.y == ' '] <- 'NA'
data.y[data.y == 'ND'] <- 'NA'
data.y.na.omit <- na.omit(data.y)
data.z <- data.frame('time' = c(1:5),
'z' = c(1:5))
data.z[data.z == ''] <- 'NA'
data.z[data.z == ' '] <- 'NA'
data.z[data.z == 'ND'] <- 'NA'
data.z.na.omit <- na.omit(data.z)
data.x.y <- merge.data.frame(data.x.na.omit, data.y.na.omit, by.x = "time", by.y = "time")
data.x.y.z <- merge.data.frame(data.x.y, data.z.na.omit, by.x = "time", by.y = "time" )
data.x.y.z$x <- as.numeric(data.x.y.z$x)
data.x.y.z$y <- as.numeric(data.x.y.z$y)
data.x.y.z$z <- as.numeric(data.x.y.z$z)
data.x.y.z$result <- data.x.y.z$x + data.x.y.z$y + data.x.y.z$z
I don't see particularly good ways to use vectors to avoid repetition. I would suggest the following, though:
Removing NA rows by evaluating the result column once, so you don't have to do this for each of x, y and z.
Setting stringsAsFactors to FALSE so using a single line like data.x$x <- as.numeric(data.x$x) will automatically coerce strings to NA, and you don't have to do it separately.
Bringing in the data as a single dataframe (by adding NA to the bottom of columns y and z), rather than creating data.x, data.y and data.z then merging.
For example, code with these suggestions might look like this:
# Create merged data
data <- data.frame('time' = c(1:11),
'x' = c(5,3,"",'ND',2,'ND',7,8,'ND',1," "),
'y' = c(5,2,3,1,2,NA,NA,8, rep(NA, 3)),
'z' = c(1:5, rep(NA, 6)),
stringsAsFactors=F)
# Convert x, y and z to numeric
for(col in c("x", "y", "z"))
class(data[,col]) <- "numeric"
# Add x, y and z together
data$result <- data$x + data$y + data$z
# Remove NAs at the end
data <- na.omit(data)
If your data sources are such that you can't bring them in as a single dataframe, but you have to merge them, then you could replace the "Create merged data" section with something like this:
# Create separate data
data.x <- data.frame('time' = c(1:11),
'x' = c(5,3,"",'ND',2,'ND',7,8,'ND',1," "),
stringsAsFactors=F)
data.y <- data.frame('time' = c(1:8),
'y' = c(5,2,3,1,2,NA,NA,8),
stringsAsFactors=F)
data.z <- data.frame('time' = c(1:5),
'z' = c(1:5),
stringsAsFactors=F)
# Merge data
data.xy <- merge(data.x, data.y)
data <- merge(data.xy, data.z)
# Now continue main code suggestion from the 'Convert x, y and z to numeric' section
I'm trying to merge multiple data frames by row names.
I know how to do it with two:
x = data.frame(a = c(1,2,3), row.names = letters[1:3])
y = data.frame(b = c(1,2,3), row.names = letters[1:3])
merge(x,y, by = "row.names")
But when I try using the reshape package's merge_all() I'm getting an error.
z = data.frame(c = c(1,2,3), row.names = letters[1:3])
l = list(x,y,z)
merge_all(l, by = "row.names")
Error in -ncol(df) : invalid argument to unary operator
What's the best way to do this?
Merging by row.names does weird things - it creates a column called Row.names, which makes subsequent merges hard.
To avoid that issue you can instead create a column with the row names (which is generally a better idea anyway - row names are very limited and hard to manipulate). One way of doing that with the data as given in OP (not the most optimal way, for more optimal and easier ways of dealing with rectangular data I recommend getting to know data.table instead):
Reduce(merge, lapply(l, function(x) data.frame(x, rn = row.names(x))))
maybe there exists a faster version using do.call or *apply, but this works in your case:
x = data.frame(X = c(1,2,3), row.names = letters[1:3])
y = data.frame(Y = c(1,2,3), row.names = letters[1:3])
z = data.frame(Z = c(1,2,3), row.names = letters[1:3])
merge.all <- function(x, ..., by = "row.names") {
L <- list(...)
for (i in seq_along(L)) {
x <- merge(x, L[[i]], by = by)
rownames(x) <- x$Row.names
x$Row.names <- NULL
}
return(x)
}
merge.all(x,y,z)
important may be to define all the parameters (like by) in the function merge.all you want to forward to merge since the whole ... arguments are used in the list of objects to merge.
As an alternative to Reduce and merge:
If you put all the data frames into a list, you can then use grep and cbind to get the data frames with the desired row names.
## set up the data
> x <- data.frame(x1 = c(2,4,6), row.names = letters[1:3])
> y <- data.frame(x2 = c(3,6,9), row.names = letters[1:3])
> z <- data.frame(x3 = c(1,2,3), row.names = letters[1:3])
> a <- data.frame(x4 = c(4,6,8), row.names = letters[4:6])
> lst <- list(a, x, y, z)
## combine all the data frames with row names = letters[1:3]
> gg <- grep(paste(letters[1:3], collapse = ""),
sapply(lapply(lst, rownames), paste, collapse = ""))
> do.call(cbind, lst[gg])
## x1 x2 x3
## a 2 3 1
## b 4 6 2
## c 6 9 3