I have a list of 59 data frames that I want to merge together. Unfortunately, because I have scraped many of them, the columns in the data frames have different classes. They all have the column "Name", some in factor form and some in character form. I want to change all of them to character form. I tried the following
dts <- c("Alabama","Alaska","Arizona","Arkansas","California","Colorado","Connecticut","Delaware","Florida",
"Georgia","Hawaii","Idaho","Illinois","Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine",
"Maryland","Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana","Nebraska",
"Nevada","New_Hampshire","New_Jersey","New_Mexico","New_York","North_Carolina","North_Dakota",
"Ohio","Oklahoma","Oregon","Pennsylvania","Rhode_Island","South_Carolina","South_Dakota","Tennessee",
"Texas","Utah","Vermont","Virginia","Washington","West_Virginia","Wisconsin","Wyoming","Federal",
"CCJail","DC","LAJail","NOLA","NYCJail","OCJail","PhilJail","TXJail")
for(i in 1:length(dts)){
dts[i]$Name <- as.character(dts[i]$Name)
}
but it only gave me the error "Error: $ operator is invalid for atomic vectors".
Does anyone know of a good work-around? Thanks in advance for the help!
My ultimate goal is to run
dta <-dplyr::bind_rows(Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,Florida,
Georgia,Hawaii,Idaho,Illinois,Indiana,Iowa,Kansas,Kentucky,Louisiana,Maine,
Maryland,Massachusetts,Michigan,Minnesota,Mississippi,Missouri,Montana,Nebraska,
Nevada,New_Hampshire,New_Jersey,New_Mexico,New_York,North_Carolina,North_Dakota,
Ohio,Oklahoma,Oregon,Pennsylvania,Rhode_Island,South_Carolina,South_Dakota,Tennessee,
Texas,Utah,Vermont,Virginia,Washington,West_Virginia,Wisconsin,Wyoming,Federal,CCJail,
DC,LAJail,NOLA,NYCJail,OCJail,PhilJail,TXJail)
But I get the error "Error: Can't combine ..1$Residents.Confirmed and ..2$Residents.Confirmed ." There are a ton of columns in each data frame, and they are different classes very often. if anyone has a more elegant solution, I would also be open to that instead! Thanks!
We can get the datasets loaded into a list with mget (assuming the dataset objects are already created in the global environment) and then loop over the list with map, change the class of 'Name' column in mutate and row bind with suffix _dfr in map
library(dplyr)
library(purrr)
out <- map_dfr(mget(dts), ~ .x %>%
mutate(Name = as.character(Name)))
If there are many columns that are different class. May be, it is better to convert to a single class for all the columns and then bind
out <- map_dfr(mget(dts), ~ .x %>%
mutate(across(everything(), as.character)))
out <- type.convert(out, as.is = TRUE)
If the dplyr version is < 1.0.0, use mutate_all
out <- map_dfr(mget(dts), ~ .x %>%
mutate_all(as.character))
d1 <- data.frame(
Name = as.factor(c("name1", "name2")),
Residents.Confirmed = c(0,1)
)
d2 <- data.frame(
Name = c("name3", "name4"),
Residents.Confirmed = c(2,3)
)
dataframes_list <- list(d1, d2)
for(i in 1:length(dataframes_list)){
dataframes_list[[i]]$Name <- as.character(dataframes_list[[i]]$Name)
}
bind_rows(dataframes_list)
Base R solution:
type.convert(do.call("rbind",
Map(function(x){data.frame(lapply(x, as.character))}, dataframes_list)))
Data thanks #chase171:
d1 <- data.frame(
Name = as.factor(c("name1", "name2")),
Residents.Confirmed = c(0,1)
)
d2 <- data.frame(
Name = c("name3", "name4"),
Residents.Confirmed = c(2,3)
)
dataframes_list <- list(d1, d2)
Related
I have a list with 5 data.frames. Now I want to change the name of the last column of each data.frame.
And I don't know exactly how many columns are in the df.
Example-data:
library(tidyverse)
data(mtcars)
df1 <- tail(mtcars)
df2 <- mtcars[1:5, 2:10]
df3 <- mtcars
df4 <- head(mtcars)
list <- list(df1, df2, df3, df4)
Doing it one by one, this would be the command:
colnames(list$df1)[length(list$df1)] <- "rank"
Within a for loop, I would think that the command would then be:
for (i in seq_along(list)) {
colnames(i)[length(i)] <- "rank"
}
But here I get the error:
Error in `colnames<-`(`*tmp*`, value = `*vtmp*`) :
attempt to set 'colnames' on an object with less than two dimensions
Any idea how to solve this problem? Maybe by the map-command?
Here I don't know how to include the index/length(df) to assign the colnames-command to the last column of the dataframe.
Thank you for your help :)
Kathrin
You can use last_col() from dplyr within map:
library(tidyverse)
list <- map(list,~{
.x %>%
rename(rank = last_col())
})
I have multiple .csv files (mydata_1, mydata_2,...) with the same amount of columns and column names(, different row lengths if that helps finding an answer). After reading them into my environment they have the class data.frame . I was putting them all in a list and now want to select specific columns by name from all of them, resulting in in the same variable name with just the chosen columns.
mydata_1 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
mydata_2 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
colnames(mydata_1) = c(paste0("X","1":"7"))
colnames(mydata_2) = c(paste0("X","1":"7"))
df1 = as.data.frame(mydata_1)
df2 = as.data.frame(mydata_2)
all_data = c(df1, df2)
class(all_data)
class(df1)
for (i in all_data){
i = select(i,"X3":"X5")
}
My for command shall output the data.frames df1 and df2 with just three columns (instead of the prior seven), but when running the code an error message regarding the select command appears.
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "c('integer', 'numeric')"
How can I get an working output of my new dfs?
The first issue here is that your are trying to create a list using c(df1, df2), while you have to use list(df1, df2)
Data
library(dplyr)
library(purrr)
mydata_1 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
mydata_2 = matrix(c(1:21), nrow=3, ncol=7,byrow = TRUE)
colnames(mydata_1) = c(paste0("X","1":"7"))
colnames(mydata_2) = c(paste0("X","1":"7"))
df1 = as.data.frame(mydata_1)
df2 = as.data.frame(mydata_2)
all_data = list(df1 = df1, df2 = df2)
The second problem is within your loop. look, in this approach you have to create an empty list before running the loop, and then aggregate elements in each iteration.
all_data2 <- list()
for(i in 1:length(all_data)) {
all_data2[[i]] <- all_data[[i]] %>% select(X3, X4, X5)
}
try using map from purrr which is part of the tidyverse package and lead to a cleaner code with the same result.
# Down here the `.x` is replaced by each element of the list all_data
# in each iteration, ending wiht a list of two data frames
all_data2 = map(all_data, ~.x %>%
select(X3, X4, X5))
Consider base R's subset with select argument for contiguous column selection, wrapped in an lapply call. Unlike for loop, lapply does not require the bookkeeping to reassign each element back into a list:
all_data <- list(df1 = df1, df2 = df2)
all_data_sub <- lapply(all_data, function(df) subset(df, select=X3:X5))
In R, I currently have 100 dataframes, named df.1, ...,df.100. I would like to be able to rbind them but it is costly to write out:
rbind(df.1, df.2, etc)
So, I have tried:
rbind(eval(as.symbol(paste0("df.",1:84, collapse = ", "))))
However, this returns errors. Does anyone know how I can make the dataframes usable? thanks.
You can rbind them one at a time in a loop.
df.1 = iris
df.2 = iris
df.3 = iris
DF = df.1
for(i in 2:3) {
DF = rbind(DF, eval(as.symbol(paste("df", i, sep=".")))) }
Using mget and then do.call or dplyr's bind_rows should work.
df.1 = iris[1:20,]
df.2 = iris[21:50,]
do.call("rbind",mget(paste0("df.",1:2)))
library(dplyr)
bind_rows(mget(paste0("df.",1:2)))
I know that there are many related questions here on SO, but I am looking for a purrr solution, please, not one from the apply list of functions or cbind/rbdind (I want to take this opportunity to get to know purrr better).
I have a list of dataframes and I would like to add a new column to each dataframe in the list. The value of the column will be the name of the dataframe, i.e. the name of each element in the list.
There is something similar here, but it involves the use of a function and mutate_each(), whereas I need just mutate().
To give you an idea of the list (called comentarios), here is the first line of str() on the first element:
> str(comentarios[1])
List of 1
$ 166860353356903_661400323902901:'data.frame': 13 obs. of 7 variables:
So I would like my new variable to contain 166860353356903_661400323902901 for 13 lines in the result, as an ID for each dataframe.
What I am trying is:
dff <- map_df(comentarios,
~ mutate(ID = names(comentarios)),
.id = "Group"
)
However, mutate() needs the name of the dataframe in order to work:
Error in mutate_(.data, .dots = lazyeval::lazy_dots(...)) :
argument ".data" is missing, with no default
It doesn't make sense to put in each name, I'd be straying into loop territory and losing the advantages of purrr (and R, more generally). If the list was smaller, I'd use reshape::merge_all(), but it has over 2000 elements. Thanks in advance for any help.
edit: some data to make the problem reproducible, as per alistaire's comments
# install.packages("tidyverse")
library(tidyverse)
df <- data_frame(one = rep("hey", 10), two = seq(1:10), etc = "etc")
list_df <- list(df, df, df, df, df)
names(list_df) <- c("first", "second", "third", "fourth", "fifth")
dfs <- map_df(list_df,
~ mutate(id = names(list_df)),
.id = "Group"
)
Your issue is that you have to explicitly provide reference to the data when you're not using mutate with piping. To do this, I'd suggest using map2_df
dff <- map2_df(comentarios, names(comentarios), ~ mutate(.x, ID = .y))
using the OP's data the answer would be
library(tidyverse)
df <- data_frame(one = rep("hey", 10), two = seq(1:10), etc = "etc")
list_df <- list(df, df, df, df, df)
dfnames <- c("first", "second", "third", "fourth", "fifth")
dfs <- list_df %>% map2_df(dfnames,~mutate(.x,name=.y))
In many occasions, after grouping a data frame by some variables, I want to apply a function that uses data from another data frame that is grouped by the same variables. The best solution I found is to use semi_join inside the function as follow:
d1 <- data.frame(model = c(1,1,2,2), x = runif(4) )
d2 <- data.frame(model=c(1,1,1,2,2,2), y = runif(6) )
myfun <- function(df1, df2) {
subsetdf2 <- semi_join(df2, df1)
data.frame(z = sum(d1$x) - sum(subsetdf2$y)) # trivial manipulation just to exemplify
}
d1 %>% group_by(model) %>% do(myfun(., d2))
The problem is that semi_join returns 'Joining by...' messages and, as I am using the function to do bootstrap, I get many messages that collapse the console. So, is there any way to reduce the verbosity of joins? Do you know a more elegant way to do something like this?
P.S. I asked a similar question a few years ago for plyr: subset inside a function by the variables specified in ddply
If all you want to do is stop the 'Joining by: ' statement, you just need to specify what column you are joining on with the by argument.
For example:
semi_join(d2, d1, by="model")
EDIT - As an alternative to using semi_join you can use a base solution. As the group_by function is passing the data by groups, you can filter using a simple indexing statement. This will avoid the need for an additional parameter. This also currently assumes that the column of interest is the first column.
myfun <- function(df1, df2) {
subsetdf2 <- df2[df2[,1] %in% unique(df1[,1]),]
data.frame(z = sum(df1$x) - sum(subsetdf2$y)) # trivial manipulation just to exemplify
}
I adapted the solution of #cdeterman. It is a bit redundant though.
d1 <- data.frame(model = c(1,1,2,2), x = runif(4) )
d2 <- data.frame(model=c(1,1,1,2,2,2), y = runif(6) )
myfun <- function(df1, df2, gv) {
subsetdf2 <- semi_join(df2, df1, by = gv)
data.frame(z = sum(d1$x) - sum(subsetdf2$y)) # trivial manipulation just to exemplify
}
group_var <- 'model'
d1 %>% group_by_(group_var) %>% do(myfun(., d2,group_var))