I'm trying to rbind multiple loaded datasets (all of them have the same num. of columns, named "num", "source" and "target"). In case, I have ten dataframes, which names are "test1", "test2", "test3" and so on...
I thought that trying the solution below (creating an empty dataframe and looping through the others) would solve my problem, but I guess that I'm missing something in the second argument of the rbind function. I don't know if the solution using paste0("test", I) to increment the variable (changing the name of the dataframe) it's correct... I'm afraid that I'm just trying to rbind a dataframe with a string object (and getting an error), is that right?
test = as.data.frame(matrix(ncol = 3, nrow = 0)) %>%
setNames(c("num", "source", "target"))
i=1
while (i < 11) {
test = rbind(test, paste0("test", i))
i = i + 1
}
We need replicate to return as a list
out <- setNames(replicate(10, test, simplify = FALSE),
paste0("test", seq_len(10)))
If there are multiple datasets already created in the global env, get those in to a list and rbind within do.call
out <- do.call(rbind, mget(paste0("test", 1:10)))
We could bind test1:test10 using the common pattern in the name:
library(dplyr)
result <- mget(ls(pattern="^test\\d+")) %>%
bind_rows()
If I understood correctly, this might help you
Libraries
library(dplyr)
Example data
list_of_df <-
list(
df1 = data.frame(a = "1"),
df2 = data.frame(a = "2"),
df3 = data.frame(a = "1"),
df4 = data.frame(a = "2")
)
Code
bind_rows(list_of_df,.id = "dataset")
Result
dataset a
1 df1 1
2 df2 2
3 df3 1
4 df4 2
Related
I have two data.frames: name and searches
name <- data.frame(
A = c("example", "firstly", "second.com")
searches <- data.frame(
A = c("example.com","secondly","first"),
B = c("test", "test.com", "test1"))
I want to search in data.frame "searches" for the values in data.frame "name". If there is a similar value (not exactly the same) I want R to return the value from name and from searches in a new row in a new table.
So a new data.frame could be
result <- data.frame(
A = "example", "firstly", "second.com",
B = "example.com","first","secondly",
C = "test", "test1", "test.com")
Is that possible?
You can use the stringr package in R to do this. For example, if you have
name <- data.frame(
A = c("example", "firstly", "second.com"))
searches <- data.frame(
A = c("example.com","secondly","first"),
B = c("test", "test.com", "test1"))
then you can use
str_extract(searches$A, '.*example.*')
This gives an output of
> str_extract(searches$A, '.*example.*')
[1] "example.com" NA NA
If you set this up with an appropriate for loop to iterate over elements in your name dataframe and cells of your searches dataframe then you could pick up all matches and extract them as desired.
use the function stringdist_join from the fuzzyjoin package.
library(fuzzyjoin)
name <- data.frame(
A = c("example", "firstly", "second.com")
)
searches <- data.frame(
A = c("example.com","secondly","first"),
B = c("test", "test.com", "test1")
)
result <- stringdist_join(name, searches, by = "A", max_dist = 5)
Which results to:
> print(result)
A.x A.y B
1 example example.com test
2 firstly first test1
3 second.com secondly test.com
I have a master file which has other data frame name (df2, df3), row and columns index which use to populate the master file x column
I think to use the for loop but don't know how to start and I haven't used R for a while.
master <- data.frame(df = c("df2","df2","df3"), column =c("A","C","B"),row = c(1,2,3), x = c(1,1,1))
df2 <- data.frame(A = c(2,4,6), B = c(1,3,5),C = c(4,8,5))
df3 <- data.frame(A = c(12,14,16), B = c(11,13,15),C = c(24,28,25))
Thanks
If you are going to use for-loop, I guess the following could help you
for (k in 1:nrow(master)) {
master$x[k] <- eval(parse(text = sprintf("%s$%s[%s]",master$df[k],master$column[k],master$row[k])))
}
where eval and parse can evaluate your query as string
I am writing a function to process data from a huge dataframe (row by row) which always has the same column names. So I want to pass the dataframe itself as a function to read out the information I need from the individual rows. However, when I try to use it as argument I can't read the information from it for some reason.
Dataframe:
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
My code:
List <- do.call(list, Map(function(DT) {
DT <- as.data.frame(DT)
aa <- as.numeric(strsplit(DT$Age, ","))
mean.aa <- mean(aa)
},
DF))
Trying this I get a list with the column names, but all Values are NULL.
Expected output :
My expected output is a list with length equal to the number of rows in the data frame. Under each list index there should be another list with the age of the corresponding row (an also other stuff from the same row of the data table, later).
DF <- apply(data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"), "mean.aa" = c(179.7143, 100.8571)), 1, as.list)
What am I doing wrong?
Here is one way :
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
apply(DF, 1, function(row){
aa <- as.numeric(strsplit(row["Age"], ",")[[1]])
row["mean.aa"] <- mean(aa)
as.list(row)
})
I have a list of data.frames (in this example only 2):
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
I want to join them into a single data.frame only by a subset of the shared column names, in this case by id.
If I use:
library(dplyr)
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
The shared column names, which I'm not joining by, get mutated with the x and y suffices:
id val.x val1 val.y val2
1 G -0.05612874 0.2914462 2.087167 0.7876396
2 G -0.05612874 0.2914462 -0.255027 1.4411577
3 J -0.15579551 -0.4432919 -1.286301 1.0273924
In reality, for the shared column names for which I'm not joining by, it's good enough to select them from a single data.frame in the list - which ever they exist in WRT to the joined id.
I don't know these shared column names in advance but that's not difficult find out:
E.g.:
df.list.colnames <- unlist(lapply(df.list,function(l) colnames(l %>% dplyr::select(-id))))
df.list.colnames <- table(df.list.colnames)
repeating.colnames <- names(df.list.colnames)[which(df.list.colnames > 1)]
Which will then allow me to separate them from the data.frames in the list:
repeating.colnames.df <- do.call(rbind,lapply(df.list,function(r) r %>% dplyr::select_(.dots = c("id",repeating.colnames)))) %>%
unique()
I can then join the list of data.frames excluding these columns:
And then join them as above:
for(r in 1:length(df.list)) df.list[[r]] <- df.list[[r]] %>% dplyr::select_(.dots = paste0("-",repeating.colnames))
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
And now I'm left with adding the repeating.colnames.df to that. I don't know of any join in dplyr that wont return all combinations between df and repeating.colnames.df, so it seems that all I can do is apply over each df$id, pick the first match in repeating.colnames.df and join the result with df.
Is there anything less cumbersome for this situation?
If I followed correctly, I think you can handle this by writing a custom function to pass into reduce that identifies the common column names (excluding your joining columns) and excludes those columns from the "second" table in the merge. As reduce works through the list, the function will "accumulate" the unique columns, defaulting to the columns in the "left-most" table.
Something like this:
library(dplyr)
library(purrr)
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
fun <- function(df1, df2, by_col = "id"){
df1_names <- names(df1)
df2_names <- names(df2)
dup_cols <- intersect(df1_names[!df1_names %in% by_col], df2_names[!df2_names %in% by_col])
out <- dplyr::inner_join(df1, df2[, !(df2_names %in% dup_cols)], by = by_col)
return(out)
}
df_chase <- df.list %>% reduce(fun,by_col="id")
Created on 2019-01-15 by the reprex package (v0.2.1)
If I compare df_chase to your final solution, I yield the same answer:
> all.equal(df_chase, df_orig)
[1] TRUE
You can just get rid of the duplicate columns from one of the data frames if you say you don't really care about them and simply use base::merge:
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
duplicates = names(df1) == names(df2) & names(df1) !="id"
df2 = df2[,!duplicates]
df12 = base::merge.data.frame(df1, df2, by = "id")
head(df12)
So I have three data sets that I need to merge. These contain school data and read/math scores for grades 4 and 5. One of them is a long form data set that has a lot of missingness in some variables (yes, I do need the data in long form) and the other two have the full missing data in wide form. All of these data frames contain a column that has an unique ID number for each individual in the database.
Here is a full reproducible example that generates a small example of the types of data.frames I am working with... The three data frames that I need to use are the following: school_lf, school4 and school5. school_lf has the long form data with NAs and school4 and school5 are the dfs I need to use to populate the NA's in this long form data (by id and grade)
set.seed(890)
school <- NULL
school$id <-sample(102938:999999, 100)
school$selected <-sample(0:1, 100, replace = T)
school$math4 <- sample(400:500, 100)
school$math5 <- sample(400:500, 100)
school$read4 <- sample(400:500, 100)
school$read5 <- sample(400:500, 100)
school <- as.data.frame(school)
# Delete observations at random from the school df
indm4 <- which(school$math4 %in% sample(school$math4, 25))
school$math4[indm4] <- NA
indm5 <- which(school$math5 %in% sample(school$math5, 50))
school$math5[indm5] <- NA
indr4 <- which(school$read4 %in% sample(school$read4, 70))
school$read4[indr4] <- NA
indr5 <- which(school$read5 %in% sample(school$read5, 81))
school$read5[indr5] <- NA
# Separate Read and Math
read <- as.data.frame(subset(school, select = -c(math4, math5)))
math <- as.data.frame(subset(school, select = -c(read4, read5)))
# Now turn this into long form data...
clr <- melt(read, id.vars = c("id", "selected"), variable.name = "variable", value.name = "readscore")
clm <- melt(math, id.vars = c("id", "selected"), value.name = "mathscore")
# Clean up the grades for each of these...
clr$grade <- ifelse(clr$variable == "read4", 4,
ifelse(clr$variable == "read5", 5, NA))
clm$grade <- ifelse(clm$variable == "math4", 4,
ifelse(clm$variable == "math5", 5, NA))
# Put all these in one df
school_lf <-cbind(clm, clr$readscore)
school_lf$readscore <- school_lf$`clr$readscore` # renames
school_lf$`clr$readscore` <- NULL # deletes
school_lf$variable <- NULL # deletes
###############
# Generate the 2 data frames with IDs that have the full data
set.seed(890)
school4 <- NULL
school4$id <-sample(102938:999999, 100)
school4$selected <-sample(0:1, 100, replace = T)
school4$math4 <- sample(400:500, 100)
school4$read4 <- sample(400:500, 100)
school4$grade <- 4
school4 <- as.data.frame(school4)
set.seed(890)
school5 <- NULL
school5$id <-sample(102938:999999, 100)
school5$selected <-sample(0:1, 100, replace = T)
school5$math5 <- sample(400:500, 100)
school5$read5 <- sample(400:500, 100)
school5$grade <- 5
school5 <- as.data.frame(school5)
I need to merge the wide-form data into the long-form data to replace the NAs with the actual values. I have tried the code below, but it introduces several columns instead of merging the read scores and the math scores where there's NA's. I simply need one column with the read scores and one with the math scores, instead of six separate columns (read.x, read.y, math.x, math.y, mathscore and readscore).
sch <- merge(school_lf, school4, by = c("id", "grade", "selected"), all = T)
sch <- merge(sch, school5, by = c("id", "grade", "selected"), all = T)
Any help is highly appreciated! I've been trying to solve this for hours now and haven't made any progress (so figured I'd ask here)
You can use the coalesce function from dplyr. If a value in the first vector is NA, it will see if the value at the same position in the second vector is not NA and select it. If again NA, it goes to the third.
library(dplyr)
sch %>% mutate(mathscore = coalesce(mathscore, math4, math5)) %>%
mutate(readscore = coalesce(readscore, read4, read5)) %>%
select(id:readscore)
EDIT: I just tried to do this approach on my actual data and it does not work because the replacement data also has some NAs and, as a result, the dfs I try to do coalesce with have differing number of rows... Back to square one.
I was able to figure this out with the following code (albeit it's not the most elegant or straight-forward ,and #Edwin's response helped point me in the right direction. Any suggestions on how to make this code more elegant and efficient are more than welcome!
# Idea: put both in long form and stack on top of one another... then merge like that!
sch4r <- as.data.frame(subset(school4, select = -c(mathscore)))
sch4m <- as.data.frame(subset(school4, select = -c(readscore)))
sch5r <- as.data.frame(subset(school5, select = -c(mathscore)))
sch5m <- as.data.frame(subset(school5, select = -c(readscore)))
# Put these in LF
sch4r_lf <- melt(sch4r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch4m_lf <- melt(sch4m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
sch5r_lf <- melt(sch5r, id.vars = c("id", "selected", "grade"), value.name = "readscore")
sch5m_lf <- melt(sch5m, id.vars = c("id", "selected", "grade"), value.name = "mathscore")
# Combine in one DF
sch_full_4 <-cbind(sch4r_lf, sch4m_lf$mathscore)
sch_full_4$mathscore <- sch_full_4$`sch4m_lf$mathscore`
sch_full_4$`sch4m_lf$mathscore` <- NULL # deletes
sch_full_4$variable <- NULL
sch_full_5 <- cbind(sch5r_lf, sch5m$mathscore)
sch_full_5$mathscore <- sch_full_5$`sch5m$mathscore`
sch_full_5$`sch5m$mathscore` <- NULL
sch_full_5$variable <- NULL
# Stack together
sch_full <- rbind(sch_full_4,sch_full_5)
sch_full$selected <- NULL # delete this column...
# MERGE together
final_school_math <- mutate(school_lf, mathscore = coalesce(school_lf$mathscore, sch_full$mathscore))
final_school_read <- mutate(school_lf, readscore = coalesce(school_lf$readscore, sch_full$readscore))
final_df <- cbind(final_school_math, final_school_read$readscore)
final_df$readscore <- final_df$`final_school_read$readscore`
final_df$`final_school_read$readscore` <- NULL