Hierarchical data is to be arranged.
Imagine that df is given. (Here i generate df with some fake data )
df1 <-data.frame("Col1" = rep("a",8), "Col2"= c(rep("M",3),rep("N",2), rep("O", 2), rep("P",1)), "Col3" = LETTERS[1:8])
df2 <-data.frame("Col1" = rep("b",13), "Col2"= c(rep("p",4),rep("q",5),rep("r",3),rep("s",1)), "Col3" = LETTERS[1:13])
df <- rbind(df1,df2)
For each element of Col1, we have to get a collection in sorted way
Finally, what i look for is a list of lists :
list a : (1,2,2,3)
list b : (1,3,4,5)
ll <- split(df, df$Col1)
lapply(ll, function(dat){
v <- Filter(function(v) !is.na(v), with(dat, tapply(Col1, Col2, length)))
v[order(v)]
})
Related
Let's say I have a list of dataframes
myList <- list(df1 = data.frame(A = as.character(sample(10)), B =
rep(1:2, 10)), df2 = data.frame(A = as.character(sample(10)), B = rep(1:2, 10)) )
I want to coerce column A in each dataframe to double.
I'm trying:
myList = sapply(myList,simplify = FALSE, function(x){
x$A <- as.double(x$A) })
But this returns the coerced values, not even column with column names.
I also tried with dplyr and mutate_if, but with no success
We can use lapply with transform in base R
myList2 <- lapply(myList, transform, A = as.double(A))
Or use map with mutate from tidyverse
library(dplyr)
library(purrr)
myList2 <- map(myList, ~ .x %>%
mutate(A = as.double(A)))
The issue in the OP's code is that it is not returning the data i.e. 'x'.
myList2 <- sapply(myList, simplify = FALSE,
function(x){
x$A <- as.double(x$A)
x
})
I have the following list, which contains several dataframes that all have the same column names:
my_list <- list(df1 = data.frame(A = c(1:3), B = c(4:6), C = c(7:9)),
df2 = data.frame(A = c(1:4), B = c(5:8), C = c(9:12)),
df3 = data.frame(A = c(1:5), B = c(6:10), C = c(11:15)))
Is there an efficient way to rename all of the column As in each data frame in the list simultaneously using base R functions?
I was thinking that something like
names(lapply(my_list, `[[`, "A")) <- "new_name"
may work, but I think I'm off track - the lapply function returns an object that might not work for what I'm trying to do.
Thanks!
A few more base options:
# rename first column name
lapply(my_list, function(x) setNames(x, replace(names(x), 1, "new_name_for_A")))
# rename column named "A"
lapply(my_list, function(x) setNames(x, replace(names(x), names(x) == "A", "new_name_for_A")))
# lowly for loop
for (i in seq_along(my_list)) {
names(my_list[[i]])[names(my_list[[i]]) == "A"] = "new_name_for_A"
}
We can use map to loop over the list and rename the column named 'A' to 'new_name" with rename_at
library(purrr)
library(dplyr)
map(my_list, ~ .x %>%
rename_at(vars("A"), ~ "new_name"))
Or with base R by making use of anonymous function call
lapply(my_list, function(x) {names(x)[names(x) == "A"] <- "new_name"; x})
How about
new.names = c('New', 'B', 'C')
lapply(my_list, `names<-`, new.names)
For the added example in your edit, you would simply change this to
new.names = sub('B', 'New', names(my_list[[1]]))
I have a list of data.frames (in this example only 2):
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
I want to join them into a single data.frame only by a subset of the shared column names, in this case by id.
If I use:
library(dplyr)
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
The shared column names, which I'm not joining by, get mutated with the x and y suffices:
id val.x val1 val.y val2
1 G -0.05612874 0.2914462 2.087167 0.7876396
2 G -0.05612874 0.2914462 -0.255027 1.4411577
3 J -0.15579551 -0.4432919 -1.286301 1.0273924
In reality, for the shared column names for which I'm not joining by, it's good enough to select them from a single data.frame in the list - which ever they exist in WRT to the joined id.
I don't know these shared column names in advance but that's not difficult find out:
E.g.:
df.list.colnames <- unlist(lapply(df.list,function(l) colnames(l %>% dplyr::select(-id))))
df.list.colnames <- table(df.list.colnames)
repeating.colnames <- names(df.list.colnames)[which(df.list.colnames > 1)]
Which will then allow me to separate them from the data.frames in the list:
repeating.colnames.df <- do.call(rbind,lapply(df.list,function(r) r %>% dplyr::select_(.dots = c("id",repeating.colnames)))) %>%
unique()
I can then join the list of data.frames excluding these columns:
And then join them as above:
for(r in 1:length(df.list)) df.list[[r]] <- df.list[[r]] %>% dplyr::select_(.dots = paste0("-",repeating.colnames))
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
And now I'm left with adding the repeating.colnames.df to that. I don't know of any join in dplyr that wont return all combinations between df and repeating.colnames.df, so it seems that all I can do is apply over each df$id, pick the first match in repeating.colnames.df and join the result with df.
Is there anything less cumbersome for this situation?
If I followed correctly, I think you can handle this by writing a custom function to pass into reduce that identifies the common column names (excluding your joining columns) and excludes those columns from the "second" table in the merge. As reduce works through the list, the function will "accumulate" the unique columns, defaulting to the columns in the "left-most" table.
Something like this:
library(dplyr)
library(purrr)
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
fun <- function(df1, df2, by_col = "id"){
df1_names <- names(df1)
df2_names <- names(df2)
dup_cols <- intersect(df1_names[!df1_names %in% by_col], df2_names[!df2_names %in% by_col])
out <- dplyr::inner_join(df1, df2[, !(df2_names %in% dup_cols)], by = by_col)
return(out)
}
df_chase <- df.list %>% reduce(fun,by_col="id")
Created on 2019-01-15 by the reprex package (v0.2.1)
If I compare df_chase to your final solution, I yield the same answer:
> all.equal(df_chase, df_orig)
[1] TRUE
You can just get rid of the duplicate columns from one of the data frames if you say you don't really care about them and simply use base::merge:
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
duplicates = names(df1) == names(df2) & names(df1) !="id"
df2 = df2[,!duplicates]
df12 = base::merge.data.frame(df1, df2, by = "id")
head(df12)
I have a list of multiple data frames which are built the same way. I would like to change the name of the 1 column of each data frame to the name of the data frame itself and append some text. From several different answers I figured lapply and working on lists would be the best way to go.
Example data:
df1 <- data.frame(A = 1, B = 2, C = 3)
df2 <- data.frame(A = 1, B = 2, C = 3)
dfList <- list(df1,df2)
col1 <- names(dfList)
df<-lapply(dfList, function(x) {
names(x)[1:2] <- c(col1[1:length(col1)]"appended text","Col2","Col3");x
})
The problem seems to be with calling the correct entry in the "col1" variable for each data frame within my code.
Any ideas on how I should address/ express this correctly? Thanks a lot!
df1<-data.frame(A = 1, B = 2, C = 3)
df2<-data.frame(A = 1, B = 2, C = 3)
dfList <- list(df1=df1,df2=df2)
names(dfList)
col1 <- names(dfList)
for(i in 1:length(dfList))
names(dfList[[names(dfList[i])]])[1]<-names(dfList)[i]
dfList
Here is one option with tidyverse
library(tidyverse)
map(dfList, ~ .x %>%
rename(Aappended_text = A))
If this is based on the column index, create a function
fName <- function(lst, new_name, index){
map(lst, ~
.x %>%
rename_at(index, funs(paste0(., new_name))))
}
fName(dfList, "appended_text", 1)
I'm not sure if I'm understanding your quesiton completely but is tihs what you're after:
df1 <- data.frame(A = 1, B = 2, C = 3)
df2 <- data.frame(A = 1, B = 2, C = 3)
dfList <- list(df1,df2)
df <- lapply(dfList, function(x) {
colnames(x) <- c(paste0(colnames(x)[1], "appended text"), colnames(x)[2:length(colnames(x))])
return(x)
})
Output:
> df
[[1]]
Aappended text B C
1 1 2 3
[[2]]
Aappended text B C
1 1 2 3
You can simply use lapply
lapply(dfList, function(x) {
names(x)[1L] <- "some text"
x
})
But if you want to rename by the name of the data frame elements in your list, first you need to name them e.g. dfList <- list(df1 = df1, df2 = df2) and you can't acces them directly with lapply(x, ... so you need to lapplyover your list by indexes, for example :
lapply(seq_along(dfList), function(i) {
names(dfList[[i]])[1L] <- names(dfList[i])
dfList[[i]]
})
I have over a 1000 objects (z) in R, each containing three dataframes (df1, df2, df3) with different structures.
z1$df1 … z1000$df1
z1$df2 … z1000$df2
z1$df3 … z1000$df3
I created a list of these objects (list1 thus contains z1 thru z1000) and tried to use lapply to extract one type of dataframe (df2) for all objects, and then merge them to one single dataframe.
Extraction:
For a single object it would look like this:
df15<- z15$df2 # I transferred the index of z to the extracted df
I tried some code with lapply, ignoring the transfer of the index (I can create another list for that). However I don’t know what function I should use.
List2 <- lapply(list1, function(x))
I try to avoid using a loop because there's so many and vectorization is so much quicker. I have the idea I'm looking at it from the wrong angle.
Subsequent merging can be done as follows:
merged <- do.call(rbind, list2)
Thanks for any suggestions.
It sounds like you want to pull out all the df1s and rbind them together then do the same for the other dataframes. You can use purrr::map_dfr to extract a column from each element of the list and rowbind them together.
library('tidyverse')
dummy_df <- list(
df1 = iris,
df2 = cars,
df3 = CO2)
list1 <- list(
z1 = dummy_df,
z2 = dummy_df,
z3 = dummy_df)
df1 <- map_dfr(list1, 'df1')
df2 <- map_dfr(list1, 'df2')
df3 <- map_dfr(list1, 'df3')
If you wanted to do it in base R, you can use lapply.
df1 <- lapply(list1, function(x) x$df1)
df1_merged <- do.call(rbind, df1)
One option could be using lapply to extract data.frame and then use bind_rows from dplyr.
## The data
df1 <- data.frame(id = c(1:10), name = c(LETTERS[1:10]), stringsAsFactors = FALSE)
df2 <- data.frame(id = 11:20, name = LETTERS[11:20], stringsAsFactors = FALSE)
df3 <- data.frame(id = 21:30, name = LETTERS[15:24], stringsAsFactors = FALSE)
df4 <- data.frame(id = 121:130, name = LETTERS[15:24], stringsAsFactors = FALSE)
z1 <- list(df1 = df1, df2 = df2, df3 = df3)
z2 <- list(df1 = df1, df2 = df2, df3 = df3)
z3 <- list(df1 = df1, df2 = df2, df3 = df3)
z4 <- list(df1 = df1, df2 = df2, df3 = df4) #DFs can contain different data
# z <- list(z1, z2, z3, z4)
# Dynamically populate list z with many list object
z <- as.list(mget(paste("z",1:4,sep="")))
df1_all <- bind_rows(lapply(z, function(x) x$df1))
df2_all <- bind_rows(lapply(z, function(x) x$df2))
df3_all <- bind_rows(lapply(z, function(x) x$df3))
## Result for df3_all
> tail(df3_all)
## id name
## 35 125 S
## 36 126 T
## 37 127 U
## 38 128 V
## 39 129 W
## 40 130 X
Try this:
lapply(list1, "[[", "df2")
or if you want to rbind them together:
do.call("rbind", lapply(list1, "[[", "df2"))
The row names in the resulting data frame will identify the origin of each row.
No packages are used.
Note
We can use this input to test the code above. BOD is a built-in data frame:
z <- list(df1 = BOD, df2 = BOD, df3 = BOD)
list1 <- list(z1 = z, z2 = z)
THere's also data.table::rbindlist, which is likely faster than do.call(rbind, lapply(...)) or dplyr::bind_rows
library(data.table)
rbindlist(lapply(list1, "[[", "df2"))