Retrieving ids of splitted list in R - r

Say I have a list, like so:
my.list <- list()
for (i in 1:100)
{
my.list[[i]] <- list(location = sample(paste0("Location", 1:5), 1, replace=T),
val1 = runif(100),
val2 = runif(30))
}
Now I split it by location
loc <- sapply(my.list, function(x){x$location})
my.list.split <- split(my.list, loc)
Is there a way to associate each element of my.list.split to the original my.list, i.e., finding its ID in my.list?

Here's one way to find the IDs:
IDs <- seq_along(my.list) # generate a vector of IDs
IDs.split <- split(IDs, loc) # split the IDs along loc
This returns a list which includes vectors of IDs for each location.

If you give my.list some names, then your my.list.split will also have names which you can use to refer back, if necessary.
# Syntactically different, but functionally equivalent way of creating the list.
my.list<- lapply(1:100,function(x) list(location = sample(paste0("Location", 1:5), 1, replace=T),
val1 = runif(100),
val2 = runif(30)))
names(my.list)<-paste0('id_',seq_along(my.list)) # Added
loc <- sapply(my.list, function(x){x$location})
my.list.split <- split(my.list, loc)
So, now everything has an unique id:
my.list.split[[1]]
# $id_11
# $id_11$location
# [1] "Location1"
#
# $id_11$val1
# [1] 0.997154684 0.348063634 0.373797808 0.569167679 0.417461443 0.799423830 0.147882721
# [8] 0.489438012 0.292867337 0.072622654 0.583932815 0.060452664 0.083562011 0.613114462
# ....
# $id_11$val2
# [1] 0.68983774 0.41056046 0.18620312 0.61078253 0.85947881 0.50736945 0.01362270 0.70022800

Another way if for some reason you don't want to set IDs first:
match(unlist(my.list.split, FALSE), my.list)
You can then set the names with names() or whatever if that's what you're trying to do.
split() divides your list into a nested list according to loc. unlist() with recursive set to FALSE will remove the items from my.list.split so that they are in the same shape as my.list. Then all you have to do is match() them to see which items refer to what indices in the original object.
Proof that the match is correct (should return TRUE unless I've made a horrible mistake):
ul <- unlist(my.list.split, FALSE)
m <- match(ul, my.list)
identical(my.list[m], unname(ul))

Related

Search unnamed lists of lists for string

I have a list of unnamed lists of named dataframes of varying lengths. I'm looking for a way to grep or search through the indices of the list elements to find specific named dfs.
Here is the current method:
library(tibble) # for tibbles
## list of lists of dataframes
abc_list <- list(list(dfAAA = tibble(names = state.abb[1:10]),
dfBBB = tibble(junk = state.area[5:15]),
dfAAA2 = tibble(names = state.abb[8:20])),
list(dfAAA2 = tibble(names = state.abb[10:15]),
dfCCC = tibble(junk2 = state.area[4:8]),
dfGGG = tibble(junk3 = state.area[12:14])))
# Open list, manually ID list index which has "AAA" dfs
# extract from list of lists into separate list
desired_dfs_list <- abc_list[[1]][grepl("AAA", names(abc_list[[1]]))]
# unlist that list into a combined df
desired_rbinded_list <- as.data.frame(data.table::rbindlist(desired_dfs_list, use.names = F))
I know there's a better way than this.
What I've attempted so far:
## attempt:
## find pattern in df names
aaa_indices <- sapply(abc_list, function(x) grep(pattern = "AAA", names(x)))
## apply that to rbind ???
desired_aaa_rbinded_list <- purrr::map_df(aaa_indices, data.table::rbindlist(abc_list))
the steps from the manual example would be:
pull identified list items (dfs) into a separate list
rbind the list of dfs into one df
I'm just not sure how to do that in a way that allows me more flexibility, instead of manually opening the lists and ID-ing the indices to pull.
thanks for any help or ideas!
If your tibbles( or dataframes) are always one level deep in the list (meaning a list(0.level) of lists (1st level)) you can use unlist to get rid of the first level:
all_dfs_list <- unlist(abc_list,
recursive = FALSE # will stop unlisting after the first level
)
This will result in a list of tibbles:
> all_dfs_list
$dfAAA
# A tibble: 10 x 1
names
<chr>
1 AL
2 AK
...
then you can filter by name and use rbindlist on the desired elements, as you already did in your question:
desired_dfs_list <- all_dfs_list[grepl("AAA",names(all_dfs_list))]
desired_rbinded_list <- as.data.frame(
data.table(rbindlist(desired_dfs_list, use.names = F))
)

Is there an easy way to tell if many data frames stored in one list contain the same columns?

I have a list containing many data frames:
df1 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df2 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df3 <- data.frame(A = 1:5, C = LETTERS[1:5])
my_list <- list(df1, df2, df3)
I want to know if every data frame in this list contains the same columns (i.e., the same number of columns, all having the same names and in the same order).
I know that you can easily find column names of data frames in a list using lapply:
lapply(my_list, colnames)
Is there a way to determine if any differences in column names occur? I realize this is a complicated question involving pairwise comparisons.
You can avoid pairwise comparison by simply checking if the count of each column name is == length(my_list). This will simultaneously check for dim and names of you dataframe -
lapply(my_list, names) %>%
unlist() %>%
table() %>%
all(. == length(my_list))
[1] FALSE
In base R i.e. without %>% -
all(table(unlist(lapply(my_list, names))) == length(my_list))
[1] FALSE
or sightly more optimized -
!any(table(unlist(lapply(my_list, names))) != length(my_list))
Here's another base solution with Reduce:
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, names)
)
)
You could also account for same columns in a different order with
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, function(z) sort(names(z)))
)
)
As for what's going on, Reduce() accumulates as it goes through the list. At first, identical(names_df1, names_df2) are evaluated. If it's true, we want to have it return the same vector evaluated! Then we can keep using it to compare to other members of the list.
Finally, if everything evaluates as true, we get a character vector returned. Since you probably want a logical output, !is.logical(...) is used to turn that character vector into a boolean.
See also here as I was very inspired by another post:
check whether all elements of a list are in equal in R
And a similar one that I saw after my edit:
Test for equality between all members of list
We can use dplyr::bind_rows:
!any(is.na(dplyr::bind_rows(my_list)))
# [1] FALSE
Here is my answer:
k <- 1
output <- NULL
for(i in 1:(length(my_list) - 1)) {
for(j in (i + 1):length(my_list)) {
output[k] <- identical(colnames(my_list[[i]]), colnames(my_list[[j]]))
k <- k + 1
}
}
all(output)

R return unique values of specific sub-lists

I would like to process information for nested lists. For example, a list has 3 1st level lists with each of those lists having 10 sub-lists. I would like to find the unique values of all 1st level lists' [[i]] sub-list.
## Design of list
list1 = replicate(10, list(sample(10, 5, replace = FALSE)))
list2 = replicate(10, list(sample(10, 5, replace = FALSE)))
list3 = replicate(10, list(sample(10, 5, replace = FALSE)))
myList = list(list1, list2, list3)
## return unique values of each list's i-th sub-list
## example
> k = unique(myList[[1:3]][[1]])
> k
[1] 10
This returns a single value instead of all unique values. I am trying to get all unique values though.
How can I properly address specific lists within lists?
Try this code and let me know if this is what you wanted...
res <- list()
for(i in 1:10){
res[[i]] <- unique(as.vector(sapply(myList, function(x) x[[i]])))
}
To get the unique elements of each list at that level, this will work:
# set seed so that "random" number generation is reproducible
set.seed(123)
# set replace to TRUE so we can see if we're getting unique elements
# when replace is FALSE, all elements are already unique :)
list1 <- replicate(10, list(sample(10, 5, replace = TRUE)))
list2 <- replicate(10, list(sample(10, 5, replace = TRUE)))
list3 <- replicate(10, list(sample(10, 5, replace = TRUE)))
myList <- list(list1, list2, list3)
# use lapply to apply anonymous function to the top levels of the list
# unlist results and then call unique to get unique values
k <- unique(unlist(lapply(myList, function(x) x[[1]])
Output:
[[1]]
[1] 3 8 5 9 10
[[2]]
[1] 1 5 8 2 6
[[3]]
[1] 6 4 5 10
The issue you were having was due to the fact that you were using double-brackets (myList[[1:3]]) in the first level of your indexing. That notation only works when indexing into a single list -- to work across multiple elements of a list, use single brackets. In this case, however, that wouldn't have gotten the job done, since myList[1:3][[1]] would first have grabbed all three top-most lists, and then the first list ([[1]]) from that, so you'd end up calling unique on a list of lists (the odds in this case being that the lists are all unique).
lapply is a useful solution here, since it runs over the first level of lists you give it, and applies the function provided to each of them individually. To make the solution above more portable, we could wrap it in a function that takes an integer as an argument, so that we can dynamically select the ith element from the lower-level lists:
get.i.elem <- function(i) {
unique(unlist(lapply(myList, function(x) x[[i]])))
}

Loop through rows in list of dataframes and extract data. (Nested "apply" functions)

I am new to R and trying to do things the "R" way, which means no for loops. I would like to loop through a list of dataframes, loop through each row in the dataframe, and extract data based on criteria and store in a master dataframe.
Some issues I am having are with accessing the "global" dataframe. I am unsure the best approach (global variable, pass by reference).
I have created an abstract example to try to show what needs to be done:
rm(list=ls())## CLEAR WORKSPACE
assign("last.warning", NULL, envir = baseenv())## CLEAR WARNINGS
# Generate a descriptive name with name and size
generateDescriptiveName <- function(animal.row, animalList.vector){
name <- animal.row["animal"]
size <- animal.row["size"]
# if in list of interest prepare name for master dataframe
if (any(grepl(name, animalList.vector))){
return (paste0(name, "Sz-", size))
}
}
# Animals of interest
animalList.vector <- c("parrot", "cheetah", "elephant", "deer", "lizard")
jungleAnimals <- c("ants", "parrot", "cheetah")
jungleSizes <- c(0.1, 1, 50)
jungle.df <- data.frame(jungleAnimals, jungleSizes)
fieldAnimals <- c("elephant", "lion", "hyena")
fieldSizes <- c(1000, 100, 80)
field.df <- data.frame(fieldAnimals, fieldSizes)
forestAnimals <- c("squirrel", "deer", "lizard")
forestSizes <- c(1, 40, 0.2)
forest.df <- data.frame(forestAnimals, forestSizes)
ecosystems.list <- list(jungle.df, field.df, forest.df)
# Final master list
descriptiveAnimal.df <- data.frame(name = character(), descriptive.name = character())
# apply to all dataframes in list
lapply(ecosystems.list, function(ecosystem.df){
names(ecosystem.df) <- c("animal", "size")
# apply to each row in dataframe
output <- apply(ecosystem.df, 1, function(row){generateDescriptiveName(row, animalList.vector)})
if(!is.null(output)){
# Add generated names to unique master list (no duplicates)
}
})
The end result would be:
name descriptive.name
1 "parrot" "parrot Sz-0.1"
2 "cheetah" "cheetah Sz-50"
3 "elephant" "elephant Sz-1000"
4 "deer" "deer Sz-40"
5 "lizard" "lizard Sz-0.2"
I did not use your function generateDescriptiveName() because I think it is a bit too laborious. I also do not see a reason to use apply() within lapply(). Here is my attempt to generate the desired output. It is not perfect but I hope it helps.
df_list <- lapply(ecosystems.list, function(ecosystem.df){
names(ecosystem.df) <- c("animal", "size")
temp <- ecosystem.df[ecosystem.df$animal %in% animalList.vector, ]
if(nrow(temp) > 0){
data.frame(name = temp$animal, descriptive.name = paste0(temp$animal, " Sz-", temp$size))
}
})
do.call("rbind",df_list)

cbind equally named vectors in multiple data.frames in a list to a single data.frame

I have a list similar to this one:
set.seed(1602)
l <- list(data.frame(subst_name = sample(LETTERS[1:10]), perc = runif(10), crop = rep("type1", 10)),
data.frame(subst_name = sample(LETTERS[1:7]), perc = runif(7), crop = rep("type2", 7)),
data.frame(subst_name = sample(LETTERS[1:4]), perc = runif(4), crop = rep("type3", 4)),
NULL,
data.frame(subst_name = sample(LETTERS[1:9]), perc = runif(9), crop = rep("type5", 9)))
Question: How can I extract the subst_name-column of each data.frame and combine them with cbind() (or similar functions) to a new data.frame without messing up the order of each column? Additionally the columns should be named after the corresponding crop type (this is possible 'cause the crop types are unique for each data.frame)
EDIT: The output should look as follows:
Having read the comments I'm aware that within R it doesn't make much sense but for the sake of having alook at the output the data.frame's View option is quite handy.
With the help of this SO-Question I came up with the following sollution. (There's probably room for improvement)
a <- lapply(l, '[[', 1) # extract the first element of the dfs in the list
a <- Filter(function(x) !is.null(unlist(x)), a) # remove NULLs
a <- lapply(a, as.character)
max.length <- max(sapply(a, length))
## Add NA values to list elements
b <- lapply(a, function(v) { c(v, rep(NA, max.length-length(v)))})
e <- as.data.frame(do.call(cbind, d))
names(e) <- unlist(lapply(lapply(lapply(l, '[[', "crop"), '[[', 2), as.character))
It is not really correct to do this with the given example because the number of rows is not the same in each one of the list's data frames . But if you don't care you can do:
nullElements = unlist(sapply(l,is.null))
l = l[!nullElements] #delete useless null elements in list
columns=lapply(l,function(x) return(as.character(x$subst_name)))
newDf = as.data.frame(Reduce(cbind,columns))
If you don't want recycled elements in the columns you can do
for(i in 1:ncol(newDf)){
colLength = nrow(l[[i]])
newDf[(colLength+1):nrow(newDf),i] = NA
}
newDf = newDf[1:max(unlist(sapply(l,nrow))),] #remove possible extra NA rows
Note that I edited my previous code to remove NULL entries from l to simplify things

Resources