I have a list of data frames which I need to obtain the last row of the 2nd column from. All the data frames have differing number of rows. I've already written code using lapply which can extract any row by variable "num" (returning NA for numbers which exceed the row length of the data frames) , however I want to include a variable num="worst" which will return the last row, 2nd column of available data. This is the code to retrive the "nth" row (xyz is the list of data frames):
if(num=="best"){num=as.integer(1)} else
(num=as.integer())
rownumber<-lapply(xyz, "[", num, 2, drop=FALSE)
Been cracking my head all day trying to find a solution to declare num=="worst". I want to avoid loops hence my use of lapply, but perhaps there is no other way?
How about...
lapply(xyz, tail, 1)
My understanding of the question is that you want a function that returns the second column of a data.frame from a list of dataframes, with an optional argument worst that allows you to restrict it to the last observation.
I think the siimplest way to do this is to write a helper function, and then apply it to your list using lapply.
I have written a selector function that takes a row and column argument, as well as a worst argument. I think this does everything you need.
df1 <- data.frame(A = rnorm(10), B = rnorm(10), C = rnorm(10))
df2 <- data.frame(A = rnorm(10), B = rnorm(10), C = rnorm(10))
ldf <- list(df1, df2)
selector <- function(DF, col, row=NULL, worst=FALSE){
if(!is.null(row)) return(DF[row, col])
if(!missing("col")) if(col > ncol(DF)) return(NA)
if(!is.null(row)) if(row > nrow(DF)) return(NA)
if(worst) {
tail(DF[,col, drop=F],1)
} else {
DF[row, col, drop=FALSE]
}
}
lapply(ldf, selector, worst=T)
Related
I have a list of six data frames, from which 5/6 data frames have a column "Z". To proceed with my script, I need to remove the data frame which doesn't have column Z, so I tried the following code:
for(i in 1:length(df)){
if(!("Z" %in% colnames(df[[i]])))
{
df[[i]] = NULL
}
}
This seem'd to actually do the job (it removed the one data frame from the list, which didn't have the column Z), BUT however I still got a message "Error in df[[i]] : subscript out of bounds". Why is that, and how could I get around the error?
The base Filter function works well here:
df <- Filter(\(x) "Z" %in% names(x), df)
As to why your method doesn't work, for(i in 1:length(df)) iterates over each item in the original length(df). As soon as df[[i]] = NULL happens once, then df is shorter than it was when the loop started, so the last iteration will be out of bounds. And you'll also skip some items: if df[[2]] is removed then the original df[[3]] is now df[[2]], and the current df[[3]] was originally df[[4]], so you hop over the original df[[3]] without checking it. Lesson: don't change the length of objects in the midst of iterating over them.
If df is your list of 6 dataframes, you can do this:
df <- df[sapply(df, \(i) "Z" %in% colnames(i))]
The reason you get the error is that your loop will reduce the length of df, such that i will eventually be beyond the (new) length of df. There will be no error if the only frame in df without column Z is the last frame.
Using discard:
list_df <- list(df1, df2)
purrr::discard(list_df, ~any(colnames(.x) == "Z"))
Output:
[[1]]
A B
1 1 3
2 3 4
As you can see it removed the first dataframe which had column Z.
data
df1 <- data.frame(A = c(1,2),
Z = c(1,4))
df2 <- data.frame(A = c(1,3),
B = c(3,4))
I have a list containing named elements. I am iterating over the list names, performing the computation for each corresponding element, "encapsulating" the results and the name in a vector and finally adding the vector to a table. The row or vector after each iteration contains a mix of characters and numbers.
The first row is getting added but from the second row onwards there is a problem.
In this example, there is supposed to be one column (first) containing alphanumeric names. All rows after the first one contain NAs.
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s))
}
df <- as.data.frame(df)
I know there are possibly more efficient ways but for the moment this is more intuitive for me as it is assuring that each computation is associated with a particular name. There can be several columns and rows and the names are extremely helpful to join tables, query, compare etc. They make it easier to trace back results to a particular element in my original list.
Additionally, I would be glad to know other ways in which the element names are always retained while transforming.
Thankyou!
You have to set stringsAsFactors = FALSE in rbind. With stringsAsFactors = TRUE the first iteration in the loop converts the string variables into factors (with the factor levels being the values).
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s), stringsAsFactors = FALSE)
}
An easier solution would be to utilize sapply().
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame(name = names(x), m = sapply(x, mean), s = sapply(x, sum))
I have a list containing many data frames:
df1 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df2 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df3 <- data.frame(A = 1:5, C = LETTERS[1:5])
my_list <- list(df1, df2, df3)
I want to know if every data frame in this list contains the same columns (i.e., the same number of columns, all having the same names and in the same order).
I know that you can easily find column names of data frames in a list using lapply:
lapply(my_list, colnames)
Is there a way to determine if any differences in column names occur? I realize this is a complicated question involving pairwise comparisons.
You can avoid pairwise comparison by simply checking if the count of each column name is == length(my_list). This will simultaneously check for dim and names of you dataframe -
lapply(my_list, names) %>%
unlist() %>%
table() %>%
all(. == length(my_list))
[1] FALSE
In base R i.e. without %>% -
all(table(unlist(lapply(my_list, names))) == length(my_list))
[1] FALSE
or sightly more optimized -
!any(table(unlist(lapply(my_list, names))) != length(my_list))
Here's another base solution with Reduce:
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, names)
)
)
You could also account for same columns in a different order with
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, function(z) sort(names(z)))
)
)
As for what's going on, Reduce() accumulates as it goes through the list. At first, identical(names_df1, names_df2) are evaluated. If it's true, we want to have it return the same vector evaluated! Then we can keep using it to compare to other members of the list.
Finally, if everything evaluates as true, we get a character vector returned. Since you probably want a logical output, !is.logical(...) is used to turn that character vector into a boolean.
See also here as I was very inspired by another post:
check whether all elements of a list are in equal in R
And a similar one that I saw after my edit:
Test for equality between all members of list
We can use dplyr::bind_rows:
!any(is.na(dplyr::bind_rows(my_list)))
# [1] FALSE
Here is my answer:
k <- 1
output <- NULL
for(i in 1:(length(my_list) - 1)) {
for(j in (i + 1):length(my_list)) {
output[k] <- identical(colnames(my_list[[i]]), colnames(my_list[[j]]))
k <- k + 1
}
}
all(output)
I've been struggling with column selection with lists in R. I've loaded a bunch of csv's (all with different column names and different number of columns) with the goal of extracting all the columns that have the same name (just phone_number, subregion, and phonetype) and putting them together into a single data frame.
I can get the columns I want out of one list element with this;
var<-data[[1]] %>% select("phone_number","Subregion", "PhoneType")
But I cannot select the columns from all the elements in the list this way, just one at a time.
I then tried a for loop that looks like this:
new.function <- function(a) {
for(i in 1:a) {
tst<-datas[[i]] %>% select("phone_number","Subregion", "PhoneType")
}
print(tst)
}
But when I try:
new.function(5)
I'll only get the columns from the 5th element.
I know this might seem like a noob question for most, but I am struggling to learn lists and loops and R. I'm sure I'm missing something very easy to make this work. Thank you for your help.
Another way you could do this is to make a function that extracts your columns and apply it to all data.frames in your list with lapply:
library(dplyr)
extractColumns = function(x){
select(x,"phone_number","Subregion", "PhoneType")
#or x[,c("phone_number","Subregion","PhoneType")]
}
final_df = lapply(data,extractColumns) %>% bind_rows()
The way you have your loop set up currently is only saving the last iteration of the loop because tst is not set up to store more than a single value and is overwritten with each step of the loop.
You can establish tst as a list first with:
tst <- list()
Then in your code be explicit that each step is saved as a seperate element in the list by adding brackets and an index to tst. Here is a full example the way you were doing it.
#Example data.frame that could be in datas
df_1 <- data.frame("not_selected" = rep(0, 5),
"phone_number" = rep("1-800", 5),
"Subregion" = rep("earth", 5),
"PhoneType" = rep("flip", 5))
# Another bare data.frame that could be in datas
df_2 <- data.frame("also_not_selected" = rep(0, 5),
"phone_number" = rep("8675309", 5),
"Subregion" = rep("mars", 5),
"PhoneType" = rep("razr", 5))
# Datas is a list of data.frames, we want to pull only specific columns from all of them
datas <- list(df_1, df_2)
#create list to store new data.frames in once columns are selected
tst <- list()
#Function for looping through 'a' elements
new.function <- function(a) {
for(i in 1:a) {
tst[[i]] <- datas[[i]] %>% select("phone_number","Subregion", "PhoneType")
}
print(tst)
}
#Proof of concept for 2 elements
new.function(2)
I have searched extensively but not found an answer to this question on Stack Overflow.
Lets say I have a data frame a.
I define:
a <- NULL
a <- as.data.frame(a)
If I wanted to add a column to this data frame as so:
a$col1 <- c(1,2,3)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "a", value = c(1, 2, 3)) :
replacement has 3 rows, data has 0
Why is the row dimension fixed but the column is not?
How do I change the number of rows in a data frame?
If I do this (inputting the data into a list first and then converting to a df), it works fine:
a <- NULL
a$col1 <- c(1,2,3)
a <- as.data.frame(a)
The row dimension is not fixed, but data.frames are stored as list of vectors that are constrained to have the same length. You cannot add col1 to a because col1 has three values (rows) and a has zero, thereby breaking the constraint. R does not by default auto-vivify values when you attempt to extend the dimension of a data.frame by adding a column that is longer than the data.frame. The reason that the second example works is that col1 is the only vector in the data.frame so the data.frame is initialized with three rows.
If you want to automatically have the data.frame expand, you can use the following function:
cbind.all <- function (...)
{
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function(x) rbind(x, matrix(, n -
nrow(x), ncol(x)))))
}
This will fill missing values with NA. And you would use it like: cbind.all( df, a )
You could also do something like this where I read in data from multiple files, grab the column I want, and store it in the dataframe. I check whether the dataframe has anything in it, and if it doesn't, create a new one rather than getting the error about mismatched number of rows:
readCounts = data.frame()
for(f in names(files)){
d = read.table(files[f], header=T, as.is=T)
d2 = round(data.frame(d$NumReads))
colnames(d2) = f
if(ncol(readCounts) == 0){
readCounts = d2
rownames(readCounts) = d$Name
} else{
readCounts = cbind(readCounts, d2)
}
}
if you have an empty dataframe, called for example df, in my opinion another quite simple solution is the following:
df[1,]=NA # ad a temporary new row of NA values
df[,'new_column'] = NA # adding new column, called for example 'new_column'
df = df[0,] # delete row with NAs
I hope this may help.