Iteratively adding a row containing characters and numbers to a dataframe - r

I have a list containing named elements. I am iterating over the list names, performing the computation for each corresponding element, "encapsulating" the results and the name in a vector and finally adding the vector to a table. The row or vector after each iteration contains a mix of characters and numbers.
The first row is getting added but from the second row onwards there is a problem.
In this example, there is supposed to be one column (first) containing alphanumeric names. All rows after the first one contain NAs.
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s))
}
df <- as.data.frame(df)
I know there are possibly more efficient ways but for the moment this is more intuitive for me as it is assuring that each computation is associated with a particular name. There can be several columns and rows and the names are extremely helpful to join tables, query, compare etc. They make it easier to trace back results to a particular element in my original list.
Additionally, I would be glad to know other ways in which the element names are always retained while transforming.
Thankyou!

You have to set stringsAsFactors = FALSE in rbind. With stringsAsFactors = TRUE the first iteration in the loop converts the string variables into factors (with the factor levels being the values).
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s), stringsAsFactors = FALSE)
}
An easier solution would be to utilize sapply().
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame(name = names(x), m = sapply(x, mean), s = sapply(x, sum))

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

Is there an easy way to tell if many data frames stored in one list contain the same columns?

I have a list containing many data frames:
df1 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df2 <- data.frame(A = 1:5, B = 2:6, C = LETTERS[1:5])
df3 <- data.frame(A = 1:5, C = LETTERS[1:5])
my_list <- list(df1, df2, df3)
I want to know if every data frame in this list contains the same columns (i.e., the same number of columns, all having the same names and in the same order).
I know that you can easily find column names of data frames in a list using lapply:
lapply(my_list, colnames)
Is there a way to determine if any differences in column names occur? I realize this is a complicated question involving pairwise comparisons.
You can avoid pairwise comparison by simply checking if the count of each column name is == length(my_list). This will simultaneously check for dim and names of you dataframe -
lapply(my_list, names) %>%
unlist() %>%
table() %>%
all(. == length(my_list))
[1] FALSE
In base R i.e. without %>% -
all(table(unlist(lapply(my_list, names))) == length(my_list))
[1] FALSE
or sightly more optimized -
!any(table(unlist(lapply(my_list, names))) != length(my_list))
Here's another base solution with Reduce:
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, names)
)
)
You could also account for same columns in a different order with
!is.logical(
Reduce(function(x,y) if(identical(x,y)) x else FALSE
, lapply(my_list, function(z) sort(names(z)))
)
)
As for what's going on, Reduce() accumulates as it goes through the list. At first, identical(names_df1, names_df2) are evaluated. If it's true, we want to have it return the same vector evaluated! Then we can keep using it to compare to other members of the list.
Finally, if everything evaluates as true, we get a character vector returned. Since you probably want a logical output, !is.logical(...) is used to turn that character vector into a boolean.
See also here as I was very inspired by another post:
check whether all elements of a list are in equal in R
And a similar one that I saw after my edit:
Test for equality between all members of list
We can use dplyr::bind_rows:
!any(is.na(dplyr::bind_rows(my_list)))
# [1] FALSE
Here is my answer:
k <- 1
output <- NULL
for(i in 1:(length(my_list) - 1)) {
for(j in (i + 1):length(my_list)) {
output[k] <- identical(colnames(my_list[[i]]), colnames(my_list[[j]]))
k <- k + 1
}
}
all(output)

Most efficient way of avoiding loops to create data.frame

I have a data.frame which includes the runs scored in each innings of baseball games as a character vector.
I want to create a new data.frame which lists the number of runs in each innings for each game. I can do this with a loop but appreciate that this is too slow for any reasonable number of observations and that the rbind method shown is also not ideal.
The number of innings may vary and an x indicates that the team did not need to bat in 9th inning as game was already won.
library(stringr)
data <- data.frame(gameID=c("a","b","c"),innings=c("002100000","30000000x","10101010101"))
for(i in 1:nrow(data)) {
box <- as.integer(str_split(data$innings[i], "")[[1]])
tempdf <- data.frame(box,id=data$gameID[i])
if(i!=1) {
df <- rbind(df,tempdf)
} else {
df <- tempdf
}
}
This helps a bit (30%):
res <- vector("list", nrow(data))
for(i in seq_along(res))
res[[i]] <- data.frame(box=as.integer(str_split(data$innings[i], "")[[1]]),
id=data$gameID[i])
do.call(rbind, res)
Not sure if this is faster,
library(splitstackshape)
data$innings <- gsub('', ' ', data$innings)
cSplit(data, 'innings', ' ', 'long')
Here's a way using lists with lapply:
library(dplyr) # for bind_rows -- you can also use do.call(rbind, list)
innings <- str_split(data$innings, "")
names(innings) <- data$gameID
innings <- lapply(innings, function(x) data.frame(box = x))
bind_rows(innings, .id = "id")
This should be pretty fast:
## Defined these separately just for readability
innings <- as.character(data$innings) # or use 'stringsAsFactors=FALSE' when defining the data frame
box <- unlist(strsplit(innings, ""))
id <- rep(data$gameID, nchar(innings))
## To get a character matrix back
cbind(box, id)
## To get a data frame back
data.frame(box=box, id=id, stringsAsFactors=FALSE)
Using a matrix is faster, but if you want to have mixed classes use a data frame. Also, for a data frame, it's faster to use characters than factors (thus the stringsAsFactors=FALSE argument). If you want box to be numeric, you can wrap it in as.integer (but then the matrix option wont work, of course).

Nested named list to data frame

I have the following named list output from a analysis. The reproducible code is as follows:
list(structure(c(-213.555409754509, -212.033637890131, -212.029474755074,
-211.320398316741, -211.158815833294, -210.470525157849), .Names = c("wasn",
"chappal", "mummyji", "kmph", "flung", "movie")), structure(c(-220.119433774144,
-219.186901747536, -218.743319709963, -218.088361753899, -217.338920075687,
-217.186050877079), .Names = c("crazy", "wired", "skanndtyagi",
"andr", "unveiled", "contraption")))
I want to convert this to a data frame. I have tried unlist to data frame options using reshape2, dplyr and other solutions given for converting a list to a data frame but without much success. The output that I am looking for is something like this:
Col1 Val1 Col2 Val2
1 wasn -213.55 crazy -220.11
2 chappal -212.03 wired -219.18
3 mummyji -212.02 skanndtyagi -218.74
so on and so forth. The actual out put has multiple columns with paired values and runs into many rows. I have tried the following codes already:
do.call(rbind, lapply(df, data.frame, stringsAsFactors = TRUE))
works partially provides all the character values in a column and numeric values in the second.
data.frame(Reduce(rbind, df))
didn't work - provides the names in the first list and numbers from both the lists as tow different rows
colNames <- unique(unlist(lapply(df, names)))
M <- matrix(0, nrow = length(df), ncol = length(colNames),
dimnames = list(names(df), colNames))
matches <- lapply(df, function(x) match(names(x), colNames))
M[cbind(rep(sequence(nrow(M)), sapply(matches, length)),
unlist(matches))] <- unlist(df)
M
didn't work correctly.
Can someone help?
Since the list elements are all of the same length, you should be able to stack them and then combine them by columns.
Try:
do.call(cbind, lapply(myList, stack))
Here's another way:
as.data.frame( c(col = lapply(x, names), val = lapply(x,unname)) )
How it works. lapply returns a list; two lists combined with c make another list; and a list is easily coerced to a data.frame, since the latter is just a list of vectors having the same length.
Better than coercing to a data.frame is just modifying its class, effectively telling the list "you're a data.frame now":
L = c(col = lapply(x, names), val = lapply(x,unname))
library(data.table)
setDF(L)
The result doesn't need to be assigned anywhere with = or <- because L is modified "in place."

Add Columns to an empty data frame in R

I have searched extensively but not found an answer to this question on Stack Overflow.
Lets say I have a data frame a.
I define:
a <- NULL
a <- as.data.frame(a)
If I wanted to add a column to this data frame as so:
a$col1 <- c(1,2,3)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "a", value = c(1, 2, 3)) :
replacement has 3 rows, data has 0
Why is the row dimension fixed but the column is not?
How do I change the number of rows in a data frame?
If I do this (inputting the data into a list first and then converting to a df), it works fine:
a <- NULL
a$col1 <- c(1,2,3)
a <- as.data.frame(a)
The row dimension is not fixed, but data.frames are stored as list of vectors that are constrained to have the same length. You cannot add col1 to a because col1 has three values (rows) and a has zero, thereby breaking the constraint. R does not by default auto-vivify values when you attempt to extend the dimension of a data.frame by adding a column that is longer than the data.frame. The reason that the second example works is that col1 is the only vector in the data.frame so the data.frame is initialized with three rows.
If you want to automatically have the data.frame expand, you can use the following function:
cbind.all <- function (...)
{
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function(x) rbind(x, matrix(, n -
nrow(x), ncol(x)))))
}
This will fill missing values with NA. And you would use it like: cbind.all( df, a )
You could also do something like this where I read in data from multiple files, grab the column I want, and store it in the dataframe. I check whether the dataframe has anything in it, and if it doesn't, create a new one rather than getting the error about mismatched number of rows:
readCounts = data.frame()
for(f in names(files)){
d = read.table(files[f], header=T, as.is=T)
d2 = round(data.frame(d$NumReads))
colnames(d2) = f
if(ncol(readCounts) == 0){
readCounts = d2
rownames(readCounts) = d$Name
} else{
readCounts = cbind(readCounts, d2)
}
}
if you have an empty dataframe, called for example df, in my opinion another quite simple solution is the following:
df[1,]=NA # ad a temporary new row of NA values
df[,'new_column'] = NA # adding new column, called for example 'new_column'
df = df[0,] # delete row with NAs
I hope this may help.

Resources