I was modifying a data frame in R with lapply() and observed that my data frame was converted to a list object when I didn't use brackets to assign it.
For example, the following returns a list
junk <- data.frame(col1 = 1:3,
col2 = c("a,b,c"),
col3 = c(T,T,F))
junk <- lapply(junk, function(x) {
if (is.numeric(x)) return(x*2)
else return(x)})
str(junk)
where as the following returns a data frame.
junk <- data.frame(col1 = 1:3,
col2 = c("a,b,c"),
col3 = c(T,T,F))
junk[] <- lapply(junk, function(x) {
if (is.numeric(x)) return(x*2)
else return(x)})
str(junk)
I'd like to know why [] preserves the data frame structure, and what [] is doing in this case. I understand why the first code chunk converts junk to a list, but don't understand why the second chunk preserves structure, though I couldn't think of a clear title to describe the question/situation. Thanks.
It is natural for lapply to return a list, because it is not always guaranteed that function FUN returns processing results of the same size.
dat <- data.frame(a = c(1,1,2), b = c(1,1,1))
lapply(dat, unique)
The second does not preserve structure by modifying the original data frame in place. It does this
tmp <- lapply(...); junk[] <- tmp; rm(tmp)
Related
So, I have a list of strings named control_for. I have a data frame sampleTable with some of the columns named as strings from control_for list. And I have a third object dge_obj (DGElist object) where I want to append those columns. What I wanted to do - use lapply to loop through control_for list, and for each string, find a column in sampleTable with the same name, and then add that column (as a factor) to a DGElist object. For example, for doing it manually with just one string, it looks like this, and it works:
group <- as.factor(sampleTable[,3])
dge_obj$samples$group <- group
And I tried something like this:
lapply(control_for, function(x) {
x <- as.factor(sampleTable[, x])
dge_obj$samples$x <- x
}
Which doesn't work. I guess the problem is that R can't recognize addressing columns like this. Can someone help?
Here are two base R ways of doing it. The data set is the example of help("DGEList") and a mock up data.frame sampleTable.
Define a vector common_vars of the table's names in control_for. Then create the new columns.
library(edgeR)
sampleTable <- data.frame(a = 1:4, b = 5:8, no = letters[21:24])
control_for <- c("a", "b")
common_vars <- intersect(control_for, names(sampleTable))
1. for loop
for(x in common_vars){
y <- sampleTable[[x]]
dge_obj$samples[[x]] <- factor(y)
}
2. *apply loop.
tmp <- sapply(sampleTable[common_vars], factor)
dge_obj$samples <- cbind(dge_obj$samples, tmp)
This code can be rewritten as a one-liner.
Data
set.seed(2021)
y <- matrix(rnbinom(10000,mu=5,size=2),ncol=4)
dge_obj <- DGEList(counts=y, group=rep(1:2,each=2))
When trying to convert data frame to a list resembling a nested dictionary I tried using a following command:
df = data.frame(col1 = c('a', 'b'), col2 = c(1, 2))
df[,1] = as.character(df[,1])
ls1 = apply(df, 1, as.list)
print(ls1)
However, the values of col2 in ls1 now seem to be converted to character:
class(ls1[[2]]$col2)
# [1] "character"
This workaround works, but I am curious if somebody knows, why the result is not the same as in previous code?
ls2 = as.list(df[1,])
for(i in 2:nrow(df)){
ls2 = list(ls2, as.list(df[i,]))
}
print(ls2)
class(ls1[[2]]$col2)
# [1] "numeric"
Instead of apply, which converts the data to matrix and matrix can have only single class, use split
lst1 <- unname(split(df, seq_len(nrow(df))))
If we need a JSON output, the dataset can be directly converted to JSON with toJSON
jsonlite::toJSON(df)
#[{"col1":"a","col2":1},{"col1":"b","col2":2}]
Based on the conversation with OP, dataset is passed as a named list that needs to be converted to JSON format
toJSON(list(listName = df))
#{"listName":[{"col1":"a","col2":1},{"col1":"b","col2":2}]}
My problem is, that I can't merge a large list of dataframes before doing some data cleaning. But it seems like my data cleaning is missing from the list.
I have 43 xlsx-files, which I've put in a list.
Here's my code for that part:
file.list <- list.files(recursive=T,pattern='*.xlsx')
dat = lapply(file.list, function(i){
x = read.xlsx(i, sheet=1, startRow=2, colNames = T,
skipEmptyCols = T, skipEmptyRows = T)
# Create column with file name
x$file = i
# Return data
x
})
I then did some datacleaning. Some of the dataframes had some empty columns that weren't skipped in the loading and some columns I just didn't need.
Example of how I removed one column (X1) from all dataframes in the list:
dat <- lapply(dat, function(x) { x["X1"] <- NULL; x })
I also applies column names:
colnames <- c("ID", "UDLIGNNR","BILAGNR", "AKT", "BA",
"IART", "HTRANS", "DTRANS", "BELOB", "REGD",
"BOGFD", "AFVBOGFD", "VALORD", "UDLIGND",
"UÅ", "AFSTEMNGL", "NRBASIS", "SPECIFIK1",
"SPECIFIK2", "SPECIFIK3", "PERIODE","FILE")
dat <- lapply(dat, setNames, colnames)
My problem is, when I open the list or look at the elements in the list, my data cleaning is missing.
And I can't bind the dataframes before the data cleaning since they're aren't looking the same.
What am I doing wrong here?
EDIT: Sample data*
# Sample data
a <- c("a","b","c")
b <- c(1,2,3)
X1 <- c("", "","")
c <- c("a","b","c")
X2 <- c(1,2,3)
X1 <- c("", "","")
df1 <- data.frame(a,b,c,X1)
df2 <- data.frame(a,b,c,X1,X2)
# Putting in list
dat <- list(df1,df2)
# Removing unwanted columns
dat <- lapply(dat, function(x) { x["X1"] <- NULL; x })
dat <- lapply(dat, function(x) { x["X2"] <- NULL; x })
# Setting column names
colnames <- c("Alpha", "Beta", "Gamma")
dat <- lapply(dat, setNames, colnames)
# Merging dataframes
df <- do.call(rbind,dat)
So I've just found that with my sample data this goes smoothly.
I had to reopen the list in View-mode to see the changes I made. That doesn't change the fact that when writing to csv and reopening all the data cleaning is missing (haven'tr tried this with my sample data).
I am wondering if it's because I've changed the merge?
# My merge when I wrote this question:
df <- do.call("rbindlist", dat)
# My merge now:
df <- do.call(rbind,dat)
When I use my real data it doesnøt go as smoothly, so I guess the sample data is bad. I don't know what I'm doing wrong so I can't give some better sample data.
The message I get when merging with rbind:
error in rbind(deparse.level ...) numbers of columns of arguments do not match
I have several data frames and I would like to run the head function over all of them. I tried the following but it doesn’t work, as it returns the name of the data frame but not the head of the data frame itself.
df.a <- data.frame(col1 = "a", col2 = 1)
df.b <- data.frame(col1 = "b", col2 = 2)
df.c <- data.frame(col1 = "c", col2 = 3)
list <- ls()
for (i in 1:length(list())){
head(list[i])
}
lapply(ls(),head)
Any idea on how to do it or why it is not working?
Put your data frames into a list, and add print to your loop.
my.list <- list(df.a, df.b, df.c)
for (i in seq_along(my.list)){
print(head(my.list[[i]]))
}
We need to get the value of the objects provided by the ls() as a vector of character strings. If the object names have a pattern, specify the pattern in the ls and wrap it with mget to get the values in a list, loop over the list with lapply and get the head
lapply(mget(ls(pattern="df\\.")), head)
If I am working with dataframes in a loop, how can I use a variable data frame name (and additionally, variable column names) to access data frame contents?
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10),Y = sample(c("yes", "no"), 10, replace = TRUE))
for (i in seq_along(dfnames)){
curr.dfname <- dfnames[i]
#how can I do this:
curr.dfname$X <- 42:52
#...this
dfnames[i]$X <- 42:52
#or even this doubly variable call
for (j in 1_seq_along(colnames(curr.dfname)){
curr.dfname$[colnames(temp[j])] <- 42:52
}
}
You can use get() to return a variable reference based on a string of its name:
> x <- 1:10
> get("x")
[1] 1 2 3 4 5 6 7 8 9 10
So, yes, you could iterate through dfnames like:
dfnames <- c("df1","df2")
df1 <- df2 <- data.frame(X = sample(1:10), Y = sample(c("yes", "no"), 10, replace = TRUE))
for (cur.dfname in dfnames)
{
cur.df <- get(cur.dfname)
# for a fixed column name
cur.df$X <- 42:52
# iterating through column names as well
for (j in colnames(cur.df))
{
cur.df[, j] <- 42:52
}
}
I really think that this is gonna be a painful approach, though. As the commenters say, if you can get the data frames into a list and then iterate through that, it'll probably perform better and be more readable. Unfortunately, get() isn't vectorised as far as I'm aware, so if you only have a string list of data frame names, you'll have to iterate through that to get a data frame list:
# build data frame list
df.list <- list()
for (i in 1:length(dfnames))
{
df.list[[i]] <- get(dfnames[i])
}
# iterate through data frames
for (cur.df in df.list)
{
cur.df$X <- 42:52
}
Hope that helps!
2018 Update: I probably wouldn't do something like this anymore. Instead, I'd put the data frames in a list and then use purrr:map(), or, the base equivalent, lapply():
library(tidyverse)
stuff_to_do = function(mydata) {
mydata$somecol = 42:52
# … anything else I want to do to the current data frame
mydata # return it
}
df_list = list(df1, df2)
map(df_list, stuff_to_do)
This brings back a list of modified data frames (although you can use variants of map(), map_dfr() and map_dfc(), to automatically bind the list of processed data frames row-wise or column-wise respectively. The former uses column names to join, rather than column positions, and it can also add an ID column using the .id argument and the names of the input list. So it comes with some nice added functionality over lapply()!