Define dimensions of an empty dataframe - r

I am trying to collect some data from multiple subsets of a data set and need to create a data frame to collect the results. My problem is don't know how to create an empty data frame with defined number of columns without actually having data to put into it.
collect1 <- c() ## i'd like to create empty df w/ 3 columns: `id`, `max1` and `min1`
for(i in 1:10){
collect1$id <- i
ss1 <- subset(df1, df1$id == i)
collect1$max1 <- max(ss1$value)
collect1$min1 <- min(ss1$value)
}
I feel very dumb asking this question (I almost feel like I've asked it on SO before but can't find it) but would greatly appreciate any help.

Would a dataframe of NAs work?
something like:
data.frame(matrix(NA, nrow = 2, ncol = 3))
if you need to be more specific about the data type then may prefer: NA_integer_, NA_real_, NA_complex_, or NA_character_ instead of just NA which is logical
Something else that may be more specific that the NAs is:
data.frame(matrix(vector(mode = 'numeric',length = 6), nrow = 2, ncol = 3))
where the mode can be of any type. See ?vector

Just create a data frame of empty vectors:
collect1 <- data.frame(id = character(0), max1 = numeric(0), max2 = numeric(0))
But if you know how many rows you're going to have in advance, you should just create the data frame with that many rows to start with.

You can do something like:
N <- 10
collect1 <- data.frame(id = integer(N),
max1 = numeric(N),
min1 = numeric(N))
Now be careful that in the rest of your code, you forgot to use the row index for filling the data.frame row by row. It should be:
for(i in seq_len(N)){
collect1$id[i] <- i
ss1 <- subset(df1, df1$id == i)
collect1$max1[i] <- max(ss1$value)
collect1$min1[i] <- min(ss1$value)
}
Finally, I would say that there are many alternatives for doing what you are trying to accomplish, some would be much more efficient and use a lot less typing. You could for example look at the aggregate function, or ddply from the plyr package.

You may use NULL instead of NA. This creates a truly empty data frame.

Here a solution if you want an empty data frame with a defined number of rows and NO columns:
df = data.frame(matrix(NA, ncol=1, nrow=10)[-1]

df = data.frame(matrix("", ncol = 3, nrow = 10))

It might help the solution given in another forum,
Basically is:
i.e.
Cols <- paste("A", 1:5, sep="")
DF <- read.table(textConnection(""), col.names = Cols,colClasses = "character")
> str(DF)
'data.frame': 0 obs. of 5 variables:
$ A1: chr
$ A2: chr
$ A3: chr
$ A4: chr
$ A5: chr
You can change the colClasses to fit your needs.
Original link is
https://stat.ethz.ch/pipermail/r-help/2008-August/169966.html

A more general method to create an arbitrary size data frame is to create a n-by-1 data-frame from a matrix of the same dimension. Then, you can immediately drop the first row:
> v <- data.frame(matrix(NA, nrow=1, ncol=10))
> v <- v[-1, , drop=FALSE]
> v
[1] X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
<0 rows> (or 0-length row.names)

If only the column names are available like :
cnms <- c("Nam1","Nam2","Nam3")
To create an empty data frame with the above variable names, first create a data.frame object:
emptydf <- data.frame()
Now call zeroth element of every column, thus creating an empty data frame with the given variable names:
for( i in 1:length(cnms)){
emptydf[0,eval(cnms[i])]
}

seq_along may help to find out how many rows in your data file and create a data.frame with the desired number of rows
listdf <- data.frame(ID=seq_along(df),
var1=seq_along(df), var2=seq_along(df))

I have come across the same problem and have a cleaner solution. Instead of creating an empty data.frame you can instead save your data as a named list. Once you have added all results to this list you convert it to a data.frame after.
For the case of adding features one at a time this works best.
mylist = list()
for(column in 1:10) mylist$column = rnorm(10)
mydf = data.frame(mylist)
For the case of adding rows one at a time this becomes tricky due to mixed types. If all types are the same it is easy.
mylist = list()
for(row in 1:10) mylist$row = rnorm(10)
mydf = data.frame(do.call(rbind, mylist))
I haven't found a simple way to add rows of mixed types. In this case, if you must do it this way, the empty data.frame is probably the best solution.

Related

What is the easiest way to add "n" empty colums to a R dataframe? [duplicate]

I am trying to collect some data from multiple subsets of a data set and need to create a data frame to collect the results. My problem is don't know how to create an empty data frame with defined number of columns without actually having data to put into it.
collect1 <- c() ## i'd like to create empty df w/ 3 columns: `id`, `max1` and `min1`
for(i in 1:10){
collect1$id <- i
ss1 <- subset(df1, df1$id == i)
collect1$max1 <- max(ss1$value)
collect1$min1 <- min(ss1$value)
}
I feel very dumb asking this question (I almost feel like I've asked it on SO before but can't find it) but would greatly appreciate any help.
Would a dataframe of NAs work?
something like:
data.frame(matrix(NA, nrow = 2, ncol = 3))
if you need to be more specific about the data type then may prefer: NA_integer_, NA_real_, NA_complex_, or NA_character_ instead of just NA which is logical
Something else that may be more specific that the NAs is:
data.frame(matrix(vector(mode = 'numeric',length = 6), nrow = 2, ncol = 3))
where the mode can be of any type. See ?vector
Just create a data frame of empty vectors:
collect1 <- data.frame(id = character(0), max1 = numeric(0), max2 = numeric(0))
But if you know how many rows you're going to have in advance, you should just create the data frame with that many rows to start with.
You can do something like:
N <- 10
collect1 <- data.frame(id = integer(N),
max1 = numeric(N),
min1 = numeric(N))
Now be careful that in the rest of your code, you forgot to use the row index for filling the data.frame row by row. It should be:
for(i in seq_len(N)){
collect1$id[i] <- i
ss1 <- subset(df1, df1$id == i)
collect1$max1[i] <- max(ss1$value)
collect1$min1[i] <- min(ss1$value)
}
Finally, I would say that there are many alternatives for doing what you are trying to accomplish, some would be much more efficient and use a lot less typing. You could for example look at the aggregate function, or ddply from the plyr package.
You may use NULL instead of NA. This creates a truly empty data frame.
Here a solution if you want an empty data frame with a defined number of rows and NO columns:
df = data.frame(matrix(NA, ncol=1, nrow=10)[-1]
df = data.frame(matrix("", ncol = 3, nrow = 10))
It might help the solution given in another forum,
Basically is:
i.e.
Cols <- paste("A", 1:5, sep="")
DF <- read.table(textConnection(""), col.names = Cols,colClasses = "character")
> str(DF)
'data.frame': 0 obs. of 5 variables:
$ A1: chr
$ A2: chr
$ A3: chr
$ A4: chr
$ A5: chr
You can change the colClasses to fit your needs.
Original link is
https://stat.ethz.ch/pipermail/r-help/2008-August/169966.html
A more general method to create an arbitrary size data frame is to create a n-by-1 data-frame from a matrix of the same dimension. Then, you can immediately drop the first row:
> v <- data.frame(matrix(NA, nrow=1, ncol=10))
> v <- v[-1, , drop=FALSE]
> v
[1] X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
<0 rows> (or 0-length row.names)
If only the column names are available like :
cnms <- c("Nam1","Nam2","Nam3")
To create an empty data frame with the above variable names, first create a data.frame object:
emptydf <- data.frame()
Now call zeroth element of every column, thus creating an empty data frame with the given variable names:
for( i in 1:length(cnms)){
emptydf[0,eval(cnms[i])]
}
seq_along may help to find out how many rows in your data file and create a data.frame with the desired number of rows
listdf <- data.frame(ID=seq_along(df),
var1=seq_along(df), var2=seq_along(df))
I have come across the same problem and have a cleaner solution. Instead of creating an empty data.frame you can instead save your data as a named list. Once you have added all results to this list you convert it to a data.frame after.
For the case of adding features one at a time this works best.
mylist = list()
for(column in 1:10) mylist$column = rnorm(10)
mydf = data.frame(mylist)
For the case of adding rows one at a time this becomes tricky due to mixed types. If all types are the same it is easy.
mylist = list()
for(row in 1:10) mylist$row = rnorm(10)
mydf = data.frame(do.call(rbind, mylist))
I haven't found a simple way to add rows of mixed types. In this case, if you must do it this way, the empty data.frame is probably the best solution.

How can lapply work with addressing columns as unknown variables?

So, I have a list of strings named control_for. I have a data frame sampleTable with some of the columns named as strings from control_for list. And I have a third object dge_obj (DGElist object) where I want to append those columns. What I wanted to do - use lapply to loop through control_for list, and for each string, find a column in sampleTable with the same name, and then add that column (as a factor) to a DGElist object. For example, for doing it manually with just one string, it looks like this, and it works:
group <- as.factor(sampleTable[,3])
dge_obj$samples$group <- group
And I tried something like this:
lapply(control_for, function(x) {
x <- as.factor(sampleTable[, x])
dge_obj$samples$x <- x
}
Which doesn't work. I guess the problem is that R can't recognize addressing columns like this. Can someone help?
Here are two base R ways of doing it. The data set is the example of help("DGEList") and a mock up data.frame sampleTable.
Define a vector common_vars of the table's names in control_for. Then create the new columns.
library(edgeR)
sampleTable <- data.frame(a = 1:4, b = 5:8, no = letters[21:24])
control_for <- c("a", "b")
common_vars <- intersect(control_for, names(sampleTable))
1. for loop
for(x in common_vars){
y <- sampleTable[[x]]
dge_obj$samples[[x]] <- factor(y)
}
2. *apply loop.
tmp <- sapply(sampleTable[common_vars], factor)
dge_obj$samples <- cbind(dge_obj$samples, tmp)
This code can be rewritten as a one-liner.
Data
set.seed(2021)
y <- matrix(rnbinom(10000,mu=5,size=2),ncol=4)
dge_obj <- DGEList(counts=y, group=rep(1:2,each=2))

cbind equally named vectors in multiple data.frames in a list to a single data.frame

I have a list similar to this one:
set.seed(1602)
l <- list(data.frame(subst_name = sample(LETTERS[1:10]), perc = runif(10), crop = rep("type1", 10)),
data.frame(subst_name = sample(LETTERS[1:7]), perc = runif(7), crop = rep("type2", 7)),
data.frame(subst_name = sample(LETTERS[1:4]), perc = runif(4), crop = rep("type3", 4)),
NULL,
data.frame(subst_name = sample(LETTERS[1:9]), perc = runif(9), crop = rep("type5", 9)))
Question: How can I extract the subst_name-column of each data.frame and combine them with cbind() (or similar functions) to a new data.frame without messing up the order of each column? Additionally the columns should be named after the corresponding crop type (this is possible 'cause the crop types are unique for each data.frame)
EDIT: The output should look as follows:
Having read the comments I'm aware that within R it doesn't make much sense but for the sake of having alook at the output the data.frame's View option is quite handy.
With the help of this SO-Question I came up with the following sollution. (There's probably room for improvement)
a <- lapply(l, '[[', 1) # extract the first element of the dfs in the list
a <- Filter(function(x) !is.null(unlist(x)), a) # remove NULLs
a <- lapply(a, as.character)
max.length <- max(sapply(a, length))
## Add NA values to list elements
b <- lapply(a, function(v) { c(v, rep(NA, max.length-length(v)))})
e <- as.data.frame(do.call(cbind, d))
names(e) <- unlist(lapply(lapply(lapply(l, '[[', "crop"), '[[', 2), as.character))
It is not really correct to do this with the given example because the number of rows is not the same in each one of the list's data frames . But if you don't care you can do:
nullElements = unlist(sapply(l,is.null))
l = l[!nullElements] #delete useless null elements in list
columns=lapply(l,function(x) return(as.character(x$subst_name)))
newDf = as.data.frame(Reduce(cbind,columns))
If you don't want recycled elements in the columns you can do
for(i in 1:ncol(newDf)){
colLength = nrow(l[[i]])
newDf[(colLength+1):nrow(newDf),i] = NA
}
newDf = newDf[1:max(unlist(sapply(l,nrow))),] #remove possible extra NA rows
Note that I edited my previous code to remove NULL entries from l to simplify things

Nested named list to data frame

I have the following named list output from a analysis. The reproducible code is as follows:
list(structure(c(-213.555409754509, -212.033637890131, -212.029474755074,
-211.320398316741, -211.158815833294, -210.470525157849), .Names = c("wasn",
"chappal", "mummyji", "kmph", "flung", "movie")), structure(c(-220.119433774144,
-219.186901747536, -218.743319709963, -218.088361753899, -217.338920075687,
-217.186050877079), .Names = c("crazy", "wired", "skanndtyagi",
"andr", "unveiled", "contraption")))
I want to convert this to a data frame. I have tried unlist to data frame options using reshape2, dplyr and other solutions given for converting a list to a data frame but without much success. The output that I am looking for is something like this:
Col1 Val1 Col2 Val2
1 wasn -213.55 crazy -220.11
2 chappal -212.03 wired -219.18
3 mummyji -212.02 skanndtyagi -218.74
so on and so forth. The actual out put has multiple columns with paired values and runs into many rows. I have tried the following codes already:
do.call(rbind, lapply(df, data.frame, stringsAsFactors = TRUE))
works partially provides all the character values in a column and numeric values in the second.
data.frame(Reduce(rbind, df))
didn't work - provides the names in the first list and numbers from both the lists as tow different rows
colNames <- unique(unlist(lapply(df, names)))
M <- matrix(0, nrow = length(df), ncol = length(colNames),
dimnames = list(names(df), colNames))
matches <- lapply(df, function(x) match(names(x), colNames))
M[cbind(rep(sequence(nrow(M)), sapply(matches, length)),
unlist(matches))] <- unlist(df)
M
didn't work correctly.
Can someone help?
Since the list elements are all of the same length, you should be able to stack them and then combine them by columns.
Try:
do.call(cbind, lapply(myList, stack))
Here's another way:
as.data.frame( c(col = lapply(x, names), val = lapply(x,unname)) )
How it works. lapply returns a list; two lists combined with c make another list; and a list is easily coerced to a data.frame, since the latter is just a list of vectors having the same length.
Better than coercing to a data.frame is just modifying its class, effectively telling the list "you're a data.frame now":
L = c(col = lapply(x, names), val = lapply(x,unname))
library(data.table)
setDF(L)
The result doesn't need to be assigned anywhere with = or <- because L is modified "in place."

Adding data frames into a list within a forloop

I have a for loop that generates a dataframe every time it loops through. I am trying to create a list of data frames but I cannot seem to figure out a good way to do this.
For example, with vectors I usually do something like this:
my_numbers <- c()
for (i in 1:4){
my_numbers <- c(my_numbers,i)
}
This will result in a vector c(1,2,3,4). I want to do something similar with dataframes, but accessing the list of data frames is quite difficult when i use:
my_dataframes <- list(my_dataframes,DATAFRAME).
Help please. The main goal is just to create a list of dataframes that I can later on access dataframe by dataframe. Thank you.
I'm sure you've noticed that list does not do what you want it to do, nor should it. c also doesn't work in this case because it flattens data frames, even when recursive=FALSE.
You can use append. As in,
data_frame_list = list()
for( i in 1:5 ){
d = create_data_frame(i)
data_frame_list = append(data_frame_list,)
}
Better still, you can assign directly to indexed elements, even if those elements don't exist yet:
data_frame_list = list()
for( i in 1:5 ){
data_frame_list[[i]] = create_data_frame(i)
}
This applies to vectors, too. But if you want to create a vector c(1,2,3,4) just use 1:4, or its underlying function seq.
Of course, lapply or the *lply functions from plyr are often better than looping depending on your application.
Continuing with your for loop method, here's a little example of creating and accessing.
> my_numbers <- vector('list', 4)
> for (i in 1:4) my_numbers[[i]] <- data.frame(x = seq(i))
And we can access the first column of each data frame with,
> sapply(my_numbers, "[", 1)
# $x
# [1] 1
#
# $x
# [1] 1 2
#
# $x
# [1] 1 2 3
#
# $x
# [1] 1 2 3 4
Other ways of accessing the data is my_numbers[[1]] for the first data set,
lapply(my_numbers, "[", 1,) to access the first row of each data frame, etc.
You can use operator [[ ]] for this purpose.
l <- list()
df1 <- data.frame(name = 'df1', a = 1:5 , b = letters[1:5])
df2 <- data.frame(name = 'df2', a = 6:10 , b = letters[6:10])
df3 <- data.frame(name = 'df3', a = 11:20 , b = letters[11:20])
df <- rbind(df1,df2,df3)
for(df_name in unique(df$name)){
l[[df_name]] <- df[df$name == df_name,]
}
In this example, there are three separate data frames and in order to store them
in a list using a for loop, we place them in one. Using the operator [[ we can even name the data frame in the list as we want and store it in the list normally.

Resources