Add Columns to an empty data frame in R - r

I have searched extensively but not found an answer to this question on Stack Overflow.
Lets say I have a data frame a.
I define:
a <- NULL
a <- as.data.frame(a)
If I wanted to add a column to this data frame as so:
a$col1 <- c(1,2,3)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "a", value = c(1, 2, 3)) :
replacement has 3 rows, data has 0
Why is the row dimension fixed but the column is not?
How do I change the number of rows in a data frame?
If I do this (inputting the data into a list first and then converting to a df), it works fine:
a <- NULL
a$col1 <- c(1,2,3)
a <- as.data.frame(a)

The row dimension is not fixed, but data.frames are stored as list of vectors that are constrained to have the same length. You cannot add col1 to a because col1 has three values (rows) and a has zero, thereby breaking the constraint. R does not by default auto-vivify values when you attempt to extend the dimension of a data.frame by adding a column that is longer than the data.frame. The reason that the second example works is that col1 is the only vector in the data.frame so the data.frame is initialized with three rows.
If you want to automatically have the data.frame expand, you can use the following function:
cbind.all <- function (...)
{
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function(x) rbind(x, matrix(, n -
nrow(x), ncol(x)))))
}
This will fill missing values with NA. And you would use it like: cbind.all( df, a )

You could also do something like this where I read in data from multiple files, grab the column I want, and store it in the dataframe. I check whether the dataframe has anything in it, and if it doesn't, create a new one rather than getting the error about mismatched number of rows:
readCounts = data.frame()
for(f in names(files)){
d = read.table(files[f], header=T, as.is=T)
d2 = round(data.frame(d$NumReads))
colnames(d2) = f
if(ncol(readCounts) == 0){
readCounts = d2
rownames(readCounts) = d$Name
} else{
readCounts = cbind(readCounts, d2)
}
}

if you have an empty dataframe, called for example df, in my opinion another quite simple solution is the following:
df[1,]=NA # ad a temporary new row of NA values
df[,'new_column'] = NA # adding new column, called for example 'new_column'
df = df[0,] # delete row with NAs
I hope this may help.

Related

Filter out all data frames which don't have the column Z in a list of data frames?

I have a list of six data frames, from which 5/6 data frames have a column "Z". To proceed with my script, I need to remove the data frame which doesn't have column Z, so I tried the following code:
for(i in 1:length(df)){
if(!("Z" %in% colnames(df[[i]])))
{
df[[i]] = NULL
}
}
This seem'd to actually do the job (it removed the one data frame from the list, which didn't have the column Z), BUT however I still got a message "Error in df[[i]] : subscript out of bounds". Why is that, and how could I get around the error?
The base Filter function works well here:
df <- Filter(\(x) "Z" %in% names(x), df)
As to why your method doesn't work, for(i in 1:length(df)) iterates over each item in the original length(df). As soon as df[[i]] = NULL happens once, then df is shorter than it was when the loop started, so the last iteration will be out of bounds. And you'll also skip some items: if df[[2]] is removed then the original df[[3]] is now df[[2]], and the current df[[3]] was originally df[[4]], so you hop over the original df[[3]] without checking it. Lesson: don't change the length of objects in the midst of iterating over them.
If df is your list of 6 dataframes, you can do this:
df <- df[sapply(df, \(i) "Z" %in% colnames(i))]
The reason you get the error is that your loop will reduce the length of df, such that i will eventually be beyond the (new) length of df. There will be no error if the only frame in df without column Z is the last frame.
Using discard:
list_df <- list(df1, df2)
purrr::discard(list_df, ~any(colnames(.x) == "Z"))
Output:
[[1]]
A B
1 1 3
2 3 4
As you can see it removed the first dataframe which had column Z.
data
df1 <- data.frame(A = c(1,2),
Z = c(1,4))
df2 <- data.frame(A = c(1,3),
B = c(3,4))

How can lapply work with addressing columns as unknown variables?

So, I have a list of strings named control_for. I have a data frame sampleTable with some of the columns named as strings from control_for list. And I have a third object dge_obj (DGElist object) where I want to append those columns. What I wanted to do - use lapply to loop through control_for list, and for each string, find a column in sampleTable with the same name, and then add that column (as a factor) to a DGElist object. For example, for doing it manually with just one string, it looks like this, and it works:
group <- as.factor(sampleTable[,3])
dge_obj$samples$group <- group
And I tried something like this:
lapply(control_for, function(x) {
x <- as.factor(sampleTable[, x])
dge_obj$samples$x <- x
}
Which doesn't work. I guess the problem is that R can't recognize addressing columns like this. Can someone help?
Here are two base R ways of doing it. The data set is the example of help("DGEList") and a mock up data.frame sampleTable.
Define a vector common_vars of the table's names in control_for. Then create the new columns.
library(edgeR)
sampleTable <- data.frame(a = 1:4, b = 5:8, no = letters[21:24])
control_for <- c("a", "b")
common_vars <- intersect(control_for, names(sampleTable))
1. for loop
for(x in common_vars){
y <- sampleTable[[x]]
dge_obj$samples[[x]] <- factor(y)
}
2. *apply loop.
tmp <- sapply(sampleTable[common_vars], factor)
dge_obj$samples <- cbind(dge_obj$samples, tmp)
This code can be rewritten as a one-liner.
Data
set.seed(2021)
y <- matrix(rnbinom(10000,mu=5,size=2),ncol=4)
dge_obj <- DGEList(counts=y, group=rep(1:2,each=2))

Iteratively adding a row containing characters and numbers to a dataframe

I have a list containing named elements. I am iterating over the list names, performing the computation for each corresponding element, "encapsulating" the results and the name in a vector and finally adding the vector to a table. The row or vector after each iteration contains a mix of characters and numbers.
The first row is getting added but from the second row onwards there is a problem.
In this example, there is supposed to be one column (first) containing alphanumeric names. All rows after the first one contain NAs.
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s))
}
df <- as.data.frame(df)
I know there are possibly more efficient ways but for the moment this is more intuitive for me as it is assuring that each computation is associated with a particular name. There can be several columns and rows and the names are extremely helpful to join tables, query, compare etc. They make it easier to trace back results to a particular element in my original list.
Additionally, I would be glad to know other ways in which the element names are always retained while transforming.
Thankyou!
You have to set stringsAsFactors = FALSE in rbind. With stringsAsFactors = TRUE the first iteration in the loop converts the string variables into factors (with the factor levels being the values).
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s), stringsAsFactors = FALSE)
}
An easier solution would be to utilize sapply().
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame(name = names(x), m = sapply(x, mean), s = sapply(x, sum))

How to convert single column data into two-column matrix using conditional/for loop in R

I have a single column data frame - example data:
1 >PROKKA_00002 Alpha-ketoglutarate permease
2 MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3 QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4 >PROKKA_00003 lipoprotein
5 MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG
Each sequence of letters is associated with the ">" line above it. I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. This is what I've tried so far:
y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
z <- 0
for(i in 1:nrow(df)){
if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
z <- z + 1
y[z,1] <- paste(df[i])
} else{
y[z,2] <- paste(df[i], collapse = "")
}
}
I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". I'm also not sure if my conditional is right. Can anyone help? It would be greatly appreciated!
Although I will stick with packages, here is a solution
initialize data
mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL" ,"MRTIIVIASLLLT"), stringsAsFactors = F)
process
ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))
seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}
fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)
> fastatable
name sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2 PROKKA_00003 lipoprotein MTESSITERGAPELMRTIIVIASLLLT
Try creating an index of the rows with the target symbol with the column headers. Then split the data on that index. The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers.
ind1 <- grepl(">", mydf$x)
#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]
#Add names
names(newdf) <- c("Name", "Value")
newdf
# Name Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002 MTESSITERGAPEL
# 5 >PROKKA_00003 lipoprotein
# 6 >PROKKA_00003 MRTIIVIASLLLT
Data
mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein" ,"MRTIIVIASLLLT"))
You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately:
library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
"MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
"QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
">PROKKA_00003 lipoprotein",
"MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)
t <- ddply(df, "section", function(x){
data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})
t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns
if you then view 't' I believe this is what you were looking for in your original post

Print the last row from a list of data frames

I have a list of data frames which I need to obtain the last row of the 2nd column from. All the data frames have differing number of rows. I've already written code using lapply which can extract any row by variable "num" (returning NA for numbers which exceed the row length of the data frames) , however I want to include a variable num="worst" which will return the last row, 2nd column of available data. This is the code to retrive the "nth" row (xyz is the list of data frames):
if(num=="best"){num=as.integer(1)} else
(num=as.integer())
rownumber<-lapply(xyz, "[", num, 2, drop=FALSE)
Been cracking my head all day trying to find a solution to declare num=="worst". I want to avoid loops hence my use of lapply, but perhaps there is no other way?
How about...
lapply(xyz, tail, 1)
My understanding of the question is that you want a function that returns the second column of a data.frame from a list of dataframes, with an optional argument worst that allows you to restrict it to the last observation.
I think the siimplest way to do this is to write a helper function, and then apply it to your list using lapply.
I have written a selector function that takes a row and column argument, as well as a worst argument. I think this does everything you need.
df1 <- data.frame(A = rnorm(10), B = rnorm(10), C = rnorm(10))
df2 <- data.frame(A = rnorm(10), B = rnorm(10), C = rnorm(10))
ldf <- list(df1, df2)
selector <- function(DF, col, row=NULL, worst=FALSE){
if(!is.null(row)) return(DF[row, col])
if(!missing("col")) if(col > ncol(DF)) return(NA)
if(!is.null(row)) if(row > nrow(DF)) return(NA)
if(worst) {
tail(DF[,col, drop=F],1)
} else {
DF[row, col, drop=FALSE]
}
}
lapply(ldf, selector, worst=T)

Resources