Filling empty dataframe in R - r

I have used the search box for this and have found similar questions, but not identical ones. It seems that this is an easy problem though (I'm an R-newbee).
I am simply trying to create a new data frame and adding values to it. Not surprisingly, R throws an error saying that the rows don't match.
Here's the code
d <- data.frame()
files <- list.files(pattern="*.lst", full.names=T, recursive=FALSE)
d$fileName <- lapply(files, basename)
d$node <- gsub("([^.]+)\.[^\.lst]+\.lst", "$1", d$fileName, perl=TRUE)
And here's the error
Error in $<-.data.frame(*tmp*, "fileName", value =
list("A-bom.WR-P-E-A.lst", : replacement has 337 rows, data has 0
How would I go about this problem? I thought about filling d with the same amount of rows as there are files, but I don't think that that's the best way?

Simply create your data frame when it's used the fist time, so you don't "add" rows to a data frame with zero rows. And you may use sapply to return a (named) vector instead of a list.
files <- list.files(pattern="*.lst", full.names=T, recursive=FALSE)
d <- data.frame(fileName = unname(sapply(files, basename)))
d$node <- gsub("([^.]+)\\.[^\\.lst]+\\.lst", "$1", d$fileName, perl=TRUE)
Your regular expression caused an error, however, I'm not that familiar with regex, so you probably have to fix my fixes ;-)

Related

RSME on dataframe of multiple files in R

My goal is to read many files into R, and ultimately, run a Root Mean Square Error (rmse) function on each pair of columns within each file.
I have this code:
#This calls all the files into a dataframe
filnames <- dir("~/Desktop/LGsampleHUCsWgraphs/testRSMEs", pattern = "*_45Fall_*")
#This reads each file
read_data <- function(z){
dat <- read_excel(z, skip = 0, )
return(dat)
}
#This combines them into one list and splits them by the names in the first column
datalist <- lapply(filnames, read_data)
bigdata <- rbindlist(datalist, use.names = T)
splitByHUCs <- split(bigdata, f = bigdata$HUC...1 , sep = "\n", lex.order = TRUE)
So far, all is working well. Now I want to apply an rmse [library(Metrics)] analysis on each of the "splits" created above. I don't know what to call the "splits". Here I have used names but that is an R reserved word and won't work. I tried the bigdata object but that didn't work either. I also tried to use splitByHUCs, and rMSEs.
rMSEs <- sapply(splitByHUCs, function(x) rmse(names$Predicted, names$Actual))
write.csv(rMSEs, file = "~/Desktop/testRMSEs.csv")
The rmse code works fine when I run it on a single file and create a name for the dataframe:
read_excel("bcc1_45Fall_1010002.xlsm")
bcc1F1010002 <- read_excel("bcc1_45Fall_1010002.xlsm")
rmse(bcc1F1010002$Predicted, bcc1F1010002$Actual)
The "splits" are named by the "splitByHUCs" script, like this:
They are named for the file they came from, appropriately. I need some kind of reference name for the rmse formula and I don't know what it would be. Any ideas? Thanks. I made some small versions of the files, but I don't know how to add them here.
As it is a list, we can loop over the list with sapply/lapply as in the OP's code, but the names$ is incorrect as the lambda function object is x which signifies each of the elements of the list (i.e. a data.frame). Therefore, instead of names$, use x$
sapply(splitByHUCs, function(x) rmse(x$Predicted, x$Actual))

How to set filter.fn (param to read.matrix in R) so that it cleans row and column names to only contain alphanum plus dash and dot characters

I'm a lightweight when it comes to R and I'm trying to use a package called Seurat. I learned that non-10x sparse market matrix data (.mtx files) can be read in using read.matrix and that the row and column names can be loaded from the associated csv files. However, the row and column names need to have specific characters (anything that's not alphanumeric, dot, or dash) removed. I'd like to replace "bad characters" with dashes. And I'd like to do this in R so that I can keep the disk space my deliverables take up small.
I was looking at the read.matrix help doc, and it looks like you can set a param called filter.fn (which I infer is a function, although that's not explicitly stated) in order to "clean" row and column names.
I learned how to create a function, and I got it to take row.ids and col.ids as arguments. I learned how to use character classes to make substitutions in the strings contained there-in (and my tests show that it does what I want). But since functions only return 1 value or set, I'm not sure what to return. I tried returning c(row.ids, col.ids) but that creates a 1 dimensional set. Even if I figure out how to return a 2D set, I'm not sure that's what's needed. I tried to see if changes to the variables submitted persist, but they do not. And I don't know what to search for to solve this.
Here's what I've got so far:
coldata <- read.csv(file="cell_metadata.csv", header=TRUE, sep=",")
colnames <- paste(coldata$cell_barcode)
rowdata <- read.csv(file="genes.csv", header=TRUE, sep=",")
rownames <- paste(rowdata$genome, rowdata$gene_id, rowdata$gene_name, sep = ".")
cleanrowscols <- function(row.ids, col.ids) {
row.ids <- gsub("[^[:alnum:\\\\-\\\\.] ]", "-", row.ids)
col.ids <- gsub("[^[:alnum:\\\\-\\\\.] ]", "-", col.ids)
return(1)
}
read.matrix("DGE.mtx", header = TRUE, skip = 1, row.ids = rownames,
col.ids = colnames, colClasses = c("numeric", "numeric", "numeric"),
assign.fn = assign_matrix_dense, filter.fn = cleanrowscols)
But what does cleanrowscols have to return to get it to clean the row and column names supplied to read.matrix?
UPDATE: Ugh, R doesn't even know what read.matrix is and I don't know how to import it. So perhaps a different tack. I discovered that library("Matrix") has readMM(file), so I tried readMM("DGE.mtx") and it seems to work. How to I set the row and column names?
OK. I think I figured it out.
coldata <- read.csv(file="cell_metadata.csv", header=TRUE, sep=",")
colnames <- gsub("[^[:alnum:\\\\-\\\\.] ]", "-", paste(coldata$cell_barcode))
rowdata <- read.csv(file="genes.csv", header=TRUE, sep=",")
rownames <- gsub("[^[:alnum:\\\\-\\\\.] ]", "-", paste(rowdata$genome, rowdata$gene_id, rowdata$gene_name, sep = "."))
library("Matrix")
mtx <- t(as.matrix(readMM("DGE.mtx")))
rownames(mtx) <- rownames
colnames(mtx) <- colnames
Although, I'm sure there's a better, more efficient answer. Note, I had to transpose, because seurate expects genes as rows and cells as columns. I concatenated some data for the row names. In fact, I need to figure out one more thing, but it's outside of the scope of this question. (I need to add the chromosome to the row names, but it's not in the genes.csv file...)

Errors in finding column mean of .csv file with NA cells in R

I have a folder with several .csv files containing raw data with multiple rows and 39 columns (x obs. of 39 variables), which have been read into R as follows:
# Name path containing .csv files as folder
folder = ("/users/.../");
# Find the number of files in the folder
file_list = list.files(path=folder, pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list))
{
assign(file_list[i],
read.csv(paste(folder, file_list[i], sep='')))
}
I want to find the mean of a specific column in each of these .csv files and save it in a vector as follows:
for (i in 1:length(file_list))
{
clean = na.omit(file_list[i])
ColumnNameMean[i] = mean(clean["ColumnName"])
}
When I run the above fragment of code, I get the error "argument is not numeric or logical: returning NA". This happens in spite of attempting to remove the NA values using na.omit. Using complete.cases,
clean = file_list[i][complete.cases(file_list[i]), ]
I get the error: incorrect number of dimensions, even though the number of columns haven't been explicitly stated.
How do I fix this?
Edit: corrected clean[i] to clean (and vice versa). Ran code, same error.
Sample .csv file
There are several things wrong with your code.
folder = ("/users/.../"); You don't need the parenthesis and you definitely do not need the semi-colon. The semi-colon separates instructions, does not end them. So, this instruction is in fact two instructions, the assigment of a string to folder and between the ; and the newline the NULL instruction.
You are creating many objects in the global environment in the for loop where you assign the return value of read.csv. It is much better to read in the files into a list of data.frames.
na.omit can remove all rows from the data.frames. And there is no need to use it since mean has a na.rm argument.
You compute the mean values of each column of each data.frame. Though the data.frames are processed in a loop, the columns are not and R has a fast colMeans function.
You mistake [ for [[. The correct ways would be either clean[, "ColumnName"] or clean[["ColumnName"]].
Now the code, revised. I present several alternatives to compute the columns' means.
First, read all files in one go. I set the working directory before reading them and reset after.
folder <- "/users/.../"
file_list <- list.files(path = folder, pattern = "^muse.*\\.csv$")
old_dir <- setwd(folder)
df_list <- lapply(file_list, read.csv)
setwd(old_dir)
Now compute the means of three columns.
cols <- c("Delta_TP9", "Delta_AF7", "Theta_TP9")
All_Means <- lapply(df_list, function(DF) colMeans(DF[cols], na.rm = TRUE))
names(All_Means) <- file_list
Compute the means of all columns starting with Delta or Theta. Get those columns names with grep.
df_names <- names(df_list[[1]])
cols2 <- grep("^Delta", df_names, value = TRUE)
cols2 <- c(cols2, grep("^Theta", df_names, value = TRUE))
All_Means_2 <- lapply(df_list, function(DF) colMeans(DF[cols2], na.rm = TRUE))
names(All_Means_2) <- file_list
Finally, compute the means of all numeric columns. Note that this time the index vector cols3 is a logical vector.
cols3 <- sapply(df_list[[1]], is.numeric)
All_Means_3 <- lapply(df_list, function(DF) colMeans(DF[cols3], na.rm = TRUE))
names(All_Means_3) <- file_list
Try it like this:
setwd("U:/Playground/StackO/")
# Find the number of files in the folder
file_list = list.files(path=getwd(), pattern="*.csv")
# Read files in the folder
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv(file_list[i]))
}
ColumnNameMean <- rep(NULL, length(file_list))
for (i in 1:length(file_list)){
clean = get(file_list[i])
ColumnNameMean[i] = mean(clean[,"Delta_TP10"])
}
ColumnNameMean
#> [1] 1.286201
I used get to retrieve the data.frame otherwise file_list[i] just returns a string. I think this is an idiom used in other languages like python. I tried to stay true to the way you were using but there are easier way than indexing like this.
Maybe this:
lapply(list.files(path=getwd(), pattern="*.csv"), function(f){ dt <- read.csv(f); mean(dt[,"Delta_TP10"]) })
PS: Be careful with na.omit(), it removes ALL the rows with NA which in your case is your whole data.frame since Elements is only NA

How to delete specific rows from multiple columns

I am importing some columns from multiple csv files from R. I want to delete all the data after row 1472.
temp = list.files(pattern="*.csv") #Importing csv files
Normalyears<-c(temp[1],temp[2],temp[3],temp[5],temp[6],temp[7],temp[9],temp[10],temp[11],temp[13],temp[14],temp[15],temp[17],temp[18],temp[19],temp[21],temp[22],temp[23])
leapyears<-c(temp[4],temp[8],temp[12],temp[16],temp[20]) #separating csv files with based on leap years and normal years.
Importing only the second column of each csv file.
myfiles_Normalyears = lapply(Normalyears, read.delim,colClasses=c('NULL','numeric'),sep =",")
myfiles_leapyears = lapply(leapyears, read.delim,colClasses=c('NULL','numeric'),sep =",")
new.data.leapyears <- NULL
for(i in 1:length(myfiles_leapyears)) {
in.data <- read.table(if(is.null(myfiles_leapyears[i])),skip=c(1472:4399),sep=",")
new.data.leapyears <- rbind(new.data.leapyears, in.data)}
the loop is suppose to delete all the rows starting from 1472 to 4399.
Error: Error in read.table(myfiles_leapyears[i], skip = c(1472:4399), sep = ",") :
'file' must be a character string or connection
There is a nrows parameter to read.table, so why not try
read.table(myfiles_leapyears[i], nrows = 1471,sep=",")
Your myfiles_leapyears is a list. When subsetting a list, you need double brackets to access a single element, otherwise you just get a sublist of length 1.
So replace
myfiles_leapyears[i]
with
myfiles_leapyears[[i]]
that will at least take care of invalid subscript type 'list' errors. I'd second Josh W. that the nrows argument seems smarter than the skip argument.
Alternatively, if you define using sapply ("s" for simplify) instead of lapply ("l" for list), you'll probably be fine using [i]:
myfiles_leapyears = lapply(leapyears, read.delim,colClasses=c('NULL','numeric'),sep =",")
It is fine. I just turned the data from a list into a dataframe.
df <- as.data.frame(myfiles_leapyears,byrow=T)
leap_df<-head(df,-2928)

lapply r to one column of a csv file

I have a folder with several hundred csv files. I want to use lappply to calculate the mean of one column within each csv file and save that value into a new csv file that would have two columns: Column 1 would be the name of the original file. Column 2 would be the mean value for the chosen field from the original file. Here's what I have so far:
setwd("C:/~~~~")
list.files()
filenames <- list.files()
read_csv <- lapply(filenames, read.csv, header = TRUE)
dataset <- lapply(filenames[1], mean)
write.csv(dataset, file = "Expected_Value.csv")
Which gives the error message:
Warning message: In mean.default("2pt.csv"[[1L]], ...) : argument is not numeric or logical: returning NA
So I think I have 2(at least) problems that I cannot figure out.
First, why doesn't r recognize that column 1 is numeric? I double, triple checked the csv files and I'm sure this column is numeric.
Second, how do I get the output file to return two columns the way I described above? I haven't gotten far with the second part yet.
I wanted to get the first part to work first. Any help is appreciated.
I didn't use lapply but have done something similar. Hope this helps!
i= 1:2 ##modify as per need
##create empty dataframe
df <- NULL
##list directory from where all files are to be read
directory <- ("C:/mydir/")
##read all file names from directory
x <- as.character(list.files(directory,,pattern='csv'))
xpath <- paste(directory, x, sep="")
##For loop to read each file and save metric and file name
for(i in i)
{
file <- read.csv(xpath[i], header=T, sep=",")
first_col <- file[,1]
d<-NULL
d$mean <- mean(first_col)
d$filename=x[i]
df <- rbind(df,d)
}
###write all output to csv
write.csv(df, file = "C:/mydir/final.csv")
CSV file looks like below
mean filename
1999.000661 hist_03082015.csv
1999.035121 hist_03092015.csv
Thanks for the two answers. After much review, it turns out that there was a much easier way to accomplish my goal. The csv files that I had were originally in one file. I split them into multiple files by location. At the time, I thought this was necessary to calculate mean on each type. Clearly, that was a mistake. I went to the original file and used aggregate. Code:
setwd("C:/~~")
allshots <- read.csv("All_Shots.csv", header=TRUE)
EV <- aggregate(allshots$points, list(Location = allshots$Loc), mean)
write.csv(EV, file= "EV_location.csv")
This was a simple solution. Thanks again or the answers. I'll need to get better at lapply for future projects so they were not a waste of time.

Resources