I have more than 300 csv files in a directory.
The csv files have a following structure
id Date Nitrate Sulfate
id of csv file Some date Some Value Some Value
id of csv file Some date Some Value Some Value
id of csv file Some date Some Value Some Value
I want to count number of row in each csv file excluding the NA in that file and stored it in dataframe which has two columns: (1) id & (2) nobs.
Here is my code for that:
complete <-function(directory,id){
filenames <-sprintf("%03d.csv", id)
filenames <-paste(directory,filenames,sep = '/')
dataframe <-data.frame(id=numeric(0),nobs=numeric(0))
for(i in filenames){
data <- read.csv(i)
dataframe[i,dataframe$id]<-data[data$id]
dataframe[i,dataframe$nobs]<-nrow(data[!is.na(data$sulfate & data$nitrate),])
}
dataframe
}
The problem arises when I try to populate dataframe inside the loop, it seems like it is not populating the data frame and returning me NULL. I know that I am doing something stupid.
I usually prefer to add the rows into a pre-allocated list then bind them together. Here's a working example :
##### fake read.csv function returning random data.frame
# (just to reproduce your case, remove this from your code...)
read.csv <- function(fileName){
stupidHash <- sum(as.integer(charToRaw(fileName)))
if(stupidHash %% 2 == 0){
return(data.frame(id=stupidHash,date='2016-02-28',
nitrate=c(NA,2,3,NA,5),sulfate=c(10,20,NA,NA,40)))
}else{
return(data.frame(id=stupidHash,date='2016-02-28',
nitrate=c(4,2,3,NA,5,9),sulfate=c(10,20,NA,NA,40,50)))
}
}
#####
complete <-function(directory,id){
filenames <-sprintf("%03d.csv", id)
filenames <-paste(directory,filenames,sep = '/')
# here we pre-allocate a list of lenght=length(filenames)
# where we will put the rows of our future data.frame
rowsList <- vector(mode='list',length=length(filenames))
for(i in 1:length(filenames)){
filename <- filenames[i]
data <- read.csv(filename)
rowsList[[i]] <- data.frame(id=data$id[1],
nobs=sum(!is.na(data$sulfate) & !is.na(data$nitrate)))
}
# here we bind all the previously created rows together into one data.frame
DF <- do.call(rbind.data.frame, rowsList)
return(DF)
}
Usage example :
res <- complete(directory='dir',id=1:3)
> res
id nobs
1 889 4
2 890 2
3 891 4
The problem is in these 2 lines:
dataframe[i,dataframe$id]<-data[data$id]
dataframe[i,dataframe$nobs]<-nrow(data[!is.na(data$sulfate & data$nitrate),])
If you want to extend a dataframe, please use rbind function. But please be aware of that it is not effective way, because it allocate new memory and copy all data and add one new row. The effective way is to allocate dataframe big enough in this line:
dataframe <-data.frame(id=numeric(0),nobs=numeric(0))
Instead of 0, use number of expected number of rows.
So the easiest way is to
dataframe <- rbind(dataframe, data.frame(id=data$id[1], nobs=nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),]))
More effective way is something like that:
dataframe <-data.frame(id=numeric(numberOfRows),nobs=numeric(numberOfRows))
and after that in loop:
dataframe[i,]$id<-data$id[1]
dataframe[i,]$nobs<-nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),])
UPDATE: I changed values you used to populate dataframe to data$id[1] and nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),])
Related
I have a set of 270 RNA-seq samples, and I have already subsetted out their expected counts using the following code:
for (i in 1:length(sample_ID_list)) {
assign(sample_ID_list[i], subset(get(sample_file_list[i]), select = expected_count))
}
Where sample_ID_list is a character list of each sample ID (e.g., 313312664) and sample_file_list is a character list of the file names for each sample already in my environment to be subsetted (e.g., s313312664).
Now, the head of one of those subsetted samples looks like this:
> head(`308087571`)
# A tibble: 6 x 1
expected_count
<dbl>
1 129
2 8
3 137
4 6230.
5 1165.
6 0
The problem is I want to paste all of these lists together to make a counts dataframe, but I will not be able to differentiate between columns without their sample ID as the column name instead of expected_count.
Does anyone know of a good way to go about this? Please let me know if you need any more details!
You can use:
dplyr::bind_rows(mget(sample_ID_list), .id = name)
If we want to name the list, loop over the list, extract the first element of 'expected_count' ('nm1') and use that to assign the names of the list
nm1 <- sapply(sample_file_list, function(x) x$expected_count[1])
names(sample_file_list) <- nm1
Or from sample_ID_list
do.call(rbind, Map(cbind, mget(sample_ID_list), name = sample_ID_list))
Update
Based on the comments, we can loop over the 'sample_file_list and 'sample_ID_list' with Map and rename the 'expected_count' column with the corresponding value from 'sample_ID_list'
sample_file_list2 <- Map(function(dat, nm) {
names(dat)[match('expected_count', names(dat))] <- nm
dat
}, sample_file_list, sample_ID_list)
Or if we need a package solution,
library(data.table)
rbindlist(mget(sample_ID_list), idcol = name)
Update:
Thank you all so much for your help. I had to update my for loop as follows:
for (i in 1:length(sample_ID_list)) {
assign(sample_ID_list[i], subset(get(sample_file_list[i]), select = expected_count))
data<- get(sample_ID_list[i])
colnames(data)<- sample_ID_list[i]
assign(sample_ID_list[i],data)
}
and was able to successfully reassign the names!
I have the following .csv file:
https://drive.google.com/open?id=0Bydt25g6hdY-RDJ4WG41VFpyX1k
And I would like to be able to take the date and agent name(pasting its constituent parts) and append them as columns to the right of the table, up until it finds a different name and date, doing the same for the remaining name and date items, to get the following result:
The only thing I have been able to do with the dplyr package is the following:
library(dplyr)
library(stringr)
report <- read.csv(file ="test15.csv", head=TRUE, sep=",")
date_pattern <- "(\\d+/\\d+/\\d+)"
date <- str_extract(report[,2], date_pattern)
report <- mutate(report, date = date)
Which gives me the following result:
The difficulty I am finding is probably using conditionals in order make the script get the appropriate string and append it as a column at the end of the table.
This might be crude, but I think it illustrates several things: a) setting stringsAsFactors=F; b) "pre-allocating" the columns in the data frame; and c) using the column name instead of column number to set the value.
report<-read.csv('test15.csv', header=T, stringsAsFactors=F)
# first, allocate the two additional columns (with NAs)
report$date <- rep(NA, nrow(report))
report$agent <- rep(NA, nrow(report))
# step through the rows
for (i in 1:nrow(report)) {
# grab current name and date if "Agent:"
if (report[i,1] == 'Agent:') {
currDate <- report[i+1,2]
currName=paste(report[i,2:5], collapse=' ')
# otherwise append the name/date
} else {
report[i,'date'] <- currDate
report[i,'agent'] <- currName
}
}
write.csv(report, 'test15a.csv')
I am working with 5 data frames that I want to filter (eliminating some rows if they match a regex). Because all data frames are similar, with the same variable names, I stored them in a list and I'm iterating it. However, when I want to save the filtered data for each of the original data frame, I find that it creates an i_filtered (instead of dfName_filtered) so every time the loop runs, it gets overwritten.
Here's what I have in the loop:
for (i in list_all){
i_filtered1 <- i[i$chr != filter1,]
i_filtered2 <- i[i$chr != filter2,]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(i_filtered2, file="/home/tama/Desktop/i_filtered.csv")
}
As I said, filter1 and filter2 are just regex that I'm using to filter the data in the chr column.
What's the correct way to assign the original name + "_filtered" to the new dataframe?
Thanks in advance
Edited to add info:
Each dataframe has these variables (but values can change)
chr start end length
chr1 10400 10669 270
chr10 237646 237836 191
chrX 713884 714414 531
chrUn 713884 714414 531
chr1 762664 763174 511
chr4 805008 805571 564
And I have stored all them in a list:
list_all <- list(heep, oe, st20_n, st20_t,all)
list_all <- lapply(list_all, na.omit)
The filters:
#Get rid of random chromosomes
filter1=".*random"
#Get rid of undefined chromosomes
filter2 = "ĉhrUn.*
The output I'm looking for is:
heep_filtered1
heep_filtered2
oe_filtered1
oe_filtered2
etc
One possibility is to iterate over a sequence of indices (or names), rather than over the list of data-frames itself, and access the data-frames using the indices.
Another problem is that the != operator doesn't support regular expressions. It only does exact literal matches. You need to use grepl() instead.
names(list_all) <- c("heep", "oe", "st20_n", "st20_t", "all")
filtered <- NULL
for (i in names(list_all)){
df <- list_all[[i]]
df.1 <- df[!grepl(filter1, df$chr), ]
df.2 <- df[!grepl(filter2, df$chr), ]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(df.2, file=paste0("/home/tama/Desktop/", i, "_filtered.csv"))
filtered[[paste0(i, "_filtered", 1)]] <- df.1
filtered[[paste0(i, "_filtered", 2)]] <- df.2
}
The result is a list called filtered that contains the filtered data-frames.
The issue is that i is only interpreted specially when it is alone. You are using it as part of other names, and as a character in the current version.
I would suggest naming the list, then using lapply instead of a for loop (note that I also changed the filter to occur in one step, since right now it is unclear if you are trying to take both things out or not -- this also makes it easier to add more filters).
filters <- c(".*random", "chrUn.*")
list_all <- list(heep = heep
, oe = oe
, st20_n = st20_n
, st20_t = st20_t
, all = all)
toLoop <- names(list_all)
names(toLoop) <- toLoop # renames them in the output list
filtered <- lapply(toLoop, function(thisSet)){
tempFiltered <- list_all[[thisSet]][!(list_all[[thisSet]]$chr %in% filters),]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(tempFiltered, file=paste0("/home/tama/Desktop/",thisSet,"_filtered.csv"))
# Return the part you care about
return(tempFiltered)
}
I have many data frames id.1, id.2, ... ,id.21 and in each of which I want to extract 2 data points: id.1[5,4] and id.1[10,6], id.2[5,4] and id.2[10,6], etc. The first data point is a date and the second data point is a integer.
I want to export this list to obtain something like this in an .csv file:
V1 V2
1 5/10/2016 1654395291
2 5/11/2016 1645024703
3 5/12/2016 1763825219
I have tried
x=c(for (i in 1:21) {
file1 = paste("id.", i, "[5,4]", sep="")}, for (i in 1:21) {
file1 = paste("id.", i, "[10,6]", sep="")})
write.csv(x, "x.csv")
But this yields x being NULL. How can I go about getting this vector?
Your problem is that a for loop doesn't return anything in R. So you can't use it in a c statement as you did. Use an [sl]apply construct instead.
I would first make a list containing all the data frames:
dfs <- list(id.1, id.2, id.3, ...)
And iterate over it, something like:
x <- sapply(dfs, function(df) {
return(c(df[5,4], df[10,6]))
})
Finally you need to transpose the result and convert it into a data.frame if you want to:
x <- as.data.frame(t(x))
I am very new to programming, both in R and in general.
Here is my goal for writing this script:
I have 332 csv files. I want to, “Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.”
The outline of the function is as follows:
complete <- function(directory, id = 1:332) {
## 'directory' is a character vector of length 1 indicating
## the location of the CSV files
## 'id' is an integer vector indicating the monitor ID numbers
## to be used
## Return a data frame of the form:
## id nobs
## 1 117
## 2 1041
## ...
## where 'id' is the monitor ID number and 'nobs' is the
## number of complete cases
}
Example output would look like this:
source("complete.R")
complete("specdata", 1)
## id nobs
## 1 1 117
complete("specdata", c(2, 4, 8, 10, 12))
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
## 4 10 148
## 5 12 96
My script so far looks like this:
setwd("C:/users/beachlb/Desktop/R_Programming/specdata") #this is the local directory on my computer where all 332 csv files are stored
>complete <- function(directory, id = 1:332) {
>files_list <- list.files(directory, full.names=TRUE) #creates a list of files from within the specified directory
>dat <- data.frame() #creates an empty data frame that we can use to add data to
>for (i in id) {
>dat <- rbind(dat, read.csv(files_list[i])) #loops through the 332 csv files, rbinding them together into one data frame called dat
}
>dat$nobs <- sum(complete.cases(dat)) #add the column nobs to dat, populated with number of rows of complete cases in the dataframe
>dat_subset <- dat[which(dat[, "ID"] %in% id),] #subsets dat so that only the desired cases are included in output when function is run
>dat_subset[, "ID", "nobs"] #prints all rows of the desired data frame for the named columns}
When I run my function as is, I get this error, “Error in drop && !has.j : invalid 'x' type in 'x && y’. I am not sure what is throwing me that error. I would appreciate any advice on what could be causing this error and how I can work to resolve it. Pointing me to literature I could read to study this and/or tutorials that would help me strengthen the coding skills needed to avoid this error would also be appreciated.
Preface: I am not sure if I should ask this question on a separate thread. Right now, my function is written to populate the total number of complete cases for all rows (for all 332 files), instead of specifically calculating the number of complete cases for a given monitor id and putting that into the column nobs for that ID only. (Note that each file is named after the monitor id and contains only cases from that monitor, such that 001.csv = output from monitor 1, 002.csv = output from monitor 2). Therefore, I am hoping for someone to help point me to a resource for how to subset dat so that when the nobs column populates, each row in the nobs column gives the number of complete cases for each id number.
complete <- function(directory, id = 1:332) {
files_list <- list.files(directory, full.names=TRUE)
nobs <- c()
for (i in id) {
dat <- read.csv(files_list[i])
nobs <- c(nobs, sum(complete.cases(dat)))
}
data.frame(id,nobs)
}
You were close. But you shouldn't read in all of the files at once and then find the complete cases. It will not separate the results by id for you. Instead I just edited your code a little bit.
Test
complete("specdata", c(2,4,8,10,12))
id nobs
1 2 1041
2 4 474
3 8 192
4 10 148
5 12 96
I have no idea what is throwing the error, but I would recommend avoiding the process that is leading up to it. Your situation would benefit greatly from vectorization. I don't think this code will work out of the box, but should be on the right path:
#* Get the file names of the CSV files to read
files <- list.files(getwd(), pattern = "\\d{3}[.]csv$")
#* Read in all of the CSV files into a list of data frames
DataFrames <- lapply(files, read.csv)
#* Calculate the number of complete cases in each file
CompleteCases <- vapply(DataFrames,
function(df) sum(complete.cases(df)),
numeric(1))
#* Produce a data frame with the file name, and the number of complete cases in the file.
data.frame(file = basename(files),
nobs = CompleteCases)
You are making a silly mistake in the last line
dat_subset[, "ID", "nobs"] # incorrect code and will generate the error
#Error in drop && length(x) == 1L : invalid 'x' type in 'x && y'
base R does not allow subsetting inside [ ] with a comma-separated column name list. You should convert that into a character vector and pass as one parameter, as follows:
dat_subset[, c("ID", "nobs")]
above is the correct way of subsetting on multiple columns.