I am trying to read CSV files that are in a folder on my computer using R. This is the code I am using:
First, I create a list with the names of the files using the pattern CSV. The elements of the list are in format chr.
Second, I loop over the files and I create a data frame in R for each of the files.
files = list.files(pattern="*.csv")
> dput(files)
c("USC00020098.csv", "USC00020104.csv", "USC00020170.csv", "USC00020307.csv",
"USC00020406.csv", "USC00020482.csv", "USC00020487.csv", "USC00020490.csv",
"USC00020492.csv", "USC00020494.csv", "USC00020625.csv", "USC00020632.csv",
"USC00020670.csv", "USC00020675.csv", "USC00020678.csv", "USC00020758.csv",
"USC00020808.csv", "USC00020810.csv", "USC00021161.csv", "USC00021193.csv",
"USC00021419.csv", "USC00021614.csv", "USC00021654.csv", "USC00021749.csv",
"USC00022193.csv", "USC00022705.csv", "USC00022927.csv", "USC00023082.csv",
"USC00023185.csv", "USC00023190.csv", "USC00023448.csv", "USC00023498.csv",
"USC00023500.csv", "USC00023501.csv", "USC00023505.csv", "USC00023573.csv",
"USC00023621.csv", "USC00023643.csv", "USC00023828.csv", "USC00023926.csv",
"USC00024069.csv", "USC00024182.csv", "USC00024345.csv", "USC00024391.csv",
"USC00024453.csv", "USC00024508.csv", "USC00025312.csv", "USC00025467.csv",
"USC00025512.csv", "USC00025560.csv", "USC00025635.csv", "USC00025700.csv",
"USC00025765.csv", "USC00025780.csv", "USC00025825.csv", "USC00026037.csv",
"USC00026244.csv", "USC00026246.csv", "USC00026315.csv", "USC00026320.csv",
"USC00026321.csv", "USC00026424.csv", "USC00026476.csv", "USC00026571.csv",
"USC00026603.csv", "USC00026653.csv", "USC00026796.csv", "USC00026840.csv",
"USC00027081.csv", "USC00027131.csv", "USC00027143.csv", "USC00027281.csv",
"USC00027466.csv", "USC00027661.csv", "USC00027708.csv", "USC00027716.csv",
"USC00027720.csv", "USC00027741.csv", "USC00027876.csv", "USC00028112.csv",
"USC00028214.csv", "USC00028273.csv", "USC00028326.csv", "USC00028489.csv",
"USC00028494.csv", "USC00028499.csv", "USC00028500.csv", "USC00028647.csv",
"USC00028649.csv", "USC00028650.csv", "USC00028653.csv", "USC00028904.csv",
"USC00028940.csv", "USC00029015.csv", "USC00029158.csv", "USC00029271.csv",
"USC00029367.csv", "USC00029534.csv", "USC00029622.csv", "USC00029626.csv",
"USW00003192.csv", "USW00023183.csv", "USW00023184.csv", "USW00053019.csv",
"USW00053156.csv", "USW00053160.csv", "USW00093139.csv", "USW00093140.csv"
)
for(i in seq_along(files)){
name <- files[[i]]
y <- read.csv(file=name,header=TRUE)
assign(name,y)
}
However, I am getting the following error
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
first five rows are empty: giving up
When I run for example: t <- read.csv(files[[94]],header=TRUE,sep=",") , it works properly.
Any idea why my loop is not working?
Related
I've first figured out how to read and name multiple H5 files from my directory, but I'm running into actually being able to graph with them. My problem is multiple - with this type of file, I do not know how to make the columns have the same number of rows and I do not know how to call on specific files.
My initial setup is as followed
library("rhdf5")
library("ggplot2")
library("fs")
library("tidyverse")
wd <- "D:/Data/1282-1329/"
setwd(wd)
testh5 <- H5Fopen("1282.h5")
H5Fclose(testh5)
y <- h5read(file = "1282.h5",
name = "/Signal")
x <- h5read(file = "1282.h5",
name = "/Scan")
The / refers to the H5 files 'Group' and the Signal or Scan refers to the 'Name', thus "/Signal" creates a numerical list with a length of 48 (number of files within 1282-1329). I make multiple lists from each of these by doing
file_paths <- fs::dir_ls("D:/Data/1282-1329/H5")
file_paths
file_Scan <- list()
for (i in seq_along(file_paths)) {
file_Scan[[i]] <- h5read(
file = file_paths[[i]],
name = "/Scan"
)
}
file_Signal <- list()
for (i in seq_along(file_paths)) {
file_Signal[[i]] <- h5read(
file = file_paths[[i]],
name = "/Signal"
)
}
file_Scan <- setNames(file_Scan, file_paths)
file_Signal <- setNames(file_Signal, file_paths)
Thus str(file_Signal) gives me something like..
List of 48
$ D:/Data/1282-1329/H5/1282.h5: num [1:8044(1d)] 11569527 11576106 10848312 11007212 11074822 ...
$ D:/Data/1282-1329/H5/1283.h5: num [1:8045(1d)] 9746633 9886735 10000637 9617273 ...
So my first problem here is [1:8044(1d)] and [1:8045(1d)] - they're one row off. But I'm unable to add in NAs or make the lengths the same as I would a normal list. Is it because I'm thinking about this wrong? I feel like the solution is simple.
My ultimate goal will be to create multiple single plots for each of these files in the directory using something like
for (i in seq_along(file_paths)) {
plots[[i]] = ggplot(file_paths, aes(x=file_Signal, y=file_Scan))+
geom_point(size=1)
}
Then use these to create a rolling gif of the files with Even numbers (1282, 1284, 1286, etc) and Odd numbers (1283, 1285, 1287, etc.)
Thank you for any help or resources to might have to offer.
I get the error:
Error in file(fn, "rb") : cannot open the connection
In addition: Warning message:
In file(fn, "rb") :
cannot open file 'C:\Users\***\AppData\Local\Temp\Rtmpwh6Zih\raster\r_tmp_2020-05-
13_170601_12152_33882.gri': No such file or directory
When I run the following code in RStudio (1.2.5042):
raster.binair <- vector(mode = "list", length = length(aggregated.rasters))
for (i in 1:NROW(aggregated.rasters)) {
+ clamped <- clamp(aggregated.rasters[[i]], upper=12, useValues=FALSE)
+ raster.binair[[i]] <- clamped
+ }
"aggregated.rasters" is a list of 96 rasters and when I separately run it, I get the correct list. I recently cleaned my temporary directory (accessed by tempdir()) and deleted the files in there. I suppose the part:
cannot open file 'C:\Users\***\AppData\Local\Temp\Rtmpwh6Zih\raster\r_tmp_2020-05-
13_170601_12152_33882.gri': No such file or directory
is referring to this. I don't know what I did wrong here. Can I get these files back or work around this error?
Files in the temp folder are deleted when an R session ends. So you should never count on them. You can run the code again, but if you want to permanently keep the results you need to write them elsewhere. Here are two options
Write many files
raster.binair <- vector(mode = "list", length = length(aggregated.rasters))
for (i in 1:NROW(aggregated.rasters)) {
f <- paste0("raster_", i)
clamped <- clamp(aggregated.rasters[[i]], upper=12, useValues=FALSE, filename=f)
raster.binair[[i]] <- clamped
}
Write a single file
raster.binair <- vector(mode = "list", length = length(aggregated.rasters))
for (i in 1:NROW(aggregated.rasters)) {
raster.binair[[i]] <- clamp(aggregated.rasters[[i]], upper=12, useValues=FALSE)
}
s <- stack(raster.binair)
s <- writeRaster(s, filename="mydata.tif")
I have implemented a R script that performs batch correction on a gene expression dataset. To do the batch correction, I first need to normalize the data in each CEL file through the Affy rma() function of Bioconductor.
If I run it on the GSE59867 dataset obtained from GEO, everything works.
I define a batch as the data collection date: I put all the CEL files having the same date into a specific folder, and then consider that date/folder as a specific batch.
On the GSE59867 dataset, a batch/folder contains only 1 CEL file. Nonetheless, the rma() function works on it perfectly.
But, instead, if I try to run my script on another dataset (GSE36809), I have some troubles: if I try to apply the rma() function to a batch/folder containing only 1 file, I get the following error:
Error in `colnames<-`(`*tmp*`, value = "GSM901376_c23583161.CEL.gz") :
attempt to set 'colnames' on an object with less than two dimensions
Here's my specific R code, to let you understand.
You first have to download the file GSM901376_c23583161.CEL.gz:
setwd(".")
options(stringsAsFactors = FALSE)
fileURL <- "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM901nnn/GSM901376/suppl/GSM901376%5Fc23583161%2ECEL%2Egz"
fileDownloadCommand <- paste("wget ", fileURL, " ", sep="")
system(fileDownloadCommand)
Library installation:
source("https://bioconductor.org/biocLite.R")
list.of.packages <- c("easypackages")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
listOfBiocPackages <- c("oligo", "affyio","BiocParallel")
bioCpackagesNotInstalled <- which( !listOfBiocPackages %in% rownames(installed.packages()) )
cat("package missing listOfBiocPackages[", bioCpackagesNotInstalled, "]: ", listOfBiocPackages[bioCpackagesNotInstalled], "\n", sep="")
if( length(bioCpackagesNotInstalled) ) {
biocLite(listOfBiocPackages[bioCpackagesNotInstalled])
}
library("easypackages")
libraries(list.of.packages)
libraries(listOfBiocPackages)
Application of rma()
thisFileDate <- "GSM901376_c23583161.CEL.gz"
thisDateRawData <- read.celfiles(thisDateCelFiles)
thisDateNormData <- rma(thisDateRawData)
After the call to rma(), I get the error.
How can I solve this problem?
I also tried to skip this normalization, by saving the thisDateRawData object directly. But then I have the problem that I cannot combine together this thisDateRawData (that is a ExpressionFeatureSet) with the outputs of rma() (that are ExpressionSet objects).
(EDIT: I extensively edited the question, and added a piece of R code you should be able to run on your pc.)
Hmm. This is a puzzling problem. the oligo::rma() function might be buggy for class GeneFeatureSet with single samples. I got it to work with a single sample by using lower-level functions, but it means I also had to create the expression set from scratch by specifying the slots:
# source("https://bioconductor.org/biocLite.R")
# biocLite("GEOquery")
# biocLite("pd.hg.u133.plus.2")
# biocLite("pd.hugene.1.0.st.v1")
library(GEOquery)
library(oligo)
# # Instead of using .gz files, I extracted the actual CELs.
# # This is just to illustrate how I read in the files; your usage will differ.
# projectDir <- "" # Path to .tar files here
# setwd(projectDir)
# untar("GSE36809_RAW.tar", exdir = "GSE36809")
# untar("GSE59867_RAW.tar", exdir = "GSE59867")
# setwd("GSE36809"); gse3_cels <- dir()
# sapply(paste(gse3_cels, sep = "/"), gunzip); setwd(projectDir)
# setwd("GSE59867"); gse5_cels <- dir()
# sapply(paste(gse5_cels, sep = "/"), gunzip); setwd(projectDir)
#
# Read in CEL
#
# setwd("GSE36809"); gse3_cels <- dir()
# gse3_efs <- read.celfiles(gse3_cels[1])
# # Assuming you've read in the CEL files as a GeneFeatureSet or
# # ExpressionFeatureSet object (i.e. gse3_efs in this example),
# # you can now fit the RMA and create an ExpressionSet object with it:
exprsData <- basicRMA(exprs(gse3_efs), pnVec = featureNames(gse3_efs))
gse3_expset <- new("ExpressionSet")
slot(gse3_expset, "assayData") <- assayDataNew(exprs = exprsData)
slot(gse3_expset, "phenoData") <- phenoData(gse3_efs)
slot(gse3_expset, "featureData") <- annotatedDataFrameFrom(attr(gse3_expset,
'assayData'), byrow = TRUE)
slot(gse3_expset, "protocolData") <- protocolData(gse3_efs)
slot(gse3_expset, "annotation") <- slot(gse3_efs, "annotation")
Hopefully the above approach will work in your code.
I'm working with limited RAM (AWS free tier EC2 server - 1GB).
I have a relatively large txt file "vectors.txt" (800mb) I'm trying to read into R. Having tried various methods I have failed to read in this vector to memory.
So, I was researching ways of reading it in in chunks. I know that the dim of the resulting data frame should be 300K * 300. If I was able to read in the file e.g. 10K lines at a time and then save each chunk as an RDS file I would be able to loop over the results and get what I need, albeit just a little slower with less convenience than having the whole thing in memory.
To reproduce:
# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
So far so good. Here's where I struggle:
word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))
Returns "cannot allocate a vector of size [size]" error message.
Tried alternatives:
word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)
Same, not enough memory
word_vectors <- readr::read_tsv_chunked("vector.txt",
callback = function(x, i) saveRDS(x, i),
chunk_size = 10000)
Resulted in:
Parsed with column specification:
cols(
`299567 300` = col_character()
)
|=========================================================================================| 100% 817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs, :
Evaluation error: bad 'file' argument.
Is there any other way to turn vectors.txt into a data frame? Maybe by breaking it into pieces and reading in each piece, saving as a data frame and then to rds? Or any other alternatives?
EDIT:
From Jonathan's answer below, tried:
library(rword2vec)
library(RSQLite)
# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")
# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
every_nlines,
table_name,
dbname = sub("\\.txt$", ".sqlite", tsv),
...) {
# Prepare reading
con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
init <- TRUE
fill_sqlite <- function(df) {
if (init) {
RSQLite::dbCreateTable(con, table_name, df)
init <<- FALSE
}
RSQLite::dbAppendTable(con, table_name, df)
NULL
}
# Read and fill by parts
bigreadr::big_fread1(tsv, every_nlines,
.transform = fill_sqlite,
.combine = unlist,
... = ...)
# Returns
con
}
vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")
Resulted in:
Splitting: 12.4 seconds.
Error: nThread >= 1L is not TRUE
Another option would be to do the processing on-disk, e.g. using an SQLite file and dplyr's database functionality. Here's one option: https://stackoverflow.com/a/38651229/4168169
To get the CSV into SQLite you can also use the bigreadr package which has an article on doing just this: https://privefl.github.io/bigreadr/articles/csv2sqlite.html
I need to import a list of 36 csv files, but after running the code I get only 26 of them. Probably, 10 files have format problems. Is there a way in R to detect the 10 files that cannot be imported?
If you the file names in a list, you can use the following code:
all <- c("16048.txt", "16062.txt", "16066.txt", "16093.txt", "16095.txt", "16122.txt", "16241.txt", "16360.txt", "16380.txt", "16389.txt", "16510.txt", "16511.txt", "16701.txt", "16729.txt", "16735.txt", "16737.txt", "16761.txt", "16816.txt", "16867.txt", "16876.txt", "16880.txt", "16883.txt", "16884.txt", "16885.txt", "16893.txt", "16904.txt", "16906.txt", "16908.txt", "16929.txt", "16931.txt", "16938.txt", "16943.txt", "16959.txt", "16967.txt", "16968.txt", "16969.txt")
imp <- c("16761.txt", "16959.txt", "16884.txt", "16093.txt", "16883.txt", "16122.txt", "16906.txt", "16737.txt", "16968.txt", "16095.txt", "16062.txt", "16816.txt", "16360.txt", "16893.txt", "16885.txt", "16938.txt", "16048.txt", "16931.txt", "16876.txt", "16511.txt", "16969.txt", "16241.txt", "16967.txt", "16701.txt", "16380.txt", "16510.txt")
Where all is the list of filenames you need and imp is the imperfect result you got. You can get a list of the missing files with:
missing <- all[!all %in% imp]