Undefined columns selected error in R while reading multiple files - r

I have lot's of files in my directoy and I want to read all files and select the second columns of them and put those columns as rows of a matrix, but I face with strange error.
would anybody help me to figure out, what's going wrong with my code ?
Here is my effort:
#read all files in one directoy into R and select desired column
nm <- list.files(path="April/mRNA_expression/")
Gene_exp<-do.call(rbind, lapply(nm, function(x) read.table(file=x,header=TRUE, sep= ",")[, 2]))
save(Gene_exp, file="Path to folder")
The error I get is :
## Error in `[.data.frame`(read.table(file = x, header = TRUE, sep = ""), :
## undefined columns selected*
To check that, really my files have 2 columns I did this :
b <- read.table("A.genes.normalized_results", sep="")
dim(b)
## [1] 20532 2
My text file Looks like this :
gene_id normalized_count
?|100130426 0.0000
?|100133144 10.6340
?|100134869 5.6790
?|10357 106.4628
?|10431 710.8902
?|136542 0.0000
?|155060 132.2883
?|26823 0.5098
?|280660 0.0000
?|317712 0.0000
?|340602 0.0000
?|388795 1.2745
?|390284 5.3527
?|391343 2.5489
?|391714 0.7647
?|404770 0.0000
?|441362 0.0000

The better solution would be to only import the second column when reading it. Use the colClasses argument to completely skip the first:
Gene_exp<-do.call(rbind, lapply(nm, function(x) read.delim(file=x,header=TRUE, colClasses=c('NULL', 'character'))))
I am assuming the second column is character. Change it to the appropriate class if you need to.

Related

R: read_csv reads numeric entries as logical - parsing col_logical instead of col_double

I am new to R.
I wrote a code for an assignment which reads several csv files and binds it into a data frame and then according to the id, calculates the mean of either nitrate or sulfate.
Data sample:
Date sulfate nitrate ID
<date> <dbl> <dbl> <dbl>
1 2003-10-06 7.21 0.651 1
2 2003-10-12 5.99 0.428 1
3 2003-10-18 4.68 1.04 1
4 2003-10-24 3.47 0.363 1
5 2003-10-30 2.42 0.507 1
6 2003-11-11 1.43 0.474 1
...
To read the files and create a data.frame, I wrote this function:
pollutantmean <- function (pollutant, id = 1:332) {
#creating a data frame from several files
file_m <- list.files(path = "specdata", pattern = "*.csv", full.names = TRUE)
read_file_m <- lapply(file_m, read_csv)
df_1 <- bind_rows(read_file_m)
# delete NAs
df_clean <- df_1[complete.cases(df_1),]
#select rows according to id
df_asid_clean <- filter(df_clean, ID %in% id)
#count the mean of the column
mean_result <- mean(df_asid_clean[, pollutant])
mean_result
However, when the read_csv function is applied, certain entries in nitrate column are read as col_logical, although the whole class of the column remains numeric and the entries are numeric. It seems that the code "expects" to receive logical value, although the real value is not.
Throughout the reading I get this message:
<...>
Parsed with column specification:
cols(
Date = col_date(format = ""),
sulfate = col_double(),
nitrate = col_logical(),
ID = col_double()
)
Warning: 41 parsing failures.
row col expected actual file
2055 nitrate 1/0/T/F/TRUE/FALSE 0.383 'specdata/288.csv'
2067 nitrate 1/0/T/F/TRUE/FALSE 0.355 'specdata/288.csv'
2073 nitrate 1/0/T/F/TRUE/FALSE 0.469 'specdata/288.csv'
2085 nitrate 1/0/T/F/TRUE/FALSE 0.144 'specdata/288.csv'
2091 nitrate 1/0/T/F/TRUE/FALSE 0.0984 'specdata/288.csv'
.... ....... .................. ...... ..................
See problems(...) for more details.
I tried to change the column class by writing
df_1[,nitrate] <- as.numeric(as.character(df_1[, nitrate])
, after binding rows, but it only shows that NAs are again introduced in step which calculates the mean.
What is wrong here, and how could I solve it?
Would appreciate your help!
UPDATE: tried to insert read_csv(col_types = list...), but I get "files" argument is not defined. As I understand, the R reads inside read_csv first, then lapply and because there is not "file" given at the time, it shows error.
The problem with readr::read_csv() failure in parsing the column types can be overcome by passing a col_types= argument in lapply(). We do this as follows:
pollutantmean <- function (directory,pollutant,id=1:332){
require(readr)
require(dplyr)
file_m <- list.files(path = directory, pattern = "*.csv", full.names = TRUE)[id]
read_file_m <- lapply(file_m, read_csv,col_types=list(col_date(),col_double(),
col_double(),col_integer()))
# rest of code goes here. Since I am a Community Mentor in the
# JHU Data Science Specialization, I am not allowed to post
# a complete solution to the programming assignment
}
Note that I use the [ form of the extract operator to subset the list of file names with the id vector that is an argument to the function, which avoids reading a lot of data that isn't necessary. This eliminates the need for the filter() statement in the code posted in the question.
With some additional programming statements to complete the assignment, the code in my answer produces the correct results for the three examples posted with the assignment, as listed below.
> pollutantmean("specdata","sulfate",1:10)
[1] 4.064128
> pollutantmean("specdata", "nitrate", 70:72)
[1] 1.706047
> pollutantmean("specdata", "nitrate", 23)
[1] 1.280833
Alternately we could implement lapply() with an anonymous function that also uses read_csv() as follows:
read_file_m <- lapply(file_m, function(x) {read_csv(x,col_types=list(col_date(),col_double(),
col_double(),col_integer()))})
NOTE: while it is completely understandable that students who have been exposed to the tidyverse would like to use it for the programming assignment, the fact that dplyr isn't introduced until the next course in the sequence (and readr isn't covered at all) makes it much more difficult to use for assignments in R Programming, especially the first assignment, where dplyr non-standard evaluation gives people fits. An example of this situation is yet another Stackoverflow question on pollutantmean().
With the read_csv update you don't need lapply, you can simply pass along the file path directly to read_csv as you already have defined.
Regarding the column types this can then be sen manually in the col_type argument:
col_type=cols(Date-col_date,sulfate=...)

How to handle "write.xlsx" error: arguments imply differing number of rows

I'm trying to write an xlsx file from a list of dataframes that I created but I'm getting an error due to missing data (I couldn't download it). I just want to write the xlsx file besides having this lacking data. Any help is appreciated.
For replication of the problem:
library(quantmod)
name_of_symbols <- c("AKER","YECO","SNOA")
research_dates <- c("2018-11-19","2018-11-19","2018-11-14")
my_symbols_df <- lapply(name_of_symbols, function(x) tryCatch(getSymbols(x, auto.assign = FALSE),error = function(e) { }))
my_stocks_OHLCV <- list()
for (i in 1:3) {
trade_date <- paste(as.Date(research_dates[i]))
OHLCV_data <- my_symbols_df[[i]][trade_date]
my_stocks_OHLCV[[i]] <- data.frame(OHLCV_data)
}
And you can see the missing data down here in my_stocks_OHLCV[[2]] and the write.xlsx error I'm getting:
print(my_stocks_OHLCV)
[[1]]
AKER.Open AKER.High AKER.Low AKER.Close AKER.Volume AKER.Adjusted
2018-11-19 2.67 3.2 1.56 1.75 15385800 1.75
[[2]]
data frame with 0 columns and 0 rows
[[3]]
SNOA.Open SNOA.High SNOA.Low SNOA.Close SNOA.Volume SNOA.Adjusted
2018-11-14 1.1 1.14 1.01 1.1 107900 1.1
write.xlsx(my_stocks_OHLCV, "C:/Users/MICRO/Downloads/Datasets_stocks/dux_OHLCV.xlsx")
Error in (function (..., row.names = NULL, check.rows = FALSE,
check.names = TRUE,:arguments imply differing number of rows: 1, 0
How do I run write.xlsx even though I have this missing data?
The main question you need to ask is, what do you want instead?
As you are working with stock data, the best idea, is that if you don't have data for a stock, then remove it. Something like this should work,
my_stocks_OHLCV[lapply(my_stocks_OHLCV,nrow)>0]
If you want a row full of NA or 0
Then use the lapply function and for each element of the list, of length 0, replace with either NA's, vector of 0's (c(0,0,0,0,0,0)) etc...
Something like this,
condition <- !lapply(my_stocks_OHLCV,nrow)>0
my_stocks_OHLCV[condition] <- data.frame(rep(NA,6))
Here we define the condition variable, to be the elements in the list where you don't have any data. We can then replace those by NA or swap the NA for 0. However, I can't think of a reason to do this.
A variation on your question, and one you could handle inside your for loop, is to check if you have data, and if you don't, replace the values there, with NAs, and you could given it the correct headers, as you know which stock it relates to.
Hope this helps.

Writing and reading a zoo object - errors

I have a zoo object, prices, which, when I type class(prices), it returns “zoo.” I then create a file using:
write.zoo(prices, file = “foo”, index.name = “time”)
The resulting files looks like this:
"time" "AAPL.Adjusted" “SHY.Adjusted"
2013-05-01 60.31 84.12
2013-05-02 61.16 84.11
2013-05-03 61.77 84.08
I then try and read this file with this statement:
myData <- read.zoo(“foo”)
and I get this error:
Error in read.zoo(“foo") :
index has bad entries at data rows: 1 2 3 4
I’ve tried a number of parameter settings and nothing seems to work. Help much appreciated.
Newbie
The file has a header line so try:
z <- read.zoo("foo", header = TRUE, check.names = FALSE)
The check.names part gives nicer looking column names but you could leave it out if that were not important.

Reading large fixed format text file in r

I am trying to input a large (> 70 MB) fixed format text file into r. For a smaller file (< 1MB), I can use the read.fwf() function as shown below.
condodattest1a <- read.fwf(impfile1,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
When I try to run the line of code below,
condodattest1 <- read.fwf(impfile,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
I get the following error message:
Error: cannot allocate vector of size 2 Kb
The only difference between the 2 lines is the size of the input file.
The formatting for the file I want to import is given in the dataframe called testcsv3. I show a small snippet of the dataframe below:
> head(testcsv3)
Varlen Varname Varclass Varsep Varforfmt
1 2 "V1" "character" 2 "A2.0"
2 15 "V2" "character" 17 "A15.0"
3 28 "V3" "character" 45 "A28.0"
4 3 "V4" "character" 48 "F3.0"
5 1 "V5" "character" 49 "A1.0"
6 3 "V6" "character" 52 "A3.0"
At least part of my problem is that I am reading in all the data as factors when I use read.fwf() and I end up exceeding the memory limit on my computer.
I tried to use read.table() as a way of formatting each variable but it seems I need a text delimiter with that function. There is a suggestion in section 3.3 in the link below that I could use sep to identify the column where every variable starts.
http://data.princeton.edu/R/readingData.html
However, when I use the command below:
condodattest1b <- read.table(impfile1,sep=testcsv3$Varsep,col.names=testcsv3$Varname, colClasses=testcsv3$Varclass)
I get the following error message:
Error in read.table(impfile1, sep = testcsv3$Varsep, col.names = testcsv3$Varname, : invalid 'sep' argument
Finally, I tried to use:
condodattest1c <- read.fortran(impfile1,lengths=testcsv3$Varlen, format=testcsv3$Varforfmt, col.names=testcsv3$Varname)
but I get the following message:
Error in processFormat(format) : missing lengths for some fields
In addition: Warning messages:
1: In processFormat(format) : NAs introduced by coercion
2: In processFormat(format) : NAs introduced by coercion
3: In processFormat(format) : NAs introduced by coercion
All I am trying to do at this point is format the data when they come into r as something other than factors. I am hoping this will limit the amount of memory I am using and allow me to actually input the file. I would appreciate any suggestions about how I can do this. I know the Fortran formats for all the variables and the column at which each variable begins.
Thank you,
Warren
Maybe this code works for you. You have to fill varlen with the field sizes and add the corresponding type strings (e.g. numeric, character, integer) to colclasses
my.readfwf <- function(filename,varlen,colclasses) {
sidx <- cumsum(c(1,varlen[1:(length(varlen)-1)]))
eidx <- sidx+varlen-1
filecontent <- scan(filename,character(0),sep="\n")
if (any(diff(nchar(filecontent))!=0))
stop("line lengths differ!")
nlines <- length(filecontent)
res <- list()
for (i in seq_along(varlen)) {
res[[i]] <- sapply(filecontent,substring,first=sidx[i],last=eidx[i])
mode(res[[i]]) <- colclasses[i]
}
attributes(res) <- list(names=paste("V",seq_along(res),sep=""),row.names=seq_along(res[[1]]),class="data.frame")
return(res)
}

Troubleshooting "undefined columns selected" in R

I was trying to remove all columns with a zero variance from my data, using this command
file <- file[,sapply(file, function(v) var(v, na.rm=TRUE)!=0)]
This command was working perfectly for my previous datasets, now I am trying to use on a new dataset and it gives me the following error:
Error in `[.data.frame`(file, , sapply(file, function(v) var(v, na.rm = TRUE) != :
undefined columns selected
In addition: Warning message:
In var(v, na.rm = TRUE) : NAs introduced by coercion
The problem is I did not select any columns, I just applied the function to all columns! How come I get an error telling me undefined columns selected!
Any idea what could have gone wrong??
The data looks exactly this way
col1 col2 col3 col4
1 FIA 3.5 2.4 NA
2 DWF 2.1 NA 3.7
3 LIK 0.25 2.3 1.38
4 JUW 2.1 4.0 3.2
The input file was a CSV file and read via the read.csv command, it had an extra empty column at the end of the table that was causing this problem, removing this last column via this command, solved the issue.
lastcol <- ncol(file)
file[,lastcol] <- NULL

Resources