Reading large fixed format text file in r - r

I am trying to input a large (> 70 MB) fixed format text file into r. For a smaller file (< 1MB), I can use the read.fwf() function as shown below.
condodattest1a <- read.fwf(impfile1,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
When I try to run the line of code below,
condodattest1 <- read.fwf(impfile,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
I get the following error message:
Error: cannot allocate vector of size 2 Kb
The only difference between the 2 lines is the size of the input file.
The formatting for the file I want to import is given in the dataframe called testcsv3. I show a small snippet of the dataframe below:
> head(testcsv3)
Varlen Varname Varclass Varsep Varforfmt
1 2 "V1" "character" 2 "A2.0"
2 15 "V2" "character" 17 "A15.0"
3 28 "V3" "character" 45 "A28.0"
4 3 "V4" "character" 48 "F3.0"
5 1 "V5" "character" 49 "A1.0"
6 3 "V6" "character" 52 "A3.0"
At least part of my problem is that I am reading in all the data as factors when I use read.fwf() and I end up exceeding the memory limit on my computer.
I tried to use read.table() as a way of formatting each variable but it seems I need a text delimiter with that function. There is a suggestion in section 3.3 in the link below that I could use sep to identify the column where every variable starts.
http://data.princeton.edu/R/readingData.html
However, when I use the command below:
condodattest1b <- read.table(impfile1,sep=testcsv3$Varsep,col.names=testcsv3$Varname, colClasses=testcsv3$Varclass)
I get the following error message:
Error in read.table(impfile1, sep = testcsv3$Varsep, col.names = testcsv3$Varname, : invalid 'sep' argument
Finally, I tried to use:
condodattest1c <- read.fortran(impfile1,lengths=testcsv3$Varlen, format=testcsv3$Varforfmt, col.names=testcsv3$Varname)
but I get the following message:
Error in processFormat(format) : missing lengths for some fields
In addition: Warning messages:
1: In processFormat(format) : NAs introduced by coercion
2: In processFormat(format) : NAs introduced by coercion
3: In processFormat(format) : NAs introduced by coercion
All I am trying to do at this point is format the data when they come into r as something other than factors. I am hoping this will limit the amount of memory I am using and allow me to actually input the file. I would appreciate any suggestions about how I can do this. I know the Fortran formats for all the variables and the column at which each variable begins.
Thank you,
Warren

Maybe this code works for you. You have to fill varlen with the field sizes and add the corresponding type strings (e.g. numeric, character, integer) to colclasses
my.readfwf <- function(filename,varlen,colclasses) {
sidx <- cumsum(c(1,varlen[1:(length(varlen)-1)]))
eidx <- sidx+varlen-1
filecontent <- scan(filename,character(0),sep="\n")
if (any(diff(nchar(filecontent))!=0))
stop("line lengths differ!")
nlines <- length(filecontent)
res <- list()
for (i in seq_along(varlen)) {
res[[i]] <- sapply(filecontent,substring,first=sidx[i],last=eidx[i])
mode(res[[i]]) <- colclasses[i]
}
attributes(res) <- list(names=paste("V",seq_along(res),sep=""),row.names=seq_along(res[[1]]),class="data.frame")
return(res)
}

Related

How to handle "write.xlsx" error: arguments imply differing number of rows

I'm trying to write an xlsx file from a list of dataframes that I created but I'm getting an error due to missing data (I couldn't download it). I just want to write the xlsx file besides having this lacking data. Any help is appreciated.
For replication of the problem:
library(quantmod)
name_of_symbols <- c("AKER","YECO","SNOA")
research_dates <- c("2018-11-19","2018-11-19","2018-11-14")
my_symbols_df <- lapply(name_of_symbols, function(x) tryCatch(getSymbols(x, auto.assign = FALSE),error = function(e) { }))
my_stocks_OHLCV <- list()
for (i in 1:3) {
trade_date <- paste(as.Date(research_dates[i]))
OHLCV_data <- my_symbols_df[[i]][trade_date]
my_stocks_OHLCV[[i]] <- data.frame(OHLCV_data)
}
And you can see the missing data down here in my_stocks_OHLCV[[2]] and the write.xlsx error I'm getting:
print(my_stocks_OHLCV)
[[1]]
AKER.Open AKER.High AKER.Low AKER.Close AKER.Volume AKER.Adjusted
2018-11-19 2.67 3.2 1.56 1.75 15385800 1.75
[[2]]
data frame with 0 columns and 0 rows
[[3]]
SNOA.Open SNOA.High SNOA.Low SNOA.Close SNOA.Volume SNOA.Adjusted
2018-11-14 1.1 1.14 1.01 1.1 107900 1.1
write.xlsx(my_stocks_OHLCV, "C:/Users/MICRO/Downloads/Datasets_stocks/dux_OHLCV.xlsx")
Error in (function (..., row.names = NULL, check.rows = FALSE,
check.names = TRUE,:arguments imply differing number of rows: 1, 0
How do I run write.xlsx even though I have this missing data?
The main question you need to ask is, what do you want instead?
As you are working with stock data, the best idea, is that if you don't have data for a stock, then remove it. Something like this should work,
my_stocks_OHLCV[lapply(my_stocks_OHLCV,nrow)>0]
If you want a row full of NA or 0
Then use the lapply function and for each element of the list, of length 0, replace with either NA's, vector of 0's (c(0,0,0,0,0,0)) etc...
Something like this,
condition <- !lapply(my_stocks_OHLCV,nrow)>0
my_stocks_OHLCV[condition] <- data.frame(rep(NA,6))
Here we define the condition variable, to be the elements in the list where you don't have any data. We can then replace those by NA or swap the NA for 0. However, I can't think of a reason to do this.
A variation on your question, and one you could handle inside your for loop, is to check if you have data, and if you don't, replace the values there, with NAs, and you could given it the correct headers, as you know which stock it relates to.
Hope this helps.

Create data for portfolio in R

I have data in Excel. Suppose I read it like this (only one series is shown below):
ccl<-ts(mysheets$CCL$`Adj Close`,start=c(2000, 1), end=c(2012, 12), frequency=12)
ccl.r<-diff(log(ccl), lag=1)
Then, I construct a vector with all the data:
data<-cbind(aal.r, adm.r, aht.r, anto.r, arm.r, av.r, azn.r, ba.r, bab.r, barc.r, bats.r,bdev.r, bkg.r, blnd.r, blt.r, bnzl.r, bta.r, bznl.r, ccl.r)
Then, I try to insert the data into format of fportfolio, by using:
ewSpec<-portfolioSpec()
nAssets<-ncol(data)
setWeights(ewSpec)<-rep(1/nAssets, time=nAssets)
mydata<-portfolioData(data=data, spec=portfolioSpec())
However, I get this error:
Error in portfolioData(data = data, spec = portfolioSpec()) :
object 'assetsNames' not found
In addition: Warning messages:
1: In if (class(data) == "timeSeries") { :
the condition has length > 1 and only the first element will be used
2: In if (class(data) == "list") { :
the condition has length > 1 and only the first element will be used
This was solved by making the matrix a "timeSeries" object. Thanks for reading the question...

Incorrect number of dimensions: extracting elements from multiple rdata files

PROBLEM
I have many .RData files in one folder and I want to extract the coordinates continued in each .rdata file. I'd also like to link the concomitant file name(use_hab) and datetime(dt) to each row of their respective coordinates.
CODE
file.namez<-list.files("C:/fitting/fitdata/7 27 2015") #name of files
#file.namez.rev<-file.namez[grep(".RData",file.namez)]
datastor<-data.frame(matrix(NA,length(file.namez),4))
names(datastor)<-c("use_hab",paste("B",1:3,sep=""))
allresults<-NULL
for(i in 1:length(file.namez))
{
datastor<-NULL
print(file.namez[i])
load(paste("C:/fitting/fitdata/7 27 2015/",file.namez[i], sep=""))
use_hab <- as.character(as.data.frame(strsplit(file.namez[i],"_an"))[2,])# this line is used to remove unwanted parts of the file name
use_hab <- gsub(".RData","", use_hab)
datastor <- fitdata$coords
datastor$use_hab <- use_hab
datastor$dt <- fitdata$dt
allresults <- rbind(allresults, datastor[,c(3,4,1,2)])
}
This is only result before the error message:
[1] "fitdata_anw514_yr2008.RData"
ERROR
Error in datastor[, c(3, 4, 1, 2)] : incorrect number of dimensions
In addition: Warning message:
In datastor$use_hab <- use_hab : Coercing LHS to a list
QUESTION
How am I getting the incorrect number of dimensions? Each file name should have 1098 coordinates and date time. In total, 63 files x 1098 rows with 4 columns(filename, datetime, x, y).
The desired result is to have the file name as the first column, the date time as the second column, and the x and y coordinates as the third and fourth columns.
Replace
datastor <- fitdata$coords
with
datastor$coords <- fitdata$coords
The error message Coercing LHS to a list is thrown when you try to access something with $ that does not support this. datastor <- fitdata$coords changes datastor to the data type of fitdata$coords.
Also, you'd change
allresults<-NULL
datastor<-NULL
to
allresults <- data.frame()
datastor <- data.frame()
but this may just my personal preference.

Unable to convert & assign character value to a numeric field

I have been struggling with the following issue:
I have the following variables:
class(HARdata)
[1] "data.frame"
dim(HARdata)
[1] 10299 88
class(activity_labels)
[1] "character"
length(activity_labels)
[1] 6
I have been trying to run the following loop:
for (i in 1:nrow(HARdata)) {
for (j in 1:length(activity_labels)){
if (as.numeric(HARdata[i, "traintype"]) == extract_numeric(activity_labels[j])) {
HARdata[i, "traintype"] <- activity_labels[j]
}
}
}
However, i get the following error:
Error in if (as.numeric(HARdata[i, "traintype"]) == extract_numeric(activity_labels[j])) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
NAs introduced by coercion
If I replace HARdata[i, "traintype"] <- activity_labels[j] with HARdata[i, "traintype"] <- 10 , the code runs fine. So I suppose the problem is in this line. The left side is numeric while the right side is supposed to be character. I tried running as.character(HARdata[i, "traintype"]) <- "test" but this doesn't work. Can anyone see what could be the issue?
test <- scan()
0.27513126 0.39694439
0.54228045 0.82751195
0.18600784 0.96602747
0.55259276 0.52368149
0.28976503 0.74500213
0.17534195 0.04931733
0.08077429 0.82169260
0.72602526 0.94921645
0.65077605 0.06989442
0.81399236 0.1379080
test <- as.data.frame(matrix(test, ncol=2))
names(test) <- c('cartype', 'traintype')
library(tidyr)
activity_labels <- c("$0.08077429", "$0.65077605")
test[,"traintype"][match(extract_numeric(activity_labels), test[,"traintype"])] <- activity_labels

Troubleshooting "undefined columns selected" in R

I was trying to remove all columns with a zero variance from my data, using this command
file <- file[,sapply(file, function(v) var(v, na.rm=TRUE)!=0)]
This command was working perfectly for my previous datasets, now I am trying to use on a new dataset and it gives me the following error:
Error in `[.data.frame`(file, , sapply(file, function(v) var(v, na.rm = TRUE) != :
undefined columns selected
In addition: Warning message:
In var(v, na.rm = TRUE) : NAs introduced by coercion
The problem is I did not select any columns, I just applied the function to all columns! How come I get an error telling me undefined columns selected!
Any idea what could have gone wrong??
The data looks exactly this way
col1 col2 col3 col4
1 FIA 3.5 2.4 NA
2 DWF 2.1 NA 3.7
3 LIK 0.25 2.3 1.38
4 JUW 2.1 4.0 3.2
The input file was a CSV file and read via the read.csv command, it had an extra empty column at the end of the table that was causing this problem, removing this last column via this command, solved the issue.
lastcol <- ncol(file)
file[,lastcol] <- NULL

Resources