Julia problem reading relatively large csv file - julia

Using Julia 1.0.0
I have a csv file with 75 columns and about 700,000 rows. Up until yesterday my code was reading it in seconds and converting to DataFrame.
RawDat = CSV.read("filename.csv", header=true, rows_for_type_detect=500,
missingstring="", categorical=false, types=dictm)
A couple days ago I installed JLD.jl which triggered several packages to update. I had probably not updated my packages, including CSV and DataFrames, for a couple months. Since the updating, I can no longer read the same CSV file. The code hangs for more than 20 minutes and nothing is happening.
I tried using CSV.File since it seems like CSV.read was deprecated. This reads the file but I still cannot convert it to DataFrame.
RawDat = CSV.File("filename.csv", header=1, missingstring="",
categorical=false, types=dictm)
works but then if I try
RawDat1 = DataFrame(RawDat)
it hangs and nothing is happening. Similarly, if I try
RawDat = CSV.File("filename.csv", header=1, missingstring="",
categorical=false, types=dictm) |> DataFrame
the file is not read.
Can someone help me understand why this is happening and how I can read this csv file into a DataTable? I have a lot of downstream code that uses DataTable features to process this file.
EDIT
I believe I figured it out and posting in case others have a similar issue. I was able to convert the file to a DataFrame one column at a time. It was pretty quick but overall I think this should be done automatically without the need for extra lines of code. This is what worked so far:
datcols = Tables.columns(RawDat)
MyDF = DataFrame()
kcs = keys(datcols)
for ci in kcs
MyDF[ci] = datcols[ci]
end

Related

Remove everything after an empty row on a list in R

I have a quick question that I cannot figure out. I am reading some results from an output file using the code below and stored as a list in R that can be seen in the picture. I want to delete all of the information after an empty row, in other words, it would be everything after line 42:
Does anybody know anything that I could use? I tried using gsub was I was not very successful.
Thanks for all of the help I am new to programming in R. Again any help is very much appreciated.
LoadFFA <- function(filename, folder.out, TYPE = "PeakFQ_17C",
colStandard = TRUE){ # standardize column output names
require(data.table)
if(grepl("PEAKFQSA",TYPE)){ # PeakfqSA Bulleting 17C analysis
text.list<-lapply(fileinput,readLines)
skip.rows<-sapply(text.list, grep, pattern = '^Ann. Exc. Prob.\\s+EMA Est.')-1
PFA <- lapply(seq_along(text.list),function(i) read.delim(fileinput[i],skip=skip.rows[i],sep="\n",stringsAsFactors = TRUE,blank.lines.skip = FALSE))
}
EDIT
I don't know if I could upload directly so here is the google drive link.
Also, here is the command to run the function LoadFFA("03606500peaks.out","D:/Documents/hydraulic.failures","PEAKFQSA"). The screenshot is the result using print(PFA).
The reason why I am using a loop is because I am reading multiple files (output files) and they have a lot of data, multiple lenghts, and I am reading the data beginning Ann.Exc.Prob. and as per the screenshot provided I would like to end after line 42 (after a full empty row). I hope that clears some confusion.
Basically read the output files, start reading on "Ann.Exc.Prob" and end until the end of that data (line 42 for this particular file). I am using a function because I am running several times.
Again, sorry for the trouble. Thank you for your time and I appreciate your patience.
https://drive.google.com/file/d/1PGbGWIHFj7IQRevTAEfqqA9Okg4fz7Mg/view?usp=sharing

R save() not producing any output but no error

I am brand new to R and I am trying to run some existing code that should clean up an input .csv then save the cleaned data to a different location as a .RData file. This code has run fine for the previous owner.
The code seems to be pulling the .csv and cleaning it just fine. It also looks like the save is running (there are no errors) but there is no output in the specified location. I thought maybe R was having a difficult time finding the location, but it's pulling the input data okay and the destination is just a sub folder.
After a full day of extensive Googling, I can't find anything related to a save just not working.
Example code below:
save(data, file = "C:\\Users\\my_name\\Documents\\Project\\Data.RData", sep="")
Hard to believe you don't see any errors - unless something has switched errors off:
> data = 1:10
> save(data, file="output.RData", sep="")
Error in FUN(X[[i]], ...) : invalid first argument
Its a misleading error, the problem is the third argument, which doesn't do anything. Remove and it works:
> save(data, file="output.RData")
>
sep is used as an argument in writing CSV files to separate columns. save writes binary data which doesn't have rows and columns.

Reading zipped folder containing non-traditional spreadsheet

I'm trying to read a zipped folder called etfreit.zip contained in Purchases from April 2016 onward.
Inside the zipped folder is a file called 2016.xls which is difficult to read as it contains empty rows along with Japanese text.
I have tried various ways of reading the xls from R, but I keep getting errors. This is the code I tried:
download.file("http://www3.boj.or.jp/market/jp/etfreit.zip", destfile="etfreit.zip")
unzip("etfreit.zip")
data <- read.csv(text=readLines("2016.xls")[-(1:10)])
I'm trying to skip the first 10 rows as I simply wish to read the data in the xls file. The code works only to the extent that it runs, but the data looks truly bizarre.
Would greatly appreciate any help on reading the spreadsheet properly in R for purposes of performing analysis.
There is more than one bizzare thing going on here I think, but I had some success with (somewhat older) gdata package:
data = gdata::read.xls("2016.xls")
By the way, treating xls file as csv seldom works. Actually it shouldn't work at all :) Find out a proper import function for your type of data and then use it, don't assume that read.csv is going to take care about anything else than csv (properly).
As per your comment: I'm not sure what you mean by "not properly aligned", but here is some code that cleans the data a bit, and gives you numeric variables instead of factors (note I'm using tidyr for that):
data2 = data[-c(1:7), -c(1, 6)]
names(data2) = c("date", "var1", "var2", "var3")
data2[, c(2:4)] = sapply(data2[, c(2:4)], tidyr::extract_numeric)
# Optionally convert the column with factor dates to Posixct
data2$date = as.POSIXct(data2$date)
Also, note that I am removing only 7 upper rows - this seems to be the portion of the data that contains the header with Japanese.
"Odd" unusual excel tables cab be read with the jailbreakr package. It is still in development, but looks pretty ace:
https://github.com/rsheets/jailbreakr

Issues in R with saving split dataframes as new files

I've searched a bit for answered questions related to this, but I still keep running into issues.
I have a 1.4 million dataframe loaded into R, containing gps route data for ~56 vehicles. I used the split() function to parse my data into smaller chunks by bus name (Bus name example: '1367/E0007489'). I used the following line of code:
dfs <- split(sater001_paired, f=sater001_paired[, "vehicleName"])
Where sater001_pairedis my dataframe, and vehicleName is the variable I split with. The # of rows for each chunk is uneven, given that this data was captured real-time.
The problem I'm facing now is attempting to save each of these chunks into their own .csv files. I tried using lapply as such:
lapply(names(dfs), function(x){write.table(dfs[[x]], file = paste("bus", x, sep = ""))})
But R returns en error message "cannot open the connection". It's likely I'm missing something, as I'm very rusty on using the lapply function.
Any suggestions based off this?
MrFlick has helped me realize the issue I was having here.
So just to close this, the Vehicle Names column I had contained a forward slash halfway in each identification code. As Rstudio on windows does not take kindly to these characters, I did not realize this, as I have only recently switched over from primarily Mac OS use.
By using gsub in the following code:
sater001_paired$vehicleName <- gsub('/', '-', sater001_paired$vehicleName)
This issue has now resolved. Thanks again to MrFlick for the help.

Writing results from multiple runs of an R script to a single output csv file

I have written an R script that is to be used as part of a shell script based pipeline which will feed dozens of files containing genetic sequence data to the R script one after the other (using args[]).
I am having trouble finding a way to write the results of each run of this script to a single results file. I thought that the easiest way to do this might be to create an empty results.csv table and then ask the script to write to the next row of this file each time it is run (saves the problem of the script writing straight over the file on each run). In this vein a friend helped me out with the following code:
x<-readLines("results.csv")
if(x[[1]]==""){x[[1]]<-paste("meancoscore", "meanboot", "CIres", "RIres", "RC", "nodecount", sep= ",")}
x[[length(x)+1]]<-paste(meancoscore, meanboot, CIres, RIres, RC, nodecount, sep = ",")
x<-data.frame(x)
write.table(x,"results.csv", row.names = F, col.names = F, sep = ",")
In the above code "meancoscore", "meanboot", "CIres", "RIres", "RC", and "nodecount" are first used as a header if the data frame has nothing on the first row.
Following this the results (objects: meancoscore, meanboot, CIres, RIres, RC and nodecount are written in the columns corresponding with their headers. The idea here is that if you run the R script again with different source files it should simply write the results to the next line in the results.csv file.
However, the following is seen in the results.csv file after three runs of this code with different input files:
"\""\\""meancoscore,meanboot,CIres,RIres,RC,nodecount\\""\""
""\""\\""0.000,76.3247863247863,0.721002252252252,0.983235214508053,0.708914804154032,117\\""\""
""\""0.845,77.6923076923077,0.723259762308998,0.983410513459875,0.711261254217159,117\""
""0.85,77.4358974358974,0.728886344116805,0.983878381369061,0.717135516451654,117"
Where my desired result would be the following:
meancoscore,meanboot,CIres,RIres,RC,nodecount
0.000,76.3247863247863,0.721002252252252,0.983235214508053,0.708914804154032,117
0.845,77.6923076923077,0.723259762308998,0.983410513459875,0.711261254217159,117
0.85,77.4358974358974,0.728886344116805,0.983878381369061,0.717135516451654,117
It is worth noting that each successive fun seems to be adding more backslashes and more quotation marks to the results.csv file.
Ideally I would like to be able to simply read in the results.csv file when it is done and analyse the data by accessing the columns with results$meanboot, or summary(results$meanboot) for example.
Could anyone offer some advice on how to modify the above code or offer an alternative solution?
I should add here that I purposefully did not go for the option of writing into the R script a loop that will run through the input files of interest and simply assemble a full table of results as an object (I am aware that this would be very simple to write out). This was because the work being done by this script will be farmed out to multiple machines in a cluster.
Thank you for your time and any help you might be able to offer.
The problem was solved by adding quote = FALSE to the write.table() call as per voidHead's suspicion.

Resources