Read SPecific lines of a CSV file in R-language - r

I am trying to write a code which manipulates data from a particular .csv and writes the data to another one.
I want to read each line one by one and perform the operation.
Also I am trying to read a particular line from the .csv but what I am getting is that line and the lines before it.
I am a beginner in R-Language, so I find the syntax a bit confusing.
testconn<=file("<path>")
num<-(length(readLines(testconn)))
for(i in 1:num){
num1=i-1
los<=read.table(file="<path>",sep=",",head=FALSE,skip=num1,nrows=1)[,c(col1,col2)]
write.table(los,"<path>",row.names=FALSE,quote=FALSE,sep=",",col.names=FALSE,append=TRUE)
}
This is the code I am currently using, thought it is giving the desiored output but it is extremely slow, my .csv data file has 43200 lines.

Your code doesn't work. You confuse the comparison operator <= and the assignment one <-
Your code is is extremly innefficient. You call both read.table and write.table 43200 times to read/write a single file.
You can simply do this:
los<- read.table(file="<path>",sep=",")[,c(col1,col2)]
res <- apply(los,1,function(x){## you treat your line here}
write.table(res,"<path_write>",row.names=FALSE,
quote=FALSE,sep=",",col.names=FALSE)

Related

Looping through nc files in R

Good morning everyone,
I am currently using the code written by Antonio Olinto Avila-da-Silva on this link: https://oceancolor.gsfc.nasa.gov/forum/oceancolor/topic_show.pl?tid=5954
It allows me to extract data of type sst/chlor_a from nc file. It uses a loop to create an excel file with all the data. Unfortunately, I noticed that the function only takes the first data file in the loop. Thus, I find myself with 20 times the same data in a row in my excel file.
Does anyone have a solution to make this loop work properly?
I would first check out that these two lines contain all the files you are expecting:
(f <- list.files(".", pattern="*.L3m_MO_SST_sst_9km.nc",full.names=F))
(lf<-length(f))
And then there's a bug in the for-loop. This line:
data<-nc_open(f)
Needs to reference the iterator i, so change it to something like this:
data<-nc_open(f[[i]])
It appears both scripts have this same bug.

Why can I only read one .json file at a time?

I have 500+ .json files that I am trying to get a specific element out of. I cannot figure out why I cannot read more than one at a time..
This works:
library (jsonlite)
files<-list.files(‘~/JSON’)
file1<-fromJSON(readLines(‘~/JSON/file1.json),flatten=TRUE)
result<-as.data.frame(source=file1$element$subdata$data)
However, regardless of using different json packages (eg RJSONIO), I cannot apply this to the entire contents of files. The error I continue to get is...
attempt to run same code as function over all contents in file list
for (i in files) {
fromJSON(readLines(i),flatten = TRUE)
as.data.frame(i)$element$subdata$data}
My goal is to loop through all 500+ and extract the data and its contents. Specifically if the file has the element ‘subdata$data’, i want to extract the list and put them all in a dataframe.
Note: files are being read as ASCII (Windows OS). This does bot have a negative effect on single extractions but for the loop i get ‘invalid character bytes’
Update 1/25/2019
Ran the following but returned errors...
files<-list.files('~/JSON')
out<-lapply(files,function (fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in file(i): object 'i' not found
Also updated function, this time with UTF* errors...
files<-list.files('~/JSON')
out<-lapply(files,function (i,fn) {
o<-fromJSON(file(i),flatten=TRUE)
as.data.frame(i)$element$subdata$data
})
Error in parse_con(txt,bigint_as_char):
lexical error: invalid bytes in UTF8 string. (right here)------^
Latest Update
Think I found out a solution to the crazy 'bytes' problem. When I run readLines on the .json file, I can then apply fromJSON),
e.x.
json<-readLines('~/JSON')
jsonread<-fromJSON(json)
jsondf<-as.data.frame(jsonread$element$subdata$data)
#returns a dataframe with the correct information
Problem is, I cannot apply readLines to all the files within the JSON folder (PATH). If I can get help with that, I think I can run...
files<-list.files('~/JSON')
for (i in files){
a<-readLines(i)
o<-fromJSON(file(a),flatten=TRUE)
as.data.frame(i)$element$subdata}
Needed Steps
apply readLines to all 500 .json files in JSON folder
apply fromJSON to files from step.1
create a data.frame that returns entries if list (fromJSON) contains $element$subdata$data.
Thoughts?
Solution (Workaround?)
Unfortunately, the fromJSON still runs in to trouble with the .json files. My guess is that my GET method (httr) is unable to wait/delay and load the 'pretty print' and thus is grabbing the raw .json which in-turn is giving odd characters and as a result giving the ubiquitous '------^' error. Nevertheless, I was able to put together a solution, please see below. I want to post it for future folks that may have the same problem with the .json files not working nicely with any R json package.
#keeping the same 'files' variable as earlier
raw_data<-lapply(files,readLines)
dat<-do.call(rbind,raw_data)
dat2<-as.data.frame(dat,stringsasFactors=FALSE)
#check to see json contents were read-in
dat2[1,1]
library(tidyr)
dat3<-separate_rows(dat2,sep='')
x<-unlist(raw_data)
x<-gsub('[[:punct:]]', ' ',x)
#Identify elements wanted in original .json and apply regex
y<-regmatches(x,regexc('.*SubElement2 *(.*?) *Text.*',x))
for loops never return anything, so you must save all valuable data yourself.
You call as.data.frame(i) which is creating a frame with exactly one element, the filename, probably not what you want to keep.
(Minor) Use fromJSON(file(i),...).
Since you want to capture these into one frame, I suggest something along the lines of:
out <- lapply(files, function(fn) {
o <- fromJSON(file(fn), flatten = TRUE)
as.data.frame(o)$element$subdata$data
})
allout <- do.call(rbind.data.frame, out)
### alternatives:
allout <- dplyr::bind_rows(out)
allout <- data.table::rbindlist(out)

Remove everything after an empty row on a list in R

I have a quick question that I cannot figure out. I am reading some results from an output file using the code below and stored as a list in R that can be seen in the picture. I want to delete all of the information after an empty row, in other words, it would be everything after line 42:
Does anybody know anything that I could use? I tried using gsub was I was not very successful.
Thanks for all of the help I am new to programming in R. Again any help is very much appreciated.
LoadFFA <- function(filename, folder.out, TYPE = "PeakFQ_17C",
colStandard = TRUE){ # standardize column output names
require(data.table)
if(grepl("PEAKFQSA",TYPE)){ # PeakfqSA Bulleting 17C analysis
text.list<-lapply(fileinput,readLines)
skip.rows<-sapply(text.list, grep, pattern = '^Ann. Exc. Prob.\\s+EMA Est.')-1
PFA <- lapply(seq_along(text.list),function(i) read.delim(fileinput[i],skip=skip.rows[i],sep="\n",stringsAsFactors = TRUE,blank.lines.skip = FALSE))
}
EDIT
I don't know if I could upload directly so here is the google drive link.
Also, here is the command to run the function LoadFFA("03606500peaks.out","D:/Documents/hydraulic.failures","PEAKFQSA"). The screenshot is the result using print(PFA).
The reason why I am using a loop is because I am reading multiple files (output files) and they have a lot of data, multiple lenghts, and I am reading the data beginning Ann.Exc.Prob. and as per the screenshot provided I would like to end after line 42 (after a full empty row). I hope that clears some confusion.
Basically read the output files, start reading on "Ann.Exc.Prob" and end until the end of that data (line 42 for this particular file). I am using a function because I am running several times.
Again, sorry for the trouble. Thank you for your time and I appreciate your patience.
https://drive.google.com/file/d/1PGbGWIHFj7IQRevTAEfqqA9Okg4fz7Mg/view?usp=sharing

Writing results from multiple runs of an R script to a single output csv file

I have written an R script that is to be used as part of a shell script based pipeline which will feed dozens of files containing genetic sequence data to the R script one after the other (using args[]).
I am having trouble finding a way to write the results of each run of this script to a single results file. I thought that the easiest way to do this might be to create an empty results.csv table and then ask the script to write to the next row of this file each time it is run (saves the problem of the script writing straight over the file on each run). In this vein a friend helped me out with the following code:
x<-readLines("results.csv")
if(x[[1]]==""){x[[1]]<-paste("meancoscore", "meanboot", "CIres", "RIres", "RC", "nodecount", sep= ",")}
x[[length(x)+1]]<-paste(meancoscore, meanboot, CIres, RIres, RC, nodecount, sep = ",")
x<-data.frame(x)
write.table(x,"results.csv", row.names = F, col.names = F, sep = ",")
In the above code "meancoscore", "meanboot", "CIres", "RIres", "RC", and "nodecount" are first used as a header if the data frame has nothing on the first row.
Following this the results (objects: meancoscore, meanboot, CIres, RIres, RC and nodecount are written in the columns corresponding with their headers. The idea here is that if you run the R script again with different source files it should simply write the results to the next line in the results.csv file.
However, the following is seen in the results.csv file after three runs of this code with different input files:
"\""\\""meancoscore,meanboot,CIres,RIres,RC,nodecount\\""\""
""\""\\""0.000,76.3247863247863,0.721002252252252,0.983235214508053,0.708914804154032,117\\""\""
""\""0.845,77.6923076923077,0.723259762308998,0.983410513459875,0.711261254217159,117\""
""0.85,77.4358974358974,0.728886344116805,0.983878381369061,0.717135516451654,117"
Where my desired result would be the following:
meancoscore,meanboot,CIres,RIres,RC,nodecount
0.000,76.3247863247863,0.721002252252252,0.983235214508053,0.708914804154032,117
0.845,77.6923076923077,0.723259762308998,0.983410513459875,0.711261254217159,117
0.85,77.4358974358974,0.728886344116805,0.983878381369061,0.717135516451654,117
It is worth noting that each successive fun seems to be adding more backslashes and more quotation marks to the results.csv file.
Ideally I would like to be able to simply read in the results.csv file when it is done and analyse the data by accessing the columns with results$meanboot, or summary(results$meanboot) for example.
Could anyone offer some advice on how to modify the above code or offer an alternative solution?
I should add here that I purposefully did not go for the option of writing into the R script a loop that will run through the input files of interest and simply assemble a full table of results as an object (I am aware that this would be very simple to write out). This was because the work being done by this script will be farmed out to multiple machines in a cluster.
Thank you for your time and any help you might be able to offer.
The problem was solved by adding quote = FALSE to the write.table() call as per voidHead's suspicion.

Append new data to an existing dataframe (RDS) in R

I have an Rscript that is reading in a constant stream of data in the form of a flat file. Another script picks up this flat file, does some parsing and processing, then saves the result as a data.frame in RDS format. It then sleeps, and repeats the process.
saveRDS(tmp.df, file="H:/Documents/tweet.df.rds") #saving the data.frame
On the second... nth iteration, I have the code only process the new lines added to the flat file since the previous iteration. However, in order to append the delta lines to the permanent data frame, I have to read it in, append, and then save it back out, overwriting the original.
df2 <- readRDS("H:/Documents/tweet.df.rds") #read in permanent
tmp.df2 <- rbind(df2, tmp.df) #append new to existing
saveRDS(tmp.df2, file="H:/Documents/tweet.df.rds") #save it
rm(df2) #housecleaning
rm(tmp.df2) #housecleaning
This approach is risky because whenever the RDS is open for reading/writing, another process wanting to touch that file has to wait. As the base file gets bigger, the risk increases.
Is there something like an appendRDS (I know literally there isn't) that can achieve what I want- iterative updating of a single data frame- saved to a file- that uses appending rather than complete replacement?
I think you can safeguard your process by using connections, opening and closing it before the next process takes over.
con <- file("tmp.rds")
open(con)
df <- readRDS(con)
df.new <- rbind(df,df)
saveRDS(df.new, con)
close(con)
Update:
You can test if a connection to the file is open and tell it to wait for a bit if you're having problems with concurrency.
while(is.Open(con)) { # untested but something of this nature should work
sys.Sleep(2)
}
Is there anything wrong with using a series of numbered RDS files in a directory instead of a single RDS file? I don't think is is possible to append to a data frame an an RDS file without rewriting the entire file, since data frames are simply lists of columns, so presumably they are serialized one column at a time, so only the last column ends near the end of the file.
If you want to stick with a single file but minimize the risk of reading inconsistent data from a RDS file, you can read it in, do the append operation, and then write it out to a temp file and rename the temp file to the original name once it is finished. Then at least your period of risk is not dependent on the size of the file. I'm not familiar with what kind of atomicity is guaranteed by various filesystems when renaming a file to an existing name, but it's probably better than the time taken by saveRDS.

Resources