Import Large Unusual File To R - r

First time poster here, so I'll try and make myself as clear as possible on the help I need. I'm fairly new to R, and this is my first real independent programming experience.
I have stock tick data for about 2.5 years, each day has its own file. The files are .txt and consist of approximately 20-30 million rows, and averaging I guess 360mb each. I am working one file at a time for now. I don't need all the data these files contain, and I was hoping that I could use the programming to minimize my files a bit.
Now my problem is that I am having some difficulties with writing the proper code so R understands what I need it to do.
Let me first show you some of the data so you can get an idea of the formatting.
M977
R 64266NRE1VEW107 FI0009653869 2EURXHEL 630 1
R 64516SSA0B 80SHB SE0002798108 8SEKXSTO 40 1
R 645730BBREEW750 FR0010734145 8EURXHEL 640 1
R 64655OXS1C 900SWE SE0002800136 8SEKXSTO 40 1
R 64663OXS1P 450SWE SE0002800219 8SEKXSTO 40 1
R 64801SSIEGV LU0362355355 11EURXCSE 160 1
M978
Another snip of data:
M732
D 3547742
A 3551497B 200000 67110 02800
D 3550806
D 3547743
A 3551498S 250000 69228 09900
So as you can see each line begins with a letter. Each letter denotes what the line means. For instance R means order book directory message, M means milliseconds after last second, H means stock trading action message. There are 14 different letters used in total.
I have used the readLines function to import the data into R. This however seems to take a very long time for R to process when I want to work with the data.
Now I would like to write some sort of If function that says if the first letter is R then from offset 1 to 4 the code means Market Segment Identifier etc., and have R add columns to these so I can work with the data in a more structured fashion.
What is the best way of importing such data, and also creating some form of structure - i.e. use unique ID information in the line of data to analyze 1 stock at a time for instance.

You can try something like this :
options(stringsAsFactors = FALSE)
f_A <- function(line,tab_A){
values <- unlist(strsplit(line," "))[2:5]
rbind(tab_A,list(name_1=as.character(values[1]),name_2=as.numeric(values[2]),name_3=as.numeric(values[3]),name_4=as.numeric(values[4])))
}
tab_A <- data.frame(name_1=character(),name_2=numeric(),name_3=numeric(),name_4=numeric(),stringsAsFactors=F)
for(i in readLines(con="/home/data.txt")){
switch(strsplit(x=i,split="")[[1]][1],M=cat("1\n"),R=cat("2\n"),D=cat("3\n"),A=(tab_A <- f_A(i,tab_A)))
}
And replace cat() by different functions that add values to each type of data.frame. Use the pattern of the function f_A() to construct others functions and same things for the table structure.

You can combine your readLines() command with regular expressions. To get more information about regular expressions, look at the R help site for grep()
> ?grep
So you can go through all the lines, check for each line what it means, and then handle or store the content of the line however you like. (Regular Expressions are also useful to split the data within one line...)

Related

summary of row of numbers in R

I just hope to learn how to make a simple statistical summary of the random numbers fra row 1 to 5 in R. (as shown in picture).
And then assign these rows to a single variable.
enter image description here
Hope you can help!
When you type something like 3 on a single line and ask R to "run" it, it doesn't store that anywhere -- it just evaluates it, meaning that it tries to make sense out of whatever you've typed (such as 3, or 2+1, or sqrt(9), all of which would return the same value) and then it more or less evaporates. You can think of your lines 1 through 5 as behaving like you've used a handheld scientific calculator; once you type something like 300 / 100 into such a calculator, it just shows you a 3, and then after you have executed another computation, that 3 is more or less permanently gone.
To do something with your data, you need to do one of two things: either store it into your environment somehow, or to "pipe" your data directly into a useful function.
In your question, you used this script:
1
3
2
7
6
summary()
I don't think it's possible to repair this strategy in the way that you're hoping -- and if it is possible, it's not quite the "right" approach. By typing the numbers on individual lines, you've structured them so that they'll evaluate individually and then evaporate. In order to run the summary() function on those numbers, you will need to bind them together inside a single vector somehow, then feed that vector into summary(). The "store it" approach would be
my_vector <- c(1, 3, 7, 2, 6)
summary(my_vector)
The importance isn't actually the parentheses; it's the function c(), which stands for concatenate, and instructs R to treat those 5 numbers as a collective object (i.e. a vector). We then pass that single object into my_vector().
To use the "piping" approach and avoid having to store something in the environment, you can do this instead (requires R 4.1.0+):
c(1, 3, 7, 2, 6) |> summary()
Note again that the use of c() is required, because we need to bind the five numbers together first. If you have an older version of R, you can get a slightly different pipe operator from the magrittr library instead that will work the same way. The point is that this "binding" part of the process is an essential part that can't be skipped.
Now, the crux of your question: presumably, your data doesn't really look like the example you used. Most likely, it's in some separate .csv file or something like that; if not, hopefully it is easy to get it into that format. Assuming this is true, this means that R will actually be able to do the heavy lifting for you in terms of formatting your data.
As a very simple example, let's say I have a plain text file, my_example.txt, whose contents are
1
3
7
2
6
In this case, I can ask R to parse this file for me. Assuming you're using RStudio, the simplest way to do this is to use the File -> Import Dataset part of the GUI. There are various options dealing with things such as headers, separators, and so forth, but I can't say much meaningful about what you'd need to do there without seeing your actual dataset.
When I import that file, I notice that it does two things in my R console:
my_example <- read.table(...)
View(my_example)
The first line stores an object (called a "data frame" in this case) in my environment; the second shows a nice view of how it's rendered. To get the summary I wanted, I just need to extract the vector of numbers I want, which I see from the view is called V1, which I can do with summary(my_example$V1).
This example is probably not helpful for your actual data set, because there are so many variations on the theme here, but the theme itself is important: point R at a file, as it to render an object, then work with that object. That's the approach I'd recommend instead of typing data as lines within an R script, as it's much faster and less error-prone.
Hopefully this will get you pointed in the right direction in terms of getting your data into R and working with it.

DADA2 - Calculating percent of reads that merged

I have been following the tutorial for DADA2 in R for a 16S data-set, and everything runs smoothly; however, I do have a question on how to calculate the total percent of merged reads. After the step to track reads through the pipeline with the following code:
merger <- mergePairs(dadaF1, derepF1, dadaR1, derepR1, verbose=TRUE)
and then tracking the reads through each step:
getN <- function(x) sum(getUniques(x))
track <- cbind(out_2, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
I get a table that looks like this, here I am viewing the resuling track data-frame made w/ the above code:
Where input is the total sequences I put in (after demuxing) and filtered is the total sequences after they were filtered based on a parameter of my choosing. The denoisedF and denoisedR are sequences that have been denoised (one for forward reads and another for reverse reads), the total number of merged reads (from the mergePairs command above) and the nonchim are the total sequences that are not chimeras.
My question is this .... to calculate the percent of merged reads - is this a simple division? Say take the first row - (417/908) * 100 = 46% or should I somehow incorporate the denoisedF and denoisedR columns in this calculation?
Thank you very much in advance!
The track object is a matrix (see class(track)), thus you can run operations accordingly. In your case:
track[, "merged"]/track[, "input"] * 100
Or, you could convert the track object into a data frame for a "table" output.
However, I usually export the track output as an excel file and then do my modification there. It is easier to be shared and commented on with non-R users.
I find the write_xlsx function from the writexl package particularly convenient.
Cheers

R - Writing data to CSV in a loop

I have a loop that is going through a list of variables (in a csv), accessing a database and extracting the relevant data. It does this for 4 different time periods (which depend on the variables).
I am trying to get R to write this data to a csv, but at current I can only get it to store the data for the last variable in 4 different csv files as it overwrites the previous variable each time.
I'd like it to have all of the data for these variables for one time period all in the same file/sheet. (So either 4 sheets or 4 csv files with all of the data on them) This is because I need to do some data manipulation on the variables before I feed them into the next loop of the script.
I'd like it to be something like this, but need 4 separate sheets/files so I can cover each time period.
date/time | var1 | var2 | ... | varn
I would post the code, but even only posting the relevant loop and none of the surrounding code would be ~150 lines. I am not familiar with R (I can follow the script but struggle writing my own), I inherited this project and don't have long to work on it.
Note: each variable is recorded at a different frequency - some will only have one data point an hour, others one every minute, so will need to match these up based on time recorded (to the nearest minute).
EDIT: I hope I've explained this clearly enough
Four different .csv files would be easiest, because you could do something like the following in your loop:
outfile.name <- paste('Sales', year.of.data, sep='')
write.csv(outfile.name, out.filepath, row.names=FALSE)
You could also append the data into one data.frame and then export it all at once into one sheet. You won't be able to export to multiple sheets for a .csv, because a CSV won't let you have multiple sheets.

What format for external file so that R can read a list of lists?

I have a haskell program that produces a text file, which is then read by R. My current solution is working, but I am asking if there is a better solution and whether it is worth changing the current approach.
Currently my haskell program produces the following output (simplified example):
mylist <- list(
list(c("b"),c("b","E"),c("b","E","P"),c("b","T"),c("b","P","T"),c("b","E","T"),c("b","E","P","T"))
, list(c("b"),c("b","T"),c("b","N"),c("b","E"),c("b","E","T"),c("b","N","T"),c("b","N","E"),c("b","N","E","T"))
, list(c("b","N"),c("b","E","N"),c("b","N","T"),c("b","E","N","T"))
)
myListNames <- c("Name1","Name2","Name3")
This output is saved to a text file that is simply sourced from within R. I then access the two variables mylist and myListNames.
The data: I am generating 9 text files. List entries represent a feature, there are at maximum 120 different features and the name can be 20 characters long. Please note that features have nothing to do with statistics. In the dummy example b would be in the real world example 20 character long. Each sublist is about 5 to 45 elements long, but an outlier might have 500.000 list entries.
The current approach works reasonably well. But is there another way to store a list of lists as a text file that might be better suited?
I used the approach that was suggested by Ricardo Saporta. It worked like a charm and I used the R library RJSONIO for JSON parsing in R.
Many thanks to Ricardo Saporta!

csv file with multiple time-series

I've imported a csv file with lots of columns and sections of data.
v <- read.csv2("200109.csv", header=TRUE, sep=",", skip="6", na.strings=c(""))
The layout of the file is something like this:
Dataset1
time, data, .....
0 0
0 <NA>
0 0
Dataset2
time, data, .....
00:00 0
0 <NA>
0 0
(The headers of the different datasets is exactly the same.
Now, I can plot the first dataset with:
plot(as.numeric(as.character(v$Calls.served.by.agent[1:30])), type="l")
I am curious if there is a better way to:
Get all the numbers read as numbers, without having to convert.
Address the different datasets in the file, in some meaningfull way.
Any hints would be appreciated. Thank you.
Status update:
I haven't really found a good solution yet in R, but I've started writing a script in Lua to seperate each individual time-series into a seperate file. I'm leaving this open for now, because I'm curious how well R will deal with all these files. I'll get 8 files per day.
What I personally would do is to make a script in some scripting language to separate the different data sets before the file is read into R, and possibly do some of the necessary data conversions, too.
If you want to do the splitting in R, look up readLines and scan – read.csv2 is too high-level and is meant for reading a single data frame. You could write the different data sets into different files, or if you are ambitious, cook up file-like R objects that are usable with read.csv2 and read from the correct parts of the underlying big file.
Once you have dealt with separating the data sets into different files, use read.csv2 on those (or whichever read.table variant is best – if those are not tabs but fixed-width fields, see read.fwf). If <NA> indicates "not available" in your file, be sure to specify it as part of na.strings. If you don't do that, R thinks you have non-numeric data in that field, but with the right na.strings, you automatically get the field converted into numbers. It seems that one of your fields can include time stamps like 00:00, so you need to use colClasses and specify a class to which your time stamp format can be converted. If the built-in Date class doesn't work, just define your own timestamp class and an as.timestamp function that does the conversion.

Resources