How to handle and normalize a dataframe with billions of rows? - r

I would need to analyze a dataframe on R (bash or even python if you have suggestions, but I don't know how to use python well). The dataframe has approximately 6 billion rows and 8 columns (control1, treaty1, control2, treaty2, control3, treaty3, control4, treaty4).
Since it is a file of almost 300Gb and 6 billion lines I cannot open it with R.
I would need to read the file line by line and remove the lines where there is even only a 0.
How could I do?
If I also needed to divide each value inside a column by a number, and put the result in a new dataframe equal to the starting one, how could I do?

Related

Querying out of memory 60gb tsv's in R on the first column, which database/method?

I have 6 large tsv's matrices of 60gb (uncompressed) containing 20million rows x 501 columns: the first index/integer column that is basically the row number (so not even necessary), 500 columns are numerical (float, 4 decimals e.g. 1.0301). All tsv's have the same number of rows that correspond to each other.
I need to extract rows on rownumber.
I need is to extract up to 5,000 contiguous rows or up to 500 non-contiguous rows;so not millions. Hopefully, also have some kind of compression to reduce the size of 60gb so maybe no SQL? What would be the best way to do this?
One method I tried is to seperate them into 100 gzipped files, index them using tabix, and then query them, but this is too slow for my need (500 random rows took 90 seconds).
I read about the ff package, but have not found how to index by the first column?
Are there other ways ?
Thanks so much.
I will use fread() from data.table package
Using the parameters skip and nrows you can play with the starting line to read (skip) or the number of rows to read (nrows)
If you want to explore the tidyverse approach I recommend you this solution R: Read in random rows from file using fread or equivalent?

How do I export a custom list of numbers and letters to Excel from R?

To help with some regular label-making and printing I need to do, I am looking to write a script that allows me to enter a range of sequential numbers (some with string identifiers) that I can export with a specific format to Excel. For example, if I entered the range '1:16', I am looking for an output in Excel exactly as:
Example Excel Output
For each unique sequential number (i.e., 1 to 16) the first five rows must be labeled with a 'U", the next three rows with an 'F' and the last two rows must be the number alone. The final exported matrix will be n columns x 21 rows, where n will vary depending on the number range I enter.
My main problem is in writing to Excel. I can't find out how to customize this output and write to specific rows and columns as in the example above. I am limited to 'openxlsx' since I work on a corporate secure workstation. Here is what I have so far:
Example Code
Any help you may have would be very appreciated, thanks in advance!

Creating a histogram in R with random numbers [1-5] from a .csv file

I'm new with R, but doing my best..
I'm trying to create a histogram from data I got in a .csv file. Just imagine one column with 10.000 random numbers with a range from 1 to 5. I want to create a histogram that shows how many times 1 occurs, how many times 2 occurs, how many times 3 occurs, etc. (Up to 5).
Is this possible in any way? Or should I do this in Excel and then get the results from there into R to create the histogram? I don't seem to get any wiser from any of the video tutorials so far or any of the other questions asked on here..
Import data from csv into R first:
dat = read.csv("c:\\documents\\file.csv")
Assuming you have a column called "col" in your csv file that has your data, run this:
hist(dat$col)
If you need to know how many times each value occurs, a more precise way is to make a table:
table(dat$col)

Optimizing File reading in R

My R application reads input data from large txt files. it does not read the entire
file in one shot. Users specify the name of the gene, (3 or 4 at a time) and based on the user-input, app goes to the appropriate row and reads the data.
File format: 32,000 rows (one gene per row, first two columns contain info about
gene name, etc.) 35,000 columns with numerical data (decimal numbers).
I used read.table (filename, skip=10,000 ) etc. to go to the right row, then read
35,000 columns of data. then I do this again for the 2nd gene, 3rd gene (upto 4 genes max)
and then process the numerical results.
The file reading operations take about 1.5 to 2.0 Minutes. I am experimenting with
reading the entire file and then taking the data for the desired genes.
Any way to accelerate this? I can rewrite the gene data in another format (one
time processing) if that will accelerate reading operations in the future.
You can use the colClasses argument to read.table to speed things up, if you know the exact format of your files. For 2 character columns and 34,998 (?) numeric columns, you would use
colClasses = c(rep("character",2), rep("numeric",34998))
This would be more efficient if you used a database interface. There are several available via the RODBC package, but a particularly well-integrated-with-R option would be the sqldf package which by default uses SQLite. You would then be able to use the indexing capacity of the database to do lookup of the correct rows and read all the columns in one operation.

Import Large Unusual File To R

First time poster here, so I'll try and make myself as clear as possible on the help I need. I'm fairly new to R, and this is my first real independent programming experience.
I have stock tick data for about 2.5 years, each day has its own file. The files are .txt and consist of approximately 20-30 million rows, and averaging I guess 360mb each. I am working one file at a time for now. I don't need all the data these files contain, and I was hoping that I could use the programming to minimize my files a bit.
Now my problem is that I am having some difficulties with writing the proper code so R understands what I need it to do.
Let me first show you some of the data so you can get an idea of the formatting.
M977
R 64266NRE1VEW107 FI0009653869 2EURXHEL 630 1
R 64516SSA0B 80SHB SE0002798108 8SEKXSTO 40 1
R 645730BBREEW750 FR0010734145 8EURXHEL 640 1
R 64655OXS1C 900SWE SE0002800136 8SEKXSTO 40 1
R 64663OXS1P 450SWE SE0002800219 8SEKXSTO 40 1
R 64801SSIEGV LU0362355355 11EURXCSE 160 1
M978
Another snip of data:
M732
D 3547742
A 3551497B 200000 67110 02800
D 3550806
D 3547743
A 3551498S 250000 69228 09900
So as you can see each line begins with a letter. Each letter denotes what the line means. For instance R means order book directory message, M means milliseconds after last second, H means stock trading action message. There are 14 different letters used in total.
I have used the readLines function to import the data into R. This however seems to take a very long time for R to process when I want to work with the data.
Now I would like to write some sort of If function that says if the first letter is R then from offset 1 to 4 the code means Market Segment Identifier etc., and have R add columns to these so I can work with the data in a more structured fashion.
What is the best way of importing such data, and also creating some form of structure - i.e. use unique ID information in the line of data to analyze 1 stock at a time for instance.
You can try something like this :
options(stringsAsFactors = FALSE)
f_A <- function(line,tab_A){
values <- unlist(strsplit(line," "))[2:5]
rbind(tab_A,list(name_1=as.character(values[1]),name_2=as.numeric(values[2]),name_3=as.numeric(values[3]),name_4=as.numeric(values[4])))
}
tab_A <- data.frame(name_1=character(),name_2=numeric(),name_3=numeric(),name_4=numeric(),stringsAsFactors=F)
for(i in readLines(con="/home/data.txt")){
switch(strsplit(x=i,split="")[[1]][1],M=cat("1\n"),R=cat("2\n"),D=cat("3\n"),A=(tab_A <- f_A(i,tab_A)))
}
And replace cat() by different functions that add values to each type of data.frame. Use the pattern of the function f_A() to construct others functions and same things for the table structure.
You can combine your readLines() command with regular expressions. To get more information about regular expressions, look at the R help site for grep()
> ?grep
So you can go through all the lines, check for each line what it means, and then handle or store the content of the line however you like. (Regular Expressions are also useful to split the data within one line...)

Resources