R looping through 20 million rows - r

I have a .txt File called Sales_2015 which has almost 1GB of info. The file has the following columns:
AREA|WEEKNUMBER|ITEM|STORE_NO|SALES|UNITS_SOLD
10GUD| W01_2015 |0345| 023234 |1200 | 12
The File´s colClasses is: c(rep("character",4),rep("numeric",2))
What I want to do is separate the 1GB file into pieces so it becomes faster to read. The number of .txt files I want to end up with will be defined by the number of AREAS I have. (Which is the first column).
So I have the following variables:
Sales <- read.table(paste(RUTAC,"/Sales_2015.txt",sep=""),sep="|",header=T, quote="",comment.char="",colClasses=c("character",rep("numeric",3)))
Areas <- c("10GUD","10CLJ","10DZV",..................) #There is 52 elements
I Want to end up with 52 .txt files which names are for instance:
2015_10GUD.txt (Which will only include entire rows of info from 1GB file that contain 10GUD in the AREA Column)
2015_10CLJ.txt (Which will only include entire rows of info from 1GB file that contain 10CLJ)
I know this question is very similar to others but the difference is that I am working with a up to 20 million rows...Can anybody help me get this done with some sort of loop such as repeat or something else?

No need to use a loop. The simplest and fastest way to do this is probably using data.table. I strongly recommend you use development version of data.table 1.9.7. so you can use the super fast fwrite function to write .csv files. Go here for install instructions.
library(data.table)
setDT(Sales_2015)[, fwrite(.SD, paste0("Sales_2015_", ID,".csv")),
by = AREA, .SDcols=names(Sales_2015)]
also, I would recommend you read your data using fread{data.table}, which is waaaay faster than read.table
Sales_2015 <- fread("C:/address to your file/Sales_2015.txt")

Related

Picking last row data only from 2000 csv in the same directory and make single dataframe through R

Using R, I want to pick last row data only from over 2000 csv in the same directory
and make single dataframe.
Directory = "C:\data
File name, for example '123456_p' (6 number digit)
Each csv has different number of rows, but has the same number of columns (10 columns)
I know the tail and list function, but over 2000 dataframes, inputting manually is time wasting.
Is there any way to do this with loop through R?
As always, I really appreciate your help and support
There are four things you need to do here:
Get all the filenames we want to read in
Read each in and get the last row
Loop through them
Bind them all together
There are many options for each of these steps, but let's use purrr for the looping and binding, and base-R for the rest.
Get all the filenames we want to read in
You can do this with the list.files() function.
filelist = list.files(pattern = '.csv')
will generate a vector of filenames for all CSV files in the working directory. Edit as appropriate to specify the pattern further or target a different directory.
Read each in and get the last row
The read.csv() function can read in each file (if you want it to go faster, use data.table::fread() instead), and as you mentioned tail() can get the last row. If you build a function out of this it will be easier to loop over, or change the process if it turns out you need another step of cleaning.
read_one_file = function(x) {
tail(read.csv(x), 1)
}
Loop through them
Bind them all together
You can do both of these steps at once with map_df() in the purrr package.
library(purrr)
final_data = map_df(filelist, read_one_file)

Pulling out specific files in a directory based on unique file names and reading them in with read_wav in R

I have >1000 audio files in one directory that I have been tagging for bird calls and entering the data into excel.
Once I read in this data, I filtered out covariates of interest and made separate data frames. Basically, I have about 45 files I want to analyse separately with read_wav.
I'm not sure how to create a for loop to look into a certain directory '/all_SMP' and pull out these 45 files from a list of >1000.
I've created a list "list.clean", however my current forloop only lists every file in that folder (>1000 files)
for(i in c(list.clean)){
raw.path <- paste0("../02_Working/SMP_SM4/SMP_15sec/all_AR_SMP")
wav.list <- list.files(path="../02_Working/SMP_SM4/SMP_15sec/all_AR_SMP",
pattern="*.wav",
recursive=TRUE)
I'm quite a novice with R as I'm sure you can tell.
I want to read_wav on the 45 files and use the 'analyze' function on each file
audio1 <- analyze(sample1,samplingRate=24000, cutFreq=c(800,8000))
Hope this makes sense!
Cheers
In list.files the parameter pattern is a regular expression.
So, for your case should be pattern=".*\\.wav$" (and it's case sensitive).
Alternatively and easier, you can use fs::dir_ls(glob = "*.wav").

Splitting a csv file into multiple txt. files

I have a large csv dataset that I want to split into multiple txt files. I want the name of each file to come from the ID column and the content of each file to come from the Text column. My data looks something like this.
ID Text
1 I like dogs
2 My name is
3 It is sunny
Would anyone be able to help advise? I don't mind using excel or R.
Thank you!
In R, You can split the data by ID and use writeLines to write it in text files.
If your dataframe is called df, try :
temp <- split(df$Text, df$ID)
Map(function(x, y) writeLines(x, paste0(y, '.txt')), temp, names(temp))
If you have a lot of rows, this is a good task for parallel computing. (Here's the general premise: R spends a lot of time formatting the file. Writing to the disk can't be done in parallel, but formatting the file can.) So let's do it in parallel!
The furrr package is one of my favorites: In short, it adds parallel processing capabilities to the purrr package, whose map functions are quite useful. In this case, we want to use the future_pmap function, which lets us apply a function to each row of a dataframe. This should be all the code you need:
library(furrr)
plan(multiprocess)
future_pmap(df, function(id, value) {write(value, paste0(id, ".txt"))})
I tested the parallel and normal versions of this function on a dataframe with 31,496 rows, and the parallel version took only 60 percent as long. This method is also about 20 percent faster than Ronak Shah's writeLines method.

How to batch read 2.8 GB gzipped (40 GB TSVs) files into R?

I have a directory with 31 gzipped TSVs (2.8 GB compressed / 40 GB uncompressed). I would like to conditionally import all matching rows based on the value of 1 column, and combine into one data frame.
I've read through several answers here, but none seem to work—I suspect that they are not meant to handle that much data.
In short, how can I:
Read 3 GB of gzipped files
Import only rows whose column matches a certain value
Combine matching rows into one data frame.
The data is tidy, with only 4 columns of interest: date, ip, type (str), category (str).
The first thing I tried using read_tsv_chunked():
library(purrr)
library(IPtoCountry)
library(lubridate)
library(scales)
library(plotly)
library(tidyquant)
library(tidyverse)
library(R.utils)
library(data.table)
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
#Define a function to filter data as it comes in.
call_back <- function(x, pos){
unique(dplyr::filter(x, .data[["type"]] == "purchase"))
}
raw_data <- files %>%
map(~ read_tsv_chunked(., DataFrameCallback$new(call_back),
chunk_size = 5000)) %>%
reduce(rbind) %>%
as_tibble() # %>%
This first approach worked with 9 GB of uncompressed data, but not with 40 GB.
The second approach using fread() (same loaded packages):
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
bind_rows(map(str_c("gunzip - c", files), fread))
That looked like it started working, but then locked up. I couldn't figure out how to pass the select = c(colnames) argument to fread() inside the map()/str_c() calls, let alone the filter criteria for the one column.
This is more of a strategy answer.
R loads all data into memory for processing, so you'll run into issues with the amount of data you're looking at.
What I suggest you do, which is what I do, is to use Apache Spark for the data processing, and use the R package sparklyr to interface to it. You can then load your data into Spark, process it there, then retrieve the summarised set of data back into R for further visualisation and analysis.
You can install Spark locally in your R Studio instance and do a lot there. If you need further computing capacity have a look at a hosted option such as AWS.
Have a read of this https://spark.rstudio.com/
One technical point, there is a sparklyr function spark_read_text which will read delimited text files directly into the Spark instance. It's very useful.
From there you can use dplyr to manipulate your data. Good luck!
First, if the base read.table is used, you don't need to gunzip anything, as it uses Zlib to read these directly. read.table also works much faster if the colClasses parameter is specified.
Y'might need to write some custom R code to produce a melted data frame directly from each of the 31 TSVs, and then accumulate them by rbinding.
Still it will help to have a machine with lots of fast virtual memory. I often work with datasets on this order, and I sometimes find an Ubuntu system wanting on memory, even if it has 32 cores. I have an alternative system where I have convinced the OS that an SSD is more of its memory, giving me an effective 64 GB RAM. I find this very useful for some of these problems. It's Windows, so I need to set memory.limit(size=...) appropriately.
Note that once a TSV is read using read.table, it's pretty compressed, approaching what gzip delivers. You may not need a big system if you do it this way.
If it turns out to take a long time (I doubt it), be sure to checkpoint and save.image at spots in between.

Reading Large Files into Data Frames in R - Issues with sqldf

I am trying to read in and manipulate data that I have stored in large data sets. Each file is about 5GB. I mostly need to be able to grab chunks of specific data out of these data sets. I also have a similar 38 MB file that I use for testing. I initially used read.table to read in chunks of the file using 'nrows' and 'skip'. However, this process take s a huge amount of time because the act of skipping an increasing amount of rows is time consuming. Here is the code I had:
numskip = 0 #how many lines in the file to skip
cur_list = read.table("file.txt", header = TRUE, sep = ',',nrows = 200000, skip = numskip, col.names = col) #col is a vector of column names
I set this up in a while loop and increasing numskip to grab the next chunk of data, but as numskip increasing, the process slowed significantly.
I briefly tried using read.lines to read in data line by line, but a few threads pointed me towards the sqdl package. I wrote the following bit of code:
library(sqldf)
f = file("bigfile.txt")
dataset = sqldf("select * from f where CusomterID = 7127382") #example of what I would like to be able to grab
From what I understand, sqldf will allow me to use SQL queries to return sets of the data from the database without R doing anything, provided that the subset isn't then too big for R to handle.
The problem is that my 4GB machine runs out of memory when I run the large files (though not the smaller test file). I found this odd because I know that SQLite can handle much larger files than 5GB, and R shouldn't be doing any of the processing. Would using PostGreSQL help? do I just need a better machine with more RAM? Should I give up on sqldf and find a different way to do this?
To wrap this up, here's an example of the data I am working with:
"Project" "CustomerID" "Stamp" "UsagePoint" "UsagePointType" "Energy"
21 110981 YY 40 Red 0.17
21 110431 YY 40 Blue 0.19
22 120392 YY 40 Blue 0.20
22 210325 YY 40 Red 0.12
Thanks
Have you tried
dat <- read.csv.sql(file = "file.txt", "select * from file where CusomterID = 7127382")
You're right about sqldf and there are a ton of other great big data tools in R, including big.memory.
Conversions to csv or json can help (use RJSONIO) and you can also first load your data into a relational, NoSQL, Hadoop, or Hive database and read it in via RODBC, which is what I'd highly recommend in your case.
Also see fread and the CRAN HPC Taskview.

Resources