I have a large csv dataset that I want to split into multiple txt files. I want the name of each file to come from the ID column and the content of each file to come from the Text column. My data looks something like this.
ID Text
1 I like dogs
2 My name is
3 It is sunny
Would anyone be able to help advise? I don't mind using excel or R.
Thank you!
In R, You can split the data by ID and use writeLines to write it in text files.
If your dataframe is called df, try :
temp <- split(df$Text, df$ID)
Map(function(x, y) writeLines(x, paste0(y, '.txt')), temp, names(temp))
If you have a lot of rows, this is a good task for parallel computing. (Here's the general premise: R spends a lot of time formatting the file. Writing to the disk can't be done in parallel, but formatting the file can.) So let's do it in parallel!
The furrr package is one of my favorites: In short, it adds parallel processing capabilities to the purrr package, whose map functions are quite useful. In this case, we want to use the future_pmap function, which lets us apply a function to each row of a dataframe. This should be all the code you need:
library(furrr)
plan(multiprocess)
future_pmap(df, function(id, value) {write(value, paste0(id, ".txt"))})
I tested the parallel and normal versions of this function on a dataframe with 31,496 rows, and the parallel version took only 60 percent as long. This method is also about 20 percent faster than Ronak Shah's writeLines method.
Related
I have a directory with 31 gzipped TSVs (2.8 GB compressed / 40 GB uncompressed). I would like to conditionally import all matching rows based on the value of 1 column, and combine into one data frame.
I've read through several answers here, but none seem to work—I suspect that they are not meant to handle that much data.
In short, how can I:
Read 3 GB of gzipped files
Import only rows whose column matches a certain value
Combine matching rows into one data frame.
The data is tidy, with only 4 columns of interest: date, ip, type (str), category (str).
The first thing I tried using read_tsv_chunked():
library(purrr)
library(IPtoCountry)
library(lubridate)
library(scales)
library(plotly)
library(tidyquant)
library(tidyverse)
library(R.utils)
library(data.table)
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
#Define a function to filter data as it comes in.
call_back <- function(x, pos){
unique(dplyr::filter(x, .data[["type"]] == "purchase"))
}
raw_data <- files %>%
map(~ read_tsv_chunked(., DataFrameCallback$new(call_back),
chunk_size = 5000)) %>%
reduce(rbind) %>%
as_tibble() # %>%
This first approach worked with 9 GB of uncompressed data, but not with 40 GB.
The second approach using fread() (same loaded packages):
#Generate the path to all the files.
import_path <- "import/"
files <- import_path %>%
str_c(dir(import_path))
bind_rows(map(str_c("gunzip - c", files), fread))
That looked like it started working, but then locked up. I couldn't figure out how to pass the select = c(colnames) argument to fread() inside the map()/str_c() calls, let alone the filter criteria for the one column.
This is more of a strategy answer.
R loads all data into memory for processing, so you'll run into issues with the amount of data you're looking at.
What I suggest you do, which is what I do, is to use Apache Spark for the data processing, and use the R package sparklyr to interface to it. You can then load your data into Spark, process it there, then retrieve the summarised set of data back into R for further visualisation and analysis.
You can install Spark locally in your R Studio instance and do a lot there. If you need further computing capacity have a look at a hosted option such as AWS.
Have a read of this https://spark.rstudio.com/
One technical point, there is a sparklyr function spark_read_text which will read delimited text files directly into the Spark instance. It's very useful.
From there you can use dplyr to manipulate your data. Good luck!
First, if the base read.table is used, you don't need to gunzip anything, as it uses Zlib to read these directly. read.table also works much faster if the colClasses parameter is specified.
Y'might need to write some custom R code to produce a melted data frame directly from each of the 31 TSVs, and then accumulate them by rbinding.
Still it will help to have a machine with lots of fast virtual memory. I often work with datasets on this order, and I sometimes find an Ubuntu system wanting on memory, even if it has 32 cores. I have an alternative system where I have convinced the OS that an SSD is more of its memory, giving me an effective 64 GB RAM. I find this very useful for some of these problems. It's Windows, so I need to set memory.limit(size=...) appropriately.
Note that once a TSV is read using read.table, it's pretty compressed, approaching what gzip delivers. You may not need a big system if you do it this way.
If it turns out to take a long time (I doubt it), be sure to checkpoint and save.image at spots in between.
I have a .txt File called Sales_2015 which has almost 1GB of info. The file has the following columns:
AREA|WEEKNUMBER|ITEM|STORE_NO|SALES|UNITS_SOLD
10GUD| W01_2015 |0345| 023234 |1200 | 12
The File´s colClasses is: c(rep("character",4),rep("numeric",2))
What I want to do is separate the 1GB file into pieces so it becomes faster to read. The number of .txt files I want to end up with will be defined by the number of AREAS I have. (Which is the first column).
So I have the following variables:
Sales <- read.table(paste(RUTAC,"/Sales_2015.txt",sep=""),sep="|",header=T, quote="",comment.char="",colClasses=c("character",rep("numeric",3)))
Areas <- c("10GUD","10CLJ","10DZV",..................) #There is 52 elements
I Want to end up with 52 .txt files which names are for instance:
2015_10GUD.txt (Which will only include entire rows of info from 1GB file that contain 10GUD in the AREA Column)
2015_10CLJ.txt (Which will only include entire rows of info from 1GB file that contain 10CLJ)
I know this question is very similar to others but the difference is that I am working with a up to 20 million rows...Can anybody help me get this done with some sort of loop such as repeat or something else?
No need to use a loop. The simplest and fastest way to do this is probably using data.table. I strongly recommend you use development version of data.table 1.9.7. so you can use the super fast fwrite function to write .csv files. Go here for install instructions.
library(data.table)
setDT(Sales_2015)[, fwrite(.SD, paste0("Sales_2015_", ID,".csv")),
by = AREA, .SDcols=names(Sales_2015)]
also, I would recommend you read your data using fread{data.table}, which is waaaay faster than read.table
Sales_2015 <- fread("C:/address to your file/Sales_2015.txt")
I am working with very large data layers for a SDM class and because of this I ended up breaking some of my layers into a bunch of blocks to avoid memory restraint. These blocks were written out as .grd files, and now I need to get them read back into R and merged together. I am extremely new to R an programming in general so any help would be appreciated. What I have been trying so far looks like this:
merge.coarse=raster("coarseBlock1.grd")
for ("" in 2:nBlocks){
merge.coarse=merge(merge.coarse,raster(paste("coarseBlock", ".grd", sep="")))
}
where my files are in coarseBlock.grd and are sequentially numbered from 1 to nBlocks (259)
Any feed back would be greatly appreciated.
Using for loops is generally slow in R. Also, using functions like merge and rbind in a for loop eat up a lot of memory because of the way R passes values to these functions.
A more efficient way to do this task would be to call lapply (see this tutorial on apply functions for details) to load the files into R. This will result in a list which can then be collapsed using the rbind function:
rasters <- lapply(list.files(GRDFolder), FUN = raster)
merge.coarse <- do.call(rbind, rasters)
I'm not too familiar with .grd files, but this overall process should at least get you going in the right direction. Assuming all your .grd files (1 through 259) are stored in the same folder (which I will refer to as GRDFolder), then you can try this:
merge.coarse <- raster("coarseBlock1.grd")
for(filename in list.files(GRDFolder))
{
temp <- raster(filename)
merge.coarse <- rbind(merge.coarse, temp)
}
I have a .csv file: example.csv with 8000 columns x 40000 rows. The csv file have a string header for each column. All fields contains integer values between 0 and 10. When I try to load this file with read.csv it turns out to be extremely slow. It is also very slow when I add a parameter nrow=100. I wonder if there is a way to accelerate the read.csv, or use some other function instead of read.csv to load the file into memory as a matrix or data.frame?
Thanks in advance.
If your CSV only contains integers, you should use scan instead of read.csv, since ?read.csv says:
‘read.table’ is not the right tool for reading large matrices,
especially those with many columns: it is designed to read _data
frames_ which may have columns of very different classes. Use
‘scan’ instead for matrices.
Since your file has a header, you will need skip=1, and it will probably be faster if you set what=integer(). If you must use read.csv and speed / memory consumption are a concern, setting the colClasses argument is a huge help.
Try using data.table::fread(). This is by far on of the fastest ways to read .csv files into R. There is a good benchmark here.
library(data.table)
data <- fread("c:/data.csv")
If you want to make it even faster, you can also read only the subset of columns you want to use:
data <- fread("c:/data.csv", select = c("col1", "col2", "col3"))
Also try Hadley Wickham's readr package:
library(readr)
data <- read_csv("file.csv")
If you'll read the file often, it might well be worth saving it from R in a binary format using the save function. Specifying compress=FALSE often results in faster load times.
...You can then load it in with the (surprise!) load function.
d <- as.data.frame(matrix(1:1e6,ncol=1000))
write.csv(d, "c:/foo.csv", row.names=FALSE)
# Load file with read.csv
system.time( a <- read.csv("c:/foo.csv") ) # 3.18 sec
# Load file using scan
system.time( b <- matrix(scan("c:/foo.csv", 0L, skip=1, sep=','),
ncol=1000, byrow=TRUE) ) # 0.55 sec
# Load (binary) file using load
save(d, file="c:/foo.bin", compress=FALSE)
system.time( load("c:/foo.bin") ) # 0.09 sec
Might be worth it to try the new vroom package
vroom is a new approach to reading delimited and fixed width data into R.
It stems from the observation that when parsing files reading data from disk and finding the delimiters is generally not the main bottle neck. Instead (re)-allocating memory and parsing the values into R data types (particularly for characters) takes the bulk of the time.
Therefore you can obtain very rapid input by first performing a fast indexing step and then using the ALTREP (ALTernative REPresentations) framework available in R versions 3.5+ to access the values in a lazy / delayed fashion.
This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.
#install.packages("vroom",
# dependencies = TRUE, repos = "https://cran.rstudio.com")
library(vroom)
df <- vroom('example.csv')
Benchmark: readr vs data.table vs vroom for a 1.57GB file
I have a few large data files I'd like to sample when loading into R. I can load the entire data set, but it's really too large to work with. sample does roughly the right thing, but I'd like to have to take random samples of the input while reading it.
I can imagine how to build that with a loop and readline and what-not but surely this has been done hundreds of times.
Is there something in CRAN or even base that can do this?
You can do that in one line of code using sqldf. See part 6e of example 6 on the sqldf home page.
No pre-built facilities. Best approach would be to use a database management program. (Seems as though this was addressed in either SO or Rhelp in the last week.)
Take a look at: Read csv from specific row , and especially note Grothendieck's comments. I consider him a "class A wizaRd". He's got first hand experience with sqldf. (The author IIRC.)
And another "huge files" problem with a Grothendieck solution that succeeded:
R: how to rbind two huge data-frames without running out of memory
I wrote the following function that does close to what I want:
readBigBz2 <- function(fn, sample_size=1000) {
f <- bzfile(fn, "r")
rv <- c()
repeat {
lines <- readLines(f, sample_size)
if (length(lines) == 0) break
rv <- append(rv, sample(lines, 1))
}
close(f)
rv
}
I may want to go with sqldf in the long-term, but this is a pretty efficient way of sampling the file itself. I just don't quite know how to wrap that around a connection for read.csv or similar.