Let's take the following simplified version of a dataset that I import using read.table:
a<-as.data.frame(c("M","M","F","F","F"))
b<-as.data.frame(c(25,22,33,17,18))
df<-cbind(a,b)
colnames(df)<-c("Sex","Age")
In reality my dataset is extremely large and I'm only interested in a small proportion of the data i.e. the data concerning Females aged 18 or under. In the example above this would be just the last 2 observations.
My question is, can I just import these observations immediately without importing the rest of the data then using subset to refine my database. My computer's capacities are limited and so I have been using scan to import my data in chunks but this is extremely time consuming.
Is there a better solution?
Some approaches that might work:
1 - Use a packages like ff than can help you with RAM issues.
2 - Use other tools/languages to clean your data before load it to R.
3 - If your file is not too big (i.e., you can load it without crashing), then you could save it to a .RData file and read from this file (instead of calling read.table):
# save each txt file once...
save.rdata = function(filepath, filebin) {
dataset = read.table(filepath)
save(dataset, paste(filebin, ".RData", sep = ""))
}
# then read from the .Rdata
get.dataset = function(filebin) {
load(filebin)
return(dataset)
}
This is much faster than read from a txt file, but i'm not sure if it applies to your case.
There should be several ways to do this. Here is one using SQL.
library(sqldf)
result = sqldf("select * from df where Sex='F' AND Age<=18")
> result
Sex Age
1 F 17
2 F 18
There is also a read.csv.sql function that you can filter with the above statement to avoid reading in the whole text file!
This is almost the same as #Drew75's answer but I'm including it to illustrate some gotcha's with SQLite:
# example: large-ish data.frame
df <- data.frame(Sex=sample(c("M","F"),1e6,replace=T),
Age=sample(18:75,1e6,replace=T))
write.csv(df, "myData.csv", quote=F, row.names=F) # note: non-quoted strings
library(sqldf)
myData <- read.csv.sql(file="myData.csv", # looks for char M (no qoutes)
sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 500127
write.csv(df, "myData.csv", row.names=F) # quoted strings...
myData <- read.csv.sql(file="myData.csv", # this fails
sql="select * from file where Sex='M'", eol = "\n")
nrow(myData)
# [1] 0
myData <- read.csv.sql(file="myData.csv", # need quotes in the char literal
sql="select * from file where Sex='\"M\"'", eol = "\n")
nrow(myData)
# [1] 500127
Related
I see the chunk_size argument in arrow::write_parquet(), but it doesn't seem to behave as expected. I would expect the code below to generate 3 separate parquet files, but only one is created, and nrow > chunk_size.
library(arrow)
# .parquet dir and file path
td <- tempdir()
tf <- tempfile("", td, ".parquet")
on.exit(unlink(tf))
# dataframe with 3e6 rows
n <- 3e6
df <- data.frame(x = rnorm(n))
# write with chunk_size 1e6, and view directory
write_parquet(df, tf, chunk_size = 1e6)
list.files(td)
Returns one file instead of 3:
[1] "25ff74854ba6.parquet"
# read parquet and show all rows are there
nrow(read_parquet(tf))
Returns:
[1] 3000000
We can't pass multiple file name arguments to write_parquet(), and I don't want to partition, so write_dataset() also seem inapplicable.
The chunk_size parameter refers to how much data to write to disk at once, rather than the number of files produced. The write_parquet() function is designed to write individual files, whereas, as you said, write_dataset() allows partitioned file writing. I don't believe that splitting files on any other basis is supported at the moment, though it is a possibility in future releases. If you had a specific reason for wanting 3 separate files, I'd recommend separating the data into multiple datasets first and then writing each of those via write_parquet().
(Also, I am one of the devs on the R package, and can see that this isn't entirely clear from the docs, so I'm going to open a ticket to update those - thanks for flagging this up)
Short of an argument to write_parquet() like max_row that defaults to reasonable number (like 1e6), we can do something like this:
library(arrow)
library(uuid)
library(glue)
library(dplyr)
write_parquet_multi <- function(df, dir_out, max_row = 1e6){
# Only one parquet file is needed
if(nrow(df) <= max_row){
cat("Saving", formatC(nrow(df), big.mark = ","),
"rows to 1 parquet file...")
write_parquet(
df,
glue("{dir_out}/{UUIDgenerate(use.time = FALSE)}.parquet"))
cat("done.\n")
}
# Multiple parquet files are needed
if(nrow(df) > max_row){
count = ceiling(nrow(df)/max_row)
start = seq(1, count*max_row, max_row)
end = c(seq(max_row, nrow(df), max_row), nrow(df))
uuids = UUIDgenerate(n = count, use.time = FALSE)
cat("Saving", formatC(nrow(df), big.mark = ","),
"rows to", count, "parquet files...")
for(j in 1:count){
write_parquet(
dplyr::slice(df, start[j]:end[j]),
glue("{dir_out}/{uuids[j]}.parquet"))
}
cat("done.\n")
}
}
# .parquet dir and file path
td <- tempdir()
tf <- tempfile("", td, ".parquet")
on.exit(unlink(tf))
# dataframe with 3e6 rows
n <- 3e6
df <- data.frame(x = rnorm(n))
# write parquet multi
write_parquet_multi(df, td)
list.files(td)
This returns:
[1] "7a1292f0-cf1e-4cae-b3c1-fe29dc4a1949.parquet"
[2] "a61ac509-34bb-4aac-97fd-07f9f6b374f3.parquet"
[3] "eb5a3f95-77bf-4606-bf36-c8de4843f44a.parquet"
I have a batch of text files that I need to read into r to do text mining.
So far, I have tried to use read.table, read.line, lapply, mcsv_r from qdap package to no avail. I have tried to write a loop to read the files, but I have to specify the name of the file, which changes in every iteration.
Here is what I have tried:
# Relative path points to the local folder
folder.path="../data/InauguralSpeeches/"
# get the list of file names
speeches=list.files(path = folder.path, pattern = "*.txt")
for(i in 1:length(speeches))
{
text_df <- do.call(rbind,lapply(speeches[i],read.csv))
}
Moreover, I have tried the following:
library(data.table)
files <- list.files(path = folder.path,pattern = ".csv")
temp <- lapply(files, fread, sep=",")
data <- rbindlist( temp )
And it is giving me this error when inaugAbrahamLincoln-1.csv clearly exists in the folder:
files <- list.files(path = folder.path,pattern = ".csv")
> temp <- lapply(files, fread, sep=",")
Error in FUN(X[[i]], ...) :
File 'inaugAbrahamLincoln-1.csv' does not exist. Include one or more spaces to consider the input a system command.
> data <- rbindlist( temp )
Error in rbindlist(temp) : object 'temp' not found
>
But it only works on .csv files, not on .txt files.
Is there a simpler way to do text mining from multiple sources files? If so how?
Thanks
I often have this same problem. The textreadr package that I maintain is designed to make reading .csv, .pdf, .doc, and .docx documents and directories of these documents easy. It would reduce what you're doing to:
textreadr::read_dir("../data/InauguralSpeeches/")
Your example is not reproducible so I do it below (please make your example reproducible in the future).
library(textreadr)
## Minimal working example
dir.create('delete_me')
file.copy(dir(system.file("docs/Maas2011/pos", package = "textreadr"), full.names=TRUE), 'delete_me', recursive=TRUE)
write.csv(mtcars, 'delete_me/mtcars.csv')
write.csv(CO2, 'delete_me/CO2.csv')
cat('test\n\ntesting\n\ntester', file='delete_me/00_00.txt')
## the read in of a directory
read_dir('delete_me')
output
The output below shows the tibble output with each document registered in the document column. For every line in the document there is one row for that document. Depending on what's in the csv files this may not be fine grained enough.
## document content
## 1 0_9 Bromwell High is a cartoon comedy. It ra
## 2 00_00 test
## 3 00_00
## 4 00_00 testing
## 5 00_00
## 6 00_00 tester
## 7 1_7 If you like adult comedy cartoons, like
## 8 10_9 I'm a male, not given to women's movies,
## 9 11_9 Liked Stanley & Iris very much. Acting w
## 10 12_9 Liked Stanley & Iris very much. Acting w
## .. ... ...
## 141 mtcars "Ferrari Dino",19.7,6,145,175,3.62,2.77,
## 142 mtcars "Maserati Bora",15,8,301,335,3.54,3.57,1
## 143 mtcars "Volvo 142E",21.4,4,121,109,4.11,2.78,18
Here is code that will read all the *.csv files in a directory to a single data.frame:
dir <- '~/Desktop/testcsv/'
files <- list.files(dir,pattern = '*.csv', full.names = TRUE)
data <- lapply(files, read.csv)
df <- do.call(rbind, data)
Notice that I added the argument full.names = TRUE. This will give you the absolute paths, which is why youre getting an error for "inaugAbrahamLincoln-1.csv" even though it exists.
Here is one way to do it.
library(data.table)
setwd("C:/Users/Excel/Desktop/CSV Files/")
WD="C:/Users/Excel/Desktop/CSV Files/"
# read headers
data<-data.table(read.csv(text="CashFlow,Cusip,Period"))
csv.list<- list.files(WD)
k=1
for (i in csv.list){
temp.data<-read.csv(i)
data<-data.table(rbind(data,temp.data))
if (k %% 100 == 0)
print(k/length(csv.list))
k<-k+1
}
I have a huge .txt file named SDN_1 with more than 1 million rows. I would like to split this file into smaller .txt files (10,000 rows each) using R.
I used this code to load the file into R:
SDN_1 <- read.csv("C:/Users/JHU/Desktop/rfiles/SDN_1.csv", header=FALSE)
Then I used this code to split the table:
chunk <- 10000
n <- nrow(SDN_1)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(SDN_1,r)
Next I would like to save the output of the split function into separate files as .txt and encode as UTF8. The files need to be named in the following format: test_YYYMMDD_HHMMSS.txt
I'm new to R and any help would be appreciated.
UPDATE:
Hack-R suggested the code below to create the .csv file. The code below worked once then started giving me the error message below:
Code Hack-R suggested:
n <- 1
for(i in d){
con <- file(paste0("file",n,"_", gsub("-
","",gsub(":","",gsub("","_",Sys.time()))), "_",".csv"),encoding="UTF-8")
write.csv(tmp, file = con)
n <- n + 1
}
The error message I'm getting:
Error in is.data.frame(x) : object 'tmp' not found
Using the code you already have:
SDN_1 <- mtcars # this represents your csv, to make it reproducible
chunk <- 10 # scaled it down for the example
n <- nrow(SDN_1)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(SDN_1,r)
n <- 1 # this part is optional
for(i in d){
con <- file(paste0("file",n,"_", gsub("-","",gsub(":","",gsub(" ","_",Sys.time()))), "_",".csv"),encoding="UTF-8")
write.csv(tmp, file = con)
n <- n + 1
}
More generally, let's say a and b represent the splits of a larger object or any collection of objects in the environment you want to write out programmatically:
a <- "a"
b <- "b"
You can get a vector containing their names:
files <- ls()
Then loop through and programmatically write them to a UTF-8 encoded csv file as follows, appending the date and time in the format you requested:
for(i in files){
tmp <- get(i)
con <- file(paste0(tmp,"_", gsub("-","",gsub(":","",gsub(" ","_",Sys.time()))), "_",".csv"),encoding="UTF-8")
write.csv(tmp, file = con)
}
I used Sys.time() for the timestamp with nested gsub()s to format the way you wanted. I encoded the file to UTF-8 as explained in this post.
I'm working with 12 large data files, all of which hover between 3 and 5 GB, so I was turning to RSQLite for import and initial selection. Giving a reproducible example in this case is difficult, so if you can come up with anything, that would be great.
If I take a small set of the data, read it in, and write it to a table, I get exactly what I want:
con <- dbConnect("SQLite", dbname = "R2")
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow=100, header=TRUE)
dbWriteTable(con, name = "Chr1test", value = data)
> dbListFields(con, "Chr1test")
[1] "row_names" "CHR_A" "BP_A" "SNP_A" "CHR_B" "BP_B" "SNP_B" "R2"
> dbGetQuery(con, "SELECT * FROM Chr1test LIMIT 2")
row_names CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
1 1 1 1579 SNP-1.578. 1 2097 SNP-1.1096. 0.07223050
2 2 1 1579 SNP-1.578. 1 2553 SNP-1.1552. 0.00763724
If I read in all of my data directly to a table, though, my columns aren't separated correctly. I've tried both sep = " " and sep = "\t", but both give the same column separation
dbWriteTable(con, name = "Chr1", value ="chr1.ld", header = TRUE)
> dbListFields(con, "Chr1")
[1] "CHR_A_________BP_A______________SNP_A__CHR_B_________BP_B______________SNP_B___________R
I can tell that it's clearly some sort of delimination issue, but I've exhausted my ideas on how to fix it. Has anyone run into this before?
*Edit, update:
It seems as though this works:
n <- 1000000
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow = n, header = TRUE)
con_data <- dbConnect("SQLite", dbname = "R2")
while (nrow(data) == n){
dbWriteTable(con_data, data, name = "ch1", append = TRUE, header = TRUE)
data <- read.table(f, nrow = n, header = TRUE)
}
close(f)
if (nrow(data) != 0){
dbWriteTable(con_data, data, name = "ch1", append = TRUE)
}
Though I can't quite figure out why just writing the table through SQLite is a problem. Possibly a memory issue.
I am guessing that your big file is causing a free memory issue (see Memory Usage under docs for read.table). It would have been helpful to show us the first few lines of chr1.ld (on *nix systems you just say "head -n 5 chr1.ld" to get the first five lines).
If it is a memory issue, then you might try sipping the file as a work-around rather than gulping it whole.
Determine or estimate the number of lines in chr1.ld (on *nix systems, say "wc -l chr1.ld").
Let's say your file has 100,000 lines.
`sip.size = 100
for (i in seq(0,100000,sip.size)) {
data <- read.table(f, nrow=sip.size, skip=i, header=TRUE)
dbWriteTable(con, name = "SippyCup", value = data, append=TRUE)
}`
You'll probably see warnings at the end but the data should make it through. If you have character data that read.table is trying to factor, this kludge will be unsatisfactory unless there are only a few factors, all of which are guaranteed to occur in every chunk. You may need to tell read.table not to factor those columns or use some other method to look at all possible factors so you can list them for read.table. (On *nix, split out one column and pipe it to uniq.)
Hallo experts,
I am trying to read in a large file in consecutive blocks of 10000 lines. This is
because the file is too large to read in at once. The "skip" field of read.csv comes in
handy to accomplish this task ( see below). However I noticed that the program starts
slowing down towards the end of the file ( for large values of i).
I suspect this is because each call to read.csv(file,skip=nskip,nrows=block) always
starts reading the file from the beginning until the required starting line "skip" is
reached. This becomes increasingly time-consuming as i increases.
Question: Is there a way to continue reading a file starting from the last location that
was reached in the previous block?
numberOfBlocksInFile<-800
block<-10000
for ( i in 1:(n-1))
{
print(i)
nskip<-i*block
out<-read.csv(file,skip=nskip,nrows=block)
colnames(out)<-names
.....
print("keep going")
}
many thanks (:-
One way is to use readLines with a file connection. For example, you could do something like this:
temp.fpath <- tempfile() # create a temp file for this demo
d <- data.frame(a=letters[1:10], b=1:10) # sample data, 10 rows. we'll read 5 at a time
write.csv(d, temp.fpath, row.names=FALSE) # write the sample data
f.cnxn <- file(temp.fpath, 'r') # open a new connection
fields <- readLines(f.cnxn, n=1) # read the header, which we'll reuse for each block
block.size <- 5
repeat { # keep reading and printing 5 row chunks until you reach the end of the cnxn.
block.text <- readLines(f.cnxn, n=5) # read chunk
if (length(block.text) == 0) # if there's nothing left, leave the loop
break
block <- read.csv(text=c(fields, block.text)) # process chunk with
print(block)
}
close(f.cnxn)
file.remove(temp.fpath)
Another option is to use fread from read.table package.
N <- 1e6 ## 1 second to read 1e6 rows/10cols
skip <- N
DT <- fread("test.csv",nrows=N)
repeat {
if (nrow(DT) < N) break
DT <- fread("test.csv",nrows=N,skip=skip)
## here use DT for your process
skip <- skip + N
}