Reading in chunks at a time using fread in package data.table

Reading in chunks at a time using fread in package data.table - r

I'm trying to input a large tab-delimited file (around 2GB) using the fread function in package data.table. However, because it's so large, it doesn't fit completely in memory. I tried to input it in chunks by using the skip and nrow arguments such as:
chunk.size = 1e6
done = FALSE
chunk = 1
while(!done)
{
temp = fread("myfile.txt",skip=(chunk-1)*chunk.size,nrow=chunk.size-1)
#do something to temp
chunk = chunk + 1
if(nrow(temp)<2) done = TRUE
}
In the case above, I'm reading in 1 million rows at a time, performing a calculation on them, and then getting the next million, etc. The problem with this code is that after every chunk is retrieved, fread needs to start scanning the file from the very beginning since after every loop iteration, skip increases by a million. As a result, after every chunk, fread takes longer and longer to actually get to the next chunk making this very inefficient.
Is there a way to tell fread to pause every say 1 million lines, and then continue reading from that point on without having to restart at the beginning? Any solutions, or should this be a new feature request?

You should use the LaF package. This introduces a sort of pointer on your data, thus avoiding the - for very large data - annoying behaviour of reading the whole file. As far as I get it fread() in data.table pckg need to know total number of rows, which takes time for GB data.
Using pointer in LaF you can go to every line(s) you want; and read in chunks of data that you can apply your function on, then move on to next chunk of data. On my small PC I ran trough a 25 GB csv-file in steps of 10e6 lines and extracted the totally ~5e6 observations needed - each 10e6 chunk took 30 seconds.
UPDATE:
library('LaF')
huge_file <- 'C:/datasets/protein.links.v9.1.txt'
#First detect a data model for your file:
model <- detect_dm_csv(huge_file, sep=" ", header=TRUE)
Then create a connection to your file using the model:
df.laf <- laf_open(model)
Once done you can do all sort of things without needing to know the size of the file as in data.table pckgs. For instance place the pointer to line no 100e6 and read 1e6 lines of data from here:
goto(df.laf, 100e6)
data <- next_block(df.laf,nrows=1e6)
Now data contains 1e6 lines of your CSV file (starting from line 100e6).
You can read in chunks of data (size depending on your memory) and only keep what you need. e.g. the huge_file in my example points to a file with all known protein sequences and has a size of >27 GB - way to big for my PC. To get only human sequence I filtered using organism id which is 9606 for human, and this should appear in start of the variable protein1. A dirty way is to put it into a simple for-loop and just go read one data chunk at a time:
library('dplyr')
library('stringr')
res <- df.laf[1,][0,]
for(i in 1:10){
raw <-
next_block(df.laf,nrows=100e6) %>%
filter(str_detect(protein1,"^9606\\."))
res <- rbind(res, raw)
}
Now res contains the filtered human data. But better - and for more complex operations, e.g. calculation on data on-the-fly - the function process_blocks() takes as argument a function. Hence in the function you do what ever you want at each piece of data. Read the documentation.

You can use readr's read_*_chunked to read in data and e.g. filter it chunkwise. See here and here for an example:
# Cars with 3 gears
f <- function(x, pos) subset(x, gear == 3)
read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f), chunk_size = 5)

A related option is the chunked package. Here is an example with a 3.5 GB text file:
library(chunked)
library(tidyverse)
# I want to look at the daily page views of Wikipedia articles
# before 2015... I can get zipped log files
# from here: hhttps://dumps.wikimedia.org/other/pagecounts-ez/merged/2012/2012-12/
# I get bz file, unzip to get this:
my_file <- 'pagecounts-2012-12-14/pagecounts-2012-12-14'
# How big is my file?
print(paste(round(file.info(my_file)$size / 2^30,3), 'gigabytes'))
# [1] "3.493 gigabytes" too big to open in Notepad++ !
# But can read with 010 Editor
# look at the top of the file
readLines(my_file, n = 100)
# to find where the content starts, vary the skip value,
read.table(my_file, nrows = 10, skip = 25)
This is where we start working in chunks of the file, we can use most dplyr verbs in the usual way:
# Let the chunked pkg work its magic! We only want the lines containing
# "Gun_control". The main challenge here was identifying the column
# header
df <-
read_chunkwise(my_file,
chunk_size=5000,
skip = 30,
format = "table",
header = TRUE) %>%
filter(stringr::str_detect(De.mw.De.5.J3M1O1, "Gun_control"))
# this line does the evaluation,
# and takes a few moments...
system.time(out <- collect(df))
And here we can work on the output as usual, since it's much smaller than the input file:
# clean up the output to separate into cols,
# and get the number of page views as a numeric
out_df <-
out %>%
separate(De.mw.De.5.J3M1O1,
into = str_glue("V{1:4}"),
sep = " ") %>%
mutate(V3 = as.numeric(V3))
head(out_df)
V1 V2 V3
1 en.z Gun_control 7961
2 en.z Category:Gun_control_advocacy_groups_in_the_United_States 1396
3 en.z Gun_control_policy_of_the_Clinton_Administration 223
4 en.z Category:Gun_control_advocates 80
5 en.z Gun_control_in_the_United_Kingdom 68
6 en.z Gun_control_in_america 59
V4
1 A34B55C32D38E32F32G32H20I22J9K12L10M9N15O34P38Q37R83S197T1207U1643V1523W1528X1319
2 B1C5D2E1F3H3J1O1P3Q9R9S23T197U327V245W271X295
3 A3B2C4D2E3F3G1J3K1L1O3P2Q2R4S2T24U39V41W43X40
4 D2H1M1S4T8U22V10W18X14
5 B1C1S1T11U12V13W16X13
6 B1H1M1N2P1S1T6U5V17W12X12
#--------------------

fread() can definitely help you read the data by chunks
What mistake you have made in your code is that you should keep your nrow a constant while you change the size of your skip parameter in the function during the loop.
Something like this is what I wrote for my data:
data=NULL
for (i in 0:20){
data[[i+1]]=fread("my_data.csv",nrow=10000,select=c(1,2:100),skip =10000*i)
}
And you may insert the follow code in your loop:
start_time <- Sys.time()
#####something!!!!
end_time <- Sys.time()
end_time - start_time
to check the time -- that each loop on average takes similar time.
Then you could use another loop to combine your data by rows with function default rbind function in R.
The sample code could be something like this:
new_data = data[[1]]
for (i in 1:20){
new_data=rbind(new_data,data[[i+1]],use.names=FALSE)
}
to unify into a large dataset.
Hope my answer may help with your question.
I loaded a 18Gb data with 2k+ columns, 200k rows in about 8 minutes using this method.

Related

Problem with file size using fread (to load) and write.csv. From 20GB to > 60GB in writing phase

My problem is that write.csv - or something else in my work process - has increment my file size from 19GB to more than 60GB. I mean more than 60GB because the saving process was interrupted because of memory problems. It had occurred when I added 250 000 rows to my DataBAse of near to 3 millions of rows. Now, I have 2 problems: 1) How to overcome the issue above and 2) how to read this huge Data and how to save it in a coherent size. For those who are not familiar with storing tweets, my file size is heavier than should be. I have been storing only 29 columns of 91 (standard columns when downloading tweets).
Here is my process:
I'm downloading tweets using rtweet thesee days. So that, I download tweets in chunks of 250 000 by day. For each day, using fread from data.table, I load my old Data Base. Then, I bind both data frames with rbind. Finally, I use write.csv to save my object. I repeat this process each time. Here my code:
base <- fread("tweets.csv")
base$user_id <- as.character(base$user_id)
base$status_id <- as.character(base$status_id)
base$retweet_status_id <- as.character(base$retweet_status_id)
base$retweet_user_id <- as.character(base$mentions_user_id)
datos <- search_tweets2("keywords", n = 250000, include_rts = T, lang = "es", retryonratelimit = T)
datos <- datos[,c(1,2,3,4,5,6,12,14,30,31,32,48,49,50,51,54,55,56,57,61,62,73,74,75,78,79,83,84)]
datos$mentions_user_id <- as.character(datos$mentions_user_id)
datos$mentions_screen_name <- as.character(datos$mentions_screen_name)
datos$created_at <- as.character(datos$created_at)
datos$retweet_created_at <- as.character(datos$retweet_created_at)
datos$account_created_at <- as.character(datos$account_created_at)
base <- rbind(base,datos)
write.csv(base, "tweets.csv")
Need to say than when I write the new file, I overwrite. Probably, here is the main problem when loading and overwritten. I don't know. Otherwise, I've been reading and think that my solution may be in read.csv.sql, loading by small parts my DataBase and saving in a correct way. But read.csv.sql presents some problems with my number of columns. It says: "Error in connection_import_file(conn#ptr, name, value, sep, eol, skip) :
RS_sqlite_import: tweets.csv line 2 expected 29 columns of data but found 6".
I load 100 rows using 'read.csv(file.csv, nrows = 100)` to know if everything is still OK in my file and it is.
Thank you in advance.

Parsing large XML to dataframe in R

I have large XML files that I want to turn into dataframes for further processing within R and other programs. This is all being done in macOS.
Each monthly XML is around 1gb large, has 150k records and 191 different variables. In the end I might not need the full 191 variables but I'd like to keep them and decide later.
The XML files can be accessed here (scroll to the bottom for the monthly zips, when uncompressed one should look at "dming" XMLs)
I've made some progress but processing for larger files takes too long (see below)
The XML looks like this:
<ROOT>
<ROWSET_DUASDIA>
<ROW_DUASDIA NUM="1">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
...
<ROW_DUASDIA NUM="150236">
<variable1>value</variable1>
...
<variable191>value</variable191>
</ROW_DUASDIA>
</ROWSET_DUASDIA>
</ROOT>
I hope that's clear enough. This is my first time working with an XML.
I've looked at many answers here and in fact managed to get the data into a dataframe using a smaller sample (using a daily XML instead of the monthly ones) and xml2. Here's what I did
library(xml2)
raw <- read_xml(filename)
# Find all records
dua <- xml_find_all(raw,"//ROW_DUASDIA")
# Create empty dataframe
dualen <- length(dua)
varlen <- length(xml_children(dua[[1]]))
df <- data.frame(matrix(NA,nrow=dualen,ncol=varlen))
# For loop to enter the data for each record in each row
for (j in 1:dualen) {
df[j, ] <- xml_text(xml_children(dua[[j]]),trim=TRUE)
}
# Name columns
colnames(df) <- c(names(as_list(dua[[1]])))
I imagine that's fairly rudimentary but I'm also pretty new to R.
Anyway, this works fine with daily data (4-5k records), but it's probably too inefficient for 150k records, and in fact I waited a couple hours and it hadn't finished. Granted, I would only need to run this code once a month but I would like to improve it nonetheless.
I tried to turn the elements for all records into a list using the as_list function within xml2 so I could continue with plyr, but this also took too long.
Thanks in advance.

While there is no guarantee of better performance on larger XML files, the ("old school") XML package maintains a compact data frame handler, xmlToDataFrame, for flat XML files like yours. Any missing nodes available in other siblings result in NA for corresponding fields.
library(XML)
doc <- xmlParse("/path/to/file.xml")
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA"))
You can even conceivably download the daily zips, unzip need XML, and parse it into data frame should the large monthly XMLs pose memory challenges. As example, below extracts December 2018 daily data into a list of data frames to be row binded at end. Process even adds a DDate field. Method is wrapped in a tryCatch due to missing days in sequence or other URL or zip issues.
dec_urls <- paste0(1201:1231)
temp_zip <- "/path/to/temp.zip"
xml_folder <- "/path/to/xml/folder"
xml_process <- function(dt) {
tryCatch({
# DOWNLOAD ZIP TO URL
url <- paste0("ftp://ftp.aduanas.gub.uy/DUA%20Diarios%20XML/2018/dd2018", dt,".zip")
file <- paste0(xml_folder, "/dding2018", dt, ".xml")
download.file(url, temp_zip)
unzip(temp_zip, files=paste0("dding2018", dt, ".xml"), exdir=xml_folder)
unlink(temp_zip) # DESTROY TEMP ZIP
# PARSE XML TO DATA FRAME
doc <- xmlParse(file)
df <- transform(xmlToDataFrame(doc, nodes=getNodeSet(doc, "//ROW_DUASDIA")),
DDate = as.Date(paste("2018", dt), format="%Y%m%d", origin="1970-01-01"))
unlink(file) # DESTROY TEMP XML
# RETURN XML DF
return(df)
}, error = function(e) NA)
}
# BUILD LIST OF DATA FRAMES
dec_df_list <- lapply(dec_urls, xml_process)
# FILTER OUT "NAs" CAUGHT IN tryCatch
dec_df_list <- Filter(NROW, dec_df_list)
# ROW BIND TO FINAL SINGLE DATA FRAME
dec_final_df <- do.call(rbind, dec_df_list)

Here is a solution that processes the entire document at once as opposed to reading each of the 150,000 records in the loop. This should provide a significant performance boost.
This version can also handle cases where the number of variables per record is different.
library(xml2)
doc<-read_xml('<ROOT>
<ROWSET_DUASDIA>
<ROW_DUASDIA NUM="1">
<variable1>value1</variable1>
<variable191>value2</variable191>
</ROW_DUASDIA>
<ROW_DUASDIA NUM="150236">
<variable1>value3</variable1>
<variable2>value_new</variable2>
<variable191>value4</variable191>
</ROW_DUASDIA>
</ROWSET_DUASDIA>
</ROOT>')
#find all of the nodes/records
nodes<-xml_find_all(doc, ".//ROW_DUASDIA")
#find the record NUM and the number of variables under each record
nodenum<-xml_attr(nodes, "NUM")
nodeslength<-xml_length(nodes)
#find the variable names and values
nodenames<-xml_name(xml_children(nodes))
nodevalues<-trimws(xml_text(xml_children(nodes)))
#create dataframe
df<-data.frame(NUM=rep(nodenum, times=nodeslength),
variable=nodenames, values=nodevalues, stringsAsFactors = FALSE)
#dataframe is in a long format.
#Use the function cast, or spread from the tidyr to convert wide format
# NUM variable values
# 1 1 variable1 value1
# 2 1 variable191 value2
# 3 150236 variable1 value3
# 4 150236 variable2 value_new
# 5 150236 variable191 value4
#Convert to wide format
library(tidyr)
spread(df, variable, values)

Subset large .csv file at reading in R

I have a very large .csv file (~4GB) which I'd like to read, then subset.
The problem comes at reading (memory allocation error). Being that large reading crashes, so what I'd like is a way to subset the file before or while reading it, so that it only gets the rows for one city (Cambridge).
f:
id City Value
1 London 17
2 Coventry 21
3 Cambridge 14
......
I've already tried the usual approaches:
f <- read.csv(f, stringsAsFactors=FALSE, header=T, nrows=100)
f.colclass <- sapply(f,class)
f <- read.csv(f,sep = ",",nrows = 3000000, stringsAsFactors=FALSE,
header=T,colClasses=f.colclass)
which seem to work for up to 1-2M rows, but not for the whole file.
I've also tried subsetting at the reading itself using pipe:
f<- read.table(file = f,sep = ",",colClasses=f.colclass,stringsAsFactors = F,pipe('grep "Cambridge" f ') )
and this also seems to crash.
I thought packages sqldf or data.table would have something, but no success yet !!
Thanks in advance, p.

I think this was alluded to already but just in case it wasn't completely clear. The sqldf package creates a temporary SQLite DB on your machine based on the csv file and allows you to write SQL queries to perform subsets of the data before saving the results to a data.frame
library(sqldf)
query_string <- "select * from file where City=='Cambridge' "
f <- read.csv.sql(file = "f.csv", sql = query_string)
#or rather than saving all of the raw data in f, you may want to perform a sum
f_sum <- read.csv.sql(file = "f.csv",
sql = "select sum(Value) from file where City=='Cambridge' " )

One solution to this type of error is
you can convert your csv file to excel file first.
Then you can map your excel file into mysql table by using toad for mysql it is easy.Just check for datatype of variables.
then using RODBC package you can access such a large dataset.
I am working with a datasets of size more than 20 GB this way.

Although there's nothing wrong with the existing answers, they miss the most conventional/common way of dealing with this: chunks (Here's an example from one of the multitude of similar questions/answers).
The only difference is, unlike for most of the answers that load the whole file, you would read it chunk by chunk and only keep the subset you need at each iteration
# open connection to file (mostly convenience)
file_location = "C:/users/[insert here]/..."
file_name = 'name_of_file_i_wish_to_read.csv'
con <- file(paste(file_location, file_name,sep='/'), "r")
# set chunk size - basically want to make sure its small enough that
# your RAM can handle it
chunk_size = 1000 # the larger the chunk the more RAM it'll take but the faster it'll go
i = 0 # set i to 0 as it'll increase as we loop through the chunks
# loop through the chunks and select rows that contain cambridge
repeat {
# things to do only on the first read-through
if(i==0){
# read in columns only on the first go
grab_header=TRUE
# load the chunk
tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header)
# subset only to desired criteria
cond = tmp_chunk[,'City'] == "Cambridge"
# initiate container for desired data
df = tmp_chunk[cond,] # save desired subset in initial container
cols = colnames(df) # save column names to re-use on next chunks
}
# things to do on all subsequent non-first chunks
else if(i>0){
grab_header=FALSE
tmp_chunk = read.csv(con, nrows = chunk_size,header=grab_header,col.names = cols)
# set stopping criteria for the loop
# when it reads in 0 rows, exit loop
if(nrow(tmp_chunk)==0){break}
# subset only to desired criteria
cond = tmp_chunk[,'City'] == "Cambridge"
# append to existing dataframe
df = rbind(df, tmp_chunk[cond,])
}
# add 1 to i to avoid the things needed to do on the first read-in
i=i+1
}
close(con) # close connection
# check out the results
head(df)

How to convert rows

I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.

I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!

How to allocate/append a large column of Date objects to a data-frame

I have a data-frame (3 cols, 12146637 rows) called tr.sql which occupies 184Mb.
(it's backed by SQL, it is the contents of my dataset which I read in via read.csv.sql)
Column 2 is tr.sql$visit_date. SQL does not allow natively representing dates as an R Date object, this is important for how I need to process the data.
Hence I want to copy the contents of tr.sql to a new data-frame tr
(where the visit_date column can be natively represented as Date (chron::Date?). Trust me, this makes exploratory data analysis easier, for now this is how I want to do it - I might use native SQL eventually but please don't quibble that for now.)
Here is my solution (thanks to gsk and everyone) + workaround:
tr <- data.frame(customer_id=integer(N), visit_date=integer(N), visit_spend=numeric(N))
# fix up col2's class to be Date
class(tr[,2]) <- 'Date'
then workaround copying tr.sql -> tr in chunks of (say) N/8 using a for-loop, so that the temporary involved in the str->Date conversion does not out-of-memory, and a garbage-collect after each:
for (i in 0:7) {
from <- floor(i*N/8)
to <- floor((i+1)*N/8) -1
if (i==7)
to <- N
print(c("Copying tr.sql$visit_date",from,to," ..."))
tr$visit_date[from:to] <- as.Date(tr.sql$visit_date[from:to])
gc()
}
rm(tr.sql)
memsize_gc() ... # only 321 Mb in the end! (was ~1Gb during copying)
The problem is allocating then copying the visit_date column.
Here is the dataset and code, I am having multiple separate problems with this, explanation below:
'training.csv' looks like...
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
and code:
# Read in as SQL (for memory-efficiency)...
library(sqldf)
tr.sql <- read.csv.sql('training.csv')
gc()
memory.size()
# Count of how many rows we are about to declare
N <- nrow(tr.sql)
# Declare a new empty data-frame with same columns as the source d.f.
# Attempt to declare N Date objects (fails due to bad qualified name for Date)
# ... does this allocate N objects the same as data.frame(colname = numeric(N)) ?
tr <- data.frame(visit_date = Date(N))
tr <- tr.sql[0,]
# Attempt to assign the column - fails
tr$visit_date <- as.Date(tr.sql$visit_date)
# Attempt to append (fails)
> tr$visit_date <- append(tr$visit_date, as.Date(tr.sql$visit_date))
Error in `$<-.data.frame`(`*tmp*`, "visit_date", value = c("14700", "14705", :
replacement has 12146637 rows, data has 0
The second line that tries to declare data.frame(visit_date = Date(N)) fails, I don't know the correct qualified name with namespace for Date object (tried chron::Date , Dates::Date? don't work)
Both the attempt to assign and append fail. Not even sure whether it is legal, or efficient, to use append on a single large column of a data-frame.
Remember these objects are big, so avoid using temporaries.
Thanks in advance...

Try this ensuring that you are using the most recent version of sqldf (currently version 0.4-1.2).
(If you find you are running out of memory try putting the database on disk by adding the dbname = tempfile() argument to the read.csv.sql call. If even that fails then its so large in relation to available memory that its unlikely you are going to be able to do much analysis with it anyways.)
# create test data file
Lines <-
"customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52"
cat(Lines, file = "trainingtest.csv")
# read it back
library(sqldf)
DF <- read.csv.sql("trainingtest.csv", method = c("integer", "Date2", "numeric"))

It doesn't look to me like you've got a data.frame there (N is a vector of length 1). Should be simple:
tr <- tr.sql
tr$visit_date <- as.Date(tr.sql$visit_date)
Or even more efficient:
tr <- data.frame(colOne = tr.sql[,1], visit_date = as.Date(tr.sql$visit_date), colThree = tr.sql[,3])
As a side note, your title says "append" but I don't think that's the operation you want. You're making the data.frame wider, not appending them on to the end (making it longer). Conceptually, this is a cbind() operation.

Try this:
tr <- data.frame(visit_date= as.Date(tr.sql$visit_date, origin="1970-01-01") )
This will succeed if your format is YYYY-MM-DD or YYYY/MM/DD. If not one of those formats then post more details. It will also succeed if tr.sql$visit_date is a numeric vector equal to the number of days after the origin. E.g:
vdfrm <- data.frame(a = as.Date(c(1470, 1475, 1480), origin="1970-01-01") )
vdfrm
a
1 1974-01-10
2 1974-01-15
3 1974-01-20

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Reading in chunks at a time using fread in package data.table - r

You can use readr's read_*_chunked to read in data and e.g. filter it chunkwise. See here and here for an example: # Cars with 3 gears f <- function(x, pos) subset(x, gear == 3) read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f), chunk_size = 5)

Related

Problem with file size using fread (to load) and write.csv. From 20GB to > 60GB in writing phase

Parsing large XML to dataframe in R

Subset large .csv file at reading in R

How to convert rows

How to allocate/append a large column of Date objects to a data-frame

Categories

Resources