Memory problems using bigmemory to load large dataset in R

Memory problems using bigmemory to load large dataset in R - r

I have a large text file (>10 million rows, > 1 GB) that I wish to process one line at a time to avoid loading the entire thing into memory. After processing each line I wish to save some variables into a big.matrix object. Here is a simplified example:
library(bigmemory)
library(pryr)
con <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')
for (i in 1:5){
print(c(address(x), refs(x)))
y <- readLines(con, n = 1, warn = FALSE)
x[i] <- 2L*as.integer(y)
}
close(con)
where x.csv contains
4
18
2
14
16
Following the advice here http://adv-r.had.co.nz/memory.html I have printed the memory address of my big.matrix object and it appears to change with each loop iteration:
[1] "0x101e854d8" "2"
[1] "0x101d8f750" "2"
[1] "0x102380d80" "2"
[1] "0x105a8ff20" "2"
[1] "0x105ae0d88" "2"
Can big.matrix objects be modified in place?
is there a better way to load, process and then save these data? The current method is slow!

is there a better way to load, process and then save these data? The current method is slow!
The slowest part of your method appearts to be making the call to read each line individually. We can 'chunk' the data, or read in several lines at a time, in order to not hit the memory limit while possibly speeding things up.
Here's the plan:
Figure out how many lines we have in a file
Read in a chunk of those lines
Perform some operation on that chunk
Push that chunk back into a new file to save for later
library(readr)
# Make a file
x <- data.frame(matrix(rnorm(10000),100000,10))
write_csv(x,"./test_set2.csv")
# Create a function to read a variable in file and double it
calcDouble <- function(calc.file,outputFile = "./outPut_File.csv",
read.size=500000,variable="X1"){
# Set up variables
num.lines <- 0
lines.per <- NULL
var.top <- NULL
i=0L
# Gather column names and position of objective column
connection.names <- file(calc.file,open="r+")
data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1)
close(connection.names)
col.name <- which(colnames(data.names)==variable)
#Find length of file by line
connection.len <- file(calc.file,open="r+")
while((linesread <- length(readLines(connection.len,read.size)))>0){
lines.per[i] <- linesread
num.lines <- num.lines + linesread
i=i+1L
}
close(connection.len)
# Make connection for doubling function
# Loop through file and double the set variables
connection.double <- file(calc.file,open="r+")
for (j in 1:length(lines.per)){
# if stops read.table from breaking
# Read in a chunk of the file
if (j == 1) {
data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="")
} else {
data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="")
}
# Grab the columns we need and double them
double <- data[,I(col.name)] * 2
if (j != 1) {
write_csv(data.frame(double),outputFile,append = TRUE)
} else {
write_csv(data.frame(double),outputFile)
}
message(paste0("Reading from Chunk: ",j, " of ",length(lines.per)))
}
close(connection.double)
}
calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")
So we get back a .csv file with the manipulated data. You can change double <- data[,I(col.name)] * 2 to whatever thing you need to do to each chunk.

Related

R: How to read in file and skip lines that have errors with fread function?

I'm trying to read in a CSV file with fread function but while reading it breaks cause of extra characters in the row. So I was wondering if there is a way to read the file, skip the rows with errors, and continue reading it? Thank you for any advice.
Below, you can see the error I get
In fread("data.csv", :
Stopped early on line 617854. Expected 52 fields but found 54. Consider fill=TRUE and comment.char=. First discarded non-empty line:

I think you could use the nrows and skip arguments of fread to kind of do this yourself. I haven't got an appropriately broken csv to hand to test this on so no promises that this will work, but maybe something like the stuff below. This is basically an attempt to automate taking that row number flagged in the warning, and resuming reading the csv in for all rows after that row.
Essentially I'm reading in 100,000 rows at a time, and if that's successful, I write that data to a list called data_chunks. If it throws a warning, I pick up the warning message, use some regex to figure out what the line number is, and read up to that line. I then return that data.table and write to the data_chunks list. I then update the rows_to_skip value by the number of rows across all my data.tables in data_chunks, plus the number of problem rows (I return a bad_row boolean along with the data.table to indicate this, and add it to bad_rows at each iteration).
It is all in a while loop so will keep executing until the number of rows_to_skip exceeds the number of rows to be read, in which case, an error is thrown and the if statement triggers the break, and you exit the loop. Finally, use rbindlist to bind all the rows together across your list. This feels pretty hacky and probably isn't all that reliable but for the sake of getting your data loaded into R it may be a start at least:
data_chunks <- list()
i <- 1
rows_to_skip <- 0
rows_to_read <- 100000
bad_rows <- 0
file_name <- "my.csv"
while (TRUE) {
out <- tryCatch(
list(
data = data.table::fread(file_name, nrows = rows_to_read, skip = rows_to_skip, header = FALSE),
bad_row = FALSE
),
error = function(e) {
e
},
warning = function(w) {
warn_msg <- conditionMessage(w)
warn_matches <- regexec("line (\\d+)", warn_msg)
rows_to_read <- as.numeric(regmatches(warn_msg, warn_matches)[[1]][2]) - 1
if (!is.na(rows_to_read)) {
list(
data = data.table::fread(file_name, nrows = rows_to_read, skip = rows_to_skip, header = FALSE),
bad_row = TRUE
)
} else {
NULL
}
})
if ("error" %in% class(out) || is.null(out)) {
break
} else {
data_chunks[[i]] <- out[["data"]]
}
bad_rows <- bad_rows + out[["bad_row"]]
rows_to_skip <- sum(sapply(data_chunks, nrow)) + bad_rows
i <- i + 1
}
mydata <- data.table::rbindlist(data_chunks, use.names = FALSE)

how to Append Multiples Outputs from multiple files analysis Using for loop in R

I am reaching out to see if anyone can direct me to a resource where I can fine tune my R codes. I am doing analysis on many files that have the same variables using R. At the end, I would like to append all the outputs tables-I have 12 tables- and have only one output table. Any idea on how to proeceed.
To recap. If you have multiples outputs using for loop, how to append the files to make a unique file.
Please see the codes below:
setwd("C:/Data/Key/Spring2017/era");
subjects=c("ALGEBRA1","BIOLOGY","LITERATURE");
modules=c("module1","module2");
`modes=c("O","P");`
#loopcount<-1;
for(i in 1:length(subjects)){
for(k in 1:length(modes)){
for(m in 1:length(modules)){
#i=1;
#k=1;
#m=1
subj <- subjects[i];
mode <- modes[k];
module <- modules[m];
cat("Subj:",subj,",mode: ",mode,",module: ",module,"\n");
###Reading the analyses output;
outfile<-
paste("output/erasure_Mean_",mode,"_",module,"_",subj,".csv",sep="");
###Reading in the datafiles;
infile<-paste("data/erasure_",mode,"_",module,"_",subj,".csv",sep="");
dat<-read.csv(infile,header=TRUE,as.is=T);
newid<-paste(dat[,"dis"],dat[,"sch"]);
dat<-data.frame(dat,newid);
head(dat);
count<-function(dat){
length(na.omit(dat))
}
##computing students count per school for Algebra1;
temp<-aggregate(dat[,"subj"],by=list(dat[,"newid"]),FUN="count")
colnames(temp)=c("newid","N")
#removing duplicate id;
#y<-dat[!duplicated(dat[,"newid"]),c("newid","dis","disname","schname","drcid","subj","mode","module")]
y<-dat[!duplicated(dat[,"newid"]),c(1,2,3,4,7:9,15)]
head(y)
MinN<-10;
out<-merge(y,temp,by="newid")
head(out)
#count of the students with FivePlus above;
temp<-aggregate(dat[,"FivePlus"],list(dat[,"newid"]),FUN=sum)
colnames(temp)<-c("newid","FivePlusN");
out<-merge(out,temp,by="newid")
temp<-aggregate(dat[,"FivePlus"],list(dat[,"newid"]),FUN=mean);
colnames(temp)<-c("newid","FivePlusPer");
temp[,2]=temp[,2]*100
out<-merge(out,temp,by="newid");
#state erasure mean
gmean<-mean(dat[,"tot_wr"])
gsd<-sd(dat[,"tot_wr"])
gn<-nrow(dat)
varused<-c("tot_wr");
x<-dat[!duplicated(dat[,"subj"]),c(7:9)]
pre.x<-data.frame(x,varused,gmean,gsd,gn)
##Statistics for wr;
#mean of wr by school
temp<-aggregate(dat[,"tot_wr"],list(dat[,"newid"]),FUN=mean)
colnames(temp)<-c("newid",paste("tot_wr","mean",sep="."))
head(temp)
#mean of wr per test
pertest<-temp[,2]
pertest<-as.matrix(pertest,ncol=1)
colnames(pertest)<-c(paste("tot_wr","pertest",sep="."))
temp<-data.frame(temp,pertest)
out<-merge(out,temp,by="newid")
#variance of WR
temp<-aggregate(dat[,"tot_wr"], list(dat[,"newid"]),sd)
#The standard deviation used is across item types by school
colnames(temp)<-c("newid",paste("tot_wr", "sd",sep="."))
out<-merge(out,temp,by="newid")
#z score of wr
Z<-(out[,"tot_wr.pertest"]-gmean)/(out[,"tot_wr.sd"]/sqrt(out[,"N"]))
out<-data.frame(out,Z)
##ncol(out) determines the column to rename using paste function.
colnames(out)[ncol(out)]<-paste("tot_wr","Z",sep=".")
##p value of wr
tdf<-out[,"N"]-1
##lower.tail logical if True, prob are P[X<=x],otherwise P[X>x]
##log.p if true, probabilities p are given as log(p)
pval<-pt(Z,tdf,lower.tail=F,log.p=FALSE)
out<-data.frame(out,pval)
colnames(out)[ncol(out)]<-paste("tot_wr","pval",sep=".")
#threat for wr
threat<-matrix(0,ncol=1,nrow=nrow(out))
prethreat<-as.matrix(round(abs(1.086*log(pval/(1-pval))),digits=4),ncol=1, nrow=nrow(out))
for(threatloop in 1:nrow(out)){
if (out[threatloop, paste("tot_wr","pval",sep=".")]< 0.5 &
is.na(out[threatloop,paste("tot_wr","pval",sep=".")])==F){
threat[threatloop,]<-prethreat[threatloop,]
}
}
threat <- as.matrix(threat,ncol=1)
if(length(threat[which(threat[,1]> 49.9),1])>0){
threat[which(threat[,1]> 49.9),1]<-49.9
}
colnames(threat)<-paste("tot_wr","threat",sep=".")
out<-cbind(out,threat)
if (length(which(out[,paste("tot_wr","threat",sep=".")] > 9.9
& out[,paste("tot_wr","threat",sep=".")] < 10))> 0){
out[which(out[,paste("tot_wr","threat",sep=".")]< 10),paste("tot_wr","threat",sep=".")] < -9.9
}
if (length(which(out[,"N"]< MinN))> 0){
out[which(out[,"N"]< MinN),paste("tot_wr","threat",sep=".")]<- NA
}
if(length(which(out[,paste("tot_wr","Z",sep=".")]=="-Inf")) > 0){
out[which(out[,paste("tot_wr","Z",sep=".")]=="-Inf"),paste("tot_wr","Z",sep=".")]<- -999999
}
if (length(which(out[,paste("tot_wr","Z",sep=".")]=="Inf")) > 0){
out[which(out[,paste("tot_wr","Z",sep=".")]=="Inf"),paste("tot_wr","Z",sep=".")] <- 999999
}
eras<-rbind(pre.x,pre.x1,pre.x2,pre.x3)
write.table(out,outfile,quote=F,append=F,row.names=F,col.name=T,na= "",sep=",")
}
}
}

Consider generalizing your process with a function and replace nested for loops with single iteration through an expand.grid dataframe of all possibly combinations. Then use mapply or its wrapper Map to pass in parameters to your generalized function.
# DATAFRAME OF ALL POSSIBLE COMBINATIONS OF NESTED for LOOP
loopdf <- expand.grid(mode = c("O","P"),
module = c("module1","module2"),
subj = c("ALGEBRA1","BIOLOGY","LITERATURE"))
# USER-DEFINED FUNCTION nearly same code but two changes at beginning and end:
# 1. Remove assignment of mode, module, subj since they are passed in as parameters
# 2. Replace write.table with a return() since you will output file outside of function
table_process <- function(mode, module, subj){
cat("Subj:",subj,",mode: ",mode,",module: ",module,"\n")
infile<-paste("data/erasure_",mode,"_",module,"_",subj,".csv",sep="")
# ... EXACT SAME CODE EXCEPT LAST LINE
return(out)
}
# LIST OF DATAFRAMES
dfList <- Map(table_process, loopdf$mode, loopdf$module, loopdf$subj)
# EQUIVALENTLY
# dfList <- mapply(table_process, loopdf$mode, loopdf$module, loopdf$subj, SIMPLIFY = FALSE)
# ROW BIND ALL DF ELEMENTS INTO ONE DATAFRAME (ASSUMED SAME COLUMN LENGTH AND NAMES)
finaldf <- do.call(rbind, dfList)
# OUTPUT SINGLE FILE
write.table(finaldf, "final.csv", quote=F, append=F, row.names=F, col.name=T, na= "", sep=",")
# EQUIVALENTLY WITHOUT sep ARG
# write.csv(finaldf, "final.csv", quote=F, append=F, row.names=F, col.name=T, na= "")

How to run a script for multiple inputs and save result objects based on the name of input?

To run the same set of commands and save the result objects for each time series, I wrote the script in the following manner :
# Specify time series to be used
dat <- tsname
# Run a set of commands and fit models with different parameters
dat.1 <- model1(dat)
dat.2 <- model2(dat)
dat.3 <- model3(dat)
# Save objects for further analysis
tsname.1 <- dat.1
tsname.2 <- dat.2
save(tsname.1, tsname.2, tsname.3, file = paste0("tsname", ".rda")
In this way, we just need to change the script in the beginning and end, save the script for each time series and run each of them individually or in a main script.
The main reason for this method was because I could not find a way to rename the objects created and some search suggested that the above is the only way to do it.
Now as the number of series has increased, it is preferable to either use a for loop, foreach, batch script or commandArgs() to run one script and specify all time series as arguments.
To make that work though, the script must find a way to assign these objects with name of series itself so that they can be loaded later for further analysis.
How can we make such a script work or is there a better approach ? Which method of looping will work in that case ?
A MWE
set.seed(1)
tsdata <- ts(rnorm(250), start = c(1980,1), frequency = 12)
dat <- tsdata
dat.11 <- arima(dat, order = c(1, 1, 1))
dat.21 <- arima(dat, order = c(2, 1, 0))
tsname.11 <- dat.11 # problem is to specify this step in each script
tsname.21 <- dat.21
save(tsname.11, , file = "tsname.rda")
REVISED the code
How can we execute this script for multiple time series and store the results and result objects for further analysis ? If Batch command can be used, what is the best way to input set of multiple time series?
How can we run the script for one series, over a set of time series of same or mixed length?

I show a couple ways to create and retrieve individual objects using assign and get, but also provide an alternative where all model runs are stored as different elements of a list. Similarly, I show how you can save each model run in separate files (soi.1.rda, etc), but that you can also save everything together, in one step :)
# ===========================================
# = Set up model params, generate test data =
# ===========================================
mod.param <- 1:5 # orders of AR to try ...
test.soi <- arima.sim(model=list(ar=c(0.5, -0.2)), n=20)
# ===========================================================
# = Create empty vectors/ list to store data and data names =
# ===========================================================
dat.names <- c() # a place to store the names of the individual objects that we'll create
save.names <- c() # the names of the files to save, e.g., "soi.1"
dat.all <- list() # as an altnerative, you can save each analysis in different elements of a list
# ===================================================
# = Loop through each type of model, saving results =
# ===================================================
for(i in 1:length(mod.param)){ # loop through each model you want to run
temp.dat <- arima(test.soi, order=c(mod.param[i], 0, 0)) # temp.dat is the current model result
dat.names[i] <- paste("dat", i, sep=".") # dat.names stores the names of all the dat.x objects
assign(dat.names[i], temp.dat) # use assign() to create an object with name of temp.dat.name
# dat.all[[dat.names[i]]] <- temp.dat # store the object in a list
dat.all[[dat.names[i]]] <- get(dat.names[i]) # same as above, but using get(), which complements assign() nicely
save.name <- paste("soi", i, "rda", sep=".") # I'm assuming the file should be named soi.1.rda, not soi.rda
save(list=dat.names[i], file=save.name) # save soi.1.rda, soi.2.rda ... etc.
}
# But we don't have to save each file individually!
# We can save a file that contains our list of models (dat.all), as well as each model object (dat.1, dat.2 ... etc.)
all.objs <- ls() # what are all of the object names in our working memory?
dat.objs <- all.objs[all.objs%in%c(dat.names, "dat.all")] # subset to the names of objects we want to save
save(list=dat.objs, file="everything.rda") # save all relevant objects in 1 .rda file
print(dat.1)
print(dat.all$dat.1)
Edit: A different approach that applies each of several models to several time series
Note that the approach might change slightly depending on which models you want to apply to which time series. I've assumed that several models should be applied to each time series, and that the models differ only the the ARIMA order.
The results can be saved as 1 nested list (different model results grouped under different time series), or with model results for each time series being saved as a separate file.
# ============================================================
# = Generate many time series, many sets of model parameters =
# ============================================================
# Model parameters
n.Params <- 5
ar.orders <- 1:n.Params # orders of AR to try ...
i.orders <- rep(0, n.Params)
ma.orders <- rep(0,n.Params)
arima.params <- as.list(as.data.frame(rbind(ar.orders, i.orders, ma.orders)))
# Time Series Data
n.ts <- 10 # number of time series
test.soi <- quote(as.numeric(arima.sim(model=list(ar=c(0.2, 0.4)), n=sample(20:30, 1))))
all.soi.ts <- replicate(n.ts, eval(test.soi))
names(all.soi.ts) <- paste("soi", 1:n.ts, sep=".")
# ==============================================
# = Function to be applied to each time series =
# ==============================================
# Analyze time series
ats <- function(TS, arimaParams){
dat.all <- list() # as an altnerative, you can save each analysis in different elements of a list
for(i in 1:length(arimaParams)){ # loop through each model you want to run
temp.dat <- arima(TS, order=arimaParams[[i]]) # temp.dat is the current model result
dat.all[[i]] <- temp.dat # store the object in a list
}
dat.all
}
# =========================
# = All Results in 1 List =
# =========================
AllResults <- lapply(all.soi.ts, ats, arima.params) # multilevel list – top level is each TS, within each TS group are the results of all models applied to that time series
save(AllResults, file="everything.rda") # save this big list as 1 file
# ========================================================================
# = Each time series gets its own file and its own list of model results =
# ========================================================================
for(i in 1:length(all.soi.ts)){ # if you want many files, 1 file per time series, use this for loop
temp.ts <- all.soi.ts[[i]]
soi.name <- paste("soi", i, sep=".")
assign(soi.name, ats(temp.ts, arima.params))
save(list=soi.name, file=paste(soi.name, "rda", sep=".")) # each file will have a name like "soi.1.rda", containing the results of all models applied to the first time series
}

The function sets datname to the name of the input variable. Then define a list L of model outputs and add names. Finally use with(L, ...) to regard the list component names as variable names in ... and use save(list = ..., ...) which allows specification of the variables as a character string of names. Now we only have to set up the data and call the function to run it. If you have several data sets call the function for each one.
run <- function(dat, datname = deparse(subset(dat))) {
L <- list(
arima(dat, order = c(1, 1, 1)),
arima(dat, order = c(2, 1, 0))
)
names(L) <- paste(datname, seq_along(L), sep = ".")
with(L, save(list = names(L), file = paste0(datname, ".rda")))
}
set.seed(1)
soi <- ts(rnorm(250), start = c(1980,1), frequency = 12)
run(soi)
Another possibility might be to save the entire list rather than its components separately. That is, replace the with statement with
listname <- paste0(datname, ".models")
assign(listname, L)
save(list = listname, file = paste0(datname, ".rda"))
REVISED Some corrections and added alternative at end.

When you want to manipulate objects whose names are themselves stored inside a variable, just use assign() and its reverse get(). And use ls() to see which objects exist in a particular scope.
The objects don't need to be stored separately as tsname.1/2/3, model1/2/3??
You can make it real simple if you just store a vector dat[1:3].
Indeed you can have a vector of model[1:3] too. Use vectorization. It's your friend.
Use the assign("tsname.21", object,...) command and its reverse get("tsname.21") to manipulate objects by string name. Just be consistent about whether you prefer to refer to objnames or objects.
set.seed(1)
tsdata <- ts(rnorm(250), start = c(1980,1), frequency = 12)
dat <- tsdata
set.seed(1)
tsdata <- ts(rnorm(250), start = c(1980,1), frequency = 12)
dat <- tsdata
create_model <- function(data, params, objname.prefix='tsname.', envir=.GlobalEnv) {
objname = paste(objname.prefix, params[1], params[2], sep='') # both assigns and prints it
the.model <- arima(dat, order = params)
assign(objname, the.model, envir) # create the var in the global env
# If you want, you can return the varname
return(objname)
}
# dat.11 <- arima(dat, order = c(1, 1, 1))
create_model(dat, c(1, 1, 1))
# dat.21 <- arima(dat, order = c(2, 1, 0))
create_model(dat, c(2, 1, 0))
#tsname.11 <- dat.11 # problem is to specify this step in each script
#tsname.21 <- dat.21
save(tsname.11, , file = "tsname.rda")
# Use `ls(pattern=...)` to find object-names, with wildcard matching.
all.models <- ls(pattern='tsname.*')
#[1] "tsname.11" "tsname.21"
#############
# Refactor your original code similarly.
dat <- tsname
# Run a set of commands and fit models with different parameters
dat[1] <- model1(dat)
dat[2] <- model2(dat)
dat[3] <- model3(dat)
# or maybe figure out how to use sapply here
# Save objects for further analysis
tsname <- dat[1:2] # instead of tsname.1 <- dat.1, tsname.2 <- dat.2
#
save(tsname, file = paste0("tsname", ".rda")

Process a large list of objects in R

I have a large list of objects (say 100k elements). Each element will have to be processed by a function "process" BUT I would like to do the processing in chunks... say 20 passes for example as I want to save processing results into a hard drive file and keep operating memory free.
I'm new to R and I know that it should involve some apply magic but I don't know how to do it (yet).
Any guidance would be much appreciated.
A small example:
objects <- list();
for (i in 1:100){
objects <- append(objects, 500);
}
objects;
processOneElement <- function(x){
x/20 + 23;
}
I would like to process first 20 elements in one go and save results then process second 20 elements in second go and save results... and so on
objects <- list();
for (i in 1:100){
objects <- append(objects, 500);
}
objects;
process <- function(x){
x/20 + 23;
}
results <- lapply(objects, FUN=process)
index <- seq(1, length(objects), by=20);
lapply(index, function(idx1) {
idx2 <- min(idx1+20-1, length(objects));
batch <- lapply(idx:idx2, function(x) {
process(objects[[x]]);
})
write.table(batch, paste("batch", idx1, sep=""));
})

With what you have given, this is the answer I could suggest. Assuming your list is stored in list.object,
lapply(seq(1, length(list.object), by=20), function(idx) {
# here idx will be 1, 21, 41 etc...
idx2 <- min(idx+20-1, length(list.object))
# do what you want here..
batch.20.processed <- lapply(idx:idx2, function(x) {
process(list.object[[x]]) # passes idx:idx2 indices one at a time
})
# here you have processed list with 20 elements
# finally write to file
lapply(1:20, function(x) {
write.table(batch.20.processed[[x]], ...)
# where "..." is all other allowed arguments to write.table
# such as row.names, col.names, quote etc.
# don't literally pass "..." to write.table
})
}

How to write map reduce in R?

I am new to R. I know how to write map reduce in Java. I want to try the same in R. So can any one help in giving any samle codes and is there any fixed format there for MapReduce in R.
Please send any link other than this: https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial
Any sample codes will be more helpful.

When you want to implement a map reduce (with Hadoop) in a language other than Java, then you use a feature called streaming. Then the data is fed to the mapper via STDIN (readLines()), back to Hadoop via STDOUT(cat()), then to the reducer again through STDIN (readLines()) and blurted finally via STDOUT (cat()).
The following code is taken from an article I wrote on writing a map reduce job with R for Hadoop. The code is supposed to count 2-grams but I'd say simple enough to see what is going on MapReduce-wise.
# map.R
library(stringdist, quietly=TRUE)
input <- file("stdin", "r")
while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
# in case of empty lines
# more sophisticated defensive code makes sense here
if(nchar(line) == 0) break
fields <- unlist(strsplit(line, "\t"))
# extract 2-grams
d <- qgrams(tolower(fields[4]), q=2)
for(i in 1:ncol(d)) {
# language / 2-gram / count
cat(fields[2], "\t", colnames(d)[i], "\t", d[1,i], "\n")
}
}
close(input)
-
# reduce.R
input <- file("stdin", "r")
# initialize variables that keep
# track of the state
is_first_line <- TRUE
while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
line <- unlist(strsplit(line, "\t"))
# current line belongs to previous
# line's key pair
if(!is_first_line &&
prev_lang == line[1] &&
prev_2gram == line[2]) {
sum <- sum + as.integer(line[3])
}
# current line belongs either to a
# new key pair or is first line
else {
# new key pair - so output the last
# key pair's result
if(!is_first_line) {
# language / 2-gram / count
cat(prev_lang,"\t",prev_2gram,"\t",sum,"\n")
}
# initialize state trackers
prev_lang <- line[1]
prev_2gram <- line[2]
sum <- as.integer(line[3])
is_first_line <- FALSE
}
}
# the final record
cat(prev_lang,"\t",prev_2gram, "\t", sum, "\n")
close(input)
http://www.joyofdata.de/blog/mapreduce-r-hadoop-amazon-emr/

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Memory problems using bigmemory to load large dataset in R - r

Related

R: How to read in file and skip lines that have errors with fread function?

how to Append Multiples Outputs from multiple files analysis Using for loop in R

How to run a script for multiple inputs and save result objects based on the name of input?

Process a large list of objects in R

How to write map reduce in R?

Categories

Resources