I am new to R. I know how to write map reduce in Java. I want to try the same in R. So can any one help in giving any samle codes and is there any fixed format there for MapReduce in R.
Please send any link other than this: https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial
Any sample codes will be more helpful.
When you want to implement a map reduce (with Hadoop) in a language other than Java, then you use a feature called streaming. Then the data is fed to the mapper via STDIN (readLines()), back to Hadoop via STDOUT(cat()), then to the reducer again through STDIN (readLines()) and blurted finally via STDOUT (cat()).
The following code is taken from an article I wrote on writing a map reduce job with R for Hadoop. The code is supposed to count 2-grams but I'd say simple enough to see what is going on MapReduce-wise.
# map.R
library(stringdist, quietly=TRUE)
input <- file("stdin", "r")
while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
# in case of empty lines
# more sophisticated defensive code makes sense here
if(nchar(line) == 0) break
fields <- unlist(strsplit(line, "\t"))
# extract 2-grams
d <- qgrams(tolower(fields[4]), q=2)
for(i in 1:ncol(d)) {
# language / 2-gram / count
cat(fields[2], "\t", colnames(d)[i], "\t", d[1,i], "\n")
}
}
close(input)
-
# reduce.R
input <- file("stdin", "r")
# initialize variables that keep
# track of the state
is_first_line <- TRUE
while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
line <- unlist(strsplit(line, "\t"))
# current line belongs to previous
# line's key pair
if(!is_first_line &&
prev_lang == line[1] &&
prev_2gram == line[2]) {
sum <- sum + as.integer(line[3])
}
# current line belongs either to a
# new key pair or is first line
else {
# new key pair - so output the last
# key pair's result
if(!is_first_line) {
# language / 2-gram / count
cat(prev_lang,"\t",prev_2gram,"\t",sum,"\n")
}
# initialize state trackers
prev_lang <- line[1]
prev_2gram <- line[2]
sum <- as.integer(line[3])
is_first_line <- FALSE
}
}
# the final record
cat(prev_lang,"\t",prev_2gram, "\t", sum, "\n")
close(input)
http://www.joyofdata.de/blog/mapreduce-r-hadoop-amazon-emr/
Related
I am writing a function that will go through a list of files in a directory, count number of complete cases, and if the sum of complete cases is above a given threshhold, a correlation should be calculated. The output must be a numeric vector of correlations for all files that meet the threshhold requirement. This is what I have so far (and it gives me an Error: unexpected '}' in "}" Full disclosure - I am a complete newbie, as in wrote my first code 2 weeks ago. What am I doing wrong?
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric()
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
c <- c(sum(complete.cases(dataFR2)))
if c >= threshhold {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
correlations_list <- c(correlations_list, d)
}
}
correlations_list
}
"Unexpected *" errors are a syntax error. Often a missing parenthesis, comma, or curly bracket. In this case, you need to change if c >= threshhold { to if (c >= threshhold) {. if() is a function and it requires parentheses.
I'd also strongly recommend that you not use c as a variable name. c() is the most commonly used R function, and giving an object the same name will make your code look very strange to anyone else reading it.
Lastly, I'd recommend that you make your output the same length as the the number of files. As you have it, there won't be any way to know which files met the threshold to have their correlations calculated. I'd make correlations_list have the same length as the number of files, and add names to it so you know which correlation belongs to which file. This has the side benefit of not "growing an object in a loop", which is an anti-pattern known for its inefficiency. A rewritten function would look something like this:
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric(length(all_files)) ## initialize to full length
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
n_complete <- sum(complete.cases(dataFR2))
if(n_complete >= threshhold) {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
} else {
d <- NA
}
correlations_list[i] <- d
}
names(correlations_list) <- all_files
correlations_list
}
I want to create a program in R that takes integer user input and then adds it to the previous user input. ex. user input(say one day): 10, then (maybe the next day) user input: 15 --> output 25.Ideally this would accept nearly an infinite amount of input. here is what I have so far:
amount_spent <- function(){
i <-1
while(i<10){
n <- readline(prompt="How much did you spend?: ")
i<-i+1
}
print(c(as.integer(n)))
}
amount_spent()
Problems I have with this code are that it only saves the last input value, and it is difficult to control when User is allowed to input. Is there any way to save user input to a data that can be manipulated through readline()?
# 1.R
fname <- "s.data"
if (file.exists(fname)) {
load(fname)
}
if (!exists("s")) {
s <- 0
}
n <- 0
while (TRUE) {
cat ("Enter a number: ")
n <- scan("stdin", double(), n=1, quiet = TRUE)
if (length(n) != 1) {
print("exiting")
break
}
s <- s + as.numeric(n)
cat("Sum=", s, "\n")
save(list=c("s"), file=fname)
}
You should run the script like this: Rscript 1.R
To exit the loop press Ctrl-D in Unix, or Ctrl-Z in Windows.
An R-ish way to do it would be through closures. Here is an example for interactive use (i.e. within an R session).
balance_setup <- function() {
balance <- 0
change_balance <- function () {
n <- readline(prompt = "How much did you spend?: ")
n <- as.numeric(n)
if (!is.na(n))
balance <<- balance + n
balance
}
print_balance <- function() {
balance
}
list(change_balance = change_balance,
print_balance = print_balance)
}
funs <- balance_setup()
change_balance <- funs$change_balance
print_balance <- funs$print_balance
Calling balance_setup creates a variable balanceand two functions that can access it: one for changing the balance, one for printing it. In R, functions can only return a single value, so I bundle both functions together as a list.
change_balance()
## How much did you spend? 5
## [1] 5
change_balance()
## How much did you spend? 5
## [1] 10
print_balance()
## [1] 10
If you want many inputs, use a loop:
repeat{
change_balance()
}
Break the loop with Ctrl-C, Escape or whatever is used on your platform.
I try to optimize my R code with parSapply.
I have xmlfile and X as global variables.
When I didn't use the clusterExport(cl,"X") and clusterExport(cl,"xmlfile") I got "xmlfile object was not found".
When I used these two clusterExport I got an error "object of type 'externalptr' is not subsettable".
With regular sapply it works ok.
can someone see the problem?
I have this R code:
require("XML")
library(parallel)
setwd("C:/PcapParser")
# A helper function that enables the dynamic additon of new rows and unseen variables to a data.frame
# field is an R XML leaf-node (capturing a field of a protocol)
# X is the current data.frame to which the feature in field should be added
# rowNum is the row (packet) to which the feature should be added. [must be that rowNum <= dim(X)[1]+1]
addFeature <- function(field, X, rowNum)
{
# extract xml name and value
featureName = xmlAttrs(field)['name']
if (featureName == "")
featureName = xmlAttrs(field)['show']
value = xmlAttrs(field)['value']
if (is.na(value) | value=="")
value = xmlAttrs(field)['show']
# attempt to add feature (add rows/cols if neccessary)
if (!(featureName %in% colnames(X))) #we are adding a new feature
{
#Special cases
#Bad column names: anything that has the prefix...
badCols = list("<","Content-encoded entity body"," ","\\?")
for(prefix in badCols)
if(grepl(paste("^",prefix,sep=""),featureName))
return(X) #don't include this new feature
X[[featureName]]=array(dim=dim(X)[1]) #add this new feature column with NAs
}
if (rowNum > dim(X)[1]) #we are trying to add a new row
{X = rbind(X,array(dim=dim(X)[2]))} #add row of NA
X[[featureName]][rowNum] = value
return(X)
}
firstLoop<-function(x)
{
packet = xmlfile[[x]]
# Iterate over all protocols in this packet
for (prot in 1:xmlSize(packet))
{
protocol = packet[[prot]]
numFields = xmlSize(protocol)
# Iterate over all fields in this protocol (recursion is not used since the passed dataset is large)
if(numFields>0)
for (f in 1:numFields)
{
field = protocol[[f]]
if (xmlSize(field) == 0) # leaf
X<<-addFeature(field,X,x)
else #not leaf xml element (assumption: there are at most three more steps down)
{
# Iterate over all sub-fields in this field
for (ff in 1:xmlSize(field))
{ #extract sub-field data for this packet
subField = field[[ff]]
if (xmlSize(subField) == 0) # leaf
X<<-addFeature(subField,X,x)
else #not leaf xml element (assumption: there are at most two more steps down)
{
# Iterate over all subsub-fields in this field
for (fff in 1:xmlSize(subField))
{ #extract sub-field data for this packet
subsubField = subField[[fff]]
if (xmlSize(subsubField) == 0) # leaf
X<<-addFeature(subsubField,X,x)
else #not leaf xml element (assumption: there is at most one more step down)
{
# Iterate over all subsubsub-fields in this field
for (ffff in 1:xmlSize(subsubField))
{ #extract sub-field data for this packet
subsubsubField = subsubField[[ffff]]
X<<-addFeature(subsubsubField,X,x) #must be leaf
}
}
}
}
}
}
}
}
}
# Given the path to a pcap file, this function returns a dataframe 'X'
# with m rows that contain data fields extractable from each of the m packets in XMLcap.
# Wireshark must be intalled to work
raw_feature_extractor <- function(pcapPath){
## Step 1: convert pcap into PDML XML file with wireshark
#to run this line, wireshark must be installed in the location referenced in the pdmlconv.bat file
print("Converting pcap file with Wireshark.")
system(paste("pdmlconv",pcapPath,"tmp.xml"))
## Step 2: load XML file into R
print("Parsing XML.")
xmlfile<<-xmlRoot(xmlParse("tmp.xml"))
## Step 3: Extract all feature into data.frame
print("Extracting raw features.")
X <<- data.frame(num=NA) #first feature is packet number
# Iterate over all packets
# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(3)
parSapply (cl,seq(from=1,to=xmlSize(xmlfile),by=1),firstLoop)
print("Done.")
return(X)
}
What do I do wrong with parSapply? (maybe considering the global variables)
Thank you
So I see a couple of obvious problems with this code. Global variables and functions are not accessible in a paralleled environment, unless you explicitly force them in or call them. You need to define your addFunction and raw_feature_extractor functions inside firstLoop. When calling functions from a pre-existing package, you should either load the package as part of firstLoop (bad coding!) or call them explicitly using the package::function notation (good coding!). I suggesting looking at the R documentation here on StackOverflow to help you work through creating an appropriately parallelized function.
I have a large text file (>10 million rows, > 1 GB) that I wish to process one line at a time to avoid loading the entire thing into memory. After processing each line I wish to save some variables into a big.matrix object. Here is a simplified example:
library(bigmemory)
library(pryr)
con <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')
for (i in 1:5){
print(c(address(x), refs(x)))
y <- readLines(con, n = 1, warn = FALSE)
x[i] <- 2L*as.integer(y)
}
close(con)
where x.csv contains
4
18
2
14
16
Following the advice here http://adv-r.had.co.nz/memory.html I have printed the memory address of my big.matrix object and it appears to change with each loop iteration:
[1] "0x101e854d8" "2"
[1] "0x101d8f750" "2"
[1] "0x102380d80" "2"
[1] "0x105a8ff20" "2"
[1] "0x105ae0d88" "2"
Can big.matrix objects be modified in place?
is there a better way to load, process and then save these data? The current method is slow!
is there a better way to load, process and then save these data? The current method is slow!
The slowest part of your method appearts to be making the call to read each line individually. We can 'chunk' the data, or read in several lines at a time, in order to not hit the memory limit while possibly speeding things up.
Here's the plan:
Figure out how many lines we have in a file
Read in a chunk of those lines
Perform some operation on that chunk
Push that chunk back into a new file to save for later
library(readr)
# Make a file
x <- data.frame(matrix(rnorm(10000),100000,10))
write_csv(x,"./test_set2.csv")
# Create a function to read a variable in file and double it
calcDouble <- function(calc.file,outputFile = "./outPut_File.csv",
read.size=500000,variable="X1"){
# Set up variables
num.lines <- 0
lines.per <- NULL
var.top <- NULL
i=0L
# Gather column names and position of objective column
connection.names <- file(calc.file,open="r+")
data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1)
close(connection.names)
col.name <- which(colnames(data.names)==variable)
#Find length of file by line
connection.len <- file(calc.file,open="r+")
while((linesread <- length(readLines(connection.len,read.size)))>0){
lines.per[i] <- linesread
num.lines <- num.lines + linesread
i=i+1L
}
close(connection.len)
# Make connection for doubling function
# Loop through file and double the set variables
connection.double <- file(calc.file,open="r+")
for (j in 1:length(lines.per)){
# if stops read.table from breaking
# Read in a chunk of the file
if (j == 1) {
data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="")
} else {
data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="")
}
# Grab the columns we need and double them
double <- data[,I(col.name)] * 2
if (j != 1) {
write_csv(data.frame(double),outputFile,append = TRUE)
} else {
write_csv(data.frame(double),outputFile)
}
message(paste0("Reading from Chunk: ",j, " of ",length(lines.per)))
}
close(connection.double)
}
calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")
So we get back a .csv file with the manipulated data. You can change double <- data[,I(col.name)] * 2 to whatever thing you need to do to each chunk.
I want to combine the results from a for loop into 1 txt file and I have written my code based on suggestion from this link
combine results from a loop in one file
There is one problem. I am supposed to get 8 results (row) but I only ended with only 5. Somehow the other results did not get into the file. I think the problem is with the if statement but I don't know how to fix it.
Here is my code
prob <- c(0.10, 0.20)
for (j in seq(prob)) {
range <- c(2,3)
for (i in seq(range)) {
sample <- c(10,20)
for (k in seq(sample)) {
data <- Simulation(X =1,Y =range[i], Z=sample[k] ,p = prob[j])
filename <- paste('file',i,'txt')
if (j == 1) {
write.table(data, "Desktop/file2.txt", col.names= TRUE)
} else {
write.table(data,"Desktop/file2.txt", append = TRUE, col.names = FALSE)
}
}
}
}
That's because the if ( j == 1 ) bit is meant to check whether this is the first time you've written to the file or not.
If it is the first time, then it will write the column names (i.e. X, Y, Z, p) into the file (see the col.names=TRUE?).
If it isn't the first time, then it won't write the column names, but will just append the data.
Since you have multiple nested loops, that condition won't work so well for you: when j==1 (i.e. for prob=0.1) you perform 4 other loops within. But since j==1, the data is getting overwritten each time.
I'd recommend initialising a variable count that counts how many times you've performed Simulation, and then changing that line to if ( count == 1 ):
count <- 1
prob <- c(0.10,0.20)
# .... code as before
data <- Simulation(X =1,Y =range[i], Z=sample[k] ,p = prob[j])
if ( count == 1 ) {
write.table(data, "Desktop/file2.txt", col.names=T)
} else {
write.table(data, "Desktop/file2.txt", append=T, col.names=F)
}
# increment count
count <- count + 1
}}}