Rhadoop - wordcount using rmr

Rhadoop - wordcount using rmr - r

I am trying to run a simple rmr job using Rhadoop package but it is not working.Here is my R script
print("Initializing variable.....")
Sys.setenv(HADOOP_HOME="/usr/hdp/2.2.4.2-2/hadoop")
Sys.setenv(HADOOP_CMD="/usr/hdp/2.2.4.2-2/hadoop/bin/hadoop")
print("Invoking functions.......")
#Referece taken from Revolution Analytics
wordcount = function( input, output = NULL, pattern = " ")
{
mapreduce(
input = input ,
output = output,
input.format = "text",
map = wc.map,
reduce = wc.reduce,
combine = T)
}
wc.map =
function(., lines) {
keyval(
unlist(
strsplit(
x = lines,
split = pattern)),
1)}
wc.reduce =
function(word, counts ) {
keyval(word, sum(counts))}
#Function Invoke
wordcount('/user/hduser/rmr/wcinput.txt')
I am running above script as
Rscript wordcount.r
I am getting below error.
[1] "Initializing variable....."
[1] "Invoking functions......."
Error in wordcount("/user/hduser/rmr/wcinput.txt") :
could not find function "mapreduce"
Execution halted
Kindly let me know what is the issue.

Firstly, you'll have to set the HADOOP_STREAMING environment variable in your code.
Try the below code, and note that the code assumes that you have copied your text file to the hdfs folder examples/wordcount/data
R Code:
Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar")
# load librarys
library(rmr2)
library(rhdfs)
# initiate rhdfs package
hdfs.init()
map <- function(k,lines) {
words.list <- strsplit(lines, '\\s')
words <- unlist(words.list)
return( keyval(words, 1) )
}
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}
## read text files from folder example/wordcount/data
hdfs.root <- 'example/wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
## save result in folder example/wordcount/out
hdfs.out <- file.path(hdfs.root, 'out')
## Submit job
out <- wordcount(hdfs.data, hdfs.out)
## Fetch results from HDFS
results <- from.dfs(out)
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')
head(results.df)
Output:
word count
AS 16
As 5
B. 1
BE 13
BY 23
By 7
For your reference, here is another example of running R word count map reduce program.
Hope this helps.

Related

r 3.4.1 source exprs

I have following function as example:
myFunc <- function(x){
while(x < 100){
x <- x+10
cat( x )
cat("\n")
}
}
In the new R version 3.4.1 on Windows I want to source this function from the file myFunc.R like as below:
filepath <- "D:/"
l <- list.files(filepath, pattern = "my", full.names = TRUE)
source(l)
But am getting the following Error:
source(l) Error in source(l) : could not find symbol "exprs" in
environment of the generic function
I hope anyone can help. Thanks a lot

Having issues with RHADOOP?

I have checked the question : Rhadoop - wordcount using rmr and have tried the answer on my side. But it is giving a lot of issues.
Here is the code:
Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar")
# load librarys
library(rmr2)
library(rhdfs)
# initiate rhdfs package
hdfs.init()
map <- function(k,lines) {
words.list <- strsplit(lines, '\\s')
words <- unlist(words.list)
return( keyval(words, 1) )
}
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}
## read text files from folder example/wordcount/data
hdfs.root <- 'example/wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
## save result in folder example/wordcount/out
hdfs.out <- file.path(hdfs.root, 'out')
## Submit job
out <- wordcount(hdfs.data, hdfs.out)
## Fetch results from HDFS
results <- from.dfs(out)
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')
head(results.df)
Here are the issues:
https://justpaste.it/143a0
I don't understand the problem and what should be the solution for this particular problem. Kindly help me and let me know what is the solution for this particular problem.
I am using the RStudio-Server and R with latest versions.

R .Last.call feature - similar to .Last.value

Similarly to .Last.value is there any way to access last call? Below expected results of potential .Last.call.
sum(1, 2)
# [1] 3
str(.Last.call)
# language sum(1, 2)
The bests if it would not require to parse file from file system.

The last.call package is no longer on cran, but you can still get the code:
# -----------------------------------------------------------------------
# FUNCTION: last.call
# Retrieves a CALL from the history and returns an unevaluated
# call.
#
# There are two uses for such abilities.
# - To be able to recall the previous commands, like pressing the up key
# on the terminal.
# - The ability to get the line that called the function.
#
# TODO:
# - does not handle commands seperated by ';'
#
# -----------------------------------------------------------------------
last.call <-
function(n=1) {
f1 <- tempfile()
try( savehistory(f1), silent=TRUE )
try( rawhist <- readLines(f1), silent=TRUE )
unlink(f1)
if( exists('rawhist') ) {
# LOOK BACK max(n)+ LINES UNTIL YOU HAVE n COMMANDS
cmds <- expression()
n.lines <- max(abs(n))
while( length(cmds) < max(abs(n)) ) {
lines <- tail( rawhist, n=n.lines )
try( cmds <- parse( text=lines ), silent=TRUE )
n.lines <- n.lines + 1
if( n.lines > length(rawhist) ) break
}
ret <- rev(cmds)[n]
if( length(ret) == 1 ) return(ret[[1]]) else return(ret)
}
return(NULL)
}
Now, to use it:
sum(1, 2)
# [1] 3
last.call(2)
# sum(1, 2)

I've modified this code to output text strings of the preceding commands / calls in a manner that preserves how there were formatted across lines in the original call, sot that I can use cat() to output the calls (for a function that emails me when the preceding function is done running). Here's the code:
lastCall <- function(num.call = 1) {
history.file <- tempfile()
try(savehistory(history.file), silent = TRUE )
try(raw.history <- readLines(history.file), silent = TRUE )
unlink(history.file)
if (exists('raw.history') ) {
# LOOK BACK max(n)+ LINES UNTIL YOU HAVE n COMMANDS
commands <- expression()
num.line <- max(abs(num.call) + 1)
while (length(commands) < max(abs(num.call) + 1)) {
lines <- tail(raw.history, n = num.line)
try(commands <- parse(text = lines), silent = TRUE)
num.line <- num.line + 1
if (num.line > length(raw.history)) break
}
ret <- rev(commands)[num.call + 1]
if (length(ret) == 1) {
a <- ret[1]
} else {
a <- ret
}
# a <- rev(commands)[num.call + 1]
out <- lapply(a, deparse) %>%
sapply(paste, sep = "\n", collapse = "\n")
}
out
}
Enjoy!

Use of environment variables in R

I am trying to understand the reducer.R code taken from the following website.
http://www.thecloudavenue.com/2013/10/mapreduce-programming-in-r-using-hadoop.html
This code is using for Hadoop Streaming using R.
I have given the code below:
#! /usr/bin/env Rscript
# reducer.R - Wordcount program in R
# script for Reducer (R-Hadoop integration)
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitLine <- function(line) {
val <- unlist(strsplit(line, "\t"))
list(word = val[1], count = as.integer(val[2]))
}
env <- new.env(hash = TRUE)
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
line <- trimWhiteSpace(line)
split <- splitLine(line)
word <- split$word
count <- split$count
if (exists(word, envir = env, inherits = FALSE)) {
oldcount <- get(word, envir = env)
assign(word, oldcount + count, envir = env)
}
else assign(word, count, envir = env)
}
close(con)
for (w in ls(env, all = TRUE))
cat(w, "\t", get(w, envir = env), "\n", sep = "")
Could someone explain the significance of the use of the following new.env command and the subsequent use of the env in the code:
env <- new.env(hash = TRUE)
Why is this required? What happens if this is not included in the code?
Update 06/05/2014
I tried writing another version of this code without having a new environment defined and have given the code as follows:
#! /usr/bin/env Rscript
current_word <- ""
current_count <- 0
word <- ""
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0)
{
line1 <- gsub("(^ +)|( +$)", "", line)
word <- unlist(strsplit(line1, "[[:space:]]+"))[[1]]
count <- as.numeric(unlist(strsplit(line1, "[[:space:]]+"))[[2]])
if (current_word == word) {
current_count = current_count + count
} else
{
if(current_word != "")
{
cat(current_word,'\t', current_count,'\n')
}
current_count = count
current_word = word
}
}
if (current_word == word)
{
cat(current_word,'\t', current_count,'\n')
}
close(con)
This code gives the same output as the one with a new environment defined.
Question: Does using new environment provide any advantages from a Hadoop standpoint? Is there a reason for using it in this specific case?
Thank you.

Your question is related with environment in R, example code for make new environment in R
> my.env <- new.env()
> my.env
<environment: 0x114a9d940>
> ls(my.env)
character(0)
> assign("a", 999, envir=my.env)
> my.env$foo = "This is the variable foo."
> ls(my.env)
[1] "a" "foo"
I think this article can help you http://www.r-bloggers.com/environments-in-r/
or press
?environment
for more help
Like on code that you give, the author make a new environmnt.
env <- new.env(hash = TRUE)
when he want to assign value they defined the environment
assign(word, oldcount + count, envir = env)
And for the question "What happens if this is not included in the code?" I think you can find the answer on the link that I already provided
About the advantages using new env in R is already answered in this link
so the reason is in this case you will play with the large of dataset, when you passing your dataset to the function, R will make a copy your dataset and then the return data will overwrite the old dataset. But if you passing env, R will directly process that env without copying large dataset.

R Language - Waiting for user input with scan or readline

I'm trying to get the user to input a few keywords for a query, and in my script I used either scan or readline. I tried it using the R-embeeded script editor (Windows) but when I execute the code, it uses my next lines of script as the standard input.
Here is my (part of) script
keywords <- scan(what=character(), nlines=1)
keywords <- paste(keywords, collapse=",")
keywords
And here is the output when executed from the editor
> keywords <- scan(what=character(), nlines=1)
1: keywords <- paste(keywords, collapse=",")
Read 4 items
> keywords
[1] "keywords" "<-" "paste(keywords," "collapse=\",\")"
Meanwhile when I use the source() command, I have my user input respected.
So is there any way to be able to input some things while executing the code right from the R software?

This is how I use readLInes:
FUN <- function(x) {
if (missing(x)) {
message("Uhh you forgot to eneter x...\nPlease enter it now.")
x <- readLines(n = 1)
}
x
}
FUN()
Or maybe something along these lines:
FUN2 <- function() {
message("How many fruits will you buy")
x <- readLines(n = 1)
message("Good you want to buy %s fruits.\n Enter them now.")
y <- readLines(n = x)
paste(y, collapse = ", ")
}
FUN2()
EDIT: With your approach in Rgui...
FUN3 <- function(n=2) {
keywords <- scan(what=character(), nlines=n)
paste(keywords, collapse=",")
}
## > FUN3 <- function(n=2) {
## + keywords <- scan(what=character(), nlines=n)
## + paste(keywords, collapse=",")
## + }
## > FUN3()
## 1: apple
## 2: chicken
## Read 2 items
## [1] "apple,chicken"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rhadoop - wordcount using rmr - r

Related

r 3.4.1 source exprs

Having issues with RHADOOP?

R .Last.call feature - similar to .Last.value

Use of environment variables in R

R Language - Waiting for user input with scan or readline

Categories

Resources