I have checked the question : Rhadoop - wordcount using rmr and have tried the answer on my side. But it is giving a lot of issues.
Here is the code:
Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar")
# load librarys
library(rmr2)
library(rhdfs)
# initiate rhdfs package
hdfs.init()
map <- function(k,lines) {
words.list <- strsplit(lines, '\\s')
words <- unlist(words.list)
return( keyval(words, 1) )
}
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}
## read text files from folder example/wordcount/data
hdfs.root <- 'example/wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
## save result in folder example/wordcount/out
hdfs.out <- file.path(hdfs.root, 'out')
## Submit job
out <- wordcount(hdfs.data, hdfs.out)
## Fetch results from HDFS
results <- from.dfs(out)
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')
head(results.df)
Here are the issues:
https://justpaste.it/143a0
I don't understand the problem and what should be the solution for this particular problem. Kindly help me and let me know what is the solution for this particular problem.
I am using the RStudio-Server and R with latest versions.
Related
I want to save dynamic data or dataframe (because it depends on parameters) inside a function and want to use this saved data another function. I thought of using saveRDS(), but since I work in my local, I didn't know if it would work when I made it public. I don't get any errors, warnings or notes when I check, but I'm looking for a better solution.
get_data <- function(startDate,endDate){
df <- ... # dynamic
saveRDS(df,"df.rds")
}
funcx <- function(){
df2 <- readRDS("df.rds")
}
Thanks in advance.
1) As mentioned in the comments just pass the data using the return value of the first function and arguments of the second.
get_data <- function(startDate, endDate) {
data.frame(startDate, endDate)
}
funcx <- function(df){
print(df)
}
get_data(1, 2) |> funcx()
or store the data in your workspace and then get it from there:
Data <- get_data(1, 2)
funcx(Data)
or put it in an option
options(Data = get_data(1, 2))
funcx(getOption("Data"))
2) Use a local environment:
e <- local({
df <- NULL
get_data <- function(startDate, endDate) {
df <<- data.frame(startDate, endDate)
}
funcx <- function(){
print(df)
}
environment()
})
e$get_data(1, 2)
e$funcx()
or put the environment right in the package. Look at lattice.options and lattice.getOption source code in the lattice package for an example of this where it uses the .LatticeEnv environment that is stored right in the package itself.
library(lattice)
ls(lattice:::.LatticeEnv)
lattice.options # displays source code of lattice.options
lattice.getOption # displays source code
3) The proto package can be used to clean up (2) a bit. proto objects are environments having their own methods. See the package vignette and help files.
library(proto)
p <- proto(
get_data = function(., startDate, endDate) {
.$df <- data.frame(startDate, endDate)
},
funcx = function(.) {
print(.$df)
}
)
p$get_data(1, 2)
p$funcx()
I am working with raster data and trying to crop and mask a variety of buffers to each raster in a raster stack for various locations. The result is a list of a list of rasters. I got the code to work for a small subset of the data, but now I am trying it over the whole dataset, and it is working very slowly. See example code:
# Example data ------------------------------------------------------------
#create example raster stack
r1 = raster(nrows=1000,ncol=1000,xmn=60,xmx=90,ymn=0,ymx=25)
rr = lapply(1:100, function(i) setValues(r1,runif(ncell(r1))))
rrstack=stack()
for (i in 1:length(rr)){
stacknext=rr[[i]]
rrstack=stack(rrstack,stacknext)
}
#create example shapefile list
lats=runif(26,min=0,max=25)
lons=runif(26,min=60,max=90)
exnames=paste0("city_",letters)
coords=data.frame(names=exnames,lats=lats,lons=lons)
coords_sf = st_as_sf(coords,coords=c("lons","lats"),crs=4326,dim ="XY")
circle1=st_buffer(coords_sf, 1E3)
circle100=st_buffer(coords_sf,1E5)
circle500=st_buffer(coords_sf,5E5)
circlist=list(circle1=circle1,circle100=circle100,circle500=circle500)
circlist_reproj=lapply(circlist,function(x) st_transform(x,crs(rrstack[[1]])))
start <- proc.time()
citlist <- vector(mode='list',length=nrow(circlist_reproj[[1]]))
dellist <- vector(mode='list',length=length(circlist_reproj))
mystack <- stack()
for(k in 1:nrow(circlist_reproj[[1]])) {
for(j in 1:length(circlist_reproj)) {
for (i in 1:nlayers(rrstack)){
maskraster <- raster::mask(rrstack[[i]],circlist_reproj[[j]][k,])
maskraster <- raster::crop(maskraster,circlist_reproj[[j]][k,])
mystack <- stack(mystack,maskraster)
}
dellist[[j]] <- mystack
mystack <- stack()
}
citlist[[k]] <- dellist
dellist <- vector(mode='list',length=length(circlist_reproj))
}
basetime <- proc.time()-start
#time taken for computation
basetime
user system elapsed
940.173 84.366 1029.688
As you can see for a set of data smaller than what I have, the computation takes a while. I wanted to try and parallelize the processing but am having trouble figuring out how to do so. I have two issues with it right now. First because of the nature of the nested for loop, I am not sure which for loop I should change to foreach. According to this post, it looks like it is the first one, though I am not sure that stands for all nested for loops. When I make the first for loop foreach I then get the error Error in { : task 1 failed - "could not find function "nlayers"" I then try and add the package argument in the foreach call resulting in a nested for loop that looks like
foreach(k = 1:nrow(circlist_reproj[[1]], .packages='raster')) %dopar% {
for(j in 1:length(circlist_reproj)) {
for (i in 1:nlayers(rrstack)) {
maskraster <- raster::mask(rrstack[[i]],circlist_reproj[[j]][k,])
maskraster <- raster::crop(maskraster,circlist_reproj[[j]][k,])
mystack <- stack(mystack,maskraster)
}
dellist[[j]] <- mystack
mystack <- stack()
}
citlist[[k]] <- dellist
dellist <- vector(mode='list',length=length(circlist_reproj))
}
Which then gives the error
unused argument (.packages = "raster")
So I am not sure how to properly apply the .packages argument to the foreach function. What am I doing wrong here?
EDIT
Taking #HenrikB's comment, I have looked at my code and reworked it. I now have the following foreach loops. Now the code completes, but it results in all null values.
cores <- detectCores()
cl <- makeCluster(cores[1]-2) #not to overload your computer
registerDoParallel(cl)
start <- proc.time()
citlist <- vector(mode='list',length=nrow(circlist_reproj[[1]]))
dellist <- vector(mode='list',length=length(circlist_reproj))
mystack <- stack()
foreach(k = 1:nrow(circlist_reproj[[1]])) %:%
foreach(j = 1:length(circlist_reproj))%:%
foreach (i = 1:nlayers(rrstack), .packages=c('raster','sf')) %dopar% {
maskraster <- raster::mask(rrstack[[i]],circlist_reproj[[j]][k,])
maskraster <- raster::crop(maskraster,circlist_reproj[[j]][k,])
mystack <- stack(mystack,maskraster)
dellist[[j]] <- mystack
mystack <- stack()
citlist[[k]] <- dellist
dellist <- vector(mode='list',length=length(circlist_reproj))
}
partime <- proc.time()-start
After taking #Henrik's comments and reworking my code a bit, I was able to come up with a solution that solves the problem via parallelization, however it is slower than the base solving. But that is for another post. Here is the solution:
cores <- detectCores()
cl <- makeCluster(cores[1]-2) #not to overload your computer
registerDoParallel(cl)
citlist <- vector(mode='list',length=nrow(circlist_reproj[[1]]))
dellist <- vector(mode='list',length=length(circlist_reproj))
for(k in 1:nrow(circlist_reproj[[1]])) {
for(j in 1:length(circlist_reproj)) {
parrasterstack <- foreach(i=1:nlayers(rrstack),.packages=c('raster','sf')) %dopar% {
maskraster <- raster::mask(rrstack[[i]],circlist_reproj[[j]][k,])
raster::crop(maskraster,circlist_reproj[[j]][k,])
}
parrasterstack <- stack(parrasterstack)
dellist[[j]] <- parrasterstack
parrasterstack <- NULL
}
citlist[[k]] <- dellist
dellist <- vector(mode='list',length=length(circlist_reproj))
}
stopCluster(cl)
I am trying to change working directory in a future processor, carry out some operations, and exit. The problem is I am not able to set a working directory.
The following toy example works fine
library(future)
dirNames <- as.character(c(1:4))
sapply(dirNames, function(x) if(!dir.exists(x)) dir.create(x))
plan(multiprocess, workers=2)
b <- list()
for(i in seq_along(dirNames)){
sleeptime <- 10
if(i > 3) sleeptime <- 50
a <- future({
# setwd(dirNames[i])
Sys.sleep(sleeptime)
return(2)
})
print(i)
b[[dirNames[i]]] <- a
}
lapply(b, resolved)
lapply(b[1:2], value)
lapply(b, value)
but if I uncomment line 11 then I get following error when running the code
Error in setwd(dirNames[i]) : cannot change working directory
How can I change working directory successfully?
I figured out a solution while playing around with the script.
library(future)
dirNames <- as.character(c(1:4))
sapply(dirNames, function(x) if(!dir.exists(x)) dir.create(x))
plan(multiprocess, workers=2)
b <- list()
for(i in seq_along(dirNames)){
sleeptime <- 10
if(i > 3) sleeptime <- 50
a <- future({
currDir <- getwd()
on.exit(setwd(currDir))
setwd(dirNames[i])
Sys.sleep(sleeptime)
return(2)
})
print(i)
b[[dirNames[i]]] <- a
}
lapply(b, resolved)
lapply(b[1:2], value)
lapply(b, value)
I believe that the workers working directory once set in the first few iterations remains permanently set to new directory for remaining iterations and hence future paths (with reference to old directory) do not work.
I am trying to incorporate a tryCatch function in my R code to prevent the loop from breaking whenever I get an error.
I've looked through other examples but can't make applying tryCatch work.
Does anyone know how to add tryCatch to the following loop to prevent any error stopping the loop continuing?
for (i in (1:nrow(pagedata))) {
u <- pagedata[i, "id"]
url <- paste0("https://www.google.com/", u)
r <- GET(url)
print(url)
if (!http_error(r)) {
web_page_read_follows <- read.csv(url)
colnames(web_page_read_follows) <- "follows"
web_page_collect_follows <- web_page_read_follows[web_page_read_follows$follows %like% "Followers", ]
web_page_collect_follows <- as.data.frame(web_page_collect_follows)
colnames(web_page_collect_follows) <- "follows"
web_page_collect_follows$follows <- gsub("Followers.*", "", web_page_collect_follows$follows)
web_page_collect_follows$follows <- gsub(".*=", "", web_page_collect_follows$follows)
web_page_collect_follows <- tail(web_page_collect_follows, -(nrow(web_page_collect_follows) - 1))
if (length(web_page_collect_follows$follows) > 0) {
pagedata[i, "followers"] <- web_page_collect_follows$follows
print(i)
Sys.sleep(1)
}
}
}
I am trying to run a simple rmr job using Rhadoop package but it is not working.Here is my R script
print("Initializing variable.....")
Sys.setenv(HADOOP_HOME="/usr/hdp/2.2.4.2-2/hadoop")
Sys.setenv(HADOOP_CMD="/usr/hdp/2.2.4.2-2/hadoop/bin/hadoop")
print("Invoking functions.......")
#Referece taken from Revolution Analytics
wordcount = function( input, output = NULL, pattern = " ")
{
mapreduce(
input = input ,
output = output,
input.format = "text",
map = wc.map,
reduce = wc.reduce,
combine = T)
}
wc.map =
function(., lines) {
keyval(
unlist(
strsplit(
x = lines,
split = pattern)),
1)}
wc.reduce =
function(word, counts ) {
keyval(word, sum(counts))}
#Function Invoke
wordcount('/user/hduser/rmr/wcinput.txt')
I am running above script as
Rscript wordcount.r
I am getting below error.
[1] "Initializing variable....."
[1] "Invoking functions......."
Error in wordcount("/user/hduser/rmr/wcinput.txt") :
could not find function "mapreduce"
Execution halted
Kindly let me know what is the issue.
Firstly, you'll have to set the HADOOP_STREAMING environment variable in your code.
Try the below code, and note that the code assumes that you have copied your text file to the hdfs folder examples/wordcount/data
R Code:
Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar")
# load librarys
library(rmr2)
library(rhdfs)
# initiate rhdfs package
hdfs.init()
map <- function(k,lines) {
words.list <- strsplit(lines, '\\s')
words <- unlist(words.list)
return( keyval(words, 1) )
}
reduce <- function(word, counts) {
keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) {
mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}
## read text files from folder example/wordcount/data
hdfs.root <- 'example/wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
## save result in folder example/wordcount/out
hdfs.out <- file.path(hdfs.root, 'out')
## Submit job
out <- wordcount(hdfs.data, hdfs.out)
## Fetch results from HDFS
results <- from.dfs(out)
results.df <- as.data.frame(results, stringsAsFactors=F)
colnames(results.df) <- c('word', 'count')
head(results.df)
Output:
word count
AS 16
As 5
B. 1
BE 13
BY 23
By 7
For your reference, here is another example of running R word count map reduce program.
Hope this helps.