unable to save the r output in hdfs - r

I am running sparkR program. I want to save the output in hdfs.The output saves in local perfectly,but if i mention hdfs path means it throws error.
I am executing from shell script.this is my shell script:
/SparkR-pkg/lib/SparkR/sparkR-submit --master yarn-client examples/pi.R yarn-client 4
this is my r code.
library(SparkR)
getwd()
setwd('hdfs://ip-172-31-41-199.us-wes t2.compute.internal:8020/user/karun/output/')
args <- commandArgs(trailing = TRUE)
if (length(args) < 1) {
print("Usage: pi <master> [<slices>]")
q("no")
}
sc <- sparkR.init(args[[1]], "PiR")
slices <- ifelse(length(args) > 1, as.integer(args[[2]]), 2)
n <- 100000 * slices
piFunc <- function(elem) {
rands <- runif(n = 2, min = -1, max = 1)
val <- ifelse((rands[1]^2 + rands[2]^2) < 1, 1.0, 0.0)
val
}
piFuncVec <- function(elems) {
message(length(elems))
rands1 <- runif(n = length(elems), min = -1, max = 1)
rands2 <- runif(n = length(elems), min = -1, max = 1)
val <- ifelse((rands1^2 + rands2^2) < 1, 1.0, 0.0)
sum(val)
}
rdd <- parallelize(sc, 1:n, slices)
count <- reduce(lapplyPartition(rdd, piFuncVec), sum)
output <- paste("Pi is roughly", 4.0 * count / n, "\n")
output <- paste(output, "Num elements in RDD ", count(rdd), "\n")
writeLines(output, con = "file.txt", sep = "\n", useBytes = FALSE)
cat("Num elements in RDD ", count(rdd), "\n")
i tried many method to save the output in hdfs link sink,write.data,writetype etc.. i am trying the change the working directory by mentioning setwd().This query is also not working .it throws error
Error in setwd("hdfs://ip-172-31-41-199.us-west- 2.compute.internal:8020/user/karun/output/") :
cannot change working directory
Execution halted
I have been troubleshooting for 2 days .any help will be appreciated

Related

Writing to database in parallel in R

I try to write a table which is a processed subset of a global data variable, in a normal for loop this piece of code works fine but when I try to do it in parallel it raises an error.
Here is my piece of code;
library(doParallel)
library(foreach)
library(odbc)
library(data.table)
nc <- detectCores() - 1
cs <- makeCluster(nc)
registerDoParallel(cs)
con <- dbConnect(odbc(),driver = 'SQL Server',server = 'localserver',database = 'mydb', encoding = 'utf-8',timeout = 20)
range_to <- 1e6
set.seed(1)
random_df <- data.table(a = rnorm(n = range_to,mean = 2,sd = 1),
b = runif(n = range_to,min = 1,max = 300))
foreach(i=1:1000,.packages = c('odbc','data.table')) %dopar% {
subk <- random_df[i,]
subk <- subk**2
odbc::dbWriteTable(conn = con,name = 'parallel_test',value = subk,row.names = FALSE,append = TRUE)
}
This code raises this error;
Error in {: task 1 failed - "unable to find an inherited method for function 'dbWriteTable' for signature '"Microsoft SQL Server", "character", "data.table"'"
Like I said before in a normal for loop it works fine.
Thanks in advance.
I solved that issue by changing only creating connection object method by;
parallel::clusterEvalQ(cs, {library(odbc);con <- dbConnect(odbc(),driver = 'SQL Server',server = 'localserver',database = 'mydb', encoding = 'utf-8',timeout = 20)})

Why is the input to the lapply function in R not being created?

I am trying to use the R package found at: https://github.com/bluegreen-labs/phenor
It pulls data from the European phenological network (pep725) and climate data from https://surfobs.climate.copernicus.eu/dataaccess/access_eobs.php#datafiles and then applies processing to the data before fitting various phenological (timing of natural events in organisms like leaf unfolding) models.
I am very new to R (I usually use Matlab) and am having difficulties identifying the source of an error.
Some example code was provided with the package(found below the screenshot below). I should say,that this code is not pulling down the eobs data as expected (I have just downloaded it manually). Also, the pep725 data also has thrown up a funny quirk in one instance where a letter in the name of a csv file was pulled down as superscript (highlighted in the screenshot below) (which also caused the code to error). I manually changed this letter too. These points may be relevant to the error, hence why I am including them.
# This script shows how you download
# and format data from various data sources
# some data is included in the phenor manuscript repository
# download and optimize data tmp_path = "~/phenor_files"
# create temporay directory for some of the data
# if it doesn't exist if(!dir.exists(tmp_path)){ dir.create(tmp_path) }
# load or download necessary data
# [create a proper pep725login.txt file first] CHANGE: I changed the download function below pr_dl_pep725(credentials = "~/phenor_files/pep725login.txt",
species = "Fagus",
internal = FALSE,
path = tmp_path)
# download eobs data, please register before using this function
# [might take a while]
#CHANGE: I changed the url here server_path = "https://surfobs.climate.copernicus.eu/dataaccess/access_eobs.php"
#CHANGE: I altered the filenames below products = c( "tg_ens_mean_0.25deg_reg_v20.0e.nc", "tn_ens_mean_0.25deg_reg_v20.0e.nc", "tx_ens_mean_0.25deg_reg_v20.0e.nc", "rr_ens_mean_0.25deg_reg_v20.0e.nc", "elev_ens_0.25deg_reg_v20.0e.nc" )
lapply(products, function(product){ httr::GET(sprintf("%s/%s",server_path, product),
httr::write_disk(sprintf("%s/%s",
tmp_path,
product),
overwrite = TRUE),
httr::progress()) })
# format pep725 data (Fagus sylvatica) and save pep725_data = pr_fm_pep725(pep_path = tmp_path,
eobs_path = tmp_path) saveRDS(pep725_data,"~/phenor_files/pep725_demo_data.rds")
My problem is occurring in the last 4 lines above. When I run the code I am getting the error:
Error in FUN(X[[i]], ...) : No E-OBS files found in the referred path !
> saveRDS(pep725_data,"~/phenor_files/pep725_demo_data.rds")
Error in saveRDS(pep725_data, "~/phenor_files/pep725_demo_data.rds") :
object 'pep725_data' not found
The error is coming from the function (pr_fm_pep725). I opened this function to view the source code and have identified the lines where the error is coming from as:
eobs_data <- lapply(c("tg", "rr", "elev",
"tn", "tx"), function(x) {
filename <- list.files(eobs_path,
sprintf("%s_ens_mean_%sdeg_reg[^/]*\\.nc",x, resolution))
if (length(filename) > 0) {
r <- raster::brick(sprintf("%s/%s", eobs_path,
filename))
return(r)
}
else {
stop("No E-OBS files found in the referred path !")
}
})
I have looked at what lapply() does, so I know it applies a function to each element in the list/input. In this case, I am interpreting the input as a five strings, that respectively identify the five eobs files in the temporary folder, but the reason it's erroring out is because these strings are not being created correctly (and as a result, not being found in the folder). Is that correct?
If I am correct in my appraisal of what is going on, I have two questions:
why is the string not being created correctly? What does the [^/]*\ do in:
sprintf("%s_ens_mean_%sdeg_reg[^/]*\\.nc",x, resolution))
Why is "function" explicitly being included in the command below? My understanding of lapply() is that a function is the second input, but that the function is predefined. I don't see 'function' defined anywhere. Or is this a generic way of using the term?
The error comes from
eobs_data <- lapply(c("tg", "rr", "elev",
"tn", "tx"), function(x) {
filename <- list.files(eobs_path, sprintf("%s_ens_mean_%sdeg_reg[^/]*\\.nc",
x, resolution))
This is a screenshot of the files I have in the 'phenor_files' directory:
This is the source code for the function ('pr_fm_pep725'):
function (pep_path = tempdir(), eobs_path = tempdir(), bbch = "11",
species = NULL, offset = 264, count = 60, resolution = 0.25,
pep_data)
{
format_data <- function(site = site, offset = offset) {
pep_subset <- pep_data[which(pep_data$pep_id == site),
]
points <- sp::SpatialPoints(cbind(pep_subset$lon[1],
pep_subset$lat[1]), proj4string = sp::CRS(raster::projection(eobs_data[[1]])))
tmean <- raster::extract(eobs_data[[1]], points)
tmin <- raster::extract(eobs_data[[4]], points)
tmax <- raster::extract(eobs_data[[5]], points)
if (all(is.na(tmean))) {
return(NULL)
}
else {
precipitation <- raster::extract(eobs_data[[2]],
points)
tmean <- matrix(rep(tmean, nrow(pep_subset)), length(tmean),
nrow(pep_subset))
tmin <- matrix(rep(tmin, nrow(pep_subset)), length(tmean),
nrow(pep_subset))
tmax <- matrix(rep(tmax, nrow(pep_subset)), length(tmean),
nrow(pep_subset))
precipitation <- matrix(rep(precipitation, nrow(pep_subset)),
length(precipitation), nrow(pep_subset))
}
ltm <- as.vector(unlist(by(tmean[which(years >= 1980),
1], INDICES = yday[which(years >= 1980)], mean, na.rm = TRUE)))[1:365]
lapse_rate <- as.vector((raster::extract(eobs_data[[3]],
points[1]) - pep_subset$alt[1]) * 0.005)
tmean <- rbind(tmean + lapse_rate, pep_subset$year)
tmin <- rbind(tmin + lapse_rate, pep_subset$year)
tmax <- rbind(tmax + lapse_rate, pep_subset$year)
ltm <- ltm + lapse_rate
Ti <- apply(tmean, 2, function(x) {
layers <- which((years == (x[length(x)] - 1) & yday >=
offset) | (years == x[length(x)] & yday < offset))[1:365]
return(x[layers])
})
Tmini <- apply(tmin, 2, function(x) {
layers <- which((years == (x[length(x)] - 1) & yday >=
offset) | (years == x[length(x)] & yday < offset))[1:365]
return(x[layers])
})
Tmaxi <- apply(tmax, 2, function(x) {
layers <- which((years == (x[length(x)] - 1) & yday >=
offset) | (years == x[length(x)] & yday < offset))[1:365]
return(x[layers])
})
Pi <- apply(rbind(precipitation, pep_subset$year), 2,
function(x) {
layers <- which((years == (x[length(x)] - 1) &
yday >= offset) | (years == x[length(x)] &
yday < offset))[1:365]
return(x[layers])
})
if (offset < 365) {
doy_neg <- c((offset - 366):-1, 1:(offset - 1))
doy <- c(offset:365, 1:(offset - 1))
}
else {
doy <- doy_neg <- 1:365
}
l <- ncol(Ti)
Li <- daylength(doy = doy, latitude = pep_subset$lat[1])
Li <- matrix(rep(Li, l), length(Li), l)
data <- list(site = site, location = c(pep_subset$lat[1],
pep_subset$lon[1]), doy = doy_neg, ltm = ltm, transition_dates = pep_subset$day,
year = pep_subset$year, Ti = Ti, Tmini = Tmini, Tmaxi = Tmaxi,
Li = Li, Pi = Pi, VPDi = NULL, georeferencing = NULL)
return(data)
}
message("* Merging and cleaning PEP725 data files in: \n")
message(sprintf(" %s\n", pep_path))
if (missing(pep_data)) {
pep_data <- pr_merge_pep725(path = pep_path)
}
message(" |_ removing data out of range of the E-OBS climate data \n")
pep_data <- pep_data[which(pep_data$year >= 1950 & pep_data$year <=
max(pep_data$year)), ]
if (is.null(species)) {
message(" |_ including all species \n")
}
else {
message(sprintf(" |_ selecting species: %s \n",
species))
pep_data <- pep_data[which(pep_data$species == species),
]
}
message(sprintf(" |_ selecting phenophase: %s \n",
bbch))
pep_data <- pep_data[which(pep_data$bbch == bbch), ]
message(sprintf(" |_ excluding sites with: < %s site years of observations \n",
count))
sites <- unique(pep_data$pep_id)
counts <- unlist(lapply(sites, function(x) {
length(which(pep_data$pep_id == x))
}))
selection <- sites[which(counts >= count)]
pep_data <- pep_data[pep_data$pep_id %in% selection, ]
if (nrow(pep_data) == 0) {
stop("no data remaining after screening for the requested criteria\n check your species name and observation count restrictions!")
}
sites <- unique(pep_data$pep_id)
years <- unique(pep_data$year)
message(sprintf("* Extracting E-OBS climatology for %s sites\n ",
length(sites)))
eobs_data <- lapply(c("tg", "rr", "elev",
"tn", "tx"), function(x) {
filename <- list.files(eobs_path, sprintf("%s_ens_mean_%sdeg_reg[^/]*\\.nc",
x, resolution))
if (length(filename) > 0) {
r <- raster::brick(sprintf("%s/%s", eobs_path,
filename))
return(r)
}
else {
stop("No E-OBS files found in the referred path !")
}
})
yday <- as.numeric(format(as.Date(eobs_data[[1]]#z$Date),
"%j"))
years <- as.numeric(format(as.Date(eobs_data[[1]]#z$Date),
"%Y"))
pb <- utils::txtProgressBar(min = 0, max = length(sites),
style = 3)
env <- environment()
i <- 0
validation_data <- lapply(sites, function(x) {
utils::setTxtProgressBar(pb, i + 1)
assign("i", i + 1, envir = env)
format_data(site = x, offset = offset)
})
close(pb)
names(validation_data) <- sites
class(validation_data) <- "phenor_time_series_data"
validation_data <- validation_data[!unlist(lapply(validation_data,
is.null))]
return(validation_data)
}

%dopar% safe way of write to csv inside foreach loop

[EDITED]
It is a general question: I have seen some posts saying that it is not a good idea to use foreach and write.csv inside a foreach loop due to different cores trying to write in the file at the same time, resulting in missing results. Still, I need to write in an external file inside the parallel loop to get my output (500000+ rows and 10+ columns). Otherwise, it crushes for memory issues. So, I would like to know if there is a more safe way to write a result file within a foreach loop.
I appreciate any help on this
I am adding some more info and a much more simple code and data than what I actually have.
Description: I have two different polygons layers (sf, polygon), each with 500000+ sf. I need to calculate the area of different raster classes (1 raster layer with 3 classes) within each one of the polygons. This is the most time-consuming part of the script, specifically because I need to use sf::sf_intersection multiple times. Then, I use many different combinations of if-else and rules to populate a df with values and rules.
This is the original code, which I get memory issues with the original data:
require(sf)
require(raster)
require(rgdal)
require(rgeos)
require(dplyr)
require(stars)
## Sample data
set.seed(131)
sample_raster = raster(nrows = 1, ncols = 1, res = 0.5, xmn = 0, xmx = 11, ymn = 0, ymx = 11)
values(sample_raster) = rep(1:3, length.out = ncell(sample_raster))
crs(sample_raster) = CRS('+init=EPSG:4326')
plot(sample_raster, axes=T)
sample_raster
##
m = rbind(c(0,0), c(1,0), c(1,1), c(0,1), c(0,0))
p = st_polygon(list(m))
n = 100
l = vector("list", n)
for (i in 1:n)
l[[i]] = p + 10 * runif(2)
sample_poly = st_sfc(l)
data = data.frame(PR_ID = seq(1:100),
COND1 = rep(1:10, length.out = 100))
sample_poly = st_sf(cbind(data, sample_poly))
plot(sample_poly, col = sf.colors(categorical = TRUE, alpha = .5), add=T)
sample_poly = sample_poly %>% st_set_crs(4326)
sample_poly
##
## Code
require(parallel)
require(foreach)
require(doParallel)
idall = as.character(sample_poly$PR_ID)
area = as.numeric(st_area(sample_poly))/10000
# i=1
# listID = idall
# mainpoly = sample_poly
# mainras = sample_raster
# mainpolyarea = area
per.imovel.paralallel = function (listID, mainpoly, mainras, mainpolyarea) { # Starting the function
## Setting the parallel work up into your computer
UseCores = detectCores()-1
cl = parallel::makeCluster(UseCores, output="")
doParallel::registerDoParallel(cl)
writeLines(c(""), "log.txt") # Creates a LOG FILE in the folder to follow processing
FOREACH.RESULT = foreach(i = 1:length(listID), .packages=c('raster', 'rgdal', 'rgeos', 'dplyr', 'parallel',
'doParallel', 'sf', 'stars'), .inorder = T , .combine ='rbind') %dopar%
{ # Stating the paral-loop
sink("log.txt", append=TRUE) # LOG FILE in the home folder
cat(paste(i, "of", length(listID), as.character(Sys.time()),"\n")) # Write to LOG FILE
sink() # end diversion of output
########################
### Pick one poly
px = sf::st_buffer(mainpoly[mainpoly$PR_ID == listID[i],], # Conditional to select the geometry PR_ID in position i
dist = 0.1) # buffer = 0 w/ byid, selects the geometry
########################
### Intersect with raster and get area
px2 = sf::st_buffer(px, dist = 0.1) # Buffer because raster::mask() masks out partially covered cells since it call rasterize() first
desm_prop = raster::crop(mainras, as_Spatial(px2))
desm_prop_shp = if(all(is.na(values(desm_prop)))){NULL
} else {sf::st_intersection(st_cast(sf::st_as_sf(stars::st_as_stars(desm_prop)), "POLYGON"), px)}
names(desm_prop_shp)[1] = if(any(names(desm_prop_shp) == "layer")){"values"
} else {NULL}
desm_prop_bet0108 = if(is.null(desm_prop_shp)){NULL
} else {desm_prop_shp[desm_prop_shp$values == 1, ]}
desm_prop_bet0108 = if(is.null(desm_prop_bet0108) | length(desm_prop_bet0108) == 0){NULL
} else if(length(desm_prop_bet0108$values) == 0){NULL
} else {desm_prop_bet0108}
desm_prop_after08 = if(is.null(desm_prop_shp)){NULL
} else {desm_prop_shp[desm_prop_shp$values == 2, ]}
desm_prop_after08 = if(is.null(desm_prop_after08) | length(desm_prop_after08) == 0){NULL
} else if(length(desm_prop_after08$values) == 0){NULL
} else {desm_prop_after08}
desm_prop_upto00 = if(is.null(desm_prop_shp)){NULL
} else {desm_prop_shp[desm_prop_shp$values == 3, ]}
desm_prop_upto00 = if(is.null(desm_prop_upto00) | length(desm_prop_upto00) == 0){NULL
} else if(length(desm_prop_upto00$values) == 0){NULL
} else {desm_prop_upto00}
area_desm_prop_bet0108 <- if(is.null(desm_prop_bet0108)){0
} else { sum(as.numeric(sf::st_area(desm_prop_bet0108)/10000))} # Deforestation area in PX 2001 - 2008
area_desm_prop_after08 <- if(is.null(desm_prop_after08)){0
} else { sum(as.numeric(sf::st_area(desm_prop_after08)/10000))} # Deforestation area in PX after 2008
area_desm_prop_upto00 <- if(is.null(desm_prop_upto00)){0
} else { sum(as.numeric(sf::st_area(desm_prop_upto00)/10000))} # Deforestation area in PX upto 2000
########################
# RESULTS
TEMP.RESULTS = data.frame(PR_ID = as.character(listID[i]),
PR_AREA_HA = mainpolyarea[i],
PR_D09 = area_desm_prop_after08,
PR_D0108 = area_desm_prop_bet0108,
PR_D00 = area_desm_prop_upto00)
return (TEMP.RESULTS)
} # Ending the loop
return (FOREACH.RESULT)
parallel::stopCluster(cl) # stop cluster
stopImplicitCluster() # stop cluster
gc()
} # Ending the function
#####################################################################################################
results_feach = per.imovel.paralallel (listID = idall, mainpoly = sample_poly, mainras = sample_raster, mainpolyarea = area)
warnings()
I have also tried #mischva11 (modified) suggestion by adding this:
length_of_chunk = round(length(idall)/(length(idall)/10)) # generate chunks of 10 lines
lchunks = split(idall, sort(rep_len(1:length_of_chunk, length(idall))))
for (z in 1:length_of_chunk){
# split up the data in chunks
idall_chunk = as.vector(unlist(lchunks[z]))
results_chunk = per.imovel.paralallel (listID = idall_chunk, mainpoly = sample_poly, mainras = sample_raster, mainpolyarea = area)
# save your foreach results for each chunk, append after the first one
if (z == 1) {write.table(results_chunk, file = "TESTDATAresults1.csv")
}else {write.table(results_chunk, file = "TESTDATAresults1.csv", append = TRUE, col.names = FALSE)}
print(NULL) # print(results_chunk)
}
It works like a charm for this example.
BUT, I have a setback when running it with the real script/data: it takes ages for the foreach to close. I am watching my machine performance and log file.. after processing all lines of my sf object, my CPU work goes down as expected, but it still takes more than 30min (i did not wait for it to completely finish) to close the foreach function.
Because of it, I thought about writing the output on the flow inside the foreach work. But clearly it is not a good idea as explained here. I have seen some posts about the package 'flock' which look the output file for writing the output. I have not tested but it sounds promising.
The problem here is, that you need communication between the cores. One core has to wait for the next one until it's finished writing in the csv. That's not easily done and not possible with foreach as far as I now. foreach does provide this method with the variable inorder(by default true). You are telling us, you got memory issues. So one solution is to chunk up your output if it's possible. I do not have a good dataset for this example, so I use mtcars which will be filled by NAs
library(foreach)
library(parallel)
library(doParallel)
registerDoParallel(4)
# split your output here, I use 5 chunks here. My data is mtcars */
length_of_chunk <-round(nrow(mtcars)/5)
for ( z in 1:length_of_chunk-1){
x<-0
#here the data gets split up
data <- mtcars[(z*length_of_chunk):(z*length_of_chunk+length_of_chunk),]
#foreach with those 5 datarows
results <- foreach(i=1:length_of_chunk, .combine=rbind) %dopar% {
#***your code***
y = data[i,]
return(y)
}
print(results)
# save your foreach results and then begin again
if (z==1) {write.table(results, file= "test.csv")}
else {write.table(results, file="test.csv", append=TRUE, col.names = FALSE)}
}

How to input HDFS file into R mapreduce for processing and get the result into HDFS file

I have a question similar to the below link in stackoverflow
R+Hadoop: How to read CSV file from HDFS and execute mapreduce?
I am tring to read a file from location "/somnath/logreg_data/ds1.10.csv" in HDFS, reduce its number of columns from 10 to 5 and then write to another location "/somnath/logreg_data/reduced/ds1.10.reduced.csv" in HDFS using the below
transfer.csvfile.hdfs.to.hdfs.reduced function.
transfer.csvfile.hdfs.to.hdfs.reduced("hdfs://10.5.5.82:8020/somnath/logreg_data/ds1.10.csv", "hdfs://10.5.5.82:8020/somnath/logreg_data/reduced/ds1.10.reduced.csv", 5)
The function definition is
transfer.csvfile.hdfs.to.hdfs.reduced =
function(hdfsFilePath, hdfsWritePath, reducedCols=1) {
#local.df = data.frame()
#hdfs.get(hdfsFilePath, local.df)
#to.dfs(local.df)
#r.file <- hdfs.file(hdfsFilePath,"r")
transfer.reduced.map =
function(.,M) {
label <- M[,dim(M)[2]]
reduced.predictors <- M[,1:reducedCols]
reduced.M <- cbind(reduced.predictors, label)
keyval(
1,
as.numeric(reduced.M))
}
reduced.values =
values(
from.dfs(
mapreduce(
input = from.dfs(hdfsFilePath),
input.format = "native",
map = function(.,M) {
label <- M[,dim(M)[2]]
print(label)
reduced.predictors <- M[,1:reducedCols]
reduced.M <- cbind(reduced.predictors, label)
keyval(
1,
as.numeric(reduced.M))}
)))
write.table(reduced.values, file="/root/somnath/reduced.values.csv")
w.file <- hdfs.file(hdfsWritePath,"w")
hdfs.write(reduced.values,w.file)
#to.dfs(reduced.values)
}
But I am receiving an error
Error in file(fname, paste(if (is.read) "r" else "w", if (format$mode == :
cannot open the connection
Calls: transfer.csvfile.hdfs.to.hdfs.reduced ... make.keyval.reader -> do.call -> <Anonymous> -> file
In addition: Warning message:
In file(fname, paste(if (is.read) "r" else "w", if (format$mode == :
cannot open file 'hdfs://10.5.5.82:8020/somnath/logreg_data/ds1.10.csv': No such file or directory
Execution halted
OR
When I am trying to load a file from hdfs using the below commands, I am getting the below error:
> x <- hdfs.file(path="hdfs://10.5.5.82:8020/somnath/logreg_data/ds1.10.csv",mode="r")
Error in hdfs.file(path = "hdfs://10.5.5.82:8020/somnath/logreg_data/ds1.10.csv", :
attempt to apply non-function
Any help will be highly appreciated
Thanks
Basically found a solution to the problem that I stated above.
r.file <- hdfs.file(hdfsFilePath,"r")
from.dfs(
mapreduce(
input = as.matrix(hdfs.read.text.file(r.file)),
input.format = "csv",
map = ...
))
Below is the entire modified function:
transfer.csvfile.hdfs.to.hdfs.reduced =
function(hdfsFilePath, hdfsWritePath, reducedCols=1) {
hdfs.init()
#local.df = data.frame()
#hdfs.get(hdfsFilePath, local.df)
#to.dfs(local.df)
r.file <- hdfs.file(hdfsFilePath,"r")
transfer.reduced.map =
function(.,M) {
numRows <- length(M)
M.vec.elems <-unlist(lapply(M,
function(x) strsplit(x, ",")))
M.matrix <- matrix(M.vec.elems, nrow=numRows, byrow=TRUE)
label <- M.matrix[,dim(M.matrix)[2]]
reduced.predictors <- M.matrix[,1:reducedCols]
reduced.M <- cbind(reduced.predictors, label)
keyval(
1,
as.numeric(reduced.M))
}
reduced.values =
values(
from.dfs(
mapreduce(
input = as.matrix(hdfs.read.text.file(r.file)),
input.format = "csv",
map = function(.,M) {
numRows <- length(M)
M.vec.elems <-unlist(lapply(M,
function(x) strsplit(x, ",")))
M.matrix <- matrix(M.vec.elems, nrow=numRows, byrow=TRUE)
label <- M.matrix[,dim(M.matrix)[2]]
reduced.predictors <- M.matrix[,1:reducedCols]
reduced.M <- cbind(reduced.predictors, label)
keyval(
1,
as.numeric(reduced.M)) }
)))
write.table(reduced.values, file="/root/somnath/reduced.values.csv")
w.file <- hdfs.file(hdfsWritePath,"w")
hdfs.write(reduced.values,w.file)
hdfs.close(r.file)
hdfs.close(w.file)
#to.dfs(reduced.values)
}
Hope this helps and don't forget to give points if you find it useful. Thanks ahead

Unused arguments in R error

I am new to R , I am trying to run example which is given in "rebmix-help pdf". It use galaxy dataset and here is the code
library(rebmix)
devAskNewPage(ask = TRUE)
data("galaxy")
write.table(galaxy, file = "galaxy.txt", sep = "\t",eol = "\n", row.names = FALSE, col.names = FALSE)
REBMIX <- array(list(NULL), c(3, 3, 3))
Table <- NULL
Preprocessing <- c("histogram", "Parzen window", "k-nearest neighbour")
InformationCriterion <- c("AIC", "BIC", "CLC")
pdf <- c("normal", "lognormal", "Weibull")
K <- list(7:20, 7:20, 2:10)
for (i in 1:3) {
for (j in 1:3) {
for (k in 1:3) {
REBMIX[[i, j, k]] <- REBMIX(Dataset = "galaxy.txt",
Preprocessing = Preprocessing[k], D = 0.0025,
cmax = 12, InformationCriterion = InformationCriterion[j],
pdf = pdf[i], K = K[[k]])
if (is.null(Table))
Table <- REBMIX[[i, j, k]]$summary
else Table <- merge(Table, REBMIX[[i, j,k]]$summary, all = TRUE, sort = FALSE)
}
}
}
It is giving me error ERROR:
unused argument (InformationCriterion = InformationCriterion[j])
Plz help
I'm running R 3.0.2 (Windows) and the library rebmix defines a function REBMIX where InformationCriterion is not listed as a named argument, but Criterion.
Brief invoke REBMIX as :
REBMIX[[i, j, k]] <- REBMIX(Dataset = "galaxy.txt",
Preprocessing = Preprocessing[k], D = 0.0025,
cmax = 12, Criterion = InformationCriterion[j],
pdf = pdf[i], K = K[[k]])
It looks as though there have been substantial changes to the rebmix package since the example mentioned in the OP was created. Among the most noticable changes is the use of S4 classes.
There's also an updated demo in the rebmix package using the galaxy data (see demo("rebmix.galaxy"))
To get the above example to produce results (Note: I am not familiar with this package or the rebmix algorithm!!!):
Change the argument to Criterion as mentioned by #Giupo
Use the S4 slot access operator # instead of $
Don't name the results object REDMIX because that's already the function name
library(rebmix)
data("galaxy")
## Don't re-name the REBMIX object!
myREBMIX <- array(list(NULL), c(3, 3, 3))
Table <- NULL
Preprocessing <- c("histogram", "Parzen window", "k-nearest neighbour")
InformationCriterion <- c("AIC", "BIC", "CLC")
pdf <- c("normal", "lognormal", "Weibull")
K <- list(7:20, 7:20, 2:10)
for (i in 1:3) {
for (j in 1:3) {
for (k in 1:3) {
myREBMIX[[i, j, k]] <- REBMIX(Dataset = list(galaxy),
Preprocessing = Preprocessing[k], D = 0.0025,
cmax = 12, Criterion = InformationCriterion[j],
pdf = pdf[i], K = K[[k]])
if (is.null(Table)) {
Table <- myREBMIX[[i, j, k]]#summary
} else {
Table <- merge(Table, myREBMIX[[i, j,k]]#summary, all = TRUE, sort = FALSE)
}
}
}
}
I guess this is late. But I encountered a similar problem just a few minutes ago. And I realized the real scenario that you may face when you got this kind of error msg... It's just the version conflict.
You may use a different version of the R package from the tutorial, thus the argument names could be different between what you are running and what the real code use.
So please check the version first before you try to manually edit the file. Also, it happens that your old version package is still in the path and it overrides the new one. This was exactly what I had... since I manually installed the old and new version separately...

Resources