I try to optimize my R code with parSapply.
I have xmlfile and X as global variables.
When I didn't use the clusterExport(cl,"X") and clusterExport(cl,"xmlfile") I got "xmlfile object was not found".
When I used these two clusterExport I got an error "object of type 'externalptr' is not subsettable".
With regular sapply it works ok.
can someone see the problem?
I have this R code:
require("XML")
library(parallel)
setwd("C:/PcapParser")
# A helper function that enables the dynamic additon of new rows and unseen variables to a data.frame
# field is an R XML leaf-node (capturing a field of a protocol)
# X is the current data.frame to which the feature in field should be added
# rowNum is the row (packet) to which the feature should be added. [must be that rowNum <= dim(X)[1]+1]
addFeature <- function(field, X, rowNum)
{
# extract xml name and value
featureName = xmlAttrs(field)['name']
if (featureName == "")
featureName = xmlAttrs(field)['show']
value = xmlAttrs(field)['value']
if (is.na(value) | value=="")
value = xmlAttrs(field)['show']
# attempt to add feature (add rows/cols if neccessary)
if (!(featureName %in% colnames(X))) #we are adding a new feature
{
#Special cases
#Bad column names: anything that has the prefix...
badCols = list("<","Content-encoded entity body"," ","\\?")
for(prefix in badCols)
if(grepl(paste("^",prefix,sep=""),featureName))
return(X) #don't include this new feature
X[[featureName]]=array(dim=dim(X)[1]) #add this new feature column with NAs
}
if (rowNum > dim(X)[1]) #we are trying to add a new row
{X = rbind(X,array(dim=dim(X)[2]))} #add row of NA
X[[featureName]][rowNum] = value
return(X)
}
firstLoop<-function(x)
{
packet = xmlfile[[x]]
# Iterate over all protocols in this packet
for (prot in 1:xmlSize(packet))
{
protocol = packet[[prot]]
numFields = xmlSize(protocol)
# Iterate over all fields in this protocol (recursion is not used since the passed dataset is large)
if(numFields>0)
for (f in 1:numFields)
{
field = protocol[[f]]
if (xmlSize(field) == 0) # leaf
X<<-addFeature(field,X,x)
else #not leaf xml element (assumption: there are at most three more steps down)
{
# Iterate over all sub-fields in this field
for (ff in 1:xmlSize(field))
{ #extract sub-field data for this packet
subField = field[[ff]]
if (xmlSize(subField) == 0) # leaf
X<<-addFeature(subField,X,x)
else #not leaf xml element (assumption: there are at most two more steps down)
{
# Iterate over all subsub-fields in this field
for (fff in 1:xmlSize(subField))
{ #extract sub-field data for this packet
subsubField = subField[[fff]]
if (xmlSize(subsubField) == 0) # leaf
X<<-addFeature(subsubField,X,x)
else #not leaf xml element (assumption: there is at most one more step down)
{
# Iterate over all subsubsub-fields in this field
for (ffff in 1:xmlSize(subsubField))
{ #extract sub-field data for this packet
subsubsubField = subsubField[[ffff]]
X<<-addFeature(subsubsubField,X,x) #must be leaf
}
}
}
}
}
}
}
}
}
# Given the path to a pcap file, this function returns a dataframe 'X'
# with m rows that contain data fields extractable from each of the m packets in XMLcap.
# Wireshark must be intalled to work
raw_feature_extractor <- function(pcapPath){
## Step 1: convert pcap into PDML XML file with wireshark
#to run this line, wireshark must be installed in the location referenced in the pdmlconv.bat file
print("Converting pcap file with Wireshark.")
system(paste("pdmlconv",pcapPath,"tmp.xml"))
## Step 2: load XML file into R
print("Parsing XML.")
xmlfile<<-xmlRoot(xmlParse("tmp.xml"))
## Step 3: Extract all feature into data.frame
print("Extracting raw features.")
X <<- data.frame(num=NA) #first feature is packet number
# Iterate over all packets
# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(3)
parSapply (cl,seq(from=1,to=xmlSize(xmlfile),by=1),firstLoop)
print("Done.")
return(X)
}
What do I do wrong with parSapply? (maybe considering the global variables)
Thank you
So I see a couple of obvious problems with this code. Global variables and functions are not accessible in a paralleled environment, unless you explicitly force them in or call them. You need to define your addFunction and raw_feature_extractor functions inside firstLoop. When calling functions from a pre-existing package, you should either load the package as part of firstLoop (bad coding!) or call them explicitly using the package::function notation (good coding!). I suggesting looking at the R documentation here on StackOverflow to help you work through creating an appropriately parallelized function.
Related
I have a list of URLs (more than 4000) from a specific domain (pixilink.com) and what I want to do is to figure out if the provided domain is a picture or a video. To do this, I used the solutions provided here: How to write trycatch in R and Check whether a website provides photo or video based on a pattern in its URL and wrote the code shown below:
#Function to get the value of initial_mode from the URL
urlmode <- function(x){
mycontent <- readLines(x)
mypos <- grep("initial_mode = ", mycontent)
if(grepl("0", mycontent[mypos])){
return("picture")
} else if(grepl("tour", mycontent[mypos])){
return("video")
} else{
return(NA)
}
}
Also, in order to prevent having error for URLs that don't exist, I used the code below:
readUrl <- function(url) {
out <- tryCatch(
{
readLines(con=url, warn=FALSE)
return(1)
},
error=function(cond) {
return(NA)
},
warning=function(cond) {
return(NA)
},
finally={
message( url)
}
)
return(out)
}
Finally, I separated the list of URLs and pass it into the functions (here for instance, I used 1000 values from URL list) described above:
a <- subset(new_df, new_df$host=="www.pixilink.com")
vec <- a[['V']]
vec <- vec[1:1000] # only chose first 1000 rows
tt <- numeric(length(vec)) # checking validity of url
for (i in 1:length(vec)){
tt[i] <- readUrl(vec[i])
print(i)
}
g <- data.frame(vec,tt)
g2 <- g[which(!is.na(g$tt)),] #only valid url
dd <- numeric(nrow(g2))
for (j in 1:nrow(g2)){
dd[j] <- urlmode(g2[j,1])
}
Final <- cbind(g2,dd)
Final <- left_join(g, Final, by = c("vec" = "vec"))
I ran this code on a sample list of URLs with 100, URLs and it worked; however, after I ran it on whole list of URLs, it returned an error. Here is the error : Error in textConnection("rval", "w", local = TRUE) : all connections are in use Error in textConnection("rval", "w", local = TRUE) : all connections are in use
And after this even for sample URLs (100 samples that I tested before) I ran the code and got this error message : Error in file(con, "r") : all connections are in use
I also tried closeAllConnection after each recalling each function in the loop, but it didn't work.
Can anyone explain what this error is about? is it related to the number of requests we can have from the website? what's the solution?
So, my guess as to why this is happening is because you're not closing the connections that you're opening via tryCatch() and via urlmode() through the use of readLines(). I was unsure of how urlmode() was going to be used in your previous post so it had made it as simple as I could (and in hindsight, that was badly done, my apologies). So I took the liberty of rewriting urlmode() to try and make it a little bit more robust for what appears to be a more expansive task at hand.
I think the comments in the code should help, so take a look below:
#Updated URL mode function with better
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
#Check if URL is good to go
if(!httr::http_error(x)){
#Test cases
#x <- "www.pixilink.com/3"
#x <- "https://www.pixilink.com/93320"
#x <- "https://www.pixilink.com/93313"
#Then since there are redirect shenanigans
#Get the actual URL the input points to
#It should just be the input URL if there is
#no redirection
#This is important as this also takes care of
#checking whether http or https need to be prefixed
#in case the input URL is supplied without those
#(this can cause problems for url() below)
myx <- httr::HEAD(x)$url
#Then check for what the default mode is
mycon <- url(myx)
open(mycon, "r")
mycontent <- readLines(mycon)
mypos <- grep("initial_mode = ", mycontent)
#Close the connection since it's no longer
#necessary
close(mycon)
#Some URLs with weird formats can return
#empty on this one since they don't
#follow the expected format.
#See for example: "https://www.pixilink.com/clients/899/#3"
#which is actually
#redirected from "https://www.pixilink.com/3"
#After that, evaluate what's at mypos, and always
#return the actual URL
#along with the result
if(!purrr::is_empty(mypos)){
#mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\\s\\=).*")
mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
return(c(myx, mystr))
#return(mystr)
#So once all that is done, check if the line at mypos
#contains a 0 (picture), tour (video)
#if(grepl("0", mycontent[mypos])){
# return(c(myx, "picture"))
#return("picture")
#} else if(grepl("tour", mycontent[mypos])){
# return(c(myx, "video"))
#return("video")
#}
} else{
#Valid URL but not interpretable
return(c(myx, "uninterpretable"))
#return("uninterpretable")
}
} else{
#Straight up invalid URL
#No myx variable to return here
#Just x
return(c(x, "invalid"))
#return("invalid")
}
}
#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)
#All future + progressr related stuff
#learned courtesy
#https://stackoverflow.com/a/62946400/9494044
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)
#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))
#Website's base URL
baseurl <- "https://www.pixilink.com"
#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar
#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
myprog <- progressr::progressor(along = range)
sitetype <- do.call(rbind, future_lapply(range, function(b, x){
myprog() ##Progress bar signaller
myurl <- paste0(b, "/", x)
cat("\n", myurl, " ")
myret <- urlmode(myurl)
cat(myret, "\n")
return(c(myurl, myret))
}, b = baseurl, future.chunk.size = 10))
})
#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")
#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")
head(sitetype)
# given_url actual_url mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310 invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311 invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313 picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315 tour
unique(sitetype$mode)
# [1] "invalid" "floorplan2d" "picture" "tour"
#--------
Basically, urlmode() now opens and closes connections only when necessary, checks for URL validity, URL redirection, and also "intelligently" extracts the value assigned to initial_mode. With the help of future.lapply(), and the progress bar from the progressr package, this can now be applied quite conveniently in parallel to as many pixilink.com/<integer> URLs as desired. With a bit of wrangling thereafter, the results can be presented very tidily as a data.frame as shown.
As an example, I've demonstrated this for a small range in the code above. Note the commented out 1:10000 range in the code in this context: I let this code run the last couple of hours over this (hopefully sufficiently) large range of URLs to check for errors and problems. I can attest that I encountered no errors (only the regular warnings In readLines(mycon) : incomplete final line found on 'https://www.pixilink.com/93334'). For proof, I have the data from all 10000 URLs written to a CSV file that I can provide upon request (I don't fancy uploading that to pastebin or elsewhere unnecessarily). Due to oversight on my part, I forgot to benchmark that run, but I suppose I could do that later if performance metrics are desired/would be considered interesting.
For your purposes, I believe you can simply take the entire code snippet below and run it verbatim (or with modifications) by just changing the range assignment right before the with_progress(do.call(...)) step to a range of your liking. I believe this approach is simpler and does away with having to deal with multiple functions and such (and no tryCatch() messes to deal with).
When working with R I frequently get the error message "subscript out of bounds". For example:
# Load necessary libraries and data
library(igraph)
library(NetData)
data(kracknets, package = "NetData")
# Reduce dataset to nonzero edges
krack_full_nonzero_edges <- subset(krack_full_data_frame, (advice_tie > 0 | friendship_tie > 0 | reports_to_tie > 0))
# convert to graph data farme
krack_full <- graph.data.frame(krack_full_nonzero_edges)
# Set vertex attributes
for (i in V(krack_full)) {
for (j in names(attributes)) {
krack_full <- set.vertex.attribute(krack_full, j, index=i, attributes[i+1,j])
}
}
# Calculate reachability for each vertix
reachability <- function(g, m) {
reach_mat = matrix(nrow = vcount(g),
ncol = vcount(g))
for (i in 1:vcount(g)) {
reach_mat[i,] = 0
this_node_reach <- subcomponent(g, (i - 1), mode = m)
for (j in 1:(length(this_node_reach))) {
alter = this_node_reach[j] + 1
reach_mat[i, alter] = 1
}
}
return(reach_mat)
}
reach_full_in <- reachability(krack_full, 'in')
reach_full_in
This generates the following error Error in reach_mat[i, alter] = 1 : subscript out of bounds.
However, my question is not about this particular piece of code (even though it would be helpful to solve that too), but my question is more general:
What is the definition of a subscript-out-of-bounds error? What causes it?
Are there any generic ways of approaching this kind of error?
This is because you try to access an array out of its boundary.
I will show you how you can debug such errors.
I set options(error=recover)
I run reach_full_in <- reachability(krack_full, 'in')
I get :
reach_full_in <- reachability(krack_full, 'in')
Error in reach_mat[i, alter] = 1 : subscript out of bounds
Enter a frame number, or 0 to exit
1: reachability(krack_full, "in")
I enter 1 and I get
Called from: top level
I type ls() to see my current variables
1] "*tmp*" "alter" "g"
"i" "j" "m"
"reach_mat" "this_node_reach"
Now, I will see the dimensions of my variables :
Browse[1]> i
[1] 1
Browse[1]> j
[1] 21
Browse[1]> alter
[1] 22
Browse[1]> dim(reach_mat)
[1] 21 21
You see that alter is out of bounds. 22 > 21 . in the line :
reach_mat[i, alter] = 1
To avoid such error, personally I do this :
Try to use applyxx function. They are safer than for
I use seq_along and not 1:n (1:0)
Try to think in a vectorized solution if you can to avoid mat[i,j] index access.
EDIT vectorize the solution
For example, here I see that you don't use the fact that set.vertex.attribute is vectorized.
You can replace:
# Set vertex attributes
for (i in V(krack_full)) {
for (j in names(attributes)) {
krack_full <- set.vertex.attribute(krack_full, j, index=i, attributes[i+1,j])
}
}
by this:
## set.vertex.attribute is vectorized!
## no need to loop over vertex!
for (attr in names(attributes))
krack_full <<- set.vertex.attribute(krack_full,
attr, value = attributes[,attr])
It just means that either alter > ncol( reach_mat ) or i > nrow( reach_mat ), in other words, your indices exceed the array boundary (i is greater than the number of rows, or alter is greater than the number of columns).
Just run the above tests to see what and when is happening.
Only an addition to the above responses: A possibility in such cases is that you are calling an object, that for some reason is not available to your query. For example you may subset by row names or column names, and you will receive this error message when your requested row or column is not part of the data matrix or data frame anymore.
Solution: As a short version of the responses above: you need to find the last working row name or column name, and the next called object should be the one that could not be found.
If you run parallel codes like "foreach", then you need to convert your code to a for loop to be able to troubleshoot it.
If this helps anybody, I encountered this while using purr::map() with a function I wrote which was something like this:
find_nearby_shops <- function(base_account) {
states_table %>%
filter(state == base_account$state) %>%
left_join(target_locations, by = c('border_states' = 'state')) %>%
mutate(x_latitude = base_account$latitude,
x_longitude = base_account$longitude) %>%
mutate(dist_miles = geosphere::distHaversine(p1 = cbind(longitude, latitude),
p2 = cbind(x_longitude, x_latitude))/1609.344)
}
nearby_shop_numbers <- base_locations %>%
split(f = base_locations$id) %>%
purrr::map_df(find_nearby_shops)
I would get this error sometimes with samples, but most times I wouldn't. The root of the problem is that some of the states in the base_locations table (PR) did not exist in the states_table, so essentially I had filtered out everything, and passed an empty table on to mutate. The moral of the story is that you may have a data issue and not (just) a code problem (so you may need to clean your data.)
Thanks for agstudy and zx8754's answers above for helping with the debug.
I sometimes encounter the same issue. I can only answer your second bullet, because I am not as expert in R as I am with other languages. I have found that the standard for loop has some unexpected results. Say x = 0
for (i in 1:x) {
print(i)
}
The output is
[1] 1
[1] 0
Whereas with python, for example
for i in range(x):
print i
does nothing. The loop is not entered.
I expected that if x = 0 that in R, the loop would not be entered. However, 1:0 is a valid range of numbers. I have not yet found a good workaround besides having an if statement wrapping the for loop
This came from standford's sna free tutorial
and it states that ...
# Reachability can only be computed on one vertex at a time. To
# get graph-wide statistics, change the value of "vertex"
# manually or write a for loop. (Remember that, unlike R objects,
# igraph objects are numbered from 0.)
ok, so when ever using igraph, the first roll/column is 0 other than 1, but matrix starts at 1, thus for any calculation under igraph, you would need x-1, shown at
this_node_reach <- subcomponent(g, (i - 1), mode = m)
but for the alter calculation, there is a typo here
alter = this_node_reach[j] + 1
delete +1 and it will work alright
What did it for me was going back in the code and check for errors or uncertain changes and focus on need-to-have over nice-to-have.
Here is my R Script that works just fine:
perc.rank <- function(x) trunc(rank(x)) / length(x) * 100.0
library(dplyr)
setwd("~/R/xyz")
datFm <- read.csv("yellow_point_02.csv")
datFm <- filter(datFm, HRA_ClassHRA_Final != -9999)
quant_cols <- c("CL_GammaRay_Despiked_Spline_MLR", "CT_Density_Despiked_Spline_FinalMerged",
"HRA_PC_1HRA_Final", "HRA_PC_2HRA_Final","HRA_PC_3HRA_Final",
"SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT", "Ultrasonic_DT_Despiked_Spline_MLR")
# add an extra column to datFm to store the quantile value
for (column_name in quant_cols) {
datFm[paste(column_name, "quantile", sep = "_")] <- NA
}
# initialize an empty dataframe with the new column names appended
newDatFm <- datFm[0,]
# get the unique values for the hra classes
hraClassNumV <- sort(unique(datFm$HRA_ClassHRA_Final))
# loop through the vector and create currDatFm and append it to newDatFm
for (i in hraClassNumV) {
currDatFm <- filter(datFm, HRA_ClassHRA_Final == i)
for (column_name in quant_cols) {
currDatFm <- within(currDatFm,
{
CL_GammaRay_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$CL_GammaRay_Despiked_Spline_MLR)
CT_Density_Despiked_Spline_FinalMerged_quantile <- perc.rank(currDatFm$CT_Density_Despiked_Spline_FinalMerged)
HRA_PC_1HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_1HRA_Final)
HRA_PC_2HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_2HRA_Final)
HRA_PC_3HRA_Final_quantile <- perc.rank(currDatFm$HRA_PC_3HRA_Final)
SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT_quantile <- perc.rank(currDatFm$SRES_IMGCAL_SHIFT2VL_Slab_SHIFT2CL_DT)
Ultrasonic_DT_Despiked_Spline_MLR_quantile <- perc.rank(currDatFm$Ultrasonic_DT_Despiked_Spline_MLR)
}
)
}
newDatFm <- rbind(newDatFm, currDatFm)
}
newDatFm <- newDatFm[order(newDatFm$Core_Depth),]
# head(newDatFm, 10)
write.csv(newDatFm, file = "Ricardo_quantiles.csv")
I have a few questions though. Every R book or video that I have read or watched, recommends using the 'apply' family of language constructs over the classic 'for' loop stating that apply is much faster.
So the first question is: how would you write it using apply (or tapply or some other apply)?
Second, is this really true though that apply is much faster than for? The csv file 'yellow_point_02.csv' has approx. 2500 rows. This script runs almost instantly on my Macbook Pro which has 16 Gig of memory.
Third, See the 'quant_cols' vector? I created it so that I could write a generic loop (for columm_name in quant_cols) ....But I could not make it to work. So I hard-coded the column names post-fixed with '_quantile' and called the 'perc.rank' many times. Is there a way this could be made dynamic? I tried the 'paste' stuff that I have in my script, but that did not work.
On the positive side though, R seems awesome in its ability to cut through the 'Data Wrangling' tasks with very few statements.
Thanks for your time.
How can I generate a tree with an unknown number of nodes, each of which have an unknown and varying number of children, with the condition that a list of the child nodes for a given parent node is generated by some fun(parent)? Note that I'm using library(data.tree) from cran to make my tree hierarchy.
The tree will always begin with a node defined by a given parent vector. There will always be a finite amount of nodes. Every node will have the same length as the root node.
I've tried to create the question in a general sense out of context, but it has just been too general to provide definitive feedback. Accordingly, here is the script that is presently not quite there:
require(data.tree)
#also requires Generating Scripts (link at bottom) to run
# Helper function to insert nodes as children of parents with unique names
i=1
assn <- function(child,parentvarname){
child<-paste(child,collapse=" ")
nam <- paste("v", i, sep = "")
# assign node to variable called vi
# and make the tree global so it can be seen outside the function
assign(nam, parentvarname$AddChild(child),envir = .GlobalEnv)
noquote(nam)->a
i+1
a #output the child variable name vi for the sake of recursion
}
cdrtree<- function(root){
#assign root
v0 <- Node$new(root) #assign root to the root of the tree
node<-root #rename variable for clarity in next step
kidparentname<-v0 #recursion starts at v0
have.kids<-function(node){ #this is unfortunately asexual reproduction...
for(pointer in cdrpointers(node)){ #A variable number of pointers are
#used to determine the next node(s) if any with function cdrmove
cdrmove(node,pointer)->newkid #make a child
assn(newkid,kidparentname) #enter this node in the tree hierarchy
#get the name of newkid for next iteration and write name to tree
kidparentname<-assn(newkid,kidparentname)
node<-newkid #rename node variable for the next iteration
have.kids(newkid) #recurse, likely the problem is here
}
return(v0) #return the tree (if the code works...)
}
}
Running the script on a possible root node node gives a strange result:
> cdrtree(c(1,-2,3))
> cdrtree(c(1,-2,3))->a
> a
function(node){ #this is unfortunately asexual reproduction...
for(pointer in cdrpointers(node)){ #A variable number of pointers are
... #all code as written above ...
}
<environment: 0x00000000330ee348>
If you want a true working example, you can grab and source "Generating Scripts.R" from here and run it with any signed permutation of 1:n with n>2 as an argument similar to my example.
To be extra clear, the tree with root node c(1,-2,3) would hypothetically look something like this:
I don't think your function are working as expected. For example, using your starting value,
lapply(cdrpointers(c(1,-2,3)), function(i) cdrmove(c(1,-2,3), i))
[[1]]
[1] 1 2 3
[[2]]
[1] 1 2 3
But, assuming those work. you could try the following and determine if they are being used incorrectly.
## Name nodes uniquely, dont be assigning to the .Globalenv like
## you are in `assn`, which wont work becuse `i` isn't being incremented.
## You could invcrement `i` in the global, but, instead,
## I would encapsulate `i` in the function's parent.frame, avoiding possible conflicts
nodeNamer <- function() {
i <- 0
## Note: `i` is incremented outside of the scope of this function using `<<-`
function(node) sprintf("v%g", (i <<- i+1))
}
## Load your functions, havent looked at these too closely,
## so just gonna assume they work
source(file="https://raw.githubusercontent.com/zediiiii/CDS/master/Generating%20Scripts.r")
cdrtree <- function(root.value) {
root <- Node$new('root') # assign root
root$value <- root.value # There seems to be a separation of value from name
name_node <- nodeNamer() # initialize the node counter to name the nodes
## Define your recursive helper function
## Note: you could do without this and have `cdrtree` have an additional
## parameter, say tree=NULL. But, I think the separation is nice.
have.kids <- function(node) {
## this function (`cdrpointers`) needs work, it should return a 0 length list, not print
## something and then error if there are no values
## (or throw and error with the message if that is what you want)
pointers <- tryCatch({cdrpointers(node$value)}, error=function(e) return( list() ))
if (!length(pointers)) return()
for (pointer in pointers) {
child_val <- cdrmove(node$value, pointer) # does this always work?
child <- Node$new(name_node()) # give the node a name
child$value <- child_val
child <- node$AddChildNode(child)
Recall(child) # recurse with child
}
}
have.kids(root)
return( root )
}
library(data.tree)
res <- cdrtree(root.value=c(1,-2,3))
After much help from #TheTime I have a solid solution to this question.
Though it's a lot of code, I would like to post it because there are a few tricky issues with duplicate values:
####################
# function: cdrtree()
# purpose: Generates a CDR tree with uniquely named nodes (uniqueness is required for igraph export)
# parameters: root.value: the value of the seed to generate the tree. Values of length>6 are not recommended.
# Author: Joshua Watson Nov 2015, help from TheTime #stackoverflow
# Dependancies: sort.listss.r ; gen.bincomb.r
require(combinat)
require(data.tree)
#Two helper functions for keeping names distinct.
nodeNamer <- function() {
i <- 0
function(node) sprintf("v%g", (i <<- i+1))
}
nodeNamer2 <- function() {
j <- 0
function(node) sprintf("%g", (j <<- j+1))
}
cdrtree <- function(root.value, make.igraph=FALSE) {
templist<- list()
root <- Node$new('v0')
root$value <- root.value
root$name <- paste(unlist(root$value),collapse=' ') #name this the same as the value collapsed in type char
name.node <- nodeNamer() # initialize the node counters to name the nodes
name.node2 <- nodeNamer2()
#recursive function that produces chidlren and names them appropriately
have.kids <- function(node) {
pointers <- tryCatch({cdrpointers(node$value)}, error=function(e) return( list() ))
if (!length(pointers)) return()
for (pointer in pointers) {
child.val <- cdrmove(node$value, pointer) #make the cdr move on the first pointer
child <- Node$new(name.node())
child$value <- child.val
#child$name <- paste(" ",unlist(child$value),collapse=' ') # Name it for text
child$name <- paste(unlist(child$value),collapse=' ') # Name it For Graphics
child <- node$AddChildNode(child)
#identical ending name handling catches duplicates. Names WIN+, WIN-, and DRAW outcomes
endname<-paste(unlist(tail(gen.cdrpile(length(root.value)), n=1)[[1]]),collapse=' ')
startname<-paste(unlist(root$value),collapse=' ')
if(child$name==endname){
child$name <- paste(name.node2(),"-WIN",child$name,sep='')
} else {
if(child$name==startname){
child$name <- paste(name.node2(),"+WIN",child$name,sep='')
} else {
#if all negative or all postitive then it is terminal and could be a duplicate, rename it
if((sum(child$value < 0) == length(root.value)) || (sum(child$value < 0 ) == 0 )){
child$name <- paste(name.node2(),"DRAW",child$name,sep='')
} else {
#catch the other duplicate cases that aren't listed above
if((child$name %in% templist == TRUE) || (child$name == root$name)){
child$name <- paste(name.node2(),"DUP",child$name,sep='')
#templist[[length(pointerlist)+1]] <-
}
}
}
}
#make a list of names for the last duplicate catcher
append(child$name,templist)->>templist
Recall(child) # recurse with child
}
}
have.kids(root)
return( root )
}
I am new to R. I know how to write map reduce in Java. I want to try the same in R. So can any one help in giving any samle codes and is there any fixed format there for MapReduce in R.
Please send any link other than this: https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial
Any sample codes will be more helpful.
When you want to implement a map reduce (with Hadoop) in a language other than Java, then you use a feature called streaming. Then the data is fed to the mapper via STDIN (readLines()), back to Hadoop via STDOUT(cat()), then to the reducer again through STDIN (readLines()) and blurted finally via STDOUT (cat()).
The following code is taken from an article I wrote on writing a map reduce job with R for Hadoop. The code is supposed to count 2-grams but I'd say simple enough to see what is going on MapReduce-wise.
# map.R
library(stringdist, quietly=TRUE)
input <- file("stdin", "r")
while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
# in case of empty lines
# more sophisticated defensive code makes sense here
if(nchar(line) == 0) break
fields <- unlist(strsplit(line, "\t"))
# extract 2-grams
d <- qgrams(tolower(fields[4]), q=2)
for(i in 1:ncol(d)) {
# language / 2-gram / count
cat(fields[2], "\t", colnames(d)[i], "\t", d[1,i], "\n")
}
}
close(input)
-
# reduce.R
input <- file("stdin", "r")
# initialize variables that keep
# track of the state
is_first_line <- TRUE
while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
line <- unlist(strsplit(line, "\t"))
# current line belongs to previous
# line's key pair
if(!is_first_line &&
prev_lang == line[1] &&
prev_2gram == line[2]) {
sum <- sum + as.integer(line[3])
}
# current line belongs either to a
# new key pair or is first line
else {
# new key pair - so output the last
# key pair's result
if(!is_first_line) {
# language / 2-gram / count
cat(prev_lang,"\t",prev_2gram,"\t",sum,"\n")
}
# initialize state trackers
prev_lang <- line[1]
prev_2gram <- line[2]
sum <- as.integer(line[3])
is_first_line <- FALSE
}
}
# the final record
cat(prev_lang,"\t",prev_2gram, "\t", sum, "\n")
close(input)
http://www.joyofdata.de/blog/mapreduce-r-hadoop-amazon-emr/