using greater than in function line in r - r

I am trying to do a subset within a function and it is not working out the way I was hoping. Here is the code I am actually trying to run:
plot.by = function(output, plot.warming, plot.baseline, apply.dim, attribute, selection){
par(mfrow= c(3,3))
for(site in 1:length(output)){
zero = apply(get(output[site])[,,attribute], c(1), sum)/32
zero = zero[selection]
zero = as.data.frame(zero)
zero = rownames(zero)
two = get(plot.warming[site])[zero,,,]
zero = get(plot.baseline[site])[zero,,,]
boxplot(apply(two, apply.dim, mean) - apply(zero, apply.dim, mean), ylim = c(-200,300))
}
}
plot.by(output= data.zero, plot.warming= aet.two, plot.baseline = aet.zero, apply.dim = c(1,3), attribute= "snowpack", selection= zero>2000)
and here is my example to play with:
select.by.attribute = function(data, selection){
tmp = data[selection]
plot(tmp)
}
select.by.attribute(data=data, selection= data>100)
I know that the example function works, however I believe it only works because it does the selection before the function is even run. If I run my actual code with a clean workspace it says "zero" is not found. If at all possible I would like the selection= >1000 rather than having the object in there.
In addition, any suggestion on how to search for this stuff in the future or an information source would be great. For example I don't even know what the line is called to "call" the function or the different attributes- made searching for the question quite difficult.
To add more information to what I am trying to do. My data is from a hydrological model where the outputs are daily measurements of things like snowfall, precipitation, evaporation, etc. In the end I am trying to plot data from these sites based upon certain attributes such as precip>2000. So to do this I first need to apply over some attribute, then I subset those specific row names (which are the site names), and then I plot those sites at the bottom (the plot(apply is to collapse 4 dimensions into two to plot them).
Essentially I need to do this a lot of times- so I want to be able to quickly do this for whatever attribute I want, whether that be precipitation or snowfall, as well as a selection to be >2000 or whatever number. Hence why I tried the make the function.

Related

How do I use prodlim function with a non-binary variable in formula?

I am trying to (eventually) plot data by groups, using the prodlim function.
I'm adjusting and adapting code that someone else (not available for questions) has written, and I'm not very familiar with the prodlim library/function. There are definitely other ways to do what I'd like to, but I'm trying to keep it consistent with what the previous person did.
I have code that works, when dividing the data into 2 groups, but when I try to adjust for a 4 group situation, I get an error.
Of note, the data is coming over from SAS using StatTransfer, which has been working fine.
I am new to coding, but I have compared the dataframes I'm trying to work with. The second is just a subset of the first (where the code does work), with all the same variables, and both of the variables I'm trying to group by are integer values.
Hist(medpop$dz_time, medpop$dz_status) works just fine, so the problem must be with the prodlim function, and I haven't understood much of what I've looked up about it, sadly :/ But it the documentation seems to indicate it supports continuous or categorical variables, and doesn't seem limited to binary either. None of the options seem applicable as I understand them.
this works:
M <- prodlim(Hist(dz_time, dz_status)~med, data=pop)
where med is a binary value =1 when a member of this population is taking it, and dz is a disease that some portion develop.
this does not:
(either of these get the error as below)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=medpop)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=pop, subset=pop$med==1)
medpop = the subset of the original population taking the med,
strength = categorical variable ("1","2","3","4")
For the line that does work, the next step is just plot(M), giving a plot with two lines, med==0 and med==1 (showing cumulative incidence of dz_status by dz_time).
For the other line, I get an error saying
Error in KernSmooth::dpik(cumtabx/N, kernel = "box") :
scale estimate is zero for input data
I don't know what that means or how to fix it.. :/

The function doesn't return a list and Sage doesn't plot it

I've got the function f(x,y) and I want to iterate it a number of times and plot the resulting points. What I've done is:
def orbita(p,n):
a = [p]
for i in range(n-1):
p = f(p[0],p[1])
a.append(p)
print a
When applied to the point p = (1,2) and asked for n = 5 iterations, this function returns the following:
[(1, 2), (2, 5/2), (5/2, 29/20), (29/20, 1241/1450), (1241/1450, 7285162/5218405)]
Which is correct. However, when I try to plot this list of points by point(orbita((1,2),5)), I simply get an empty plot.
At first I thought that this was an issue with the fact that point() plots "either a single point (as a tuple), a list of points, a single complex number, or a list of complex numbers", quoted from here, because I checked type(orbita((1,2),5)) and got NoneType. However, when I tried to assign the list to a variable by L = orbit((1,2),5) my variable is empty, since typing L returns nothing, so I'm not sure anymore if that is the problem. If I copy the list and write:
point([(1, 2), (2, 5/2), (5/2, 29/20), (29/20, 1241/1450), (1241/1450, 7285162/5218405)])
The plot works perfectly, but I would like to use this function for plotting at least a couple of hundred points, so I wouldn't like to copy them every time. I'm new with both Sage and Python, so I really don't know what I'm doing wrong at any level.
How can I print the outcome of the function orbita(p,n) or how can I modify it so that by typing point(orbita(p,n)) I obtain a plot?
Here is your problem.
this function returns the following:
No, it doesn't. It prints the following. As you correctly identify, it returns None. So you have to have it return the list of tuples. It looks like you are plenty experienced in Python to figure out how to do that, so I won't include the details, but please do follow up if you still have trouble.

(R) Burrow into function to change it with Trace()

I've built a large function which calls numerous gbm functions in a big loop. All I'm trying to do is increase the thickness of the tickmarks in rug() which is called by gbm.plot.
I was hoping to use (e.g.)
body(gbm.plot)[[24]][[4]][[3]][[3]][[3]][[3]][[2]]$ylab <- "change value"
From this page's examples, which I've used successfully elsewhere, but the section in question in gbm.plot is in an IF statement, so as.list doesn't nicely recurse the lines (because arguably it's all one huge long line). You can get to them by just manually [[trying]][[successive]][[combinations]] until you get to the right place, but since I'm trying to insert a piece of code , lwd=6 into a bracketed statement, rather than assigning a value to a named subobject, I'm not sure how to get trace to do this.
?trace says:
When the at argument is supplied, it can be a vector of integers referring to the substeps of the body of the function (this only works if the body of the function is enclosed in { ...}. In this case tracer is not called on entry, but instead just before evaluating each of the steps listed in at. (Hint: you don't want to try to count the steps in the printed version of a function; instead, look at as.list(body(f)) to get the numbers associated with the steps in function f.)
The at argument can also be a list of integer vectors. In this case, each vector refers to a step nested within another step of the function. For example, at = list(c(3,4)) will call the tracer just before the fourth step of the third step of the function
So I tried pasting the whole line with the lwd bit added in, hoping that it would overwrite it with the small addition:
trace (gbm.plot, quote(rug(quantile(data[, gbm.call$gbm.x[variable.no]], probs = seq(0, 1, 0.1), na.rm = TRUE),lwd=6)), at=c(22,4,7,3,3,3,2))
...as well as putting objects in & out of {brackets}, all to no avail. Does anyone know either the correct way of using trace for this, or can suggest a better way? Thanks
p.s. it needs to be done automatically with coding so users can load the function which will load the vanilla gbm functions from CRAN and then make tweaks as required.
EDIT: found a workaround. But generalisable question: how can one insert elements into an IF statemented part of a function? e.g. From
rug(quantile(data[, gbm.call$gbm.x[variable.no]], probs=seq(0, 1, 0.1), na.rm=TRUE))
to
rug(quantile(data[, gbm.call$gbm.x[variable.no]], probs=seq(0, 1, 0.1), na.rm=TRUE),lwd=6)

How to use a value that is specified in a function call as a "variable"

I am wondering if it is possible in R to use a value that is declared in a function call as a "variable" part of the function itself, similar to the functionality that is available in SAS IML.
Given something like this:
put.together <- function(suffix, numbers) {
new.suffix <<- as.data.frame(numbers)
return(new.suffix)
}
x <- c(seq(1000,1012, 1))
put.together(part.a, x)
new.part.a ##### does not exist!!
new.suffix ##### does exist
As it is written, the function returns a dataframe called new.suffix, as it should because that is what I'm asking it to do.
I would like to get a dataframe returned that is called new.part.a.
EDIT: Additional information was requested regarding the purpose of the analysis
The purpose of the question is to produce dataframes that will be sent to another function for analysis.
There exists a data bank where elements are organized into groups by number, and other people organize the groups
into a meaningful set.
Each group has an id number. I use the information supplied by others to put the groups together as they are specified.
For example, I would be given a set of id numbers like: part-1 = 102263, 102338, 202236, 302342, 902273, 102337, 402233.
So, part-1 has seven groups, each group having several elements.
I use the id numbers in a merge so that only the groups of interest are extracted from the large data bank.
The following is what I have for one set:
### all.possible.elements.bank <- .csv file from large database ###
id.part.1 <- as.data.frame(c(102263, 102338, 202236, 302342, 902273, 102337, 402233))
bank.names <- c("bank.id")
colnames(id.part.1) <- bank.names
part.sort <- matrix(seq(1,nrow(id.part.1),1))
sort.part.1 <- cbind(id.part.1, part.sort)
final.part.1 <- as.data.frame(merge(sort.part.1, all.possible.elements.bank,
by="bank.id", all.x=TRUE))
The process above is repeated many, many times.
I know that I could do this for all of the collections that I would pull together, but I thought I would be able to wrap the selection process into a function. The only things that would change would be the part numbers (part-1, part-2, etc..) and the groups that are selected out.
It is possible using the assign function (and possibly deparse and substitute), but it is strongly discouraged to do things like this. Why can't you just return the data frame and call the function like:
new.part.a <- put.together(x)
Which is the generally better approach.
If you really want to change things in the global environment then you may want a macro, see the defmacro function in the gtools package and most importantly read the document in the refrences section on the help page.
This is rarely something you should want to do... assigning to things out of the function environment can get you into all sorts of trouble.
However, you can do it using assign:
put.together <- function(suffix, numbers) {
assign(paste('new',
deparse(substitute(suffix)),
sep='.'),
as.data.frame(numbers),
envir=parent.env(environment()))
}
put.together(part.a, 1:20)
But like Greg said, its usually not necessary, and always dangerous if used incorrectly.

How to rewrite my R code for multicore processing?

I have R code that I need to get to A "parallelization" stage. Im new at this so please forgive me if I use the wrong terms. I have a process that just has to chug through individual by individual one at a time and then average across individuals in the end. The process is the exact same for each individual (its a Brownian Bridge), I just have to do this for >300 individuals. So, I was hoping someone here might know how to change my code so that it can be spawned? or parallelized? or whatever the word is to make sure that the 48 CPU's I now have access to can help reduce the 58 days it will take to compute this with my little laptop. In my head I would just send out 1 individual to one processor. Have it run through the script and then send another one....if that makes sense.
Below is my code. I have tried to comment in it and have indicated where I think the code needs to be changed.
for (n in 1:(length(IDNames))){ #THIS PROCESSES THROUGH EACH INDIVIDUAL
#THIS FIRST PART IS JUST EXTRACTING THE DATA FROM MY TWO INPUT FILES.
#I HAVE ONE FILE WITH ALL THE LOCATIONS AND THEN ANOTHER FILE WITH A DATE RANGE.
#EACH INDIVIDUAL HAS DIFFERENT DATE RANGES, THUS IT HAS TO PULL OUT EACH INDIVIDUALS
#DATA SET SEPARATELY AND THEN RUN THE FUNCTION ON IT.
IndivData = MovData[MovData$ID==IDNames[n],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames[n],]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
########################
#THIS IS THE PROCESS WHERE I THINK I NEED THAT HAS TO HAVE EACH INDIVIDUAL RUN THROUGH IT
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
#############################
# BELOW IS JUST CODE TO BIND THE RESULTS INTO A GRID DATA FRAME I ALREADY CREATED.
#I DO NOT UNDERSTAND HOW THE MULTICORE PROCESSED CODE WOULD JOIN THE DATA BACK
#WHICH IS WHY IVE INCLUDED THIS PART OF THE CODE.
if(n==1){ #creating a data fram with the x, y, and probabilities for the first individual
BBMMProbGrid = as.data.frame(1:length(BBMM[[2]]))
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[2]],BBMM[[3]],BBMM[[4]])
colnames(BBMMProbGrid)=c("GrdId","X","Y",paste(IDNames[n],"_Prob", sep=""))
} else { #For every other individual just add the new information to the dataframe
BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[4]])
colnames(BBMMProbGrid)[n*2+2]=paste(IDNames[n],"_Prob", sep ="")
}# end if
} #end loop through individuals
Not sure why this has been voted down either. I think the foreach package is what you're after. Those first few pdfs have very clear useful information in them. Basically write what you want done for each person as a function. Then use foreach to send the data for one person out to a node to run the function (while sending another persons to another node etc) and then it compiles all the results using something like rbind. I've used this a few times with great results.
Edit: I didn't look to rework your code as I figure given you've got that far you'll easily have the skills to wrap it into a function and then use the one liner foreach.
Edit 2: This was too long for a comment to reply to you.
I thought since you had got that far with the code that you would be able to get it into a function :) If you're still working on this, it might help to think of writing a for loop to loop over your subjects and do the calculations required for that subject. Then, that for loop is what you want in your function. I think in your code that is everything down to 'area.grid'. Then you can get rid of most of your [n]'s since the data is only subset once per iteration.
Perhaps:
pernode <- function(MovData) {
IndivData = MovData[MovData$ID==IDNames[i],]
IndivData = IndivData[1:(nrow(IndivData)-1),]
if (UseTimeWindow==T){
IndivDates = dates[dates$ID==IDNames,]
IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]
&IndivData$DateTime<IndivDates$End[1],]
}
IndivData$TimeDif[nrow(IndivData)]=NA
BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
area.grid = Grid, time.step = 0.1)
return(BBMM)
}
Then something like:
library(doMC)
library(foreach)
registerDoMC(cores=48) # or perhaps a few less than all you have
system.time(
output <- foreach(i = 1:length(IDNames)), .combine = "rbind", .multicombine=T,
.inorder = FALSE) %dopar% {pernode(i)}
)
Hard to say whether that is it without some test data, let me know how you get on.
This is a general example since I didn't have the patience to read through all of your code. One of the quickest ways to spread this across multiple processors would be to use the multicore library and the mclapply (a parallelized version of lapply) to push a list (individual items on the list would be dataframes for each of the 300+ individuals in your case) through a function.
Example:
library(multicore)
result=mclapply(data_list, your_function,mc.preschedule=FALSE, mc.set.seed=FALSE)
As I understand you description you have access to a distributed computer cluster. So the package multicore will be not working. You have to use Rmpi, snow or foreach. Based on your existing loop structure I would advice to use the foreach and doSnow package. But your codes looks like as you have a lot of data. You probably have to check to reduce the data (only the required ones) which will be send to the nodes.

Resources