i am writing a package for folks who want to predict values base on AADTMAJ, L, and Base_Past. The function provides two options 1) allow the user to enter there own regression coefficients, or 2) provide the user with pre defined coefficients. However, i have not been able to use return() correctly .
input data
data=data.frame(Base_Past=c("HSM-RUR2U-KABCO",
"HSM-RUR2U-KABCO",
"HSM-RUR4-KABC",
"HSM-RUR4-KABCO"),
AADTMAJ=c(100,100,100,100),
L=c(1,1,1,1)
)
input custom regression coefficients
custom.spf=data.frame(Base_Past=c("HSM-RUR2U-KABCO","HSM-RUR2U-KABC"), a=c(-0.312,-0.19))
define helper function
helper_function = function (data, Base_Past=FALSE, override=custom.spf){
if (is.data.frame(override)){
for (j in 1:nrow(override)){
for (i in 1:nrow(data)){
if(data[i, ]$Base_Past==override[j, ]$Base_Past){
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(override[j, ]$a))
return(output)} else{
if(data[i, ]$Base_Past=="HSM-RUR4-KABCO") {a=-0.101}
if(data[i, ]$Base_Past=="HSM-RUR4-KABC") {a=-0.143}
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(a))
return(output)
}
}
}
}
else if (!is.data.frame(override)){
if(Base_Past=="HSM-RUR4-KABCO") {a=-0.101}
if(Base_Past=="HSM-RUR4-KABC") {a=-0.143}
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(a))
return(output)
}
}
run
(data %>% dplyr::rowwise() %>% dplyr::mutate(predicted_value = helper_function(data = data, override=custom.spf)))[,4]
Output
# A tibble: 4 x 1
# Rowwise:
predicted_value
<dbl>
1 0.0267
2 0.0267
3 0.0267
4 0.0267
alternative
data %>% dplyr::mutate(predicted_value=dplyr::case_when(Base_Past =="HSM-RUR4-KABCO" ~AADTMAJ*L*365*10^(-6)*exp(-0.101),
Base_Past=="HSM-RUR4-KABC" ~AADTMAJ*L*365*10^(-6)*exp(-0.143),
Base_Past=="HSM-RUR2U-KABCO" ~AADTMAJ*L*365*10^(-6)*exp(-0.312),
Base_Past=="HSM-RUR2U-KABC" ~AADTMAJ*L*365*10^(-6)*exp(-0.190),
TRUE ~ NA_real_))
desired output
Base_Past AADTMAJ L predicted_value
1 HSM-RUR2U-KABCO 100 1 0.02671733
2 HSM-RUR2U-KABCO 100 1 0.02671733
3 HSM-RUR4-KABC 100 1 0.03163652
4 HSM-RUR4-KABCO 100 1 0.03299356
The function and your use of it have several problems. Notable on the list of problems since my first batch of comments:
You call it within a rowwise pipe but then pass data=data, which means that it is ignoring the data coming in the pipe and instead looking at the whole thing. You might instead use data=cur_data() (since it is inside of a mutate, this works, as cur_data() is defined by dplyr for situations something like this).
Your helper_function is ill-defined by assuming that custom.spf is defined and available. Having a function rely on the presence of external variables not explicitly passed to it makes it fragile and can be rather difficult to troubleshoot. If for instance custom.spf were not defined in the calling environment, then this function will fail with object 'custom.spf' not found. Instead, I think you could use:
helper_function <- function(..., override=NA) {
if (isTRUE(is.na(override)) && exists("custom.spf")) {
message("found 'custom.spf', using it as 'override'")
override <- custom.spf
}
...
}
I'm not totally thrilled with this still, but at least it won't fail too quickly, and is verbose in what it is doing.
Using 1:nrow(.) can be a little risky if used programmatically. That is, if for some reason one of the inputs has 0 rows (perhaps custom.spf has nothing to override), then 1:nrow(.) should logically do nothing but instead will iterate twice over rows that do not exist. That is, if nrow(.) is 0, then note that 1:0 returns c(1, 0), which is clearly not "do nothing". Instead, use seq_len(nrow(.)), as seq_len(0) returns integer(0), which is what we would want.
There is no reason to use rowwise() here, and its use should be avoided whenever possible. (It does what it does very well, and when it is truly necessary, it works great. But the performance penalty for iterating one row at a time can be significant, especially for larger data.)
Some of what you are trying to do can be simplified by learning about merge/join methods. Two really good references for merge/join are: How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?.
Further, it seems as if a significant portion of your effort is to assign a reasonable value to a for your equation. Your inner code (looking for "-KABCO" and "-KABC") looks like it really should be yet another frame of default values.
Here's a suggested helper_function that changes things slightly. It takes as mandatory arguments Base_Past, AADTMAJ, and L, and then zero or more frames to merge/join in order to find an appropriate value for a in the equation.
helper_function <- function(Base_Past, AADTMAJ, L, ...) {
stopifnot(
length(Base_Past) == length(AADTMAJ),
length(Base_Past) == length(L)
)
defaults <- data.frame(Base_Past = c("HSM-RUR4-KABCO", "HSM-RUR4-KABC"), a = c(-0.101, -0.143))
frames <- c(list(defaults), list(...))
a <- rep(NA, length(Base_Past))
tmpdat <- data.frame(row = seq_along(Base_Past), Base_Past = Base_Past, a = a)
for (frame in frames) {
tmpdat <- merge(tmpdat, frame, by = "Base_Past", suffixes = c("", ".y"),
all.x = TRUE, sort = FALSE)
tmpdat$a <- ifelse(is.na(tmpdat$a), tmpdat$a.y, tmpdat$a)
tmpdat$a.y <- NULL
}
tmpdat <- tmpdat[order(tmpdat$row),]
AADTMAJ * L * 365 * 10^(-6) * exp(tmpdat$a)
}
The premise is that you looking for "default" values of a in your function is really the same as looking them up in your override variable. I could have given you the override= argument for a single lookup dictionary, but it is sometimes useful to have a "one or more" type of argument: perhaps you have more than one frame with other values for a, and you may want to use them all at once. This will work as you desired for a single, but if you have multiple, perhaps custom.spf and custom.spf, this would work (by adding all of them after the L argument when called).
I chose to keep the internals of the function simple base R for a few reasons, nothing that stands out as critical. The portion that could be dplyr-ized is within the for (frame in frames) loop.
data %>%
mutate(a = helper_function(Base_Past, AADTMAJ, L, custom.spf))
# Base_Past AADTMAJ L a
# 1 HSM-RUR2U-KABCO 100 1 0.02671733
# 2 HSM-RUR2U-KABCO 100 1 0.02671733
# 3 HSM-RUR4-KABC 100 1 0.03163652
# 4 HSM-RUR4-KABCO 100 1 0.03299356
The function should operate cleanly within grouping (group_by or rowwise) if you desire, but it is certainly not necessary to do what you asked originally.
Related
I want to concatenate iris$SepalLength, so I can use that in a function to get the Sepal Length column from iris data frame. But when I use paste function paste("iris$", colnames(iris[3])), the result is as characters (with quotes), as "iris$SepalLength". I need the result not as a character. I have tried noquotes(), as.datafram() etc but it doesn't work.
freq <- function(y) {
for (i in iris) {
count <-1
y <- paste0("iris$",colnames(iris[count]))
data.frame(as.list(y))
print(y)
span = seq(min(y),max(y), by = 1)
freq = cut(y, breaks = span, right = FALSE)
table(freq)
count = count +1
}
}
freq(1)
The crux of your problem isn't making that object not be a string, it's convincing R to do what you want with the string. You can do this with, e.g., eval(parse(text = foo)). Isolating out a small working example:
y <- "iris$Sepal.Length"
data.frame(as.list(y)) # does not display iris$Sepal.Length
data.frame(as.list(eval(parse(text = y)))) # DOES display iris.$Sepal.Length
That said, I wanted to point out some issues with your function:
The input variable appears to not do anything (because it is immediately overwritten), which may not have been intended.
The for loop seems broken, since it resets count to 1 on each pass, which I think you didn't mean. Relatedly, it iterates over all i in iris, but then it doesn't use i in any meaningful way other than to keep a count. Instead, you could do something like for(count in 1 : length(iris) which would establish the count variable and iterate it for you as well.
It's generally better to avoid for loops in R entirely; there's a host of families available for doing functions to (e.g.) every column of a data frame. As a very simple version of this, something like apply(iris, 2, table) will apply the table function along margin 2 (the columns) of iris and, in this case, place the results in a list. The idea would be to build your function to do what you want to a single vector, then pass each vector through the function with something from the apply() family. For instance:
cleantable <- function(x) {
myspan = seq(min(x), max(x)) # if unspecified, by = 1
myfreq = cut(x, breaks = myspan, right = FALSE)
table(myfreq)
}
apply(iris[1:4], 2, cleantable) # can only use first 4 columns since 5th isn't numeric
would do what I think you were trying to do on the first 4 columns of iris. This way of programming will be generally more readable and less prone to mistakes.
I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.
My data.frame has the following structure:
CHR POS var_A var_B
1 192 0.9 0.7
1 2000 0.8 0.3
2 3 0.21 0.76
2 30009 0.36 0.15
...
The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).
The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.
I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.
x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
df=freq[0,]
for (i in unique(freq$CHR)) {
t = freq[freq$CHR==i,]
fim = t[nrow(t),2]
i = t[1,2]
f = i + x
while(f < fim) {
rows = which(t$POS>=i & t$POS<f)
s = sample(rows)
df = rbind(df,t[s,])
i = f
f = f + x
}
}
cors = c(cors, cor(df$var_A, df$var_B))
}
Could anybody help me out? I am sure there is a more efficient way of doing this.
Thank you in advance.
One efficient way to try would be to use the 'boot' package, of which functions include parallel processing capabilities.
In particular, the 'tsboot', or time series boot function, will select ordered blocks of data. This could work if your POS variable is some kind of ordered observation.
The boot package functions are great, but they need a little help first. To use bootstrap functions in the boot package, one must first wrap the statistic of interest in a function which includes an index argument. This is the device the bootstrap generated index will use to pass sampled data to your statistic.
cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)
Note cor_hat in the arguments below. The sim = "fixed", l = 1000 arguments, which indicate you want fixed blocks of length(l) 1000. However, you could do blocks of any size, 5 or 10 if your trying to capture nearest neighbor dynamics moving over time. The multicore argument speaks for itself, but it maybe "snow" if you are using windows.
library(boot)
tsboot(data, cor_hat, R = 1000, sim = "fixed", l = 1000, parallel = "multicore", ncpus = 4)
In addition, page 194 of Elements of Statistical Learning provides a good example of the framework using the traditional boot function, all of which is relevant to tsboot.
Hope that helps, good luck.
Justin
r
I hope I understood you right:
# needed for round_any()
library(plyr)
res <- lapply(unique(freq$CHR),function(x){
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
})
This should return a list with an entry for each chromosome. Within each entry, there's an observation per 1kb-block if present. The number of blocks is determined by the maximum POS value.
EDIT:
library(doParallel)
library(foreach)
library(plyr)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
res <- foreach(x=unique(freq$CHR),.packages = 'plyr') %dopar% {
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
}
stopCluster(cl)
This is a simple parallelisation with foreach on each Chromosome. It could be better to restructure the function and base the parallel processing on another level (such as the 1000 iterations or maybe the blocks). In any case, I can just stress again what I was saying in my comment: Before you work on parallelising your code, you should be sure that it's as efficient as possible. Meaning you might want to look into the boot package or similar to get an increase in efficiency. That said, with the number of iterations you're planning, parallel processing might be useful once you're comfortable with your function.
So, after a while I came up with an answer to my problem. Here it goes.
You'll need the package dplyr.
l = 1000
teste = freq %>%
mutate(w = ceiling(POS/l)) %>%
group_by(CHR, w) %>%
sample_n(1)
This code creates a new variable named w based on the position in the genome (POS). This variable w is the window to which each row was assigned, and it depends on l, which is the length of your window.
You can repeat this code several times, each time sampling one row per window/CHR (with the sample_n(1)) and apply whatever statistic of interest that you want.
I have a data frame containing many time columns. I want to add columns for each time for year, month, date, etc.
Here is what I have so far:
library(dplyr)
library(lubridate)
times <- c(133456789, 143456789, 144456789 )
train2 <- data.frame(sent_time = times, open_time = times)
time_col_names <- c("sent_time", "open_time")
dt_part_names <- c("year", "month", "hour", "wday", "day")
train3 <- as.data.frame(train2)
dummy <- lapply(time_col_names, function(col_name) {
pct_times <- as.POSIXct(train3[,col_name], origin = "1970-01-01", tz = "GMT")
lapply(dt_part_names, function(part_name) {
part_col_name <- paste(col_name, part_name, sep = "_")
train3[, part_col_name] <- rep(NA, nrow(train3))
train3[, part_col_name] <- factor(get(part_name)(pct_times))
})
})
Everything seems to work, except the columns never get created or assigned. The components do get extracted, and the assignment succeeds without error, but train3 does not have any new columns.
I have checked that the assignment works when I call it outside the nested lapply context:
train3[, "x"] <- rep(NA, nrow(train3))
In this case, column x does get created.
It is often believed that the apply family provides an advantage in terms of performance compared to a for loop. But the most important difference between a for loop and a loop from the *apply() family is that the latter is designed to have no side effects.
The absence of side effects favors the development of clean, well-structured, and concise code. A problem occurs if one wishes to have side effects, which is usually a symptom of a flawed code design.
Here is a simple example to illustrate this:
myvector <- 10:1
sapply(myvector,prod,2)
# [1] 20 18 16 14 12 10 8 6 4 2
It looks correct, right? The sapply() loop has seemingly multiplied the entries of myvec by two (granted, this result could have been achieved more easily, but this is just a simple example to discuss the functioning of *apply()).
Upon inspection, however, one realizes that this operation has not changed myvector at all:
> myvector
# [1] 10 9 8 7 6 5 4 3 2 1
That is because sapply() did not have the side effect to modify myvector. In this example the sapply() loop is equivalent to the command print(myvector*2), and not to myvector <- myvector * 2. The *apply() loops return an object, but they don't modify the original one.
If one really wants to change the object within the loop, the superassignment operator <<- is necessary to modify the object outside the scope of the loop. This should almost never be done, and things become quite ugly in this case. For example, the following loop does change my myvector:
sapply(seq_along(myvector), function(x) myvector[x] <<- myvector[x]*2)
> myvector
# [1] 20 18 16 14 12 10 8 6 4 2
Coding in R should not look like this. Note that also in this more convoluted case, if the normal assignment operator <- is used instead of <<- then myvector remains unchanged. The correct approach is to assign the object returned by *apply instead of modifying it within the loop.
In the specific case described by the OP, the variable dummy may contain the desired output if the commands in the loop are correct. But one cannot expect that the object train3 is modified within the loop. For this the <<- operator would be necessary.
A quote mentioned in fortunes::fortune(212) possibly summarizes the problem:
Basically R is reluctant to let you shoot yourself in the foot unless
you are really determined to do so. -- Bill Venables
Reading this excellent post i came across within and transform.
Reading both help files i unfortunatly did not fully understand what the differnence is...
I tried something like:
df <- data.frame(A = runif(5), B = rnorm(5))
A=1:5
within(df, C<-A+B)
transform(df,C=A+B)
Both times the Output was:
A B C
1 0.2326266 1.3237210 1.5563476
2 0.4581693 -0.2605674 0.1976018
3 0.6431078 0.5920021 1.2351099
4 0.9682578 1.1964012 2.1646590
5 0.9889942 0.5468008 1.5357950
So both seem to create a new enwironment as they are ignoring A=1:5 within the evaluation.
Thanks in advance!
within lets you use an earlier defined variable later but not transform:
within(BOD, { a <- demand; b <- a }) # ok
transform(BOD, a = demand, b = a) # error
Note that I had defined a variation of transform that acts more like within a number of years ago here where it is called my.transform. Using that we could write the above like this:
my.transform(BOD, a = demand, b = a) # ok
In the above examples within (or my.transform) would be better but in the following transform is better:
transform(BOD, Time = demand, demand = Time) # swap columns
within(BOD, { Time <- demand; demand <- Time }) # oops
(To perform the swap with within would require that we define a temporary.)
EDIT
my.transform is now in the gsubfn CRAN package where it is called transform2 . mutate in dplyr works from left to right.
Note that transform, transform2 and mutate each work slightly differently. The RHS transform arguments all refer to the original values. The RHS of mutate arguments refer to the most recent left-to-right value. transform2 figures out the dependencies and uses those so that a dependent can come before or after the argument in which it used.
I would like to speed-up the below function (fndf) which calls another function (fn1) based on a character array.
fndf- New Function
list_s - character array - chr [1:400]
rdata_i - empty data frame (for initialization)
fn1 - another custom function
rdata2 - data frame with 3000 obs of 40 variables
mdata - data.frame
nm - character
fndf = function(list_s, rdata2){
rdata_i = df <- data.frame(Date=as.Date(character()),
File=character(),
User=character(),
stringsAsFactors=FALSE)
for(i in 1:length(list_s))
{
rdata = fn1(list_s[i], rdata2)
rdata_i = rbind(rdata, rdata_i)
}
return(unique(rdata_i))
}
Can we also improve performance of the function below?
fn1 = function(nm, mdata){
n0 = mdata[mdata$Sign==nm,]
cn0 = unique(c(n0$Name))
repeat{
n1c = mdata[mdata$Mgr %in% cn0,]
n0 = unique(rbind(n0,n1c))
if(nrow(n1c)==0){
return(n0)
break
}
cn0= unique(c(n1c$Name))
}
}
It’s indeed hard to say how to best transform your loop into an *apply statement, and even harder to say whether this will speed it up. But fundamentally, the following transformation is what you’re after, and it definitely makes the function simpler and more readable. It also quite possibly corresponds to a substantial performance gain due to the loss of the repeated rbind, as noted by baptiste:
fndf = function (list_s, rdata2)
as.data.frame(do.call(rbind, unique(lapply(list_s, fn1, rdata2))))
(Yes. That’s a single statement.)
Also note that I’m now applying the unique directly to the list rather than the data.frame. This changes the semantics – unique is specialised for data.frames – but is probably the right thing for your purposes, and it will be more efficient because it means that we don’t construct a needlessly big data.frame with redundant rows.
It's hard to say without your data/functions, but here is a solution with plyr and some placeholder data:
list_s<-LETTERS
rdata2<-data.frame(a=rep(LETTERS,2),b=runif(52),c=runif(52)*10)
fn1<-function(a,b=rdata2)b[rdata2$a==a,]
fn1("A")
require(plyr) # for ldply function, which takes a list and returns a dataframe
result<-ldply(1:length(list_s),function(x)fn1(list_s[x],rdata2))
head(result)
a b c
1 A 0.281940237 2.7774933
2 A 0.023611392 0.6067029
3 B 0.456547803 9.4219258
4 B 0.645783746 5.3094864
5 C 0.475949523 4.8580622
6 C 0.006063407 2.5851738