Reading this excellent post i came across within and transform.
Reading both help files i unfortunatly did not fully understand what the differnence is...
I tried something like:
df <- data.frame(A = runif(5), B = rnorm(5))
A=1:5
within(df, C<-A+B)
transform(df,C=A+B)
Both times the Output was:
A B C
1 0.2326266 1.3237210 1.5563476
2 0.4581693 -0.2605674 0.1976018
3 0.6431078 0.5920021 1.2351099
4 0.9682578 1.1964012 2.1646590
5 0.9889942 0.5468008 1.5357950
So both seem to create a new enwironment as they are ignoring A=1:5 within the evaluation.
Thanks in advance!
within lets you use an earlier defined variable later but not transform:
within(BOD, { a <- demand; b <- a }) # ok
transform(BOD, a = demand, b = a) # error
Note that I had defined a variation of transform that acts more like within a number of years ago here where it is called my.transform. Using that we could write the above like this:
my.transform(BOD, a = demand, b = a) # ok
In the above examples within (or my.transform) would be better but in the following transform is better:
transform(BOD, Time = demand, demand = Time) # swap columns
within(BOD, { Time <- demand; demand <- Time }) # oops
(To perform the swap with within would require that we define a temporary.)
EDIT
my.transform is now in the gsubfn CRAN package where it is called transform2 . mutate in dplyr works from left to right.
Note that transform, transform2 and mutate each work slightly differently. The RHS transform arguments all refer to the original values. The RHS of mutate arguments refer to the most recent left-to-right value. transform2 figures out the dependencies and uses those so that a dependent can come before or after the argument in which it used.
Related
i am writing a package for folks who want to predict values base on AADTMAJ, L, and Base_Past. The function provides two options 1) allow the user to enter there own regression coefficients, or 2) provide the user with pre defined coefficients. However, i have not been able to use return() correctly .
input data
data=data.frame(Base_Past=c("HSM-RUR2U-KABCO",
"HSM-RUR2U-KABCO",
"HSM-RUR4-KABC",
"HSM-RUR4-KABCO"),
AADTMAJ=c(100,100,100,100),
L=c(1,1,1,1)
)
input custom regression coefficients
custom.spf=data.frame(Base_Past=c("HSM-RUR2U-KABCO","HSM-RUR2U-KABC"), a=c(-0.312,-0.19))
define helper function
helper_function = function (data, Base_Past=FALSE, override=custom.spf){
if (is.data.frame(override)){
for (j in 1:nrow(override)){
for (i in 1:nrow(data)){
if(data[i, ]$Base_Past==override[j, ]$Base_Past){
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(override[j, ]$a))
return(output)} else{
if(data[i, ]$Base_Past=="HSM-RUR4-KABCO") {a=-0.101}
if(data[i, ]$Base_Past=="HSM-RUR4-KABC") {a=-0.143}
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(a))
return(output)
}
}
}
}
else if (!is.data.frame(override)){
if(Base_Past=="HSM-RUR4-KABCO") {a=-0.101}
if(Base_Past=="HSM-RUR4-KABC") {a=-0.143}
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(a))
return(output)
}
}
run
(data %>% dplyr::rowwise() %>% dplyr::mutate(predicted_value = helper_function(data = data, override=custom.spf)))[,4]
Output
# A tibble: 4 x 1
# Rowwise:
predicted_value
<dbl>
1 0.0267
2 0.0267
3 0.0267
4 0.0267
alternative
data %>% dplyr::mutate(predicted_value=dplyr::case_when(Base_Past =="HSM-RUR4-KABCO" ~AADTMAJ*L*365*10^(-6)*exp(-0.101),
Base_Past=="HSM-RUR4-KABC" ~AADTMAJ*L*365*10^(-6)*exp(-0.143),
Base_Past=="HSM-RUR2U-KABCO" ~AADTMAJ*L*365*10^(-6)*exp(-0.312),
Base_Past=="HSM-RUR2U-KABC" ~AADTMAJ*L*365*10^(-6)*exp(-0.190),
TRUE ~ NA_real_))
desired output
Base_Past AADTMAJ L predicted_value
1 HSM-RUR2U-KABCO 100 1 0.02671733
2 HSM-RUR2U-KABCO 100 1 0.02671733
3 HSM-RUR4-KABC 100 1 0.03163652
4 HSM-RUR4-KABCO 100 1 0.03299356
The function and your use of it have several problems. Notable on the list of problems since my first batch of comments:
You call it within a rowwise pipe but then pass data=data, which means that it is ignoring the data coming in the pipe and instead looking at the whole thing. You might instead use data=cur_data() (since it is inside of a mutate, this works, as cur_data() is defined by dplyr for situations something like this).
Your helper_function is ill-defined by assuming that custom.spf is defined and available. Having a function rely on the presence of external variables not explicitly passed to it makes it fragile and can be rather difficult to troubleshoot. If for instance custom.spf were not defined in the calling environment, then this function will fail with object 'custom.spf' not found. Instead, I think you could use:
helper_function <- function(..., override=NA) {
if (isTRUE(is.na(override)) && exists("custom.spf")) {
message("found 'custom.spf', using it as 'override'")
override <- custom.spf
}
...
}
I'm not totally thrilled with this still, but at least it won't fail too quickly, and is verbose in what it is doing.
Using 1:nrow(.) can be a little risky if used programmatically. That is, if for some reason one of the inputs has 0 rows (perhaps custom.spf has nothing to override), then 1:nrow(.) should logically do nothing but instead will iterate twice over rows that do not exist. That is, if nrow(.) is 0, then note that 1:0 returns c(1, 0), which is clearly not "do nothing". Instead, use seq_len(nrow(.)), as seq_len(0) returns integer(0), which is what we would want.
There is no reason to use rowwise() here, and its use should be avoided whenever possible. (It does what it does very well, and when it is truly necessary, it works great. But the performance penalty for iterating one row at a time can be significant, especially for larger data.)
Some of what you are trying to do can be simplified by learning about merge/join methods. Two really good references for merge/join are: How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?.
Further, it seems as if a significant portion of your effort is to assign a reasonable value to a for your equation. Your inner code (looking for "-KABCO" and "-KABC") looks like it really should be yet another frame of default values.
Here's a suggested helper_function that changes things slightly. It takes as mandatory arguments Base_Past, AADTMAJ, and L, and then zero or more frames to merge/join in order to find an appropriate value for a in the equation.
helper_function <- function(Base_Past, AADTMAJ, L, ...) {
stopifnot(
length(Base_Past) == length(AADTMAJ),
length(Base_Past) == length(L)
)
defaults <- data.frame(Base_Past = c("HSM-RUR4-KABCO", "HSM-RUR4-KABC"), a = c(-0.101, -0.143))
frames <- c(list(defaults), list(...))
a <- rep(NA, length(Base_Past))
tmpdat <- data.frame(row = seq_along(Base_Past), Base_Past = Base_Past, a = a)
for (frame in frames) {
tmpdat <- merge(tmpdat, frame, by = "Base_Past", suffixes = c("", ".y"),
all.x = TRUE, sort = FALSE)
tmpdat$a <- ifelse(is.na(tmpdat$a), tmpdat$a.y, tmpdat$a)
tmpdat$a.y <- NULL
}
tmpdat <- tmpdat[order(tmpdat$row),]
AADTMAJ * L * 365 * 10^(-6) * exp(tmpdat$a)
}
The premise is that you looking for "default" values of a in your function is really the same as looking them up in your override variable. I could have given you the override= argument for a single lookup dictionary, but it is sometimes useful to have a "one or more" type of argument: perhaps you have more than one frame with other values for a, and you may want to use them all at once. This will work as you desired for a single, but if you have multiple, perhaps custom.spf and custom.spf, this would work (by adding all of them after the L argument when called).
I chose to keep the internals of the function simple base R for a few reasons, nothing that stands out as critical. The portion that could be dplyr-ized is within the for (frame in frames) loop.
data %>%
mutate(a = helper_function(Base_Past, AADTMAJ, L, custom.spf))
# Base_Past AADTMAJ L a
# 1 HSM-RUR2U-KABCO 100 1 0.02671733
# 2 HSM-RUR2U-KABCO 100 1 0.02671733
# 3 HSM-RUR4-KABC 100 1 0.03163652
# 4 HSM-RUR4-KABCO 100 1 0.03299356
The function should operate cleanly within grouping (group_by or rowwise) if you desire, but it is certainly not necessary to do what you asked originally.
I have a tibble called 'Volume' in which I store some data (10 columns - the first 2 columns are characters, 30 rows).
Now I want to calculate the relative Volume of every column that corresponds to Column 3 of my tibble.
My current solution looks like this:
rel.Volume_unmod = tibble(
"Volume_OD" = Volume[[3]] / Volume[[3]],
"Volume_Imp" = Volume[[4]] / Volume[[3]],
"Volume_OD_1" = Volume[[5]] / Volume[[3]],
"Volume_WS_1" = Volume[[6]] / Volume[[3]],
"Volume_OD_2" = Volume[[7]] / Volume[[3]],
"Volume_WS_2" = Volume[[8]] / Volume[[3]],
"Volume_OD_3" = Volume[[9]] / Volume[[3]],
"Volume_WS_3" = Volume[[10]] / Volume[[3]])
rel.Volume_unmod
I would like to keep the tibble structure and the labels. I am sure there is a better solution for this, but I am relative new to R so I it's not obvious to me. What I tried is something like this, but I can't actually run this:
rel.Volume = NULL
for(i in Volume[,3:10]){
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
}
Mockup Data
Since you did not provide some data, I've followed the description you provided to create some mockup data. Here:
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE))
Volume[3:10] <- rnorm(30*8)
Solution with Dplyr
library(dplyr)
# rename columns [brute force]
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(Volume)[3:10] <- cols
# divide by Volumn_OD
rel.Volume_unmod <- Volume %>%
mutate(across(all_of(cols), ~ . / Volume_OD))
# result
rel.Volume_unmod
Explanation
I don't know the names of your columns. Probably, the names correspond to the names of the columns you intended to create in rel.Volume_unmod. Anyhow, to avoid any problem I renamed the columns (kinda brutally). You can do it with dplyr::rename if you wan to.
There are many ways to select the columns you want to mutate. mutate is a verb from dplyr that allows you to create new columns or perform operations or functions on columns.
across is an adverb from dplyr. Let's simplify by saying that it's a function that allows you to perform a function over multiple columns. In this case I want to perform a division by Volum_OD.
~ is a tidyverse way to create anonymous functions. ~ . / Volum_OD is equivalent to function(x) x / Volumn_OD
all_of is necessary because in this specific case I'm providing across with a vector of characters. Without it, it will work anyway, but you will receive a warning because it's ambiguous and it may work incorrectly in same cases.
More info
Check out this book to learn more about data manipulation with tidyverse (which dplyr is part of).
Solution with Base-R
rel.Volume_unmod <- Volume
# rename columns
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(rel.Volume_unmod)[3:10] <- cols
# divide by columns 3
rel.Volume_unmod[3:10] <- lapply(rel.Volume_unmod[3:10], `/`, rel.Volume_unmod[3])
rel.Volume_unmod
Explanation
lapply is a base R function that allows you to apply a function to every item of a list or a "listable" object.
in this case rel.Volume_unmod is a listable object: a dataframe is just a list of vectors with the same length. Therefore, lapply takes one column [= one item] a time and applies a function.
the function is /. You usually see / used like this: A / B, but actually / is a Primitive function. You could write the same thing in this way:
`/`(A, B) # same as A / B
lapply can be provided with additional parameters that are passed directly to the function that is being applied over the list (in this case /). Therefore, we are writing rel.Volume_unmod[3] as additional parameter.
lapply always returns a list. But, since we are assigning the result of lapply to a "fraction of a dataframe", we will just edit the columns of the dataframe and, as a result, we will have a dataframe instead of a list. Let me rephrase in a more technical way. When you are assigning rel.Volume_unmod[3:10] <- lapply(...), you are not simply assigning a list to rel.Volume_unmod[3:10]. You are technically using this assigning function: [<-. This is a function that allows to edit the items in a list/vector/dataframe. Specifically, [<- allows you to assign new items without modifying the attributes of the list/vector/dataframe. As I said before, a dataframe is just a list with specific attributes. Then when you use [<- you modify the columns, but you leave the attributes (the class data.frame in this case) untouched. That's why the magic works.
Whithout a minimal working example it's hard to guess what the Variable Volume actually refers to. Apart from that there seems to be a problem with your for-loop:
for(i in Volume[,3:10]){
Assuming Volume refers to a data.frame or tibble, this causes the actual column-vectors with indices between 3 and 10 to be assigned to i successively. You can verify this by putting print(i) inside the loop. But inside the loop it seems like you actually want to use i as a variable containing just the index of the current column as a number (not the column itself):
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
Also, two brackets are usually used with lists, not data.frames or tibbles. (You can, however, do so, because data.frames are special cases of lists.)
Last but not least, initialising the variable rel.Volume with NULL will result in an error, when trying to reassign to that variable, since you haven't told R, what rel.Volume should be.
Try this, if you like (thanks #Edo for example data):
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE),
Vol1 = rnorm(30),
Vol2 = rnorm(30),
Vol3 = rnorm(30))
rel.Volume <- Volume[1:2] # Assuming you want to keep the IDs.
# Your data.frame will need to have the correct number of rows here already.
for (i in 3:ncol(Volume)){ # ncol gives the total number of columns in data.frame
rel.Volume[i] = Volume[i]/Volume[3]
}
A more R-like approach would be to avoid using a for-loop altogether, since R's strength is implicit vectorization. These expressions will produce the same result without a loop:
# OK, this one messes up variable names...
rel.V.2 <- data.frame(sapply(X = Volume[3:5], FUN = function(x) x/Volume[3]))
rel.V.3 <- data.frame(Map(`/`, Volume[3:5], Volume[3]))
Since you said you were new to R, frankly I would recommend avoiding the Tidyverse-packages while you are still learing the basics. From my experience, in the long run you're better off learning base-R first and adding the "sugar" when you're more familiar with the core language. You can still learn to use Tidyverse-functions later (but then, why would anybody? ;-) ).
Questions like R variable names in loop, get, etc are common among people coming to R from other languages. The standard answer is usually, as I gave in that example, that it's not possible to iterate through a list of variables in the global environment and modify the underlying variable, only a copy. This fits in with the R semantics of not passing values by reference and creating copies of objects as necessary.
However, this type of construct works:
xData <- data.frame(a = 1:2, b = 3:4)
yData <- data.frame(a = 4:5, b = 6:7)
varList <- ls(pattern = "Data$")
for(var in varList) {
.GlobalEnv[[var]]$c <- with(.GlobalEnv[[var]], a + b)
}
xData
# a b c
# 1 1 3 4
# 2 2 4 6
Other than being poor programming style, and not really fitting with the 'R' way of doing things, are there any specific issues with this style of coding?
Please note I'm not advocating this as a good practice, just curious as to whether this is likely to have unintended side effects.
I would like to speed-up the below function (fndf) which calls another function (fn1) based on a character array.
fndf- New Function
list_s - character array - chr [1:400]
rdata_i - empty data frame (for initialization)
fn1 - another custom function
rdata2 - data frame with 3000 obs of 40 variables
mdata - data.frame
nm - character
fndf = function(list_s, rdata2){
rdata_i = df <- data.frame(Date=as.Date(character()),
File=character(),
User=character(),
stringsAsFactors=FALSE)
for(i in 1:length(list_s))
{
rdata = fn1(list_s[i], rdata2)
rdata_i = rbind(rdata, rdata_i)
}
return(unique(rdata_i))
}
Can we also improve performance of the function below?
fn1 = function(nm, mdata){
n0 = mdata[mdata$Sign==nm,]
cn0 = unique(c(n0$Name))
repeat{
n1c = mdata[mdata$Mgr %in% cn0,]
n0 = unique(rbind(n0,n1c))
if(nrow(n1c)==0){
return(n0)
break
}
cn0= unique(c(n1c$Name))
}
}
It’s indeed hard to say how to best transform your loop into an *apply statement, and even harder to say whether this will speed it up. But fundamentally, the following transformation is what you’re after, and it definitely makes the function simpler and more readable. It also quite possibly corresponds to a substantial performance gain due to the loss of the repeated rbind, as noted by baptiste:
fndf = function (list_s, rdata2)
as.data.frame(do.call(rbind, unique(lapply(list_s, fn1, rdata2))))
(Yes. That’s a single statement.)
Also note that I’m now applying the unique directly to the list rather than the data.frame. This changes the semantics – unique is specialised for data.frames – but is probably the right thing for your purposes, and it will be more efficient because it means that we don’t construct a needlessly big data.frame with redundant rows.
It's hard to say without your data/functions, but here is a solution with plyr and some placeholder data:
list_s<-LETTERS
rdata2<-data.frame(a=rep(LETTERS,2),b=runif(52),c=runif(52)*10)
fn1<-function(a,b=rdata2)b[rdata2$a==a,]
fn1("A")
require(plyr) # for ldply function, which takes a list and returns a dataframe
result<-ldply(1:length(list_s),function(x)fn1(list_s[x],rdata2))
head(result)
a b c
1 A 0.281940237 2.7774933
2 A 0.023611392 0.6067029
3 B 0.456547803 9.4219258
4 B 0.645783746 5.3094864
5 C 0.475949523 4.8580622
6 C 0.006063407 2.5851738
I am attempting to write a for loop which will take subsets of a dataframe by person id and then lag the EXAMDATE variable by one for comparison. So a given row will have the original EXAMDATE and also a variable EXAMDATE_LAG which will contain the value of the EXAMDATE one row before it.
for (i in length(uniquerid))
{
temp <- subset(part2test, RID==uniquerid[i])
temp$EXAMDATE_LAG <- temp$EXAMDATE
temp2 <- data.frame(lag(temp, -1, na.pad=TRUE))
temp3 <- data.frame(cbind(temp,temp2))
}
It seems that I am creating the new variable just fine but I know that the lag won't work properly because I am missing steps. Perhaps I have also misunderstood other peoples' examples on how to use the lag function?
So that this can be fully answered. There are a handful of things wrong with your code. Lucaino has pointed one out. Each time through your loop you are going to create temp, temp2, and temp3 (or overwrite the old one). and thus you'll be left with only the output of the last time through the loop.
However, this isnt something that needs a loop. Instead you can make use of the vectorized nature of R
x <- 1:10
> c(x[-1], NA)
[1] 2 3 4 5 6 7 8 9 10 NA
So if you combine that notion with a library like plyr that splits data nicely you should have a workable solution. If I've missed something or this doesn't solve your problem, please provide a reproducible example.
library(plyr)
myLag <- function(x) {
c(x[-1], NA)
}
ddply(part2test, .(uniquerid), transform, EXAMDATE_LAG=myLag(EXAMDATE))
You could also do this in base R using split or the data.table package using its by= argument.