I have a data frame containing many time columns. I want to add columns for each time for year, month, date, etc.
Here is what I have so far:
library(dplyr)
library(lubridate)
times <- c(133456789, 143456789, 144456789 )
train2 <- data.frame(sent_time = times, open_time = times)
time_col_names <- c("sent_time", "open_time")
dt_part_names <- c("year", "month", "hour", "wday", "day")
train3 <- as.data.frame(train2)
dummy <- lapply(time_col_names, function(col_name) {
pct_times <- as.POSIXct(train3[,col_name], origin = "1970-01-01", tz = "GMT")
lapply(dt_part_names, function(part_name) {
part_col_name <- paste(col_name, part_name, sep = "_")
train3[, part_col_name] <- rep(NA, nrow(train3))
train3[, part_col_name] <- factor(get(part_name)(pct_times))
})
})
Everything seems to work, except the columns never get created or assigned. The components do get extracted, and the assignment succeeds without error, but train3 does not have any new columns.
I have checked that the assignment works when I call it outside the nested lapply context:
train3[, "x"] <- rep(NA, nrow(train3))
In this case, column x does get created.
It is often believed that the apply family provides an advantage in terms of performance compared to a for loop. But the most important difference between a for loop and a loop from the *apply() family is that the latter is designed to have no side effects.
The absence of side effects favors the development of clean, well-structured, and concise code. A problem occurs if one wishes to have side effects, which is usually a symptom of a flawed code design.
Here is a simple example to illustrate this:
myvector <- 10:1
sapply(myvector,prod,2)
# [1] 20 18 16 14 12 10 8 6 4 2
It looks correct, right? The sapply() loop has seemingly multiplied the entries of myvec by two (granted, this result could have been achieved more easily, but this is just a simple example to discuss the functioning of *apply()).
Upon inspection, however, one realizes that this operation has not changed myvector at all:
> myvector
# [1] 10 9 8 7 6 5 4 3 2 1
That is because sapply() did not have the side effect to modify myvector. In this example the sapply() loop is equivalent to the command print(myvector*2), and not to myvector <- myvector * 2. The *apply() loops return an object, but they don't modify the original one.
If one really wants to change the object within the loop, the superassignment operator <<- is necessary to modify the object outside the scope of the loop. This should almost never be done, and things become quite ugly in this case. For example, the following loop does change my myvector:
sapply(seq_along(myvector), function(x) myvector[x] <<- myvector[x]*2)
> myvector
# [1] 20 18 16 14 12 10 8 6 4 2
Coding in R should not look like this. Note that also in this more convoluted case, if the normal assignment operator <- is used instead of <<- then myvector remains unchanged. The correct approach is to assign the object returned by *apply instead of modifying it within the loop.
In the specific case described by the OP, the variable dummy may contain the desired output if the commands in the loop are correct. But one cannot expect that the object train3 is modified within the loop. For this the <<- operator would be necessary.
A quote mentioned in fortunes::fortune(212) possibly summarizes the problem:
Basically R is reluctant to let you shoot yourself in the foot unless
you are really determined to do so. -- Bill Venables
Related
i am writing a package for folks who want to predict values base on AADTMAJ, L, and Base_Past. The function provides two options 1) allow the user to enter there own regression coefficients, or 2) provide the user with pre defined coefficients. However, i have not been able to use return() correctly .
input data
data=data.frame(Base_Past=c("HSM-RUR2U-KABCO",
"HSM-RUR2U-KABCO",
"HSM-RUR4-KABC",
"HSM-RUR4-KABCO"),
AADTMAJ=c(100,100,100,100),
L=c(1,1,1,1)
)
input custom regression coefficients
custom.spf=data.frame(Base_Past=c("HSM-RUR2U-KABCO","HSM-RUR2U-KABC"), a=c(-0.312,-0.19))
define helper function
helper_function = function (data, Base_Past=FALSE, override=custom.spf){
if (is.data.frame(override)){
for (j in 1:nrow(override)){
for (i in 1:nrow(data)){
if(data[i, ]$Base_Past==override[j, ]$Base_Past){
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(override[j, ]$a))
return(output)} else{
if(data[i, ]$Base_Past=="HSM-RUR4-KABCO") {a=-0.101}
if(data[i, ]$Base_Past=="HSM-RUR4-KABC") {a=-0.143}
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(a))
return(output)
}
}
}
}
else if (!is.data.frame(override)){
if(Base_Past=="HSM-RUR4-KABCO") {a=-0.101}
if(Base_Past=="HSM-RUR4-KABC") {a=-0.143}
output=as.numeric(data[i, ]$AADTMAJ*data[i, ]$L*365*10^(-6)*exp(a))
return(output)
}
}
run
(data %>% dplyr::rowwise() %>% dplyr::mutate(predicted_value = helper_function(data = data, override=custom.spf)))[,4]
Output
# A tibble: 4 x 1
# Rowwise:
predicted_value
<dbl>
1 0.0267
2 0.0267
3 0.0267
4 0.0267
alternative
data %>% dplyr::mutate(predicted_value=dplyr::case_when(Base_Past =="HSM-RUR4-KABCO" ~AADTMAJ*L*365*10^(-6)*exp(-0.101),
Base_Past=="HSM-RUR4-KABC" ~AADTMAJ*L*365*10^(-6)*exp(-0.143),
Base_Past=="HSM-RUR2U-KABCO" ~AADTMAJ*L*365*10^(-6)*exp(-0.312),
Base_Past=="HSM-RUR2U-KABC" ~AADTMAJ*L*365*10^(-6)*exp(-0.190),
TRUE ~ NA_real_))
desired output
Base_Past AADTMAJ L predicted_value
1 HSM-RUR2U-KABCO 100 1 0.02671733
2 HSM-RUR2U-KABCO 100 1 0.02671733
3 HSM-RUR4-KABC 100 1 0.03163652
4 HSM-RUR4-KABCO 100 1 0.03299356
The function and your use of it have several problems. Notable on the list of problems since my first batch of comments:
You call it within a rowwise pipe but then pass data=data, which means that it is ignoring the data coming in the pipe and instead looking at the whole thing. You might instead use data=cur_data() (since it is inside of a mutate, this works, as cur_data() is defined by dplyr for situations something like this).
Your helper_function is ill-defined by assuming that custom.spf is defined and available. Having a function rely on the presence of external variables not explicitly passed to it makes it fragile and can be rather difficult to troubleshoot. If for instance custom.spf were not defined in the calling environment, then this function will fail with object 'custom.spf' not found. Instead, I think you could use:
helper_function <- function(..., override=NA) {
if (isTRUE(is.na(override)) && exists("custom.spf")) {
message("found 'custom.spf', using it as 'override'")
override <- custom.spf
}
...
}
I'm not totally thrilled with this still, but at least it won't fail too quickly, and is verbose in what it is doing.
Using 1:nrow(.) can be a little risky if used programmatically. That is, if for some reason one of the inputs has 0 rows (perhaps custom.spf has nothing to override), then 1:nrow(.) should logically do nothing but instead will iterate twice over rows that do not exist. That is, if nrow(.) is 0, then note that 1:0 returns c(1, 0), which is clearly not "do nothing". Instead, use seq_len(nrow(.)), as seq_len(0) returns integer(0), which is what we would want.
There is no reason to use rowwise() here, and its use should be avoided whenever possible. (It does what it does very well, and when it is truly necessary, it works great. But the performance penalty for iterating one row at a time can be significant, especially for larger data.)
Some of what you are trying to do can be simplified by learning about merge/join methods. Two really good references for merge/join are: How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?.
Further, it seems as if a significant portion of your effort is to assign a reasonable value to a for your equation. Your inner code (looking for "-KABCO" and "-KABC") looks like it really should be yet another frame of default values.
Here's a suggested helper_function that changes things slightly. It takes as mandatory arguments Base_Past, AADTMAJ, and L, and then zero or more frames to merge/join in order to find an appropriate value for a in the equation.
helper_function <- function(Base_Past, AADTMAJ, L, ...) {
stopifnot(
length(Base_Past) == length(AADTMAJ),
length(Base_Past) == length(L)
)
defaults <- data.frame(Base_Past = c("HSM-RUR4-KABCO", "HSM-RUR4-KABC"), a = c(-0.101, -0.143))
frames <- c(list(defaults), list(...))
a <- rep(NA, length(Base_Past))
tmpdat <- data.frame(row = seq_along(Base_Past), Base_Past = Base_Past, a = a)
for (frame in frames) {
tmpdat <- merge(tmpdat, frame, by = "Base_Past", suffixes = c("", ".y"),
all.x = TRUE, sort = FALSE)
tmpdat$a <- ifelse(is.na(tmpdat$a), tmpdat$a.y, tmpdat$a)
tmpdat$a.y <- NULL
}
tmpdat <- tmpdat[order(tmpdat$row),]
AADTMAJ * L * 365 * 10^(-6) * exp(tmpdat$a)
}
The premise is that you looking for "default" values of a in your function is really the same as looking them up in your override variable. I could have given you the override= argument for a single lookup dictionary, but it is sometimes useful to have a "one or more" type of argument: perhaps you have more than one frame with other values for a, and you may want to use them all at once. This will work as you desired for a single, but if you have multiple, perhaps custom.spf and custom.spf, this would work (by adding all of them after the L argument when called).
I chose to keep the internals of the function simple base R for a few reasons, nothing that stands out as critical. The portion that could be dplyr-ized is within the for (frame in frames) loop.
data %>%
mutate(a = helper_function(Base_Past, AADTMAJ, L, custom.spf))
# Base_Past AADTMAJ L a
# 1 HSM-RUR2U-KABCO 100 1 0.02671733
# 2 HSM-RUR2U-KABCO 100 1 0.02671733
# 3 HSM-RUR4-KABC 100 1 0.03163652
# 4 HSM-RUR4-KABCO 100 1 0.03299356
The function should operate cleanly within grouping (group_by or rowwise) if you desire, but it is certainly not necessary to do what you asked originally.
I would like to speed-up the below function (fndf) which calls another function (fn1) based on a character array.
fndf- New Function
list_s - character array - chr [1:400]
rdata_i - empty data frame (for initialization)
fn1 - another custom function
rdata2 - data frame with 3000 obs of 40 variables
mdata - data.frame
nm - character
fndf = function(list_s, rdata2){
rdata_i = df <- data.frame(Date=as.Date(character()),
File=character(),
User=character(),
stringsAsFactors=FALSE)
for(i in 1:length(list_s))
{
rdata = fn1(list_s[i], rdata2)
rdata_i = rbind(rdata, rdata_i)
}
return(unique(rdata_i))
}
Can we also improve performance of the function below?
fn1 = function(nm, mdata){
n0 = mdata[mdata$Sign==nm,]
cn0 = unique(c(n0$Name))
repeat{
n1c = mdata[mdata$Mgr %in% cn0,]
n0 = unique(rbind(n0,n1c))
if(nrow(n1c)==0){
return(n0)
break
}
cn0= unique(c(n1c$Name))
}
}
It’s indeed hard to say how to best transform your loop into an *apply statement, and even harder to say whether this will speed it up. But fundamentally, the following transformation is what you’re after, and it definitely makes the function simpler and more readable. It also quite possibly corresponds to a substantial performance gain due to the loss of the repeated rbind, as noted by baptiste:
fndf = function (list_s, rdata2)
as.data.frame(do.call(rbind, unique(lapply(list_s, fn1, rdata2))))
(Yes. That’s a single statement.)
Also note that I’m now applying the unique directly to the list rather than the data.frame. This changes the semantics – unique is specialised for data.frames – but is probably the right thing for your purposes, and it will be more efficient because it means that we don’t construct a needlessly big data.frame with redundant rows.
It's hard to say without your data/functions, but here is a solution with plyr and some placeholder data:
list_s<-LETTERS
rdata2<-data.frame(a=rep(LETTERS,2),b=runif(52),c=runif(52)*10)
fn1<-function(a,b=rdata2)b[rdata2$a==a,]
fn1("A")
require(plyr) # for ldply function, which takes a list and returns a dataframe
result<-ldply(1:length(list_s),function(x)fn1(list_s[x],rdata2))
head(result)
a b c
1 A 0.281940237 2.7774933
2 A 0.023611392 0.6067029
3 B 0.456547803 9.4219258
4 B 0.645783746 5.3094864
5 C 0.475949523 4.8580622
6 C 0.006063407 2.5851738
I am scoring a psychometric instrument at work and want to recode a few variables. Basically, each question has five possible responses, worth 0 to 4 respectively. That is how they were coded into our database, so I don't need to do anything except sum those. However, there are three questions that have reversed scores (so, when someone answers 0, we score that as 4). Thus, I am "reversing" those ones.
The data frame basically looks like this:
studyid timepoint date inst_q01 inst_q02 ... inst_q20
1 2 1995-03-13 0 2 ... 4
2 2 1995-06-15 1 3 ... 4
Here's what I've done so far.
# Survey Processing
# Find missing values (-9) and confusions (-1), and sum them
project_f03$inst_nmiss <- rowSums(project_f03[,4:23]==-9)
project_f03$inst_nconfuse <- rowSums(project_f03[,4:23]==-1)
project_f03$inst_nmisstot <- project_f03$inst_nmiss + project_f03$inst_nconfuse
# Recode any missing values into NAs
for(x in 4:23) {project_f03[project_f03[,x]==-9 | project_f03[,x]==-1,x] <- NA}
rm(x)
Now, everything so far is pretty fine, I am about to recode the three reversed ones. Now, my initial thought was to do a simple loop through the three variables, and do a series of assignment statements something like below:
# Questions 3, 11, and 16 are reversed
for(x in c(3,11,16)+3) {
project_f03[project_f03[,x]==4,x] <- 5
project_f03[project_f03[,x]==3,x] <- 6
project_f03[project_f03[,x]==2,x] <- 7
project_f03[project_f03[,x]==1,x] <- 8
project_f03[project_f03[,x]==0,x] <- 9
project_f03[,x] <- project_f03[,x]-5
}
rm(x)
So, the five assignment statements just reassign new values, and the loop just takes it through all three of the variables in question. Since I was reversing the scale, I thought it was easiest to offset everything by 5 and then just subtract five after all recodes were done. The main issue, though, is that there are NAs and those NAs result in errors in the loop (naturally, NA==4 returns an NA in R). Duh - forgot a basic rule!
I've come up with three alternatives, but I'm not sure which is the best.
First, I could obviously just move the NA-creating code after the loop, and it should work fine. Pros: easiest to implement. Cons: Only works if I am receiving data with no innate (versus created) NAs.
Second, I could change the logic statement to be something like:
project_f03[!is.na(project_f03[,x]) && project_f03[,x]==4,x] which should eliminate the logic conflict. Pros: not too hard, I know it works. Cons: A lot of extra code, seems like a kludge.
Finally, I could change the logic from
project_f03[project_f03[,x]==4,x] <- 5 to
project_f03[project_f03[,x] %in% 4,x] <- 5. This seems to work fine, but I'm not sure if it's a good practice, and wanted to get thoughts. Pros: quick fix for this issue and seems to work; preserves general syntatic flow of "blah blah LOGIC blah <- bleh". Cons: Might create black hole? Not sure what the potential implications of using %in% like this might be.
EDITED TO MAKE CLEAR
This question has one primary component: Is it safe to utilize %in% as described in the third point above when doing logical operations, or are there reasons not to do so?
The second component is: What are recommended ways of reversing the values, like some have described in answers and comments?
The straightforward answer is that there is no black hole to using %in%. But in instances where I want to just discard the NA values, I'd use which: project_f03[which(project_f03[,x]==4),x] <- 5
%in% could shorten that earlier bit of code you had:
for(x in 4:23) {project_f03[project_f03[,x]==-9 | project_f03[,x]==-1,x] <- NA}
#could be
for(x in 4:23) {project_f03[project_f03[,x] %in% c(-9,-1), x] <- NA}
Like #flodel suggested, you can replace that whole block of code in your for-loop with project_f03[,x] <- rev(0:4)[match(project_f03[,x], 0:4, nomatch=10)]. It should preserve NA. And there are probably more opportunities to simplify code.
It doesn't answer your question, but should fix your problem:
cols <- c(3,11,16)+3
project_f03[, cols] <- abs(project_f03[, cols]-4)
## or a lot of easier (as #TylerRinker suggested):
project_f03[, cols] <- max(project_f03[, cols]) - project_f03[, cols]
I am attempting to write a for loop which will take subsets of a dataframe by person id and then lag the EXAMDATE variable by one for comparison. So a given row will have the original EXAMDATE and also a variable EXAMDATE_LAG which will contain the value of the EXAMDATE one row before it.
for (i in length(uniquerid))
{
temp <- subset(part2test, RID==uniquerid[i])
temp$EXAMDATE_LAG <- temp$EXAMDATE
temp2 <- data.frame(lag(temp, -1, na.pad=TRUE))
temp3 <- data.frame(cbind(temp,temp2))
}
It seems that I am creating the new variable just fine but I know that the lag won't work properly because I am missing steps. Perhaps I have also misunderstood other peoples' examples on how to use the lag function?
So that this can be fully answered. There are a handful of things wrong with your code. Lucaino has pointed one out. Each time through your loop you are going to create temp, temp2, and temp3 (or overwrite the old one). and thus you'll be left with only the output of the last time through the loop.
However, this isnt something that needs a loop. Instead you can make use of the vectorized nature of R
x <- 1:10
> c(x[-1], NA)
[1] 2 3 4 5 6 7 8 9 10 NA
So if you combine that notion with a library like plyr that splits data nicely you should have a workable solution. If I've missed something or this doesn't solve your problem, please provide a reproducible example.
library(plyr)
myLag <- function(x) {
c(x[-1], NA)
}
ddply(part2test, .(uniquerid), transform, EXAMDATE_LAG=myLag(EXAMDATE))
You could also do this in base R using split or the data.table package using its by= argument.
I have the following type of data set:
id;2011_01;2011_02;2011_03; ... ;2001_12
id01;NA;NA;123; ... ;NA
id02;188;NA;NA; ... ;NA
That is, each row is unique customer and each column depicts a trait for this customer from the past 10 years (each month has its own column). The thing is that I want to condense this 120 column data frame into a 10 column data frame, this because I know that almost all rows have (although the month itself can vary) have 1 or 0 observations from each year.
I've already done, one year at the time, this using a loop with a nested if-clause:
for(i in 1:nrow(input_data)) {
temp_row <- input_data[i,c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
loc2011 <- which(!is.na(temp_row))
if(length(loc2011 ) > 0) {
temp_row_2011[i,] <- temp_row[loc2011[1]] #pick the first observation if there are several
} else {
temp_row_2011[i,] <- NA
}
}
Since my data set is quite big, and I need to perform the above loop 10 times (one for each year), this is taking way too much time. I know one is much better of using apply commands in R, so I would greatly appreciate help on this task. How could I write the whole thing (including the different years) better?
Are you after something like this?:
temp_row_2011 <- apply(input_data, 1, function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
})
If this gives you the right output, and if it runs faster than your loop, then it's not necessarily due only to the fact of using an apply(), but also because it assigns less stuff and avoids an if {} else {}. You might be able to make it go even faster by compiling the anonymous function:
reduceyear <- function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
}
# compile, just in case it runs faster:
reduceyear_c <- compiler:::cmpfun(reduceyear)
# this ought to do the same as the above.
temp_row_2011 <- apply(input_data, 1, reduceyear_c)
You didn't say whether input_data is a data.frame or a matrix, but a matrix would be faster than the former (but only valid if input_data is all the same class of data).
[EDIT: full example, motivated by DWin]
input_data <- matrix(ncol=24,nrow=10)
# years and months:
colnames(input_data) <- c(paste(2010,1:12,sep="_"),paste(2011,1:12,sep="_"))
# some ids
rownames(input_data) <- 1:10
# put in some values:
input_data[sample(1:length(input_data),200,replace=FALSE)] <- round(runif(200,100,200))
# make an all-NA case:
input_data[2,1:12] <- NA
# and here's the full deal:
sapply(2010:2011, function(x,input_data){
input_data_yr <- input_data[, grep(x, colnames(input_data) )]
apply(input_data_yr, 1, function(id){
id[!is.na(id)][1]
}
)
}, input_data)
All NA case works. grep() column selection idea lifted from DWin. As in the above example, you could actually define the anonymous interior function and compile it to potentially make the thing run faster.
I built a tiny test case (for which timriffe's suggestion fails). You might attract more interest by putting up code that creates a more complete test case such as 4 quarters for 2 years and including pathological cases such as all NA's in one row of one year. I would think that instead of requiring you to write out all the year columns by name, that you ought to cycle through them with a grep() strategy:
# funyear <- function to work on one year's data and return a single vector
# my efforts keep failing on the all(NA) row by year combos
sapply(seq("2011", "2001"), function (pat) funyear(input_data[grep(pat, names(input_data) )] )