R- Please help. Having trouble writing for loop to lag date - r

I am attempting to write a for loop which will take subsets of a dataframe by person id and then lag the EXAMDATE variable by one for comparison. So a given row will have the original EXAMDATE and also a variable EXAMDATE_LAG which will contain the value of the EXAMDATE one row before it.
for (i in length(uniquerid))
{
temp <- subset(part2test, RID==uniquerid[i])
temp$EXAMDATE_LAG <- temp$EXAMDATE
temp2 <- data.frame(lag(temp, -1, na.pad=TRUE))
temp3 <- data.frame(cbind(temp,temp2))
}
It seems that I am creating the new variable just fine but I know that the lag won't work properly because I am missing steps. Perhaps I have also misunderstood other peoples' examples on how to use the lag function?

So that this can be fully answered. There are a handful of things wrong with your code. Lucaino has pointed one out. Each time through your loop you are going to create temp, temp2, and temp3 (or overwrite the old one). and thus you'll be left with only the output of the last time through the loop.
However, this isnt something that needs a loop. Instead you can make use of the vectorized nature of R
x <- 1:10
> c(x[-1], NA)
[1] 2 3 4 5 6 7 8 9 10 NA
So if you combine that notion with a library like plyr that splits data nicely you should have a workable solution. If I've missed something or this doesn't solve your problem, please provide a reproducible example.
library(plyr)
myLag <- function(x) {
c(x[-1], NA)
}
ddply(part2test, .(uniquerid), transform, EXAMDATE_LAG=myLag(EXAMDATE))
You could also do this in base R using split or the data.table package using its by= argument.

Related

Repeat same action with function and apply familly

I'm new on R (and I use R-studio) and I have to analyze a big data frame (60 variables for 10 000 observations). My data frame had a column name specie with lot of different animals species in there. The goal of my work it's to have results of 8 differents species, so I have to work on there separately.
I start with building different subset (like I learn in school) and with awesome packages(special thanks to dplyr & tdyr). But now I have to repeat many identical (or nearly identical) actions on each of the 8 species, so I spent much time to copy/paste and when I make a mistake I must verify and change mistakes on thousands of lines.
Then I try to learn about loops et apply family functions. But I can't do something good.
There is an exemple of an action I do on a specie with the traditional way (organize data):
espece_td_a <- subset(BDD, BDD$espece == "espece A" & BDD$placette =="TOTAL")%>%
select(code_site,passage,adulte)%>%
spread(passage, adulte)
espece_td_a <- full_join(B.irene_td_a, BDD_P3_TOT_site)
espece_td_a <- replace(espece_td_a, is.na(espece_td_a),0)
espece_td_a$P1[B.irene_td_a$P1>0]<-1
espece_td_a$P2[B.irene_td_a$P2>0]<-1
espece_td_a$P3[B.irene_td_a$P3>0]<-1
write.csv(espece_td_a, file = "espece_td_a.csv")
BDD is my data frame.
BDD_P3_TOT_site is vector (or data frame with 1 columns and many rows ?) built with BDD
This "traditional way" work for me, but I must do something like that so many times! And it takes a lot of time...
Then I tried to "apply" this with function :
f <- function(x)
{
select(code_site, passage, adulte)%>%
spread(x, x$passage, x$adulte)%>%
full_join(x, BDD_P3_TOT_site) -> x
x <- replace(x, is.na(x),0)
x$P1[x$P1>0]<-1
x$P2[x$P2>0]<-1
x$P3[x$P3>0]<-1
}
I wish apply this function to my dataset with lapply (with my 8 species in list):
l <- c("espece_a","espece_b","espece_c")
lapply(l,f(x))
Problems :
I know that is a wrong formulation for lapply if I want take my species into BDD.
the function doesn't want work:
I already made 8 subsets (for each of my interest species)
In my global environment: espece_a; espece_b...
Then I wanted to put my subset one by one into my function:
> f(espece_a)
Error in select_(.data, .dots = lazyeval::lazy_dots(...)) : Show Traceback
object 'code_site' not found Rerun with Debug
I wish that my table appears in my Globlal env with a name that make me able to recognize it (ex: "espece_td_a")
You have 3 issues relating to your use of lapply:
You need to return the object x at the end of the f function:
l should be a list of dataframes not just a vector of dataframe names, i.e. l <- list(espece_a,espece_b,espece_c)
When using lapply with an existing function, you only need to pass the name of the function, i.e. lapply(l,f)
Hopefully this should solve your problem.
I solve the function problem :
f <- function(X){
X <- select(X, code_site, passage, adulte)%>%
spread(passage, adulte)
X <- full_join(X, BDD_P3_TOT_site)
X <- replace(X, is.na(X),0)
X$P1[X$P1>0]<-1
X$P2[X$P2>0]<-1
X$P3[X$P3>0]<-1
X <- return(X)
}
test <- f(espece_a)

How to split epochs into year, month, etc

I have a data frame containing many time columns. I want to add columns for each time for year, month, date, etc.
Here is what I have so far:
library(dplyr)
library(lubridate)
times <- c(133456789, 143456789, 144456789 )
train2 <- data.frame(sent_time = times, open_time = times)
time_col_names <- c("sent_time", "open_time")
dt_part_names <- c("year", "month", "hour", "wday", "day")
train3 <- as.data.frame(train2)
dummy <- lapply(time_col_names, function(col_name) {
pct_times <- as.POSIXct(train3[,col_name], origin = "1970-01-01", tz = "GMT")
lapply(dt_part_names, function(part_name) {
part_col_name <- paste(col_name, part_name, sep = "_")
train3[, part_col_name] <- rep(NA, nrow(train3))
train3[, part_col_name] <- factor(get(part_name)(pct_times))
})
})
Everything seems to work, except the columns never get created or assigned. The components do get extracted, and the assignment succeeds without error, but train3 does not have any new columns.
I have checked that the assignment works when I call it outside the nested lapply context:
train3[, "x"] <- rep(NA, nrow(train3))
In this case, column x does get created.
It is often believed that the apply family provides an advantage in terms of performance compared to a for loop. But the most important difference between a for loop and a loop from the *apply() family is that the latter is designed to have no side effects.
The absence of side effects favors the development of clean, well-structured, and concise code. A problem occurs if one wishes to have side effects, which is usually a symptom of a flawed code design.
Here is a simple example to illustrate this:
myvector <- 10:1
sapply(myvector,prod,2)
# [1] 20 18 16 14 12 10 8 6 4 2
It looks correct, right? The sapply() loop has seemingly multiplied the entries of myvec by two (granted, this result could have been achieved more easily, but this is just a simple example to discuss the functioning of *apply()).
Upon inspection, however, one realizes that this operation has not changed myvector at all:
> myvector
# [1] 10 9 8 7 6 5 4 3 2 1
That is because sapply() did not have the side effect to modify myvector. In this example the sapply() loop is equivalent to the command print(myvector*2), and not to myvector <- myvector * 2. The *apply() loops return an object, but they don't modify the original one.
If one really wants to change the object within the loop, the superassignment operator <<- is necessary to modify the object outside the scope of the loop. This should almost never be done, and things become quite ugly in this case. For example, the following loop does change my myvector:
sapply(seq_along(myvector), function(x) myvector[x] <<- myvector[x]*2)
> myvector
# [1] 20 18 16 14 12 10 8 6 4 2
Coding in R should not look like this. Note that also in this more convoluted case, if the normal assignment operator <- is used instead of <<- then myvector remains unchanged. The correct approach is to assign the object returned by *apply instead of modifying it within the loop.
In the specific case described by the OP, the variable dummy may contain the desired output if the commands in the loop are correct. But one cannot expect that the object train3 is modified within the loop. For this the <<- operator would be necessary.
A quote mentioned in fortunes::fortune(212) possibly summarizes the problem:
Basically R is reluctant to let you shoot yourself in the foot unless
you are really determined to do so. -- Bill Venables

Recoding variables in R using the %in% operator to avoid NAs

I am scoring a psychometric instrument at work and want to recode a few variables. Basically, each question has five possible responses, worth 0 to 4 respectively. That is how they were coded into our database, so I don't need to do anything except sum those. However, there are three questions that have reversed scores (so, when someone answers 0, we score that as 4). Thus, I am "reversing" those ones.
The data frame basically looks like this:
studyid timepoint date inst_q01 inst_q02 ... inst_q20
1 2 1995-03-13 0 2 ... 4
2 2 1995-06-15 1 3 ... 4
Here's what I've done so far.
# Survey Processing
# Find missing values (-9) and confusions (-1), and sum them
project_f03$inst_nmiss <- rowSums(project_f03[,4:23]==-9)
project_f03$inst_nconfuse <- rowSums(project_f03[,4:23]==-1)
project_f03$inst_nmisstot <- project_f03$inst_nmiss + project_f03$inst_nconfuse
# Recode any missing values into NAs
for(x in 4:23) {project_f03[project_f03[,x]==-9 | project_f03[,x]==-1,x] <- NA}
rm(x)
Now, everything so far is pretty fine, I am about to recode the three reversed ones. Now, my initial thought was to do a simple loop through the three variables, and do a series of assignment statements something like below:
# Questions 3, 11, and 16 are reversed
for(x in c(3,11,16)+3) {
project_f03[project_f03[,x]==4,x] <- 5
project_f03[project_f03[,x]==3,x] <- 6
project_f03[project_f03[,x]==2,x] <- 7
project_f03[project_f03[,x]==1,x] <- 8
project_f03[project_f03[,x]==0,x] <- 9
project_f03[,x] <- project_f03[,x]-5
}
rm(x)
So, the five assignment statements just reassign new values, and the loop just takes it through all three of the variables in question. Since I was reversing the scale, I thought it was easiest to offset everything by 5 and then just subtract five after all recodes were done. The main issue, though, is that there are NAs and those NAs result in errors in the loop (naturally, NA==4 returns an NA in R). Duh - forgot a basic rule!
I've come up with three alternatives, but I'm not sure which is the best.
First, I could obviously just move the NA-creating code after the loop, and it should work fine. Pros: easiest to implement. Cons: Only works if I am receiving data with no innate (versus created) NAs.
Second, I could change the logic statement to be something like:
project_f03[!is.na(project_f03[,x]) && project_f03[,x]==4,x] which should eliminate the logic conflict. Pros: not too hard, I know it works. Cons: A lot of extra code, seems like a kludge.
Finally, I could change the logic from
project_f03[project_f03[,x]==4,x] <- 5 to
project_f03[project_f03[,x] %in% 4,x] <- 5. This seems to work fine, but I'm not sure if it's a good practice, and wanted to get thoughts. Pros: quick fix for this issue and seems to work; preserves general syntatic flow of "blah blah LOGIC blah <- bleh". Cons: Might create black hole? Not sure what the potential implications of using %in% like this might be.
EDITED TO MAKE CLEAR
This question has one primary component: Is it safe to utilize %in% as described in the third point above when doing logical operations, or are there reasons not to do so?
The second component is: What are recommended ways of reversing the values, like some have described in answers and comments?
The straightforward answer is that there is no black hole to using %in%. But in instances where I want to just discard the NA values, I'd use which: project_f03[which(project_f03[,x]==4),x] <- 5
%in% could shorten that earlier bit of code you had:
for(x in 4:23) {project_f03[project_f03[,x]==-9 | project_f03[,x]==-1,x] <- NA}
#could be
for(x in 4:23) {project_f03[project_f03[,x] %in% c(-9,-1), x] <- NA}
Like #flodel suggested, you can replace that whole block of code in your for-loop with project_f03[,x] <- rev(0:4)[match(project_f03[,x], 0:4, nomatch=10)]. It should preserve NA. And there are probably more opportunities to simplify code.
It doesn't answer your question, but should fix your problem:
cols <- c(3,11,16)+3
project_f03[, cols] <- abs(project_f03[, cols]-4)
## or a lot of easier (as #TylerRinker suggested):
project_f03[, cols] <- max(project_f03[, cols]) - project_f03[, cols]

R: rewrite loop with apply

I have the following type of data set:
id;2011_01;2011_02;2011_03; ... ;2001_12
id01;NA;NA;123; ... ;NA
id02;188;NA;NA; ... ;NA
That is, each row is unique customer and each column depicts a trait for this customer from the past 10 years (each month has its own column). The thing is that I want to condense this 120 column data frame into a 10 column data frame, this because I know that almost all rows have (although the month itself can vary) have 1 or 0 observations from each year.
I've already done, one year at the time, this using a loop with a nested if-clause:
for(i in 1:nrow(input_data)) {
temp_row <- input_data[i,c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
loc2011 <- which(!is.na(temp_row))
if(length(loc2011 ) > 0) {
temp_row_2011[i,] <- temp_row[loc2011[1]] #pick the first observation if there are several
} else {
temp_row_2011[i,] <- NA
}
}
Since my data set is quite big, and I need to perform the above loop 10 times (one for each year), this is taking way too much time. I know one is much better of using apply commands in R, so I would greatly appreciate help on this task. How could I write the whole thing (including the different years) better?
Are you after something like this?:
temp_row_2011 <- apply(input_data, 1, function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
})
If this gives you the right output, and if it runs faster than your loop, then it's not necessarily due only to the fact of using an apply(), but also because it assigns less stuff and avoids an if {} else {}. You might be able to make it go even faster by compiling the anonymous function:
reduceyear <- function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
}
# compile, just in case it runs faster:
reduceyear_c <- compiler:::cmpfun(reduceyear)
# this ought to do the same as the above.
temp_row_2011 <- apply(input_data, 1, reduceyear_c)
You didn't say whether input_data is a data.frame or a matrix, but a matrix would be faster than the former (but only valid if input_data is all the same class of data).
[EDIT: full example, motivated by DWin]
input_data <- matrix(ncol=24,nrow=10)
# years and months:
colnames(input_data) <- c(paste(2010,1:12,sep="_"),paste(2011,1:12,sep="_"))
# some ids
rownames(input_data) <- 1:10
# put in some values:
input_data[sample(1:length(input_data),200,replace=FALSE)] <- round(runif(200,100,200))
# make an all-NA case:
input_data[2,1:12] <- NA
# and here's the full deal:
sapply(2010:2011, function(x,input_data){
input_data_yr <- input_data[, grep(x, colnames(input_data) )]
apply(input_data_yr, 1, function(id){
id[!is.na(id)][1]
}
)
}, input_data)
All NA case works. grep() column selection idea lifted from DWin. As in the above example, you could actually define the anonymous interior function and compile it to potentially make the thing run faster.
I built a tiny test case (for which timriffe's suggestion fails). You might attract more interest by putting up code that creates a more complete test case such as 4 quarters for 2 years and including pathological cases such as all NA's in one row of one year. I would think that instead of requiring you to write out all the year columns by name, that you ought to cycle through them with a grep() strategy:
# funyear <- function to work on one year's data and return a single vector
# my efforts keep failing on the all(NA) row by year combos
sapply(seq("2011", "2001"), function (pat) funyear(input_data[grep(pat, names(input_data) )] )

Return multiple data frames from function R

I am trying to put together a function that will loop thru a given data frame in blocks and return a new data frame containing stuff calculated from the original. The length of x will be different each time and the actual problem will have more loops in the function. New-ish to R and have not been able to find anything helpful (I don't think using a list will help)
func<-function(x){
tmp # need to declare this here?
for (i in 1:dim(x)[1]){
tmp[i]<-ave(x[i,]) # add things to it
}
return(tmp)
}
df<-cbind(rnorm(10),rnorm(10))
means<-func(df)
This code does not work but I hope it gets across what I want to do. thanks!
Do you mean you want to loop through each row of df and return a data frame with the calculated values?
You may want to look in to the apply function:
df <- cbind(rnorm(10),rnorm(10))
# apply(df,1,FUN) does FUN(df[i,])
# e.g. mean of each row:
apply(df,1,mean)
For more complicated looping like performing some operation on a per-factor basis, I strongly recommend package plyr, and function ddply within. Quick example:
df <- data.frame( gender=c('M','M','F','F'), height=c(183,176,157,168) )
# find mean height *per gender*
ddply(df,.(gender), function(x) c(height=mean(x$height)))
# returns:
gender height
1 F 162.5
2 M 179.5

Resources