Conditional column creation (horizontal and vertical conditions) - r

My starting condition is something like the df data frame
df<-data.frame(id=c(rep(2, 3), rep(4, 2)), year=c(2005:2007, 2005:2006), event=c(1,0,0,0,1))
id year event
1 2 2005 1
2 2 2006 0
3 2 2007 0
4 4 2005 0
5 4 2006 1
I have a series of actors (identified through an id) who happen to experience an event in a certain year.
Here I am trying to build is a series of additional columns that describe a) the distance from events and b) whether such distance is observable.
This is what I would like to obtain.
id year event evm2 evm1 evp1 evp2 ndm2 ndm1 ndp1 ndp2
1 2 2005 1 0 0 0 0 1 1 0 0
2 2 2006 0 0 1 0 0 1 0 0 1
3 2 2007 0 1 0 0 0 0 0 1 1
4 4 2005 0 0 0 1 0 1 1 0 1
5 4 2006 1 0 0 0 0 1 0 1 1
event equals 1 when there is an event in a certain year. evm1 equals 1 when an event is observable in the year before. Similarly, evp1 is 1 when the event is in the following year - the letters p or m stand for 'plus' and 'minus' and the numbers represent the distance in years from the event.
For some of these observations the distance is not observable because the available time window is too short. This is the case of df[1,] for which we don't know if in the previous years an event took place or not. In such a case, ndm1 and ndm2 are coded 1. If we consider the case df[5,], it will be ndp1 (and ndp2) to be coded 1.
ev and nd variables work exactly in the same way. But the former tells if at a certain distance there is an event or not and the latter reveals whether such a distance is actually observable.
I tried to accomplish this using the following nested for loops, but I didn't succeed.
lag<-c(-2, -1, 1, 2)
df2<-df
df2[,4:11]<-0
colnames(df2)<-c("id", "year", "event", "evm2", "evm1", "evp1", "evp2", "ndm2", "ndm1", "ndp1", "ndp2")
for (i in length(df2$id)) {
id<-df2[i,1]
yr<-df2[i,2]
sta<-3
sta2<-7
for (j in lag){
sta<-sta+1
sta2<-sta2+1
if !is.null(df2[df2$id==id & df2$year==yr+j])==TRUE {
rw<-which(df2[df2$id==id & df2$year==yr+j])
if (df2[rw,3]==1) df2[i, sta]==1
} else {
df2[i, sta2]==1
}
}
}
Do you see anything that may be responsible for the errors? I have been going mad for two days trying to make it work and I would be really thankful if you could help.

Following my comment, here is what I had in mind as a potential rewrite:
lag.it <- function(x, n = 0L) {
l <- length(x)
neg.lag <- min(max(0L, -n), l)
pos.lag <- min(max(0L, +n), l)
c(rep(NA, +neg.lag),
head(x, -neg.lag),
tail(x, -pos.lag),
rep(NA, +pos.lag))
}
library(plyr)
ddply(df, "id", transform,
evm2 = lag.it(event, -2),
evm1 = lag.it(event, -1),
evp1 = lag.it(event, +1),
evp2 = lag.it(event, +2))
# id year event evm2 evm1 evp1 evp2
# 1 2 2005 1 NA NA 0 0
# 2 2 2006 0 NA 1 0 NA
# 3 2 2007 0 1 0 NA NA
# 4 4 2005 0 NA NA 1 NA
# 5 4 2006 1 NA 0 NA NA
Notice how I use NAs instead of using two sets of variables. While I'd recommend you keep it this way, you can easily get what you asked for by defining e.g. ndm2 as is.na(evm2) then replace the NAs by zeroes.

Related

Transforming longitudinal data for time-to-event analysis in R

I am trying to reformat longitudinal data for a time to event analysis. In the example data below, I simply want to find the earliest week that the result was “0” for each ID.
The specific issue I am having is how to patients that don't convert to 0, and had either all 1's or 2's. In the example data, patient J has all 1's.
#Sample data
have<-data.frame(patient=rep(LETTERS[1:10], each=9),
week=rep(0:8,times=10),
result=c(1,0,2,rep(0,6),1,1,2,1,rep(0,5),1,1,rep(0,7),1,rep(0,8),
1,1,1,1,2,1,0,0,0,1,1,1,rep(0,6),1,2,1,rep(0,6),1,2,rep(0,7),
1,rep(0,8),rep(1,9)))
patient week result
A 0 1
A 1 0
A 2 2
A 3 0
A 4 0
A 5 0
A 6 0
A 7 0
A 8 0
B 0 1
B 1 0
... .....
J 6 1
J 7 1
J 8 1
I am able to do this relatively straightforward process with the following code:
want<-aggregate(have$week, by=list(have$patient,have$result), min)
want<-want[which(want[2]==0),]
but realize if someone does not convert to 0, it excludes them (in this example, patient J is excluded). Instead, J should be present with a 1 in the second column and an 8 in the third column. Instead it of course is omitted
print(want)
Group.1 Group.2 x
A 0 1
B 0 4
C 0 2
D 0 1
E 0 6
F 0 3
G 0 3
H 0 2
I 0 1
#But also need
J 1 8
Pursuant to guidelines on posting here, I did work to solve this, am able to get what I need very inelegantly:
mins<-aggregate(have$week, by=list(have$patient,have$result), min)
maxs<-aggregate(have$week, by=list(have$patient,have$result), max)
want<-rbind(mins[which(mins[2]==0),],maxs[which(maxs[2]==1&maxs[3]==8),])
This returns the correct desired dataset, but the coding is terrible and not sustainable as I work with other datasets (i.e. datasets with different timeframes since I have to manually put in maxsp[3]==8, etc).
Is there a more elegant or systematic way to approach this data manipulation issue?
We can write a function to select a row from the group.
select_row <- function(result, week) {
if(any(result == 0)) which.max(result == 0) else which.max(week)
}
This function returns the index of first 0 value if it is present or else returns index of maximum value of week.
and apply it to all groups.
library(dplyr)
have %>% group_by(patient) %>% slice(select_row(result, week))
# patient week result
# <fct> <int> <dbl>
# 1 A 1 0
# 2 B 4 0
# 3 C 2 0
# 4 D 1 0
# 5 E 6 0
# 6 F 3 0
# 7 G 3 0
# 8 H 2 0
# 9 I 1 0
#10 J 8 1

Sub-setting or arrange the data in R

As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1

R - Finance back test function for entry / exit signal

Ok so I am back testing trading signals using R. Here is a snippet of my code which shows the z-score creation, close to close returns (using TTR package), the long signal and the Lag() to place the signal the next day.
require(quantmod)
require(TTR)
require(zoo)
# Calculate n period close price z-scores indicator using TTR package
new.df$roll.mean.n3 <- runMean(new.df$Close, n=3, cumulative = FALSE)
new.df$roll.sd.n3 <- runSD(new.df$Close, n=3, cumulative = FALSE)
new.df$roll.z.score.n3 <- apply(new.df[,c('Close','roll.mean.n3', 'roll.sd.n3')], 1, function(x) { (x[1]-x[2])/x[3] } )
# Calculate Close-to-Close returns
new.df$clret <- ROC(new.df$Close,1)
new.df$clret[1] <- 0
# Create the long (up) signal
new.df$sigup <- ifelse(new.df$roll.z.score.n3 < -1, 1, 0)
# Lag signals to align with days in market not days when signals were generated
new.df$sigup <- Lag(new.df$sigup,1) # Note k=1 implies a move *forward*
The current setup above produces an output like this :
roll.z.score.n3 sigup
1 NA 0
2 NA 0
3 -1.135974424 0
4 0.193311168 1
5 0.714285714 0
6 -1.148753543 0
7 -0.942160394 1
8 -0.695763683 0
9 1.140646864 0
10 0.985196899 0
11 -0.768766574 0
12 -1.011293858 0
13 -0.516703612 1
14 -1.120897077 0
15 1.091089451 1
The entry signal is to go long when zscore value is <-1 which is shown in row 3. We have a +1 on row 4 because we used Lag() to forward step the entry signal to the next day. Each time the z-score value is below -1, there is a +1 the next day.
This setup is perfectly fine if i'm only trading for 1 holding day only.
I can then multiply sigup 1 x % daily returns to obtain an equity curve.
I want to elaborate further on the entry / exit signals. I wish to go long (sig long) when zscore is <-1 and exit when z-score is >1.
The output would look something like this:
roll.z.score.n3 sig long
1 NA 0
2 NA 0
3 -1.135974424 0
4 0.193311168 1
5 0.714285714 1
6 -1.148753543 1
7 -0.942160394 1
8 -0.695763683 1
9 1.140646864 1
10 0.985196899 0
11 -0.768766574 0
12 -1.011293858 0
13 -0.516703612 1
14 -1.120897077 1
15 1.091089451 1
16 0.968364052 0
17 0.872871561 0
18 1.099524999 0
19 0.918397948 0
Row 3 shows a zscore signal of <-1. Lag next day makes it +1 (row 4). And it stays +1 all way until row 9 when z-score signal is >1.0. Thus, the next day at row 10, the signal is 0.
I wanted to give some background on the current coding, its an attempt to further the post at FOSS trading blog.
Thanks for taking a look at this.
See if the following works:
zz = '
roll.z.score.n3 sig_long
1 NA 0
2 NA 0
3 -1.135974424 0
4 0.193311168 1
5 0.714285714 1
6 -1.148753543 1
7 -0.942160394 1
8 -0.695763683 1
9 1.140646864 1
10 0.985196899 0
11 -0.768766574 0
12 -1.011293858 0
13 -0.516703612 1
14 -1.120897077 1
15 1.091089451 1
16 0.968364052 0
17 0.872871561 0
18 1.099524999 0
19 0.918397948 0
'
df <- read.table(text = zz, header = TRUE)
df = na.omit(df)
df$sig_long[[1]] = ifelse(df$roll.z.score.n3[[1]] < (-1), 1, 0)
for (i in 2:nrow(df)){
df$sig_long[i] = ifelse(df$roll.z.score.n3[i] < (-1), 1,
ifelse(df$roll.z.score.n3[i] > 1, 0,
df$sig_long[i-1]))
}
Not sure about this part:
df$sig_long <- Lag(df$sig_long, 1)

Increment call vector on itself in R - Finding counts in between values

I have a time series (or simply a vector) that is binary, returning 0 or 1's depending on some condition (generated with ifelse). I would like to be able to return the counts (in this case corresponds to time series, so days) in between the 1's.
I can do this very easily in Excel, by simply calling the Column I am trying to calculate and then adding the row above (if working with Ascending data, or calling row below if working with descending). See below
I tried doing something similar in R but I am getting an error.
DaysBetweenCondition1 = as.numeric(ifelse((Condition1 ==0 ),0,lag(DaysBetweenCondition1)+1))
Is there an easier way to do this besides making a function
Row# Date Condition1 DaysBetweenCondition1
1 5/2/2007 NA NA
2 5/3/2007 NA NA
3 5/4/2007 NA NA
4 5/5/2007 NA NA
5 5/6/2007 0 NA
6 5/7/2007 0 NA
7 5/8/2007 0 NA
8 5/9/2007 0 NA
9 5/10/2007 0 NA
10 5/11/2007 0 NA
11 5/12/2007 0 NA
12 5/13/2007 0 NA
13 5/14/2007 1 0
14 5/15/2007 0 1
15 5/16/2007 0 2
16 5/17/2007 0 3
17 5/18/2007 0 4
18 5/19/2007 0 5
19 5/20/2007 0 6
20 5/21/2007 0 7
21 5/22/2007 1 0
22 5/23/2007 0 1
23 5/24/2007 0 2
24 5/25/2007 0 3
25 5/26/2007 0 4
26 5/27/2007 1 0
27 5/28/2007 0 1
28 5/29/2007 0 2
29 5/30/2007 1 0
(fwiw, the Dates in this example are made up, in the real data I am using business days so a bit different, and I dont want to reference them, just put in for clarity)
This gets the counting done in one line. Borrowing PhiSeu's code and a line from How to reset cumsum at end of consecutive string and modifying it to count zeros:
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
df_date$DaysBetweenCondition1<-sequence(rle(!df_date$Condition1)$lengths) * !df_date$Condition1
R is very good when working with rows that don't depend on each other. Therefore a lot of functions are vectorized. When working with functions that depend on the value of other rows it is not so easy.
At the moment I can only provide you with a solution using a loop. I assume there is a better solution without a loop.
# Example
df_date <- cbind.data.frame(c(1:20),
c(rep("18/08/2016",times=20)),
c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")
# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)
# loop over rows
for(i in 1:nrow(df_date)){
if(is.na(df_date$Condition1[i])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0 & is.na(df_date$Condition1[i-1])) {
df_date$DaysBetweenCondition1[i] <- NA
} else if(df_date$Condition1[i]==0) {
df_date$DaysBetweenCondition1[i] <- df_date$DaysBetweenCondition1[i-1]+1
} else {
df_date$DaysBetweenCondition1[i] <- 0
}
}
Here's a solution that should be relatively fast
f0 = function(x) {
y = x # template for return value
isna = is.na(x) # used a couple of times
grp = cumsum(x[!isna]) # use '1' to mark start of each group
lag = lapply(tabulate(grp + 1), function(len) {
seq(0, length.out=len) # sequence from 0 to len-1
})
split(y[!isna], grp) <- lag # split y, set to lag element, unsplit
data.frame(x, y)
}
A faster version avoids the lapply() loop; it creates a vector along x (seq_along(x)) and an offset vector describing how the vector along x should be corrected based on the start value of the original vector
f1 = function(x0) {
y0 = x0
x = x0[!is.na(x0)]
y = seq_along(x)
offset = rep(c(1, y[x==1]), tabulate(cumsum(x) + 1))
y0[!is.na(y0)] = y - offset
data.frame(x0, y)
}
Walking through the first solution, here's some data
> set.seed(123)
> x = c(rep(NA, 5), rbinom(30, 1, .15))
> x
[1] NA NA NA NA NA 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1
[26] 1 0 0 1 0 0 0 0 0 0
use cumsum() to figure out the group the non-NA data belong to
> isna = is.na(x)
> grp = cumsum(x[!isna])
> grp
[1] 0 0 0 1 2 2 2 3 3 3 4 4 4 4 4 5 5 5 5 6 7 7 7 8 8 8 8 8 8 8
use tabulate() to figure out the number of elements in each group, lapply() to generate the relevant sequences
> lag = lapply(tabulate(grp + 1), function(len) seq(0, length.out=len))
finally, create a vector to hold the result, and use spilt<- to update with the lag
> y = x
> split(y[!isna], grp) <- lag
> data.frame(x, y)
x y
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 0 0
7 0 1
8 0 2
9 1 0
10 1 0
11 0 1
12 0 2
13 1 0
14 0 1
15 0 2
16 1 0
17 0 1
...
The key to the second solution is the calculation of the offset. The goal is to be able to 'correct' y = seq_along(x) by the value of y at the most recent 1 in x, kind of like 'fill down' in Excel. The starting values are c(1, y[x==1]) and each needs to be replicated by the number of elements in the group tabulate(cumsum(x) + 1).

Cumulative sum, with lag, by group, on date-stamped observations

I have this data set of batting data from the GameDay servers:
eliasID teamID gameID gameDate h hr bb so rbi ab runs t d lob sb cs sf hbp
1 430203 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 1 2 0 0 0 0 0 0 0 0
2 459714 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 1 0 0 1 0 3 1 0 0 1 0 0 0 0
3 325392 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 1 0 0 1 0 0 0 0 0 0 0 0
4 429801 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 0 1 0 3 0 0 0 2 0 0 0 0
5 456714 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 1 0 0 4 0 0 0 2 0 0 0 0
6 150449 kca 2010/04/01/arimlb-kcamlb-1 4/1/2010 0 0 0 1 1 4 0 0 0 2 0 0 0 0
ba ID gameDateFormat year Year
1 0.345 1 2010-04-01 2010 NA
2 0.250 2 2010-04-01 2010 NA
3 0.319 3 2010-04-01 2010 NA
4 0.327 4 2010-04-01 2010 NA
5 0.333 5 2010-04-01 2010 NA
6 0.217 6 2010-04-01 2010 NA
My issue is that I'd like to build a running total of at bats (ab) for each game, but total only those at bats from games with gameDate lower than the row's gameDate, and with games in the same gameYear.
I've look at the for loop and dplyr has been suggested, but these all want to sum all the ab column for one player, when I need an ongoing sum added to each game to show the player's ab total for the year so far at that game.
I'm attempting to build the equivalent of the kind of statistics you see on baseball-ref.com.
In English, I'm looking for:
For each EliasID, gameID in Batting:
sum(ab) for the EliasID where the gameDate < this row's gameDate and the gameYear = this row's gameYear
What do you think?
Welcome to R programming. Because you did not provide complete sample data (i.e. dput() rather than just a print(), this answer makes a couple assumptions:
your data.frame is called df. You can replace this name with the actual name.
your gameDate is an actual date vector, not just a string vector. If it is a string, change it to a date with df$gameDate <- as.Date(df$gameDate, format = "%m/%d/%Y")
It appears that what you want is a "cumulative sum with lag, by group." (I recommend that you make this your title to make it clear that this is what you want.) Let's look at both of those parts.
Cumulative sum, with lag
As suggested in this answer, an easy way to introduce a lag of 1 into cumsum() is to replace the vector x1, x2, ... xn with 0, x1, x2, ... xn-1. Thus:
cumsumLag1 <- function(x){
cumsum(c(0, head(x, n = -1))) # see ?cumsum and ?head, particularly the note on negative n
}
# test it out on first 5 counting numbers
cumsumLag1(1:5) # returns: 0 1 3 6 10
Your dataset should be in the right chronological order for the cumulative function. So you could do something with ?order like:
df <- df[order(df$gameDate)]
but we will use arrange() in dplyr (see below) to keep things simple.
By group
There are many ways to do sum (and similar functions) by group. Perhaps the simplest syntax is %>% group_by(thing) in dplyr. You want to group by year, and perhaps other variables (maybe teamId or playerId). One really unclear part of your question is what you're trying to group by, so please just focus on the concept here. The first challenge is that you don't have a year variable, and there are lots of ways to do this. Let's just do something like this:
df$gameYear <- as.POSIXlt(df$gameDate)$year + 1900 # see ?POSIXlt for more details
Putting it together
Using the chain operator %>%, we just sequence what we've already reviewed.
library(dplyr)
cumsumLag1 <- function(x) cumsum(c(0, head(x, n = -1)))
df %>%
mutate(gameYear = as.POSIXlt(gameDate)$year + 1900) %>%
arrange(gameDate) %>%
group_by(gameYear) %>%
mutate(priorAtBats = cumsumLag1(ab))

Resources