dataframe math to calculate profit based on indicator - r

Simple explanation: My Goal is to figure out how to get the profit column shown below. I am trying to calculate the difference in val for every pair of changing values (-1 to 1 and 1 to -1).
If the starting indicator is 1 or -1 store the value.
Find the next indicator that is opposite (so -1 on row 3). Save this val. Subtract the first value from it (.85.-.84). Store that in the profit column.
repeat
Specific to this case
Go until find next opposite val (on row 4). Save this value. Subtract values, save in profit column. ()
Go until find next opposite val (on row 8). Save this value. Subtract values, save in profit column.
Financial explanation (if it is useful)
I am trying to write a function to calculate profit given a column of values and a column of indicators (buy/sell/hold). I know this is implemented in a few of the big packages (quantmod, quantstrat), but I cant seem to find a simple way to do it.
df<-
data.frame(val=c(.84,.83,.85,.83,.83,.84,.85,.81),indicator=c(1,0,-1,1,0,1,1,-1))
df
val indicator profit
1 0.84 1 NA
2 0.83 0 NA
3 0.85 -1 .01 based on: (.85-.84) from 1 one to -1
4 0.83 1 .02 based on (.85-.83) from -1 to 1
5 0.83 0 NA
6 0.84 1 NA
7 0.85 1 NA
8 0.81 -1 -.02 based on (.81-.83) from last change (row 4) to now
Notes
multiple indicators should be ignored (1111) means nothing beyond the first one which should be stored. (line 4 is stored, lines 5,6,7 are not)
ignore 0, holds do not change profit calculation
I am happy to provide more info if needed.

It was easier for me to work it out in my head after splitting the problem into two parts, which correspond to the two loops shown below. The first part involves flagging the rows where there was a change in the indicator value, while the second part involves subtracting the val's from the relevant rows (i.e., those selected in part 1). FYI, I assume you meant to put -.02 for row 4 in your example? If not, then please clarify which rows get subtracted from which when calculating profit.
data.frame(val=c(.84,.83,.85,.83,.83,.84,.85,.81),
indicator=c(1,0,-1,1,0,1,1,-1)) -> x
x$num <- seq_along(x$val)
x$rollingProf <- NA
# start with indicator = 1
indicator <- 1
value <- .84
for (i in 1:(nrow(x) - 1)) {
x[i + 1, "indicator"] -> next_
if (indicator * -1 == next_) {
1 -> x[i + 1, "rollingProf"]
indicator <- next_
}
}
x[!is.na(x$rollingProf), c("val", "num")] -> q
for (i in 2:nrow(q)) {
q[i, "val"] - q[i - 1, "val"] -> q[i, "change"]
}

Related

How to find a total of row values in R

I am trying to find the total of rows that have a column value of 3 or 4. That being said, the first row has only one value of 3 so if I create a new column
currentdx_count1$TotalDiagnoses
That new column called TotalDiagnoses should only have a value of 1 under it for the first row. I have tried
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[2:32])
This doesn't give me what I need as expected because it literally sums up the whole row. That being said, is there an existing function that does what I want to do or will I have to make one? Could I specify more in rowSums for it to work as I need it to?
Thanks for any and all help.
Edit: I'm trying to adapt a method I use earlier in my script that works for a similar purpose
findtotal <- endsWith(names(currentdx_count1), 'Current')
findtotal <- lapply(findtotal, `>`, 2)
findtotal <- unlist(findtotal)
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
I get an error which I have never seen before (an error in view?!)
So I tried just this
findtotal <- endsWith(names(currentdx_count1), 'Current')
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
Gets me closer but it is finding the total count for each column separately which is not what I need. I want a single column to encompass counts for each SID.
You can compare the dataframe with the value of 3 or 4 and then use rowSums to count :
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[-1] == 3 |
currentdx_count1[-1] == 4)
currentdx_count1$TotalDiagnoses
#[1] 1 2 2 2 1 1 1 1 1 1 1 1 1 2

Complex data calculation for consecutive zeros at row level in R (lag v/s lead)

I have a complex calculation that needs to be done. It is basically at a row level, and i am not sure how to tackle the same.
If you can help me with the approach or any functions, that would be really great.
I will break my problem into two sub-problems for simplicity.
Below is how my data looks like
Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0
I have data for each Group at Monthly Level.
I would like to capture the below two things.
1. The count of consecutive zeros for each row to-and-fro from lag0(reference)
The highlighted yellow are the cases, that are consecutive with lag0(reference) to a certain point, that it reaches first 1. I want to capture the count of zero's at row level, along with the corresponding Sales value.
Below is the output i am looking for the part1.
Output:
Month,Sales,Count
1,2503,9
2,3734,3
3,6631,5
4,8606,0
5,1889,6
6,4819,1
7,5120,1
2. Identify the consecutive rows(row:1,2 and 3 & similarly row:5,6) where overlap of any lag or lead happens for any 0 within the lag0(reference range), and capture their Sales and Month value.
For example, for row 1,2 and 3, the overlap happens at atleast lag:3,2,1 &
lead: 1,2, this needs to be captured and tagged as case1 (or 1). Similarly, for row 5 and 6 atleast lag1 is overlapping, hence this needs to be captured, and tagged as Case2(or 2), along with Sales and Month value.
Now, row 7 is not overlapping with the previous or later consecutive row,hence it will not be captured.
Below is the result i am looking for part2.
Month,Sales,Case
1,2503,1
2,3734,1
3,6631,1
5,1889,2
6,4819,2
I want to run this for multiple groups, hence i will either incorporate dplyr or loop to get the result. Currently, i am simply looking for the approach.
Not sure how to solve this problem. First time i am looking to capture things at row level in R. I am not looking for any solution. Simply looking for a first step to counter this problem. Would appreciate any leads.
An option using rle for the 1st part of the calculation can be as:
df$count <- apply(df[,-c(1:4)],1,function(x){
first <- rle(x[1:7])
second <- rle(x[9:15])
count <- 0
if(first$values[length(first$values)] == 0){
count = first$lengths[length(first$values)]
}
if(second$values[1] == 0){
count = count+second$lengths[1]
}
count
})
df[,c("Month", "Sales", "count")]
# Month Sales count
# 1 1 2503 9
# 2 2 3734 3
# 3 3 6631 5
# 4 4 8606 0
# 5 5 1889 6
# 6 6 4819 1
# 7 7 5120 1
Data:
df <- read.table(text =
"Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0",
header = TRUE, stringsAsFactors = FALSE, sep = ",")

R For Loop anomaly when expanding the range

Assume the following dataframe:
Application <- c('A','A','B','B','B','C','C','D')
Rating <- c('0','0.6','0.6','2.0','2.0','3.8','3.8','3.9')
DF <- data.frame(Application,Rating)
DF
#Application Score
#1 A 0
#2 A 0.6
#3 B 0.6
#4 B 2.0
#5 B 2.0
#6 C 3.8
#7 C 3.8
#8 D 3.9
I want to create an empty results table to be populated through a loop:
1st column - to show the rating being counted (e.g. 0.6)
2nd column - to show the number of times that rating occurs in DF
3rd column - to list total number of ratings in DF (i.e. 8)
4th column - to calculate the proportion of the applications with that rating relative to the overall
#create empty results table
results_rating_bins <- as.data.frame(matrix(nrow = 1, ncol = 4))
#initiate row count
rownr = 1
#Loop:
for (rating in seq(from = 0, to = 4.0, by = 0.1)) {
this_rating <- subset(DF, DF$Score == rating)
results_rating_bins[rownr, 1] = rating
results_rating_bins[rownr, 2] = nrow(this_rating)
results_rating_bins[rownr, 3] = nrow(DF)
results_rating_bins[rownr, 4] = nrow(this_rating) / nrow(DF)
rownr <- rownr + 1
}
The final result is what I expect, except for rating 2.0 where the count is 0 even though it should be 2.
This illustrates at small scale, what I see at larger scale with a 30k line dataset. I have a list of apps with ratings going from 0 to 4.9, so the range in my loop would be set to 0 to 4.9 instead of 0.6 to 4.0 in my example. However, when I run the loop on the large dataset I end up with a number of instances where the rating count is 0 even though it shouldn't be. What's even more odd, is that by playing around with the ranges, the ratings where the anomaly (i.e. count = 0) happens varies completely randomly.
Any idea what may justify this type of behaviour?
Amnesty
Typically I answer the questions as asked, trying to work through the logic a question poster is already using. However, in this case, it is so much easier to use dplyr to aggregate into the new table that I am breaking with tradition.
require(dplyr)
Application <- c('A','A','B','B','B','C','C','D')
Rating <- c('0','0.6','0.6','2.0','2.0','3.8','3.8','3.9')
DF <- data.frame(Application,Rating)
df2<-DF%>%
group_by(Application, Rating)%>%
summarize(ratio=(n()/nrow(DF)))
The first part is the same as yours, but with the library call added
where it starts df2 you are setting the df2 data frame equal to a grouped version of your initial data frame based on the combinations of Application and Rating. In the summarize statement, for each possible combination we tell it to count the number n() and divide it by the total number of rows in the original data frame nrow(DF), This creates the third row of your new the percent of total each pair represents.
It looks like this and you could add the column with the number of rows with another summarize statement if you need it, but to perform this function, it is not necessary.
Application Rating ratio
1 A 0 0.125
2 A 0.6 0.125
3 B 0.6 0.125
4 B 2.0 0.250
5 C 3.8 0.250
6 D 3.9 0.125
This will absolutely catch every combination of Application and Rating and calculate the ratio relative to the whole data frame.
EDIT: If you do not care about the Application letter, you cans imply remove it from the group_by function and still get what you want.
And add
%>%
summarise(rows=nrow(DF))
if you want the total number of rows in the frame on each row

Mutate Cumsum with Previous Row Value

I am trying to run a cumsum on a data frame on two separate columns. They are essentially tabulation of events for two different variables. Only one variable can have an event recorded per row in the data frame. The way I attacked the problem was to create a new variable, holding the value ‘1’, and create two new columns to sum the variables totals. This works fine, and I can get the correct total amount of occurrences, but the problem I am having is that in my current ifelse statement, if the event recorded is for variable “A”, then variable “B” is assigned 0. But, for every row, I want to have the previous variable’s value assigned to the current row, so that I don’t end up with gaps where it goes from 1 to 2, to 0, to 3.
I don't want to run summarize on this either, I would prefer to keep each recorded instance and run new columns through mutate.
CURRENT DF:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 0 1
4 1 A 3 0
DESIRED RESULT:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 2 1
4 1 A 3 1
Thanks!
You can use the property of booleans that you can sum them as ones and zeroes. Therefore, you can use the cumsum-function:
DF$Total.A <- cumsum(DF$variable=="A")
Or as a more general approach, provided by #Frank you can do:
uv = unique(as.character(DF$Variable))
DF[, paste0("Total.",uv)] <- lapply(uv, function(x) cumsum(DF$V == x))
If you have many levels to your factor, you can get this in one line by dummy coding and then cumsuming the matrix.
X <- model.matrix(~Variable+0, DF)
apply(X, 2, cumsum)

compare row values over multiple rows (R)

I don't think this question has asked yet (most similar questions are about extracting data or returning a count). I am new to R, so any help would be appreciated!
I have a dataset of multiple runs of an experiment in one file and the data looks like this, where i have all the time steps for each run in rows
time [info] id (unique per run)
I am attempting to calculate when the system reaches equilibrium, which I am defining as stable values in 3 interdependent parameters. I would like to have the contents of rows compared and if they are within 5% of each other over 20 timesteps, to return the timestep at which the stability begins and the id.
So far, I'm thinking it will be something like the following (or maybe have a while loop)(sorry for the bad formatting):
y=1;
z=0; #variables to control the loop
x=0;
for (ID) {
if (CC at time=x == 0.05+-CC at time=y ) {
if(z<=20){ #catalogs the number of periods that match
y++
z++}
else [save value in column]
}
else{ #no match for sustained period so start over again
x++
y=x+1
z=0
}
}
eta: CC is one of my parameters of interest and ranges between 0 and 1 although the endpoints are unlikely.
Here's a simple example that might help: this is something like how my data looks:
zz <- textConnection("time CC ID
1 0.99 1
2 0.80 1
3 0.90 1
4 0.91 1
5 0.92 1
6 0.91 1
1 0.99 2
2 0.90 2
3 0.90 2
4 0.91 2
5 0.92 2
6 0.91 2")
Data <- read.table(zz, header = TRUE)
close(zz)
my question is, how can i run through the lines to find out when the value of CC becomes 'stable' (meaning it doesn't change by more than 0.05 over X (here, 3) time steps) so that it would create the following results:
ID timeToEQ
1 1 3
2 2 2
does this help? The only way I can think to do this is with a for-loop and I think there must be an easier way!
Here is my code. I will post the explanation in some time.
require(plyr)
ddply(Data, .(ID), summarize, timeToEQ = Position(isTRUE, abs(diff(CC)) < 0.05 ))
ID timeToEQ
1 1 3
2 2 2
EDIT. Here is how it works.
ddply breaks Data into subsets based on ID.
diff(CC) computes the difference between CC of successive rows.
abs(diff(CC)) < 0.05) returns TRUE if the difference has stabilized.
Position locates the first instance of an element which satisfies isTRUE.

Resources