How to create a complex running calculation on an R data table - r

I want to create a running calculation that includes logic to restart the running sum when the value is negative. Initially I have a data table or frame like below :
df <- data.frame(value1 = c(0,0,10,0,1,0,2,0)
, value2 = c(5,1,2,6,8,3,7,2))
value1 value2
0 5
0 1
10 2
0 6
1 8
0 3
2 7
0 2
I would like to take the cumulative sum of value2 subtracted by value1. However, if the new value is less than 0, then start the running calculation over.
i.e. end up with
value1 value2 newvalue
0 5 5
0 1 6
10 2 2
0 6 8
1 8 15
0 3 18
2 7 23
0 2 25
I tried multiple attempts with data.table and dplyr packages with no luck.
EDIT: Updated df to match the actual table shown.

I am sure there are other simpler ways to do this by tweaking cumsum or other such functions, but I came up with this basic loop to produce the desired output. Hope it helps !!
> df
GroupID value1 value2
1 1 0 5
2 1 0 1
3 1 10 2
4 2 0 6
5 2 1 8
6 3 0 3
7 3 2 7
8 3 0 2
for(i in 1:nrow(df)) {
if(i == 1) {
df$newvalue[i] <- df$value2[i]
} else {
df$newvalue[i] <- (df$newvalue[i-1] + df$value2[i]) - df$value1[i]
if(df$newvalue[i] < 0 | df$GroupID[i] != df$GroupID[i-1]) {
df$newvalue[i] <- df$value2[i]
}
}
}
> df
GroupID value1 value2 newvalue
1 1 0 5 5
2 1 0 1 6
3 1 10 2 2
4 2 0 6 6
5 2 1 8 13
6 3 0 3 3
7 3 2 7 8
8 3 0 2 10

I believe that explicitly looping through the data frame is the only solution for calculating this type of conditional cumulative sum. Sagar's solution was very helpful to me (I up-voted but do not have enough reputation points for it to count).
In my experience, new value needs to be initialized prior to starting the loop in order to work properly. Below is how I would approach this:
df$newvalue <- df$value2
for(i in 2:nrow(df)) {
if(df$GroupID[i] == df$GroupID[i-1]) {
df$newvalue[i] <- max(df$newvalue[i-1] + df$value2[i]) - df$value1[i], df$value2[i])
}
}

Related

Data Frame- Add number of occurrences with a condition in R

I'm having a bit of a struggle trying to figure out how to do the following. I want to map how many days of high sales I have previously a change of price. For example, I have a price change on day 10 and the high sales indicator will tell me any sale greater than or equal to 10. Need my algorithm to count the number of consecutive high sales.
In this case it should return 5 (day 5 to 9)
For example purposes, the dataframe is called df. Code:
#trying to create a while loop that will check if lag(high_sales) is 1, if yes it will count until
#there's a lag(high_sales) ==0
#loop is just my dummy variable that will take me out of the while loop
count_sales<-0
loop<-0
df<- df %>% mutate(consec_high_days= ifelse(price_change > 0, while(loop==0){
if(lag(High_sales_ind)==1){
count_sales<-count_sales +1}
else{loop<-0}
count_sales},0))
day
price
price_change
sales
High_sales_ind
1
5
0
12
1
2
5
0
6
0
3
5
0
5
0
4
5
0
4
0
5
5
0
10
1
6
5
0
10
1
7
5
0
10
1
8
5
0
12
1
9
5
0
14
1
10
7
2
3
0
11
7
0
2
0
This is my error message:
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i the condition has length > 1 and only the first element will be used
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i 'x' is NULL so the result will be NULL
Error: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
x replacement has length zero
Any help would be greatly appreciated.
This is a very inelegant brute-force answer, though hopefully someone better than me can provide a more elegant answer - but to get the desired dataset, you can try:
df <- read.table(text = "day price price_change sales High_sales_ind
1 5 0 12 1
2 5 0 6 0
3 5 0 5 0
4 5 0 4 0
5 5 0 10 1
6 5 0 10 1
7 5 0 10 1
8 5 0 12 1
9 5 0 14 1
10 7 2 3 0
11 7 0 2 0", header = TRUE)
# assign consecutive instances of value
df$seq <- sequence(rle(as.character(df$sales >= 10))$lengths)
# Find how many instance of consecutive days occurred before price change
df <- df %>% mutate(lseq = lag(seq))
# define rows you want to keep and when to end
keepz <- df[df$price_change != 0, "lseq"]
end <- as.numeric(rownames(df[df$price_change != 0,]))-1
df_want <- df[keepz:end,-c(6:7)]
Output:
# day price price_change sales High_sales_ind
# 5 5 5 0 10 1
# 6 6 5 0 10 1
# 7 7 5 0 10 1
# 8 8 5 0 12 1
# 9 9 5 0 14 1

Update value in a subset dataframe using index of for loop [duplicate]

This question already has an answer here:
Bulk update in subset obtained from dataframe filtering [duplicate]
(1 answer)
Closed 3 years ago.
My usecase involve me to filter a dataframe with some condition. Once I get the subset dataframe, I want to traverse through the subset one row at a time and checking for certain condition and updating a value in that particular row.
Here is my implementation:
> sales_data[sales_data$month == 1 & sales_data$dept_name == 1,]
emp_name month dept_name revenue status n_points x_partition y_partition x y
1 Sam 1 1 100 Low 9 3 3 0 0
7 Kenneth 1 1 500 Very High 9 3 3 0 0
11 Jonathan 1 1 500 Low 9 3 3 0 0
12 Sam 1 1 100 Low 9 3 3 0 0
18 Kenneth 1 1 500 Very High 9 3 3 0 0
22 Jonathan 1 1 500 Low 9 3 3 0 0
23 Sam 1 1 100 Low 9 3 3 0 0
29 Kenneth 1 1 500 Very High 9 3 3 0 0
33 Jonathan 1 1 500 Low 9 3 3 0 0
Now, my subset dataframe has 9 rows. So, a for loop:
for(i in 1:nrow(sales_data[sales_data$month == 1 & sales_data$dept_name == 1, ] )) {
#Here I want to update the value of column named x with i
sales_data[sales_data$month == month_item & sales_data$dept_name == dept_item, ][i]$x <- x_vector_data[i] ##NOT CORRECT APPROACH
}
Why loop, maybe:
sales_data[sales_data$month == 1 & sales_data$dept_name == 1, "x"] <- x_vector_data

Data Cleaning for Survival Analysis

I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that an individual only has a single, sustained, transition from symptom present (ss=1) to symptom remitted (ss=0). An individual must have a complete sustained remission in order for it to count as a remission. Statistical problems/issues aside, I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,1,1,1,NA,0,0,1,1,0,NA,0,0,0,1,1,1,1,1,1,NA,1,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold and underlined characters represent changes from the dataset above
The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 1,1,1,1,1,0,0
ID# 2 (variable ss) to look like this: 1,1,0,0,0,0,0
ID #3 (variable ss) to look like this: 1,1,1,1,1,1,NA (no change because the row with NA will be deleted eventually)
ID #4 (variable ss) to look like this: 1,1,1,1,1,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).
I don't really think you have considered all the "edge case". What to do with two NA's in a row at the end of a period or 4 or 5 NA's in a row. This will give you the requested solution in your tiny test case, however, using the na.locf-function:
require(zoo)
fillNA <- function(vec) { if ( is.na(tail(vec, 1)) ){ vec } else { vec <- na.locf(vec) }
}
> mydat$locf <- with(mydat, ave(ss, id, FUN=fillNA))
> mydat
id time ss locf
1 1 0 1 1
2 1 1 1 1
3 1 2 1 1
4 1 3 1 1
5 1 4 NA 1
6 1 5 0 0
7 1 6 0 0
8 2 0 1 1
9 2 1 1 1
10 2 2 0 0
11 2 3 NA 0
12 2 4 0 0
13 2 5 0 0
14 2 6 0 0
15 3 0 1 1
16 3 1 1 1
17 3 2 1 1
18 3 3 1 1
19 3 4 1 1
20 3 5 1 1
21 3 6 NA NA
22 4 0 1 1
23 4 1 1 1
24 4 2 0 0
25 4 3 NA 0
26 4 4 NA 0
27 4 5 0 0
28 4 6 0 0

R - Conditional replacement of column values in a data frame

I have a data frame which has 2 columns - A & B. I want to replace the values of column B in such a way that, when the VALUE>=5 replace with 1, else replace with 0.
Note - There are 2 conditions to be checked.
X=read.csv("Y:/impdat.csv")
A B
3 16
12 3
1 2
12 9
4 4
5 6
21 1
4 14
3 10
12 1
So after replacing, the data should be
A B
3 1
12 0
1 0
12 1
4 0
5 1
21 0
4 1
3 1
12 0
Sounds simple. But I am unable to implement it.
I tried
ifelse(X$B>=5,1,0)
This only prints the new values, but the original data remains the same.
X$B <- as.integer(X$B >= 5)
will do the trick.
transform(X, B=ifelse(B>=5,1,0))
Got it.
Just had to assign the object.
X$B=ifelse(X$B>=5,1,0)

interchanging values after comparing two columns in R

i want to write a code that checks two columns in a dataframe and compares them. one is supposed to have lower limit and the other upper limits. if values on the upper limit columns are less than on the lower limit, them it should interchange the values. if both lower and upper limits are zero, it should replace the upper limit column with a value say 2. a sample data is as below:
lower_limit upper_limit
0 3
0 4
5 2
0 15
0 0
0 0
7 4
8 2
after running the code, it should produce something like
lower_limit upper_limit
0 3
0 4
2 5
0 15
0 2
0 2
4 7
2 8
dfrm <- read.table(text="lower_limit upper_limit
0 3
0 4
5 2
0 15
0 0
0 0
7 4
8 2", header=TRUE)
dfrm2 <- dfrm
dfrm2[,2] <- pmax(dfrm[,1], dfrm[,2] )
dfrm2[,1] <- pmin(dfrm[,1], dfrm[,2] );
dfrm2[abs(pmax(dfrm[,1],dfrm[,2]))==0 , 2] <- 2
> dfrm2
lower_limit upper_limit
1 0 3
2 0 4
3 2 5
4 0 15
5 0 2
6 0 2
7 4 7
8 2 8
Assuming dat is the name of your data frame/matrix:
setNames(as.data.frame(t(apply(dat, 1, function(x) {
tmp <- sort(x);
tmp[2] <- tmp[2] + (!any(x)) * 2;
return(tmp) }))), colnames(dat))
lower_limit upper_limit
1 0 3
2 0 4
3 2 5
4 0 15
5 0 2
6 0 2
7 4 7
8 2 8
How it works?
The function apply is used to apply a function to each line (argument 1). In this function, x represents a line of dat. Firstly, the values are ordered (with sort) and stored in the object tmp. Then, the second value of tmp is replaced with 2 if both values are 0. Finally, tmp is returned. The function apply returns the results as matrix, which needs to be transposed (with t). This matrix is transformed to a data frame (as.data.frame) with the same column names as the original object dat (with setNames).

Resources