Data Frame- Add number of occurrences with a condition in R - r

I'm having a bit of a struggle trying to figure out how to do the following. I want to map how many days of high sales I have previously a change of price. For example, I have a price change on day 10 and the high sales indicator will tell me any sale greater than or equal to 10. Need my algorithm to count the number of consecutive high sales.
In this case it should return 5 (day 5 to 9)
For example purposes, the dataframe is called df. Code:
#trying to create a while loop that will check if lag(high_sales) is 1, if yes it will count until
#there's a lag(high_sales) ==0
#loop is just my dummy variable that will take me out of the while loop
count_sales<-0
loop<-0
df<- df %>% mutate(consec_high_days= ifelse(price_change > 0, while(loop==0){
if(lag(High_sales_ind)==1){
count_sales<-count_sales +1}
else{loop<-0}
count_sales},0))
day
price
price_change
sales
High_sales_ind
1
5
0
12
1
2
5
0
6
0
3
5
0
5
0
4
5
0
4
0
5
5
0
10
1
6
5
0
10
1
7
5
0
10
1
8
5
0
12
1
9
5
0
14
1
10
7
2
3
0
11
7
0
2
0
This is my error message:
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i the condition has length > 1 and only the first element will be used
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i 'x' is NULL so the result will be NULL
Error: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
x replacement has length zero
Any help would be greatly appreciated.

This is a very inelegant brute-force answer, though hopefully someone better than me can provide a more elegant answer - but to get the desired dataset, you can try:
df <- read.table(text = "day price price_change sales High_sales_ind
1 5 0 12 1
2 5 0 6 0
3 5 0 5 0
4 5 0 4 0
5 5 0 10 1
6 5 0 10 1
7 5 0 10 1
8 5 0 12 1
9 5 0 14 1
10 7 2 3 0
11 7 0 2 0", header = TRUE)
# assign consecutive instances of value
df$seq <- sequence(rle(as.character(df$sales >= 10))$lengths)
# Find how many instance of consecutive days occurred before price change
df <- df %>% mutate(lseq = lag(seq))
# define rows you want to keep and when to end
keepz <- df[df$price_change != 0, "lseq"]
end <- as.numeric(rownames(df[df$price_change != 0,]))-1
df_want <- df[keepz:end,-c(6:7)]
Output:
# day price price_change sales High_sales_ind
# 5 5 5 0 10 1
# 6 6 5 0 10 1
# 7 7 5 0 10 1
# 8 8 5 0 12 1
# 9 9 5 0 14 1

Related

Update value in a subset dataframe using index of for loop [duplicate]

This question already has an answer here:
Bulk update in subset obtained from dataframe filtering [duplicate]
(1 answer)
Closed 3 years ago.
My usecase involve me to filter a dataframe with some condition. Once I get the subset dataframe, I want to traverse through the subset one row at a time and checking for certain condition and updating a value in that particular row.
Here is my implementation:
> sales_data[sales_data$month == 1 & sales_data$dept_name == 1,]
emp_name month dept_name revenue status n_points x_partition y_partition x y
1 Sam 1 1 100 Low 9 3 3 0 0
7 Kenneth 1 1 500 Very High 9 3 3 0 0
11 Jonathan 1 1 500 Low 9 3 3 0 0
12 Sam 1 1 100 Low 9 3 3 0 0
18 Kenneth 1 1 500 Very High 9 3 3 0 0
22 Jonathan 1 1 500 Low 9 3 3 0 0
23 Sam 1 1 100 Low 9 3 3 0 0
29 Kenneth 1 1 500 Very High 9 3 3 0 0
33 Jonathan 1 1 500 Low 9 3 3 0 0
Now, my subset dataframe has 9 rows. So, a for loop:
for(i in 1:nrow(sales_data[sales_data$month == 1 & sales_data$dept_name == 1, ] )) {
#Here I want to update the value of column named x with i
sales_data[sales_data$month == month_item & sales_data$dept_name == dept_item, ][i]$x <- x_vector_data[i] ##NOT CORRECT APPROACH
}
Why loop, maybe:
sales_data[sales_data$month == 1 & sales_data$dept_name == 1, "x"] <- x_vector_data

How to create a complex running calculation on an R data table

I want to create a running calculation that includes logic to restart the running sum when the value is negative. Initially I have a data table or frame like below :
df <- data.frame(value1 = c(0,0,10,0,1,0,2,0)
, value2 = c(5,1,2,6,8,3,7,2))
value1 value2
0 5
0 1
10 2
0 6
1 8
0 3
2 7
0 2
I would like to take the cumulative sum of value2 subtracted by value1. However, if the new value is less than 0, then start the running calculation over.
i.e. end up with
value1 value2 newvalue
0 5 5
0 1 6
10 2 2
0 6 8
1 8 15
0 3 18
2 7 23
0 2 25
I tried multiple attempts with data.table and dplyr packages with no luck.
EDIT: Updated df to match the actual table shown.
I am sure there are other simpler ways to do this by tweaking cumsum or other such functions, but I came up with this basic loop to produce the desired output. Hope it helps !!
> df
GroupID value1 value2
1 1 0 5
2 1 0 1
3 1 10 2
4 2 0 6
5 2 1 8
6 3 0 3
7 3 2 7
8 3 0 2
for(i in 1:nrow(df)) {
if(i == 1) {
df$newvalue[i] <- df$value2[i]
} else {
df$newvalue[i] <- (df$newvalue[i-1] + df$value2[i]) - df$value1[i]
if(df$newvalue[i] < 0 | df$GroupID[i] != df$GroupID[i-1]) {
df$newvalue[i] <- df$value2[i]
}
}
}
> df
GroupID value1 value2 newvalue
1 1 0 5 5
2 1 0 1 6
3 1 10 2 2
4 2 0 6 6
5 2 1 8 13
6 3 0 3 3
7 3 2 7 8
8 3 0 2 10
I believe that explicitly looping through the data frame is the only solution for calculating this type of conditional cumulative sum. Sagar's solution was very helpful to me (I up-voted but do not have enough reputation points for it to count).
In my experience, new value needs to be initialized prior to starting the loop in order to work properly. Below is how I would approach this:
df$newvalue <- df$value2
for(i in 2:nrow(df)) {
if(df$GroupID[i] == df$GroupID[i-1]) {
df$newvalue[i] <- max(df$newvalue[i-1] + df$value2[i]) - df$value1[i], df$value2[i])
}
}

How to find and remove columns containing more than k consecutive zeros in R data.frame?

I have a huge data.frame with around 200 variables, each represented by a column. Unfortunately, the data is sourced from a poorly formatted data dump (and hence can't be modified) which represents both missing values and zeroes as 0.
The data has been observed every 5 minutes for a month, and a day-long period of only 0s can be reasonably thought of as a day where the counter was not functioning, thereby leading to the conclusion that those 0s are actually NAs.
I want to find (and remove) columns that have at least 288 consecutive 0s at any point. Or, more generally, how can we remove columns from a data.frame containing >=k consecutive 0s?
I'm relatively new to R, and any help would be greatly appreciated. Thanks!
EDIT: Here is a reproducible example. Considering k=4, I would like to remove columns A and B (but not C, since the 0s are not consecutive).
df<-data.frame(A=c(4,5,8,2,0,0,0,0,6,3), B=c(3,0,0,0,0,6,8,2,1,0), C=c(4,5,6,0,3,0,2,1,0,0), D=c(1:10))
df
A B C D
1 4 3 4 1
2 5 0 5 2
3 8 0 6 3
4 2 0 0 4
5 0 0 3 5
6 0 6 0 6
7 0 8 2 7
8 0 2 1 8
9 6 1 0 9
10 3 0 0 10
You can use this function on your data:
cons.Zeros <- function (x, n)
{
x <- x[!is.na(x)] == 0
r <- rle(x)
any(r$lengths[r$values] >= n)
}
This function returns TRUE for the columns that need to be dropped. n is the number of consecutive zeros that you want the column to be dropped for.
For your sample dataset let's use n = 3;
df.dropped <- df[, !sapply(df, cons.Zeros, n=3)]
#output:
# > df.dropped
# C D
# 1 4 1
# 2 5 2
# 3 6 3
# 4 0 4
# 5 3 5
# 6 0 6
# 7 2 7
# 8 1 8
# 9 0 9
# 10 0 10

r how to get total count of repeated values

I have a dataframe with person_id, study_id columns like below:
person_id study_id
10 1
11 2
10 3
10 4
11 5
I want to get the count for number of persons (unique by person_id) with 1 study or 2 studies - so not those with particular value for study_id but:
2 persons with 1 study
3 persons with 2 studies
1 person with with 3 studies
etc
How can I do this? I think maybe a count through loop but I wonder if there is a package that makes it easier?
To get a sample data set that better matches your expected output, i'll use this
dd <- data.frame(
person_id = c(10, 11, 15, 12, 10, 13, 10, 11, 12, 14, 15),
study_id = 1:11
)
Now I can count the number of people with a given number of studies with.
table(rowSums(with(dd, table(person_id, study_id))>0))
# 1 2 3
# 2 3 1
Where the top line is the number of studies, and the bottom line it the number of people with that number of studies.
This works because
with(dd, table(person_id, study_id))
returns
study_id
person_id 1 2 3 4 5 6 7 8 9 10 11
10 1 0 0 0 1 0 1 0 0 0 0
11 0 1 0 0 0 0 0 1 0 0 0
12 0 0 0 1 0 0 0 0 1 0 0
13 0 0 0 0 0 1 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 1 0
15 0 0 1 0 0 0 0 0 0 0 1
and then we use >0 and rowSums to get a count of unique studies for each person. Then we use table again to summarize the results.
The creating the table for your data is taking up too much RAM, you can try
table(with(dd, tapply(study_id, person_id, function(x) length(unique(x)))))
which is a slightly different way to get at the same thing.
You can use the aggregate function to get counts per user.
Then use it again to get counts per counts
i.e. assume your data is called "test"
person_id study_id
10 1
11 2
10 3
10 4
11 5
12 NA
You can set your NA to be a number such as zero so they are not ignored i.e.
test$study_id[is.na(test$study_id)] = 0
Then you can run the same function but with a condition that the study_id has to be greater than zero
stg=setNames(
aggregate(
study_id~person_id,
data=test,function(x){sum(x>0)}),
c("person_id","num_studies"))
Output:
stg
person_id num_studies
10 3
11 2
12 0
Then do the same to get counts of counts
setNames(
aggregate(
person_id~num_studies,
data=stg,length),
c("num_studies","num_users"))
Output:
num_studies num_users
0 1
2 1
3 1
Here's a solution using dplyr
library(dplyr)
tmp <- df %>%
group_by(person_id) %>%
summarise(num.studies = n()) %>%
group_by(num.studies) %>%
summarise(num.persons = n())
> dat <- read.table(h=T, text = "person_id study_id
10 1
11 2
10 3
10 4
11 5
12 6")
I think you can just use xtabs for this. I may have misunderstood the question, but it seems like that's what you want.
> table(xtabs(dat))
# 10 11 12
# 3 2 1
df <- data.frame(
person_id = c(10,11,10,10,11,11,11),
study_id = c(1,2,3,4,5,5,1))
# remove replicated rows
df <- unique(df)
# number of studies each person has been in:
summary(as.factor(df$person_id))
#10 11
# 3 4
# number of people in each study
summary(as.factor(df$study_id))
# 1 2 3 4 5
# 2 1 1 1 2

interchanging values after comparing two columns in R

i want to write a code that checks two columns in a dataframe and compares them. one is supposed to have lower limit and the other upper limits. if values on the upper limit columns are less than on the lower limit, them it should interchange the values. if both lower and upper limits are zero, it should replace the upper limit column with a value say 2. a sample data is as below:
lower_limit upper_limit
0 3
0 4
5 2
0 15
0 0
0 0
7 4
8 2
after running the code, it should produce something like
lower_limit upper_limit
0 3
0 4
2 5
0 15
0 2
0 2
4 7
2 8
dfrm <- read.table(text="lower_limit upper_limit
0 3
0 4
5 2
0 15
0 0
0 0
7 4
8 2", header=TRUE)
dfrm2 <- dfrm
dfrm2[,2] <- pmax(dfrm[,1], dfrm[,2] )
dfrm2[,1] <- pmin(dfrm[,1], dfrm[,2] );
dfrm2[abs(pmax(dfrm[,1],dfrm[,2]))==0 , 2] <- 2
> dfrm2
lower_limit upper_limit
1 0 3
2 0 4
3 2 5
4 0 15
5 0 2
6 0 2
7 4 7
8 2 8
Assuming dat is the name of your data frame/matrix:
setNames(as.data.frame(t(apply(dat, 1, function(x) {
tmp <- sort(x);
tmp[2] <- tmp[2] + (!any(x)) * 2;
return(tmp) }))), colnames(dat))
lower_limit upper_limit
1 0 3
2 0 4
3 2 5
4 0 15
5 0 2
6 0 2
7 4 7
8 2 8
How it works?
The function apply is used to apply a function to each line (argument 1). In this function, x represents a line of dat. Firstly, the values are ordered (with sort) and stored in the object tmp. Then, the second value of tmp is replaced with 2 if both values are 0. Finally, tmp is returned. The function apply returns the results as matrix, which needs to be transposed (with t). This matrix is transformed to a data frame (as.data.frame) with the same column names as the original object dat (with setNames).

Resources