R Conditional subtraction from the next unequal value - r

Given a larger data frame with around 300k+ rows and 14 columns in the following form:
df <- data.frame(team_id = c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10)),
year = rep(c(1954:1963), 5), members= c(0,0,0,1,1,1,2,0,0,0,0,0,2,1,1,1,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0),
size = c(rep(60,8),50,50,rep(40,7),50,50,70,rep(30,10),rep(99,6),110,101,101,101,rep(80,9),66) )
The aim is to create a new vector containing the difference in size, for each team, once all members left (members change from 2 or 1 to 0) subtracting the size of the year of the last departure of players from the next different size.
The direction of change should be shown so absolute values are not necessary.
What I achieved so far is:
df2 <- df %>% arrange(team_id,year) %>%
group_by(team_id) %>%
mutate(sizediff = if_else(members == 1 & lead(members) == 0 | members == 2 & lead(members) == 0,1,0, missing = 0) )
However, instead of the values 1 in the sizediff vector I want to have the difference to future size. Maybe changes from long to wide format or a conditional re-arrangement the year vector could help but I am stuck. What I want to achieve looks like:
aim <- data.frame(team_id = c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10)),
year = rep(c(1954:1963), 5), members= c(0,0,0,1,1,1,2,0,0,0, 0,0,2,1,1,1,0,0,0,0, 1,1,1,1,1,1,1,1,1,1, 0,1,1,1,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,0,0 ) ,
size = c(57,rep(60,7),50,50,rep(40,7),50,50,70,rep(30,10),rep(99,6),110,101,101,101,88,rep(80,8),66),
sizediff = c(rep(0,6),-10,rep(0,3),rep(0,5),10,rep(0,4),rep(0,10),rep(0,3),11,rep(0,6),rep(0,7),-14,rep(0,2)) )

is this something your are looking for?
df %>%
arrange(team_id, year) %>%
mutate(diff = if_else((members> 0 & dplyr::lead(members, n=1)==0), size, 0)) %>%
group_by(team_id) %>%
mutate(diff = ifelse(diff>0, dplyr::last(size)-size, NA))

Try this custom approach :
library(dplyr)
df %>%
group_by(team_id) %>%
mutate(sizediff = {
sizediff = rep(0, n())
inds <- which(members %in% c(1, 2) & lead(members) == 0)[1]
sizediff[inds] <- size[which(row_number() > inds & size != size[inds])[1]] - size[inds]
sizediff
}) -> result
result
# team_id year members size sizediff
# <dbl> <int> <dbl> <dbl> <dbl>
# 1 1 1954 0 60 0
# 2 1 1955 0 60 0
# 3 1 1956 0 60 0
# 4 1 1957 1 60 0
# 5 1 1958 1 60 0
# 6 1 1959 1 60 0
# 7 1 1960 2 60 -10
# 8 1 1961 0 60 0
# 9 1 1962 0 50 0
#10 1 1963 0 50 0
# … with 40 more rows
We first initialise sizediff to 0, inds is used to find where members left. We calculate the difference in size from the next value which changes and update inds position.

Related

Condition for multiplying columns

I am trying to multiply the first column with each subsequent second column with some condition.
The main condition is to have 10 in the first row. Below you can see my data.
df<-data.frame(
Stores=c(10,30,10,0,10),
Value1=c(10,10,0,100,0),
Value2=c(10,10,0,100,0),
Value3=c(10,0,0,0,0),
Value4=c(10,10,0,0,0)
)
df
So multiplying values works well with this command but without any condition.
df[,1] * df[seq(3,ncol(df), by = 2)]
Now I want to put a condition for the first row of data. I tried with this command below but is not work well.
ifelse(df[,1]==10,1,0) * df[seq(3,ncol(df), by = 2)]
So can anybody help me how to solve this and to multiply values only if the first column with the title Stores is number 10?
You can extend your selection with df$Stores == 10 e.g.
df[df$Stores == 10, 1] * df[df$Stores == 10, seq(3, ncol(df), by = 2)]
Value2 Value4
1 100 100
3 0 0
5 0 0
If you want to keep the structure of the data frame try this approach using dplyr. Since you can select ranges very specifically with dplyr the column wise condition can be changed to the variable name plus the variable number, i.e. paste0("Value", seq(2, 4, 2)) which equals "Value2" and "Value4".
library(dplyr)
library(tidyr)
df %>%
mutate(n = row_number()) %>%
pivot_longer(Value1:Value4) %>%
mutate(is = Stores == 10,
value = if_else(is & name %in% paste0("Value", seq(2, 4, 2)),
value * Stores[is][1], value)) %>%
pivot_wider(c(Stores, n)) %>%
select(-n)
# A tibble: 5 × 5
Stores Value1 Value2 Value3 Value4
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10 10 100 10 100
2 30 10 10 0 10
3 10 0 0 0 0
4 0 100 100 0 0
5 10 0 0 0 0

How to count observations between two rows based on condition in R?

I am trying to create a variable for a data frame in which I count the number of observations between two observations which meet a criteria. Here it is counting the number of times since last win in a game.
Say I have a dataframe like this:
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),win = c(1,0,0,0,1,1,0,1,0,0,1))
I want to create a new variable that counts the number of games it has been since the player has won.
Summarized in a vector, the result should be (setting a Not Applicable for the first observation):
c(NA,0,1,2,3,0,0,1,0,1,2)
I want to be able to do this easily and create it as a variable in the data.frame using dplyr (or any other suitable approach)
I am not quite sure why the first value should be NA. Because the elapsed time is 0 since the last "win" and not NA.
For purely logical reasons, I would take the following approach:
seq = with(df, ave(win, cumsum(win == 1), FUN = seq_along)-1)
So you get the past cummulated sum games since the last win as follows:
c(0,1,2,3,0,0,1,0,1,2,0)
But if you still aim for your described result with a little data handling you can achieve it with this:
append(NA, seq[1:length(seq)-1])
It is not nice, but it works ;)
With {tidyverse}, try:
library(tidyverse)
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),
win = c(1,0,0,0,1,1,0,1,0,0,1))
df %>%
group_by(player, group = cumsum(win != lag(win, default = first(win)))) %>%
mutate(counter = row_number(),
counter = if_else(win == 1, true = 0L, false = counter)) %>%
ungroup() %>%
group_by(player) %>%
mutate(counter = if_else(row_number() == 1, true = NA_integer_, false = counter)) %>%
ungroup() %>%
select(-group)
player win counter
<dbl> <dbl> <int>
1 10 1 NA
2 10 0 1
3 10 0 2
4 10 0 3
5 10 1 0
6 10 1 0
7 10 0 1
8 10 1 0
9 10 0 1
10 10 0 2
11 10 1 0

Selecting all columns that have some specific values

I have a data.frame with more than 50 columns and 10,000 rows I want select those columns that are haveing 0 or 1 in them excluding other values in those columna
sample data.frame is as below:
dummy_df <- data.frame(
id=1:4,
gender=c(4,1,0,1),
height=seq(150, 180,by = 10),
smoking=c(3,0,1,0)
)
I want to select all those columns with 0 or 1 value and exclude other values like 4 in gender and 3 in smoking and as below
gender smoking
1 0
0 1
1 0
but I have 50 columns in actual data frame and I don't know which of them are having 0 or 1
What I'm trying is:
dummy_df %>% select_if(~ all( . %in% 0:1))
Is this useful for you?
dummy_df %>%
select(- c(id, height)) %>%
rowwise() %>%
filter(any(c_across() == 0)|any(c_across() == 1))
# A tibble: 3 x 2
# Rowwise:
gender smoking
<dbl> <dbl>
1 1 0
2 0 1
3 1 0
EDIT:
If you don't know in advance which cols contain 0 and/or 1, you can determine that in base R:
temp <- dummy_df[sapply(dummy_df, function(x) any(x == 0|x == 1))]
Now you can filter for rows with 0and/or 1:
temp %>%
rowwise() %>%
filter(any(c_across() == 0)|any(c_across() == 1))
I think it's more like a case of filter than select:
library(dplyr)
dummy_df %>%
filter(if_all(c(gender, smoking), ~ .x %in% c(0, 1)))
id gender height smoking
1 2 1 160 0
2 3 0 170 1
3 4 1 180 0

Mark row before count starts again

shift = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3)
count =c(1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7)
test <- cbind(shift,count)
So I am trying to mark every last row for every shift (so rows with count = c(8,10,7)with a binary 1 and every other row with 0. Right now I am thinking maybe that is possible with a left join but I am not quite sure. I would prefer not working with loops but rather use some techniques from dplyr. Thanks guys!
Assuming that you want to add a new 0/1 column last that contains a 1 in the last row of each shift and that the shifts are contiguous, here are two base R approaches:
transform(test, last = ave(count, shift, FUN = function(x) x == max(x)))
transform(test, last = +!duplicated(shift, fromLast = TRUE))
or with dplyr use mutate:
test %>%
as.data.frame %>%
group_by(shift) %>%
mutate(last = +(1:n() == n())) %>%
ungroup
test %>%
as.data.frame %>%
mutate(last = +!duplicated(shift, fromLast = TRUE))
Try this one
library(dplyr)
test %>%
as_tibble() %>%
group_by(shift) %>%
mutate(is_last = ifelse( row_number() == max(row_number()), 1, 0)) %>%
ungroup()
# A tibble: 25 x 3
shift count is_last
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 1
9 2 1 0
10 2 2 0
# … with 15 more rows

Consecutive wins/losses R

I am still new to R and learning methods for conducting analysis. I have a df which I want to count the consecutive wins/losses based on column "x9". This shows the gain/loss (positive value or negative value) for the trade entered. I did find some help on code that helped with assigning a sign, sign lag and change, however, I am looking for counter to count the consecutive wins until a loss is achieved then reset, and then count the consecutive losses until a win is achieved. Overall am looking for assistance to adjust the counter to reset when consecutive wins/losses are interrupted. I have some sample code below and a attached .png to explain my thoughts
#Read in df
df=vroom::vroom(file = "analysis.csv")
#Filter df for specfic order types
df1 = filter(df, (x3=="s/l") |(x3=="t/p"))
#Create additional column to tag wins/losses in df1
index <- c("s/l","t/p")
values <- c("Loss", "Win")
df1$col2 <- values[match(df1$x3, index)]
df1
#Mutate df to review changes, attempt to review consecutive wins and losses & reset when a
#positive / negative value is encountered
df2=df1 %>%
mutate(sign = ifelse(x9 > 0, "pos", ifelse(x9 < 0, "neg", "zero")), # get the sign of the value
sign_lag = lag(sign, default = sign[9]), # get previous value (exception in the first place)
change = ifelse(sign == sign_lag, 1 , 0), # check if there's a change
series_id = cumsum(change)+1) %>% # create the series id
print() -> dt2
I think you can use rle for this. By itself, it doesn't immediately provide a grouping-like functionality, but we can either use data.table::rleid or construct our own function:
# borrowed from https://stackoverflow.com/a/62007567/3358272
myrleid <- function(x) {
rl <- rle(x)$lengths
rep(seq_along(rl), times = rl)
}
x9 <- c(-40.57,-40.57,-40.08,-40.08,-40.09,-40.08,-40.09,-40.09,-39.6,-39.6,-49.6,-39.6,-39.61,-39.12,-39.12-39.13,782.58,-41.04)
tibble(x9) %>%
mutate(grp = myrleid(x9 > 0)) %>%
group_by(grp) %>%
mutate(row = row_number()) %>%
ungroup()
# # A tibble: 17 x 3
# x9 grp row
# <dbl> <int> <int>
# 1 -40.6 1 1
# 2 -40.6 1 2
# 3 -40.1 1 3
# 4 -40.1 1 4
# 5 -40.1 1 5
# 6 -40.1 1 6
# 7 -40.1 1 7
# 8 -40.1 1 8
# 9 -39.6 1 9
# 10 -39.6 1 10
# 11 -49.6 1 11
# 12 -39.6 1 12
# 13 -39.6 1 13
# 14 -39.1 1 14
# 15 -78.2 1 15
# 16 783. 2 1
# 17 -41.0 3 1

Resources