Condition for multiplying columns - r

I am trying to multiply the first column with each subsequent second column with some condition.
The main condition is to have 10 in the first row. Below you can see my data.
df<-data.frame(
Stores=c(10,30,10,0,10),
Value1=c(10,10,0,100,0),
Value2=c(10,10,0,100,0),
Value3=c(10,0,0,0,0),
Value4=c(10,10,0,0,0)
)
df
So multiplying values works well with this command but without any condition.
df[,1] * df[seq(3,ncol(df), by = 2)]
Now I want to put a condition for the first row of data. I tried with this command below but is not work well.
ifelse(df[,1]==10,1,0) * df[seq(3,ncol(df), by = 2)]
So can anybody help me how to solve this and to multiply values only if the first column with the title Stores is number 10?

You can extend your selection with df$Stores == 10 e.g.
df[df$Stores == 10, 1] * df[df$Stores == 10, seq(3, ncol(df), by = 2)]
Value2 Value4
1 100 100
3 0 0
5 0 0
If you want to keep the structure of the data frame try this approach using dplyr. Since you can select ranges very specifically with dplyr the column wise condition can be changed to the variable name plus the variable number, i.e. paste0("Value", seq(2, 4, 2)) which equals "Value2" and "Value4".
library(dplyr)
library(tidyr)
df %>%
mutate(n = row_number()) %>%
pivot_longer(Value1:Value4) %>%
mutate(is = Stores == 10,
value = if_else(is & name %in% paste0("Value", seq(2, 4, 2)),
value * Stores[is][1], value)) %>%
pivot_wider(c(Stores, n)) %>%
select(-n)
# A tibble: 5 × 5
Stores Value1 Value2 Value3 Value4
<dbl> <dbl> <dbl> <dbl> <dbl>
1 10 10 100 10 100
2 30 10 10 0 10
3 10 0 0 0 0
4 0 100 100 0 0
5 10 0 0 0 0

Related

How to count observations between two rows based on condition in R?

I am trying to create a variable for a data frame in which I count the number of observations between two observations which meet a criteria. Here it is counting the number of times since last win in a game.
Say I have a dataframe like this:
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),win = c(1,0,0,0,1,1,0,1,0,0,1))
I want to create a new variable that counts the number of games it has been since the player has won.
Summarized in a vector, the result should be (setting a Not Applicable for the first observation):
c(NA,0,1,2,3,0,0,1,0,1,2)
I want to be able to do this easily and create it as a variable in the data.frame using dplyr (or any other suitable approach)
I am not quite sure why the first value should be NA. Because the elapsed time is 0 since the last "win" and not NA.
For purely logical reasons, I would take the following approach:
seq = with(df, ave(win, cumsum(win == 1), FUN = seq_along)-1)
So you get the past cummulated sum games since the last win as follows:
c(0,1,2,3,0,0,1,0,1,2,0)
But if you still aim for your described result with a little data handling you can achieve it with this:
append(NA, seq[1:length(seq)-1])
It is not nice, but it works ;)
With {tidyverse}, try:
library(tidyverse)
df <- data.frame(player = c(10,10,10,10,10,10,10,10,10,10,10),
win = c(1,0,0,0,1,1,0,1,0,0,1))
df %>%
group_by(player, group = cumsum(win != lag(win, default = first(win)))) %>%
mutate(counter = row_number(),
counter = if_else(win == 1, true = 0L, false = counter)) %>%
ungroup() %>%
group_by(player) %>%
mutate(counter = if_else(row_number() == 1, true = NA_integer_, false = counter)) %>%
ungroup() %>%
select(-group)
player win counter
<dbl> <dbl> <int>
1 10 1 NA
2 10 0 1
3 10 0 2
4 10 0 3
5 10 1 0
6 10 1 0
7 10 0 1
8 10 1 0
9 10 0 1
10 10 0 2
11 10 1 0

R Conditional subtraction from the next unequal value

Given a larger data frame with around 300k+ rows and 14 columns in the following form:
df <- data.frame(team_id = c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10)),
year = rep(c(1954:1963), 5), members= c(0,0,0,1,1,1,2,0,0,0,0,0,2,1,1,1,0,0,0,0, 1,1,1,1,1,1,1,1,1,1,0,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0),
size = c(rep(60,8),50,50,rep(40,7),50,50,70,rep(30,10),rep(99,6),110,101,101,101,rep(80,9),66) )
The aim is to create a new vector containing the difference in size, for each team, once all members left (members change from 2 or 1 to 0) subtracting the size of the year of the last departure of players from the next different size.
The direction of change should be shown so absolute values are not necessary.
What I achieved so far is:
df2 <- df %>% arrange(team_id,year) %>%
group_by(team_id) %>%
mutate(sizediff = if_else(members == 1 & lead(members) == 0 | members == 2 & lead(members) == 0,1,0, missing = 0) )
However, instead of the values 1 in the sizediff vector I want to have the difference to future size. Maybe changes from long to wide format or a conditional re-arrangement the year vector could help but I am stuck. What I want to achieve looks like:
aim <- data.frame(team_id = c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10)),
year = rep(c(1954:1963), 5), members= c(0,0,0,1,1,1,2,0,0,0, 0,0,2,1,1,1,0,0,0,0, 1,1,1,1,1,1,1,1,1,1, 0,1,1,1,0,0,0,0,0,0, 1,1,1,1,1,1,1,1,0,0 ) ,
size = c(57,rep(60,7),50,50,rep(40,7),50,50,70,rep(30,10),rep(99,6),110,101,101,101,88,rep(80,8),66),
sizediff = c(rep(0,6),-10,rep(0,3),rep(0,5),10,rep(0,4),rep(0,10),rep(0,3),11,rep(0,6),rep(0,7),-14,rep(0,2)) )
is this something your are looking for?
df %>%
arrange(team_id, year) %>%
mutate(diff = if_else((members> 0 & dplyr::lead(members, n=1)==0), size, 0)) %>%
group_by(team_id) %>%
mutate(diff = ifelse(diff>0, dplyr::last(size)-size, NA))
Try this custom approach :
library(dplyr)
df %>%
group_by(team_id) %>%
mutate(sizediff = {
sizediff = rep(0, n())
inds <- which(members %in% c(1, 2) & lead(members) == 0)[1]
sizediff[inds] <- size[which(row_number() > inds & size != size[inds])[1]] - size[inds]
sizediff
}) -> result
result
# team_id year members size sizediff
# <dbl> <int> <dbl> <dbl> <dbl>
# 1 1 1954 0 60 0
# 2 1 1955 0 60 0
# 3 1 1956 0 60 0
# 4 1 1957 1 60 0
# 5 1 1958 1 60 0
# 6 1 1959 1 60 0
# 7 1 1960 2 60 -10
# 8 1 1961 0 60 0
# 9 1 1962 0 50 0
#10 1 1963 0 50 0
# … with 40 more rows
We first initialise sizediff to 0, inds is used to find where members left. We calculate the difference in size from the next value which changes and update inds position.

Mark row before count starts again

shift = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3)
count =c(1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7)
test <- cbind(shift,count)
So I am trying to mark every last row for every shift (so rows with count = c(8,10,7)with a binary 1 and every other row with 0. Right now I am thinking maybe that is possible with a left join but I am not quite sure. I would prefer not working with loops but rather use some techniques from dplyr. Thanks guys!
Assuming that you want to add a new 0/1 column last that contains a 1 in the last row of each shift and that the shifts are contiguous, here are two base R approaches:
transform(test, last = ave(count, shift, FUN = function(x) x == max(x)))
transform(test, last = +!duplicated(shift, fromLast = TRUE))
or with dplyr use mutate:
test %>%
as.data.frame %>%
group_by(shift) %>%
mutate(last = +(1:n() == n())) %>%
ungroup
test %>%
as.data.frame %>%
mutate(last = +!duplicated(shift, fromLast = TRUE))
Try this one
library(dplyr)
test %>%
as_tibble() %>%
group_by(shift) %>%
mutate(is_last = ifelse( row_number() == max(row_number()), 1, 0)) %>%
ungroup()
# A tibble: 25 x 3
shift count is_last
<dbl> <dbl> <dbl>
1 1 1 0
2 1 2 0
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
7 1 7 0
8 1 8 1
9 2 1 0
10 2 2 0
# … with 15 more rows

Recode matrix values with dplyr based on look-up table

I have a matrix with numerous cases and corresponding answers from a questionnaire. The strongly reduced example below (raw_responses) contains the answers of 5 persons to 5 items. Let us assume that these are multiple choice items with 4 possible answers each. If the item was not processed, the person received the code 9.
raw_responses <- data.frame('id' = 1:10,
'item_1' = sample(c(1:4,9), 10, replace = TRUE),
'item_2' = sample(c(1:4,9), 10, replace = TRUE),
'item_3' = sample(c(1:4,9), 10, replace = TRUE),
'item_4' = sample(c(1:4,9), 10, replace = TRUE),
'item_5' = sample(c(1:4,9), 10, replace = TRUE))
The correct answers are stored in a separate table that reflects the entire test design. Below again a strongly reduced variant (design) with only item names and the corresponding correct answers.
design <- data.frame('item' = c('item_1','item_2','item_3','item_4','item_5'),
'key' = sample(1:4, 5, replace = TRUE))
Finally, the goal is a table with scored answers. A correct answer is coded with 1, a wrong one with 0 and an "empty" answer with 99. This works for example with the for loop below.
scored_responses <- raw_responses
for(item in colnames(raw_responses)[2:6]) {
scored_responses[, item] <- ifelse(scored_responses[, item] == design[design$item == item, 'key'], 1,
ifelse(scored_responses[, item] == 9, 99, 0))
}
However, I was wondering if this would work with a more efficient variant with dplyr (including case_when) and possibly purr. Especially because the very extensive answer table is cleaned up with the help of a longer dplyr-pipe, it would be an advantage if the scoring could be built in there.
I thank you in advance for all ideas and hints.
Get the data in long format, join, recode the values and get the data back in wide format.
library(dplyr)
library(tidyr)
raw_responses %>%
pivot_longer(cols = -id, names_to = 'item') %>%
left_join(design, by = 'item') %>%
mutate(value = case_when(value == 9 ~ 99,
value == key ~ 1,
TRUE ~ 0)) %>%
select(-key) %>%
pivot_wider(names_from = 'item')
# A tibble: 10 x 6
# id item_1 item_2 item_3 item_4 item_5
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 99 99 1 0 0
# 2 2 99 99 99 0 0
# 3 3 1 99 0 99 99
# 4 4 0 1 1 99 1
# 5 5 99 0 1 0 1
# 6 6 0 1 0 0 1
# 7 7 0 0 0 1 99
# 8 8 1 99 0 0 0
# 9 9 0 99 99 0 1
#10 10 99 1 99 1 0
Another approach without getting data into wide format is to use map2_dfc from purrr :
library(purrr)
map2_dfc(raw_responses[-1], design$key, ~case_when(.x == 9 ~ 99,
.x == .y ~ 1,
TRUE ~ 0))
However, for this answer to work we need to ensure that column names in raw_responses and design$item are in the same order. In this example, they are already in the same order however, in the real data if they are not we can achieve it by doing :
raw_responses[-1] <- raw_responses[-1][design$key]

Filter rows by last maximal value ordering by a time value

I have a dataframe with an id, an ordering time value and a value. And for each group of ids, I would like to remove rows having a smaller value than rows having smaller time value.
data <- data.frame(id = c(rep(c("a", "b"), each = 3L), "b"),
time = c(0, 1, 2, 0, 1, 2, 3),
value = c(1, 1, 2, 3, 1, 2, 4))
> data
id time value
1 a 0 1
2 a 1 1
3 a 2 2
4 b 0 3
5 b 1 1
6 b 2 2
7 b 3 4
So the result would be :
> data
id time value
1 a 0 1
2 a 2 2
3 b 0 3
4 b 3 4
(For id == b rows where time %in% c(3, 4) are removed because the value value is smaller than when time is lower)
I was thinking about lag
data %>%
group_by(id) %>%
filter(time == 0 | lag(value, order_by = time) < value)
Source: local data frame [5 x 3]
Groups: id [2]
id time value
<fctr> <dbl> <dbl>
1 a 0 1
2 a 2 2
3 b 0 3
4 b 2 2
5 b 3 4
But it doesn't work as expected since it's a vectorized function, so instead the idea would be to use a "recursive lag function" or to check the last maximal value. I can do it recursively with a loop but I'm sure there is a more straightforward and high level way to do it.
Any help would be appreciated, thank you !
Here is a data.table solution:
library(data.table)
setDT(data)
data[, myVal := cummax(c(0, shift(value)[-1])), by=id][value > myVal][, myVal := NULL][]
id time value
1: a 0 1
2: a 2 2
3: b 0 3
4: b 3 4
The first part of the chain uses shift and cummax to create the cumulative maximum of the lagged value variable. In c(0, shift(value)[-1]), 0 is added to supply a value lover than any in the variable. More generally, you could use min(value)-1 the [-1] subsetting removes the first element of shift, which is NA. The second part of the chain selects observations where value is greater than the cumulative maximum. The final two chains remove the cumulative maximum variable and print out the result.
Another option is to perform a self anti/non-equi join using data.table
library(data.table) # v1.10.0
setDT(data)[!data, on = .(id, time > time, value <= value)]
# id time value
# 1: a 0 1
# 2: a 2 2
# 3: b 0 3
# 4: b 3 4
Which is basically saying: "If time is larger but value is less-equal, then I don't want these rows (! sign)"
Here is an option with dplyr. After grouping by 'id', we filter the rows where the 'value' is greater than the cumulative maximum of the 'lag' of the 'value' column
library(dplyr)
data %>%
group_by(id) %>%
filter(value > cummax(lag(value, default = 0)) )
# id time value
# <fctr> <dbl> <dbl>
#1 a 0 1
#2 a 2 2
#3 b 0 3
#4 b 3 4
Or another option is slice after arrangeing by 'id' and 'time' (as the OP mentioned about the order
data %>%
group_by(id) %>%
arrange(id, time) %>%
slice(which(value > cummax(lag(value, default = 0))))

Resources