multiple condition then creating new column - r

I have a dataset with two columns, I need to create a third column carries conditions on first one and second one.
set.seed(1)
x1=(sample(1:10, 100,replace=T))
y1=sample(seq(1,10,0.1),100,replace=T)
z=cbind(x1,y1)
unique(as.data.frame(z)$x1)
z%>%as.data.frame()%>%dplyr::filter(x1==3)
table(x1)
1 2 3 4 5 6 7 8 9 10
7 6 11 14 14 5 11 15 11 6
> z%>%as.data.frame()%>%dplyr::filter(x1==3)
x1 y1
1 3 6.9
2 3 9.5
3 3 10.0
4 3 5.6
5 3 4.1
6 3 2.5
7 3 5.3
8 3 9.5
9 3 5.5
10 3 8.9
11 3 1.2
for example when I filter x==3 then y1 values can be seen, I need to write 1 on 11th row, rest will be 0. I need to find a minimum in that column. My original dataset has 43545 rows but only 638 unique numbers like x1. table x1 shows that 1 repeated 7 times but in my dataset some have a frequency of 1 some have frequency of 100. I should use case_when but how can I check every y1 to find the smallest to put 1?

If I understand correctly, you are looking for the row with minimal y1 value for each value of x1
library(tidyverse)
z %>% as.data.frame() %>%
group_by(x1) %>%
arrange(y1) %>% # sort values by increasing order within each group
mutate(flag = ifelse(row_number()==1,1,0)) %>% # create flag for first row in group
ungroup()

Related

How do I mutate numbers to a different scale in R?

x <- read.csv(text="Group,X,Y
1,1,2
1,2,5
1,3,8
1,4,9
2,1,1
2,2,4
2,3,8
2,4,10
2,5,16
3,1,6
3,2,7
3,3,8
3,4,12
3,5,13")
I want to shift the y-scale from what it currently is to one that starts from 0. Right now, I have grouped this dataset by the "Group" variable. And I am trying to find the minimum "Y" value from each group and subtracting it from each value in the "Y" column of each group. However, this isn't working as intended. Any ideas on how I can do this? Thank you.
This is a pretty simple use case for dplyr::group_by. Once you group the data.frame by your grouping variable, when you use functions like min or max in mutate, they will be called (and applied to) that group individually. So if we say Y=Y-min(Y) it will find the minimum value of Y in each group, and subtract it from the values of Y in that group
library(dplyr)
x %>%
group_by(Group) %>%
mutate(Y=Y-min(Y))
Group X Y
<int> <int> <int>
1 1 1 0
2 1 2 3
3 1 3 6
4 1 4 7
5 2 1 0
6 2 2 3
7 2 3 7
8 2 4 9
9 2 5 15
10 3 1 0
11 3 2 1
12 3 3 2
13 3 4 6
14 3 5 7

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

Find historical maximums in time series

I am trying to identify historical maximum records from time-series data. I need to only identify maximum records as they pertain to data up to that point, not for the whole vector.
An example:
set.seed(431)
df <- data.frame(time = c(1:10), value = runif(10, 1, 10))
df
time value
1 1 7.758703
2 2 6.262849
3 3 8.281712
4 4 8.243617
5 5 6.781752
6 6 2.078103
7 7 4.455353
8 8 1.339119
9 9 3.635554
10 10 9.084619
What I want to do is produce the vector that identifies the following record high numbers moving forward in time:
time value record
1 1 7.758703 yes
2 2 6.262849 no
3 3 8.281712 yes
4 4 8.243617 no
5 5 6.781752 no
6 6 2.078103 no
7 7 4.455353 no
8 8 1.339119 no
9 9 3.635554 no
10 10 9.084619 yes
The value at time 1 is a record because no values exist prior to that, therefor it is maximum. The item at time 3 is a record because its higher than that at time 1. The value at time 10 is a record because its higher than that at time 3.
All I have been able to do is test the max value for the whole vector (i.e identify the value at time 10), rather than the vector up to the time value being considered. I was trying to mutate through dplyr but it wouldn't work. Then I looked at writing a for loop, which would append values to the vector and look for the maximum within that new vector. That lead me to posts suggesting that was a more pythonic than R way of doing things.
Can anyone help? I imagine this is easy.
An option is to get the cummax of 'value', check whether it is equal to 'value'
library(dplyr)
df %>%
mutate(record = c('no', 'yes')[(value == cummax(value)) + 1])
# A tibble: 10 x 3
# time value record
# <int> <dbl> <chr>
# 1 1 7.76 yes
# 2 2 6.26 no
# 3 3 8.28 yes
# 4 4 8.24 no
# 5 5 6.78 no
# 6 6 2.08 no
# 7 7 4.46 no
# 8 8 1.34 no
# 9 9 3.64 no
#10 10 9.08 yes

Reuse value of previous row during dplyr::mutate

I am trying to group events based on their time of occurrence. To achieve this, I simply calculate a diff over the timestamps and want to essentially start a new group if the diff is larger than a certain value. I would have tried like the code below. However, this is not working since the dialog variable is not available during the mutate it is created by.
library(tidyverse)
df <- data.frame(time = c(1,2,3,4,5,510,511,512,513), id = c(1,2,3,4,5,6,7,8,9))
> df
time id
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 510 6
7 511 7
8 512 8
9 513 9
df <- df %>%
mutate(t_diff = c(NA, diff(time))) %>%
# This generates an error as dialog is not available as a variable at this point
mutate(dialog = ifelse(is.na(t_diff), id, ifelse(t_diff >= 500, id, lag(dialog, 1))))
# This is the desired result
> df
time id t_diff dialog
1 1 1 NA 1
2 2 2 1 1
3 3 3 1 1
4 4 4 1 1
5 5 5 1 1
6 510 6 505 6
7 511 7 1 6
8 512 8 1 6
9 513 9 1 6
In words, I want to add a column that points to the first element of each group. Thereby, the groups are distinguished at points at which the diff to the previous element is larger than 500.
Unfortunately, I have not found a clever workaround to achieve this in an efficient way using dplyr. Obviously, iterating over the data.frame with a loop would work, but would be very inefficient.
Is there a way to achieve this in dplyr?

How to create a function which loops through column index numbers in R?

Consider the following data frame (df):
"id" "a1" "b1" "c1" "not_relevant" "p_a1" "p_b1" "p_c1"
a 2 6 0 x 2 19 12
a 4 2 7 x 3.5 7 11
b 1 9 4 x 7 1.5 4
b 7 5 11 x 8 12 5
I would like to create a new column which shows the sum of the product between two corresponding columns. To write less code I address the columns by their index number. Unfortunately I have no experience in writing functions, so I ended up doing this manually, which is extremely tedious and not very elegant.
Here a reproducible example of the data frame and what I have tried so far:
id <- c("a","a","b","b")
df <- data.frame(id)
df$a1 <- as.numeric((c(2,4,1,7)))
df$b1 <- as.numeric((c(6,2,9,5)))
df$c1 <- as.numeric((c(0,7,4,11)))
df$not_relevant <- c("x","x","x","x")
df$p_a1 <- as.numeric((c(2,3.5,7,8)))
df$p_b1 <- as.numeric((c(19,7,1.5,12)))
df$p_c1 <- as.numeric((c(12,11,4,5)))
require(dplyr)
df %>% mutate(total = .[[2]]*.[[6]] + .[[3]] *.[[7]]+ .[[4]] *.[[8]])
This leads to the desired result, but as I mentioned is not very efficient:
"id" "a1" "b1" "c1" "not_relevant" "p_a1" "p_b1" "p_c1" "total"
a 2 6 0 x 2 19 12 118.0
a 4 2 7 x 3.5 7 11 105.0
b 1 9 4 x 7 1.5 4 36.5
b 7 5 11 x 8 12 5 171.0
The real data I am working with has much more columns, so I would be glad if someone could show me a way to pack this operation into a function which loops through the column index numbers and matches the correct columns to each other.
Column indices are not a good way to do this. (Not a good way in general...)
Here's a simple dplyr method that assumes the columns are in the correct corresponding order (that is, it will give the wrong result if the "x1", "x2", "x3" is in a different order than "p_x3", "p_x2", "p_x1"). You may also need to refine the selection criteria for your real data:
df$total = rowSums(select(df, starts_with("x")) * select(df, starts_with("p_")))
df
# id x1 x2 x3 not_relevant p_x1 p_x2 p_x3 total
# 1 a 2 6 0 x 2.0 19.0 12 118.0
# 2 a 4 2 7 x 3.5 7.0 11 105.0
# 3 b 1 9 4 x 7.0 1.5 4 36.5
# 4 b 7 5 11 x 8.0 12.0 5 171.0
The other good option would be to convert your data to a long format, where you have a single x column and a single p column, with an "index" column indicating the 1, 2, 3. Then the operation could be done by group, finally moving back to a wide format.

Resources