I got a dataframe like this:
id day time state
<chr> <dbl> <dbl> <dbl>
1 A 1 1 0
2 A 1 2 0
3 A 1 3 1
4 A 2 1 0
5 A 2 2 1
6 A 2 3 1
7 A 3 1 1
8 A 3 2 1
9 A 3 3 0
In the original dataframe, there are 30 ids and every id has 5 days with 1440 timepoints (so 216000 rows in total).
Now I want to create a new variable called "delta" as result for comparing if the state (1 or 0) is equal between two different timeponts (1 = equal, 0 = unequal).
For example:
Compare if state of day 1 time 1 is = state of day 2 time 1, day 1 time 2 = day 2 time 2....
and then day 2 time 1 = day 3 time 1 and so on.
In the end it should look like this:
id day time state delta
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 1 0 1
2 A 1 2 0 0
3 A 1 3 1 1
4 A 2 1 0 0
5 A 2 2 1 1
6 A 2 3 1 0
7 A 3 1 1 1
8 A 3 2 1 0
9 A 3 3 0 1
I already tried some codes with the ifelse-command, but I could work it out yet.
Assuming my assumption here is correct: you want for the day i time j row, the delta value compares if the state on day i + 1 time j (next day, same time) is equal.
Here's a dplyr method:
library(dplyr)
your_data %>%
group_by(time) %>%
arrange(day) %>%
mutate(delta = as.integer(lead(state) == state))
Related
I am trying to create similar_player_selected column. I have first 4 columns.
For row 1, player_id =1 and the most similar player to player 1 is player 3. But player 3 (row 3) isn't selected for campaign 1(player_selected=0) so I assign a value of 0 to similar_player_selected for row 1. For row 2, player_id=2 and the most similar player to player 2 is player 4. Player 4 is selected for the campaign 1(row 4) so I assign a value of 1 to similar_player_selected for row 2. Please note there are more than 1000 campaigns overall.
campaign_id
player_id
most_similar_player
player_selected
similar_player_selected
1
1
3
1
0
1
2
4
0
1
1
3
4
0
?
1
4
1
1
?
2
1
3
1
?
2
2
4
1
?
2
3
4
0
?
2
4
1
0
?
Using match we can subset player selected at matched locations
library(dplyr)
df |>
group_by(campaign_id) |>
mutate(
similar_player_selected = player_selected[match(most_similar_player, player_id)]
) |>
ungroup()
Faster base R alternative
df$similar_player_selected <- lapply(split(df, df$campaign_id), \(x)
with(x, player_selected[match(most_similar_player, player_id)])) |>
unlist()
campaign_id player_id most_similar_player player_selected similar_player_selected
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 3 1 0
2 1 2 4 0 1
3 1 3 4 0 1
4 1 4 1 1 1
5 2 1 3 1 0
6 2 2 4 1 0
7 2 3 4 0 0
8 2 4 1 0 1
Given the following data I would like to create a new column new_sequence based on the condition:
If only one id is present the new value should be 0. If several id's are present, the new value should numbered according to the values present in sequence.
dat <- tibble(id = c(1,2,3,3,3,4,4),
sequence = c(1,1,1,2,3,1,2))
# A tibble: 7 x 2
id sequence
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 3 2
5 3 3
6 4 1
7 4 2
So, for the example data I am looking to produce the following output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
I have tried with the code below, that does not work since all unique values are coded as 0
dat %>% mutate(new_sequence = ifelse(!duplicated(id), 0, sequence))
Use dplyr::add_count() rather than !duplicated():
library(dplyr)
dat %>%
add_count(id) %>%
mutate(new_sequence = ifelse(n == 1, 0, sequence)) %>%
select(!n)
Output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
You can also try the following. After grouping by id check if the number of rows in the group n() is 1 or not. Use separate if and else instead of ifelse since the lengths are different within each group.
dat %>%
group_by(id) %>%
mutate(new_sequence = if(n() == 1) 0 else sequence)
Output
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
Thank you in advance for your time reading this post. I have a data.frame that looks like this
time offspring
1 1
1 2
2 1
2 5
3 1
3 4
and I would like to check if the offspring of every time point match the offspring of the last time point. To be more explicit I would like to see if the offspring of the time point 1 and time point 2 are present in the timepoint 3.
When this is the case, then I would like to assign the offspring with the value 1 in a new column and when not with the value 0.4.
For example
time offspring alpha
1 1 1
1 2 0.4
2 1 1
2 5 0.4
3 1 1
3 4 1
Any help and comment are highly appreciated.
One dplyr option could be:
df %>%
group_by(offspring) %>%
mutate(alpha = pmax(0.4, all(1:3 %in% time)))
time offspring alpha
<int> <int> <dbl>
1 1 1 1
2 1 2 0.4
3 2 1 1
4 2 5 0.4
5 3 1 1
6 3 4 0.4
If cases that are only present at time period three should be also treated as ones:
df %>%
group_by(offspring) %>%
mutate(alpha = pmax(0.4, all(1:3 %in% time) | unique(time) == 3))
time offspring alpha
<int> <int> <dbl>
1 1 1 1
2 1 2 0.4
3 2 1 1
4 2 5 0.4
5 3 1 1
6 3 4 1
For example, if I have records like:
A B
1 2
2 3
3 1
1 2
2 1
Let's say one cycle is from 1 (to 2 to 3) back to 1,so I need my data frame to be like
No. A B
cycle1 1 2
cycle1 2 3
cycle1 3 1
cycle2 1 2
cycle2 2 1
Or a better way for me, I just need to record the time the same record appears, like
Time A B
Time1 1 2
Time1 2 3
Time1 3 1
Time2 1 2
Time1 2 1
I need to do this because I have to use summarize function in dplyr to do calculation but I cannot group data by A and B directly. The order of the data is also important.
Is this what you want ?
library(zoo)
T1=which(df$A==1)
T2=1:length(T1)
T2=paste('cycle',T2 )
df$No=NA
df$No[T1]=T2
df$No=na.locf(df$No)
df
A B No
1 1 2 cycle 1
2 2 3 cycle 1
3 3 1 cycle 1
4 1 2 cycle 2
5 2 1 cycle 2
#the reason: keep the row Id with the calculation
library(dplyr)
df%>%group_by(A,B)%>%mutate(Time=paste('Time',row_number()))
A B Time
<int> <int> <chr>
1 1 2 Time 1
2 2 3 Time 1
3 3 1 Time 1
4 1 2 Time 2
5 2 1 Time 1
Create an augmented 'diff' variable. c(NA , diff (your_var)). Within a sequence group this will be 1. Set your group to change at the logical falsity of that proposition. (My first iteration on the algorithm wasn't quite correct so modified it slightly.)
dat %>% as_tibble() %>% mutate(G = cumsum( c(-1, diff(A)) < 0 ) )
# A tibble: 5 x 3
A B G
<int> <int> <int>
1 1 2 1
2 2 3 1
3 3 1 1
4 1 2 2
5 2 1 2
dat %>% as_tibble() %>% mutate(G = paste0( "time", cumsum( c(-1, diff(A)) < 0 ) ))
# A tibble: 5 x 3
A B G
<int> <int> <chr>
1 1 2 time1
2 2 3 time1
3 3 1 time1
4 1 2 time2
5 2 1 time2
One could also test for A=1, but then sequences like 1,2,3,2,3,4 would not get properly split.
I work in the healthcare industry and I'm using machine learning algorithms to develop a model to predict when patients will not show up for their appointments. I'm trying to create a new feature that will be the sum of each patient's most recent consecutive no-shows. I've looked around a lot on stackoverflow and other resources, but cannot find exactly what I'm looking for. As an example, if a patient has no-showed her past two most recent appointments, then every row of the new feature's column with her ID will be filled in with 2's. If she no-showed three times, but showed up for her most recent appointment, then the new column will be filled in with 0's.
I tried using plyr's ddply with cumsum, but it did not give me the results I'm looking for. I used:
ddply(a, .(ID), transform, ConsecutiveNoshows = cumsum(Noshow))
Here is an example data set ('1' signifies a no-show):
ID Noshow
1 1
1 1
1 0
1 0
1 1
2 0
2 1
2 1
3 1
3 0
3 1
3 1
3 1
This is my desired outcome:
ID Noshow ConsecutiveNoshows
1 1 2
1 1 2
1 0 2
1 0 2
1 1 2
2 0 0
2 1 0
2 1 0
3 1 1
3 0 1
3 1 1
3 1 1
3 1 1
I'll be very grateful for any help. Thank you.
The idea is to sum() for each ID the number of Noshow before a 0 appears.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(ConsecutiveNoshows = sum(!cumsum(Noshow == 0) >= 1))
Which gives:
#Source: local data frame [13 x 3]
#Groups: ID [3]
#
# ID Noshow ConsecutiveNoshows
# <int> <int> <int>
#1 1 1 2
#2 1 1 2
#3 1 0 2
#4 1 0 2
#5 1 1 2
#6 2 0 0
#7 2 1 0
#8 2 1 0
#9 3 1 1
#10 3 0 1
#11 3 1 1
#12 3 1 1
#13 3 1 1