I am working with a dataset of animal behaviors, and am trying to create a new column ("environment") based on conditions fulfilled in another row. Specifically, I want the new column to return "water" if the behavior falls between the start/stop times of the behavior "o_water", and "land" if it falls outside these bounds. If this is unclear here is a minimal example:
library(dplyr)
library(magrittr)
otters <- data.frame(
observation_id = 1,
subject = 1,
behavior = c("o_water", "swim", "float", "o_land", "walk", "o_water", "float"),
start_time = c(1,1,2,6,6,11,11),
stop_time = c(5,3,4,10,9,15,14)
)
#this does it, but manually. need to go over very large dataset and search for conditions
otters <- otters %>%
group_by(subject, observation_id, behavior) %>%
mutate(environment = ifelse((start_time >= 1 & stop_time <= 5) |
(start_time >= 11 & stop_time <= 15), "water", "land"))
This is the output desired.
Groups: subject, observation_id, behavior [5]
observation_id subject behavior start_time stop_time environment
<dbl> <dbl> <fct> <dbl> <dbl> <chr>
1 1 1 o_water 1 5 water
2 1 1 swim 1 3 water
3 1 1 float 2 4 water
4 1 1 o_land 6 10 land
5 1 1 walk 6 9 land
6 1 1 o_water 11 15 water
7 1 1 float 11 14 water
The second set of commands is sort of what I want, but I need this to search out and apply it to an entire dataset rather than typing out each parameter. The grouping is so the functions are performed over the applicable rows; in the full dataset, there are multiple subjects and observation_id's.
I've tried using when() and case_when() to no avail, but I am very novice level at R so would appreciate any help!
Apologies for any missteps I've done. I haven't been able to find a problem quite like this elsewhere on stackoverflow.
Here is another approach with dplyr that also uses fuzzyjoin package.
You can separate your o_water behavior rows from otters and designate the environment as water.
Then, with fuzzy_left_join, merge the o_water rows with the rest of your data, where the start_time and end_time fall between the o_water range.
The remaining NA in environment will be non-merged rows, which can be land or other designation.
library(dplyr)
library(fuzzyjoin)
otters_water <- otters %>%
filter(behavior == "o_water") %>%
mutate(environment = "water") %>%
select(-behavior)
otters %>%
fuzzy_left_join(otters_water,
by = c("subject", "observation_id", "start_time", "stop_time"),
match_fun = list(`==`, `==`, `>=`, `<=`)) %>%
replace_na(list(environment = "land")) %>%
select(c(observation_id.x:stop_time.x, environment))
Output
observation_id.x subject.x behavior start_time.x stop_time.x environment
1 1 1 o_water 1 5 water
2 1 1 swim 1 3 water
3 1 1 float 2 4 water
4 1 1 o_land 6 10 land
5 1 1 walk 6 9 land
6 1 1 o_water 11 15 water
7 1 1 float 11 14 water
Great job on your question.
I think rearranging your dataset will help a lot here. I'd suggest rearranging it so that each time point has only one record (per individual otter, perhaps), and individual behaviors each have their own column, with binary data indicating whether or not that behavior is occurring at each time point.
There's a lot of rearranging that happens in the first few lines; I'd suggest stepping through the code one line at a time just to see how each line moves the data around.
Using the data you provided:
library(tidyverse)
otters_wide <- otters %>%
# first pivot to a longer form, so the time values are all in one column
pivot_longer(cols = c("start_time", "stop_time"), names_to = "start_stop", values_to = "time", names_pattern = "(.*)_time") %>%
# then pivot to w wider format, so each behavior has its own column.
pivot_wider(names_from = "behavior", values_from = "start_stop") %>%
#Then arrange everything in order of time.
arrange(time) %>%
#Fill behavior columns downward
fill(o_water, swim, float, o_land, walk) %>%
#change all "start"s and the first "stop" in each series to "yes", and all other "stop"s to "no"
mutate_at(.vars = c("o_water", "swim", "float", "o_land", "walk"), ~ if_else(. == "start" | lag(.) == "start", "yes", "no")) %>%
# this column is a little redundant now, but here's the water/land column, at last.
mutate(environment = if_else(o_water == "yes", "water", "land"))
otters_wide
# A tibble: 11 x 9
observation_id subject time o_water swim float o_land walk environment
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 1 yes yes NA NA NA water
2 1 1 2 yes yes yes NA NA water
3 1 1 3 yes yes yes NA NA water
4 1 1 4 yes no yes NA NA water
5 1 1 5 yes no no NA NA water
6 1 1 6 no no no yes yes land
7 1 1 9 no no no yes yes land
8 1 1 10 no no no yes no land
9 1 1 11 yes no yes no no water
10 1 1 14 yes no yes no no water
11 1 1 15 yes no no no no water
In this alternate format you could go even further and include time points that weren't explicitly recorded (e.g. times 7, 8, 12 & 13, in this example) which, once filled in, would make summarizing things like total time spent on each behavior much more straightforward.
Since the behavioral data is binary, those columns could certainly contain logical data rather than character data, but because they started as character data, keeping them that way seemed simplest.
Hope this helps.
Here is another approach, in which I extract all the start_ and stop_times of each behavior of o_water into a list. As there are 2 entries for o_water these lists have two elements.
With the help of purrr I create a list pu of vectors which span the start_ and stop_time for each entry.
purrr also helps to find which row in otters falls within the sequences defined in pu. l_true is the sum of all columns which have a FALSE/TRUE, it contains either a 1, if there is one entry matching the time sequence or a 0 if it does not.
This list is bound to otters and with ifelse the new column is named.
library(tidyverse)
otters <- data.frame(
observation_id = 1,
subject = 1,
behavior = c("o_water", "swim", "float", "o_land", "walk", "o_water", "float"),
start_time = c(1,1,2,6,6,11,11),
stop_time = c(5,3,4,10,9,15,14)
)
# otters
# find all start_times of 'o_water'
otters %>%
dplyr::filter(grepl('water', behavior)) %>%
select(ends_with('time')) %>%
`[[`(1) -> start
start
#> [1] 1 11
# find all stop_times
otters %>%
dplyr::filter(grepl('water', behavior)) %>%
select(ends_with('time')) %>%
`[[`(2) -> stop
stop
#> [1] 5 15
# bring start and stop_times together in one
# list per 'o_water'
pu <- purrr::map2(start, stop, ~ .x : .y)
pu
#> [[1]]
#> [1] 1 2 3 4 5
#>
#> [[2]]
#> [1] 11 12 13 14 15
# check with pu, if start_ and stop_time of each row is
# in pu, and combine the row of FALSE/TRUE into a list
l_true <- map_dfc(pu, ~ otters$start_time %in% . & otters$stop_time %in% .) %>%
mutate(l = rowSums(.[1:ncol(.)])) %>% `[[`(ncol(.))
otters %>%
cbind(., l_true) %>%
mutate(ev = ifelse(l_true == 1, 'water', 'land'))
#> observation_id subject behavior start_time stop_time l_true ev
#> 1 1 1 o_water 1 5 1 water
#> 2 1 1 swim 1 3 1 water
#> 3 1 1 float 2 4 1 water
#> 4 1 1 o_land 6 10 0 land
#> 5 1 1 walk 6 9 0 land
#> 6 1 1 o_water 11 15 1 water
#> 7 1 1 float 11 14 1 water
Related
I've got some time series data where both the steps of the sequence (ranging from 1 to 8) as well as its topic (>100) are encoded as character factor levels within a single variable. Here is a minimal example (I omitted timestamps which would be increasing within each id):
id <- c(1,rep(2,5),rep(3,4))
step <- c("call", "call", "agent", "forest", "forward", "resolved", "call", "agent", "beach", "resolved")
(df <- data.frame(id,step))
id step
1 1 call
2 2 call
3 2 agent
4 2 forest
5 2 forward
6 2 resolved
7 3 call
8 3 agent
9 3 beach
10 3 resolved
I now want to split this information into two dedicated variables (step and topic), thereby shrinking the dataframe in rows and making it wider, while also repeating the topic for each row of the time series and adding an "NA" when there is no topic. Using base R to split this into two dataframes and merging them back together gets the job done:
step <- subset(df, step %in% c("call", "agent", "forward", "resolved"))
topic <- subset(df, step %in% c("forest", "beach"))
topic$topic <- topic$step
topic$step <- NULL
(newdf <- merge(step,topic, all=TRUE))
id step topic
1 1 call <NA>
2 2 call forest
3 2 agent forest
4 2 forward forest
5 2 resolved forest
6 3 call beach
7 3 agent beach
8 3 resolved beach
This is somewhat clunky though and I'm looking for a more elegant dplyr/tidyverse approach to this. pivot_wider() doesn't seem to be able to do this. Any ideas?
This isn't particularly elegant, but it works:
steps <- c("call", "agent", "forward", "resolved")
df %>%
mutate(type = ifelse(step %in% steps, "step", "topic"),
row = cumsum(type == "step")) %>%
pivot_wider(names_from = type, values_from = step) %>%
group_by(id) %>%
fill(topic, .direction = "updown") %>%
ungroup()
# A tibble: 8 x 4
id row step topic
<dbl> <int> <chr> <chr>
1 1 1 call NA
2 2 2 call forest
3 2 3 agent forest
4 2 4 forward forest
5 2 5 resolved forest
6 3 6 call beach
7 3 7 agent beach
8 3 8 resolved beach
Thanks for providing a minimal example of your problem
id <- c(1,rep(2,5),rep(3,4))
step <- c("call", "call", "agent", "forest", "forward",
"resolved", "call", "agent", "beach", "resolved")
df <- data.frame(id,step)
df
#> id step
#> 1 1 call
#> 2 2 call
#> 3 2 agent
#> 4 2 forest
#> 5 2 forward
#> 6 2 resolved
#> 7 3 call
#> 8 3 agent
#> 9 3 beach
#> 10 3 resolved
This is a possible solution using tidyverse
library(dplyr)
library(tidyr)
df %>%
# define in column type_c if step is an step or a topic
# you need a unique id for each row to use pivot_wider in this case
mutate(
type_c = if_else(step %in% c("forest", "beach"), "topic", "step"),
unique_id = 1:nrow(df)) %>%
pivot_wider(names_from = type_c, values_from = c(id, step)) %>%
mutate(id = coalesce(id_step, id_topic)) %>%
select(id, step = step_step, topic = step_topic) %>%
# Need group_by to apply the function fill
group_by(id) %>%
# fill replaces NA, in each id, with a value found in any direction "downup"
fill(topic, .direction = "downup") %>%
# get rid off the NA in column step that pivot_wider created for each topic
filter(!is.na(step))
#> # A tibble: 8 x 3
#> # Groups: id [3]
#> id step topic
#> <dbl> <chr> <chr>
#> 1 1 call <NA>
#> 2 2 call forest
#> 3 2 agent forest
#> 4 2 forward forest
#> 5 2 resolved forest
#> 6 3 call beach
#> 7 3 agent beach
#> 8 3 resolved beach
Created on 2021-06-08 by the reprex package (v0.3.0)
I am still new to R and learning methods for conducting analysis. I have a df which I want to count the consecutive wins/losses based on column "x9". This shows the gain/loss (positive value or negative value) for the trade entered. I did find some help on code that helped with assigning a sign, sign lag and change, however, I am looking for counter to count the consecutive wins until a loss is achieved then reset, and then count the consecutive losses until a win is achieved. Overall am looking for assistance to adjust the counter to reset when consecutive wins/losses are interrupted. I have some sample code below and a attached .png to explain my thoughts
#Read in df
df=vroom::vroom(file = "analysis.csv")
#Filter df for specfic order types
df1 = filter(df, (x3=="s/l") |(x3=="t/p"))
#Create additional column to tag wins/losses in df1
index <- c("s/l","t/p")
values <- c("Loss", "Win")
df1$col2 <- values[match(df1$x3, index)]
df1
#Mutate df to review changes, attempt to review consecutive wins and losses & reset when a
#positive / negative value is encountered
df2=df1 %>%
mutate(sign = ifelse(x9 > 0, "pos", ifelse(x9 < 0, "neg", "zero")), # get the sign of the value
sign_lag = lag(sign, default = sign[9]), # get previous value (exception in the first place)
change = ifelse(sign == sign_lag, 1 , 0), # check if there's a change
series_id = cumsum(change)+1) %>% # create the series id
print() -> dt2
I think you can use rle for this. By itself, it doesn't immediately provide a grouping-like functionality, but we can either use data.table::rleid or construct our own function:
# borrowed from https://stackoverflow.com/a/62007567/3358272
myrleid <- function(x) {
rl <- rle(x)$lengths
rep(seq_along(rl), times = rl)
}
x9 <- c(-40.57,-40.57,-40.08,-40.08,-40.09,-40.08,-40.09,-40.09,-39.6,-39.6,-49.6,-39.6,-39.61,-39.12,-39.12-39.13,782.58,-41.04)
tibble(x9) %>%
mutate(grp = myrleid(x9 > 0)) %>%
group_by(grp) %>%
mutate(row = row_number()) %>%
ungroup()
# # A tibble: 17 x 3
# x9 grp row
# <dbl> <int> <int>
# 1 -40.6 1 1
# 2 -40.6 1 2
# 3 -40.1 1 3
# 4 -40.1 1 4
# 5 -40.1 1 5
# 6 -40.1 1 6
# 7 -40.1 1 7
# 8 -40.1 1 8
# 9 -39.6 1 9
# 10 -39.6 1 10
# 11 -49.6 1 11
# 12 -39.6 1 12
# 13 -39.6 1 13
# 14 -39.1 1 14
# 15 -78.2 1 15
# 16 783. 2 1
# 17 -41.0 3 1
https://www.kaggle.com/nowke9/ipldata ----- Contains the IPL Data.
This is exploratory study performed for the IPL data set. (link for the data attached above) After merging both the files with "id" and "match_id", I have created four more variables namely total_extras, total_runs_scored, total_fours_hit and total_sixes_hit. Now I wish to combine these newly created variables into one single data frame. When I assign these variables into one single variable namely batsman_aggregate and selecting only the required columns, I am getting an error message.
library(tidyverse)
deliveries_tbl <- read.csv("deliveries_edit.csv")
matches_tbl <- read.csv("matches.csv")
combined_matches_deliveries_tbl <- deliveries_tbl %>%
left_join(matches_tbl, by = c("match_id" = "id"))
# Add team score and team extra columns for each match, each inning.
total_score_extras_combined <- combined_matches_deliveries_tbl%>%
group_by(id, inning, date, batting_team, bowling_team, winner)%>%
mutate(total_score = sum(total_runs, na.rm = TRUE))%>%
mutate(total_extras = sum(extra_runs, na.rm = TRUE))%>%
group_by(total_score, total_extras, id, inning, date, batting_team, bowling_team, winner)%>%
select(id, inning, total_score, total_extras, date, batting_team, bowling_team, winner)%>%
distinct(total_score, total_extras)%>%
glimpse()%>%
ungroup()
# Batsman Aggregate (Runs Balls, fours, six , Sr)
# Batsman score in each match
batsman_score_in_a_match <- combined_matches_deliveries_tbl %>%
group_by(id, inning, batting_team, batsman)%>%
mutate(total_batsman_runs = sum(batsman_runs, na.rm = TRUE))%>%
distinct(total_batsman_runs)%>%
glimpse()%>%
ungroup()
# Number of deliveries played .
balls_faced <- combined_matches_deliveries_tbl %>%
filter(wide_runs == 0)%>%
group_by(id, inning, batsman)%>%
summarise(deliveries_played = n())%>%
ungroup()
# Number of 4 and 6s by a batsman in each match.
fours_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 4)%>%
group_by(id, inning, batsman)%>%
summarise(fours_hit = n())%>%
glimpse()%>%
ungroup()
sixes_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 6)%>%
group_by(id, inning, batsman)%>%
summarise(sixes_hit = n())%>%
glimpse()%>%
ungroup()
batsman_aggregate <- c(batsman_score_in_a_match, balls_faced, fours_hit, sixes_hit)%>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit)
The error message is displayed as:-
Error: `select()` doesn't handle lists.
The required output is the data set created newly constructed variables.
You'll have to join those four tables, not combine using c.
And the join type is left_join so that all batsman are included in the output. Those who didn't face any balls or hit any boundaries will have NA, but you can easily replace these with 0.
I've ignored the by since dplyr will assume you want c("id", "inning", "batsman"), the only 3 common columns in all four data sets.
batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
left_join(fours_hit) %>%
left_join(sixes_hit) %>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
replace(is.na(.), 0)
# A tibble: 11,335 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 1 1 DA Warner 14 8 2 1
2 1 1 S Dhawan 40 31 5 0
3 1 1 MC Henriques 52 37 3 2
4 1 1 Yuvraj Singh 62 27 7 3
5 1 1 DJ Hooda 16 12 0 1
6 1 1 BCJ Cutting 16 6 0 2
7 1 2 CH Gayle 32 21 2 3
8 1 2 Mandeep Singh 24 16 5 0
9 1 2 TM Head 30 22 3 0
10 1 2 KM Jadhav 31 16 4 1
# ... with 11,325 more rows
There are also 2 batsmen who didn't face any delivery:
batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 482 2 MK Pandey 0 0 0 0
2 7907 1 MJ McClenaghan 2 0 0 0
One of which apparently scored 2 runs! So I think the batsman_runs column has some errors. The game is here and clearly says that on the second last delivery of the first innings, 2 wides were scored, not runs to the batsman.
I am working on a dataset in which I need to calculate how long does it take for a retail store to replenish some products from shortage, and here is a quick view of the dataset in the simplest form:
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
When the Net_Available_Qty becomes negative, it means there is a shortage. When it turns back to 0 or positive qty, it means the supply has been recovered. What I need to calculate is the days between when we first see shortage and when it is recovered. In this case, for the 1st shortage, it took 2 days to recover and for the second shortage, it took 3 days to recover.
A tidyverse solution would be most welcome.
I hope someone else finds a cleaner solution. But this produces diffDate which assigns the date difference from when a negative turns positive/zero.
sample_df %>%
mutate(sign = ifelse(Net_Available_Qty > 0, "pos", ifelse(Net_Available_Qty < 0, "neg", "zero")),
sign_lag = lag(sign, default = sign[1]), # get previous value (exception in the first place)
change = ifelse(sign != sign_lag, 1 , 0), # check if there's a change
sequence=sequence(rle(as.character(sign))$lengths)) %>%
group_by(sequence) %>%
mutate(diffDate = as.numeric(difftime(Date, lag(Date,1))),
diffDate=ifelse(Net_Available_Qty <0, NA, ifelse((sign=='pos'| sign=='zero') & sequence==1, diffDate, NA))) %>%
ungroup() %>%
select(Date, Product, Net_Available_Qty, diffDate)
#Schilker had a great idea using rle. I am building on his answer and offering a slightly shorter version including the use of cumsum
Date <- c("2019-1-1","2019-1-2","2019-1-3","2019-1-4","2019-1-5","2019-1-6","2019-1-7","2019-1-8")
Product <- rep("Product A",8)
Net_Available_Qty <- c(-2,-2,10,8,-5,-6,-7,0)
sample_df <- data.frame(Date,Product,Net_Available_Qty)
library(tidyverse)
sample_df %>%
mutate(
diffDate = c(1, diff(as.Date(Date))),
sequence = sequence(rle(Net_Available_Qty >= 0)$lengths),
group = cumsum(c(TRUE, diff(sequence)) != 1L)
) %>%
group_by(group) %>%
mutate(n_days = max(cumsum(diffDate)))
#> # A tibble: 8 x 7
#> # Groups: group [4]
#> Date Product Net_Available_Qty diffDate sequence group n_days
#> <fct> <fct> <dbl> <dbl> <int> <int> <dbl>
#> 1 2019-1-1 Product A -2 1 1 0 2
#> 2 2019-1-2 Product A -2 1 2 0 2
#> 3 2019-1-3 Product A 10 1 1 1 2
#> 4 2019-1-4 Product A 8 1 2 1 2
#> 5 2019-1-5 Product A -5 1 1 2 3
#> 6 2019-1-6 Product A -6 1 2 2 3
#> 7 2019-1-7 Product A -7 1 3 2 3
#> 8 2019-1-8 Product A 0 1 1 3 1
Created on 2020-02-23 by the reprex package (v0.3.0)
I would like to generate a dummy treatment variable "treatment" based on country variable "iso" and earthquakes dummy variable "quake" (for dataset "data").
I would basically like to get a dummy variable "treatment" where, if quake==1 for at least one time in my entire timeframe (let's say 2000-2018), I would like all values for that "iso" have "treatment"==1, for all other countries "iso"==0. So countries that are affected by earthquakes have all observations 1, others 0.
I have tried using dplyr but since I'm still very green at R, it has taken me multiple tries and I haven't found a solution yet. I've looked on this website and google.
I suspect the solution should be something along the lines of but I can't finish it myself:
data %>%
filter(quake==1) %>%
group_by(iso) %>%
mutate(treatment)
Welcome to StackOverflow ! You should really consider Sotos's links for your next questions on SO :)
Here is a dplyr solution (following what you started) :
## data
set.seed(123)
data <- data.frame(year = rep(2000:2002, each = 26),
iso = rep(LETTERS, times = 3),
quake = sample(0:1, 26*3, replace = T))
## solution (dplyr option)
library(dplyr)
data2 <- data %>% arrange(iso) %>%
group_by(iso) %>%
mutate(treatment = if_else(sum(quake) == 0, 0, 1))
data2
# A tibble: 78 x 4
# Groups: iso [26]
year iso quake treatment
<int> <fct> <int> <dbl>
1 2000 A 0 1
2 2001 A 1 1
3 2002 A 1 1
4 2000 B 1 1
5 2001 B 1 1
6 2002 B 0 1
7 2000 C 0 1
8 2001 C 0 1
9 2002 C 1 1
10 2000 D 1 1
# ... with 68 more rows