My dataset is large, containing many observations (Dependent variable = DV) on individuals (Name) across set periods (Period) of a testing session. A small example of my dataset is as follows:
ExampleData <- data.frame(Name = c("Tom","Tom","Tom","Tom","Tom","Tom","Tom","Tom", "Tom", "Tom",
"Ben","Ben","Ben","Ben","Ben","Ben","Ben","Ben", "Ben", "Ben"),
Period = c(0,0,1,1,1,0,0,0,1,1,
0,0,0,1,1,1,0,0,1,1),
DV = runif(20, 1.5, 2.8))
When ExampleData$Period==1 an individual is undergoing an exercise test, which varies in time/ length. Breaks in between each test are represented by ExampleData$Period==0. To avoid manually entering when a person is undergoing a test and adding the sequential periods in, I wish to include a column that declares when a group of 1's, seperated by a group of 0's, is a new period - across each person's data. How do I please go about doing this?
My anticipated output would be:
ExampleData$Descriptor <- c(NA,NA,"Period One", "Period One","Period One",NA,NA,NA,"Period Two","Period Two",
NA,NA,NA,"Period One","Period One","Period One",NA,NA,"Period Two","Period Two")
My question is similar to another of mine, located here, although I now have multiple entries for each individual. I have tried the dplyr syntax of:
Test_df <- ExampleData %>%
mutate(
Descriptor = case_when(
Period > 0 ~ "Period",
Period == 0 ~ "Rest"),
rleid = cumsum(Descriptor != lag(Descriptor, 1, default = "NA")),
Descriptor = case_when(
Descriptor == "Period" ~ paste0(Descriptor, rleid %/% 2),
TRUE ~ "Rest"),
rleid = NULL
)
Although, how do I account for each different Name/ individual in my dataset?
Thank you.
Here's an alternative approach with dplyr
library(dplyr)
ExampleData %>%
group_by(Name) %>%
mutate(Descriptor = with(rle(Period == 1),
rep(replace(paste("Period", cumsum(values)), !values, NA), lengths)))
# # A tibble: 20 x 4
# # Groups: Name [2]
# Name Period DV Descriptor
# <fctr> <dbl> <dbl> <chr>
# 1 Tom 0 2.641044 <NA>
# 2 Tom 0 2.692745 <NA>
# 3 Tom 1 1.515797 Period 1
# 4 Tom 1 2.601471 Period 1
# 5 Tom 1 1.669399 Period 1
# 6 Tom 0 2.700371 <NA>
# 7 Tom 0 1.993971 <NA>
# 8 Tom 0 2.203379 <NA>
# 9 Tom 1 2.488742 Period 2
# 10 Tom 1 1.596458 Period 2
# 11 Ben 0 2.578924 <NA>
# 12 Ben 0 1.916804 <NA>
# 13 Ben 0 2.676466 <NA>
# 14 Ben 1 2.508759 Period 1
# 15 Ben 1 2.447217 Period 1
# 16 Ben 1 2.728756 Period 1
# 17 Ben 0 2.326854 <NA>
# 18 Ben 0 1.748016 <NA>
# 19 Ben 1 1.703044 Period 2
# 20 Ben 1 1.783434 Period 2
Here is an option using data.table
library(data.table)
setDT(ExampleData)[ , grp := rleid(Period == 1), .(Name)][Period == 1,
Descriptor := paste("Period", match(grp, unique(grp))), Name][, grp := NULL][]
# Name Period DV Descriptor
# 1: Tom 0 2.764916 NA
# 2: Tom 0 1.537837 NA
# 3: Tom 1 1.848110 Period 1
# 4: Tom 1 2.621724 Period 1
# 5: Tom 1 2.206875 Period 1
# 6: Tom 0 1.715299 NA
# 7: Tom 0 1.882378 NA
# 8: Tom 0 2.244155 NA
# 9: Tom 1 2.094944 Period 2
#10: Tom 1 1.713493 Period 2
#11: Ben 0 1.794261 NA
#12: Ben 0 1.608199 NA
#13: Ben 0 2.053490 NA
#14: Ben 1 1.791563 Period 1
#15: Ben 1 1.652090 Period 1
#16: Ben 1 2.510483 Period 1
#17: Ben 0 2.345984 NA
#18: Ben 0 2.754110 NA
#19: Ben 1 1.675527 Period 2
#20: Ben 1 1.709622 Period 2
Base R option:
unlist(with(ExampleData, tapply(Period, Name, function(x) c(0, cumsum(ifelse(diff(x) < 0, 0, diff(x)))) * x)))
I was able to successfully complete this by running the following:
Test_df <- ExampleData %>%
group_by(Name) %>%
mutate(
Descriptor = case_when(
Period > 0 ~ "Period",
Period == 0 ~ "Rest"),
rleid = cumsum(Descriptor != lag(Descriptor, 1, default = "NA")),
Descriptor = case_when(
Descriptor == "Period" ~ paste0(Descriptor, rleid %/% 2),
TRUE ~ "Rest"),
rleid = NULL
)
I also used "Rest" instead of NA as this more accurately depicts what transpired.
Related
I have the following data frame in R:
df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))
And another data frame:
data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
name = c('a','b','c','d','e','f','g','h','i','j'))
So, if we take a look at df we could sort the data by target and combination and we will notice that there are basically "groups". For example for target=1 and comb=0 there are four entries p1_start,p1_end,p2_start,p2_end and it is the same for all other target/comb combinations.
On the other side data contains entries with time being a timestamp.
Goal: I want to map the values from both data frames based on time.
Example: The first entry of data has time=2 meaning it happened between p1_start,p1_end so it should get the values target=1 and comb=0 mapped to the data data frame.
Example 2: The entries of data with time=14 happened between p2_start,p2_end so they should get the values target=1 and comb=1 mapped to the data data frame.
Idea: I thought I iterate over df by target and comb and for each combination of them check if there are rows in data whose time is between. The second could be done with the following command:
data[which(data$time > p1_start & data$time < p2_end),]
once I get the rows it is easy to append the values.
Problem: how could I do the iteration? I tried with the following:
df %>%
group_by(target, comb) %>%
print(data[which(data$time > df$p1_start & data$time < df$p2_end),])
But I am getting an error that time has not been initialized
Your problem is best known as performing non-equi join. We need to find a range in some given dataframe that corresponds to each value in one or more given vectors. This is better handled by the data.table package.
We would first transform your df into a format suitable for performing the join and then join data with df by time <= end while time >= start. Here is the code
library(data.table)
setDT(df)[, c("type", "name") := tstrsplit(name, "_", fixed = TRUE)]
df <- dcast(df, ... ~ name, value.var = "time")
cols <- c("target", "comb", "type")
setDT(data)[df, (cols) := mget(paste0("i.", cols)), on = .(time<=end, time>=start)]
After dcast, df looks like this
target comb type end start
1: 1 0 p1 3 1
2: 1 0 p2 7 5
3: 1 1 p1 11 9
4: 1 1 p2 15 13
5: 2 0 p1 19 17
6: 2 0 p2 23 21
7: 2 1 p1 27 25
8: 2 1 p2 31 29
And the output is
> data
time name target comb type
1: 2 a 1 0 p1
2: 5 b 1 0 p2
3: 8 c NA NA <NA>
4: 14 d 1 1 p2
5: 14 e 1 1 p2
6: 20 f NA NA <NA>
7: 21 g 2 0 p2
8: 26 h 2 1 p1
9: 28 i NA NA <NA>
10: 28 j NA NA <NA>
Here is a tidyverse solution:
library(tidyr)
library(dplyr)
df %>%
rename(name_df=name) %>%
mutate(x = time +1) %>%
pivot_longer(
cols = c(time, x),
names_to = "helper",
values_to = "time"
) %>%
right_join(data, by="time") %>%
select(time, name, target, comb)
time name target comb
<dbl> <chr> <dbl> <dbl>
1 2 a 1 0
2 5 b 1 0
3 8 c 1 0
4 14 d 1 1
5 14 e 1 1
6 20 f 2 0
7 21 g 2 0
8 26 h 2 1
9 28 i 2 1
10 28 j 2 1
df <- data.frame(name = c('p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end','p1_start','p1_end','p2_start','p2_end'),
time = c(1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31),
target = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
comb = c(0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1))
data <- data.frame(time = c(2,5,8,14,14,20,21,26,28,28),
name = c('a','b','c','d','e','f','g','h','i','j'))
library(fuzzyjoin)
library(tidyverse)
tmp <- df %>%
separate(name,
into = c("p", "period"),
sep = "_",
remove = TRUE) %>%
pivot_wider(
id_cols = c(p, target, comb),
names_from = period,
values_from = time
) %>%
select(-p)
fuzzy_left_join(
x = data,
y = tmp,
by = c("time" = "start",
"time" = "end"),
match_fun = list(`>=`, `<=`))
#> time name target comb start end
#> 1 2 a 1 0 1 3
#> 2 5 b 1 0 5 7
#> 3 8 c NA NA NA NA
#> 4 14 d 1 1 13 15
#> 5 14 e 1 1 13 15
#> 6 20 f NA NA NA NA
#> 7 21 g 2 0 21 23
#> 8 26 h 2 1 25 27
#> 9 28 i NA NA NA NA
#> 10 28 j NA NA NA NA
Created on 2022-01-11 by the reprex package (v2.0.1)
I have a big dataset of about 4 Milion rows.
the columns are
Idx - dog serial number
date - date of event YYYY-MM-DD ( 2016 till 2021)
Is_sterilized - 1 if the dog was sterilized and 0 if not sterilized.
each dog can appear many times in a year,
It can appear in 2016 and 2020 but not in 2017-2019.
I want to count how many dogs were sterilized each year, meaning, if a dog change from Is_serilized==0 to Is_sterilized ==1 in a year I count it as sterilized that year, the first year it appears sterilized counted as his year fo sterilization.
The issue is that my database is not clean and for some dogs goes from sterilized to not sterilized, this can not happen since sterilization is one-way ticket surgery.
It can happen that a dog appears sterilized, 3 years consecutive and then one year by mistake unsterilized and then sterilized for 2 years.
What I'm asking is if there is a logic that I can estimate/count how many dogs having the wrong direction.
And if so, how can I deduce those dogs from my dataset?
In the example data, Idx = A and C make sense but B and D does not make senese
df_test <- data.frame(Idx=c( 'A', 'B', 'B', 'B','A', 'A', 'C', 'C', 'D','D','D','D','D','D','C', 'C','A' ),
YEAR_date=as.Date(c("2016-01-01","2016-01-29","2017-01-01","2016-05-01","2016-05-06","2016-05-01","2016-03-03","2016-04-22","2018-05-05", "2017-02-01"," 2021-11-12"," 2019-09-13"," 2019-11-12"," 2019-08-17", "2011-09-01"," 2011-07-05","2021-01-05")),
Is_sterilized =c(0,1,0,1,1,1,1,1,1,1,0,1,0,1,1,1,1)
)
df_test[,c( "Idx" ,"YEAR_date", "Is_sterilized")] %>% arrange(Idx ,YEAR_date )
Idx YEAR_date Is_sterilized
1 A 2016-01-01 0
2 A 2016-05-01 1
3 A 2016-05-06 1
4 A 2021-01-05 1
5 B 2016-01-29 1
6 B 2016-05-01 1
7 B 2017-01-01 0
8 C 2011-07-05 1
9 C 2011-09-01 1
10 C 2016-03-03 1
11 C 2016-04-22 1
12 D 2017-02-01 1
13 D 2018-05-05 1
14 D 2019-08-17 1
15 D 2019-09-13 1
16 D 2019-11-12 0
17 D 2021-11-12 0
I have more columns is if you thing anything else is relevant please write and I'll check I have it.
Any hint idea anything will be helpul
Thanks You in advance
Here's some dplyr code to identify instances where a dog's sterilization went from 1 to zero:
library(dplyr)
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup()
# A tibble: 3 x 4
Idx YEAR_date Is_sterilized change
<chr> <date> <dbl> <dbl>
1 B 2017-01-01 0 -1
2 D 2021-11-12 0 -1
3 D 2019-11-12 0 -1
If you want to count the number of dogs in that list, add %>% count(Idx) at the end.
df_test %>%
group_by(Idx) %>%
mutate(change = Is_sterilized-lag(Is_sterilized, default = 0)) %>%
filter(change == -1) %>%
ungroup() %>%
count(Idx, name = "times_desterilized")
# A tibble: 2 x 2
Idx times_desterilized
<chr> <int>
1 B 1
2 D 2
I currently have the following data format:
df = data.frame(c(rep("A", 12), rep("B", 12)), rep(1:12, 2), seq(-12, 11))
colnames(df) = c("station", "month", "mean")
df
df_master = data.frame(c(rep("A", 10), rep("B", 10)), rep(c(27:31, 1:5), 2), rep(c(rep(1, 5), rep(2, 5)), 2), rep(seq(-4,5), 2))
colnames(df_master) = c("station", "day", "month", "value")
df_master
Effectively df is a monthly average value for each station and I want to compute a new variable in the df_master data set which computes the difference from the monthly mean for each daily observation. I have managed to do this with an overall average incuding all the data, but since the mean values vary from each station so I would like to make the new variable station specific.
I have tried the following code to match the monthly value, but this currently doesn't account for cross station differences:
df_master$mean = df$mean[match(df_master$month, df$month)]
df_master = df_master %>% mutate(diff = value - mean)
How can I progress this further so that the averages are taken per station?
With dplyr using a left join
library(dplyr)
left_join(df_master, df, by = c('station', 'month')) %>%
mutate(monthdiff = value - mean) %>%
select(-mean)
If you convert them to data.tables, you can add the difference column with an update join, joining df_master with df on the condition that the values for both station and month are equal.
library(data.table)
setDT(df_master)
setDT(df)
df_master[df, on = .(station, month),
diff_monthmean := value - i.mean]
df_master
# station day month value diff_monthmean
# 1: A 27 1 -4 8
# 2: A 28 1 -3 9
# 3: A 29 1 -2 10
# 4: A 30 1 -1 11
# 5: A 31 1 0 12
# 6: A 1 2 1 12
# 7: A 2 2 2 13
# 8: A 3 2 3 14
# 9: A 4 2 4 15
# 10: A 5 2 5 16
# 11: B 27 1 -4 -4
# 12: B 28 1 -3 -3
# 13: B 29 1 -2 -2
# 14: B 30 1 -1 -1
# 15: B 31 1 0 0
# 16: B 1 2 1 0
# 17: B 2 2 2 1
# 18: B 3 2 3 2
# 19: B 4 2 4 3
# 20: B 5 2 5 4
Another option could be:
transform(df_master,
diff = value - merge(df_master, df, by = c('station', 'month'), all.x = TRUE)$mean)
Or, using match with interaction
transform(df_master,
diff = value - df$mean[match(interaction(df_master[c("month", "station")]), interaction(df[c("month", "station")]))])
I've tried creating a variable that represents the lagged version of another variable relative to the whole change of the variable within the group.
Let's use this example dataframe:
game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
game_data
player level score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I've tried the following, but while lagging the variable works, I am not able to create a new variable that shows the lag of the variable relative to the whole change for the player:
result <-
+ game_data %>%
+ group_by(player) %>%
+ mutate(
+ lag_score = score - dplyr::lag(score, n=1, default = NA),
+ lag_score_relative = lag_score/sum(lag_score))
result
# A tibble: 9 x 5
# Groups: player [3]
player level score lag_score lag_score_relative
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 NA NA
2 1 2 150 150 NA
3 1 3 170 20 NA
4 2 1 80 NA NA
5 2 2 100 20 NA
6 2 3 110 10 NA
7 3 1 75 NA NA
8 3 2 100 25 NA
9 3 3 0 -100 NA
For example, for player 1 it should be in
Level 1: NA/170 = NA
Level 2: 150/170
Level 3: 20/170
Thanks in advance, I hope anyone can help.
If you sum the lagged scores you include an NA. The sum then returns NA. You divide by NA which in the end returns NA for every value. To avoid this just set the na.rm argument to TRUE in your call of sum and NAs do not get included in the sum:
game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3),
score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
game_data %>%
group_by(player) %>%
mutate(
lag_score = score - dplyr::lag(score, n=1, default = NA),
lag_score_relative = lag_score/sum(lag_score, na.rm = TRUE))
My sample dataset
df <- data.frame(period=rep(1:3,3),
product=c(rep('A',9)),
account= c(rep('1001',3),rep('1002',3),rep('1003',3)),
findme= c(0,0,0,1,0,1,4,2,0))
My Desired output dataset
output <- data.frame(period=rep(1:3,2),
product=c(rep('A',6)),
account= c(rep('1002',3),rep('1003',3)),
findme= c(1,0,1,4,2,0))
Here my conditions are....
I want to eliminate records 3 records from 9 based on below conditions.
If all my periods (1, 2 and 3) meet “findme” value is equal to ‘Zero’ and
if that happens to the same product and
and same account.
Rule 1: It should meet Periods 1, 2, 3
Rule 2: Findme value for all periods = 0
Rule 3: All those 3 records (Preiod 1,2,3) should have same Product
Rule 4: All those 3 recods (period 1,2,3) should have one account.
If I understand correctly, you want to drop all records from a product-account combination where findme == 0, if all periods are included in this combination?
library(dplyr)
df %>%
group_by(product, account, findme) %>%
mutate(all.periods = all(1:3 %in% period)) %>%
ungroup() %>%
filter(!(findme == 0 & all.periods)) %>%
select(-all.periods)
# A tibble: 6 x 4
period product account findme
<int> <fctr> <fctr> <dbl>
1 1 A 1002 1
2 2 A 1002 0
3 3 A 1002 1
4 1 A 1003 4
5 2 A 1003 2
6 3 A 1003 0
Here is an option with data.table
library(data.table)
setDT(df)[df[, .I[all(1:3 %in% period) & !all(!findme)], .(product, account)]$V1]
# period product account findme
#1: 1 A 1002 1
#2: 2 A 1002 0
#3: 3 A 1002 1
#4: 1 A 1003 4
#5: 2 A 1003 2
#6: 3 A 1003 0