Choose groups with consecutive year-quarters - r

I hope to choose the identifiers that have consecutive year-quarter records. For example, ID 111 will be selected because it has all year-quarters. ID 113 will be selected because the year-quarter combinations are consecutive, although the ID only has a portion of the total year-quarters. ID 112 will not be selected because the year-quarter is not consecutive. It lacks 201601, 201602, 201603.
Identifer year-quarter
111 201503
111 201504
111 201601
111 201602
111 201603
111 201604
112 201503
112 201504
112 201604
113 201503
113 201504
113 201601
My current code (below) can only deal with selecting IDs that have the full year-quarter combinations. I wonder how to achieve my desired outcome.
df2 = group_by(df1, Identifer) %>% summarize(total = n()) %>% filter(total =6)
The desired outcome is
Identifer
111
113

To select 'Identifiers', convert 'year.quarter' to zoo::year.qtr, take difference between consecutive values by group, check if all differerences are 0.25*.
library(zoo)
tapply(as.yearqtr(as.character(d$year.quarter), format = "%Y%q"), d$Identifer,
FUN = function(x) all(diff(as.numeric(x)) == 0.25))
# 111 112 113
# TRUE FALSE TRUE
To select corresponding rows, use a similar logic with ave:
d[as.logical(ave(as.yearqtr(as.character(d$year.quarter), format = "%Y%q"), d$Identifer,
FUN = function(x) all(diff(x) == 0.25))), ]
# Identifer year.quarter
# 1 111 201503
# 2 111 201504
# 3 111 201601
# 4 111 201602
# 5 111 201603
# 6 111 201604
# 10 113 201503
# 11 113 201504
# 12 113 201601
*From ?as.yearqtr:
The "yearqtr" class is used to represent quarterly data. Internally it holds the data as year plus 0 for Quarter 1, 1/4 for Quarter 2 and so on
The post was improved by comments from #G.Grothendieck. Thanks!

One way , we could do this is by using dplyr and lubridate together. We can group_by Identifier and use yq function to convert year-quarter to date and then take difference between those consecutive dates and get all the groups where all the dates are in the range of 90-120 as maximum amount of days we can allow between one quarter.
library(dplyr)
library(lubridate)
df %>%
group_by(Identifer) %>%
mutate(yearq = c(90, diff(yq(year.quarter)))) %>%
filter(all(yearq > 89 & yearq < 120)) %>%
select(Identifer) %>%
unique()
# Identifer
# <int>
#1 111
#2 113

Related

How to order and mark duplicated rows at the same time

I am looking to make a new variable to mark which of my data is duplicated, selecting the oldest datapoint to be the "original". My dataframe is ordered by date, but by ID.
ID Name Number Datetime (dd/mm/yyy/hh/MM)
1 ace 114 15.03.2019 15:26
2 bert 197 18.03.2019 07:28
3 vance 245 16.03.2019 14:03
4 chad 116 17.03.2019 02:02
5 chad 116 18.03.2019 18:23
6 ace 114 12.03.2019 23:15
Ordering the dataframe works and selecting the duplicated lines also works, but not in combination, which leads to the originals not being the first presentation. Even if I order the dataframe before marking the represenation the dataframe is seems to be unordered for the next command and linking the two commands with %>% is not working.
df %>% arrange(Datetime)
df$representations <- if_else(duplicated(df$number, .keep_all =TRUE), 1, 0)
df$represntations <- df %>%
arrange(Datetime) %>%
if_else(duplicated(df$number, .keep_all =TRUE), 1, 0)
How can i be sure, that the the originals will be the first datapoint to the number (like this)?
ID Name Number Datetime (dd/mm/yyy/hh/MM) representation
1 ace 114 15.03.2019 15:26 1
2 bert 197 18.03.2019 07:28 0
3 vance 245 16.03.2019 14:03 0
4 chad 116 17.03.2019 02:02 0
5 chad 116 18.03.2019 18:23 1
6 ace 114 12.03.2019 23:15 0
Try the below code
df <- df %>%
arrange(Datetime) %>%
mutate(representations = if_else(duplicated(number, .keep_all =TRUE), 1, 0)) %>%
arrange(ID)
library(dplyr)
df %>%
arrange(`Datetime(dd/mm/yyy/hh/MM)`) %>%
mutate(flag = duplicated(Number)*1) %>%
arrange(ID)
1 ace 114 15.03.2019 1
2 2 bert 197 18.03.2019 0
3 3 vance 245 16.03.2019 0
4 4 chad 116 17.03.2019 0
5 5 chad 116 18.03.2019 1
6 6 ace 114 12.03.2019 0
I ended up using this code and the sample I checked seemed to be correct, thank you! (even though the as.Date changed the year from 2019 to 2020, but the order is correct)
# split time and date, so as.Date can be used
emerge$date <- as.Date(sapply(strsplit(as.character(emerge$Falleinzeitdatum.Notfall), " "), "[", 1), format = "%d.%m.%y")
# arrange as proposed
emerge <- emerge %>%
arrange(date) %>%
mutate(re = if_else(duplicated(Patientennummer, .keep_all = TRUE), 1, 0))

subset data based on condition in r [duplicate]

This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 1 year ago.
I want to select those household where all the member's age is greater than 20 in r.
household Members_age
100 75
100 74
100 30
101 20
101 50
101 60
102 35
102 40
102 5
Here two household satisfy the condition. Household 100 and 101.
How to do it in r?
what I did is following but it's not working.
sqldf("select household,Members_age from data group by household having Members_age > 20")
household Members_age
100 75
102 35
Please suggest. Here is the sample dataset
library(dplyr)
library(sqldf)
data <- data.frame(household = c(100,100,100,101,101,101,102,102,102),
Members_age = c(75,74,30,20,50,60,35,40,5))
You can use ave.
data[ave(data$Members_age, data$household, FUN=min) > 20,]
# household Members_age
#1 100 75
#2 100 74
#3 100 30
or only the households.
unique(data$household[ave(data$Members_age, data$household, FUN=min) > 20])
#[1] 100
I understand SQL's HAVING clause, but your request "all member's age is greater than 20" does not match your sqldf output. This is because HAVING is really only looking at the first row for each household, which is why we see 102 (and shouldn't) and we don't see 101 (shouldn't as well).
I suggest to implement your logic, you would change your sqldf code to the following:
sqldf("select household,Members_age from data group by household having min(Members_age) > 20")
# household Members_age
# 1 100 30
which is effectively the SQL analog of GKi's ave answer.
An alternative:
library(dplyr)
data %>%
group_by(household) %>%
filter(all(Members_age > 20)) %>%
ungroup()
# # A tibble: 3 x 2
# household Members_age
# <dbl> <dbl>
# 1 100 75
# 2 100 74
# 3 100 30
and if you just need one row per household, then add %>% distinct(household) or perhaps %>% distinct(household, .keep_all = TRUE).
But for base R, I think nothing is likely to be better than GKi's use of ave.

R - Transpose columns and rows with conditions

I am working with the dataframe 'by_class_survival' and I am trying to convert in other format, changing the rows and columns plus including conditions, I have already solved in a very rustic way, so but I am wondering if there is a better way to transpose columns and rows, plus adding conditions at the moment to create the transposition.
library(dplyr)
titanic_tbl <- dplyr::tbl_df(Titanic)
titanic_tbl <- titanic_tbl %>%
mutate_at(vars(Class:Survived), funs(factor))
by_class_survival <- titanic_tbl %>%
group_by(Class, Survived) %>%
summarize(Count = sum(n))
Original dataframe
# Class Survived Count
# 1 1st No 122
# 2 1st Yes 203
# 3 2nd No 167
# 4 2nd Yes 118
# 5 3rd No 528
# 6 3rd Yes 178
# 7 Crew No 673
# 8 Crew Yes 212
Creating a new dataframe based on the values from by_class_survival
first <- c(122,203)
second <- c(167, 118)
third <- c(528,178)
crew <- c(673,212)
titanic.df = data.frame(first,second,third,crew)
library(data.table)
t_titanic.df <- transpose(titanic.df)
rownames(t_titanic.df) <- colnames(titanic.df)
colnames(t_titanic.df) <- c("No survivor", "Survivor")
Expected result
## No survivor Survivor
## first 122 203
## second 167 118
## third 528 178
## crew 673 212
There is a better way to reach the expected result?
You can do it in one step with reshape2::dcast:
library(reshape2)
library(dplyr)
titanic_tbl %>%
dcast(Class ~ Survived, value.var = "n", sum)
Class No Yes
1 1st 122 203
2 2nd 167 118
3 3rd 528 178
4 Crew 673 212
or you can use tidyr::spread on the summarised data frame:
library(tidyr)
titanic_tbl %>%
group_by(Class, Survived) %>%
summarise(sum = sum(n)) %>%
spread(Survived, sum)
# A tibble: 4 x 3
# Groups: Class [4]
Class No Yes
<chr> <dbl> <dbl>
1 1st 122 203
2 2nd 167 118
3 3rd 528 178
4 Crew 673 212

I am trying to combine or aggregate 2 rows of data into 1 row by a certain criteria

I am attempting to combine 2 rows into 1 row and select the value to keep depending on a different column.
ID score date std error
123 87 1/15/2018 5
123 92 1/15/2018 10
155 78 3/10/2018 8
155 82 1/15/2018 7
In the data set I only want 1 row per ID. When there are two different test scores I want to keep the score value with the corresponding test date that is closest to present day. If the date is the same then I want to take the test score with the smallest standard error.
End result would look like this:
ID score test date std error
123 87 1/15/2018 5
155 78 3/10/2018 8
Being going at it few a few hours and cannot seem to figure this out.
Thanks
arrange by date (descending order) and std error (ascending order) then take the first row from each group:
df %>%
arrange(desc(as.Date(date, '%m/%d/%Y')), std.error) %>%
group_by(ID) %>% slice(1)
# A tibble: 2 x 4
# Groups: ID [2]
# ID score date std.error
# <int> <int> <fct> <int>
#1 123 87 1/15/2018 5
#2 155 78 3/10/2018 8

How to assign a value depending on two conditions including column names. (add environmental variable to tracking data)

I have a data frame (track) with the position (longitude - Latitude) and date (number of the day in the year) of tracking point for different animals and an other data frame (var) which gives a the mean temperature for every day of the year in different locations.
I would like to add a new column TEMP to my data frame (Track) where the value would be from (var) and correspond to the date and GPS location of each tracking points in (track).
Here are a really simple subset of my data and what I would like to obtain.
track = data.frame(
animals=c(1,1,1,2,2),
Longitude=c(117,116,117,117,116),
Latitude=c(18,20,20,18,20),
Day=c(1,3,4,1,5))
Var = data.frame(
Longitude=c(117,117,116,116),
Latitude=c(18,20,18,20),
Day1=c(22,23,24,21),
Day2=c(21,28,27,29),
Day3=c(12,13,14,11),
Day4=c(17,19,20,23),
Day5=c(32,33,34,31)
)
TrackPlusVar = data.frame(
animals=c(1,1,1,2,2),
Longitude=c(117,116,117,117,116),
Latitude=c(18,20,20,18,20),
Day=c(1,3,4,1,5),
Temp= c(22,11,19,22,31)
)
I've no idea how to assign the value from the same date and GPS location as it is a column name. Any idea would be very useful !
This is a dplyr and tidyr approach.
library(dplyr)
library(tidyr)
# reshape table Var
Var %>%
gather(Day,Temp,-Longitude, -Latitude) %>%
mutate(Day = as.numeric(gsub("Day","",Day))) -> Var2
# join tables
track %>% left_join(Var2, by=c("Longitude", "Latitude", "Day"))
# animals Longitude Latitude Day Temp
# 1 1 117 18 1 22
# 2 1 116 20 3 11
# 3 1 117 20 4 19
# 4 2 117 18 1 22
# 5 2 116 20 5 31
If the process that creates your tables makes sure that all your cases belong to both tables, then you can use inner_join instead of left_join to make the process faster.
If you're still not happy with the speed you can use a data.table join process to check if it is faster, like:
library(data.table)
Var2 = setDT(Var2, key = c("Longitude", "Latitude", "Day"))
track = setDT(track, key = c("Longitude", "Latitude", "Day"))
Var2[track][order(animals,Day)]
# Longitude Latitude Day Temp animals
# 1: 117 18 1 22 1
# 2: 116 20 3 11 1
# 3: 117 20 4 19 1
# 4: 117 18 1 22 2
# 5: 116 20 5 31 2

Resources