How to order and mark duplicated rows at the same time - r

I am looking to make a new variable to mark which of my data is duplicated, selecting the oldest datapoint to be the "original". My dataframe is ordered by date, but by ID.
ID Name Number Datetime (dd/mm/yyy/hh/MM)
1 ace 114 15.03.2019 15:26
2 bert 197 18.03.2019 07:28
3 vance 245 16.03.2019 14:03
4 chad 116 17.03.2019 02:02
5 chad 116 18.03.2019 18:23
6 ace 114 12.03.2019 23:15
Ordering the dataframe works and selecting the duplicated lines also works, but not in combination, which leads to the originals not being the first presentation. Even if I order the dataframe before marking the represenation the dataframe is seems to be unordered for the next command and linking the two commands with %>% is not working.
df %>% arrange(Datetime)
df$representations <- if_else(duplicated(df$number, .keep_all =TRUE), 1, 0)
df$represntations <- df %>%
arrange(Datetime) %>%
if_else(duplicated(df$number, .keep_all =TRUE), 1, 0)
How can i be sure, that the the originals will be the first datapoint to the number (like this)?
ID Name Number Datetime (dd/mm/yyy/hh/MM) representation
1 ace 114 15.03.2019 15:26 1
2 bert 197 18.03.2019 07:28 0
3 vance 245 16.03.2019 14:03 0
4 chad 116 17.03.2019 02:02 0
5 chad 116 18.03.2019 18:23 1
6 ace 114 12.03.2019 23:15 0

Try the below code
df <- df %>%
arrange(Datetime) %>%
mutate(representations = if_else(duplicated(number, .keep_all =TRUE), 1, 0)) %>%
arrange(ID)

library(dplyr)
df %>%
arrange(`Datetime(dd/mm/yyy/hh/MM)`) %>%
mutate(flag = duplicated(Number)*1) %>%
arrange(ID)
1 ace 114 15.03.2019 1
2 2 bert 197 18.03.2019 0
3 3 vance 245 16.03.2019 0
4 4 chad 116 17.03.2019 0
5 5 chad 116 18.03.2019 1
6 6 ace 114 12.03.2019 0

I ended up using this code and the sample I checked seemed to be correct, thank you! (even though the as.Date changed the year from 2019 to 2020, but the order is correct)
# split time and date, so as.Date can be used
emerge$date <- as.Date(sapply(strsplit(as.character(emerge$Falleinzeitdatum.Notfall), " "), "[", 1), format = "%d.%m.%y")
# arrange as proposed
emerge <- emerge %>%
arrange(date) %>%
mutate(re = if_else(duplicated(Patientennummer, .keep_all = TRUE), 1, 0))

Related

Pivot wider to one row in R

Here is the sample code that I am using
library(dplyr)
naics <- c("000000","000000",123000,123000)
year <- c(2020,2021,2020,2021)
January <- c(250,251,6,9)
February <- c(252,253,7,16)
March <- c(254,255,8,20)
sample2 <- data.frame (naics, year, January, February, March)
Here is the intended result
Jan2020 Feb2020 March2020 Jan2021 Feb2021 March2021
000000 250 252 254 251 253 255
123000 6 7 8 9 16 20
Is this something that is done with pivot_wider or is it more complex?
We use pivot_wider by selecting the values_from with the month column, names_from as 'year' and then change the column name format in names_glue and if needed convert the 'naics' to row names with column_to_rownames (from tibble)
library(tidyr)
library(tibble)
pivot_wider(sample2, names_from = year, values_from = January:March,
names_glue = "{substr(.value, 1, 3)}{year}")%>%
column_to_rownames('naics')
-output
Jan2020 Jan2021 Feb2020 Feb2021 Mar2020 Mar2021
000000 250 251 252 253 254 255
123000 6 9 7 16 8 20
With reshape function from BaseR,
reshape(sample2, dir = "wide", sep="",
idvar = "naics",
timevar = "year",
new.row.names = unique(naics))[,-1]
# January2020 February2020 March2020 January2021 February2021 March2021
# 000000 250 252 254 251 253 255
# 123000 6 7 8 9 16 20
This takes a longer route than #akrun's answer. I will leave this here in case it may help with more intuition on the steps being taken. Otherwise, #akrun's answer is more resource efficient.
sample2 %>%
tidyr::pivot_longer(-c(naics, year), names_to = "month",
values_to = "value") %>%
mutate(Month=paste0(month, year)) %>%
select(-year, - month) %>%
tidyr::pivot_wider(names_from = Month,values_from = value)
# A tibble: 2 x 7
naics January2020 February2020 March2020 January2021 February2021
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 000000 250 252 254 251 253
2 123000 6 7 8 9 16
# ... with 1 more variable: March2021 <dbl>

Summing row values using mutate_at

I am trying to sum row values by specific columns using mutate_at and sum function. The dataset is given below:
Country/Region 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20
Chili 0 0 0 3 1 2
Chili 1 0 1 4 2 1
China 23 26 123 12 56 70
China 45 25 56 23 16 18
I am using following code but instead of summing all the column values, I am getting zeroes.
tb <- confirmed_raw %>% group_by(`Country/Region`) %>%
filter(`Country/Region` != "Cruise Ship") %>%
select(-`Province/State`, -Lat, -Long) %>%
mutate_at(vars(-group_cols()), ~sum)
The output which I want is:
Country/Region 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20
Chili 2 0 1 7 3 3
China 68 51 179 35 72 88
But instead of above, all the date columns are coming 0. How can I solve this?
Can you try summarise_all instead of mutate_at(vars(-group_cols()), ~sum)?
tb %>% group_by(`Country.Region`) %>% summarise_all(funs(sum))
PS: I guess you have few typos here such as tb[1,1] should return 1, not 2. Also, the example code does not correspond to the data entirely (ther is no Cruise Ship or Province/State in it. Still, ignoring those, I found this works to generate the expected output.
To complete, another option :
tb %>% group_by(`Country/Region`) %>% mutate_all(sum) %>% distinct(`Country/Region`,.keep_all = TRUE)

How to calculate the number of flights with an specific condition

I'm using the nycflights13::flights dataframe and want to calculate the number of flights an airplane have flown before its first more than 1 hour delay. How can I do this? I've tried with a group_by and filter, but I haven't been able to. Is there a method to count the rows till a condition (e.g. till the first dep_delay >60)?
Thanks.
library(dplyr)
library(nycflights13)
data("flights")
There may be more elegant ways, but this code counts the total number of flights made by each plane (omitting cancelled flights) and joins this with flights that were not cancelled, grouping on the unique plane identifier (tailnum), sorting on departure date/time, assigning the row_number less 1, filtering on delays>60, and taking the first row.
select(
filter(flights, !is.na(dep_time)) %>%
count(tailnum, name="flights") %>% left_join(
filter(flights, !is.na(dep_time)) %>%
group_by(tailnum) %>%
arrange(month, day, dep_time) %>%
mutate(not_delayed=row_number() -1) %>%
filter(dep_delay>60) %>% slice(1)),
tailnum, flights, not_delayed)
# A tibble: 4,037 x 3
tailnum flights not_delayed
<chr> <int> <dbl>
1 D942DN 4 0
2 N0EGMQ 354 53
3 N10156 146 9
4 N102UW 48 25
5 N103US 46 NA
6 N104UW 47 3
7 N10575 272 0
8 N105UW 45 22
9 N107US 41 20
10 N108UW 60 36
# ... with 4,027 more rows
The plane with tailnum N103US has made 46 flights, of which none have been delayed by more than 1 hour. So the number of flights it has made before its first 1 hour delay is undefined or NA.
I got the answer:
flights %>%
#Eliminate the NAs
filter(!is.na(dep_time)) %>%
#Sort by date and time
arrange(time_hour) %>%
group_by(tailnum) %>%
#cumulative number of flights delayed more than one hour
mutate(acum_delay = cumsum(dep_delay > 60)) %>%
#count the number of flights
summarise(before_1hdelay = sum(acum_delay < 1))

R - Transpose columns and rows with conditions

I am working with the dataframe 'by_class_survival' and I am trying to convert in other format, changing the rows and columns plus including conditions, I have already solved in a very rustic way, so but I am wondering if there is a better way to transpose columns and rows, plus adding conditions at the moment to create the transposition.
library(dplyr)
titanic_tbl <- dplyr::tbl_df(Titanic)
titanic_tbl <- titanic_tbl %>%
mutate_at(vars(Class:Survived), funs(factor))
by_class_survival <- titanic_tbl %>%
group_by(Class, Survived) %>%
summarize(Count = sum(n))
Original dataframe
# Class Survived Count
# 1 1st No 122
# 2 1st Yes 203
# 3 2nd No 167
# 4 2nd Yes 118
# 5 3rd No 528
# 6 3rd Yes 178
# 7 Crew No 673
# 8 Crew Yes 212
Creating a new dataframe based on the values from by_class_survival
first <- c(122,203)
second <- c(167, 118)
third <- c(528,178)
crew <- c(673,212)
titanic.df = data.frame(first,second,third,crew)
library(data.table)
t_titanic.df <- transpose(titanic.df)
rownames(t_titanic.df) <- colnames(titanic.df)
colnames(t_titanic.df) <- c("No survivor", "Survivor")
Expected result
## No survivor Survivor
## first 122 203
## second 167 118
## third 528 178
## crew 673 212
There is a better way to reach the expected result?
You can do it in one step with reshape2::dcast:
library(reshape2)
library(dplyr)
titanic_tbl %>%
dcast(Class ~ Survived, value.var = "n", sum)
Class No Yes
1 1st 122 203
2 2nd 167 118
3 3rd 528 178
4 Crew 673 212
or you can use tidyr::spread on the summarised data frame:
library(tidyr)
titanic_tbl %>%
group_by(Class, Survived) %>%
summarise(sum = sum(n)) %>%
spread(Survived, sum)
# A tibble: 4 x 3
# Groups: Class [4]
Class No Yes
<chr> <dbl> <dbl>
1 1st 122 203
2 2nd 167 118
3 3rd 528 178
4 Crew 673 212

Choose groups with consecutive year-quarters

I hope to choose the identifiers that have consecutive year-quarter records. For example, ID 111 will be selected because it has all year-quarters. ID 113 will be selected because the year-quarter combinations are consecutive, although the ID only has a portion of the total year-quarters. ID 112 will not be selected because the year-quarter is not consecutive. It lacks 201601, 201602, 201603.
Identifer year-quarter
111 201503
111 201504
111 201601
111 201602
111 201603
111 201604
112 201503
112 201504
112 201604
113 201503
113 201504
113 201601
My current code (below) can only deal with selecting IDs that have the full year-quarter combinations. I wonder how to achieve my desired outcome.
df2 = group_by(df1, Identifer) %>% summarize(total = n()) %>% filter(total =6)
The desired outcome is
Identifer
111
113
To select 'Identifiers', convert 'year.quarter' to zoo::year.qtr, take difference between consecutive values by group, check if all differerences are 0.25*.
library(zoo)
tapply(as.yearqtr(as.character(d$year.quarter), format = "%Y%q"), d$Identifer,
FUN = function(x) all(diff(as.numeric(x)) == 0.25))
# 111 112 113
# TRUE FALSE TRUE
To select corresponding rows, use a similar logic with ave:
d[as.logical(ave(as.yearqtr(as.character(d$year.quarter), format = "%Y%q"), d$Identifer,
FUN = function(x) all(diff(x) == 0.25))), ]
# Identifer year.quarter
# 1 111 201503
# 2 111 201504
# 3 111 201601
# 4 111 201602
# 5 111 201603
# 6 111 201604
# 10 113 201503
# 11 113 201504
# 12 113 201601
*From ?as.yearqtr:
The "yearqtr" class is used to represent quarterly data. Internally it holds the data as year plus 0 for Quarter 1, 1/4 for Quarter 2 and so on
The post was improved by comments from #G.Grothendieck. Thanks!
One way , we could do this is by using dplyr and lubridate together. We can group_by Identifier and use yq function to convert year-quarter to date and then take difference between those consecutive dates and get all the groups where all the dates are in the range of 90-120 as maximum amount of days we can allow between one quarter.
library(dplyr)
library(lubridate)
df %>%
group_by(Identifer) %>%
mutate(yearq = c(90, diff(yq(year.quarter)))) %>%
filter(all(yearq > 89 & yearq < 120)) %>%
select(Identifer) %>%
unique()
# Identifer
# <int>
#1 111
#2 113

Resources