Deleting duplicated rows based on condition (position) - r

I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
I have mulple observations for each id and time identifier - e.g. I have 3 different alpha 1970 values. I would like to retain only one observation per id/year most notably the last one that appears in for each id/year.
the final dataset should look something like this:
final <- data.frame("id" = c("Alpha","Alpha","Beta","Beta","Beta"),
"Year" = c(1970,1971,1980,1981,1982),
"Val" = c(-2,5,5,3,5))
Does anyone know how I can approach the problem?
Thanks a lot in advance for your help

If you are open to a data.table solution, this can be done quite concisely:
library(data.table)
setDT(df)[, .SD[.N], by = c("id", "Year")]
#> id Year Val
#> 1: Alpha 1970 -2
#> 2: Alpha 1971 5
#> 3: Beta 1980 5
#> 4: Beta 1981 3
#> 5: Beta 1982 5
by = c("id", "Year") groups the data.table by id and Year, and .SD[.N] then returns the last row within each such group.

How about this?
library(tidyverse)
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
final <-
df %>%
group_by(id, Year) %>%
slice(n()) %>%
ungroup()
final
#> # A tibble: 5 x 3
#> id Year Val
#> <fct> <dbl> <dbl>
#> 1 Alpha 1970 -2
#> 2 Alpha 1971 5
#> 3 Beta 1980 5
#> 4 Beta 1981 3
#> 5 Beta 1982 5
Created on 2019-09-29 by the reprex package (v0.3.0)
Translates to "within each id-Year group, take only the row where the row number is equal to the size of the group, i.e. it's the last row under the current ordering."
You could also use either filter(), e.g. filter(row_number() == n()), or distinct() (and then you wouldn't even have to group), e.g. distinct(id, Year, .keep_all = TRUE) - but distinct functions take the first distinct row, so you'd need to reverse the row ordering here first.

An option with base R
aggregate(Val ~ ., df, tail, 1)
# id Year Val
#1 Alpha 1970 -2
#2 Alpha 1971 5
#3 Beta 1980 5
#4 Beta 1981 3
#5 Beta 1982 5
If we need to select the first row
aggregate(Val ~ ., df, head, 1)

Related

Applying a function to rows but referencing different table

I have 2 tables
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"), as.Date("2020-1-10"), by = "days")))
df2 = data.frame("observations" = c("a", "b", "c", "d"), "start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")), "end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
I would like to know the number of observation periods that occur on each day of df1, based on the start/stop dates in df2. E.g. on 1/1/2020, observations a and b were in progress, hence "2".
The expected output would be as follows:
I've tried using sums
df1$number = sum(as.Date(df2$start) <= df1$dates & as.Date(df2$end)>=df1$dates)
But that only sums up the entire column values
I've then tried to create a custom function for this:
df1$number = apply(df1, 1, function(x) sum(df2$start <= x & df2$end>=x))
But it returns an NA value.
I then tried to do embed an "ifelse" within it, but get the same issue with NAs
apply(df1, 1, function(x) sum(ifelse(df2$start <= x & df2$end>=x, 1, 0)))
Can anyone suggest what the issue is? Thanks!
edit: an interval join was suggested which is not what I'm trying to get - I think naming the observations with a numeric label was what caused confusion. I am trying to find out the TOTAL number of observations with periods that fall within the day, as compared to doing a 1:1 match.
Regards
Sing
Define the comparison in a function f and pass it through outer, rowSums is what you're looking for.
f <- \(x, y) df1[x, 1] >= df2[y, 2] & df1[x, 1] <= df2[y, 3]
cbind(df1, number=rowSums(outer(1:nrow(df1), 1:nrow(df2), f)))
# dates number
# 1 2020-01-01 2
# 2 2020-01-02 2
# 3 2020-01-03 1
# 4 2020-01-04 0
# 5 2020-01-05 1
# 6 2020-01-06 1
# 7 2020-01-07 1
# 8 2020-01-08 1
# 9 2020-01-09 1
# 10 2020-01-10 2
Here is a potential solution using dplyr/tidyverse functions and the %within% function from the lubridate package. This approach is similar to Left Join Subset of Column Based on Date Interval, however there are some important differences i.e. use summarise() instead of filter() to avoid 'losing' dates where "number" == 0, and join by 'character()' as there are no common columns between datasets:
library(dplyr)
library(lubridate)
df1 = data.frame("dates" = c(seq(as.Date("2020-1-1"),
as.Date("2020-1-10"),
by = "days")))
df2 = data.frame("observations" = c("1", "2", "3", "4"),
"start" = as.Date(c("2019-12-30", "2020-1-1", "2020-1-5","2020-1-10")),
"end"=as.Date(c("2020-1-3", "2020-1-2", "2020-1-12","2020-1-14")))
df1 %>%
full_join(df2, by = character()) %>%
mutate(number = dates %within% interval(start, end)) %>%
group_by(dates) %>%
summarise(number = sum(number))
#> # A tibble: 10 × 2
#> dates number
#> <date> <dbl>
#> 1 2020-01-01 2
#> 2 2020-01-02 2
#> 3 2020-01-03 1
#> 4 2020-01-04 0
#> 5 2020-01-05 1
#> 6 2020-01-06 1
#> 7 2020-01-07 1
#> 8 2020-01-08 1
#> 9 2020-01-09 1
#> 10 2020-01-10 2
Created on 2022-06-27 by the reprex package (v2.0.1)
Does this approach work with your actual data?

R - dplyr- Reducing data package 'storms'

I am working with dplyr and the data package 'storms'.
I need a table in which I have each measured storm in a column. Then I want to give each row an ID.
So far I have
storm_ID <- storms %>%
select(year,month,name) %>%
group_by(year,month,name) %>%
summarise(ID = n())
storm_ID
View(storm_ID)
The only thing is that it doesn't do anything for me.
I don't quite understand how I can see every single storm in the table. I had previously sorted them by name. Then I get 214 storms. However, storms with the same name occur in several years.
At the end I want something like:
name | year | month | day | ID
| | | | |
Zeta 2005 12 31 Zeta1
Zeta 2006 1 1 Zeta1
| | | | |
Zeta 2020 10 24 Zeta2
To do this, I need to know if a storm occurred in 2 years (i.e. from 2005-12-31 to 2006-01-01) But this should then only be counted as one storm.
After that I should then be able to evaluate the duration, wind speed difference and pressure difference per storm. What I had already evaluated with the wrong sorting.
Help would be nice.
Thanks in advance.
If you count storms as one if they continue to the next day but there are no gaps, days without a storm of the same name, then the following code might be what you want.
The variable Thresh is set to the maximum number of consecutive days for a storm to be counted as the same storm.
suppressPackageStartupMessages(library(dplyr))
data("storms", package = "dplyr")
Thresh <- 5
storms %>%
count(name, year, month, day) %>%
group_by(name) %>%
mutate(Date = as.Date(ISOdate(year, month, day)),
DDiff = c(0, diff(Date)) > Thresh,
DDiff = cumsum(DDiff)) %>%
group_by(name, DDiff) %>%
mutate(name = ifelse(DDiff > 0, paste(name, cur_group_id(), sep = "."), name)) %>%
ungroup() %>%
group_by(name) %>%
summarise(name = first(name),
year = first(year),
n = sum(n))
#> # A tibble: 512 x 3
#> name year n
#> <chr> <dbl> <int>
#> 1 AL011993 1993 8
#> 2 AL012000 2000 4
#> 3 AL021992 1992 5
#> 4 AL021994 1994 6
#> 5 AL021999 1999 4
#> 6 AL022000 2000 12
#> 7 AL022001 2001 5
#> 8 AL022003 2003 4
#> 9 AL022006 2006 5
#> 10 AL031987 1987 32
#> # ... with 502 more rows
Created on 2022-04-15 by the reprex package (v2.0.1)
Edit
After seeing the OP's answer, I have revised mine and they are now nearly identical.
The main difference is that, even if increasing the gap between days with records Thresh to 5, storm Dorian has 5 consecutive days without records, between July the 27th 2013 and August 2nd 2013. Still the same, it should be considered as one storm only. To have this result, increase Thresh to an appropriate value, for instance, 30 (days) and the outputs now match.
I have left it like this to show this point and to show what the variable Thresh is meant for.
In the code that follows, I assign the result of my code above to data.frame rui and the result of the OP's is cbind'ed with id and piped to a count instruction. Then saved in storm_count. The two outputs are compared for differences with anti_join after removing the id from my name column.
suppressPackageStartupMessages(library(dplyr))
data("storms", package = "dplyr")
Thresh <- 5
storms %>%
count(name, year, month, day) %>%
group_by(name) %>%
mutate(Date = as.Date(ISOdate(year, month, day)),
DDiff = c(0, diff(Date)) > Thresh,
DDiff = cumsum(DDiff)) %>%
group_by(name, DDiff) %>%
mutate(name = ifelse(DDiff > 0, paste(name, cur_group_id(), sep = "."), name)) %>%
ungroup() %>%
group_by(name) %>%
summarise(name = first(name),
year = first(year),
n = sum(n)) -> rui
id <- c()
j <- 1
k <- 1
for(i in storms$name) {
if(k-1 == 0){
id <- append(id, j)
k <- k+1
next
}
if(i != storms$name[k-1])
{
j <- j+1
}
id <- append(id, j)
k <- k+1
}
cbind(storms, id) %>%
count(name, id) -> storm_count
# two rows
anti_join(
rui %>% mutate(name = sub("\\.\\d+$", "", name)),
storm_count,
by = c("name", "n")
)
#> # A tibble: 2 x 3
#> name year n
#> <chr> <dbl> <int>
#> 1 Dorian 2013 16
#> 2 Dorian 2013 4
# just one row
anti_join(
storm_count,
rui %>% mutate(name = sub("\\.\\d+$", "", name)),
by = c("name", "n")
)
#> name id n
#> 1 Dorian 397 20
# see here the dates of 2013-07-27 and 2013-08-02
storms %>%
filter(name == "Dorian", year == 2013) %>%
count(name, year, month, day)
#> # A tibble: 7 x 5
#> name year month day n
#> <chr> <dbl> <dbl> <int> <int>
#> 1 Dorian 2013 7 23 1
#> 2 Dorian 2013 7 24 4
#> 3 Dorian 2013 7 25 4
#> 4 Dorian 2013 7 26 4
#> 5 Dorian 2013 7 27 3
#> 6 Dorian 2013 8 2 1
#> 7 Dorian 2013 8 3 3
Created on 2022-04-15 by the reprex package (v2.0.1)
For you first problem:
storm_ID <- storms %>%
select(year,month,name) %>%
group_by(year,month,name) %>%
mutate(ID = stringr::str_c(name, cur_group_id()))
This create a unique Storm-Name-ID, e.g. Amy1, Amy2 etc.
This is how you can check if a storm has happened in consecutive years
storms %>%
group_by(name) %>%
mutate(consec_helper = cumsum(c(1, diff(year) != 1))) %>%
group_by(name, consec_helper) %>%
filter(n() > 1)
I find this to be true only for Zeta
name year
<chr> <dbl>
1 Zeta 2005
2 Zeta 2006
Thanks for your approaches, unfortunately, all are not the appropriate solution.
I asked my professor for help, he said I could start a query with a loop. (I didn't expect an answer) So I checked the names afterwards and looked if they change. The data set is sorted by date, for this reason Zeta does not occur consecutively if it is not the same storm.
My current solution is :
install.packages(dplyr)
library(dplyr)
id <- c()
j <- 1
k <- 1
for(i in storms$name) {
if(k-1 == 0){
id <- append(id, j)
k <- k+1
next
}
if(i != storms$name[k-1])
{
j <- j+1
}
id <- append(id, j)
k <- k+1
}
storms <- cbind(storms, id)
View(storms)
I have now manually checked the dataset and think it is the appropriate solution to my problem.
This brings me to 511 different storms. (As of 22-04-15)
Nevertheless, thank you for all the solutions, I appreciate it very much.

Weighted mean of a group, where weight is from another group

Suppose you have a long data.frame of the following form:
ID Group Year Field VALUE
1 1 2016 AA 10
2 1 2016 AA 16
1 1 2016 TOTAL 100
2 1 2016 TOTAL 120
etc..
and you want to create an grouped output of weighted.mean(Value,??) for each group_by(Group, Year, Field) using Field == TOTAL as the weight for years >2013.
So far i am using dplyr:
dat %>%
filter(Year>2013) %>%
group_by(Group, Year, Field) %>%
summarize(m = weighted.mean(VALUE,VALUE[Field == 'TOTAL'])) %>%
ungroup()
Now the problem (to my understanding) is that by using group_by I cannot define the "Field" value afterwards, as I tell it to look at the group of "Field == AA".
Transforming data from long to wide is not a solution, as i have >1000 different field values which potentially increase over time, and this code will be run daily at some point.
First of all, this is a hacky solution, and I am sure there is a better approach to this issue. The goal is to make a new column containing the weights, and this approach does so using the filling nature of left_join(), but I am sure you could do this with fill() or across().
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3
# Example data from OP
dat <- data.frame(ID = c(1,2,1,2), Group = rep(1,4), Year = rep(2016,4),Field = c("AA","AA","TOTAL","TOTAL"), VALUE = c(10,16,100,120))
# Make a new dataframe containing the TOTAL values
weights <- dat %>% filter(Field == "TOTAL") %>% mutate(w = VALUE) %>% select(-Field,-VALUE)
weights
#> ID Group Year w
#> 1 1 1 2016 100
#> 2 2 1 2016 120
# Make a new frame containing the original values and the weights
new_dat <- left_join(dat,weights, by = c("Group","Year","ID"))
# Add a column for weight
new_dat %>%
filter(Year>2013) %>%
group_by(Group, Year, Field) %>%
summarize(m = weighted.mean(VALUE,w)) %>%
ungroup()
#> `summarise()` regrouping output by 'Group', 'Year' (override with `.groups` argument)
#> # A tibble: 2 x 4
#> Group Year Field m
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2016 AA 13.3
#> 2 1 2016 TOTAL 111.
Created on 2020-11-03 by the reprex package (v0.3.0)

R: Grouping in a hierarchy

I'm working on a dataset with a with grouping-system with six digits. The first two digits denote grouping on the top-level, the next two denote different sub-groups, and the last two digits denote specific type within the sub-group. I want to group the data to the top level in the hierarchy (two first digits only), and count unique names in each group.
An example for the GroupID 010203:
01 denotes BMW
02 denotes 3-series
03 denotes 320i (the exact model)
All I care about in this example is how many of each brand there is.
Toy dataset and wanted output:
df <- data.table(Quarter = c('Q4', 'Q4', 'Q4', 'Q4', 'Q3'),
GroupID = c(010203, 150503, 010101, 150609, 010000),
Name = c('AAAA', 'AAAA', 'BBBB', 'BBBB', 'CCCC'))
Output:
Quarter Group Counts
Q3 01 1
Q4 01 2
Q4 15 2
Using data.table we could do:
library(data.table)
dt[, Group := substr(GroupID, 1, 2)][
, Counts := .N, by = list(Group, Quarter)][
, head(.SD, 1), by = .(Quarter, Group, Counts)][
, .(Quarter, Group, Counts)]
Returns:
Quarter Group Counts
1: Q4 01 2
2: Q4 15 2
3: Q3 01 1
With dplyr and stringr we could do something like:
library(dplyr)
library(stringr)
df %>%
mutate(Group = str_sub(GroupID, 1, 2)) %>%
group_by(Group, Quarter) %>%
summarise(Counts = n()) %>%
ungroup()
Returns:
# A tibble: 3 x 3
Group Quarter Counts
<chr> <fct> <int>
1 01 Q3 1
2 01 Q4 2
3 15 Q4 2
Since you are already using data.table, you can do:
df[, Group := substr(GroupID,1,2)]
df <- df[,Counts := .N, .(Group,Quarter)][,.(Group, Quarter, Counts)]
df <- unique(df)
print(df)
Group Quarter Counts
1: 10 Q4 2
2: 15 Q4 2
3: 10 Q3 1
Here's my simple solution with plyr and base R, it is lightening fast.
library(plyr)
df$breakid <- as.character((substr(df$GroupID, start =0 , stop = 2)))
d <- plyr::count(df, c("Quarter", "breakid"))
Result
Quarter breakid freq
Q3 01 1
Q4 01 2
Q4 15 2
Alternatively, using tapply (and data.table indexing):
df$Brand <- substr(df$GroupID, 1, 2)
tapply(df$Brand, df[, .(Quarter, Brand)], length)
(If you don't care about the output being a matrix).

Calculate largest value for multiple overlapping events in a specific range

I have multiple large data frames that capture events that last a certain amount of time. This example gives a simplified version of my data set
Data frame 1:
ID Days Date Value
1 10 80 30
1 10 85 30
2 20 75 20
2 10 80 20
3 5 90 30
Data frame 2:
ID Days Date Value
1 20 0 30
1 10 3 20
2 20 5 30
3 20 1 10
3 10 10 10
The same ID is used for the same person in all datasets
Days specifies the length of the event (if Days has the value 10 then the event lasts 10 days)
Date specifies the date at which the event starts. In this case,Date can be any number between 0 and 90 or 91 (the data represent days in quarter)
Value is an attribute that is repeated for the number of Days specified. For example, for the first row in df1, the value 30 is repeated for 10 times starting from day 80 ( 30 is repeated for 10 days)
What I am interested in is to give for each ID in each data frame the highest value per day. Keep in mind that multiple events can overlap and values then have to be summed.
The final data frame should look like this:
ID HighestValuedf1 HighestValuedf2
1 60 80
2 40 30
3 30 20
For example, for ID 1 three events overlapped and resulted in the highest value of 80 in data frame 2. There was no overlap between the events of df1 and df1 for ID 3, only an overlap withing df2.
I would prefer a solution that avoids merging all data frames into one data frame because of the size of my files.
EDIT
I rearranged my data so that all events that overlap are in one data frame. I only need the highest overlap value for every data frame.
Code to reproduce the data frames:
ID = c(1,1,2,2,3)
Date = c(80,85,75,80,90)
Days = c(10,10,20,10,5)
Value = c(30,30,20,20,30)
df1 = data.frame(ID,Days, Date,Value)
ID = c(1,1,2,3,3)
Date = c(1,3,5,1,10)
Days = c(20,10,20,20,10 )
Value =c(30,20,30,10,10)
df2 = data.frame(ID,Days, Date,Value)
ID= c(1,2,3)
HighestValuedf1 = c(60,40,30)
HighestValuedf2 = c(80,30,20)
df3 = data.frame(ID, HighestValuedf1, HighestValuedf2)
I am interpreting highest value per day to mean highest value on a single day throughout the time period. This is probably not the most efficient solution, since I expect something can be done with map or apply functions, but I didn't see how on a first look. Using df1 and df2 as defined above:
EDIT: Modified code upon understanding that df1 and df2 are supposed to represent sequential quarters. I think the easiest way to do this is simply to stack the dataframes so anything that overlaps will automatically be caught (i.e. day 1 of df2 is day 91 overall). You will probably need to either adjust this code manually because of the different length of quarters, or preferably simply convert days of quarters into actual dates of the year with a date formate ((df1 day 1 becomes January 1st 2017, for example). The code below just rearranges to achieve this and then produces the results desired for each quarter by filtering on days 1:90, 91:180 as shown)
ID = c(1,1,2,2,3)
Date = c(80,85,75,80,90)
Days = c(10,10,20,10,5)
Value = c(30,30,20,20,30)
df1 = data.frame(ID,Days, Date,Value)
ID = c(1,1,2,3,3)
Date = c(1,3,5,1,10)
Days = c(20,10,20,20,10 )
Value =c(30,20,30,10,10)
df2 = data.frame(ID,Days, Date,Value)
library(tidyverse)
#> -- Attaching packages --------------------------------------------------------------------- tidyverse 1.2.1 --
#> v ggplot2 2.2.1.9000 v purrr 0.2.4
#> v tibble 1.4.2 v dplyr 0.7.4
#> v tidyr 0.7.2 v stringr 1.2.0
#> v readr 1.1.1 v forcats 0.2.0
#> -- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
df2 <- df2 %>%
mutate(Date = Date + 90)
# Make a dataframe with complete set of day-ID combinations
df_completed <- df1 %>%
mutate(day = factor(Date, levels = 1:180)) %>% # set to total day length
complete(ID, day) %>%
mutate(daysum = 0) %>%
select(ID, day, daysum)
# Function to apply to each data frame containing events
# Should take each event and add value to the appropriate days
sum_df_daily <- function(df_complete, df){
for (i in 1:nrow(df)){
event_days <- seq(df[i, "Date"], df[i, "Date"] + df[i, "Days"] - 1)
df_complete <- df_complete %>%
mutate(
to_add = case_when(
ID == df[i, "ID"] & day %in% event_days ~ df[i, "Value"],
!(ID == df[i, "ID"] & day %in% event_days) ~ 0
),
daysum = daysum + to_add
)
}
return(df_complete)
}
df_filled <- df_completed %>%
sum_df_daily(df1) %>%
sum_df_daily(df2) %>%
mutate(
quarter = case_when(
day %in% 1:90 ~ "q1",
day %in% 91:180 ~ "q2"
)
)
df_filled %>%
group_by(quarter, ID) %>%
summarise(maxsum = max(daysum))
#> # A tibble: 6 x 3
#> # Groups: quarter [?]
#> quarter ID maxsum
#> <chr> <dbl> <dbl>
#> 1 q1 1.00 60.0
#> 2 q1 2.00 40.0
#> 3 q1 3.00 30.0
#> 4 q2 1.00 80.0
#> 5 q2 2.00 30.0
#> 6 q2 3.00 40.0

Resources