Keep less recent duplicate row in R - r

So, I have a dataset with bill numbers, day, month, year, and aggregate value. There are a bunch of bull number duplicates, and I want to keep the first ones. If there is a duplicate with the same day, month, and year, I want to keep the one with the highest amount in aggregate value.
For example, if the dataset now looks like this:
Bill Number Day Month Year Ag. Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4
I want the result to look like this:
Bill Number Day Month Year Ag. Value
1 10 4 1998 10
2 23 11 2001 12
3 11 3 2005 8
I'm not sure if there is a command I can use and just introduce all these arguments or if I should do it in stages, but either way I'm not sure how to begin. I used duplicate() and unique() and then got stuck.
Thanks!

library( data.table )
dt <- fread("Bill_Number Day Month Year Ag_Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4", header = TRUE)
dt[ !duplicated( Bill_Number), ]
# Bill_Number Day Month Year Ag_Value
# 1: 1 10 4 1998 10
# 2: 2 23 11 2001 12
# 3: 3 11 3 2005 8
or
dt[, .SD[1], by = .(Bill_Number) ] #other approach, a bit slower

duplicated() gives you the entries which are identical to earlier one (i.e. ones with smaller subscripts). Therefore, sorting your bill numbers by date (earliest to the top) and then removing duplicates should do the trick. Aggregating your columns day, month and year into one date-column might be helpful.

This answer uses dplyr package and satisfies your condition: "If there is a duplicate with the same day, month, and year, I want to keep the one with the highest amount in aggregate value."
library(data.table)
library(dplyr)
myData <- fread("Bill_Number Day Month Year Ag_Value
1 10 4 1998 10
1 11 4 1998 14
2 23 11 2001 12
2 23 11 2001 9
3 11 3 2005 8
3 12 3 2005 9
3 13 3 2005 4", header = TRUE)
myData <- as.tibble(myData) #tibble form
sData <- arrange(myData, Bill_Number, Year, Month, Day, desc(Ag_Value)) #sort the data with the required manner
fData <- distinct(sData, Bill_Number, .keep_all = 1) #final data
fData
# A tibble: 3 x 5
Bill_Number Day Month Year Ag_Value
<int> <int> <int> <int> <int>
1 1 10 4 1998 10
2 2 23 11 2001 12
3 3 11 3 2005 8

I used some loops and condition checks, and tried with a test set besides the "base" set you mentioned.
library(tidyverse)
#base dataset
billNumber <- c(1,1,2,2,3,3,3)
day <- c(10,11,23,23,11,12,13)
month <- c(4,4,11,11,3,3,3)
year <- c(1998,1998,2001,2001,2005,2005,2005)
agValue <- c(10,14,12,9,8,9,4)
#test dataset
billNumber <- c(1,1,2,2,3,3,3,4,4,4)
day <- c(10,11,23,23,11,12,13,15,15,15)
month <- c(4,4,11,11,3,3,3,6,6,6)
year <- c(1998,1998,2001,2001,2005,2005,2005,2020,2020,2020)
agValue <- c(10,14,9,12,8,9,4,13,15,8)
#build the dataset
df <- data.frame(billNumber,day,month,year,agValue)
#add a couple of working columns
df_full <- df %>%
mutate(
concat = paste(df$billNumber,df$day,df$month,df$year,sep="-"),
flag = ""
)
df_full
billNumber day month year agValue concat flag
1 1 10 4 1998 10 1-10-4-1998
2 1 11 4 1998 14 1-11-4-1998
3 2 23 11 2001 12 2-23-11-2001
4 2 23 11 2001 9 2-23-11-2001
5 3 11 3 2005 8 3-11-3-2005
6 3 12 3 2005 9 3-12-3-2005
7 3 13 3 2005 4 3-13-3-2005
#separate records with one/multi occurence as defined in the question
row_single <- df_full %>% count(concat) %>% filter(n == 1)
df_full_single <- df_full[df_full$concat %in% row_single$concat,]
row_multi <- df_full %>% count(concat) %>% filter(n > 1)
df_full_multi <- df_full[df_full$concat %in% row_multi$concat,]
#flag the rows with single occurence
df_full_single[1,]$flag = "Y"
for (row in 2:nrow(df_full_single)) {
if (df_full_single[row,]$billNumber == df_full_single[row-1,]$billNumber) {
df_full_single[row,]$flag = "N"
} else
{
df_full_single[row,]$flag = "Y"
}
}
df_full_single
#flag the rows with multi occurences
df_full_multi[1,]$flag = "Y"
for (row in 2:nrow(df_full_multi)) {
if (
(df_full_multi[row,]$billNumber == df_full_multi[row-1,]$billNumber) &
(df_full_multi[row,]$agValue > df_full_multi[row-1,]$agValue)
) {
df_full_multi[row,]$flag = "Y"
df_full_multi[row-1,]$flag = "N"
} else
{
df_full_multi[row,]$flag = "N"
}
}
df_full_multi
#rebuild full dataset and retrieve the desired output
df_full_final <- rbind(df_full_single,df_full_multi)
df_full_final <- df_full_final[df_full_final$flag == "Y",c(1,2,3,4,5)]
df_full_final <- df_full_final[order(df_full_final$billNumber),]
df_full_final
billNumber day month year agValue
1 1 10 4 1998 10
3 2 23 11 2001 12
5 3 11 3 2005 8

Related

How to group non-ecluding year ranges using a loop with dplyr

I'm new here, so maybe my question could be difficult to understand. So, I have some data and it's date information and I need to group the mean of the data in year ranges. But this year ranges are non-ecluding, I mean that, for example, my first range is: 2013-2015 then 2014-2016 then 2015-2017, etc. So I think that it could be done by using a loop function and dplyr, but I dont know how to do it. I´ll be very thankfull if someone can help me.
Thank you,
Alejandro
What I tried was like:
for (i in Year){
Year_3=c(i, i+1, i+2)
db>%> group_by(Year_3)
#....etc
}
As you note, each observation would be used in multiple groups, so one approach could be to make copies of your data accordingly:
df <- data.frame(year = 2013:2020, value = 1:8)
library(dplyr)
df %>%
tidyr::uncount(3, .id = "grp") %>%
mutate(group_start = year - grp + 1,
group_name = paste0(group_start, "-", group_start + 2)) %>%
group_by(group_name) %>%
summarise(value = mean(value),
n = n())
# A tibble: 10 × 3
group_name value n
<chr> <dbl> <int>
1 2011-2013 1 1
2 2012-2014 1.5 2
3 2013-2015 2 3
4 2014-2016 3 3
5 2015-2017 4 3
6 2016-2018 5 3
7 2017-2019 6 3
8 2018-2020 7 3
9 2019-2021 7.5 2
10 2020-2022 8 1
Or we might take a more algebraic approach, noting that the sum of a three year period will be the difference between the cumulative amount two years in the future minus the cumulative amount the prior year. This approach excludes the partial ranges.
df %>%
mutate(cuml = cumsum(value),
value_3yr = (lead(cuml, n = 2) - lag(cuml, default = 0)) / 3)
year value cuml value_3yr
1 2013 1 1 2
2 2014 2 3 3
3 2015 3 6 4
4 2016 4 10 5
5 2017 5 15 6
6 2018 6 21 7
7 2019 7 28 NA
8 2020 8 36 NA

How to select only the individuals that appear in every year of a dataframe in r

I'm trying to subset the individuals that have been present for the duration of the whole study starting in 2014 and ending in 2019. So, the output would be a list of names that are present in every year of the dataframe.
I've tried the following code:
big_data <- dplyr::bind_rows(df1, df2, df3, df4, df5, df6) # I've bound 6 different dataframes (each with data from one of the years) by row. These dfs have a different number of rows and columns. Some columns repeat in different years, while others don't.
Date <- as.POSIXlt.Date(big_data$Date)
Year <- separate(big_data, Date, into = c('Month', 'Day', 'Year') %>% select(Year)) # I've extracted the Year from the Date variable (DD/MM/YYYY)
Year <- big_data$Year # I've added it to the big_data
Interval <- Year %between% c("2014", "2019") # I've created a timeperiod with the start and end years of the study
big_data [, all.names(FocalID %in% Interval)] # I've tried to get the names of the individuals (in variable FocalID) that are present in the interval (but probably doesn't mean in every year)
Obviously this code didn't work. Could you help me out? Thank you!
If your data frame has rows with id and year, for example:
big_data <- data.frame(
id = c(1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3),
year = c(2014:2019, 2014:2019, 2014:2018)
)
id year
1 1 2014
2 1 2015
3 1 2016
4 1 2017
5 1 2018
6 1 2019
7 1 2014
8 2 2015
9 2 2016
10 2 2017
11 2 2018
12 3 2019
13 3 2014
14 3 2015
15 3 2016
16 3 2017
17 3 2018
You can use dplyr package from tidyverse to group_by individual subject id, and then check to make sure rows of data contain all years 2014-2019 in year. This will filter in all rows for given id - if all years are represented.
library(dplyr)
big_data %>%
group_by(id) %>%
filter(all(2014:2019 %in% year))
A base R option would be the following:
big_data[big_data$id %in% Reduce(intersect, split(big_data$id, big_data$year)), ]
In this example, id of 1 and 3 include all years 2014-2019.
Output
id year
1 1 2014
2 1 2015
3 1 2016
4 1 2017
5 1 2018
6 1 2019
7 1 2014
12 3 2019
13 3 2014
14 3 2015
15 3 2016
16 3 2017
17 3 2018
Another option with data.table
library(data.table)
setDT(big_data)[big_data[, .I[all(2014:2019 %in% year)], id]$V1]
-output
# id year
# 1: 1 2014
# 2: 1 2015
# 3: 1 2016
# 4: 1 2017
# 5: 1 2018
# 6: 1 2019
# 7: 1 2014
# 8: 3 2019
# 9: 3 2014
#10: 3 2015
#11: 3 2016
#12: 3 2017
#13: 3 2018
data
big_data <- data.frame(
id = c(1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,3),
year = c(2014:2019, 2014:2019, 2014:2018)
)

Extracting data from different columns in a data frame based on row values

From each row in the data frame, df, I want to extract values in columns, as explained below and create a new data frame, output.
When Year is equal to 2003, I need values in Y_2001 and Y_2002 columns, in output data frame as Year 1 and Year 2. They are the values corresponding to two years prior to year specified in Year column. Similarly, if year equals to 2006, I need values in Y_2004 and Y_2005 in output data frame. Likewise, for all years in Year column.
> df
ID Year Y_2001 Y_2002 Y_2003 Y_2004 Y_2005
[1,] 1 2003 2 4 6 4 3
[2,] 2 2004 5 9 7 1 2
[3,] 3 2006 4 3 5 7 8
[4,] 4 2004 7 6 4 8 9
> output
ID Year Year1 Year2
[1,] 1 2003 2 4
[2,] 2 2004 9 7
[3,] 3 2006 7 8
[4,] 4 2004 6 4
Can someone please help me to create a code to get above output? Highly appreciate any support.
Here is a tidyverse solution:
Would take data and put into long form with pivot_longer. The data values of interest are where the Year "row" is 1 or 2 years less than the "column" Year. You can filter on these differences (filter here is explicit for 1 or 2 year differences).
An additional column is created with mutate for your column names of Year1 and Year2 (note Year1 is difference of 2 years, and Year2 is difference of 1 year, so the values is subtracted from 3 for this reversal). Finally, pivot_wider puts the data back in wide form.
library(tidyverse)
df %>%
pivot_longer(cols = -c(ID, Year), names_to = c(".value", "Year_Sep"), names_sep = "_", names_ptypes = list(Year_Sep = numeric())) %>%
filter(Year - Year_Sep == 1 | Year - Year_Sep == 2) %>%
mutate(YearCol = paste0("Year", 3 - (Year - Year_Sep))) %>%
pivot_wider(id_cols = c(ID, Year), names_from = YearCol, values_from = Y)
Output
# A tibble: 4 x 4
ID Year Year1 Year2
<int> <int> <int> <int>
1 1 2003 2 4
2 2 2004 9 7
3 3 2006 7 8
4 4 2004 6 4
Bit of a clunky solution, but ...
i.col <- function(data, n) { # Returns the column index corresponding to the year
sapply(data$Year-n, function(x) grep(x, names(data)))
}
df$Year1 <- diag(as.matrix(df[, i.col(df, n=2)]))
df$Year2 <- diag(as.matrix(df[, i.col(df, n=1)]))
Edit:
Apparently using diag is very slow. Using cbind to access matrix elements is preferred.
df$Year1 <- df[cbind(1:4, i.col(df, n=2))] # where 4 is number of rows
df$Year2 <- df[cbind(1:4, i.col(df, n=1))]
df
ID Year Y_2001 Y_2002 Y_2003 Y_2004 Y_2005 Year1 Year2
1 1 2003 2 4 6 4 3 2 4
2 2 2004 5 9 7 1 2 9 7
3 3 2006 4 3 5 7 8 7 8
4 4 2004 7 6 4 8 9 6 4
Here is one way with row-wise apply assuming that you can find out the starting year (2001).
cbind(df[1:2], t(apply(df[-1], 1, function(x)
{ vals <- x[1] - 2001; x[c(vals:(vals + 1))]})))
# ID Year 1 2
#1 1 2003 2 4
#2 2 2004 9 7
#3 3 2006 7 8
#4 4 2004 6 4

Remove rows out of a specific year range, without using for-loop in R

I am looking for a way to omit the rows which are not between two specific values, without using for loop. All rows in year column are between 1999 and 2002, however some of them do not include all years between these two dates. You can see the initial data as follows:
a <- data.frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
year id
1 2000 4
2 2001 6
3 2002 2
4 1999 1
5 2000 3
6 2001 5
7 2002 7
8 1999 4
9 2000 2
10 2001 0
11 2002 -1
12 1999 -3
13 2000 4
14 2001 3
Processed dataset should only include consecutive rows between 1999:2002. The following data.frame is exactly what I need:
year id
1 1999 1
2 2000 3
3 2001 5
4 2002 7
5 1999 4
6 2000 2
7 2001 0
8 2002 -1
When I execute the following for loop, I get previous data.frame without any problem:
for(i in 1:which(a$year == 2002)[length(which(a$year == 2002))]){
if(a[i,1] == 1999 & a[i+3,1] == 2002){
b <- a[i:(i+3),]
}else{next}
if(!exists("d")){
d <- b
}else{
d <- rbind(d,b)
}
}
However, I have more than 1 million rows and I need to do this process without using for loop. Is there any faster way for that?
You could try this. First we create groups of consecutive numbers, then we join with the full date range, then we filter if any group is not full. If you already have a grouping variable, this can be cut down a lot.
library(tidyverse)
df <- data_frame(year = c(2000:2002,1999:2002,1999:2002,1999:2001),
id=c(4,6,2,1,3,5,7,4,2,0,-1,-3,4,3))
df %>%
mutate(groups = cumsum(c(0,diff(year)!=1))) %>%
nest(-groups) %>%
mutate(data = map(data, .f = ~full_join(.x, data_frame(year = 1999:2002), by = "year")),
drop = map_lgl(data, ~any(is.na(.x$id)))) %>%
filter(drop == FALSE) %>%
unnest() %>%
select(-c(groups, drop))
#> # A tibble: 8 x 2
#> year id
#> <int> <dbl>
#> 1 1999 1
#> 2 2000 3
#> 3 2001 5
#> 4 2002 7
#> 5 1999 4
#> 6 2000 2
#> 7 2001 0
#> 8 2002 -1
Created on 2018-08-31 by the reprex
package (v0.2.0).
There is a function that can do this automatically.
First, install the package called dplyr or tidyverse with command install.packages("dplyr") or install.packages("tidyverse").
Then, load the package with library(dplyr).
Then, use the filter function: a_filtered = filter(a, year >=1999 & year < 2002).
This should be fast even there are many rows.
We could also do this by creating a grouping column based on the logical expression checking the 'year' 1999, then filter by checking the first 'year' as '1999', last as '2002' and if all the 'year' in between are present for the particular 'grp'
library(dplyr)
a %>%
group_by(grp = cumsum(year == 1999)) %>%
filter(dplyr::first(year) == 1999,
dplyr::last(year) == 2002,
all(1999:2002 %in% year)) %>%
ungroup %>% # in case to remove the 'grp'
select(-grp)
# A tibble: 8 x 2
# year id
# <int> <dbl>
#1 1999 1
#2 2000 3
#3 2001 5
#4 2002 7
#5 1999 4
#6 2000 2
#7 2001 0
#8 2002 -1

Choose a month of a year to rank then give resulting ranks to the rest years

Sample data:
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34))
id year month new.employee
1 A 2014 1 4
2 A 2014 2 6
3 A 2015 1 2
4 A 2015 2 6
5 B 2014 1 23
6 B 2014 2 2
7 B 2015 1 5
8 B 2015 2 34
Desired outcome:
desired_df <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
new.employee=c(4,6,2,6,23,2,5,34),
new.employee.rank=c(1,1,2,2,2,2,1,1))
id year month new.employee new.employee.rank
1 A 2014 1 4 1
2 A 2014 2 6 1
3 A 2015 1 2 2
4 A 2015 2 6 2
5 B 2014 1 23 2
6 B 2014 2 2 2
7 B 2015 1 5 1
8 B 2015 2 34 1
The ranking rule is: I choose month 2 in each year to rank number of new employees between A and B. Then I need to give those ranks to month 1. i.e., month 1 of each year rankings must be equal to month 2 ranking in the same year.
I tried these code to get rankings for each month and each year,
library(data.table)
df1 <- data.table(df1)
df1[,rank:=rank(new.employee), by=c("year","month")]
If (anyone can roll the rank value within a column to replace rank of month 1 by rank of month 2 ), it might be a solution.
You've tried a data.table solution, so here's how would I do this using data.table
library(data.table) # V1.9.6+
temp <- setDT(df1)[month == 2L, .(id, frank(-new.employee)), by = year]
df1[temp, new.employee.rank := i.V2, on = c("year", "id")]
df1
# id year month new.employee new.employee.rank
# 1: A 2014 1 4 1
# 2: A 2014 2 6 1
# 3: A 2015 1 2 2
# 4: A 2015 2 6 2
# 5: B 2014 1 23 2
# 6: B 2014 2 2 2
# 7: B 2015 1 5 1
# 8: B 2015 2 34 1
It appears somewhat similar to the above dplyr solution. Which is basically ranks the ids per year and joins them back to the original data set. I'm using data.table V1.9.6+ here.
Here's a dplyr-based solution. The idea is to reduce the data to the parts you want to compare, make the comparison, then join the results back into the original data set, expanding it to fill all of the relevant slots. Note the edits to your code for creating the sample data.
df1 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=rep(c(2014,2014,2015,2015), 2),
month=rep(c(1,2), 4),
new.employee=c(4,6,2,6,23,2,5,34))
library(dplyr)
df1 %>%
# Reduce the data to the slices (months) you want to compare
filter(month==2) %>%
# Group the data by year, so the comparisons are within and not across years
group_by(year) %>%
# Create a variable that indicates the rankings within years in descending order
mutate(rank = rank(-new.employee)) %>%
# To prepare for merging, reduce the new data to just that ranking var plus id and year
select(id, year, rank) %>%
# Use left_join to merge the new data (.) with the original df, expanding the
# new data to fill all rows with id-year matches
left_join(df1, .) %>%
# Order the data by id, year, and month to make it easier to review
arrange(id, year, month)
Output:
Joining by: c("id", "year")
id year month new.employee rank
1 A 2014 1 4 1
2 A 2014 2 6 1
3 A 2015 1 2 2
4 A 2015 2 6 2
5 B 2014 1 23 2
6 B 2014 2 2 2
7 B 2015 1 5 1
8 B 2015 2 34 1

Resources