Counting duplicates in R [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
I would like to know how it would be possible to make a new variable counting how many ID duplicates I have for certain years. For example, below I want to count for the year 2014 how many times before that year that ID was repeated. That way in the year 2015 it is counting the ID's in both 2013 and 2014.
ID Term Year Repeats
122 L 2013 N/A
112 L 2013 N/A
002 L 2013 N/A
152 L 2013 N/A
124 L 2013 N/A
122 L 2014 1
102 L 2014 N/A
142 L 2014 N/A
152 L 2014 N/A
120 L 2014 N/A
198 L 2014 N/A
122 L 2015 2
012 L 2015 N/A
101 L 2015 N/A
092 L 2015 N/A
031 L 2015 N/A

If Year is in ascending order:
df$Repeats <- 0L
i <- which(duplicated(df$ID))
df$Repeats[i] <- with(df[i, ], unsplit(lapply(split(ID, ID), seq_along), ID))
df
# ID Term Year Repeats
#1 122 L 2013 0
#2 112 L 2013 0
#3 2 L 2013 0
#4 152 L 2013 0
#5 124 L 2013 0
#6 122 L 2014 1
#7 102 L 2014 0
#8 142 L 2014 0
#9 152 L 2014 1
#10 120 L 2014 0
#11 198 L 2014 0
#12 122 L 2015 2
#13 12 L 2015 0
#14 101 L 2015 0
#15 92 L 2015 0
#16 31 L 2015 0

Another base R solution:
d$Repeats <- ave(d$ID, d$ID, FUN = function(x) seq_along(x)-1)
# or a bit cleaner (thx to #DavidArenburg):
d$Repeats <- with(d, ave(ID, ID, FUN = seq_along)) - 1
which gives:
> d
ID Term Year Repeats
1 122 L 2013 0
2 112 L 2013 0
3 2 L 2013 0
4 152 L 2013 0
5 124 L 2013 0
6 122 L 2014 1
7 102 L 2014 0
8 142 L 2014 0
9 152 L 2014 1
10 120 L 2014 0
11 198 L 2014 0
12 122 L 2015 2
13 12 L 2015 0
14 101 L 2015 0
15 92 L 2015 0
16 31 L 2015 0
A solution using data.table:
library(data.table)
setDT(d, key = c('ID','Year'))
d[, Repeats := 0:(.N-1), by = ID]
which gives:
> d
ID Term Year Repeats
1: 2 L 2013 0
2: 12 L 2015 0
3: 31 L 2015 0
4: 92 L 2015 0
5: 101 L 2015 0
6: 102 L 2014 0
7: 112 L 2013 0
8: 120 L 2014 0
9: 122 L 2013 0
10: 122 L 2014 1
11: 122 L 2015 2
12: 124 L 2013 0
13: 142 L 2014 0
14: 152 L 2013 0
15: 152 L 2014 1
16: 198 L 2014 0
Alternatively, you can use the rowid function from the development version of data.table:
d[, Repeats := rowid(ID)-1]
With dplyr:
library(dplyr)
d %>% group_by(ID) %>% mutate(Repeats = row_number()-1)
If you want NA's instead of zero's, you could use:
d[, Repeats := c(NA, 1:(.N-1)), by = ID]
which will give:
ID Term Year Repeats
1: 2 L 2013 NA
2: 12 L 2015 NA
3: 31 L 2015 NA
4: 92 L 2015 NA
5: 101 L 2015 NA
6: 102 L 2014 NA
7: 112 L 2013 NA
8: 120 L 2014 NA
9: 122 L 2013 NA
10: 122 L 2014 1
11: 122 L 2015 2
12: 124 L 2013 NA
13: 142 L 2014 NA
14: 152 L 2013 NA
15: 152 L 2014 1
16: 198 L 2014 NA

Related

Add multiple columns lagged by one year

I need to add a 1-year-lagged version of multiple columns from my dataframe. Here's my data:
data<-data.frame(Year=c("2011","2011","2011","2012","2012","2012","2013","2013","2013"),
Country=c("America","China","India","America","China","India","America","China","India"),
Value1=c(234,443,754,334,117,112,987,903,476),
Value2=c(2,4,5,6,7,8,1,2,2))
And I want to add two columns that contain Value1 and Value2 at t-1, so that my dataframe looks like this:
How can I do this? Would this be the correct way to lag my variables by year?
Thanks in advance!
Using data.table:
library(data.table)
setDT(data)
cols <- grep("^Value", colnames(data), value = TRUE)
data[, paste0(cols, "_lag") := lapply(.SD, shift), .SDcols = cols, by = Country]
# Year Country Value1 Value2 Value1_lag Value2_lag
# 1: 2011 America 234 2 NA NA
# 2: 2011 China 443 4 NA NA
# 3: 2011 India 754 5 NA NA
# 4: 2012 America 334 6 234 2
# 5: 2012 China 117 7 443 4
# 6: 2012 India 112 8 754 5
# 7: 2013 America 987 1 334 6
# 8: 2013 China 903 2 117 7
# 9: 2013 India 476 2 112 8
In dplyr, use lag by group:
library(dplyr) #1.1.0
data %>%
mutate(across(contains("Value"), lag, .names = "{col}_lagged"), .by = Country)
Year Country Value1 Value2 Value1_lagged Value2_lagged
1 2011 America 234 2 NA NA
2 2011 China 443 4 NA NA
3 2011 India 754 5 NA NA
4 2012 America 334 6 234 2
5 2012 China 117 7 443 4
6 2012 India 112 8 754 5
7 2013 America 987 1 334 6
8 2013 China 903 2 117 7
9 2013 India 476 2 112 8
Below 1.1.0:
data %>%
group_by(Country) %>%
mutate(across(c(GDP, Population), lag, .names = "{col}_lagged")) %>%
ungroup()
Another way using dplyr to ge tthe job done.
library(dplyr)
data_lagged <- data %>%
group_by(Country) %>%
mutate(Value1_Lagged = lag(Value1),
Value2_Lagged = lag(Value2),
Year = as.integer(as.character(Year)) + 1)
data_final <- cbind(data, data_lagged[, c("Value1_Lagged", "Value2_Lagged")])
data_final
Output:
Year Country Value1 Value2 Value1_Lagged Value2_Lagged
1 2011 America 234 2 NA NA
2 2011 China 443 4 NA NA
3 2011 India 754 5 NA NA
4 2012 America 334 6 234 2
5 2012 China 117 7 443 4
6 2012 India 112 8 754 5
7 2013 America 987 1 334 6
8 2013 China 903 2 117 7
9 2013 India 476 2 112 8

Hi is there a way to code the minimum value that I want to get?

my data is as follows:
Year Type Amount
2013 A 100
2013 B 150
2013 C 100
2013 D 300
2014 A 200
2014 B 150
2014 C 170
2014 D 100
2014 E 120
2015 A 100
2015 B 350
2015 C 670
2015 D 300
2015 E 220
I'd like to only extract such that it gets the earliest and latest year of each type (A,B,C,D,E)
As seen, the earliest year of E starts from 2014, instead of 2013.
The output that I want will look like this:
Year Type Amount
2013 A 100
2013 B 150
2013 C 100
2013 D 300
2014 E 120
2015 A 100
2015 B 350
2015 C 670
2015 D 300
2015 E 220
Is there any way to code this, without hardcoding? This is in a dataframe format
Using dplyr you can group by Type and select Year with the condition that it is the minimum or maximum Year for each Type
library(dplyr)
df %>%
group_by(Type) %>%
filter(Year == min(Year) | Year == max(Year))
Gives us:
Year Type Amount
<int> <chr> <int>
1 2013 A 100
2 2013 B 150
3 2013 C 100
4 2013 D 300
5 2014 E 120
6 2015 A 100
7 2015 B 350
8 2015 C 670
9 2015 D 300
10 2015 E 220
For your follow up, to calculate percent increase:
df %>%
group_by(Type) %>%
filter(Year == min(Year) | Year == max(Year)) %>%
arrange(Type) %>%
mutate(pct_change = (Amount[Year == max(Year)]/Amount[Year == min(Year)] - 1)*100)
Gives us:
Year Type Amount pct_change
<int> <chr> <int> <dbl>
1 2013 A 100 0
2 2015 A 100 0
3 2013 B 150 133.
4 2015 B 350 133.
5 2013 C 100 570
6 2015 C 670 570
7 2013 D 300 0
8 2015 D 300 0
9 2014 E 120 83.3
10 2015 E 220 83.3
You can use ave testing for each Type if Year is either min or max:
x[ave(x$Year, x$Type, FUN=function(y) y==min(y) | y==max(y))==1,]
# Year Type Amount
#1 2013 A 100
#2 2013 B 150
#3 2013 C 100
#4 2013 D 300
#9 2014 E 120
#10 2015 A 100
#11 2015 B 350
#12 2015 C 670
#13 2015 D 300
#14 2015 E 220
or using range and %in%
x[ave(x$Year, x$Type, FUN=function(y) y %in% range(y))==1,]

How to create a new column using looping and rbind in r?

I have a data similar like this. I would like to make 3 columns (date1, date2, date3) by using looping and rbind. It is because I am requied to do it by only that method.
(all I was told is making a loop, subset the data, sort it make a new data frame then rbind it to make a new column.)
year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103
The result I expect is:
date1: number of days from 2011, January 1st, start again from 1 in a new year.
date2: number of days of an id working in a year, start again from 1 in a new year.
date3: number of days open within a year, start again from 1 in a new year.
(all of the dates are in ascending order)
year month day id date1 date2 date3
2011 1 5 3101 5 1 1
2011 1 14 3101 14 2 2
2011 2 3 3101 34 3 3
2011 2 4 3101 35 4 4
2012 1 27 3153 27 1 1
2012 2 20 3153 51 2 2
2012 2 22 3153 53 3 3
2012 3 1 3153 60 4 4
2013 1 31 3103 31 1 1
2013 2 1 3103 32 2 2
2013 2 4 3103 35 3 3
2013 3 4 3103 94 4 4
2013 3 6 3103 96 5 5
Please help! Thank you.
You can do it without using unnecessary for loop and subset, here is the answer below
df <- read.table(text =" year month day id
2011 1 5 3101
2011 1 14 3101
2011 2 3 3101
2011 2 4 3101
2012 1 27 3153
2012 2 20 3153
2012 2 22 3153
2012 3 1 3153
2013 1 31 3103
2013 2 1 3103
2013 2 4 3103
2013 3 4 3103
2013 3 6 3103",header = T)
library(lubridate)
df$date1 <- yday(mdy(paste0(df$month,"-",df$day,"-",df$year)))
df$date2 <- ave(df$year, df$id, FUN = seq_along)
df$date3 <- ave(df$year, df$year, FUN = seq_along)

How to remove subjects with missing yearly observations in R?

num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432
I have the data which is represented by various subjects in 5 years. I need to remove all the subjects, which are missing any of years from 2011 to 2015. How can I accomplish it, so in given data only subject A is left?
Using data.table:
A data.table solution might look something like this:
library(data.table)
dt <- as.data.table(df)
dt[, keep := identical(unique(year), 2011:2015), by = Name ][keep == T, ][,keep := NULL]
# num Name year age X
#1: 1 A 2011 68 116292
#2: 1 A 2012 69 46132
#3: 1 A 2013 70 7042
#4: 1 A 2014 71 -100425
#5: 1 A 2015 72 6493
This is more strict in that it requires that the unique years be exactly equal to 2011:2015. If there is a 2016, for example that person would be excluded.
A less restrictive solution would be to check that 2011:2015 is in your unique years. This should work:
dt[, keep := all(2011:2015 %in% unique(year)), by = Name ][keep == T, ][,keep := NULL]
Thus, if for example, A had a 2016 year and a 2010 year it would still keep all of A. But if anyone is missing a year in 2011:2015 this would exclude them.
Using base R & aggregate:
Same option, but using aggregate from base R:
agg <- aggregate(df$year, by = list(df$Name), FUN = function(x) all(2011:2015 %in% unique(x)))
df[df$Name %in% agg[agg$x == T, 1] ,]
Here is a slightly more straightforward tidyverse solution.
First, expand the dataframe to include all combinations of Name + year:
df %>% complete(Name, year)
# A tibble: 20 x 5
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
6 B 2011 2 20 -8484
7 B 2012 NA NA NA
8 B 2013 NA NA NA
9 B 2014 NA NA NA
10 B 2015 NA NA NA
...
Then extend the pipe to group by "Name", and filter to keep only those with 0 NA values:
df %>% complete(Name, year) %>%
group_by(Name) %>%
filter(sum(is.na(age)) == 0)
# A tibble: 5 x 5
# Groups: Name [1]
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
Just check which names have the right number of entries.
## Reproduce your data
df = read.table(text=" num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432",
header=TRUE)
Tab = table(df$Name)
Keepers = names(Tab)[which(Tab == 5)]
df[df$Name %in% Keepers,]
num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
Here is a somewhat different approach using tidyverse packages:
library(tidyverse)
df <- read.table(text = " num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432")
df2 <- spread(data = df, key = Name, value = year)
x <- colSums(df2[, 4:7], na.rm = TRUE) > 10000
df3 <- select(df2, num, age, X, c(4:7)[x])
df4 <- na.omit(df3)
All steps can of course be constructed as one single pipe with the %>% operator.

Inserting rows into a table

I have this table (visit_ts) -
Year Month Number_of_visits
2011 4 1
2011 6 3
2011 7 23
2011 12 32
2012 1 123
2012 11 3200
The aim is to insert rows with Number_of_visits as 0, for months which are missing in the table.
Do not insert rows for 2011 where month is 1,2,3 or 2012 where month is 12.
The following code works correctly -
vec_month=c(1,2,3,4,5,6,7,8,9,10,11,12)
vec_year=c(2011,2012,2013,2014,2015,2016)
i=1
startyear=head(visit_ts$Year,n=1)
endyear=tail(visit_ts$Year,n=1)
x=head(visit_ts$Month,n=1)
y=tail(visit_ts$Month,n=1)
for (year in vec_year)
{
if(year %in% visit_ts$Year)
{
a=subset(visit_ts,visit_ts$Year==year)
index= which(!vec_month %in% a$Month)
for (j in index)
{
if((year==startyear & j>x )|(year==endyear & j<y))
visit_ts=rbind(visit_ts,c(year,j,0))
else
{
if(year!=startyear & year!=endyear)
visit_ts=rbind(visit_ts,c(year,j,0))
}
}}
else
{
i=i+1
}}
As I am new to R I am looking for an alternative/better solution to the problem which would not involve hard-coding the year and month vectors. Also please feel free to point out best programming practices.
We can use expand.grid with merge or left_join
library(dplyr)
expand.grid(Year = min(df1$Year):max(df1$Year), Month = 1:12) %>%
filter(!(Year == min(df1$Year) & Month %in% 1:3|
Year == max(df1$Year) & Month == 12)) %>%
left_join(., df1) %>%
mutate(Number_of_visits=replace(Number_of_visits, is.na(Number_of_visits), 0))
# Year Month Number_of_visits
#1 2012 1 123
#2 2012 2 0
#3 2012 3 0
#4 2011 4 1
#5 2012 4 0
#6 2011 5 0
#7 2012 5 0
#8 2011 6 3
#9 2012 6 0
#10 2011 7 23
#11 2012 7 0
#12 2011 8 0
#13 2012 8 0
#14 2011 9 0
#15 2012 9 0
#16 2011 10 0
#17 2012 10 0
#18 2011 11 0
#19 2012 11 3200
#20 2011 12 32
We can make it more dynamic by grouping by 'Year', get the sequence of 'Month' from minimum to maximum in a list, unnest the column, join with the original dataset (left_join) and replace the NA values with 0.
library(tidyr)
df1 %>%
group_by(Year) %>%
summarise(Month = list(min(Month):max(Month))) %>%
unnest(Month) %>%
left_join(., df1) %>%
mutate(Number_of_visits=replace(Number_of_visits, is.na(Number_of_visits), 0))
# Year Month Number_of_visits
# <int> <int> <dbl>
#1 2011 4 1
#2 2011 5 0
#3 2011 6 3
#4 2011 7 23
#5 2011 8 0
#6 2011 9 0
#7 2011 10 0
#8 2011 11 0
#9 2011 12 32
#10 2012 1 123
#11 2012 2 0
#12 2012 3 0
#13 2012 4 0
#14 2012 5 0
#15 2012 6 0
#16 2012 7 0
#17 2012 8 0
#18 2012 9 0
#19 2012 10 0
#20 2012 11 3200
Or another option is data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Year', we get the sequence of min to max 'Month', join with the original dataset on 'Year' and 'Month', replace the NA values to 0.
library(data.table)
setDT(df1)
df1[df1[, .(Month=min(Month):max(Month)), Year],
on = c("Year", "Month")][is.na(Number_of_visits), Number_of_visits := 0][]
# Year Month Number_of_visits
# 1: 2011 4 1
# 2: 2011 5 0
# 3: 2011 6 3
# 4: 2011 7 23
# 5: 2011 8 0
# 6: 2011 9 0
# 7: 2011 10 0
# 8: 2011 11 0
# 9: 2011 12 32
#10: 2012 1 123
#11: 2012 2 0
#12: 2012 3 0
#13: 2012 4 0
#14: 2012 5 0
#15: 2012 6 0
#16: 2012 7 0
#17: 2012 8 0
#18: 2012 9 0
#19: 2012 10 0
#20: 2012 11 3200

Resources