Remove duplicate year rows by groups [duplicate] - r

This question already has answers here:
get rows of unique values by group
(4 answers)
Closed 1 year ago.
I have a data.table of the following form:-
data <- data.table(group = rep(1:3, each = 4),
year = c(2011:2014, rep(2011:2012, each = 2),
2012, 2012, 2013, 2014), value = 1:12)
This is only an abstract of my data.
So group 2 has 2 values for 2011 and 2012. And group 3 has 2 values for the year 2012. I want to just keep the first row for all the duplicated years.
So, in effect, my data.table will become the following:-
data <- data.table(group = c(rep(1, 4), rep(2, 2), rep(3, 3)),
year = c(2011:2014, 2011, 2012, 2012, 2013, 2014),
value = c(1:5, 7, 9, 11, 12))
How can I achieve this? Thanks in advance.

Try this data.table option with duplicated
> data[!duplicated(cbind(group, year))]
group year value
1: 1 2011 1
2: 1 2012 2
3: 1 2013 3
4: 1 2014 4
5: 2 2011 5
6: 2 2012 7
7: 3 2012 9
8: 3 2013 11
9: 3 2014 12

For data.tables you can pass by argument to unique -
library(data.table)
unique(data, by = c('group', 'year'))
# group year value
#1: 1 2011 1
#2: 1 2012 2
#3: 1 2013 3
#4: 1 2014 4
#5: 2 2011 5
#6: 2 2012 7
#7: 3 2012 9
#8: 3 2013 11
#9: 3 2014 12

Using base R
subset(data, !duplicated(cbind(group, year)))

One solution would be to use distinct from dplyr like so:
library(dplyr)
data %>%
distinct(group, year, .keep_all = TRUE)
Output:
group year value
1: 1 2011 1
2: 1 2012 2
3: 1 2013 3
4: 1 2014 4
5: 2 2011 5
6: 2 2012 7
7: 3 2012 9
8: 3 2013 11
9: 3 2014 12

This should do the trick:
library(tidyverse)
data %>%
group_by(group, year) %>%
filter(!duplicated(group, year))

Related

In R: How can I check that I have consecutive years of data (to later be able to calculate growth)?

I have the dataframe (sample) below:
companyID year yearID
1 2010 1
1 2011 2
1 2012 3
1 2013 4
2 2010 1
2 2011 2
2 2016 3
2 2017 4
2 2018 5
3 2010 1
3 2011 2
3 2014 3
3 2017 4
3 2018 5
I have used a for loop in order to try and create a sequence column that starts a new number for each new sequence of numbers. I am new to R so my definitions may be a bit wrong. My for loop looks like this:
size1 <- c(1:3)
s <- 0
for (val1 in size) {
m <- max(sample[sample$companyID == val1, 4])
size2 <- c(1:m)
for (val2 in size2){
row <- sample[which(sample$companyID == val1 & sample$yearID == val2)]
m1 <- sample[sample$companyID == val1 & sample$yearID == val2, 2]
m2 <- sample[sample$CompanyID == val1 & sample$yearID == (val2-1), 2]
if(val2>1 && m1-m2 > 1) {
sample$sequence[row] s = s+1}
else {s = s}
}
}
Where m is the max value of the yearID per companyID, row is to identify that the value should be entered on the row where companyID = val1 and yearID = val2, m1 is from the year variable and is the latter year, whereas m2 is the former year. What I have tried to do is to change the sequence every time m1-m2 > 1 (when val2 > 1 also).
Desired outcome:
companyID year yearID sequence
1 2010 1 1
1 2011 2 1
1 2012 3 1
1 2013 4 1
2 2010 1 2
2 2011 2 2
2 2016 3 3
2 2017 4 3
2 2018 5 3
3 2010 1 4
3 2011 2 4
3 2014 3 5
3 2017 4 6
3 2018 5 6
Super appreciative if anyone can help!!
This is a good question!
First group_by companyID
calculate the difference of each consecutive row in year column with lag to identify if year is consecutive.
group_by companyID, yearID)
mutate helper column sequence1 to apply 1 to each starting consecutive year in group.
ungroup and apply a sequence number eachtime 1
occurs in sequence1
remove column sequence1 and deltalag1
library(tidyverse)
df1 <- df %>%
group_by(companyID) %>%
mutate(deltaLag1 = year - lag(year, 1)) %>%
group_by(companyID, yearID) %>%
mutate(sequence1 = case_when(is.na(deltaLag1) | deltaLag1 > 1 ~ 1,
TRUE ~ 2)) %>%
ungroup() %>%
mutate(sequence = cumsum(sequence1==1)) %>%
select(-deltaLag1, -sequence1)
data
df <- tribble(
~companyID, ~year, ~yearID,
1, 2010, 1,
1, 2011, 2,
1, 2012, 3,
1, 2013, 4,
2, 2010, 1,
2, 2011, 2,
2, 2016, 3,
2, 2017, 4,
2, 2018, 5,
3, 2010, 1,
3, 2011, 2,
3, 2014, 3,
3, 2017, 4,
3, 2018, 5)
It's not clear if you want the exact desired outcome or check that you have consecutive years by companyID.
According to your title message:
sample <- read.table(header = TRUE, text = "
companyID year yearID
1 2010 1
1 2011 2
1 2012 3
1 2013 4
2 2010 1
2 2011 2
2 2016 3
2 2017 4
2 2018 5
3 2010 1
3 2011 2
3 2014 3
3 2017 4
3 2018 5
")
library(data.table)
sample <- setDT(sample)
sample[ , diff_year := year - shift(year), by = companyID]
sample <- setDF(sample)
sample
#> companyID year yearID diff_year
#> 1 1 2010 1 NA
#> 2 1 2011 2 1
#> 3 1 2012 3 1
#> 4 1 2013 4 1
#> 5 2 2010 1 NA
#> 6 2 2011 2 1
#> 7 2 2016 3 5
#> 8 2 2017 4 1
#> 9 2 2018 5 1
#> 10 3 2010 1 NA
#> 11 3 2011 2 1
#> 12 3 2014 3 3
#> 13 3 2017 4 3
#> 14 3 2018 5 1
# Created on 2021-03-13 by the reprex package (v1.0.0.9002)
Related to Calculate difference between values in consecutive rows by group
Regards,

Create incremental column year based on id and year column in R

I have the below dataframe and i want to create the 'create_col' using some kind of seq() function i guess using the 'year' column as the start of the sequence. How I could do that?
id <- c(1,1,2,3,3,3,4)
year <- c(2013, 2013, 2015,2017,2017,2017,2011)
create_col <- c(2013,2014,2015,2017,2018,2019,2011)
Ideal result:
id year create_col
1 1 2013 2013
2 1 2013 2014
3 2 2015 2015
4 3 2017 2017
5 3 2017 2018
6 3 2017 2019
7 4 2011 2011
You can add row_number() to minimum year in each id :
library(dplyr)
df %>%
group_by(id) %>%
mutate(create_col = min(year) + row_number() - 1)
# id year create_col
# <dbl> <dbl> <dbl>
#1 1 2013 2013
#2 1 2013 2014
#3 2 2015 2015
#4 3 2017 2017
#5 3 2017 2018
#6 3 2017 2019
#7 4 2011 2011
data
df <- data.frame(id, year)

Combine data in many row into a columnn

I have a data like this:
year Male
1 2011 8
2 2011 1
3 2011 4
4 2012 3
5 2012 12
6 2012 9
7 2013 4
8 2013 3
9 2013 3
and I need to group the data for the year 2011 in one column, 2012 in the next column and so on.
2011 2012 2013
1 8 3 4
2 1 12 3
3 4 9 3
How do I achieve this?
One option is unstack if the number of rows per 'year' is the same
unstack(df1, Male ~ year)
One option is to use functions from dplyr and tidyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(year) %>%
mutate(ID = 1:n()) %>%
spread(year, Male) %>%
select(-ID)
1
If every year has the same number of data, you could split the data and cbind it using base R
do.call(cbind, split(df$Male, df$year))
# 2011 2012 2013
#[1,] 8 3 4
#[2,] 1 12 3
#[3,] 4 9 3
2
If every year does not have the same number of data, you could use rbind.fill of plyr
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(plyr)
setNames(object = data.frame(t(rbind.fill.matrix(lapply(split(df$Male, df$year), t)))),
nm = unique(df$year))
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA
3
Yet another way is to use dcast to convert data from long to wide format
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(reshape2)
dcast(df, ave(df$Male, df$year, FUN = seq_along) ~ year, value.var = "Male")[,-1]
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA

How to conditionally update a table in R?

My table looks like this:
# Year Month WaterYear
# 1993 3
# 2000 4
# 2013 10
# 2015 6
# 2000 7
# 2008 12
# 2008 9
# 2012 10
# 2000 11
# 2000 12
I am trying to update this table by computing WaterYear equals Year+1 where months range between October and December.
I am working on R and hoping to find the easiest way to make it work.
Simple ifelse function will do the trick.
From your data.
# Create data
Year <- c(1993, 2000, 2013, 2015, 2000, 2008, 2008, 2012, 2000, 2000)
Month <- c(3, 4, 10, 6, 7, 12, 9 ,10, 11, 12)
WaterYear <- rep("",length(Year))
dat <- data.frame(Year, Month, WaterYear)
# If month is greater or equal to 10 change it to Year +1,
# otherwise keep it as it is
dat$WaterYear <- ifelse(dat$Month >=10, Year+1, WaterYear)
Results in
Year Month WaterYear
1993 3
2000 4
2013 10 2014
2015 6
2000 7
2008 12 2009
2008 9
2012 10 2013
2000 11 2001
We can also do
i1 <- dat$Month >=10
dat$WaterYear[i1] <- dat$Year[i1] + 1
dat
# Year Month WaterYear
#1 1993 3
#2 2000 4
#3 2013 10 2014
#4 2015 6
#5 2000 7
#6 2008 12 2009
#7 2008 9
#8 2012 10 2013
#9 2000 11 2001
#10 2000 12 2001
Or using data.table, convert the 'data.frame' to 'data.table' (setDT(dat)), specify the logical condition in 'i' (Month >= 10), and assign (:=) the 'Year' + 1 to 'WaterYear'
library(data.table)
setDT(dat)[Month >=10, WaterYear := as.character(Year + 1)]

Merge 2 resulting vectors into 1 data frame using R

I have a df like this
Month <- c('JAN','JAN','JAN','JAN','FEB','FEB','MAR','APR','MAY','MAY')
Category <- c('A','A','B','C','A','E','B','D','E','F')
Year <- c(2014,2015,2015,2015,2014,2013,2015,2014,2015,2013)
Number_Combinations <- c(3,2,3,4,1,3,6,5,1,1)
df <- data.frame(Month ,Category,Year,Number_Combinations)
df
Month Category Year Number_Combinations
1 JAN A 2014 3
2 JAN A 2015 2
3 JAN B 2015 3
4 JAN C 2015 4
5 FEB A 2014 1
6 FEB E 2013 3
7 MAR B 2015 6
8 APR D 2014 5
9 MAY E 2015 1
10 MAY F 2013 1
I have another df that I got from the above dataframe with a condition
df1 <- subset(df,Number_Combinations > 2)
df1
Month Category Year Number_Combinations
1 JAN A 2014 3
3 JAN B 2015 3
4 JAN C 2015 4
6 FEB E 2013 3
7 MAR B 2015 6
8 APR D 2014 5
Now I want to create a table reporting the month, the total number of rows for the month in df and the total number of for the month in df1
Desired Output would be
Month Number_Month_df Number_Month_df1
1 JAN 4 3
2 FEB 2 1
3 MAR 1 1
4 APR 1 1
5 MAY 2 0
While I used table(df) and table(df1) and tried merging but not getting the desired result. Could someone please help me in getting the above dataframe?
We get the table of the 'Month' column from both 'df' and 'df1', convert to 'data.frame' (as.data.frame), merge by the 'Var1', and change the column names accordingly.
res <- merge(as.data.frame(table(df$Month)),
as.data.frame(table(df1$Month)), by='Var1')
colnames(res) <- c('Month', 'Number_Month_df', 'Number_Month_df1')
res <- data.frame(Number_Month_df=sort(table(df$Month),T),
Number_Month_df1=sort(table(df1$Month),T))
res$Month <- rownames(res)

Resources