Retaining unique values per individual id in a dataframe in R - r

A very basic question! I tried finding searching a lot and using my own brain but eventually, had to come here.. :)
Well here is a sample dataframe
df<- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3),
quarter=c(1,2,3,4,1,2,3,4,1,2,3,4),
year=c(2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,2015),
value=c(2.75,2.75,2.75,2.75,2.90,2.90,2.90,2.90,2.21,2.21,2.21,2.21))
> df
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 2.75
3 1 3 2015 2.75
4 1 4 2015 2.75
5 2 1 2015 2.90
6 2 2 2015 2.90
7 2 3 2015 2.90
8 2 4 2015 2.90
9 3 1 2015 2.21
10 3 2 2015 2.21
11 3 3 2015 2.21
12 3 4 2015 2.21
I need unique value per id. So, I use this-
df$value[duplicated(df$value)]<-NA
And I get what I need.
> df
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 NA
3 1 3 2015 NA
4 1 4 2015 NA
5 2 1 2015 2.90
6 2 2 2015 NA
7 2 3 2015 NA
8 2 4 2015 NA
9 3 1 2015 2.21
10 3 2 2015 NA
11 3 3 2015 NA
12 3 4 2015 NA
Now lets say that I have the a new dataframe with more similar values -
df<- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3),
quarter=c(1,2,3,4,1,2,3,4,1,2,3,4),
year=c(2015,2015,2015,2015,2016,2016,2016,2016,2015,2015,2015,2015),
value=c(2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.21,2.21,2.21,2.21))
If I use the same code, I will end up with data missing for ID 2 as well.
How could I retain unique values for every ID per year??
Any help is much appreciated.

Here is a base R solution using ave + duplicated
df <- within(df,value <- ave(value,
id,
year,
FUN = function(v) ifelse(duplicated(v),NA,v)))
such that
> df
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 NA
3 1 3 2015 NA
4 1 4 2015 NA
5 2 1 2015 2.90
6 2 2 2015 NA
7 2 3 2015 NA
8 2 4 2015 NA
9 3 1 2015 2.21
10 3 2 2015 NA
11 3 3 2015 NA
12 3 4 2015 NA

Using duplicated on cbind id and year instead of value should give you the desired result:
df[duplicated(cbind(df$id, df$year)), "value"]<-NA
Using this solution on your second data.frame that gave you missing rows:
df<- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3),
quarter=c(1,2,3,4,1,2,3,4,1,2,3,4),
year=c(2015,2015,2015,2015,2016,2016,2016,2016,2015,2015,2015,2015),
value=c(2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.21,2.21,2.21,2.21))
df[duplicated(cbind(df$id, df$year)), "value"]<-NA
Returns:
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 NA
3 1 3 2015 NA
4 1 4 2015 NA
5 2 1 2016 2.75
6 2 2 2016 NA
7 2 3 2016 NA
8 2 4 2016 NA
9 3 1 2015 2.21
10 3 2 2015 NA
11 3 3 2015 NA
12 3 4 2015 NA

Related

Fill observation from latest year in grouped data using data.table

For each id, I am trying to fill the value in the code column corresponding to the latest year using data.table.
Data:
df <- data.frame(
id=c(1,1,1,2,2,2,3,3,3),
year=c(2014, 2015, 2016, 2015, 2015, 2016, NA, NA, 2016),
code=c(1,2,2, 1,2,3, 3,4,5)
)
> df
id year code
1 1 2014 1
2 1 2015 2
3 1 2016 2
4 2 2015 1
5 2 2015 2
6 2 2016 3
7 3 NA 3
8 3 NA 4
9 3 2016 5
In dplyr:
df %>% group_by(id) %>%
mutate(code2=last(na.omit(code[order(year, na.last=F)])))
# A tibble: 9 x 4
# Groups: id [3]
id year code code2
<dbl> <dbl> <dbl> <dbl>
1 1 2014 1 2
2 1 2015 2 2
3 1 2016 2 2
4 2 2015 1 3
5 2 2015 2 3
6 2 2016 3 3
7 3 NA 3 5
8 3 NA 4 5
9 3 2016 5 5
Attempt in data.table:
df %>%
as.data.table() %>%
.[,code2:=last(na.omit(code[order(year, na.last=F)]), by=id)] %>%
as.data.table()
Try data.table like below
> setDT(df)[,code2:=code[which.max(year)],id][]
id year code code2
1: 1 2014 1 2
2: 1 2015 2 2
3: 1 2016 2 2
4: 2 2015 1 3
5: 2 2015 2 3
6: 2 2016 3 3
7: 3 NA 3 5
8: 3 NA 4 5
9: 3 2016 5 5

how to obtain an element/column even when it's NA with tapply in R

I have a dataset like this:
df <- data.frame("y"=c(2010,2011,2012,2013,2010,2012,2010,2011,2012),"x"=c(1,2,1,1,2,2,4,4,4),"a"=c(5,3,0,2,3,0,2,3,0))
y x a
1 2010 1 5
2 2011 2 3
3 2012 1 0
4 2013 1 2
5 2010 2 3
6 2012 2 0
7 2010 4 2
8 2011 4 3
9 2012 4 0
And I want to sum 'a' for each 'y' and 'x', using:
sum <- tapply(df$a,list(df$y,df$x),sum)
That is:
1 2 4
2010 5 3 2
2011 NA 3 3
2012 0 0 0
2013 2 NA NA
How can i obtain also the '3' column, even though I don't have the value 3 in the column x of df?
Something like this:
1 2 3 4
2010 5 3 NA 2
2011 NA 3 NA 3
2012 0 0 NA 0
2013 2 NA NA NA
Make x column as factor with levels that include all the values between min and max of x column.
df$x <- factor(df$x, levels = seq(min(df$x), max(df$x)))
tapply(df$a,list(df$y,df$x),sum)
# 1 2 3 4
#2010 5 3 NA 2
#2011 NA 3 NA 3
#2012 0 0 NA 0
#2013 2 NA NA NA

How to get all my values within the same categorie to be equal in my dataframe?

So, I have a dataset that looks just like that :
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA NA
3 10 2015 2.0 1
4 10 2014 NA NA
5 10 2013 NA NA
6 11 2012 NA NA
7 11 2011 0.0 2
8 11 2010 NA NA
9 11 2009 1.0 2
But I do not want to have NAs in the cat column. Instead, I want every line within the same site to get the same value of cat.
Just like this :
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA 1
3 10 2015 2.0 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0.0 2
8 11 2010 NA 2
9 11 2009 1.0 2
Any idea on how I can do that?
Use na.aggregate to fill in the NA values using ave to do it by site.
library(zoo)
transform(DF, cat = ave(cat, site, FUN = na.aggregate))
giving:
site year territories cat
1 10 2017 0 1
2 10 2016 NA 1
3 10 2015 2 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0 2
8 11 2010 NA 2
9 11 2009 1 2
Note
The input used, in reproducible form, is:
Lines <- "
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA NA
3 10 2015 2.0 1
4 10 2014 NA NA
5 10 2013 NA NA
6 11 2012 NA NA
7 11 2011 0.0 2
8 11 2010 NA NA
9 11 2009 1.0 2"
DF <- read.table(text = Lines)
A complete base R alternative:
transform(DF, cat = ave(cat, site, FUN = function(x) x[!is.na(x)][1]))
which gives:
site year territories cat
1 10 2017 0 1
2 10 2016 NA 1
3 10 2015 2 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0 2
8 11 2010 NA 2
9 11 2009 1 2
The same logic implemented with dplyr:
library(dplyr)
DF %>%
group_by(site) %>%
mutate(cat = na.omit(cat)[1])
Or with na.locf of the zoo-package:
library(zoo)
transform(DF, cat = ave(cat, site, FUN = function(x) na.locf(na.locf(x, fromLast = TRUE, na.rm = FALSE))))
Or with fill from tidyr:
library(tidyr)
library(dplyr)
DF %>%
group_by(site) %>%
fill(cat) %>%
fill(cat, .direction = "up")
NOTE: I'm wondered what the added value is of the cat-column when cat has to be the same for each site. You'll end up with two grouping variables that do exactly the same, thus making one ot them redundant imo.
You can also use tidyr::fill
library(dplyr)
library(tidyr)
DF %>%
group_by(site) %>%
fill(cat,.direction = "up") %>%
fill(cat,.direction = "down") %>%
ungroup
# # A tibble: 9 x 4
# site year territories cat
# <int> <int> <dbl> <int>
# 1 10 2017 0 1
# 2 10 2016 NA 1
# 3 10 2015 2 1
# 4 10 2014 NA 1
# 5 10 2013 NA 1
# 6 11 2012 NA 2
# 7 11 2011 0 2
# 8 11 2010 NA 2
# 9 11 2009 1 2

How to "extrapolate" values of panel data in R?

I have a panel data with NA values like below:
uid year month day value
1 1 2016 8 1 NA
2 1 2016 8 2 NA
3 1 2016 8 3 30
4 1 2016 8 4 NA
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 NA
8 2 2016 8 3 50
9 2 2016 8 4 NA
10 2 2016 8 5 NA
I would like to perform a linear interpolation, so I wrote this code:
library(dplyr)
library(zoo)
panel_df <- group_by(panel_df, userid)
panel_df <- mutate(panel_df, value=na.approx(value, na.rm=FALSE))
then I get the output:
uid year month day value
1 1 2016 8 1 NA
2 1 2016 8 2 NA
3 1 2016 8 3 30
4 1 2016 8 4 25
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 45
8 2 2016 8 3 50
9 2 2016 8 4 NA
10 2 2016 8 5 NA
Here the approx method interpolates NA values successfully but does not extrapolate.
Is there any good way to replace the value of the 1st and 2nd rows with first non-NA value of this user(30)? Similary, how I can replace the value of the 9th and 10th rows with last non-NA value of this user(50)?
One way to do this is by using na.spline() from same package zoo:
panel_df <- group_by(panel_df, uid)
panel_df <- mutate(panel_df, value=na.spline(value))
panel_df
Source: local data frame [10 x 5]
Groups: uid [2]
uid year month day value
<int> <int> <int> <int> <dbl>
1 1 2016 8 1 40
2 1 2016 8 2 35
3 1 2016 8 3 30
4 1 2016 8 4 25
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 45
8 2 2016 8 3 50
9 2 2016 8 4 55
10 2 2016 8 5 60

R paired column index

Say I have two matrix, A and B:
mth <- c(rep(1:5,2))
day <- c(rep(10,5),rep(11,5))
hr <- c(3,4,5,6,7,3,4,5,6,7)
v <- c(3,4,5,4,3,3,4,5,4,3)
A <- data.frame(cbind(mth,day,hr,v))
year <- c(2008:2012)
mth <- c(1:5)
B <- data.frame(cbind(year,mth))
What I want should be look like:
mth <- c(rep(2008:2012,2))
day <- c(rep(10,5),rep(11,5))
hr <- c(3,4,5,6,7,3,4,5,6,7)
v <- c(3,4,5,4,3,3,4,5,4,3)
A <- data.frame(cbind(mth,day,hr,v))
Basically what I need is to change the column mth in A with column year in B, Maybe I didn't search for the right keyword, I was not able to get what I want(I tried which()), please help, thank you.
A2 <- merge(A,B, by = "mth")[ , -1]
names(A2)[(which(names(A2)=="year"))] <- "mth"
> A2
day hr v mth
1 10 3 3 2008
2 11 3 3 2008
3 11 4 4 2009
4 10 4 4 2009
5 11 5 5 2010
6 10 5 5 2010
7 11 6 4 2011
8 10 6 4 2011
9 10 7 3 2012
10 11 7 3 2012
Probably the easiest solution is to use merge, which is equivalent to a sql join in a lot of ways:
merge(A,B)
#-----
merge(A, B)
mth day hr v year
1 1 10 3 3 2008
2 1 11 3 3 2008
3 2 11 4 4 2009
4 2 10 4 4 2009
5 3 11 5 5 2010
6 3 10 5 5 2010
7 4 11 6 4 2011
8 4 10 6 4 2011
9 5 10 7 3 2012
10 5 11 7 3 2012
You could also probably use match like this to replace mth in place:
A$mth <- B[match(A$mth, B$mth),1]
#-----
mth day hr v
1 2008 10 3 3
2 2009 10 4 4
3 2010 10 5 5
4 2011 10 6 4
5 2012 10 7 3
6 2008 11 3 3
7 2009 11 4 4
8 2010 11 5 5
9 2011 11 6 4
10 2012 11 7 3
While a little dense, that code indexes B by matching the two mth columns from A and B and then grabs the first column.+

Resources