Selecting distinct rows in dplyr [duplicate] - r

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 4 years ago.
dat <- data.frame(loc.id = rep(1:2, each = 3),
year = rep(1981:1983, times = 2),
prod = c(200,300,400,150,450,350),
yld = c(1200,1250,1200,3000,3200,3200))
If I want to select for each loc.id distinct values of yld, I do this:
dat %>% group_by(loc.id) %>% distinct(yld)
loc.id yld
<int> <dbl>
1 1200
1 1250
2 3000
2 3200
However, what I want to do is for loc.id, if years have the same yld, then select the yld with a lower
prod value. My dataframe should look like i.e. I want the prod and year column too included in the final dataframe
loc.id year prod yld
1 1981 200 1200
1 1982 300 1250
2 1981 150 3000
2 1983 350 3200

We can do an arrange by 'prod' and then slice the first observation
dat %>%
arrange(loc.id, prod) %>%
group_by(loc.id, yld) %>%
slice(1)
# A tibble: 4 x 4
# Groups: loc.id, yld [4]
# loc.id year prod yld
# <int> <int> <dbl> <dbl>
#1 1 1981 200 1200
#2 1 1982 300 1250
#3 2 1981 150 3000
#4 2 1983 350 3200

Related

Finding a row from a dataframe

I am trying to write a function get_value <- function(gyear, gmonth) that selects a row from already existing dataframe or a value if the month is between two time points. The dataframe is e.g.
df <- read.table(header = TRUE, text = "
year month var1 var2
2022 1 123 987
2021 4 234 876
2021 1 345 765
2020 7 456 654
2020 3 567 543
2020 1 678 432
2019 1 789 321")
For example in year 2021 the same row
year month var1 var2
2021 1 345 765
is valid for months 1,2,3 and then comes a change and the next row
year month var1 var2
2021 4 234 876
is valid for months 4,5,6,7,8,9,10,11,12.
If the year & month are already in the dataframe, then I can have the row like
get_value <- function(gyear, gmonth){
library(tidyverse)
temp <- df %>% filter(year == gyear & month == gmonth)
}
get_value(gyear = 2020, gmonth = 1)
but what I want is also to be able to have rows (months) that are between the months that are included in the dataframe. For example I would like to be able to call
get_value(gyear = 2021, gmonth = 5)
that returns row
year month var1 var2
2021 4 234 876
because in year 2021 the month is between 4-12.
Thanks in advance for your help!
You could first create a new column month2 to indicate the ending months.
library(dplyr)
df2 <- df %>%
group_by(year) %>%
arrange(month, .by_group = TRUE) %>%
mutate(month2 = lead(month-1, default = 12), .after = month) %>%
ungroup()
# # A tibble: 7 × 5
# year month month2 variable1 variable2
# <int> <int> <dbl> <int> <int>
# 1 2019 1 12 789 321
# 2 2020 1 2 678 432
# 3 2020 3 6 567 543
# 4 2020 7 12 456 654
# 5 2021 1 3 345 765
# 6 2021 4 12 234 876
# 7 2022 1 12 123 987
Then customize a filter function that takes a dataframe as input and extract those rows where gmonth >= month & gmonth <= month2.
get_value <- function(data, gyear, gmonth){
data %>%
filter(year == gyear & gmonth >= month & gmonth <= month2)
}
I don't encourage you to put the code where I manipulate df to get month2 into this function. If you do that, the data will be repeatedly manipulated whenever get_value() is run. It's out of efficiency.
Output
get_value(df2, gyear = 2020, gmonth = 1)
# # A tibble: 1 × 5
# year month month2 variable1 variable2
# <int> <int> <dbl> <int> <int>
# 1 2020 1 2 678 432
get_value(df2, gyear = 2021, gmonth = 5)
# # A tibble: 1 × 5
# year month month2 variable1 variable2
# <int> <int> <dbl> <int> <int>
# 1 2021 4 12 234 876

Insert new rows of imputed data into data table by group [duplicate]

This question already has answers here:
Complete dataframe with missing combinations of values
(2 answers)
Interpolate NA values in a data frame with na.approx
(3 answers)
Closed 2 years ago.
i have a data table and i would like to insert new rows imputing values between two years. This will be done over many ID groups. how do i go about replicating the data into the new rows?
# data table
dt <- data.table(ID=c(rep(1:3,each=3)),
attrib1=rep(c("sdf","gghgf","eww"),each=3),
attrib2=rep(c("444","222","777"),each=3),
Year = rep(c(1990, 1995, 1996), 3),
value = c(12,6,7,6,3,1,9,17,18))
so for all groups (ID), Year would go from 1990 to 1996 and the 2 values for 1990 & 1995 would be imputed linearly. All other attributes would remain the same and be copied into the new rows.
i've done this with a hideously long work around and attempted a custom function, but to no avail
You can use tidyr::complete to expand the years and zoo::na.approx for interpolation of the values.
library(dplyr)
dt %>%
group_by(ID, attrib1, attrib2) %>%
tidyr::complete(Year = min(Year):max(Year)) %>%
mutate(value = zoo::na.approx(value))
# ID attrib1 attrib2 Year value
# <int> <chr> <chr> <dbl> <dbl>
# 1 1 sdf 444 1990 12
# 2 1 sdf 444 1991 10.8
# 3 1 sdf 444 1992 9.6
# 4 1 sdf 444 1993 8.4
# 5 1 sdf 444 1994 7.2
# 6 1 sdf 444 1995 6
# 7 1 sdf 444 1996 7
# 8 2 gghgf 222 1990 6
# 9 2 gghgf 222 1991 5.4
#10 2 gghgf 222 1992 4.8
# … with 11 more rows

R: using the mutate function to combine strings

I've got a dataset called data1 with headers year and count.
My sample data looks like this:
Year Count
1 2005 3000
2 2006 4000
3 2007 5000
4 2008 6000
I add another column to the data which works out the yearly increase. This is my code:
data1growth <- data1 %>%
mutate(Growth = Count - lag(Count))
I want to be able to add another column called period so that I can get the following output:
Year Count Growth Period
1 2005 3000 NA NA
2 2006 4000 1000 2005-2006
3 2007 5000 1000 2006-2007
4 2008 6000 1000 2007-2008
What code should I add to the mutate function to get the desired output, or am I off the mark completely? Any help is appreciated.
Thanks everyone.
library(dplyr)
data1 %>%
mutate(
Growth = Count - lag(Count),
period = if_else(
row_number() > 1,
paste0(lag(Year), "-", Year),
NA_character_
)
)
# Year Count Growth period
# 1 2005 3000 NA <NA>
# 2 2006 4000 1000 2005-2006
# 3 2007 5000 1000 2006-2007
# 4 2008 6000 1000 2007-2008
Reproducible data
data1 <- data.frame(
Year = seq(2005L, 2008L, 1L),
Count = seq(3000L, 6000L, 1000L)
)
If you want 'Period' to just be a string, you can just use another mutate:
library(tidyverse)
data1 <- tibble(Year = 2005:2008, Count = c(3000, 4000, 5000, 6000))
data1growth <- data1 %>%
mutate(Growth = Count - lag(Count))
# Period as string
data1growth %>%
mutate(Period = paste0(Year, "-", Year-1))
#> # A tibble: 4 x 4
#> Year Count Growth Period
#> <int> <dbl> <dbl> <chr>
#> 1 2005 3000 NA 2005-2004
#> 2 2006 4000 1000 2006-2005
#> 3 2007 5000 1000 2007-2006
#> 4 2008 6000 1000 2008-2007
# Period as string (don't include NA Growth)
data1growth %>%
mutate(Period = ifelse(is.na(Growth), NA, paste0(Year, "-", Year-1)))
#> # A tibble: 4 x 4
#> Year Count Growth Period
#> <int> <dbl> <dbl> <chr>
#> 1 2005 3000 NA <NA>
#> 2 2006 4000 1000 2006-2005
#> 3 2007 5000 1000 2007-2006
#> 4 2008 6000 1000 2008-2007
Here is a base R option
transform(df1,
Grouth = c(NA, diff(Count)),
Period = c(NA, paste0(Year[-nrow(df1)], "-", Year[-1]))
)
which gives
Year Count Grouth Period
1 2005 3000 NA <NA>
2 2006 4000 1000 2005-2006
3 2007 5000 1000 2006-2007
4 2008 6000 1000 2007-2008

How to summarize `Number of days since first date` and `Number of days seen` by ID and for a large data frame

The dataframe df1 summarizes detections of individuals (ID) through the time (Date). As a short example:
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Date= ymd(c("2016-08-21","2016-08-24","2016-08-23","2016-08-29","2016-08-27","2016-09-02","2016-09-01","2016-09-09","2016-09-01","2016-09-10")))
df1
ID Date
1 1 2016-08-21
2 2 2016-08-24
3 1 2016-08-23
4 2 2016-08-29
5 1 2016-08-27
6 2 2016-09-02
7 1 2016-09-01
8 2 2016-09-09
9 1 2016-09-01
10 2 2016-09-10
I want to summarize either the Number of days since the first detection of the individual (Ndays) and Number of days that the individual has been detected since the first time it was detected (Ndifdays).
Additionally, I would like to include in this summary table a variable called Prop that simply divides Ndifdays between Ndays.
The summary table that I would expect would be this:
> Result
ID Ndays Ndifdays Prop
1 1 11 4 0.360 # Between 21st Aug and 01st Sept there is 11 days.
2 2 17 5 0.294 # Between 24th Aug and 10st Sept there is 17 days.
Does anyone know how to do it?
You could achieve using various summarising functions in dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294
The data.table version of this would be
library(data.table)
df12 <- setDT(df1)[, .(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = uniqueN(Date)), by = ID]
df12$Prop <- df12$Ndifdays/df12$Ndays
and base R with aggregate
df12 <- aggregate(Date~ID, df1, function(x) c(max(x) - min(x), length(unique(x))))
df12$Prop <- df1$Ndifdays/df1$Ndays
After grouping by 'ID', get the diff or range of 'Date' to create 'Ndays', and then get the unique number of 'Date' with n_distinct, divide by the number of distinct by the Ndays to get the 'Prop'
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(diff(range(Date))),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# A tibble: 2 x 4
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294

Calculate change over time with tidy data in R - do you have to spread and gather?

Quick question about calculating a change over time for tidy data. Do I need to spread the data, mutate the variable and then gather the data again (see below), or is there a quicker way to do this keeping the data tidy.
Here is an example:
df <- data.frame(country = c(1, 1, 2, 2),
year = c(1999, 2000, 1999, 2000),
value = c(20, 30, 40, 50))
df
country year value
1 1 1999 20
2 1 2000 30
3 2 1999 40
4 2 2000 50
To calculate the change in value between 1999 and 2000 I would:
library(dplyr)
library(tidyr)
df2 <- df %>%
spread(year, value) %>%
mutate(change.99.00 = `2000` - `1999`) %>%
gather(year, value, c(`1999`, `2000`))
df2
country change.99.00 year value
1 1 10 1999 20
2 2 10 1999 40
3 1 10 2000 30
4 2 10 2000 50
This seems a laborious way to do this. I assume there should be a neat way to do this while keeping the data in narrow, tidy format, by grouping the data or something but I can't think of it and I can't find an answer online.
Is there an easier way to do this?
After grouping by 'country', get the diff of 'value' filtered with the logical expression year %in% 1999:2000
library(dplyr)
df %>%
group_by(country) %>%
mutate(change.99.00 = diff(value[year %in% 1999:2000]))
# A tibble: 4 x 4
# Groups: country [2]
# country year value change.99.00
# <dbl> <dbl> <dbl> <dbl>
#1 1 1999 20 10
#2 1 2000 30 10
#3 2 1999 40 10
#4 2 2000 50 10
NOTE: Here, we assume that the 'year' is not duplicated per 'country'

Resources