This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I have two datasets I would like to merge in R: one is a long catch dataset and the other is a small effort dataset. I would like to join these so that I can multiply values for the same years AND industry together. Eg, the small effort columns will be repeated many times over, as they are industry-wide characteristics. I think this is a very simple merge but am having trouble making it work!
Catch <- data.frame(
Species = c("a", "a", "c", "c", "a", "b"),
Industry= c( "ag", "fi", "ag", "fi", "ag", "fi" ),
Year = c("1990", "1990", "1991", "1992", "1990", "1990"),
Catch = c(0,1,4,7,5,6))
Effort<-data.frame(
Industry= c( "ag", "ag", "ag" , "fi", "fi", "fi"),
Year = c("1990", "1991", "1992", "1990", "1991", "1992"),
Effort = c(0,1,4,7,5,6))
What I have tried so far:
effort_catch<-merge(Effort, Catch , by.x= Year, by.y=Year )
I am not sure which one is what you need
transform(
merge(Catch, Effort, by = c("Industry", "Year"), all.x = TRUE),
prod = Catch * Effort
)
Industry Year Species Catch Effort prod
1 ag 1990 a 0 0 0
2 ag 1990 a 5 0 0
3 ag 1991 c 4 1 4
4 fi 1990 a 1 7 7
5 fi 1990 b 6 7 42
6 fi 1992 c 7 6 42
or
transform(
merge(Catch, Effort, by = c("Industry", "Year"), all = TRUE),
prod = Catch * Effort
)
Industry Year Species Catch Effort prod
1 ag 1990 a 0 0 0
2 ag 1990 a 5 0 0
3 ag 1991 c 4 1 4
4 ag 1992 <NA> NA 4 NA
5 fi 1990 a 1 7 7
6 fi 1990 b 6 7 42
7 fi 1991 <NA> NA 5 NA
8 fi 1992 c 7 6 42
Here's a solution using dplyr
library(dplyr)
full_join(Catch, Effort) %>%
mutate(Multiplied = Catch * Effort)
#> Joining, by = c("Industry", "Year")
#> Species Industry Year Catch Effort Multiplied
#> 1 a ag 1990 0 0 0
#> 2 a fi 1990 1 7 7
#> 3 c ag 1991 4 1 4
#> 4 c fi 1992 7 6 42
#> 5 a ag 1990 5 0 0
#> 6 b fi 1990 6 7 42
#> 7 <NA> ag 1992 NA 4 NA
#> 8 <NA> fi 1991 NA 5 NA
Based on your provided data...
Catch <- data.frame(
Species = c("a", "a", "c", "c", "a", "b"),
Industry= c( "ag", "fi", "ag", "fi", "ag", "fi" ),
Year = c("1990", "1990", "1991", "1992", "1990", "1990"),
Catch = c(0,1,4,7,5,6))
Effort<-data.frame(
Industry= c( "ag", "ag", "ag" , "fi", "fi", "fi"),
Year = c("1990", "1991", "1992", "1990", "1991", "1992"),
Effort = c(0,1,4,7,5,6))
Related
I'm trying to use dplyr to calculate medians by grouping 3 different columns and in 3 year increments.
My data looks like this:
data <- data.frame("Year" = c("1990","1990", "1992", "1993", "1994", "1990", "1991", "1990",
"1991", "1992", "1994", "1995"),"Type" = c("Al", "Al", "Al", "Al", "Al", "Al", "Al", "Cu",
"Cu", "Cu", "Cu", "Cu"), "Frac" = c("F", "F", "F", "F", "F", "UF", "UF", "F", "F", "UF",
"UF", "UF"), "Value" = c(0.1, 0.2, 0.3, 0.6, 0.7, 1.3, 1.5, 0.4, 0.2, 0.9, 2.3, 2.9))
I would like to calculate the median of "Value" in 3 year groupings and also grouping by "Type" and "Frac".
The problem is that sometimes there is a missing year, so I want it to group in 3 year increments based on the data that I have. Showing what I mean with my example data it would be grouped like this: (1990, 1992, 1993) for Al and F. Then just (1994) for Al and F since there's no more data for Al and F. Then (1990, 1991) for Al and UF since there's only 2 years worth of data. So basically I want it to be grouped by 3 years if possible, but if not, then do whatever is left over.
This is the end table I would like to have:
stats_wanted <- data.frame("Year" = c("1990, 1992, 1993", "1994", "1990, 1991",
"1990, 1991", "1992, 1994, 1995"), "Type" = c("Al", "Al", "Al", "Cu", "Cu"), "Frac" =
c("F", "F", "UF", "F", "UF"), "Median" = c(0.25, 0.7, 1.4, 0.3, 2.3))
Hopefully this makes sense... let me know if you have any questions :)!
I do not know dplyr, but here is a data.table solution.
library(data.table)
setDT(data)
data = data[order(Type,Frac,Year)]
# data = data[order(Year)] also works fine
data[
!duplicated(.SD,by=c('Year','Type','Frac')),
yeargroup:=0:(.N-1) %/% 3,
.(Type,Frac)]
# !duplicated... selects only the first unique row by year,type,frac
# 0:(.N-1) gives 0 to N-1 for each Type,Frac group
# %/% 3 gives the remainder when divided by 3
> data
Year Type Frac Value yeargroup
1: 1990 Al F 0.1 0
2: 1990 Al F 0.2 NA <- NA because dupe Year,Type,Frac
3: 1992 Al F 0.3 0
4: 1993 Al F 0.6 0
5: 1994 Al F 0.7 1
6: 1990 Al UF 1.3 0
7: 1991 Al UF 1.5 0
8: 1990 Cu F 0.4 0
9: 1991 Cu F 0.2 0
10: 1992 Cu UF 0.9 0
11: 1994 Cu UF 2.3 0
12: 1995 Cu UF 2.9 0
# handle dupe Year,Type,Frac rows:
data[,yeargroup:=max(yeargroup,na.rm=T),.(Year,Type,Frac)]
> data
Year Type Frac Value yeargroup
1: 1990 Al F 0.1 0
2: 1990 Al F 0.2 0 <- fixed NA
3: 1992 Al F 0.3 0
4: 1993 Al F 0.6 0
5: 1994 Al F 0.7 1
6: 1990 Al UF 1.3 0
7: 1991 Al UF 1.5 0
8: 1990 Cu F 0.4 0
9: 1991 Cu F 0.2 0
10: 1992 Cu UF 0.9 0
11: 1994 Cu UF 2.3 0
12: 1995 Cu UF 2.9 0
stats_wanted = data[,
.(Year=paste0(unique(Year),collapse=', '),Median=median(Value)),
.(Type,Frac,yeargroup)]
> stats_wanted
Type Frac yeargroup Year Median
1: Al F 0 1990, 1992, 1993 0.25
2: Al F 1 1994 0.70
3: Al UF 0 1990, 1991 1.40
4: Cu F 0 1990, 1991 0.30
5: Cu UF 0 1992, 1994, 1995 2.30
PS: #ronak-shah posted a concise dplyr solution, which inspired me to post another data.table solution which is even conciser:
> data[
order(Year),
.(Year,Value,group=(rleid(Year)-1)%/%3),
.(Type,Frac)
][,
.(Year=paste0(unique(Year),collapse=', '),Median=median(Value)),
.(Type,Frac,group)
]
Here's a dplyr solution -
For each Type and Frac, we create a group column which assigns the same number to every 3 values. For each group, we concatenate the Year value and calculate the median.
library(dplyr)
data %>%
group_by(Type, Frac) %>%
mutate(group = match(Year, unique(Year)),
group = ceiling(group/3)) %>%
group_by(group, .add = TRUE) %>%
summarise(Year = toString(unique(Year)),
Median = median(Value), .groups = 'drop') %>%
select(Year, Type, Frac, Median)
# Year Type Frac Median
# <chr> <chr> <chr> <dbl>
#1 1990, 1992, 1993 Al F 0.25
#2 1994 Al F 0.7
#3 1990, 1991 Al UF 1.4
#4 1990, 1991 Cu F 0.3
#5 1992, 1994, 1995 Cu UF 2.3
I have a dataframe of daily water chemistry values taken from deployed sensors. I’m trying to calculate rolling 7 day averages of daily maximum values. This in in-situ environmental data, the data can be a bit messy.
Here are the rules for calculating the averages and assigning quality levels:
Data is graded and given a quality value (DQL) for the day (dyDQL).
'A' is high quality, 'B' is medium, and 'E' is poor.
7 day average is calculated at the end of a 7 day period.
Dataset needs only 6 complete days to calculate a 7 day average (Can miss 1 day of data)
If there are at least 6 days worth ‘A’ and ‘B’ graded data and 1 day of ‘E’, discard ‘E’ data and calculate the 7-day average using the 6 days of ‘A’ and ‘B’ data
I have the code working using a loop that loops through each result, creates a new dataframe containing the 7 day window, and then calculates the moving average. See minimal example below.
Note that there are missing dates for the 11th, 16th, 17th, and 18th in this example:
daily_data <- tibble::tribble(
~Monitoring.Location.ID, ~date, ~dyMax, ~dyMin, ~dyDQL,
"River 1", as.Date("2018-07-01"), 24.219, 22.537, "A",
"River 1", as.Date("2018-07-02"), 24.557, 20.388, "A",
"River 1", as.Date("2018-07-03"), 24.847, 20.126, "A",
"River 1", as.Date("2018-07-04"), 25.283, 20.674, "A",
"River 1", as.Date("2018-07-05"), 25.501, 20.865, "A",
"River 1", as.Date("2018-07-06"), 25.04, 21.008, "A",
"River 1", as.Date("2018-07-07"), 24.847, 20.674, "A",
"River 1", as.Date("2018-07-08"), 23.424, 20.793, "B",
"River 1", as.Date("2018-07-09"), 22.657, 18.866, "E",
"River 1", as.Date("2018-07-10"), 22.298, 18.2, "A",
"River 1", as.Date("2018-07-12"), 22.92, 19.008, "A",
"River 1", as.Date("2018-07-13"), 23.978, 19.532, "A",
"River 1", as.Date("2018-07-14"), 24.508, 19.936, "A",
"River 1", as.Date("2018-07-15"), 25.137, 20.627, "A",
"River 1", as.Date("2018-07-19"), 24.919, 20.674, "A"
)
for (l in seq_len(nrow(daily_data))){
station_7day <- filter(daily_data,
dplyr::between(date, daily_data[[l,'date']] - lubridate::days(6), daily_data[l,'date']))
daily_data[l,"ma.max7"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7 ~ mean(station_7day$dyMax),
nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ mean(station_7day$dyMax),
max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6 & l >=7 ~ mean(station_7day$dyMax[station_7day$dyDQL %in% c("A", "B")]),
nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ mean(station_7day$dyMax),
TRUE ~ NA_real_)
daily_data[l, "ma.max7_DQL"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7 ~ "A",
nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ "B",
max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6 & l >=7 ~ "B",
nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ "E",
TRUE ~ NA_character_)
}
The expected results are:
tibble::tribble(
~Monitoring.Location.ID, ~date, ~dyMax, ~dyMin, ~dyDQL, ~ma.max7, ~ma.max7_DQL,
"River 1", as.Date("2018-07-01"), 24.219, 22.537, "A", NA, NA,
"River 1", as.Date("2018-07-02"), 24.557, 20.388, "A", NA, NA,
"River 1", as.Date("2018-07-03"), 24.847, 20.126, "A", NA, NA,
"River 1", as.Date("2018-07-04"), 25.283, 20.674, "A", NA, NA,
"River 1", as.Date("2018-07-05"), 25.501, 20.865, "A", NA, NA,
"River 1", as.Date("2018-07-06"), 25.04, 21.008, "A", NA, NA,
"River 1", as.Date("2018-07-07"), 24.847, 20.674, "A", 24.8991428571429, "A",
"River 1", as.Date("2018-07-08"), 23.424, 20.793, "B", 24.7855714285714, "B",
"River 1", as.Date("2018-07-09"), 22.657, 18.866, "E", 24.5141428571429, "B",
"River 1", as.Date("2018-07-10"), 22.298, 18.2, "A", 24.15, "B",
"River 1", as.Date("2018-07-12"), 22.92, 19.008, "A", 23.531, "E",
"River 1", as.Date("2018-07-13"), 23.978, 19.532, "A", 23.354, "E",
"River 1", as.Date("2018-07-14"), 24.508, 19.936, "A", 23.2975, "E",
"River 1", as.Date("2018-07-15"), 25.137, 20.627, "A", 23.583, "E",
"River 1", as.Date("2018-07-19"), 24.919, 20.674, "A", NA, NA
)
The code works fine, but is very slow when calculating values for multi-year levels of data with multiple different water quality parameters at multiple locations.
Due to the fact that a 7 day value can be calculated from 6 days of data, I don’t think I can use any of the rolling functions from the zoo package. I don’t think I can use the roll_mean function from the roll package, due to the variable nature of discarding 1 days worth of ‘E’ data when there is 6 days of ‘A’ or ‘B’ data.
Is there way to vectorize this, in order to avoid looping through every row of data?
I used tidyverse and runner and have done it like this in a single piped syntax. Syntax explanation-
I first collected seven days (as per logic provided) DQL and MAX values into a list using runner.
Before doing that, I have converted DQL into an ordered factored variable, which will be used in last syntax.
Secondly, i used purrr::map to modify each list according to given conditions,
Not less than six are to be counted
If there is exactly one E in 7 values, that has not to be counted.
Finally I unnested the list using unnest_wider
library(runner)
daily_data %>% mutate(dyDQL = factor(dyDQL, levels = c("A", "B", "E"), ordered = T),
d = runner(x = data.frame(a = dyMax, b= dyDQL),
k = "7 days",
lag = 0,
idx = date,
f = function(x) list(x))) %>%
mutate(d = map(d, ~ .x %>% group_by(b) %>%
mutate(c = n()) %>%
ungroup() %>%
filter(!n() < 6) %>%
filter(!(b == 'E' & c == 1 & n() == 7)) %>%
summarise(ma.max7 = ifelse(n() == 0, NA, mean(a)), ma.max7.DQL = max(b))
)
) %>%
unnest_wider(d)
# A tibble: 15 x 7
Monitoring.Location.ID date dyMax dyMin dyDQL ma.max7 ma.max7.DQL
<chr> <date> <dbl> <dbl> <ord> <dbl> <ord>
1 River 1 2018-07-01 24.2 22.5 A NA NA
2 River 1 2018-07-02 24.6 20.4 A NA NA
3 River 1 2018-07-03 24.8 20.1 A NA NA
4 River 1 2018-07-04 25.3 20.7 A NA NA
5 River 1 2018-07-05 25.5 20.9 A NA NA
6 River 1 2018-07-06 25.0 21.0 A 24.9 A
7 River 1 2018-07-07 24.8 20.7 A 24.9 A
8 River 1 2018-07-08 23.4 20.8 B 24.8 B
9 River 1 2018-07-09 22.7 18.9 E 24.8 B
10 River 1 2018-07-10 22.3 18.2 A 24.4 B
11 River 1 2018-07-12 22.9 19.0 A 23.5 E
12 River 1 2018-07-13 24.0 19.5 A 23.4 E
13 River 1 2018-07-14 24.5 19.9 A 23.3 E
14 River 1 2018-07-15 25.1 20.6 A 23.6 E
15 River 1 2018-07-19 24.9 20.7 A NA NA
Here's a vectorized approach using slider:slide_index to calculate the high quality and backup quality values, then combine for best available:
library(tidyverse); library(slider)
The following function groups by location, calculates the weekly mean (including everything from date-6 to and including date) and number of observations included, then filters to just having 6+ observations.
get_weekly_by_loc <- function(df) {
df %>%
group_by(Monitoring.Location.ID) %>%
mutate(mean = slide_index_dbl(dyMax, date, mean, .complete = TRUE, .before = lubridate::days(6)),
count = slide_index_dbl(dyMax, date, ~sum(!is.na(.)), .before = lubridate::days(6))) %>%
ungroup() %>%
filter(count >= 6)
}
Then we can run this function on just A/B data and overall:
daily_data_high_quality <- daily_data %>%
filter(dyDQL %in% c("A", "B")) %>%
get_weekly_by_loc() %>%
select(Monitoring.Location.ID, date, high_qual_mean = mean)
daily_data_backup <- daily_data %>%
get_weekly_by_loc() %>%
select(Monitoring.Location.ID, date, backup_mean = mean)
Then join those and use the high quality if available:
daily_data %>%
left_join(daily_data_high_quality) %>%
left_join(daily_data_backup) %>%
mutate(max7_DQL = coalesce(high_qual_mean, backup_mean)) %>%
mutate(moar_digits = format(max7_DQL, nsmall = 6))
Result
# A tibble: 15 x 9
Monitoring.Location.ID date dyMax dyMin dyDQL high_qual_mean backup_mean max7_DQL moar_digits
<chr> <date> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 River 1 2018-07-01 24.2 22.5 A NA NA NA " NA"
2 River 1 2018-07-02 24.6 20.4 A NA NA NA " NA"
3 River 1 2018-07-03 24.8 20.1 A NA NA NA " NA"
4 River 1 2018-07-04 25.3 20.7 A NA NA NA " NA"
5 River 1 2018-07-05 25.5 20.9 A NA NA NA " NA"
6 River 1 2018-07-06 25.0 21.0 A NA NA NA " NA"
7 River 1 2018-07-07 24.8 20.7 A 24.9 24.9 24.9 "24.899143"
8 River 1 2018-07-08 23.4 20.8 B 24.8 24.8 24.8 "24.785571"
9 River 1 2018-07-09 22.7 18.9 E NA 24.5 24.5 "24.514143"
10 River 1 2018-07-10 22.3 18.2 A 24.4 24.2 24.4 "24.398833"
11 River 1 2018-07-12 22.9 19.0 A NA 23.5 23.5 "23.531000"
12 River 1 2018-07-13 24.0 19.5 A NA 23.4 23.4 "23.354000"
13 River 1 2018-07-14 24.5 19.9 A NA 23.3 23.3 "23.297500"
14 River 1 2018-07-15 25.1 20.6 A NA 23.6 23.6 "23.583000"
15 River 1 2018-07-19 24.9 20.7 A NA NA NA " NA"
I am trying to make a table that counts the number of consecutive years grouped by columns "state" and "p" that looks like this:
data_right <- data.table(state = c("NY", "NY", "NY", "NY", "NY","NY", "PA",
"PA", "PA", "PA", "PA", "PA"), p = c("n", "n","n","n", "p", "p", "n", "n", "n",
"p", "p", "p"),Year = c("1973", "1974", "1977", "1978", "1988", "1989" ,"1991",
"1992", "1993", "1920", "1929", "1931"), Consecutive_Yrs =
c(1,2,1,2,1,2,1,2,3,1,1,1))
The code I am using right now is not working properly. I'm trying mutate, and group_by statements in dplyr but am having no luck. I also cannot use the data.table package because my R version is not up to date.
Any help to get this output is greatly appreciated!
library(dplyr)
data_right %>%
group_by(state, p) %>%
mutate(grp = cumsum(c(TRUE, diff(as.integer(Year)) > 1))) %>%
group_by(state, p, grp) %>%
mutate(cy = row_number()) %>%
ungroup() %>%
select(-grp)
# # A tibble: 12 x 5
# state p Year Consecutive_Yrs cy
# <chr> <chr> <chr> <dbl> <int>
# 1 NY n 1973 1 1
# 2 NY n 1974 2 2
# 3 NY n 1977 1 1
# 4 NY n 1978 2 2
# 5 NY p 1988 1 1
# 6 NY p 1989 2 2
# 7 PA n 1991 1 1
# 8 PA n 1992 2 2
# 9 PA n 1993 3 3
# 10 PA p 1920 1 1
# 11 PA p 1929 1 1
# 12 PA p 1931 1 1
Assumes the data is already ordered by Year.
Data:
data_right <- data.table(state = c("NY", "NY", "NY", "NY", "NY","NY", "PA", "PA", "PA", "PA", "PA", "PA"), p = c("n", "n","n","n", "p", "p", "n", "n", "n", "p", "p", "p"),Year = c("1973", "1974", "1977", "1978", "1988", "1989" ,"1991", "1992", "1993", "1920", "1929", "1931"), Consecutive_Yrs = c(1,2,1,2,1,2,1,2,3,1,1,1))
This question already has answers here:
R Cumulative Sum with a condition and a reset
(3 answers)
Closed 3 years ago.
So I have a table that looks like this currently:
data_wrong <- data.table(State = c("NY", "NY", "NY", "NY", "PA", "PA", "PA",
"NJ", "NJ", "NJ"), Year = c("1973", "1974", "1975", "2005", "1992", "1993",
"2001", "1930", "1931", "1932"), Consecutive_Yrs = c(1,2,3,1,1,6,1,1,9,10))
And I'd like it to look like this:
data <- data.table(State = c("NY", "NY", "NY", "NY", "PA", "PA", "PA", "NJ",
"NJ", "NJ"), Year = c("1973", "1974", "1975", "2005", "1992", "1993",
"2001", "1930", "1931", "1932"), Consecutive_Yrs = c(1,2,3,1,1,2,1,1,2,3))
This is the code I'm using right now to get my table:
data$diff <- NA
data <- data %>%
group_by(State) %>%
arrange(State) %>%
mutate(diff = Year - lag(Year, default = first(Year)))
data$Consecutive_Yrs <- 1
data$Consecutive_Yrs <- ifelse(data$diff == 1, cumsum(data$Consecutive_Yrs),
1)
Any help would be greatly appreciated :)
As. it is a data.table, an option is to use data.table methods
library(data.table)
data_wrong[, grp := cumsum(c(TRUE, diff(as.numeric(Year)) > 1)),
.(State)][, Consecutive_Yrs := as.numeric(seq_len(.N)), .(State, grp)]
data_wrong
# State Year Consecutive_Yrs grp
# 1: NY 1973 1 1
# 2: NY 1974 2 1
# 3: NY 1975 3 1
# 4: NY 2005 1 2
# 5: PA 1992 1 1
# 6: PA 1993 2 1
# 7: PA 2001 1 2
# 8: NJ 1930 1 1
# 9: NJ 1931 2 1
#10: NJ 1932 3 1
Or use rowid
data_wrong[, Consecutive_Yrs2 := rowid(rleid(as.numeric(Year) -
shift(as.numeric(Year), fill = as.numeric(Year[1])) >1)), .(State)]
I have a csv file where rows 1-5 represent one state, 5-10 another, etc... I also have a column with years 1970,1980,..,2010 repeated for each state. In R (although I'm not opposed to a solution in Excel if that is easier), I want for each state to calculate the percent difference between that year and 1970, i.e. for Alabama 1990 it would be (AL 1990 - AL 1970)/(AL 1970), and add it to a new column in the data table so I can export it to a csv.
State, Year, Num
AL, 1970, 1
AL, 1980, 2
AL, 1990, 3
AL, 2000, 4
AL, 2010, 6
Output would be a column
pct_change
0
1
2
3
5
The dplyr package includes the function first which provides an easy method for getting the first value of a group. So if we arrange by Year to make it so that 1970 will be the first value of each group, when we group_by(State), we can use first(Num) to get that first value of Num which represents the value from 1970:
# Example data with 2 states
df <- structure(list(State = c("AL", "AL", "AL", "AL", "AL", "TX",
"TX", "TX", "TX", "TX"), Year = c(1970L, 1980L, 1990L, 2000L,
2010L, 1970L, 1980L, 1990L, 2000L, 2010L), Num = c(1, 2, 3, 4,
6, 5, 2, 10, 12, 6)), class = "data.frame", row.names = c(NA,
-10L))
library(dplyr)
df %>%
arrange(State, Year) %>%
group_by(State) %>%
mutate(perc_diff = 100 * (Num - first(Num))/first(Num))
# A tibble: 10 x 4
# Groups: State [2]
State Year Num perc_diff
<chr> <int> <dbl> <dbl>
1 AL 1970 1 0
2 AL 1980 2 100
3 AL 1990 3 200
4 AL 2000 4 300
5 AL 2010 6 500
6 TX 1970 5 0
7 TX 1980 2 -60
8 TX 1990 10 100
9 TX 2000 12 140
10 TX 2010 6 20
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), order by 'State', 'Year' in the i, grouped by 'State', get the difference of the 'Num' with the first value of 'Num' and assign (:=) to create the 'perc_diff'
library(data.table)
setDT(df)[order(State, Year), perc_diff :=
100 * (Num - first(Num))/first(Num), State][]
# State Year Num perc_diff
# 1: AL 1970 1 0
# 2: AL 1980 2 100
# 3: AL 1990 3 200
# 4: AL 2000 4 300
# 5: AL 2010 6 500
# 6: TX 1970 5 0
# 7: TX 1980 2 -60
# 8: TX 1990 10 100
# 9: TX 2000 12 140
#10: TX 2010 6 20
Or using base R
v1 <- with(df, ave(Num, State, FUN = function(x) x[1]))
df$perc_diff <- with(df, 100 * (Num - v1)/v1)
data
df <- structure(list(State = c("AL", "AL", "AL", "AL", "AL", "TX",
"TX", "TX", "TX", "TX"), Year = c(1970L, 1980L, 1990L, 2000L,
2010L, 1970L, 1980L, 1990L, 2000L, 2010L), Num = c(1, 2, 3, 4,
6, 5, 2, 10, 12, 6)), class = "data.frame", row.names = c(NA,
-10L))
Base R solution using tapply
df <- df[with(df, order(State, Year)), ]
df$pct_change <- unlist( tapply(df$Num, df$State, function(x) 100 * (x - x[1]) / x[1]) )
> df
State Year Num pct_change
1 AL 1970 1 0
2 AL 1980 2 100
3 AL 1990 3 200
4 AL 2000 4 300
5 AL 2010 6 500
6 TX 1970 5 0
7 TX 1980 2 -60
8 TX 1990 10 100
9 TX 2000 12 140
10 TX 2010 6 20