new variable by subtracting rows based on condition in grouped data - r

I have data for every census tract in the country at three time points (2000, 2013, 2019) that is grouped by CBSA. I'm trying to create a new variable called cont_chg_pedu_colplus that is the difference between the 2013 and 2000 value for pedu_colplus. So, in my example below, I want to create a new column called cont_chg_pedu_colplus that returns the value of 3.0 (14.6 - 11.6). Ideally, each group of tracts would have the same value, since I'm only interested in the difference between time 1 and time 2.
tractid year CBSA_name pedu_colplus
<chr> <dbl> <chr> <dbl>
1 48059030101 2000 Abilene, TX 11.6
2 48059030101 2013 Abilene, TX 14.6
3 48059030101 2019 Abilene, TX 20.6
4 48059030102 2000 Abilene, TX 11.6
5 48059030102 2013 Abilene, TX 14.2
6 48059030102 2019 Abilene, TX 20.2
Below is the code I have so far. It throws the following error, I think because I'm subsetting on just one year (37 rows instead of the 111 in the dataset). I'd prefer not to make my data wide, because I've got a bunch of other data manipulations I have to. I couldn't get lag to work.
gent_vars_prelim <- outcome_data %>%
mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000], na.rm = TRUE) %>%
glimpse()
Problem with mutate() input cont_chg_pedu_colplus. x Input cont_chg_pedu_colplus can't be recycled to size 37. ℹ Input cont_chg_pedu_colplus is pedu_colplus[year == 2013] - pedu_colplus[year == 2000]. ℹ Input cont_chg_pedu_colplus must be size 37 or 1, not 0. ℹ The error occurred in group 1: CBSA_name = "Abilene, TX", year = 2000
Any thoughts? Thanks.

I'll assume that for each unique pair of tractid and CBSA_name, there are up to 3 entries for year (possible values 2000, 2013, or 2019) and no two years are the same for a given pair of tractid and CBSA_name.
First, we'll group the values in the data frame by tractid and CBSA_name. Each group will have up to 3 rows, one for each year. We do this with dplyr::group_by(tractid, CBSA_name).
Next, we'll force the group to have all 3 years. We do this with tidyr::complete(year = c(2000, 2013, 2019)) as you suggested in the comments. (This is better than my comment using filter(n() == 3), because we actually wouldn't care if only 2019 were missing, and we are able to preserve incomplete groups.)
Then, we can compute the result you're interested in: dplyr::mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000]). We just need to dplyr::ungroup() after this and we're done.
Final code:
gent_vars_prelim <- outcome_data %>%
dplyr::group_by(tractid, CBSA_name) %>%
tidyr::complete(year = c(2000, 2013, 2019)) %>%
dplyr::mutate(cont_chg_pedu_colplus = pedu_colplus[year == 2013] - pedu_colplus[year == 2000]) %>%
dplyr::ungroup() %>%
glimpse()
Test case:
outcome_data <- data.frame(tractid = c(48059030101, 48059030101, 48059030101, 48059030101, 48059030101, 48059030101, 48059030102, 48059030102, 48059030102, 48059030103),
year = c(2000, 2013, 2019, 2000, 2013, 2019, 2000, 2013, 2019, 2000),
CBSA_name = c("Abilene, TX", "Abilene, TX", "Abilene, TX", "Austin, TX", "Austin, TX", "Austin, TX", "Abilene, TX", "Abilene, TX", "Abilene, TX", "Abilene, TX"),
pedu_colplus = c(11.6, 14.6, 20.6, 8.4, 9.0, 9.6, 11.6, 14.2, 20.2, 4.0))
Result:
> tibble(gent_vars_prelim)
# A tibble: 12 x 1
gent_vars_prelim$tractid $CBSA_name $year $pedu_colplus $cont_chg_pedu_colplus
<dbl> <fct> <dbl> <dbl> <dbl>
1 48059030101 Abilene, TX 2000 11.6 3
2 48059030101 Abilene, TX 2013 14.6 3
3 48059030101 Abilene, TX 2019 20.6 3
4 48059030101 Austin, TX 2000 8.4 0.600
5 48059030101 Austin, TX 2013 9 0.600
6 48059030101 Austin, TX 2019 9.6 0.600
7 48059030102 Abilene, TX 2000 11.6 2.60
8 48059030102 Abilene, TX 2013 14.2 2.60
9 48059030102 Abilene, TX 2019 20.2 2.60
10 48059030103 Abilene, TX 2000 4 NA
11 48059030103 Abilene, TX 2013 NA NA
12 48059030103 Abilene, TX 2019 NA NA

Related

Calculating the change in % of data by year

I am trying to calculate the % change by year in the following dataset, does anyone know if this is possible?
I have the difference but am unsure how we can change this into a percentage
C diff(economy_df_by_year$gdp_per_capita)
df
year gdp
1998 8142.
1999 8248.
2000 8211.
2001 7926.
2002 8366.
2003 10122.
2004 11493.
2005 12443.
2006 13275.
2007 15284.
Assuming that gdp is the total value, you could do something like this:
library(tidyverse)
tribble(
~year, ~gdp,
1998, 8142,
1999, 8248,
2000, 8211,
2001, 7926,
2002, 8366,
2003, 10122,
2004, 11493,
2005, 12443,
2006, 13275,
2007, 15284
) -> df
df |>
mutate(pdiff = 100*(gdp - lag(gdp))/gdp)
#> # A tibble: 10 × 3
#> year gdp pdiff
#> <dbl> <dbl> <dbl>
#> 1 1998 8142 NA
#> 2 1999 8248 1.29
#> 3 2000 8211 -0.451
#> 4 2001 7926 -3.60
#> 5 2002 8366 5.26
#> 6 2003 10122 17.3
#> 7 2004 11493 11.9
#> 8 2005 12443 7.63
#> 9 2006 13275 6.27
#> 10 2007 15284 13.1
Which relies on the tidyverse framework.
If gdp is the difference, you will need the total to get a percentage, if that is what you mean by change in percentage by year.
df$change <- NA
df$change[2:10] <- (df[2:10, "gdp"] - df[1:9, "gdp"]) / df[1:9, "gdp"]
This assigns the yearly GDP growth to each row except the first one where it remains as NA
df$diff <- c(0,diff(df$gdp))
df$percentDiff <- 100*(c(0,(diff(df$gdp)))/(df$gdp - df$diff))
This is another possibility.

Find average change in timeseries

I have an annual mean timeseries dataset for 15 years, and I am trying to find the average change/increase/decrease in this timeseries.
The timeseries I have is spatial (average values for each grid-cell/pixel, years repeat).
How can I do this in R via dplyr?
Sample data
year = c(2005, 2005, 2005, 2005, 2006, 2006, 2006, 2006, 2007, 2007, 2007, 2007, 2008, 2008, 2008, 2008)
Tmean = c(24, 24.5, 25.8,25, 24.8, 25, 23.5, 23.8, 24.8, 25, 25.2, 25.8, 25.3, 25.6, 25.2, 25)
Code
library(tidyverse)
df = data.frame(year, Tmean)
change = df$year %>%
# Sort by year
arrange(year) %>%
mutate(Diff_change = Tmean - lag(Tmean), # Difference in Tmean between years
Rate_percent = (Diff_change / year)/Tmean * 100) # Percent change # **returns inf values**
Average_change = mean(change$Rate_percent, na.rm = TRUE)
To find the average: mean(). To find the differences or changes: diff()
So, to find the average change:
> avg_change <- mean(diff(Tmean))
> print(avg_change)
[1] 0.06666667
If you need that in percentage, then you want to find out how much the difference between an element and its previous one (this year - last year) is in percentage with respect to last year, like so:
> pct_change <- Tmean[2:length(Tmean)] / Tmean[1:(length(Tmean)-1)] - 1
> avg_pct_change <- mean(pct_change) * 100
> print(avg_pct_change)
[1] 0.3101632
We can put those vectors into a data frame to use with dplyr (...if that's how you want to do it; this is straightforward with base R as well).
library(dplyr)
df <- data.frame(year, Tmean)
change <- df %>%
arrange(year) %>%
mutate(Diff_change = Tmean - lag(Tmean), # Difference in Tmean between years
Diff_time = year - lag(year),
Rate_percent = (Diff_change/Diff_time)/lag(Tmean) * 100) # Percent change
Average_change = mean(change$Rate_percent, na.rm = TRUE)
Results (with updated question data)
> change
year Tmean Diff_change Rate_percent
1 2005 24.0 NA NA
2 2005 24.5 0.5 2.0833333
3 2005 25.8 1.3 5.3061224
4 2005 25.0 -0.8 -3.1007752
5 2006 24.8 -0.2 -0.8000000
6 2006 25.0 0.2 0.8064516
7 2006 23.5 -1.5 -6.0000000
8 2006 23.8 0.3 1.2765957
9 2007 24.8 1.0 4.2016807
10 2007 25.0 0.2 0.8064516
11 2007 25.2 0.2 0.8000000
12 2007 25.8 0.6 2.3809524
13 2008 25.3 -0.5 -1.9379845
14 2008 25.6 0.3 1.1857708
15 2008 25.2 -0.4 -1.5625000
16 2008 25.0 -0.2 -0.7936508
> Average_change
[1] 0.3101632

R: How do I avoid getting an error when merging two data frames (group by/summarise)?

I have a big data frame of 80,000 rows. It was created by combining individual data frames from different years. The origin variable indicates the year of the entry's original data frame.
Here is an example of the first few of the big data frame rows that show how data frames from 2003 and 2011 were combined.
df_1:
ID City State origin
1 NY NY 2003
2 NY NY 2003
3 SF CA 2003
1 NY NY 2011
3 SF CA 2011
2 NY NY 2011
4 LA CA 2011
5 SD CA 2011
Now I want to create a new variable called first_appearance that takes the min of the origin variable for each ID:
final_df:
ID City State origin first_appearance
1 NY NY 2003 2003
2 NY NY 2003 2003
3 SF CA 2003 2003
1 NY NY 2011 2003
3 SF CA 2011 2003
2 NY NY 2011 2003
4 LA CA 2011 2011
5 SD CA 2011 2011
So far, I've tried using:
prestep_final <- df_1 %>% group_by(ID) %>% summarise(first_apperance = min(origin))
final_df <- merge(prestep_final, df_1, by = "ID")
Prestep_final works and produces a data frame with the ID and the first_appearance.
Unfortunately, the merge step doesn't work and yields a data frame with NA entries only.
How can I improve my code so that I can produce a table like final_df above. I'd appreciate any suggestions and don't have package preferences.
If you change summarise to mutate you get your desired result without merging:
library(tidyverse)
df <- tibble::tribble(
~ID, ~City, ~State, ~origin,
1, 'NY', 'NY', 2003,
2, 'NY', 'NY', 2003,
3, 'SF', 'CA', 2003,
1, 'NY', 'NY', 2011,
3, 'SF', 'CA', 2011,
2, 'NY', 'NY', 2011,
4, 'LA', 'CA', 2011,
5, 'SD', 'CA', 2011
)
df %>% group_by(ID) %>%
mutate(first_appearance = min(origin))
#> # A tibble: 8 x 5
#> # Groups: ID [5]
#> ID City State origin first_appearance
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 NY NY 2003 2003
#> 2 2 NY NY 2003 2003
#> 3 3 SF CA 2003 2003
#> 4 1 NY NY 2011 2003
#> 5 3 SF CA 2011 2003
#> 6 2 NY NY 2011 2003
#> 7 4 LA CA 2011 2011
#> 8 5 SD CA 2011 2011
Created on 2020-06-10 by the reprex package (v0.3.0)
An option with data.table
library(data.table)
setDT(df)[, first_appearance := min(origin), ID]
Or in base R
df$first_appearance <- with(df, ave(origin, ID, FUN = min))

R Markdown: Transforming pooled cross section data into a panel data set

I am currently trying to transform a cross-sectional data set into a panel data set.
Currently I have a variable called "state" and a variable called "year". I would like to re-arrange the observations, so that they are displayed per state per year and the numbers display averages of the other variables (e.g. income) per state per year respectively.
Anyone has an idea how I could proceed?
Thank you very much in advance!
If I understand your question correctly. The code below should help. It is helpful with questions to add a small example data set, and your desired output.
This answer uses the dplyr package
library(dplyr)
Example data:
data <- tibble(state = c("florida", "florida", "florida",
"new_york", "new_york", "new_york"),
year = c(1990, 1990, 1992, 1992, 1992, 1994),
income = c(19, 13, 45, 34, 66, 34))
To produce:
# A tibble: 6 x 3
state year income
<chr> <dbl> <dbl>
1 florida 1990 19
2 florida 1990 13
3 florida 1992 45
4 new_york 1992 34
5 new_york 1992 66
6 new_york 1994 34
Code to summarise data (using dplyr package)
data %>%
group_by(state, year) %>%
summarise(
mean_income = mean(income)
)
Produces this output:
# A tibble: 4 x 3
# Groups: state [?]
state year mean_income
<chr> <dbl> <dbl>
1 florida 1990 16
2 florida 1992 45
3 new_york 1992 50
4 new_york 1994 34

Column manipulation in R - matching correct names

I have a data.frame composed of multiple columns and thousands of rows. Below I attempt to display its (head):
|year |state_name|idealPoint| vote_no| vote_yes|
|:--------------|---------:|---------:|---------:|---------:|
|1971 | China | -25.0000| 31.0000| 45.4209|
|1972 | China | -26.2550| 38.2974| 45.4209|
|1973 | China | 28.2550| 35.2974| 45.4209|
|1994 | Czech | 27.2550| 34.2974| 45.4209|
As you can see. Not all countries [there are 196 of them] joined voting at the UN in the same year.
What I want to do is to create a new column in my data.frame (votes) that consists of the absolute difference between ChinaIdealpoints to Czech Ideal points (for given year...). I know how to create the new column with dplyr but how do I multiply correct countries from the list of 196 countries? (the difference between the year of joining can be then deleted manually I think).
The final Output should be new data.frame (or new columns in votes) looking like this: China ideal point in 1994 was, for instance, 2.2550
|year |state_name|idealPoint|Abs.Difference China_Czech
|:--------------|---------:|---------:|-------------------------:|
|1971 | China | -25.0000| NA |
|1972 | China | -26.2550| NA |
|1973 | China | 28.2550| NA |
|1994 | Czech | 27.2550| 25.0000 |
Codes:
df1 <- data.frame(year = c(1994,1995,1996,1997,1994,1995,1996,1997),
state_name = c("China","China","China","China","Czech_Republic","Czech_Republic","Czech_Republic","Czech_Republic"),
idealpoints = c(-25.0000,-26.2550,28.2550,27.2550,-27.0000,-28.2550,29.2550,22.2550),
vote_no = c(31.0000,38.2974,35.2974,34.2974,33.0000,36.2974,37.2974,38.2974),
vote_yes = c(45.4209,45.4209,45.4209,45.4209,45.4209,45.4209,45.4209,45.4209))
china_df <- df1[df1$state_name == "China",]
czech_df <- df1[df1$state_name == "Czech_Republic",]
china_czech_merge <- merge(china_df,czech_df,by = "year")
china_czech_merge$Abs_diff <- abs(china_czech_merge$idealpoints.x - china_czech_merge$idealpoints.y)
Output:
year state_name.x idealpoints.x vote_no.x vote_yes.x state_name.y idealpoints.y vote_no.y vote_yes.y Abs_diff
1 1994 China -25.000 31.0000 45.4209 Czech_Republic -27.000 33.0000 45.4209 2
2 1995 China -26.255 38.2974 45.4209 Czech_Republic -28.255 36.2974 45.4209 2
3 1996 China 28.255 35.2974 45.4209 Czech_Republic 29.255 37.2974 45.4209 1
4 1997 China 27.255 34.2974 45.4209 Czech_Republic 22.255 38.2974 45.4209 5
I think this will work for you.
Thanks
Does this perhaps solve your problem?
library(tibble)
library(dplyr)
a <- tribble(
~year, ~ctry, ~vote,
1994, "China", 5,
1995, "China", 100,
1996, "China", 600,
1997, "China", 45,
1998, "China", 9,
1994, "Czech_Republic", 1,
1995, "Czech_Republic", 5,
1996, "Czech_Republic", 100,
1997, "Czech_Republic", 40,
1998, "Czech_Republic", 6,
)
a %>%
group_by(year) %>%
mutate(foo = abs(lag(lead(vote) - vote)))
Output:
# A tibble: 10 x 4
# Groups: year [5]
year ctry vote foo
<dbl> <chr> <dbl> <dbl>
1 1994 China 5 NA
2 1995 China 100 NA
3 1996 China 600 NA
4 1997 China 45 NA
5 1998 China 9 NA
6 1994 Czech_Republic 1 4
7 1995 Czech_Republic 5 95
8 1996 Czech_Republic 100 500
9 1997 Czech_Republic 40 5
10 1998 Czech_Republic 6 3
You'll have to filter down the data to fit your needs, e.g. by country.

Resources