Attempting to create panel-data from cross sectional data - r

I'm attempting to transform data from the Global Terrorism Database so that instead of the unit being terror events, it will be "Country_Year" with one variable having the number of terror events that year.
I've managed to create a dataframe that has all one column with all the Country_Year combinations as one variable. I've also find that by using `
´table(GTD_94_Land$country_txt, GTD_94_Land$iyear)´ the table shows the values that I would like the new variable to have. What I can't figure out is how to store this number as a variable.
So my data look like this
eventid iyear crit1 crit2 crit3 country country_txt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 199401010008 1994 1 1 1 182 Somalia
2 199401010012 1994 1 1 1 209 Turkey
3 199401010013 1994 1 1 1 209 Turkey
4 199401020003 1994 1 1 1 209 Turkey
5 199401020007 1994 1 1 0 106 Kuwait
6 199401030002 1994 1 1 1 209 Turkey
7 199401030003 1994 1 1 1 228 Yemen
8 199401030006 1994 1 1 0 53 Cyprus
9 199401040005 1994 1 1 0 209 Turkey
10 199401040006 1994 1 1 0 209 Turkey
11 199401040007 1994 1 1 1 209 Turkey
12 199401040008 1994 1 1 1 209 Turkey
and I would like to transform so that I had
Terror attacks iyear crit1 crit2 crit3 country country_txt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 1994 1 1 1 182 Somalia
2 8 1994 1 1 1 209 Turkey
5 1 1994 1 1 0 106 Kuwait
7 1 1994 1 1 1 228 Yemen
8 1 1994 1 1 0 53 Cyprus
´´´
I've looked at some solutions but most of them seems to assume that the number the new variable should have already is in the data.
All help is appreciated!

Assuming df is the original dataframe:
df_out = df %>%
dplyr::select(-eventid) %>%
dplyr::group_by(country_txt,iyear) %>%
dplyr::mutate(Terrorattacs = n()) %>%
dplyr::slice(1L) %>%
dplyr::ungroup()
Ideally, I would use summarise but since I don't know the summarising criteria for other columns, I have simply used mutate and slice.
Note: The 'crit' columns values would be the first occurrence of the 'country_txt' and 'iyear'.

Here's a data.table solution. If the data set has already been filtered to have crit1 and crit2 equal to 1 (which you gave as a condition in a comment), you can remove the first argument (crit1 == 1 & crit2 == 1)
library(data.table)
set.seed(1011)
dat <- data.table(eventid = round(runif(100, 1000, 10000)),
iyear = sample(1994:1996, 100, rep = T),
crit1 = rbinom(100, 1, .9),
crit2 = rbinom(100, 1, .9),
crit3 = rbinom(100, 1, .9),
country = sample(1:3, 100, rep = T))
dat[, country_txt := LETTERS[country]]
## remove crit variables
dat[crit1 == 1 & crit2 == 1, .N, .(country, country_txt, iyear)]
#> country country_txt iyear N
#> 1: 1 A 1994 10
#> 2: 1 A 1995 4
#> 3: 3 C 1995 10
#> 4: 1 A 1996 7
#> 5: 2 B 1996 9
#> 6: 3 C 1996 5
#> 7: 2 B 1994 8
#> 8: 3 C 1994 13
#> 9: 2 B 1995 10
Created on 2019-09-24 by the reprex package (v0.3.0)

Related

How to find the annual evolution rate for each firm in my data table?

So I have a data table of 5000 firms, each firm is assigned a numerical value ("id") which is 1 for the first firm, 2 for the second ...
Here is my table with only the profit variable :
|id | year | profit
|:----| :----| :----|
|1 |2001 |-0.4
|1 |2002 |-0.89
|2 |2001 |1.89
|2 |2002 |2.79
Each firm is expressed twice, one line specifies the data in 2001 and the second in 2002 (the "id" value being the same on both lines because it is the same firm one year apart).
How to calculate the annual rate of change of each firm ("id") between 2001 and 2002 ?
I'm really new to R and I don't see where to start? Separate the 2001 and 2002 data?
I did this :
years <- sort(unique(group$year))years
And I also found this on the internet but with no success :
library(dplyr)
res <-
group %>%
arrange(id,year) %>%
group_by(id) %>%
mutate(evol_rate = ("group$year$2002" / lag("group$year$2001") - 1) * 100) %>%
ungroup()
Thank you very much
From what you've written, I take it that you want to calculate the formula for ROC for the profit values of 2001 and 2002:
ROC=(current_value​/previous_value − 1) ∗ 100
To accomplish this, I suggest tidyr::pivot_wider() which reshapes your dataframe from long to wide format (see: https://r4ds.had.co.nz/tidy-data.html#pivoting).
Code:
require(tidyr)
require(dplyr)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
head(df, 10)
#> id year value
#> 1 1 2001 856
#> 2 1 2002 1850
#> 3 2 2001 1687
#> 4 2 2002 1902
#> 5 3 2001 1728
#> 6 3 2002 1773
#> 7 4 2001 691
#> 8 4 2002 1691
#> 9 5 2001 1368
#> 10 5 2002 893
df_wide <- df %>%
pivot_wider(names_from = year,
names_prefix = "profit_",
values_from = value,
values_fn = mean)
res <- df_wide %>%
mutate(evol_rate = (profit_2002/profit_2001-1)*100) %>%
round(2)
head(res, 10)
#> # A tibble: 10 x 4
#> id profit_2001 profit_2002 evol_rate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 856 1850 116.
#> 2 2 1687 1902 12.7
#> 3 3 1728 1773 2.6
#> 4 4 691 1691 145.
#> 5 5 1368 893 -34.7
#> 6 6 883 516 -41.6
#> 7 7 1280 1649 28.8
#> 8 8 1579 1383 -12.4
#> 9 9 1907 1626 -14.7
#> 10 10 1227 1134 -7.58
If you want to do it without reshaping your data into a wide format you can use
library(tidyverse)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
df %>% head(n = 10)
#> id year value
#> 1 1 2001 1173
#> 2 1 2002 1648
#> 3 2 2001 1560
#> 4 2 2002 1091
#> 5 3 2001 1736
#> 6 3 2002 667
#> 7 4 2001 1840
#> 8 4 2002 1202
#> 9 5 2001 1597
#> 10 5 2002 1797
new_df <- df %>%
group_by(id) %>%
mutate(ROC = ((value / lag(value) - 1) * 100))
new_df %>% head(n = 10)
#> # A tibble: 10 × 4
#> # Groups: id [5]
#> id year value ROC
#> <dbl> <dbl> <int> <dbl>
#> 1 1 2001 1173 NA
#> 2 1 2002 1648 40.5
#> 3 2 2001 1560 NA
#> 4 2 2002 1091 -30.1
#> 5 3 2001 1736 NA
#> 6 3 2002 667 -61.6
#> 7 4 2001 1840 NA
#> 8 4 2002 1202 -34.7
#> 9 5 2001 1597 NA
#> 10 5 2002 1797 12.5
This groups the data by id and then uses lag to compare the current year to the year prior

Penalized cumulative sum in r

I need to calculate a penalized cumulative sum.
Individuals "A", "B" and "C" were supposed to get tested every other year. Every time they get tested, they accumulate 1 point. However, when they miss a test, their cumulative score gets deducted in 1.
I have the following code:
data.frame(year = rep(1990:1995, 3), person.id = c(rep("A", 6), rep("B", 6), rep("C", 6)), needs.testing = rep(c("Yes", "No"), 9), test.compliance = c(c(1,0,1,0,1,0), c(1,0,1,0,0,0), c(1,0,0,0,0,0)), penalized.compliance.cum.sum = c(c(1,1,2,2,3,3), c(1,1,2,2,1,1), c(1,1,0,0,-1,-1)))
...which gives the following:
year person.id needs.testing test.compliance penalized.compliance.cum.sum
1 1990 A Yes 1 1
2 1991 A No 0 1
3 1992 A Yes 1 2
4 1993 A No 0 2
5 1994 A Yes 1 3
6 1995 A No 0 3
7 1990 B Yes 1 1
8 1991 B No 0 1
9 1992 B Yes 1 2
10 1993 B No 0 2
11 1994 B Yes 0 1
12 1995 B No 0 1
13 1990 C Yes 1 1
14 1991 C No 0 1
15 1992 C Yes 0 0
16 1993 C No 0 0
17 1994 C Yes 0 -1
18 1995 C No 0 -1
As it is evident, "A" fully complied. "B" somewhat complied (in year 1994 he's supposed to get tested, but he missed the test, and consequently his cumulative sum gets deducted from 2 to 1). Finally, "C" complies just once (in year 1990, and every time she needs to get tested, she misses the test).
What I need is some code to get the "penalized.compliance.cum.sum" variable.
Please note:
Tests are every other year.
The "penalized.compliance.cum.sum" variable keeps adding the previous score.
But starts deducting only if the individual misses the test on the testing year (denoted in the "needs.testing" variable).
For instance, individual "C" complies in year 1990. In 1991 she doesn't need to get tested, and hence keeps her score of 1. Then, she misses the 1992 test, and 1 is subtracted from her cumulative score, getting a score of 0 in 1992. Then she keeps missing test getting a -1 at the end of the study.
Also, I need to assign different penalties (i.e. different numbers). In this example, it's just 1. However, I need to be able to penalize using other numbers such as 0.5, 0.1, and others.
Thanks!
Using case_when
library(dplyr)
df1 %>%
group_by(person.id) %>%
mutate(res = cumsum(case_when(needs.testing == "Yes" ~ 1- 2 *(test.compliance < 1), TRUE ~ 0)))
base R
do.call(rbind, by(dat, dat$person.id,
function(z) transform(z, res = cumsum(ifelse(needs.testing == "Yes", 1-2*(test.compliance < 1), 0)))
))
# year person.id needs.testing test.compliance penalized.compliance.cum.sum res
# A.1 1990 A Yes 1 1 1
# A.2 1991 A No 0 1 1
# A.3 1992 A Yes 1 2 2
# A.4 1993 A No 0 2 2
# A.5 1994 A Yes 1 3 3
# A.6 1995 A No 0 3 3
# B.7 1990 B Yes 1 1 1
# B.8 1991 B No 0 1 1
# B.9 1992 B Yes 1 2 2
# B.10 1993 B No 0 2 2
# B.11 1994 B Yes 0 1 1
# B.12 1995 B No 0 1 1
# C.13 1990 C Yes 1 1 1
# C.14 1991 C No 0 1 1
# C.15 1992 C Yes 0 0 0
# C.16 1993 C No 0 0 0
# C.17 1994 C Yes 0 -1 -1
# C.18 1995 C No 0 -1 -1
by splits a frame up by the INDICES (dat$person.id here), where in the function z is the data for just that group. This allows us to operate on the data without fearing the person changing in a vector.
by returns a list, and the canonical base-R way to combine lists into a frame is either rbind(a, b) when only two frames, or do.call(rbind, list(...)) when there may be more than two frames in the list.
The 1-2*(.) is just a trick to waffle between +1 and -1 based on test.compliance.
This has the side-effect of potentially changing the order of the rows. For instance, if it were ordered first by year then person.id, then the by-group calculations will still be good, but the output will be grouped by person.id (and ordered by year within the group). Minor, but note it if you need order to be something.
dplyr
library(dplyr)
dat %>%
group_by(person.id) %>%
mutate(res = cumsum(if_else(needs.testing == "Yes", 1-2*(test.compliance < 1), 0))) %>%
ungroup()
data.table
library(data.table)
datDT <- as.data.table(dat)
datDT[, res := cumsum(fifelse(needs.testing == "Yes", 1-2*(test.compliance < 1), 0)), by = .(person.id)]
This might do the trick for you?
df <- data.frame(year = rep(1990:1995, 3), person.id = c(rep("A", 6), rep("B", 6), rep("C", 6)), needs.testing = rep(c("Yes", "No"), 9), test.compliance = c(c(1,0,1,0,1,0), c(1,0,1,0,0,0), c(1,0,0,0,0,0)), penalized.compliance.cum.sum = c(c(1,1,2,2,3,3), c(1,1,2,2,1,1), c(1,1,0,0,-1,-1)))
library("dplyr")
penalty <- -1
df %>%
group_by(person.id) %>%
mutate(cumsum = cumsum(ifelse(needs.testing == "Yes" & test.compliance == 0, penalty, test.compliance)))
## A tibble: 18 x 6
## Groups: person.id [3]
# year person.id needs.testing test.compliance penalized.compliance.cum.sum cumsum
# <int> <chr> <chr> <dbl> <dbl> <dbl>
# 1 1990 A Yes 1 1 1
# 2 1991 A No 0 1 1
# 3 1992 A Yes 1 2 2
# 4 1993 A No 0 2 2
# 5 1994 A Yes 1 3 3
# 6 1995 A No 0 3 3
# 7 1990 B Yes 1 1 1
# 8 1991 B No 0 1 1
# 9 1992 B Yes 1 2 2
#10 1993 B No 0 2 2
#11 1994 B Yes 0 1 1
#12 1995 B No 0 1 1
#13 1990 C Yes 1 1 1
#14 1991 C No 0 1 1
#15 1992 C Yes 0 0 0
#16 1993 C No 0 0 0
#17 1994 C Yes 0 -1 -1
#18 1995 C No 0 -1 -1
You can then easily adjust the penalty variable to be whatever penalty you want.

Spread valued column into binary 'time series' in R

I'm attempting to spread a valued column first into a set of binary columns and then gather them again in a 'time series' format.
By way of example, consider locations that have been conquered at certain times, with data that looks like this:
df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))
locationID conquered_in
1 1 1931
2 2 1932
3 3 1929
I'm attempting to reshape the data to look like this:
df2 <- data.frame(locationID = c(1,1,1,1,2,2,2,2,3,3,3,3), year = c(1929,1930,1931,1932,1929,1930,1931,1932,1929,1930,1931,1932), conquered = c(0,0,1,1,0,0,0,0,1,1,1,1))
locationID year conquered
1 1 1929 0
2 1 1930 0
3 1 1931 1
4 1 1932 1
5 2 1929 0
6 2 1930 0
7 2 1931 0
8 2 1932 0
9 3 1929 1
10 3 1930 1
11 3 1931 1
12 3 1932 1
My original strategy was to spread on conquered and then attempt a gather. This answer seemed close, but I can't seem to get it right with fill, since I'm trying to populate the later years with 1's also.
You can use complete() to expand the data frame and then use cumsum() when conquered equals 1 to fill the grouped data downwards.
library(tidyr)
library(dplyr)
df1 %>%
mutate(conquered = 1) %>%
complete(locationID, conquered_in = seq(min(conquered_in), max(conquered_in)), fill = list(conquered = 0)) %>%
group_by(locationID) %>%
mutate(conquered = cumsum(conquered == 1))
# A tibble: 12 x 3
# Groups: locationID [3]
locationID conquered_in conquered
<dbl> <dbl> <int>
1 1 1929 0
2 1 1930 0
3 1 1931 1
4 1 1932 1
5 2 1929 0
6 2 1930 0
7 2 1931 0
8 2 1932 1
9 3 1929 1
10 3 1930 1
11 3 1931 1
12 3 1932 1
Using complete from tidyr would be better choice. Though we need to aware that the conquered year may not fully cover all the year from beginning to end of the war.
library(dplyr)
library(tidyr)
library(magrittr)
df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))
# A data frame full of all year you want to cover
df2 <- data.frame(year=seq(1929, 1940, by=1))
# Create a data frame full of combination of year and location + conquered data
df3 <- full_join(df2, df1, by=c("year"="conquered_in")) %>%
mutate(conquered=if_else(!is.na(locationID), 1, 0)) %>%
complete(year, locationID) %>%
arrange(locationID) %>%
filter(!is.na(locationID))
# calculate conquered depend on the first year it get conquered - using group by location
df3 %<>%
group_by(locationID) %>%
# year 2000 in the min just for case if you have location that never conquered
mutate(conquered=if_else(year>=min(2000, year[conquered==1], na.rm=T), 1, 0)) %>%
ungroup()
df3 %>% filter(year<=1932)
# A tibble: 12 x 3
year locationID conquered
<dbl> <dbl> <dbl>
1 1929 1 0
2 1930 1 0
3 1931 1 1
4 1932 1 1
5 1929 2 0
6 1930 2 0
7 1931 2 0
8 1932 2 1
9 1929 3 1
10 1930 3 1
11 1931 3 1
12 1932 3 1

Fill NAs with next columns for moving average

set.seed(123)
df <- data.frame(loc.id = rep(c(1:3), each = 4*10),
year = rep(rep(c(1980:1983), each = 10), times = 3),
day = rep(1:10, times = 3*4),
x = sample(123:200, 4*3*10, replace = T))
I want to add one more column x.mv which is 3 days moving average of x for each loc.id and year combination
df %>% group_by(loc.id,year) %>% mutate(x.mv = zoo::rollmean(x, 3, fill = "NA", align = "right"))
loc.id year day x x.mv
<int> <int> <int> <int> <dbl>
1 1 1980 1 145 NA
2 1 1980 2 184 NA
3 1 1980 3 154 161
4 1 1980 4 191 176.
5 1 1980 5 196 180.
6 1 1980 6 126 171
7 1 1980 7 164 162
8 1 1980 8 192 161.
9 1 1980 9 166 174
10 1 1980 10 158 172
What I want to do is to replace the NAs in the x.mv column with x. I tried this:
df %>% group_by(loc.id,year) %>% mutate(x.mv = zoo::rollmean(x, 3, fill = x[1:2], align = "right"))
loc.id year day x x.mv
<int> <int> <int> <int> <dbl>
1 1 1980 1 145 145
2 1 1980 2 184 145
3 1 1980 3 154 161
4 1 1980 4 191 176.
5 1 1980 5 196 180.
6 1 1980 6 126 171
7 1 1980 7 164 162
8 1 1980 8 192 161.
9 1 1980 9 166 174
10 1 1980 10 158 172
But what it is doing instead is filling the NAs with the first value of x instead of the corresponding value of x. How do I fix it?
skip the fill argument and pad manually:
df %>%
group_by(loc.id,year) %>%
mutate(x.mv = c(x[1:2],zoo::rollmean(x, 3, align = "right"))) %>%
ungroup
# # A tibble: 120 x 5
# loc.id year day x x.mv
# <int> <int> <int> <int> <dbl>
# 1 1 1980 1 145 145.0000
# 2 1 1980 2 184 184.0000
# 3 1 1980 3 154 161.0000
# 4 1 1980 4 191 176.3333
# 5 1 1980 5 196 180.3333
# 6 1 1980 6 126 171.0000
# 7 1 1980 7 164 162.0000
# 8 1 1980 8 192 160.6667
# 9 1 1980 9 166 174.0000
# 10 1 1980 10 158 172.0000
# # ... with 110 more rows
You might want to use dplyr::cummean(x[1:2]) instead of x[1:2], to have an average for the second value already, or in this case, use #g-grothendieck's suggestion in the comments and rewrite your mutate call as mutate(x.mv = rollapplyr(x, 3, mean, partial = TRUE)).

Rearranging data frame in R with summarizing values

I need to rearrange a data frame, which currently looks like this:
> counts
year score freq rounded_year
1: 1618 0 25 1620
2: 1619 2 1 1620
3: 1619 0 20 1620
4: 1620 1 6 1620
5: 1620 0 70 1620
---
11570: 1994 107 1 1990
11571: 1994 101 2 1990
11572: 1994 10 194 1990
11573: 1994 1 30736 1990
11574: 1994 0 711064 1990
But what I need is the count of the unique values in score per decade (rounded_year).
So, the data frame should looks like this:
rounded_year 0 1 2 3 [...] total
1620 115 6 1 0 122
---
1990 711064 30736 0 0 741997
I've played around with aggregate and ddply, but so far without success. I hope, it's clear what I mean. I don't know how to describe it better.
Any ideas?
A simple example using dplyr and tidyr.
dt = data.frame(year = c(1618,1619,1620,1994,1994,1994),
score = c(0,1,0,2,2,3),
freq = c(3,5,2,6,7,8),
rounded_year = c(1620,1620,1620,1990,1990,1990))
dt
# year score freq rounded_year
# 1 1618 0 3 1620
# 2 1619 1 5 1620
# 3 1620 0 2 1620
# 4 1994 2 6 1990
# 5 1994 2 7 1990
# 6 1994 3 8 1990
library(dplyr)
library(tidyr)
dt %>%
group_by(rounded_year, score) %>%
summarise(freq = sum(freq)) %>%
mutate(total = sum(freq)) %>%
spread(score,freq, fill=0)
# Source: local data frame [2 x 6]
#
# rounded_year total 0 1 2 3
# (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1620 10 5 5 0 0
# 2 1990 21 0 0 13 8
In case you prefer to work with data.table (as the dataset you provide looks more like a data.table), you can use this:
library(data.table)
library(tidyr)
dt = setDT(dt)[, .(freq = sum(freq)) ,by=c("rounded_year","score")]
dt = dt[, total:= sum(freq) ,by="rounded_year"]
dt = spread(dt,score,freq, fill=0)
dt
# rounded_year total 0 1 2 3
# 1: 1620 10 5 5 0 0
# 2: 1990 21 0 0 13 8

Resources