Rearranging data frame in R with summarizing values - r

I need to rearrange a data frame, which currently looks like this:
> counts
year score freq rounded_year
1: 1618 0 25 1620
2: 1619 2 1 1620
3: 1619 0 20 1620
4: 1620 1 6 1620
5: 1620 0 70 1620
---
11570: 1994 107 1 1990
11571: 1994 101 2 1990
11572: 1994 10 194 1990
11573: 1994 1 30736 1990
11574: 1994 0 711064 1990
But what I need is the count of the unique values in score per decade (rounded_year).
So, the data frame should looks like this:
rounded_year 0 1 2 3 [...] total
1620 115 6 1 0 122
---
1990 711064 30736 0 0 741997
I've played around with aggregate and ddply, but so far without success. I hope, it's clear what I mean. I don't know how to describe it better.
Any ideas?

A simple example using dplyr and tidyr.
dt = data.frame(year = c(1618,1619,1620,1994,1994,1994),
score = c(0,1,0,2,2,3),
freq = c(3,5,2,6,7,8),
rounded_year = c(1620,1620,1620,1990,1990,1990))
dt
# year score freq rounded_year
# 1 1618 0 3 1620
# 2 1619 1 5 1620
# 3 1620 0 2 1620
# 4 1994 2 6 1990
# 5 1994 2 7 1990
# 6 1994 3 8 1990
library(dplyr)
library(tidyr)
dt %>%
group_by(rounded_year, score) %>%
summarise(freq = sum(freq)) %>%
mutate(total = sum(freq)) %>%
spread(score,freq, fill=0)
# Source: local data frame [2 x 6]
#
# rounded_year total 0 1 2 3
# (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1620 10 5 5 0 0
# 2 1990 21 0 0 13 8
In case you prefer to work with data.table (as the dataset you provide looks more like a data.table), you can use this:
library(data.table)
library(tidyr)
dt = setDT(dt)[, .(freq = sum(freq)) ,by=c("rounded_year","score")]
dt = dt[, total:= sum(freq) ,by="rounded_year"]
dt = spread(dt,score,freq, fill=0)
dt
# rounded_year total 0 1 2 3
# 1: 1620 10 5 5 0 0
# 2: 1990 21 0 0 13 8

Related

How do I add a column indicating the years since a binary variable in R?

I thought this would be trivial, and I think it must be, but I am very tired and stuck at this problem at the moment.
Consider a df with two columns, one with a year, and the other with a binary variable indicating some event.
df <- data.frame(year = c(2000,2001,2002,2003,2004, 2005,2006,2007,2008,2010),
flag = c(0,0,0,1,0,0,0,1,0,0))
I want to create a third column that simply counts the years since the last flag and that resets when a new flag appears, like so:
I thought this code would do the job:
First, add a 0 as "year_since" for every year with a flag, then, if there was a flag in the previous year, add 1 to the value of the previous "year_since".
df <- df %>% mutate(year_since = ifelse(flag == 1, 0, NA)) %>%
mutate(year_since = ifelse(dplyr::lag(flag, n=1, order_by = "year") == 1 & is.na(year_since),
dplyr::lag(year_since, n=1, order_by = "year")+1, year_since))
However, this returns NA for every row that should be 1,2,3, and so on.
You could do
df %>%
group_by(group = cumsum(flag)) %>%
mutate(year_since = ifelse(group == 0, NA, seq(n()) - 1)) %>%
ungroup() %>%
select(-group)
#> # A tibble: 10 x 3
#> year flag year_since
#> <dbl> <dbl> <dbl>
#> 1 2000 0 NA
#> 2 2001 0 NA
#> 3 2002 0 NA
#> 4 2003 1 0
#> 5 2004 0 1
#> 6 2005 0 2
#> 7 2006 0 3
#> 8 2007 1 0
#> 9 2008 0 1
#> 10 2010 0 2
Created on 2022-09-16 with reprex v2.0.2
Using data.table
library(data.table)
setDT(df)[, year_since := (NA^!cummax(flag)) * rowid(cumsum(flag))-1]
-output
> df
year flag year_since
<num> <num> <num>
1: 2000 0 NA
2: 2001 0 NA
3: 2002 0 NA
4: 2003 1 0
5: 2004 0 1
6: 2005 0 2
7: 2006 0 3
8: 2007 1 0
9: 2008 0 1
10: 2010 0 2

Spread valued column into binary 'time series' in R

I'm attempting to spread a valued column first into a set of binary columns and then gather them again in a 'time series' format.
By way of example, consider locations that have been conquered at certain times, with data that looks like this:
df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))
locationID conquered_in
1 1 1931
2 2 1932
3 3 1929
I'm attempting to reshape the data to look like this:
df2 <- data.frame(locationID = c(1,1,1,1,2,2,2,2,3,3,3,3), year = c(1929,1930,1931,1932,1929,1930,1931,1932,1929,1930,1931,1932), conquered = c(0,0,1,1,0,0,0,0,1,1,1,1))
locationID year conquered
1 1 1929 0
2 1 1930 0
3 1 1931 1
4 1 1932 1
5 2 1929 0
6 2 1930 0
7 2 1931 0
8 2 1932 0
9 3 1929 1
10 3 1930 1
11 3 1931 1
12 3 1932 1
My original strategy was to spread on conquered and then attempt a gather. This answer seemed close, but I can't seem to get it right with fill, since I'm trying to populate the later years with 1's also.
You can use complete() to expand the data frame and then use cumsum() when conquered equals 1 to fill the grouped data downwards.
library(tidyr)
library(dplyr)
df1 %>%
mutate(conquered = 1) %>%
complete(locationID, conquered_in = seq(min(conquered_in), max(conquered_in)), fill = list(conquered = 0)) %>%
group_by(locationID) %>%
mutate(conquered = cumsum(conquered == 1))
# A tibble: 12 x 3
# Groups: locationID [3]
locationID conquered_in conquered
<dbl> <dbl> <int>
1 1 1929 0
2 1 1930 0
3 1 1931 1
4 1 1932 1
5 2 1929 0
6 2 1930 0
7 2 1931 0
8 2 1932 1
9 3 1929 1
10 3 1930 1
11 3 1931 1
12 3 1932 1
Using complete from tidyr would be better choice. Though we need to aware that the conquered year may not fully cover all the year from beginning to end of the war.
library(dplyr)
library(tidyr)
library(magrittr)
df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))
# A data frame full of all year you want to cover
df2 <- data.frame(year=seq(1929, 1940, by=1))
# Create a data frame full of combination of year and location + conquered data
df3 <- full_join(df2, df1, by=c("year"="conquered_in")) %>%
mutate(conquered=if_else(!is.na(locationID), 1, 0)) %>%
complete(year, locationID) %>%
arrange(locationID) %>%
filter(!is.na(locationID))
# calculate conquered depend on the first year it get conquered - using group by location
df3 %<>%
group_by(locationID) %>%
# year 2000 in the min just for case if you have location that never conquered
mutate(conquered=if_else(year>=min(2000, year[conquered==1], na.rm=T), 1, 0)) %>%
ungroup()
df3 %>% filter(year<=1932)
# A tibble: 12 x 3
year locationID conquered
<dbl> <dbl> <dbl>
1 1929 1 0
2 1930 1 0
3 1931 1 1
4 1932 1 1
5 1929 2 0
6 1930 2 0
7 1931 2 0
8 1932 2 1
9 1929 3 1
10 1930 3 1
11 1931 3 1
12 1932 3 1

Attempting to create panel-data from cross sectional data

I'm attempting to transform data from the Global Terrorism Database so that instead of the unit being terror events, it will be "Country_Year" with one variable having the number of terror events that year.
I've managed to create a dataframe that has all one column with all the Country_Year combinations as one variable. I've also find that by using `
´table(GTD_94_Land$country_txt, GTD_94_Land$iyear)´ the table shows the values that I would like the new variable to have. What I can't figure out is how to store this number as a variable.
So my data look like this
eventid iyear crit1 crit2 crit3 country country_txt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 199401010008 1994 1 1 1 182 Somalia
2 199401010012 1994 1 1 1 209 Turkey
3 199401010013 1994 1 1 1 209 Turkey
4 199401020003 1994 1 1 1 209 Turkey
5 199401020007 1994 1 1 0 106 Kuwait
6 199401030002 1994 1 1 1 209 Turkey
7 199401030003 1994 1 1 1 228 Yemen
8 199401030006 1994 1 1 0 53 Cyprus
9 199401040005 1994 1 1 0 209 Turkey
10 199401040006 1994 1 1 0 209 Turkey
11 199401040007 1994 1 1 1 209 Turkey
12 199401040008 1994 1 1 1 209 Turkey
and I would like to transform so that I had
Terror attacks iyear crit1 crit2 crit3 country country_txt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 1994 1 1 1 182 Somalia
2 8 1994 1 1 1 209 Turkey
5 1 1994 1 1 0 106 Kuwait
7 1 1994 1 1 1 228 Yemen
8 1 1994 1 1 0 53 Cyprus
´´´
I've looked at some solutions but most of them seems to assume that the number the new variable should have already is in the data.
All help is appreciated!
Assuming df is the original dataframe:
df_out = df %>%
dplyr::select(-eventid) %>%
dplyr::group_by(country_txt,iyear) %>%
dplyr::mutate(Terrorattacs = n()) %>%
dplyr::slice(1L) %>%
dplyr::ungroup()
Ideally, I would use summarise but since I don't know the summarising criteria for other columns, I have simply used mutate and slice.
Note: The 'crit' columns values would be the first occurrence of the 'country_txt' and 'iyear'.
Here's a data.table solution. If the data set has already been filtered to have crit1 and crit2 equal to 1 (which you gave as a condition in a comment), you can remove the first argument (crit1 == 1 & crit2 == 1)
library(data.table)
set.seed(1011)
dat <- data.table(eventid = round(runif(100, 1000, 10000)),
iyear = sample(1994:1996, 100, rep = T),
crit1 = rbinom(100, 1, .9),
crit2 = rbinom(100, 1, .9),
crit3 = rbinom(100, 1, .9),
country = sample(1:3, 100, rep = T))
dat[, country_txt := LETTERS[country]]
## remove crit variables
dat[crit1 == 1 & crit2 == 1, .N, .(country, country_txt, iyear)]
#> country country_txt iyear N
#> 1: 1 A 1994 10
#> 2: 1 A 1995 4
#> 3: 3 C 1995 10
#> 4: 1 A 1996 7
#> 5: 2 B 1996 9
#> 6: 3 C 1996 5
#> 7: 2 B 1994 8
#> 8: 3 C 1994 13
#> 9: 2 B 1995 10
Created on 2019-09-24 by the reprex package (v0.3.0)

How do I turn monadic data into dyadic data in R (country-year into pair-year)?

I have data organized by country-year, with a ID for a dyadic relationship. I want to organize this by dyad-year.
Here is how my data is organized:
dyadic_id country_codes year
1 1 200 1990
2 1 20 1990
3 1 200 1991
4 1 20 1991
5 2 300 1990
6 2 10 1990
7 3 100 1990
8 3 10 1990
9 4 500 1991
10 4 200 1991
Here's how I want my data to be organized:
dyadic_id_want country_codes_1 country_codes_2 year_want
1 1 200 20 1990
2 1 200 20 1991
3 2 300 10 1990
4 3 100 10 1990
5 4 500 200 1991
Here is reproducible code:
dyadic_id<-c(1,1,1,1,2,2,3,3,4,4)
country_codes<-c(200,20,200,20,300,10,100,10,500,200)
year<-c(1990,1990,1991,1991,1990,1990,1990,1990,1991,1991)
mydf<-as.data.frame(cbind(dyadic_id,country_codes,year))
I want mydf to look like df_i_want
dyadic_id_want<-c(1,1,2,3,4)
country_codes_1<-c(200,200,300,100,500)
country_codes_2<-c(20,20,10,10,200)
year_want<-c(1990,1991,1990,1990,1991)
my_df_i_want<-as.data.frame(cbind(dyadic_id_want,country_codes_1,country_codes_2,year_want))
We can reshape from 'long' to 'wide' using different methods. Two are described below.
Using 'data.table', we convert the 'data.frame', to 'data.table' (setDT(mydf)), create a sequence column ('ind'), grouped by 'dyadic_id' and 'year'. Then, we convert the dataset from 'long' to 'wide' format using dcast.
library(data.table)
setDT(mydf)[, ind:= 1:.N, by = .(dyadic_id, year)]
dcast(mydf, dyadic_id+year~ paste('country_codes', ind, sep='_'), value.var='country_codes')
# dyadic_id year country_codes_1 country_codes_2
#1: 1 1990 200 20
#2: 1 1991 200 20
#3: 2 1990 300 10
#4: 3 1990 100 10
#5: 4 1991 500 200
Or using dplyr/tidyr, we do the same i.e. grouping by 'dyadic_id', 'year', create a 'ind' column (mutate(...), and use spread from tidyr to reshape to 'wide' format.
library(dplyr)
library(tidyr)
mydf %>%
group_by(dyadic_id, year) %>%
mutate(ind= paste0('country_codes', row_number())) %>%
spread(ind, country_codes)
# dyadic_id year country_codes1 country_codes2
# (dbl) (dbl) (dbl) (dbl)
#1 1 1990 200 20
#2 1 1991 200 20
#3 2 1990 300 10
#4 3 1990 100 10
#5 4 1991 500 200

Create Indicator Data Frame Based on Interval Ranges

I am trying to create a "long" data frame of indicator ("dummy") variables out of a very peculiar type of "wide" data frame in R that has interval ranges of years defining my data.
What I have looks like this:
f=data.frame(name=c("A","B","C"),
year.start=c(1990,1994,1993),year.end=c(1994,1995,1993))
name year.start year.end
1 A 1990 1994
2 B 1994 1995
3 C 1993 1993
Update: I have changed the value of year.start for A to 1990 from the initial example of 1993 to address some of the answers below which rely on unique values instead of intervals.
What I would like is a long data frame that would look like this, with an entry for each of the possible years in the original data frame, eg, 1990 through 1995 where 1 = present and 0 = absent.
name year indicator
A 1990 1
A 1991 1
A 1992 1
A 1993 1
A 1994 1
A 1995 0
B 1990 0
B 1991 0
B 1992 0
B 1993 0
B 1994 1
B 1995 1
C 1990 0
C 1991 0
C 1992 0
C 1993 1
C 1994 0
C 1995 0
Try as I might, I don't see how I can do this with Hadley Wickham's reshape2 package.
Thanks!
Someone else might have suggestion for reshape2, but here is a base R solution:
years <- factor(unlist(f[-1]), levels=seq(min(f[-1]), max(f[-1]), by=1))
result <- data.frame(table(years, rep(f[[1]], length.out=length(years))))
# years Var2 Freq
# 1 1990 A 1
# 2 1991 A 0
# 3 1992 A 0
# 4 1993 A 0
# 5 1994 A 1
# 6 1995 A 0
# 7 1990 B 0
# 8 1991 B 0
# 9 1992 B 0
# 10 1993 B 0
# 11 1994 B 1
# 12 1995 B 1
# 13 1990 C 0
# 14 1991 C 0
# 15 1992 C 0
# 16 1993 C 2
# 17 1994 C 0
# 18 1995 C 0
here is a step-by-step breakdown, using data.table
library(data.table)
f <- as.data.table(f)
## ALL OF NAME-YEAR COMBINATIONS
ALL <- f[, CJ(name=name, year=seq(min(year.start), max(year.end)))]
## WHICH COMBINATIONS EXIST
PRESENT <- f[, list(year = seq(year.start, year.end)), by=name]
## SETKEYS FOR MERGING
setkey(ALL, name, year)
setkey(PRESENT, name, year)
## INITIALIZE INDICATOR TO ZERO, THEN SET TO 1 FOR THOSE PRESENT
ALL[, indicator := 0]
ALL[PRESENT, indicator := 1]
ALL
name year indicator
1: A 1993 1
2: A 1994 1
3: A 1995 0
4: B 1993 0
5: B 1994 1
6: B 1995 1
7: C 1993 1
8: C 1994 0
9: C 1995 0
Here's another solution, similar to the ones above, which aims to be straightforward:
zz <- cbind(name=f[1],year=rep(min(f[-1]):max(f[-1]),each=nrow(f)))
zz$indicator <- as.numeric((f$name==zz$name &
f$year.start<=zz$year &
f$year.end >=zz$year))
result <- zz[order(zz$name,zz$year),]
The first line builds a template with all the names and all the years. The second line sets indicator based on whether it is present in the range. The third line just reorders the result.
Another base R solution
f=data.frame(name=c("A","B","C"),
year.start=c(1993,1994,1993),year.end=c(1994,1995,1993), stringsAsFactors=F)
x <- expand.grid(unique(f$name),min(f1$year):max(f1$year))
names(x) <- c("name", "year")
x$indicator <- sapply(1:nrow(x), function(i) sum(x$name[i]==f$name & x$year[i] >= f$year.start & x$year[i] <= f$year.end))
x[order(x$name),]

Resources