Spread valued column into binary 'time series' in R - r

I'm attempting to spread a valued column first into a set of binary columns and then gather them again in a 'time series' format.
By way of example, consider locations that have been conquered at certain times, with data that looks like this:
df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))
locationID conquered_in
1 1 1931
2 2 1932
3 3 1929
I'm attempting to reshape the data to look like this:
df2 <- data.frame(locationID = c(1,1,1,1,2,2,2,2,3,3,3,3), year = c(1929,1930,1931,1932,1929,1930,1931,1932,1929,1930,1931,1932), conquered = c(0,0,1,1,0,0,0,0,1,1,1,1))
locationID year conquered
1 1 1929 0
2 1 1930 0
3 1 1931 1
4 1 1932 1
5 2 1929 0
6 2 1930 0
7 2 1931 0
8 2 1932 0
9 3 1929 1
10 3 1930 1
11 3 1931 1
12 3 1932 1
My original strategy was to spread on conquered and then attempt a gather. This answer seemed close, but I can't seem to get it right with fill, since I'm trying to populate the later years with 1's also.

You can use complete() to expand the data frame and then use cumsum() when conquered equals 1 to fill the grouped data downwards.
library(tidyr)
library(dplyr)
df1 %>%
mutate(conquered = 1) %>%
complete(locationID, conquered_in = seq(min(conquered_in), max(conquered_in)), fill = list(conquered = 0)) %>%
group_by(locationID) %>%
mutate(conquered = cumsum(conquered == 1))
# A tibble: 12 x 3
# Groups: locationID [3]
locationID conquered_in conquered
<dbl> <dbl> <int>
1 1 1929 0
2 1 1930 0
3 1 1931 1
4 1 1932 1
5 2 1929 0
6 2 1930 0
7 2 1931 0
8 2 1932 1
9 3 1929 1
10 3 1930 1
11 3 1931 1
12 3 1932 1

Using complete from tidyr would be better choice. Though we need to aware that the conquered year may not fully cover all the year from beginning to end of the war.
library(dplyr)
library(tidyr)
library(magrittr)
df1 <- data.frame(locationID = c(1,2,3), conquered_in = c(1931, 1932, 1929))
# A data frame full of all year you want to cover
df2 <- data.frame(year=seq(1929, 1940, by=1))
# Create a data frame full of combination of year and location + conquered data
df3 <- full_join(df2, df1, by=c("year"="conquered_in")) %>%
mutate(conquered=if_else(!is.na(locationID), 1, 0)) %>%
complete(year, locationID) %>%
arrange(locationID) %>%
filter(!is.na(locationID))
# calculate conquered depend on the first year it get conquered - using group by location
df3 %<>%
group_by(locationID) %>%
# year 2000 in the min just for case if you have location that never conquered
mutate(conquered=if_else(year>=min(2000, year[conquered==1], na.rm=T), 1, 0)) %>%
ungroup()
df3 %>% filter(year<=1932)
# A tibble: 12 x 3
year locationID conquered
<dbl> <dbl> <dbl>
1 1929 1 0
2 1930 1 0
3 1931 1 1
4 1932 1 1
5 1929 2 0
6 1930 2 0
7 1931 2 0
8 1932 2 1
9 1929 3 1
10 1930 3 1
11 1931 3 1
12 1932 3 1

Related

r conditional subtract number

I am trying to do the following logic to create 'subtract' column.
I have years from 1986-2014 and around 100 firms.
year firm count sum_of_year subtract
1986 A 1 2 2
1986 B 1 2 4
1987 A 2 4 5
1987 C 1 4 2
1987 D 1 4 5
1988 C 3 5
1988 E 2 5
That is, if a firm i at t appears in t+1, then subtract its count at t+1 from the sum_of_year at t+1,
if a firm i does not appear in t+1, then just put sum_of_year at t+1 as shown in the sample.
I am having difficulties in creating this conditional code.
How can I do this in a generalized version?
Thank you for your help.
One way using dplyr with the help of tidyr::complete. We complete the missing combinations of rows for year and firm and fill count with 0. For each year, we subtract the count by sum of count for that entire year and finally for each firm, we take the value from the next year using lead.
library(dplyr)
df %>%
tidyr::complete(year, firm, fill = list(count = 0)) %>%
group_by(year) %>%
mutate(n = sum(count) - count) %>%
group_by(firm) %>%
mutate(subtract = lead(n)) %>%
filter(count != 0) %>%
select(-n)
# year firm count sum_of_year subtract
# <int> <fct> <dbl> <int> <dbl>
#1 1986 A 1 2 2
#2 1986 B 1 2 4
#3 1987 A 2 4 5
#4 1987 C 1 4 2
#5 1987 D 1 4 5
#6 1988 C 3 5 NA
#7 1988 E 2 5 NA

Attempting to create panel-data from cross sectional data

I'm attempting to transform data from the Global Terrorism Database so that instead of the unit being terror events, it will be "Country_Year" with one variable having the number of terror events that year.
I've managed to create a dataframe that has all one column with all the Country_Year combinations as one variable. I've also find that by using `
´table(GTD_94_Land$country_txt, GTD_94_Land$iyear)´ the table shows the values that I would like the new variable to have. What I can't figure out is how to store this number as a variable.
So my data look like this
eventid iyear crit1 crit2 crit3 country country_txt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 199401010008 1994 1 1 1 182 Somalia
2 199401010012 1994 1 1 1 209 Turkey
3 199401010013 1994 1 1 1 209 Turkey
4 199401020003 1994 1 1 1 209 Turkey
5 199401020007 1994 1 1 0 106 Kuwait
6 199401030002 1994 1 1 1 209 Turkey
7 199401030003 1994 1 1 1 228 Yemen
8 199401030006 1994 1 1 0 53 Cyprus
9 199401040005 1994 1 1 0 209 Turkey
10 199401040006 1994 1 1 0 209 Turkey
11 199401040007 1994 1 1 1 209 Turkey
12 199401040008 1994 1 1 1 209 Turkey
and I would like to transform so that I had
Terror attacks iyear crit1 crit2 crit3 country country_txt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 1994 1 1 1 182 Somalia
2 8 1994 1 1 1 209 Turkey
5 1 1994 1 1 0 106 Kuwait
7 1 1994 1 1 1 228 Yemen
8 1 1994 1 1 0 53 Cyprus
´´´
I've looked at some solutions but most of them seems to assume that the number the new variable should have already is in the data.
All help is appreciated!
Assuming df is the original dataframe:
df_out = df %>%
dplyr::select(-eventid) %>%
dplyr::group_by(country_txt,iyear) %>%
dplyr::mutate(Terrorattacs = n()) %>%
dplyr::slice(1L) %>%
dplyr::ungroup()
Ideally, I would use summarise but since I don't know the summarising criteria for other columns, I have simply used mutate and slice.
Note: The 'crit' columns values would be the first occurrence of the 'country_txt' and 'iyear'.
Here's a data.table solution. If the data set has already been filtered to have crit1 and crit2 equal to 1 (which you gave as a condition in a comment), you can remove the first argument (crit1 == 1 & crit2 == 1)
library(data.table)
set.seed(1011)
dat <- data.table(eventid = round(runif(100, 1000, 10000)),
iyear = sample(1994:1996, 100, rep = T),
crit1 = rbinom(100, 1, .9),
crit2 = rbinom(100, 1, .9),
crit3 = rbinom(100, 1, .9),
country = sample(1:3, 100, rep = T))
dat[, country_txt := LETTERS[country]]
## remove crit variables
dat[crit1 == 1 & crit2 == 1, .N, .(country, country_txt, iyear)]
#> country country_txt iyear N
#> 1: 1 A 1994 10
#> 2: 1 A 1995 4
#> 3: 3 C 1995 10
#> 4: 1 A 1996 7
#> 5: 2 B 1996 9
#> 6: 3 C 1996 5
#> 7: 2 B 1994 8
#> 8: 3 C 1994 13
#> 9: 2 B 1995 10
Created on 2019-09-24 by the reprex package (v0.3.0)

How do I merge two dataframes with conflicting values

I'm sorry if this is a duplicate question, but I have looked around at similar problems and haven't been able to find a real solution. Anyway, here goes:
I've read a .csv file into a table. There I'm dealing with 3 columns:
"ID"(author's ID), "num_pub"(number of articles published), and "year"(spans from 1930 to 2017).
I would like to get a final table where I would have "num_pub" for each "year", for every "ID". So rows would be "ID"s, columns would be "year"s, and underneath each year there would be the corresponding "num_pub" or 0 value if the author hasn't published anything.
I have tried creating two new tables and merging them in a few different ways described here but to no avail.
So first I read my file into a table:
tab<-read.table("mytable.csv",sep=",",head=T,colClasses=c("character","numeric","factor"))
head(tab,10)
ID num_pub year
1 00002 1 1977
2 00002 2 1978
3 00002 1 1983
4 00002 4 1984
5 00002 3 1990
6 00002 1 1994
7 00002 2 1996
8 00004 3 1957
9 00004 1 1958
10 00004 1 1959
With that, I was then able to create a table where for each "ID", there was every single "year", and if the author published in that year, the value was 1, otherwise it was 0:
a<-table(tab[,1], tab[,3])
Calling head(a,1) returns the following table: pic
I would like to know how to achieve the desired result I described above. Namely, having a table where rows would be populated with "ID"s, columns would be populated with "year"s (from 1930 to 2017), and underneath each year, there would be an actual "num_pub" value or a 0 value. The structure of the table would be just like the one shown in the pic
Thank you for your time and help. I'm very new to R, and kind of stuck in the mud with this.
Edit: the reshape approach as explained here does not solve my problem. I need zeros in place of "NA"s, and I want my year to start with 1930 instead of the first year that the author has published.
using reshape2 & dcast one can change to a wide format and then pipe through to replace NAs with 0s.
library(reshape2)
library(dplyr)
dcast(tab, ID~year, value.var = "num_pub") %>%
replace(is.na(.), 0)
ID 1957 1958 1959 1977 1978 1983 1984 1990 1994 1996
1 00002 0 0 0 1 2 1 4 3 1 2
2 00004 3 1 1 0 0 0 0 0 0 0
You can use complete to fill in the zeros for non available data, and then spread to turn your column of years into multiple columns (both from the tidyr package):
library(tidyr)
df_complete <-
complete(df, ID, year, fill = list(num_pub = 0))
spread(df_complete, key = year, value = num_pub)
# A tibble: 2 x 11
ID `1957` `1958` `1959` `1977` `1978` `1983` `1984` `1990` `1994` `1996`
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 00002 0 0 0 1 2 1 4 3 1 2
2 00004 3 1 1 0 0 0 0 0 0 0
Data:
df <-
data.frame(ID = c("00002", "00002", "00002", "00002", "00002", "00002", "00002", "00004", "00004", "00004"),
num_pub = c(1, 2, 1, 4, 3, 1, 2, 3, 1, 1),
year = c(1977, 1978, 1983, 1984, 1990, 1994, 1996, 1957, 1958, 1959))
In base R this might be handled with a merge operation followed by some coercion to 0/1 by way of negating is.na and using as.numeric. (Admittedly the complete function appears easier.
temp <- merge(expand.grid(ID=sprintf("%05d", 2:4),year=1930:2018), tab, all=T)
str(temp)
#--------
'data.frame': 267 obs. of 3 variables:
$ ID : Factor w/ 3 levels "00002","00003",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 ...
$ num_pub: num NA NA NA NA NA NA NA NA NA NA ...
temp$any_pub <- as.numeric(!is.na(temp$num_pub))
head(temp)
ID year num_pub any_pub
1 00002 1930 NA 0
2 00002 1931 NA 0
3 00002 1932 NA 0
4 00002 1933 NA 0
5 00002 1934 NA 0
6 00002 1935 NA 0
tapply(temp$any_pub, temp$ID,sum)
#
00002 00003 00004
7 0 3

grouping data in R and summing by decade

I have the following dataset:
ireland england france year
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931
I need to summarise the data by 1920's and 1930's. So I need total points for ireland, england and france in the 1920-1922 and then another total point for ireland,england and france in 1930,1931.
Any ideas? I have tried but failed.
Dataset:
x <- read.table(text = "ireland england france
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931", header = T)
How about dividing the years by 10 and then summarizing?
library(dplyr)
x %>% mutate(decade = floor(year/10)*10) %>%
group_by(decade) %>%
summarize_all(sum) %>%
select(-year)
# A tibble: 2 x 5
# decade ireland england france
# <dbl> <int> <int> <int>
# 1 1920 15 8 7
# 2 1930 5 6 7
An R base solution
As A5C1D2H2I1M1N2O1R2T1 mentioned, you can use findIntervals() to set corresponding decade for each year and then, an aggregate() to group py decade
txt <-
"ireland england france year
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931"
df <- read.table(text=txt, header=T)
decades <- c(1920, 1930, 1940)
df$decade<- decades[findInterval(df$year, decades)]
aggregate(cbind(ireland,england,france) ~ decade , data = df, sum)
Output:
decade ireland england france
1 1920 15 8 7
2 1930 5 6 7

Rearranging data frame in R with summarizing values

I need to rearrange a data frame, which currently looks like this:
> counts
year score freq rounded_year
1: 1618 0 25 1620
2: 1619 2 1 1620
3: 1619 0 20 1620
4: 1620 1 6 1620
5: 1620 0 70 1620
---
11570: 1994 107 1 1990
11571: 1994 101 2 1990
11572: 1994 10 194 1990
11573: 1994 1 30736 1990
11574: 1994 0 711064 1990
But what I need is the count of the unique values in score per decade (rounded_year).
So, the data frame should looks like this:
rounded_year 0 1 2 3 [...] total
1620 115 6 1 0 122
---
1990 711064 30736 0 0 741997
I've played around with aggregate and ddply, but so far without success. I hope, it's clear what I mean. I don't know how to describe it better.
Any ideas?
A simple example using dplyr and tidyr.
dt = data.frame(year = c(1618,1619,1620,1994,1994,1994),
score = c(0,1,0,2,2,3),
freq = c(3,5,2,6,7,8),
rounded_year = c(1620,1620,1620,1990,1990,1990))
dt
# year score freq rounded_year
# 1 1618 0 3 1620
# 2 1619 1 5 1620
# 3 1620 0 2 1620
# 4 1994 2 6 1990
# 5 1994 2 7 1990
# 6 1994 3 8 1990
library(dplyr)
library(tidyr)
dt %>%
group_by(rounded_year, score) %>%
summarise(freq = sum(freq)) %>%
mutate(total = sum(freq)) %>%
spread(score,freq, fill=0)
# Source: local data frame [2 x 6]
#
# rounded_year total 0 1 2 3
# (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1620 10 5 5 0 0
# 2 1990 21 0 0 13 8
In case you prefer to work with data.table (as the dataset you provide looks more like a data.table), you can use this:
library(data.table)
library(tidyr)
dt = setDT(dt)[, .(freq = sum(freq)) ,by=c("rounded_year","score")]
dt = dt[, total:= sum(freq) ,by="rounded_year"]
dt = spread(dt,score,freq, fill=0)
dt
# rounded_year total 0 1 2 3
# 1: 1620 10 5 5 0 0
# 2: 1990 21 0 0 13 8

Resources