Making Sense of Time Series Data with > 43,000 observations - r

Updated Post
After a lot of work, I have finally merged three different datasets. The result is a time series data frame with 43,396 observations of 7 seven variables. Below, I have included a few rows of what my data looks like.
Dyad year cyberattack cybersev MID MIDsev peace score
2360 2005 NA NA 0 1 0
2360 2006 NA NA NA NA 0
2360 2007 1 3.0 0 1 0
2360 2008 1 4.0 0 1 0
2360 2009 3 3.33 1 4 0
2360 2010 1 3.0 NA NA 0
2360 2011 3 2.0 NA NA 0
2360 2012 1 2.0 NA NA 0
2360 2013 4 2.0 NA NA 0
If I am interested in comparing how different country pairs (dyads) differ in how often they launch attacks (either in cyberspace, physically with MIDs, or neither)...how should I go about doing this?
Since I am working with country/year data, how can I get descriptive statistics for the different countries (Dyads) in my Dyad variable? For example, I would like to know how the behavior of Dyad 2360 (USA and Iran) compares with other countries.
I tried this code, but it just gave me a list of my unique dyad pairs:
table(final$Dyadpair)
names(sort(-table(final$Dyadpair)))
You mentioned using aggregate or dplyr - but I don't see how those will allow me to descriptive statistics for all of my unique dyads? Would you mind elaborating on this?
Is it possible for a code to return something like this: For Dyad 2360 during the years 2005-2013, 80% were NA, 10% were cyber attacks, and 10% were MID attacks, etc. ?
Upate to clarify:
Ok, yes - the above example was just hypothetical. Based on the nine rows of data that I have provided - here is what I am hoping R can provide when it comes to descriptive statistics.
Dyad: 2360
No attacks: 22.22% (2/9) ….in 2005 and 2006
Cyber attacks: 77.78% (7/9) ….in the years 2007-2013
MID attacks: 11.11% (1/9) ….in 2009
Both cyber and MID: 11.11% (1/9) ….in 2009
Essentially, during a given time range (2005-2013 for the example I gave above), how many of those years result in NO attacks, how many of those years result in a cyber attack, how many of those years result in a MID attack, and how many of those years result in both a cyber and MID attack.
I do not know if this is possible with how my data is set up —> since I aggregated cyber-attacks and MID attacks per year? And yes, I would also like to take into consideration the severity of the attacks (both cyber attacks and MID attacks), but I don’t know how to do that.
Does this help clarify what I am looking for?

Here's a dplyr approach with my best guess for what you want. It will output a data frame with one row per dyad and the same summary statistics for each dyad.
library(dplyr)
your_data %>%
group_by(Dyad) %>%
summarize(
year_range = paste(min(year), max(year), sep = "-"),
no_attacks = mean(is.na(cyberattack) & (is.na(MID) | MID == 0)),
cyber_attacks = mean(!is.na(cyberattack)),
MID_attacks = mean(!is.na(MID) & MID > 0),
cyber_and_MID = mean(!is.na(cyberattack) & (!is.na(MID) & MID > 0)),
cyber_sev_weighted = weighted.mean(cyberattack, w = cybersev, na.rm = TRUE)
)
# # A tibble: 1 x 7
# Dyad year_range no_attacks cyber_attacks MID_attacks cyber_and_MID cyber_sev_weighted
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2360 2005-2013 0.222 0.778 0.111 0.111 1.86
Using this data:
your_data = read.table(text = 'Dyad year cyberattack cybersev MID MIDsev peace_score
2360 2005 NA NA 0 1 0
2360 2006 NA NA NA NA 0
2360 2007 1 3.0 0 1 0
2360 2008 1 4.0 0 1 0
2360 2009 3 3.33 1 4 0
2360 2010 1 3.0 NA NA 0
2360 2011 3 2.0 NA NA 0
2360 2012 1 2.0 NA NA 0
2360 2013 4 2.0 NA NA 0', header = T)

Related

rowwise multiplication of two different dataframes dplyr

I have two dataframes and I want to multiply one column of one dataframe (pop$Population) with parts of the other dataframe, sometimes with the mean of one column or a subset (here e.g.: multiplication with mean of df$energy).
As I want to have my results per Year i need to additionally multiply it by 365 (days).
I need the results for each Year.
age<-c("6 Months","9 Months", "12 Months")
energy<-c(2.5, NA, 2.9)
Df<-data.frame(age,energy)
Age<-1
Year<-c(1990,1991,1993, 1994)
Population<-c(200,300,400, 250)
pop<-data.frame(Age, Year,Population)
pop:
Age Year Population
1 1 1990 200
2 1 1991 300
3 1 1993 400
4 1 1994 250
df:
age energy
1 6 Months 2.5
2 9 Months NA
3 12 Months 2.9
my thoughts were, but I got an Error:
pop$energy<-pop$Population%>%
rowwise()%>%
transmute("energy_year"= .%*% mean(Df$energy, na.rm = T)%*%365)
Error in UseMethod("rowwise") :
no applicable method for 'rowwise' applied to an object of class "c('double', 'numeric')"
I wished to result in a dataframe like this:
Age Year Population energy_year
1 1 1990 200 197100
2 1 1991 300 295650
3 1 1993 400 394200
4 1 1994 250 246375
pop$Population is a vector and not a data frame hence the error.
For your use case the simplest thing to do would be:
pop %>% mutate(energy_year= Population * mean(Df$energy, na.rm = T) * 365)
This will give you the output:
Age Year Population energy_year
1 1 1990 200 197100
2 1 1991 300 295650
3 1 1993 400 394200
4 1 1994 250 246375

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

Impute only certain NA's for a variable in a data frame [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I'm new to R and exploring different beautiful options in it. I'm working on a data frame where I have a variable with 900 missing values, i.e NAs.
I want to impute 3 different values for NAs;
1st 300 NA's with Value 1.
2nd 300 NA's with Value 2.
3rd 300 NA's with Value 3.
There are a total of 23272 rows in the data.
dim(data)
[1] 23272 2
colSums(is.na(data))
month year
884 884
summary(data$month)
1 2 3 4 5 6 7 8 9 10 11 12 NA's
1977 1658 1837 1584 1703 1920 1789 2046 1955 2026 1845 2048 884
If we check the month 8,10 and 12. There is no much differences, Hence thought of assigning these 3 months to NA by splitting at the ratio (300:300:284). Usually we go my MODE, but I want to try this approach.
I assume you mean you a have a long list, some of the values of which are NAs:
set.seed(42)
df <- data.frame(val = sample(c(1:3, NA_real_), size = 1000, replace = TRUE))
We can keep a running tally of NA's and assign those to the imputed value using integer division with %/%.
library(tidyverse)
df2 <- df %>%
mutate(NA_num = if_else(is.na(val),
cumsum(is.na(val)),
NA_integer_),
imputed = NA_num %/% 100 + 1)
Output:
df2 %>%
slice(397:410) # based on manual examination using this seed
val NA_num imputed
1 NA 98 1
2 NA 99 1
3 3 NA NA
4 1 NA NA
5 1 NA NA
6 3 NA NA
7 3 NA NA
8 2 NA NA
9 NA 100 2
10 1 NA NA
11 NA 101 2
12 2 NA NA
13 1 NA NA
14 2 NA NA
Without an example, I think this will work.
Basically, filter the NAs to a new table, do the calc and merge it back. Assume the new_dt is the OG data where you filter to only contain the NAs
library('tidyverse');
new_dt = data.frame(x1 =rep(1:900), x2= NA) %>% filter(is.na(x2)) %>%
mutate(23 = case_when(row_number()%/%300==0 ~1,
row_number()%/%300==1 ~2,
row_number()%/%300==2 ~3))
dt <- rbind(dt,new_dt)

Canonical way to reduce number of ID variables in wide-format data

I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2

Transform Variable

I have a data frame as shown below:
Country Year X1 X2 convex1 convex2
UK 2011 100 5
UK 2012 110 5.5
UK 2013 NA 10
UK 2014 115 2
US 2011 NA 10
US 2012 120 11
US 2013 118 9.2
US 2014 NA NA
And I would like to create two new variables to capture convexity. These two variables will apply the following logic:
Convex1 if X1>average (X1 by country) then 1 else 0
I have pinned down Convex1 using
dataa <- data.table(dataa , key = "Country")
dataa[ , convex1 := (X1 ) /mean(X1, na.rm = TRUE), "Country" ]
dataa$convex1var<- ifelse(dataa$convex1>1.1 ,1 ,0)
But I would like to do it in one step as I have to perform this for a long list of vars.
Convex2 if X2>= 10 for two consecutive years (by country) then 1 else 0
For Convex2, I am not sure how to do this.
dataa$Convex2<-ifelse( dataa$X2 > lag(dataa$X2, by=Country) &
lag(dataa$X2, by=Country) > lag(dataa$X2,2, by=Country),1,0)

Resources