Cumulative sums for the previous row - r

I'm trying to get cumulative sums for the previous row/year. Running cumsum(data$fonds) gives me the running totals of adjacent sells, which doesn't work for what I want to do. I would like to have my data look like the following:
year fond cumsum
1 1950 0 0
2 1951 1 0
3 1952 3 1
4 1953 0 4
5 1954 0 4
Any help would be appreciated.

data$cumsum <- c(0, cumsum(data$fonds)[-nrow(data)])

With data.table, we can use the shift function. By default, it gives type="lag"
library(data.table)
setDT(df1)[, Cumsum := cumsum(shift(fond, fill= 0))]

Related

Using tapply and cumsum function for multiple vectors in R

I have a data frame with four columns.
country date pangolin_lineage n cum_country
1 Albania 2020-09-05 B.1.236 1 1
2 Algeria 2020-03-02 B.1 2 2
3 Algeria 2020-03-08 B.1 1 3
4 Algeria 2020-06-09 B.1.1.119 1 4
5 Algeria 2020-06-15 B.1 1 5
6 Algeria 2020-06-15 B.1.36 1 6
I wished to calculate the cumulative sum of n across country and date. I was able to do that with this code:
date_country$cum_country <- as.numeric(unlist(tapply(date_country$n, date_country$country, cumsum)))
I now, however, would like to do the same thing, but the cumulative sum across country, pangolin_lineage, and date. I have tried to add another vector into the above function, but it seems you can only input one index input and one vector input for tapply. I get this error:
date_country$cum_country_pangol <- as.numeric(unlist(tapply(date_country$n, date_country$country, date_country$pangolin_lineage, cumsum)))
Error in match.fun(FUN) :
'date_country$pangolin_lineage' is not a function, character or symbol
Does anyone have any ideas how how to use cumsum in tapply across multiple vectors (country, pangolin_lineage, date?
if there are more than one group, wrap it in a list, but note that tapply in a summarising function and it can split up when we specify function like cumsum.
tapply(date_country$n, list(date_country$country, date_country$pangolin_lineage), cumsum))
But, this is much more easier with ave i.e. if we want to create a new column, avoid the hassle of unlist etc. by just using ave
ave(date_country$n, date_country$country,
date_country$pangolin_lineage, FUN = cumsum)
#[1] 1 2 3 1 4 1

Construct a variable that conditionally takes a certain value until another condition is met

I have a panel dataset with data on conflicts for which I want to identify the post-conflict years.
So I constructed a variable myself, which codes a transition from conflict to peace with "3". Whenever the values for a new country begin, I coded that same variable with NA. S
What I want to do now is to create a new binary variable which identifies post-conflict years with a 1 and conflict years and never conflict with 0. For that I would have to assign every year, following a 3 in the transition variable with a 1 until there is an NA in the same column. As follows:
Country Year transition post-conflict
Afghanistan 1994 0 0
Afghanistan 1995 0 0
Afghanistan 1996 3 1
Afghanistan 1997 2 1
Afghanistan 1998 2 1
Albania 1994 NA 0
Albania 1994 2 0
How could I go about this?
You probably shouldn't use NA like that. It prevents functions like which, sum, and cumsum from working as you may want them to. You likely don't need to mark the first row of a new country anyway, since most R functions you would use for your analysis can group by Country without needing a special marker showing where each group starts.
Below I change NA to something different, and make transition a factor. Then you can use cumsum to create your new column.
library(data.table)
setDT(df) # assuming your data is called df
# fix transition column
df[is.na(transition), transition := 90]
df[, transition := as.factor(transition)]
# create post_conflict column
df[, post_conflict := cumsum(transition == 3), by = Country]
# Country Year transition post_conflict
# 1: Afghanistan 1994 0 0
# 2: Afghanistan 1995 0 0
# 3: Afghanistan 1996 3 1
# 4: Afghanistan 1997 2 1
# 5: Afghanistan 1998 2 1
# 6: Albania 1994 90 0
# 7: Albania 1994 2 0

R: Creating a table with the highest values by year

I hope I don't ask a question that has been asked already, but I couldn't quite find what I was looking for. I am fairly new to R and have no experience with programming.
I want to make a table with the top 10 values of three sections for each year If my data looks somthing like this:
Year Country Test1 Test2 Test3
2000 ALB 500 497 501
2001 ALB NA NA NA
...
2000 ARG 502 487 354
2001 ARG NA NA NA
...
(My years go from 2000 to 2015, I only have observations for every three years, and even in those years still a lot of NA's for some countries or tests)
I would like to get a table in which I can see the 10 top values for each test for each year. So for the year 2000,2003,2006,...,2015 the top ten values and the countries that reached those values for test 1,2&3.
AND then (I am not sure if this should be a separate question) I would like to get the table into Latex.
Easier to see top values this way.
You could use dcast and melt from data.table package:
# convert to data table
setDT(df)
# convert it to long format and select the columns to used
df1 <- melt(df, id.vars=1:2)
df1 <- df1[,c(1,2,4)]
# get top values year and country
df1 <- df1[,top_value := .(list(sort(value, decreasing = T))), .(Year, Country)][,.(Year, Country, top_value)]
print(df1)
Year Country top_value
1: 2000 ALB 501,500,497
2: 2001 ALB
3: 2000 ARG 502,487,354
4: 2001 ARG
5: 2000 ALB 501,500,497
6: 2001 ALB
7: 2000 ARG 502,487,354
8: 2001 ARG
9: 2000 ALB 501,500,497
10: 2001 ALB
11: 2000 ARG 502,487,354
12: 2001 ARG

Create time event based dummy variable in R - leads & lags

I am currently searching for a method to create a set of dummy variables indicating a time event in a panel. Explicitly I am trying to make dummy variables indicating the event 20 years prior the event and 20 years after the event, e.g. the effect of a war on trade in 20 years. I want to code this dummy for each parnter in the dyads. How is it possible, to elegantly programm these event dummies ? I would appreciate your help :)
iso_o iso_d year mid_o mid_d
ABW AFG 1980 0 1
ABW AFG 1981 0 1
ABW AFG 1982 0 1
ABW AFG 1983 0 2
ABW AFG 1984 0 1
ABW AFG 1985 0 1
ABW AFG 1986 0 1
ABW AFG 1987 0 1
ABW AFG 1988 0 0
ABW AFG 1989 0 1
So and this is where I want to go to:
iso_o iso_d year mid_o mid_d mid_o_t-20 mid_o_t-19 mid o_t-18 .... mid_d_t-20
ABW AFG 1980 0 1 0 0 0
ABW AFG 1981 0 1 0 0 0
ABW AFG 1982 0 1 0 0 0
ABW AFG 1983 0 2 0 0 0
ABW AFG 1984 0 1 0 0 0
ABW AFG 1985 0 1 0 0 0
I'm assuming here da.f (short for data.frame with no collision with known functions) follows approximately your structure as you did not include it in the question.
library(zoo)
#da.f is randomly generated in this example
da.f = data.frame(mid_o = sample(seq(0,4), 50, replace = TRUE), mid_d = sample(seq(0,4), 50, replace = TRUE))
#our result consists of 20 lags backward and forward in time
res = lag(as.zoo(da.f), -20:20, na.pad = TRUE)
On May 10th 2018 it was pointed to me by #thistleknot (thanks!) that dplyr masks stats's own lag generic. Therefore make sure you don't have dplyr attached, or instead run stats::lag explicitly, otherwise my code won't run.
I think I found the culprit: github.com/tidyverse/dplyr/issues/1586
answer: This is a natural consequence of having lots of R packages.
Just be explicit and use stats::lag or dplyr::lag
Hello There and thank you for your help!
I found the solution to the problem: I had to convert the data.frame to a data.table in the first place. Seconly I found a way to create multiple columns in data.table combining the commands sprintif and shift. Therby I could create 20 lags and 20 leads within only 4 lines of code.
df[, sprintf("mid_o_lag_%0d", 1:20) := shift(mid_o, c(1:20), type = 'lag')]
df[, sprintf("mid_d_lag_%0d", 1:20) := shift(mid_d, c(1:20), type = 'lag')]
df[, sprintf("mid_o_lead_%0d", 1:20) := shift(mid_o, c(1:20), type = 'lead')]
df[, sprintf("mid_d_lead_%0d", 1:20) := shift(mid_d, c(1:20), type = 'lead')]

Create New Variable by Matching from adjacent Column to another data.table [duplicate]

This question already has answers here:
Adding a column based on another column R data.table
(3 answers)
Closed 7 years ago.
Looking for a more elegant solution using data.table code to accomplish the following:
I have two data tables, which can be captured by the following example:
library("data.table")
A <- data.table(country_name = c("afghanistan", "albania", "algeria"),
country_rank = c(1:3)) #primary data table
B <- data.table(country_name = c("afghanistan", "albania", "algeria"),
A2 = c("AF", "AL", "DZ")) #reference data table
A
# country_name country_rank
# 1: afghanistan 1
# 2: albania 2
# 3: algeria 3
B
# country_name A2
# 1: afghanistan AF
# 2: albania AL
# 3: algeria DZ
I would like to add a new column to A that is the 2-letter country code, contained in B. I am accomplishing this currently using dplyr, and what I feel is a quite convoluted manner; reading the command is unnecessarily confusing, I feel. I am wondering the analog solution within data.table.
FYI Within dplyr:
A <- mutate(A, A2 = B[match(A$country_name, B$country_name), A2])
A
country_name country_rank A2
1: afghanistan 1 AF
2: albania 2 AL
3: algeria 3 DZ
Many thanks!
data.table is set up to do these joins very naturally, but you need to specify the common key first.
setkey(A, country_name)
setkey(B, country_name)
A[B] ## join A with B on the common key 'country_name'
country_name country_rank A2
1: afghanistan 1 AF
2: albania 2 AL
3: algeria 3 DZ
You can use a join in dplyr to do as follows:
library(dplyr)
inner_join(A, B)
Joining by: "country_name"
country_name country_rank A2
1 afghanistan 1 AF
2 albania 2 AL
3 algeria 3 DZ
You can use select to relocate the last column where you need.
If B does not have all the country names, you can use left_join instead to get NAs into the missing rows.

Resources