Looping a Function over Groups and Years - r

I have a sample of many countries across several years that contains information on output (GDP). I would like to calculate the "Output Gap" using a function I found at R-Bloggers here, but would like it to loop over all the countries in my sample taking the years into consideration, with results being stored in a matrix (binding across rows).
The function looks as follows:
hp <- function(data,l=1600){
#h-p filter code from Farnsworth
hpfilterq <- function(x=data,lambda=l){
eye <- diag(length(x))
result <- solve(eye+lambda*crossprod(diff(eye,lag=1,d=2)),x)
return(result)
}
hpfiltered<-hpfilterq(data)
hpgap <- data - hpfiltered
#
t1<-1:length(data)
t2<-t1^2
t3<-t1^3
t1<-ts(t1)
t2<-ts(t2)
t3<-ts(t3)
#
datats<-ts(data)
myseries<-ts.union(datats,t1,t2,t3)
#
polynomial1 <- lm(datats ~ t1,data=myseries)
polynomial2 <- lm(datats ~ t1 + t2,data=myseries)
polynomial3 <- lm(datats ~ t1 + t2 + t3,data=myseries)
#
returndata<-data.frame(hpgap,polynomial1$residuals,polynomial2$residuals,polynomial3$residuals)
colnames(returndata) <- c("H-P Gap", "Poly1","Poly2","Poly3")
return(returndata)
}
My sample hypothetically looks like:
country year output
1 AUS 2000 49709.21
2 AUS 2001 59805.90
3 AUS 2002 46501.57
4 AUS 2003 53521.78
5 AUS 2004 53824.41
6 AUS 2005 55001.43
7 AUS 2006 48356.12
8 AUS 2007 55125.00
9 AUS 2008 58551.84
10 AUS 2009 57805.95
11 AUS 2010 64858.86
12 AUS 2011 67395.81
13 AUS 2012 69043.00
14 AUS 2013 73789.00
15 AUS 2014 77869.09
16 BEL 2000 7110.00
17 BEL 2001 7235.10
18 BEL 2002 7204.10
19 BEL 2003 7327.60
20 BEL 2004 7558.70
21 BEL 2005 7123.10
22 BEL 2006 7539.00
23 BEL 2007 7943.40
24 BEL 2008 8052.50
25 BEL 2009 7509.60
26 BEL 2010 8455.50
27 BEL 2011 8749.40
28 BEL 2012 9694.10
29 BEL 2013 9614.40
30 BEL 2014 8707.50
I would like to apply function "hp" to "AUS",
H-P Gap Poly1 Poly2 Poly3
1 2393.2324 2751.8407 -3684.68922 -3536.276248
2 10838.8666 11069.4941 7391.47697 7412.678826
3 -4118.0018 -4013.8596 -5357.75042 -5414.832333
4 1239.8878 1227.3108 1793.15960 1698.566717
5 -135.2596 -249.0938 1802.10804 1702.622415
6 -657.9475 -851.1054 2261.06288 2181.148204
7 -9031.2784 -9275.4400 -5526.69186 -5570.726477
8 -4024.7279 -4285.5977 -324.65619 -324.656186
9 -2394.8369 -2637.7883 1110.95990 1154.994518
10 -4970.1209 -5162.7119 -2050.54360 -1970.628924
11 224.6324 111.1625 2162.36431 2261.849928
12 881.4586 869.0828 1434.93163 1529.524516
13 633.1949 737.2392 -606.65163 -549.569720
14 3474.5127 3704.2066 26.18952 4.987662
15 5646.3875 6005.2600 -431.26992 -579.682899
"BEL",
H-P Gap Poly1 Poly2 Poly3
1 291.55895 311.04333 -4.1179412 -188.99755
2 253.89032 266.24190 86.1497479 59.73838
3 59.93946 65.34048 -0.4624273 70.64511
4 19.96547 18.93905 46.6455333 164.48089
5 86.68999 80.13762 180.5736296 304.50392
6 -514.57784 -525.36381 -372.9781383 -273.42758
7 -266.08305 -279.36524 -95.8097705 -40.95538
8 -30.94904 -44.86667 149.0787330 149.07873
9 -92.93293 -105.66810 77.8873723 23.03298
10 -808.67250 -818.46952 -666.0838526 -765.63441
11 -37.24743 -42.47095 57.9650582 -65.96523
12 81.16801 81.52762 109.2341047 -8.60125
13 850.02282 856.32619 790.5232870 719.41575
14 594.71530 606.72476 426.6326050 453.04398
15 -487.48754 -470.07667 -785.2379412 -600.35833
and all the other countries, and store them in a matrix (or a list of some sort):
H-P Gap Poly1 Poly2 Poly3
1 2393.23236 2751.84068 -3684.6892235 -3536.276248
2 10838.86665 11069.49406 7391.4769723 7412.678826
3 -4118.00184 -4013.85956 -5357.7504192 -5414.832333
4 1239.88784 1227.31082 1793.1596021 1698.566717
5 -135.25961 -249.09380 1802.1080361 1702.622415
6 -657.94746 -851.10542 2261.0628828 2181.148204
7 -9031.27839 -9275.44005 -5526.6918577 -5570.726477
8 -4024.72789 -4285.59767 -324.6561855 -324.656186
9 -2394.83690 -2637.78829 1110.9598994 1154.994518
10 -4970.12091 -5162.71191 -2050.5436029 -1970.628924
11 224.63238 111.16247 2162.3643075 2261.849928
12 881.45859 869.08285 1434.9316306 1529.524516
13 633.19493 737.23923 -606.6516335 -549.569720
14 3474.51272 3704.20660 26.1895151 4.987662
15 5646.38752 6005.25998 -431.2699235 -579.682899
16 291.55895 311.04333 -4.1179412 -188.997549
17 253.89032 266.24190 86.1497479 59.738375
18 59.93946 65.34048 -0.4624273 70.645114
19 19.96547 18.93905 46.6455333 164.480888
20 86.68999 80.13762 180.5736296 304.503916
21 -514.57784 -525.36381 -372.9781383 -273.427580
22 -266.08305 -279.36524 -95.8097705 -40.955381
23 -30.94904 -44.86667 149.0787330 149.078733
24 -92.93293 -105.66810 77.8873723 23.032983
25 -808.67250 -818.46952 -666.0838526 -765.634411
26 -37.24743 -42.47095 57.9650582 -65.965228
27 81.16801 81.52762 109.2341047 -8.601250
28 850.02282 856.32619 790.5232870 719.415746
29 594.71530 606.72476 426.6326050 453.043978
30 -487.48754 -470.07667 -785.2379412 -600.358333
without having to use the very inefficient: hp(data$output[c(1:15)]) and hp(data$output[c(16:30)])

You can combine split and lapply to apply the hp function in a new function and this will automate export of results as a list of data.frames by country
hp_per_country <- function(x) {
data_list <- split(x$output, x$country)
result <- lapply(data_list, FUN = hp)
} #note that `return` is assumed in R for the last object computed by a function
overall_hp_results <- hp_per_country(input_data)
This assumes that input_data is ordered by year, if that's important to calculation.

Related

How to speed up the computation of the intersections between each pair of sets for a large number of pairs

I have the following dataframe:
> str(database)
'data.frame': 8547287 obs. of 4 variables:
$ cited_id : num 4.06e+08 5.41e+07 5.31e+07 5.04e+07 3.79e+08 ...
$ cited_pub_year : num 2014 1989 2002 2002 2015 ...
$ citing_id : num 3.34e+08 3.37e+08 4.06e+08 4.19e+08 4.25e+08 ...
$ citing_pub_year: num 2011 2011 2013 2014 2014 ...
The variables cited_id and citing_id contain the IDs of the objects from which this database has been obtained.
This is an example of the dataframe:
cited_id cited_pub_year citing_id citing_pub_year
1 405821349 2014 419185055 2011
2 405821349 1989 336621202 2011
3 53148996 2002 406314162 2013
4 53148996 2002 419185055 2014
5 379369076 2015 424901495 2014
6 53148996 2011 441055669 2015
7 405821349 2014 447519383 2015
8 405821349 2015 469644221 2016
9 329268142 2014 470861263 2016
10 45433355 2008 55422577 2008
For example the ID 405821349 has been cited by 419185055, 336621202, 447519383 and 469644221. For each pair of IDs I would like to calculate the intersection of their citing IDs. The quantity Pj.k below is the length of the intersection. I tried with the following code
total_id<-c(database$cited_id,database$citing_id)
total_id<-unique(total_id)
df<-data.frame(data_k=character(),data_j=character(),Pj.k=numeric(),
stringsAsFactors = F)
for (k in 1:(length(total_id)-1)) {
data_k<-total_id[k]
citing_data_k<-database[database$cited_id==data_k,]
for (j in (k+1):length(total_id)) {
data_j<-total_id[j]
citing_data_j<-database[database$cited_id==data_j,]
Pj.k<-length(intersect(citing_data_j$citing_id,citing_data_k$citing_id))
dfxx=data.frame(data_k=data_k,data_j=data_j,Pj.k=Pj.k,
stringsAsFactors = F)
df<-rbind(df,dfxx)
}
}
Anyway, it takes too long! How could I speed it up?
Using xtabs, tcrossprod and sparse matrices:
library(Matrix)
library(data.table)
m2 <- as(
triu(
tcrossprod(
m1 <- xtabs(data = database[,c(1, 3)], sparse = TRUE)
), k = 1
), "TsparseMatrix"
)
df <- data.frame(
data_k = row.names(m1)[attr(m2, "i") + 1L],
data_j = row.names(m1)[attr(m2, "j") + 1L],
Pj.k = attr(m2, "x"),
stringsAsFactors = FALSE
)
Inspired by answers in Count combinations of categorical variables, regardless of order, in R? , count pairs:
database = read.table(header = T, stringsAsFactors = F, text =
"cited_id cited_pub_year citing_id citing_pub_year
1 405821349 2014 419185055 2011
2 405821349 1989 336621202 2011
3 53148996 2002 406314162 2013
4 53148996 2002 419185055 2014
5 379369076 2015 424901495 2014
6 53148996 2011 441055669 2015
7 405821349 2014 447519383 2015
8 405821349 2015 469644221 2016
9 329268142 2014 470861263 2016
10 45433355 2008 55422577 2008")
database |>
dplyr::count(pairs = paste(pmin(cited_id, citing_id),
pmax(cited_id, citing_id)))
#> pairs n
#> 1 329268142 470861263 1
#> 2 336621202 405821349 1
#> 3 379369076 424901495 1
#> 4 405821349 419185055 1
#> 5 405821349 447519383 1
#> 6 405821349 469644221 1
#> 7 45433355 55422577 1
#> 8 53148996 406314162 1
#> 9 53148996 419185055 1
#> 10 53148996 441055669 1
Depending on what you actually need you might find with(database, table(cited_id = cited_id, citing_id = citing_id)) useful too.

R function won't use my arguments. Cannot rename column of dataframe from within the function

I wrote a function:
ltd_EGST <- function (var, x, newvana="name1", string1="con", string2="pro") {
print (table (var))
x$w12d_gr <- ifelse(as.numeric(var)>2,1,0)
x$w12d_gr <- factor(x$w12d_gr, levels = c(0,1), labels = c(string1,string2))
print (table (x$w12d_gr))
x_w <- svydesign(ids=~0, weights = ~persgew, data = x)
t <- svytable(~w12d_gr+welle, x_w)
tt <- round(prop.table(t,2)*100, digits=0)
w12d <- tt[2,] # Zustimmungswerte only
print(str(w12d))
w12d <- as.data.frame(w12d)
rename(w12d, newvana = w12d)
}
Producing the following dataframe:
newvana
2001 17
2002 17
2003 20
2004 18
2005 19
2006 19
2007 14
2008 17
2010 15
2011 14
2012 16
2013 12
2014 13
2015 20
2016 16
2017 14
2018 18
2019 21
2020 15
2021 13
My question:
Why won't the function use the newvana argument from the interface inside the function?
As you can see, it will name the column newvana, but not "name1".
Why is that?

How to melt a matrix with respect to all variables, where id variable are row.names? [duplicate]

This question already has answers here:
Fastest conversion of matrix to long format data frame in R
(1 answer)
Convert a matrix with dimnames into a long format data.frame
(4 answers)
Closed 2 years ago.
I have got a matrix, that I generated using tapply.
It looks like this:
NON-ROAD NONPOINT ON-ROAD POINT
1999 522.94000 2107.625 346.82000 296.7950
2002 240.84692 1509.500 134.30882 569.2600
2005 248.93369 1509.500 130.43038 1202.4900
2008 55.82356 1373.207 88.27546 344.9752
as you can see that I don't have any id variable that I can use to melt all the 4 columns into a single column.
The dataset is around 30MB in size : data
to generate the matrix:
NEI <- readRDS("summarySCC_PM25.rds")
data <- with(NEI[NEI$fips=="24510",],tapply(Emissions,list(year,type),sum))
> class(data)
[1] "matrix"
Expected Output:
Year Type Emission
1 1999 NON-ROAD 522.94000
2 2002 NON-ROAD 240.84692
3 2005 NON-ROAD 248.93369
4 2008 NON-ROAD 55.82356
5 1999 NONPOINT 2107.62500
6 2002 NONPOINT 1509.50000
7 2005 NONPOINT 1509.50000
8 2008 NONPOINT 1373.20731
9 1999 ON-ROAD 346.82000
10 2002 ON-ROAD 134.30882
11 2005 ON-ROAD 130.43038
12 2008 ON-ROAD 88.27546
13 1999 POINT 296.79500
14 2002 POINT 569.26000
15 2005 POINT 1202.49000
16 2008 POINT 344.97518
Also, I have seen solutions that can directly convert the original dataset into my expected output using aggregate. But i can't use that function to maintain a uniformity with my other answers, in my assignment.
It would also be great if the solution is done by native R functions, if possible.
Exploit the fact that in R matrices are actually represented as vectors with dimensions attributes. Using as.vector strips off the dimensions and lines up the column data. All you have to do then is to bind together a data.frame with the dimnames.
res <- data.frame(year=rownames(m),
type=rep(colnames(m), each=nrow(m)),
emission=as.vector(m))
res
# year type emission
# 1 1999 NON.ROAD 522.94000
# 2 2002 NON.ROAD 240.84692
# 3 2005 NON.ROAD 248.93369
# 4 2008 NON.ROAD 55.82356
# 5 1999 NONPOINT 2107.62500
# 6 2002 NONPOINT 1509.50000
# 7 2005 NONPOINT 1509.50000
# 8 2008 NONPOINT 1373.20700
# 9 1999 ON.ROAD 346.82000
# 10 2002 ON.ROAD 134.30882
# 11 2005 ON.ROAD 130.43038
# 12 2008 ON.ROAD 88.27546
# 13 1999 POINT 296.79500
# 14 2002 POINT 569.26000
# 15 2005 POINT 1202.49000
# 16 2008 POINT 344.97520
Data:
m <- structure(c(522.94, 240.84692, 248.93369, 55.82356, 2107.625,
1509.5, 1509.5, 1373.207, 346.82, 134.30882, 130.43038, 88.27546,
296.795, 569.26, 1202.49, 344.9752), .Dim = c(4L, 4L), .Dimnames = list(
c("1999", "2002", "2005", "2008"), c("NON.ROAD", "NONPOINT",
"ON.ROAD", "POINT")))
One way would be to convert your data to dataframe, add a year column and reshape your data.
result <- reshape(data.frame(data, year = rownames(data), check.names = FALSE),
direction = "long", varying = list(colnames(data)),
v.names = "Emission", times = colnames(data),
idvar = 'year', timevar = 'Type')
rownames(result) <- NULL
result
Using tidyverse function we can do :
library(tibble)
library(tidyr)
data %>%
as.data.frame() %>%
rownames_to_column('year') %>%
pivot_longer(cols = -year, names_to = 'Type', values_to = 'Emission')
# year Type Emission
# <chr> <chr> <dbl>
# 1 1999 NON.ROAD 523.
# 2 1999 NONPOINT 2108.
# 3 1999 ON.ROAD 347.
# 4 1999 POINT 297.
# 5 2002 NON.ROAD 241.
# 6 2002 NONPOINT 1510.
# 7 2002 ON.ROAD 134.
# 8 2002 POINT 569.
# 9 2005 NON.ROAD 249.
#10 2005 NONPOINT 1510.
#11 2005 ON.ROAD 130.
#12 2005 POINT 1202.
#13 2008 NON.ROAD 55.8
#14 2008 NONPOINT 1373.
#15 2008 ON.ROAD 88.3
#16 2008 POINT 345.
data
data <- structure(c(522.94, 240.84692, 248.93369, 55.82356, 2107.625,
1509.5, 1509.5, 1373.207, 346.82, 134.30882, 130.43038, 88.27546,
296.795, 569.26, 1202.49, 344.9752), .Dim = c(4L, 4L), .Dimnames = list(
c("1999", "2002", "2005", "2008"), c("NON.ROAD", "NONPOINT", "ON.ROAD", "POINT")))

Calculation with apply

I have one table with five columns Year,Revenue,Pensions,Income and Wages.With this table I made calculation with code below:
library(dplyr)
#DATA
TEST<-data.frame(
Year= c(2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021),
Revenue =c(8634,5798,6022,6002,6266,6478,6732,7224,6956,6968,7098,7620,7642,8203,9856,20328,22364,22222,23250,25250,26250,27250),
Pensions =c(8734,5798,7011,7002,7177,7478,7731,7114,7957,7978,7098,7710,7742,8203,9857,10328,11374,12211,13150,15150,17150,17150),
Income =c(8834,5898,6033,6002,6366,6488,6833,8334,6956,6968,8098,8630,8642,8203,9856,30328,33364,32233,33350,35350,36350,38350),
Wages =c(8834,5598,8044,8002,8488,8458,8534,5444,8958,8988,5098,5840,5842,8203,9858,40328,44384,42244,43450,45450,48450,45450)
)
#FUNCTION
fun1 <- function(x){ ((x - lag(x))/lag(x))*100}
#CALCULATION
ESTIMATION_0<-mutate(TEST,
Nominal_growth_Revenue=fun1(Revenue),
Nominal_growth_Pensions=fun1(Pensions),
Nominal_growth_Income=fun1(Income),
Nominal_growth_Wages=fun1(Wages)
)
But my intention is to optimize this code and to do this calculation with apply function (or something similar). Namely for this calculation I wrote 4 code line, but I like to do this with one code line. So can anybody help me with this problem ?
Assuming you have a character vector with the relevant columns:
cols <- c("Revenue", "Pensions", "Income", "Wages")
Use apply():
TEST[paste0('nomial_growth', cols)] <- apply(TEST[cols], 2, fun1)
or data.table:
library(data.table)
setDT(TEST)
TEST[, (paste0('nomial_growth', cols)) := lapply(.SD, fun1), .SDcols = cols]
You could do this:
vars_names <- paste0("Nominal_groth", names(select(TEST, -Year)))
TEST %>%
bind_cols( (TEST %>% mutate_at(vars(-Year), ~fun1(.x))) %>% select(-Year) %>% set_names(vars_names) )
Year Revenue Pensions Income Wages Nominal_grothRevenue Nominal_grothPensions Nominal_grothIncome Nominal_grothWages
1 2000 8634 8734 8834 8834 NA NA NA NA
2 2001 5798 5798 5898 5598 -32.8468844 -33.6157545 -33.2352275 -36.63119765
3 2002 6022 7011 6033 8044 3.8634012 20.9210072 2.2889115 43.69417649
4 2003 6002 7002 6002 8002 -0.3321156 -0.1283697 -0.5138405 -0.52212829
5 2004 6266 7177 6366 8488 4.3985338 2.4992859 6.0646451 6.07348163
6 2005 6478 7478 6488 8458 3.3833387 4.1939529 1.9164310 -0.35344015
7 2006 6732 7731 6833 8534 3.9209633 3.3832576 5.3175092 0.89855758
8 2007 7224 7114 8334 5444 7.3083779 -7.9808563 21.9669252 -36.20810874
9 2008 6956 7957 6956 8958 -3.7098560 11.8498735 -16.5346772 64.54812638
10 2009 6968 7978 6968 8988 0.1725129 0.2639186 0.1725129 0.33489618
11 2010 7098 7098 8098 5098 1.8656716 -11.0303334 16.2169920 -43.27992879
12 2011 7620 7710 8630 5840 7.3541843 8.6221471 6.5695233 14.55472734
13 2012 7642 7742 8642 5842 0.2887139 0.4150454 0.1390498 0.03424658
14 2013 8203 8203 8203 8203 7.3410102 5.9545337 -5.0798426 40.41424170
15 2014 9856 9857 9856 9858 20.1511642 20.1633549 20.1511642 20.17554553
16 2015 20328 10328 30328 40328 106.2500000 4.7783301 207.7110390 309.08906472
17 2016 22364 11374 33364 44384 10.0157418 10.1278079 10.0105513 10.05752827
18 2017 22222 12211 32233 42244 -0.6349490 7.3588887 -3.3898813 -4.82155732
19 2018 23250 13150 33350 43450 4.6260463 7.6897879 3.4653926 2.85484329
20 2019 25250 15150 35350 45450 8.6021505 15.2091255 5.9970015 4.60299194
21 2020 26250 17150 36350 48450 3.9603960 13.2013201 2.8288543 6.60066007
22 2021 27250 17150 38350 45450 3.8095238 0.0000000 5.5020633 -6.19195046

How to 'stretch' the cell of a column from a data frame in R

'stretch' may not be the most suitable way to put it, but I can't come up with any other word.
I have a data frame like this :
var1 <- c(rep(0, each=9),1999,rep(0, each=9),2000,rep(0, each=9),2001)
var2 <- c(rnorm(n=30))
df1 <- data.frame(var1,var2)
What I want to do is to replace every 0 from the column var1 by the next number encountered in the column. Hence I want sthg like:
var1 <- c(rep(1999, each=10),rep(2000, each=10),rep(2001, each=10))
var2 <- c(rnorm(n=30))
df2 <- data.frame(var1,var2)
With var2 having specific and ordered values I don't want to move around.
The thing is, the data frame is 500 000 rows long, so I would like not to find the row number of every var1 different from 0.
(it's likely that such question has been asked before, but since I couldn't find another word than 'stretch'...)
One way using na.locf from zoo:
library(zoo)
#convert zeros to NA in order to use na.locf afterwards
df1$var1[df1$var1 == 0] <- NA
#fromLast carries the observations backwards
df1$var1 <- na.locf(df1$var1, fromLast = TRUE)
Out:
> df1
var1 var2
1 1999 -0.04750614
2 1999 -0.35462388
3 1999 0.30700748
4 1999 1.09506443
5 1999 -0.61049306
6 1999 0.66687294
7 1999 0.54623236
8 1999 -0.04848903
9 1999 -0.56502719
10 1999 0.08067966
11 2000 -0.05474748
12 2000 0.27380898
13 2000 -0.21283353
14 2000 -0.89820808
15 2000 -0.18752047
16 2000 0.21827094
17 2000 0.56370895
18 2000 -1.21738551
19 2000 -0.61426847
20 2000 -1.34144736
21 2001 -0.52697208
22 2001 0.90209640
23 2001 -0.52040468
24 2001 -0.37432746
25 2001 -0.21218776
26 2001 0.88372231
27 2001 0.54274394
28 2001 0.06127087
29 2001 0.04263164
30 2001 0.52294204

Resources