How to properly use do.call within a function?

How to properly use do.call within a function? - r

I'm progressively transitioning from SAS to R, and at the moment I am trying to replicate what I used to do with macros.
I have a table that contains all my data (let's call it IDF_pop) and from this table I create two other : YVE_pop and EPCI_pop, which are two subsets from the main table. I prefer creating separate tables, but I guess this might not be optimal. Here's how I proceed :
## Let's say the main table contains 10 lines.
## codgeo is the city's postal code, epci is the area, and I have three
## variables that describe different parts of the population
codgeo <- c("75014","75020","78300","78520","78650","91200","91600","92500","93100","95230")
epci <- c("001","001","002","002","003","004","004","005","006","007")
pop0_15 <- c(10000*runif(10))
pop15_64 <- c(10000*runif(10))
pop65p <- c(10000*runif(10))
IDF_pop <- data.frame(codgeo,epci,pop0_15,pop15_64,pop65p)
## I'd like my population to be in one single column, for this I'll use melt
IDF_pop_line <- melt(IDF_pop,c("codgeo","epci"))
## Now I want to create separate tables for the Yvelines department (codgeo starts with 78) and for EPCI 002
## I could do it in two lines but I wanted to train using functions so here goes
localisation <- function(code_dep, lib_dep, code_epci, lib_epci){
do.call("<<-",
list(paste0(eval(lib_dep),"_pop_ligne"),
IDF_pop_line %>% filter(stri_sub(codgeo,from=1,length=2)==code_dep)
)
)
do.call("<<-",
list(paste0(eval(lib_epci),"_pop_ligne"),
IDF_pop_line %>% filter(epci==code_epci)
)
)
}
do.call("localisation",list("78","YVE","002","GPSO"))
With this, I have my 3 tables (IDF_, YVE_, GPSO_) and can now get to the main problem.
What I want to do next is summarise my tables. I'm trying to write a function that would work for all 3 tables.
I'd like it to be fully dependent on the parameter, but it seems that do.call won't accept a paste0 in its second argument.
## Aggregating the tables. I'll call the function 3 times, one for each level.
agregation <- function(lib){
# This doesn't :
do.call("<<-",
list(paste0(eval(lib),"_pop_agr"),
paste0(eval(lib),"_pop_line") %>%
group_by(variable) %>%
summarise(pop = sum(value))
)
)
}
do.call("agregation",list("IDF")) # This one doesn't work
agregation2 <- function(lib){
do.call("<<-",
list(paste0(eval(lib),"_pop_agr"),
IDF_pop_line %>%
group_by(variable) %>%
summarise(pop = sum(value))
)
)
}
do.call("agregation2",list("IDF")) # This one does
As you can see, the only working way I've found as of now is to write the full name of the table I'm using for aggregation. But this goes against the initial idea of having something that can be freely parametered.
How can I modify the first version of my function, in a way that will make it work for all three possible parameters ?
Lastly, I am aware that a simple workaround would have been to keep my IDF_pop_line table and filter at the last moment to create the 3 aggregated tables, but I prefer having separate tables from the get-go.
Thanks in advance for your help !

In your agregation function string paste0(eval(lib),"_pop_line") returns a name of dataframe not dataframe itself.
Try get
agregation <- function(lib){
do.call("<<-",
list(paste0(eval(lib),"_pop_agr"),
get(paste0(eval(lib),"_pop_line")) %>%
group_by(variable) %>%
summarise(pop = sum(value))
)
)
}

Here is a suggestion using data.table.
You can use the IDF_pop you create before entering all functions.
library(data.table)
#make adata.table out of YVE_pop_ligne
setDT( IDF_pop )
#create groups to summarise by
IDF_pop[ epci == "002", GSPO := TRUE][]
IDF_pop[ grepl("^78", codgeo) , YVE := TRUE][]
#melt and filter only values where a filter is TRUE
dt <- data.table::melt( IDF_pop,
id.vars = c("codgeo", "epci", "pop0_15", "pop15_64", "pop65p"),
measure.vars = c("GSPO", "YVE"))[ value == TRUE,][]
in between result (dt)
# codgeo epci pop0_15 pop15_64 pop65p variable value
# 1: 78300 002 6692.394 5441.225 4008.875 GSPO TRUE
# 2: 78520 002 2128.604 6808.004 1889.822 GSPO TRUE
# 3: 78300 002 6692.394 5441.225 4008.875 YVE TRUE
# 4: 78520 002 2128.604 6808.004 1889.822 YVE TRUE
# 5: 78650 003 8482.971 6556.482 5098.929 YVE TRUE
code
#now summarising is easy, sum by varianle-group on all pop-columns
dt[, lapply( .SD, sum), by = variable, .SDcols = names(dt)[grepl("^pop", names(dt) )] ]
final output
# variable pop0_15 pop15_64 pop65p
# 1: GSPO 7171.683 5855.894 11866.55
# 2: YVE 12602.153 8028.948 14364.21

Related

looping r-package "seg" function -with changing data selection

I am calculating the dissimilarity index of several groups compared to the total population with the function "seg" from the identically named package.
The data consists of about 450 rows, each a different district, and around 20 columns (groups that may be segregated). The values are the number of people from respective group living in respective district. Here are the first few rows of my csv file:
Region,Germany,EU15 without Germany,Poland,Former Yugoslavia and successor countries,Former Soviet Union and successor countries,Turkey,Arabic states,West Afrika,Central Afrika,East Afrika,North America,Central America and the Carribean,South America,East and Central Asia,South and Southeast Asia - excluding Vietnam,Australia and Oceania,EU,Vietnam,Non EU Europe,Total Population
1011101,1370,372,108,35,345,91,256,18,6,3,73,36,68,272,98,3,1979,19,437,3445
1011102,117,21,6,0,0,0,6,0,0,0,7,0,6,0,7,0,156,0,3,188
1011103,2180,482,181,102,385,326,358,48,12,12,73,24,75,175,129,12,3152,34,795,5159
Since the seg function only works with two columns as input, my current code to create a table with the index for all groups looks like this:
DI_table <- as.data.frame(0)
DI_table[1,1] <- print (seg(data =dfplrcountrygroups2019[, c( "Germany", "Total.Population")]))
DI_table[1,2] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[3], "Total.Population")]))
DI_table[1,3] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[4], "Total.Population")]))
DI_table[1,4] <- print (seg(data =dfplrcountrygroups2019[, c( colnames(dfplrcountrygroups2019)[5], "Total.Population")]))
# and so on...
colnames(DI_table)<- (colnames(dfplrcountrygroups2019[2:20]))
Works well, but a hassle to recode every time I change something with my data and I would like to use this method for other datasets too.
I thought I might try something like below but the seg function did not consider it a selection of two columns.
for (i in colnames(dfplrcountrygroups2019)) {
di_matrix [i] <- seg(data =dfplrcountrygroups2019[, c( "i", "Total.Population")])
}
Error in [.data.frame(dfplrcountrygroups2019, , c("i",
"Total.Population")) : undefined columns selected
I also thought of the apply function but not sure how to make it work so it repeats itself while just changing the column where "Germany" is in the example. How do I make the selection of columns change for each time I repeat the seg function?
my_function <- seg(data =dfplrcountrygroups2019[, c("Germany", "Total.Population")])
apply(X = dfplrcountrygroups2019,
FUN = my_function,
MARGIN = 2
)
Error in get(as.character(FUN), mode = "function", envir = envir) :
object 'my_function' of mode 'function' was not found

The seg package's functions such as dissim (seg::seg is being deprecated in its favor) have a specific expected data format. From the docs:
data - a numeric matrix or data frame with two columns that represent mutually exclusive population groups (e.g., Asians and non-Asians). If more than two columns are given, only the first two will be used for computing the index.
To get a data frame of the d values seg::dissim returns, where each column is a region's dissimilarity index, you can iterate over the columns, making a temporary data frame and calculating the index. Because the data you're starting with isn't made up of mutually-exclusive categories, you'll have to subtract each population from the total population column to get a not-X counterpart for each group X.
A base R option with sapply will return a named list, which you can then convert into a data frame.
di_table <- sapply(names(dat)[2:20], function(col) {
tmp_df <- dat[col]
tmp_df$other <- dat$Total.Population - dat[col]
seg::dissim(data = tmp_df)$d
}, simplify = FALSE)
as.data.frame(di_table)
#> Germany EU15.without.Germany Poland
#> 1 0.03127565 0.03989693 0.02770549
#> Former.Yugoslavia.and.successor.countries
#> 1 0.160239
#> Former.Soviet.Union.and.successor.countries Turkey Arabic.states West.Afrika
#> 1 0.08808277 0.2047 0.02266828 0.1415519
#> Central.Afrika East.Afrika North.America Central.America.and.the.Carribean
#> 1 0.08004711 0.213581 0.1116014 0.2095969
#> South.America East.and.Central.Asia
#> 1 0.08486598 0.2282734
#> South.and.Southeast.Asia...excluding.Vietnam Australia.and.Oceania EU
#> 1 0.0364721 0.213581 0.04394527
#> Vietnam Non.EU.Europe
#> 1 0.05505789 0.06624686
A couple tidyverse options: you can use purrr functions to do something like above in one step.
dat[2:20] %>%
purrr::map(~data.frame(value = ., other = dat$Total.Population - .)) %>%
purrr::map_dfc(~seg::dissim(data = .)$d)
# same output
Or with reshaping the data and splitting by county. This takes more steps, but might fit a larger workflow better.
library(dplyr)
dat %>%
tidyr::pivot_longer(c(-Region, -Total.Population)) %>%
mutate(other = Total.Population - value) %>%
split(.$name) %>%
purrr::map_dfc(~seg::dissim(data = .[c("value", "other")])$d)
# same output

Recode values based on look up table with dplyr (R)

A relatively trivial question that has been bothering me for a while, but to which I have not yet found an answer - perhaps because I have trouble verbalizing the problem for search engines.
Here is a column of a data frame that contains identifiers.
data <- data.frame("id" = c("D78", "L30", "F02", "A23", "B45", "T01", "Q38", "S30", "K84", "O04", "P12", "Z33"))
Based on a lookup table, outdated identifiers are to be recoded into new ones. Here is an example look up table.
recode_table <- data.frame("old" = c("A23", "B45", "K84", "Z33"),
"new" = c("A24", "B46", "K88", "Z33"))
What I need now can be done with a merge or a loop. Here a loop example:
for(ID in recode_table$old) {
data[data$id == ID, "id"] <- recode_table[recode_table$old == ID, "new"]
}
But I am looking for a dplyr solution without having to use the " join" family. I would like something like this.
data <- mutate(data, id = ifelse(id %in% recode_table$old, filter(recode_table, old == id) %>% pull(new), id))
Obviously though, I can't use the column name ("id") of the table in order to identify the new ID.
References to corresponding passages in documentations or manuals are also appreciated. Thanks in advance!

You can use recode with unquote splicing (!!!) on a named vector
library(dplyr)
# vector of new IDs
recode_vec <- recode_table$new
# named with old IDs
names(recode_vec) <- recode_table$old
data %>%
mutate(id = recode(id, !!!recode_vec))
# id
# 1 D78
# 2 L30
# 3 F02
# 4 A24
# 5 B46
# 6 T01
# 7 Q38
# 8 S30
# 9 K88
# 10 O04
# 11 P12
# 12 Z33

How to add leading zeros to select rows in a data frame column using dplyr pipes

Consider data created here:
data <- data.frame(ID = sample(10000,100), not.imp1 = rnorm(100), not.imp2 = rnorm(100), not.imp3 = rnorm(100))
#Note that not all IDs are the same length
We have data for 100 IDs, where each individual has a unique ID number. Columns not.imp1:3 are only relevant to show the structure of the dataframe.
We want to add a leading zero to the first 95 ID numbers. I am trying to do this using dplyr pipes, but cant figure out how to add the zeros.
Here is how I subset the data that I want to add the zeros to:
library(dplyr)
data%>%
select(ID)%>%
slice(1:95)
I have tried several things like adding %>%mutate(paste0("0",.)) to the pipe, but havent gotten anything to work. what is the best way to do this?

Using sprintf() to pad to 3 digits:
data %>% mutate(ID = sprintf("%03d", ID))
You can change %03d to change how many leading zeros to add. Eg. %05d will ensure all IDs are at least 5 digits long.

You can try this approach
data2 <- data %>%
mutate(ID = ifelse(row_number()<= 95, paste0("0", ID), ID))
head(data2)
# ID not.imp1 not.imp2 not.imp3
# 1 09449 -1.4297317 -2.2210106 0.1923912
# 2 07423 1.9010681 1.0825734 -0.8855694
# 3 06283 0.2508254 -0.5307967 2.1645044
# 4 05593 -2.2451267 0.1281156 -1.8528800
# 5 09194 -0.1677409 -0.7422480 -0.4237452
# 6 07270 -0.2536918 1.2289698 1.0083092
tail(data2)
# ID not.imp1 not.imp2 not.imp3
# 95 06538 1.0071791 0.1596557 -0.7099883
# 96 4829 0.2444440 0.8869954 -1.2938356
# 97 2571 -1.1012023 0.8343393 -0.6264487
# 98 150 0.2116460 -0.2146265 -1.8281045
# 99 3107 -1.2379193 0.3491078 1.4531531
# 100 9953 -0.9326725 1.1146032 -1.5542687

Use of str_pad is helpful
data <- data.frame(ID = sample(10000,100), not.imp1 = rnorm(100), not.imp2 = rnorm(100), not.imp3 = rnorm(100)) %>%
mutate(ID = str_pad(string = ID, width = 4, side = 'left', pad = 0))

This would be a quick and easy way to do it. I didn't use dplyr or pipeing, but you could merge this idea with the code you already tried if you want to.
data[which(nchar(data[,"ID"])==3),"ID"]<-paste0(0,data[which(nchar(data[,"ID"])==3),"ID"])
data[which(nchar(data[,"ID"])==2),"ID"]<-paste0(00,data[which(nchar(data[,"ID"])==2),"ID"])

How to write this into a loop in R?

I am trying to arrange my data. The csv file that I load contains results of 15 precincts for one locality. The number of rows are 150 because the names of the 10 candidates repeat for each of the 15 precincts.
My goal is to make the names of the 10 candidates as columns without repeating their names and with the results for each candidate as the values. I use the code below, however I have to do it 15 times because I cut my data in intervals of 10 to extract the results of one precinct. It's the same for "binondov". I have to cut my data in intervals of 8 because there are 8 candidates for each precinct.
Is there a way to write my code as a loop? Thanks!
binondop1 <- binondop[1:10,]
binondop1a <- binondop1[order(binondop1[,2]),]
binondov1 <- binondov[1:8,]
binondov1a <- binondov1[order(binondov1[,2]),]
colnames(binondop1a) = colnames(binondov1a) =
c('X', 'Candidate', 'Party', 'Vote', 'Percentage')
binondo1 <- rbind(binondop1a, binondov1a)
binondo <- rbind(t(binondo1$Vote), t(binondo2$Vote),
t(binondo3$Vote), t(binondo4$Vote),
t(binondo5$Vote), t(binondo6$Vote),
t(binondo7$Vote), t(binondo8$Vote),
t(binondo9$Vote), t(binondo10$Vote),
t(binondo11$Vote), t(binondo12$Vote),
t(binondo13$Vote),t(binondo14$Vote),
t(binondo15$Vote))
colnames(binondo) <- c('Acosta', 'Aquino', 'DLReyes', 'EEjercito',
'Gordon', 'Madrigal', 'Perlas', 'Teodoro',
'Villanueva', 'Villar', 'Binay', 'Chipeco',
'Fernando', 'Legarda', 'Manzano', 'Roxas',
'Sonza', 'Yasay')

It's hard to say exactly without seeing a sample data set, but perhaps something like this will help get you where you need to your answer.
library(dplyr)
library(tidyr)
df <- data.frame(Candidate = c(rep('Acosta',3), rep('Aquino',3), rep('DLReyes',3)),
Party = c('R','R','R','L','L','L','D','D','D'),
Vote = rep(c('A','B','C'),3),
Percentage = c(5,4,2,6,8,3,1,3,2))
df2 <- df %>%
mutate(Candidate = paste0(Candidate, ' (', Party, ')')) %>%
select(-Party) %>%
spread(Candidate, Percentage)

Dynamic variable evaluation within dplyr or vectorized method

Normally, I would handle this kind of issue by running a loop (probably not the best solution still), but I am working with an extremely large dataset (7.8 million observations) and I have been trying to program it more efficiently. Here is a very small subset of my dataset:
df = data.frame(STATE = c("PA", "PA", "MD","MD", "MO", "MO"),
DIVISION = c("Middle_Atlantic", "Middle_Atlantic","South_Atlantic","South_Atlantic","West_North_Central","West_North_Central"),
Middle_Atlantic_NSA = c(117.77, 119.43, 119.43, 120.72, 119.11, 117.77),
Middle_Atlantic_SA = c(118.45, 119.65, 119.65, 120.73, 119, 118.45),
South_Atlantic_NSA = c(134.45, 135.2, 135.2, 136.69, 134.07, 134.45),
South_Atlantic_SA = c(134.25, 134.83, 134.83, 135.97, 133.86, 134.25),
West_North_Central_NSA=c(152.24, 153.61, 153.61, 155.19, 151.08, 152.24),
West_North_Central_SA=c(152.77, 153.19, 153.19, 154.44, 151.63, 152.77),
DIV_HPI_NSA = c(117.77, 119.43, 135.2, 136.69, 151.08, 152.24),
DIV_HPI_SA = c(118.45, 119.65, 134.83, 135.97, 151.63, 152.77))
I have included my desired output for variables "DIV_HPI_NSA" and "DIV_HPI_SA". What I am attempting to accomplish is to look up the value in "DIVISION" (e.g. "Middle_Atlantic") attaching suffix "_NDA" to it and return the corresponding value of that variable (in this case "Middle_Atlantic") to the new variable "DIV_HPI_NSA". I am doing the same thing for the "DIV_HPI_SA" variable. Currently, I am trying to use either the get() function or the eval(parse(text = "text_here")) method to evaluate the strings as column names and produce the correct values, however they are not working as desired for me. Ideally I would prefer a dplyr solution as this has been processing relatively quick as opposed to loops. I am not sure why this is not working in dplyr, and would like to understand why and how I could execute it successfully. Here is a screenshot of a color coordinated desired output.
Here is my current code:
comb.df = df %>%
mutate(DIV_HPI_NSA = get(paste0(DIVISION,"_NSA")),
DIV_HPI_SA = eval(parse(text = (paste0(DIVISION,"_SA")))))
This is how I would do it through a loop - which produces the correct result but it takes a ridiculous amount of time:
for(i in 1:dim(comb.df)[1]){
comb.df$DIV_HPI_NSA[i] = comb.df[i, paste0(comb.df$DIVISION[i],"_NSA")]
comb.df$DIV_HPI_SA[i] = comb.df[i, paste0(comb.df$DIVISION[i],"_SA")]
}
My current output (i.e. DIV_HPI_NSA) keeps providing the column's output that corresponds to the first element evaluated in the "DIVISION" column. For example, the dplyr method for "DIV_HPI_NSA" returns only values from "Middle_Atlantic_NSA" column as that is the first element in "DIVISION". The eval() also has the same issue and is not generating the correct rows output.
Is there a better/faster method than dplyr, and/or how can I fix my dplyr code for it to work properly?
Please let me know if you may need additional information.
Thanks in advance!

The answer will maybe depend on the number of values DIVISION can take.
Here is a little benchmark with only "_NSA", but obviously you can do the same with "_SA" later.
#your base function in a for loop
x1 = function(db){
for(i in 1:dim(db)[1]){
db$DIV_HPI_NSA[i] = db[i, paste0(db$DIVISION[i],"_NSA")]
db$DIV_HPI_SA[i] = db[i, paste0(db$DIVISION[i],"_SA")]
}
db}
#the very same function using 'apply', which is supposed to be much faster than base loop
x2= function(db){
db %>% apply(1, function(x){
x["DIV_HPI_NSA2"] = x[paste0(x["DIVISION"],"_NSA")]
x["DIV_HPI_SA2"] = x[paste0(x["DIVISION"],"_SA")]
x
}) %>% t %>% as.data.frame
}
#if DIVISION have few values, you can use 'dplyr::case_when' this way
x3= function(db){
db %>% mutate(output2 = case_when(
DIVISION=="Middle_Atlantic" ~ Middle_Atlantic_NSA,
DIVISION=="South_Atlantic" ~ South_Atlantic_NSA,
DIVISION=="West_North_Central" ~ West_North_Central_NSA
))
}
#but if DIVISION can take a lot of values, you may have to rlang the function a bit
x4= function(db){
db = db %>% mutate(output2 = -999) #start with dummy value
xx=data.frame(A=dff$DIVISION, B=paste0(dff$DIVISION,"_NSA"), stringsAsFactors = F) %>%
unique %>%
split(seq(nrow(.))) #turns xx into a list of its rows
for(i in xx){
db = db %>% mutate(output2 = case_when(DIVISION==i$A ~ !!sym(i$B), T~output2))
}
db
}
#here are some replicates of your dataset to increase the number of lines
df60 = df[rep(seq_len(nrow(df)), 10),]
df600 = df[rep(seq_len(nrow(df)), 100),]
df6k = df[rep(seq_len(nrow(df)), 1000),]
df60k = df[rep(seq_len(nrow(df)), 10000),]
df600k = df[rep(seq_len(nrow(df)), 100000),]
#the benchmark of every function with every dataset
(mbm=microbenchmark(
base = x1(df),
base60 = df60 %>% x1,
base600 = df600 %>% x1,
base6k = df6k %>% x1,
apply = x2(df),
apply60 = df60 %>% x2,
apply600 = df600 %>% x2,
apply6k = df6k %>% x2,
dplyr = x3(df),
dplyr60 = x3(df60),
dplyr600 = x3(df600),
dplyr6k = x3(df6k),
dplyr60k = x3(df60k),
dplyr600k = x3(df600k),
dplyrcw = x4(df),
dplyrcw60 = x4(df60),
dplyrcw600 = x4(df600),
dplyrcw6k = x4(df6k),
dplyrcw60k = x4(df60k),
dplyrcw600k = x4(df600k),
times=6
))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# base 515.283 599.3395 664.6767 683.396 739.3735 795.351 3 a
# base60 5125.835 5209.1620 5515.3047 5292.489 5710.0395 6127.590 3 a
# base600 53225.746 53300.1395 66678.0210 53374.533 73404.1585 93433.784 3 b
# base6k 587666.127 618005.9505 629841.8157 648345.774 650929.6600 653513.546 3 d
# apply 1220.559 1272.8895 1342.4810 1325.220 1403.4420 1481.664 3 a
# apply60 2265.710 2384.9575 2497.3980 2504.205 2613.2420 2722.279 3 a
# apply600 10852.649 11579.6225 12047.9227 12306.596 12645.5595 12984.523 3 a
# apply6k 114463.342 125155.8980 137072.6593 135848.454 148377.3180 160906.182 3 c
# dplyr 1298.964 1352.9355 1433.0417 1406.907 1500.0805 1593.254 3 a
# dplyr60 1604.559 1647.0435 1713.2313 1689.528 1767.5675 1845.607 3 a
# dplyr600 1357.676 1456.6845 1556.4223 1555.693 1655.7955 1755.898 3 a
# dplyr6k 1954.644 1970.1425 2025.0260 1985.641 2060.2170 2134.793 3 a
# dplyr60k 6366.085 6584.1590 6809.2833 6802.233 7030.8825 7259.532 3 a
# dplyr600k 46893.576 53406.6235 58086.0983 59919.671 63682.3595 67445.048 3 b
# dplyrcw 5824.182 5834.0285 5999.5897 5843.875 6087.2935 6330.712 3 a
# dplyrcw60 5591.885 5683.0535 6032.4097 5774.222 6252.6720 6731.122 3 a
# dplyrcw600 5664.820 5811.2360 5900.6413 5957.652 6018.5520 6079.452 3 a
# dplyrcw6k 6390.883 6522.7120 9003.2733 6654.541 10309.4685 13964.396 3 a
# dplyrcw60k 14379.395 14936.6140 15179.6070 15493.833 15579.7130 15665.593 3 a
# dplyrcw600k 85238.503 86607.3005 92601.6017 87976.098 96283.1510 104590.204 3 b
Conclusion
For a 6k line dataset,
apply (137s) is 6x faster than base (630s)
vanilla dplyr is even much faster (2s)
rlanged dplyr is a bit slower than vanilla (9s)
Times seem to expand linearly with base and apply at 100ms/line, so 8M lines should takes approximately 8M seconds = 1 week.
dplyr times seem to expand exponentially though, so I cannot say if it will work on your big dataset.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to properly use do.call within a function? - r

In your agregation function string paste0(eval(lib),"_pop_line") returns a name of dataframe not dataframe itself. Try get agregation <- function(lib){ do.call("<<-", list(paste0(eval(lib),"_pop_agr"), get(paste0(eval(lib),"_pop_line")) %>% group_by(variable) %>% summarise(pop = sum(value)) ) ) }

Related

looping r-package "seg" function -with changing data selection

Recode values based on look up table with dplyr (R)

How to add leading zeros to select rows in a data frame column using dplyr pipes

How to write this into a loop in R?

Dynamic variable evaluation within dplyr or vectorized method

Categories

Resources