Error: target of assignment expands to non-language object - r

I am trying to dynamically clean up some column names for a large number of tables and I get the above error.
I have a gut feeling that I should be using quo but I have no idea on how to do that.
Any ideas?
The apply_alias applies a set of business rules to clean the names.
apply_alias <- function(l){
which(l=="Geography")
l[which(l=="Geography")] <- "GEO"
toupper(l)
}
The cleanup_column_names_tbl applies the alias_function to a list of table
cleanup_column_names_tbl <- function(PID){
for(p in PID){
names(get(paste0("tbl_",p))) <- apply_alias(names(get(paste0("tbl_",p))))
}
}
cleanup_column_names_tbl("14100287")
When I try to run it i get the following error message:
> cleanup_column_names_tbl("14100287")
Error in names(get(paste0("tbl_", p))) <- apply_alias(names(get(paste0("tbl_", :
target of assignment expands to non-language object
Sample data:
> dput(tbl_14100287[1,])
structure(list(V1 = 0L, REF_DATE = "1976-01", GEO = "Canada",
DGUID = "2016A000011124", `Labour force characteristics` = "Population",
Sex = "Both sexes", `Age group` = "15 years and over", Statistics = "Estimate",
`Data type` = "Seasonally adjusted", UOM = "Persons", UOM_ID = 249L,
SCALAR_FACTOR = "thousands", SCALAR_ID = 3L, VECTOR = "v2062809",
COORDINATE = "1.1.1.1.1.1", VALUE = 16852.4, STATUS = "",
SYMBOL = NA, TERMINATED = NA, DECIMALS = 1L), class = c("data.table",
"data.frame"), row.names = c(NA, -1L), .internal.selfref = <pointer: 0x000002123cf21ef0>)

You cannot assign a value to get since there is no function get<-. The right way of doing it would be something like the following.
apply_alias <- function(l){
l[which(l == "Geography")] <- "GEO"
toupper(l)
}
cleanup_column_names_tbl <- function(PID, envir = .GlobalEnv){
pid_full <- paste0("tbl_", PID)
res <- lapply(pid_full, function(p){
nms <- apply_alias(names(get(p)))
DF <- get(p)
names(DF) <- nms
DF
})
names(res) <- pid_full
list2env(res, envir = envir)
invisible(NULL)
}
cleanup_column_names_tbl("14100287")
names(tbl_14100287)
# [1] "V1" "REF_DATE"
# [3] "GEO" "DGUID"
# [5] "LABOUR FORCE CHARACTERISTICS" "SEX"
# [7] "AGE GROUP" "STATISTICS"
# [9] "DATA TYPE" "UOM"
#[11] "UOM_ID" "SCALAR_FACTOR"
#[13] "SCALAR_ID" "VECTOR"
#[15] "COORDINATE" "VALUE"
#[17] "STATUS" "SYMBOL"
#[19] "TERMINATED" "DECIMALS"

My solution:
Creating and expression and evaluating it. It is short but somehow does not feel like it is the proper way of doing things since I am breaking away from the functional paradigm of R.
cleanup_column_names_tbl <- function(PID){
for(p in PID){
expr1 <- paste0("names(", paste0("tbl_",p), ") <- apply_alias(names(", paste0("tbl_",p),"))")
eval(rlang::parse_expr(expr1))
}
}
Edit:
A slightly different way that:
avoids using strings
a little more flexible since it allows the use of strings and symbol
library(rlang)
test_df <- data.frame(a=1:10,b=1:10)
test_df2 <- data.frame(a=1:10,b=1:10)
fix_names <- function(df){
x <- ensym(df)
expr1 <- expr(names(!!x) <- toupper(names(!!x)))
eval(expr1, envir = parent.env(environment()))
# expr1
}
fix_names(test_df)
fix_names("test_df2")
names(test_df)
#> [1] "A" "B"
names(test_df2)
#> [1] "A" "B"

Related

Parallelizing a loop with updating during each iteration

I have some R code that puts together demographic data from the Census for all of states in the US into a list object. The block-level code can take a week to run as a sequential loop since there are ~11M blocks, so I am trying to parallelize the loop over states to make it faster. I have accomplished this goal with this:
states <- c("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI",
"ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI",
"MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC",
"ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT",
"VT","VA","WA","WV","WI","WY","DC","PR")
library(future.apply)
plan(multiprocess)
ptm <- proc.time()
CensusObj_block_age_sex = list()
CensusObj_block_age_sex[states] <- future_lapply(states, function(s){
county <- census_geo_api(key = "XXX", state = s, geo = "county", age = TRUE, sex = TRUE)
tract <- census_geo_api(key = "XXX", state = s, geo = "tract", age = TRUE, sex = TRUE)
block <- census_geo_api(key = "XXX", state = s, geo = "block", age = TRUE, sex = TRUE)
censusObj[[s]] <- list(state = s, age = TRUE, sex = TRUE, block = block, tract = tract, county = county)
}
)
However, I need to make it more robust. Sometimes there are problem with the Census API, so I would like the CensusObj to be updated at each state iteration so that I don't loose my completed data if something wrong. That way I can restart the loop over the remaining state if something does goes wrong (like if I spell "WY" as "WU")
Would it be possible to accomplish this somehow? I am open to other methods of parallelization.
The code above runs, but it seems to run into memory issues:
Error: Failed to retrieve the value of MultisessionFuture (future_lapply-3) from cluster RichSOCKnode #3 (PID 80363 on localhost ‘localhost’). The reason reported was ‘vector memory exhausted (limit reached?)’. Post-mortem diagnostic: A process with this PID exists, which suggests that the localhost worker is still alive.
I have R_MAX_VSIZE = 8Gb in my .Renviron, but I am not sure how that would get divided between the 8 cores on my machine. This all suggests that I need to store the results of each iteration rather than try to keep it all in memory, and then append the objects together at the end.
Here is a solution that uses doParallel (with the options for UNIX systems, but you can also use it on Windows, see here) and foreach that stores the results for every state separately and afterwards reads in the single files and combines them to a list.
library(doParallel)
library(foreach)
path_results <- "my_path"
ncpus = 8L
registerDoParallel(cores = ncpus)
states <- c("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI",
"ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI",
"MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC",
"ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT",
"VT","VA","WA","WV","WI","WY","DC","PR")
results <- foreach(state = states) %dopar% {
county <- census_geo_api(key = "XXX", state = state, geo = "county", age = TRUE, sex = TRUE)
tract <- census_geo_api(key = "XXX", state = state, geo = "tract", age = TRUE, sex = TRUE)
block <- census_geo_api(key = "XXX", state = state, geo = "block", age = TRUE, sex = TRUE)
results <- list(state = state, age = TRUE, sex = TRUE, block = block, tract = tract, county = county)
# store the results as rds
saveRDS(results,
file = paste0(path_results, "/", state, ".Rds"))
# remove the results
rm(county)
rm(tract)
rm(block)
rm(results)
gc()
# just return a string
paste0("done with ", state)
}
library(purrr)
# combine the results to a list
result_files <- list.files(path = path_results)
CensusObj_block_age_sex <- set_names(result_files, states) %>%
map(~ readRDS(file = paste0(path_results, "/", .x)))
You could use a tryCatch inside future_lapply to try to relaunch the calculation in case of API error, for a maximum of maxtrials.
In the resulting list, you get for each calculation the number of trials and the final status, OK or Error:
states <- c("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI",
"ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI",
"MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC",
"ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT",
"VT","VA","WA","WV","WI","WY","DC","PR")
library(future.apply)
#> Le chargement a nécessité le package : future
plan(multiprocess)
ptm <- proc.time()
maxtrials <- 3
census_geo_api <-
function(key = "XXX",
state = s,
geo = "county",
age = TRUE,
sex = TRUE) {
paste(state,'-', geo)
}
CensusObj_block_age_sex <- future_lapply(states, function(s) {
ntrials <- 1
while (ntrials <= maxtrials) {
hasError <- tryCatch({
#simulate random error
if (runif(1)>0.3) {error("API failed")}
county <- census_geo_api(key = "XXX", state = s, geo = "county", age = TRUE, sex = TRUE)
tract <- census_geo_api(key = "XXX", state = s, geo = "tract", age = TRUE, sex = TRUE)
block <- census_geo_api(key = "XXX", state = s, geo = "block", age = TRUE, sex = TRUE)
},
error = function(e)
e)
if (inherits(hasError, "error")) {
ntrials <- ntrials + 1
} else { break}
}
if (ntrials > maxtrials) {
res <- list(state = s, status = 'Error', ntrials = ntrials-1, age = NA, sex = NA, block = NA, tract = NA, county = NA)
} else {
res <- list(state = s, status = 'OK' , ntrials = ntrials, age = TRUE, sex = TRUE, block = block, tract = tract, county = county)
}
res
}
)
CensusObj_block_age_sex[[1]]
#> $state
#> [1] "AL"
#>
#> $status
#> [1] "OK"
#>
#> $ntrials
#> [1] 3
#>
#> $age
#> [1] TRUE
#>
#> $sex
#> [1] TRUE
#>
#> $block
#> [1] "AL - block"
#>
#> $tract
#> [1] "AL - tract"
#>
#> $county
#> [1] "AL - county"
<sup>Created on 2020-08-19 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
One possible solution that I have is to log the value of CensusObj to a text file i.e print the CensusObj in each iteration. The doSNOW package can be used for logging for example
library(doSNOW)
cl <- makeCluster(1, outfile="abc.out")
registerDoSNOW(cl)
states <- c("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI",
"ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI",
"MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC",
"ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT",
"VT","VA","WA","WV","WI","WY","DC","PR")
foreach(i=1:length(states), .combine=rbind, .inorder = TRUE) %dopar% {
county <- "A"
tract <- "B"
block <- "C"
censusObj <- data.frame(state = states[i], age = TRUE, sex = TRUE, block = block, tract = tract, county = county)
# edit: print json objects to easily extract from the file
cat(sprintf("%s\n",rjson::toJSON(censusObj)))
}
stopCluster(cl)
This would log the value of censusObj in abc.out and also logs the error if program crashes but you will get the latest value of censusObj logged in abc.out.
Here is the output of the last iteration from the log file:
Type: EXEC {"state":"PR","age":true,"sex":true,"block":"C","tract":"B","county":"A"} Type: DONE
Type:EXEC means that the iteration has started and Type:DONE means execution is completed. The result of cat will be present between these two statements of each iteration. Now, the value of CensusObj can be extracted from the log file as shown below:
Lines = readLines("abc.out")
results = list()
for(i in Lines){
# skip processing logs created by doSNOW
if(!startsWith(i, "starting") && !startsWith(i, "Type:")){
results = rlist::list.append(results, jsonlite::fromJSON(i))
}
}
results will contain the elements all the values printed in abc.out.
> head(results, 1)
[[1]]
[[1]]$state
[1] "AL"
[[1]]$age
[1] TRUE
[[1]]$sex
[1] TRUE
[[1]]$block
[1] "C"
[[1]]$tract
[1] "B"
[[1]]$county
[1] "A"
It is not a very clean solution but works.

Group and summarise character in R?

I this in R
> con_promedio_por_curso_transpuesta
A B C D
Description "Abc" "Bcd" "Cde" "Def"
mean(X7) "5.000000" "4.105263" "4.733333" "4.680000"
mean(X8) "5.000000" "3.736842" "4.400000" "4.760000"
mean(X9) "5.000000" "3.950000" "4.600000" "4.840000"
mean(X10) "5.000000" "4.210526" "4.333333" "4.560000"
I want to delete the first row
"Abc" "Bcd" "Cde" "Def"
I run the next instruction:
without_first_row <- con_promedio_por_curso_transpuesta[-c(1),])
Until now, everything is OK, but. If I want to group and summarise the without_first_row I get an error.
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "c('matrix', 'character')"
I run the type of data,
> typeof(con_promedio_por_curso_transpuesta)
[1] "character"
How I cast "character" to any type of data for grouping ?
Thanks.
You have an error in your syntax near where you create without_first_row, follow along below:
con_promedio_por_curso_transpuesta <-
data.frame(
row.names = c('Description','mean(X7)','mean(X8)','mean(X9)','mean(X10)'),
'A' = c("Abc","5.000000","5.000000","5.000000","5.000000"),
'B' = c("Bcd","4.105263","3.736842","3.950000","4.210526"),
'C' = c("Cde","4.733333","4.400000","4.600000","4.333333"),
'D' = c("Def","4.680000","4.760000","4.840000","4.560000"),
'ID' = c(NA, 1, 1, 2, 2) # added for this example
)
## without_first_row <- con_promedio_por_curso_transpuesta[-c(1), ]) <- this is your error, you added a ')' unnecessarily
without_first_row <- con_promedio_por_curso_transpuesta[-c(1), ] # with fixed syntax
> class(without_first_row)
[1] "data.frame"
# just to show you can group_by and summarise with data
without_first_row %>%
mutate_at(.vars = vars(c(A,B,C,D)), funs(as.numeric)) %>%
group_by(ID) %>%
summarise_all(mean)

Loop error: Error in `[.data.frame`(company, , i) : undefined columns selected

I have 2 data.frames:
The first includes Companies ("n" columns)
The second a benchmark (1 column)
Both have the same length of rows
After the calculations, each column (in both of them) has "k" levels more.
Eg. if the first data.frame had 3 Companies in each column {a,b,c} and 3 levels then it will be like {a.1, b.1, c.1, a.2,..., c.3} and the second will be {bench.1,bench.2,bench.3}
I'm trying to create a table that has the companies as rows and the regression coefficients between the two data.frames as columns.
Here's what I have tried so far:
a1 <- data.frame(); b1 <- data.frame(); rs <- data.frame()
# company: the company data.frame (levels included)
# benchmark: the benchmark data.frame (levels included)
# levels: the levels
k <- ncol(company)/levels
l <- 1 - k
for (j in 1:levels) {
l <- l + k
k <- k + k
for (i in l:k) {
mod <- lm(company[,i] ~ benchmark[,j])
a1[i,j] <- mod$coefficients[1]
b1[i,j] <- mod$coefficients[2]
rs[i,j] <- summary(mod)$adj.r.squared
}
}
table <- data.frame('Alpha_Coef' = a1, 'Beta_Coef' = b1, 'Adj.R_Squared' = rs)
I would appreciate any kind of help.
Thank you very much
Edit:
The Company Dataframe (4 levels)
> dput(wsf[1:30, seq(1, 356, 32)])
structure(list(W1free = c(-0.0412000679217938, 0.0880096344938139,
0.050722366154044, -0.0946016947369898, 0.0473490608961103, -0.0316948337480497,
-0.000178721497686476, 0.0330043255435735, -0.0130213571012855,
-0.0210134019518695, 0.00515539322334187, 0.0313594382649759,
-0.0636728947738869, 0.069783976440601, -0.0233500288324757,
0.000310577992801808, -0.0106422243645882, -0.00844960399798073,
0.0207508743372746, -0.00655907030852679, -0.00917878754640728,
0.0422985993411094, -0.0583530401655691, 0.00316447724460396,
0.0548253425520056, -0.0303784828505676, -0.0178226463347567,
0.0141621070101102, 0.0040525928183183, 0.00756095153768141),
W1free.32 = c(-0.0303584128041137, 0.0781259056778406, 0.0543601664896256,
-0.0838929634247089, 0.0168995513463754, -0.00116320770298013,
-0.00661142266491474, 0.00316786967673743, 0.014865018488982,
0.00115744890492633, -0.0133139018524208, 0.00357666761370058,
-0.00984757332103789, 0.0121936464190736, -0.0116509859273921,
0.0214700671630899, -9.18974370460937e-05, -0.0291337664969924,
0.0183613008541499, -0.00623650470163079, 0.00188765595333864,
0.0169715922156494, -0.0202622159786693, 0.00117256873232021,
0.0188193851765861, -0.0330732214410425, 0.0276667753578936,
-0.00912446778879232, -0.00645760419877771, 0.0229548860537124
), W1free.64 = c(-0.041320747786529, 0.0940721256003125,
0.0546722155282992, -0.0776194922127773, -0.0188218918393148,
0.0369018784168118, -0.0159710004093249, 0.00270145290316121,
0.00593858603645259, -0.0018398928749183, -0.0203943374367289,
0.0409079997097826, -0.0302107159471338, 0.00687409495325281,
-0.00278673048827845, 0.00579509928589942, 0.00450071408544868,
-0.0167040789608537, 0.0183985148340536, -0.0140021183403113,
0.00302717840065282, 0.010014802876346, -0.0108024225605952,
0.00664477277427707, -0.00998679780654633, 0.0093455115960892,
-0.0157862445056117, 0.0278399667684328, -0.0124817099428715,
-0.00574730232120169), W2free.7 = c(-0.723798800294376, -1.69462545197809,
-0.322808715747857, 0.983076466907199, 0.601543663890055,
0.00738341380613653, -0.17043040850117, -0.130795012258449,
0.0877012527311328, 0.144654492689526, -0.0233580207307919,
-0.130528770674957, -0.0811443177437124, 0.00455323978774625,
0.0322004951423808, 0.041028915171238, 0.0397928561027241,
-0.0146377819487017, -0.0564390552961818, -0.0157108172121778,
0.0418289577687561, 0.0407167980322358, -0.000920052532455742,
-0.0455497658600707, -0.0588733856819542, 0.00103356528290383,
0.072999787419253, 0.0566294953344014, -0.0139270074576876,
-0.051740788996522), W2free.39 = c(-0.719321984671057, -1.67269117285615,
-0.320838224757322, 0.99401218132634, 0.631713043453369,
0.00311609762638836, -0.215156520118741, -0.148974422352992,
0.0967557896976805, 0.120695957831705, -0.0371356083199943,
-0.0698672528876954, -0.0239853298540248, -0.00317615704745786,
0.0069410289690482, 0.00582109106839021, 0.00118481377532628,
-0.000737420044277918, -0.00796548329789375, -0.000991745340389675,
0.00911571004844975, 0.00143748164424383, -0.00218700317993713,
0.00467672676232776, 0.0132823977786716, 0.00346746468015986,
-0.0284915884601326, -0.0246832031983181, 0.0100296772438092,
0.0157902406702257), W2free.71 = c(-0.743286786555513, -1.68048071636107,
-0.304449505870913, 1.00901538641735, 0.626394565701536,
-0.00615227460862292, -0.212468284990251, -0.146055149093144,
0.0968104699798898, 0.129746894309015, -0.0313266854247715,
-0.0756053816865943, -0.0334202489014552, -0.0195998665482904,
0.010111846958308, 0.0392121845384003, 0.00544740230543033,
-0.0252594118720365, -0.0189333126817274, -0.0200581449364624,
0.00172381708912141, 0.0377707777006468, 0.0253024954648705,
0.00375902559668281, -0.000786565490537995, -0.0183404009358504,
-0.0144607525884234, 0.00635615971857548, -0.01995715314952,
-0.0339339332556635), W3free.14 = c(-0.541971799096092, -0.547254597852541,
-0.351417129791281, -0.079614566169948, 0.190159527265416,
0.369077378264343, 0.344217272166032, 0.307260788133197,
0.335851696193595, 0.389852108956907, 0.508410193658837,
0.454576753894716, 0.0246283521212561, -0.468731535325678,
-0.911314207864435, -1.10081031169091, -0.693894242109856,
-0.176018831349174, 0.268013221596553, 0.584156224372314,
0.483437264811605, 0.31842908645562, 0.211725648202932, 0.0860050283532339,
0.0374128558225933, -0.0175973788383141, -0.104572345794254,
-0.139452927246105, -0.102203406030287, -0.0486154328959697
), W3free.46 = c(-0.551809713896474, -0.554226067631237,
-0.344691113458163, -0.0662397128588583, 0.210840558688842,
0.392862015453828, 0.361004880017924, 0.315708806714992,
0.335952251181492, 0.38343966622271, 0.499384154321493, 0.451678337070302,
0.0251530591544403, -0.472351170164682, -0.920102359468484,
-1.12201190642854, -0.722117102849035, -0.19757656345329,
0.257496857320272, 0.585660993514737, 0.489353812024943,
0.318486723025294, 0.201872000937277, 0.0775858599501224,
0.0511079098302739, 0.0245387675955284, -0.0362827260167453,
-0.0723029499363708, -0.0746172353011349, -0.0670522978529944
), W3free.78 = c(-0.54111517653805, -0.544975921014698, -0.340169990337836,
-0.0694109571903436, 0.196448997132187, 0.3710738269895,
0.339483189903327, 0.305610659199469, 0.344043131600595,
0.405907239122663, 0.527335328093289, 0.469685973096099,
0.02498782328932, -0.485994428238778, -0.940087328347237,
-1.13732999528212, -0.724247605957125, -0.192167012739451,
0.266786191889965, 0.598018937656833, 0.497551375682964,
0.324775953859068, 0.203400777243459, 0.0638344266267895,
0.0231419137529647, -0.0116987397715524, -0.0687128256203292,
-0.0832420106844102, -0.056424343917657, -0.0291186092992814
), W4free.21 = c(0.118918655781396, 0.108718779015937, 0.102775253233938,
0.0828524472480944, 0.0410407234124081, -0.012381397981727,
-0.0746511451766081, -0.126967081904642, -0.138260372435415,
-0.142146232655087, -0.144033266870221, -0.113074511945608,
-0.0670368640581458, -0.0159506388411364, 0.0576398958821921,
0.114505138680487, 0.138987380604024, 0.17298154560344, 0.190728780916805,
0.172497528928508, 0.148714427830889, 0.111056896498362,
0.0322016534964464, -0.0411902802201208, -0.108714452830289,
-0.193403780805043, -0.242468575602981, -0.2518526201175,
-0.226528938042806, -0.16536045623284), W4free.53 = c(0.110582824081414,
0.0990631039125744, 0.092803742919437, 0.07531563115484,
0.0364487613311845, -0.0148493771602927, -0.0766616219694969,
-0.130115230718347, -0.140678702662368, -0.141590690581832,
-0.140856314457181, -0.108167295871846, -0.0604430195571222,
-0.00767211515738517, 0.0691273756544755, 0.129924506626998,
0.154897007967789, 0.186742438470891, 0.201270731261766,
0.177042350952871, 0.147222730892732, 0.105238351658075,
0.0205834624120472, -0.0576989254447123, -0.127161196169879,
-0.214118057849679, -0.26440281808828, -0.269229187033526,
-0.237685035633724, -0.172411380276773), W4free.85 = c(0.118750904966437,
0.105876306066672, 0.0980390856567921, 0.0763519039332738,
0.0332059062273466, -0.0198714852858678, -0.0829673978857073,
-0.13673208596903, -0.147663483183383, -0.150600183457387,
-0.15128240500669, -0.117726475726204, -0.0690026777475602,
-0.0158895662765118, 0.0616902413279015, 0.122399296935057,
0.147735045409867, 0.181929151536837, 0.19882362481352, 0.177391946124788,
0.150509412599221, 0.110259155225688, 0.0273385567351872,
-0.0487018627269797, -0.116082830918324, -0.200781736711009,
-0.248333357539446, -0.252999585674256, -0.222777574439045,
-0.157452460819487)), .Names = c("W1free", "W1free.32", "W1free.64",
"W2free.7", "W2free.39", "W2free.71", "W3free.14", "W3free.46",
"W3free.78", "W4free.21", "W4free.53", "W4free.85"), row.names = c(NA,
30L), class = "data.frame")
Benchmark Dataframe (4 levels):
> dput(wbf[1:30,])
structure(list(W1free = c(-0.0432455343158943, 0.0874909119474779,
0.0571001565877175, -0.0884507864841058, 0.0172050650848323,
-0.00224255555827861, 0.000722631485183195, 0.00993908327736091,
-0.0117418265042027, 0.00534988466357996, -0.000940017698511891,
0.00546041849046642, -0.0103733510325452, 0.00956961911965048,
-0.0113459297172276, 0.0133958729726377, -0.0029416252299541,
-0.011414032468681, 0.0141883947414388, -0.00822524736978799,
0.00316136152084385, 0.00521921925859321, -0.0159379140722575,
0.00678918411451175, 0.0148541387078586, -0.0129222646391021,
-0.00765917374935741, 0.0110142405769407, -0.00458116020357859,
0.0168472964297673), W2free = c(-0.740522709681838, -1.67769972639025,
-0.307890697506331, 1.00015354619176, 0.626337664518445, 0.00505640584038886,
-0.205521408512004, -0.150946308038834, 0.0857142318047878, 0.123609501408063,
-0.0266672430532372, -0.0735840422655566, -0.0316973358168455,
-0.00275393824752691, 0.00718734780106011, 0.00446732335756168,
0.000838919259401042, 0.00297041882635259, 6.04838862344908e-05,
-0.00405610663093049, -0.00221649131469466, 0.00381412534328539,
0.0066934836198532, -0.000657836733240365, -0.00444029910039748,
0.00143349288299151, 0.00265511863053606, -0.000743159102980881,
-0.00661664414444181, -0.00754557691213225), W3free = c(-0.537883232834564,
-0.543185462048805, -0.341173228804443, -0.0719862903967931,
0.195766331288493, 0.373205978054011, 0.344376017839598, 0.312014362375209,
0.346039880023554, 0.403777644728145, 0.52334517757098, 0.46282661064603,
0.0212821638525336, -0.487411032478064, -0.942703893546479, -1.13740960834218,
-0.725250237305827, -0.192667529664807, 0.268680928939938, 0.598063378890107,
0.499012500582589, 0.328774353296679, 0.21316016955454, 0.0826700443418289,
0.0440645204568091, 0.00181643263159242, -0.0732510545397724,
-0.107882422379997, -0.0915052281639699, -0.0597510688882035),
W4free = c(0.120565311954663, 0.110601962226259, 0.105408918589852,
0.0865592864118751, 0.0450732522189251, -0.00895997768433414,
-0.0730317587513421, -0.127472031073358, -0.139664402331508,
-0.144382004669276, -0.14768736817263, -0.118078134288994,
-0.0729438783923505, -0.0220437686110972, 0.0529380577249222,
0.111858330376543, 0.137386196550584, 0.172903566190241,
0.192861994141253, 0.176582979701844, 0.1548325160191, 0.118996625676438,
0.040367522055528, -0.0335387828973377, -0.101185748094811,
-0.187175302200655, -0.238647127560718, -0.250387805121693,
-0.227657278667077, -0.16931414087988)), .Names = c("W1free",
"W2free", "W3free", "W4free"), row.names = c(NA, 30L), class = "data.frame")
Consider a vectorized approach using Map where you pass in each dataframe's column names to a user defined function, reghandle, to run regression and return model results.
However, to use Map, lengths must match. Hence benchmark colnames are replicated by the k levels and sorted to align. At end, the function's returned list is iteratively cast to dataframe rows, then are binded together in a do.call(rbind, ...):
reghandle <- function(x, y){
mod <- lm(company[[x]] ~ benchmark[[y]])
return(list(Alpha_coef = unname(mod$coefficients[1]),
Beta_coef = unname(mod$coefficients[2]),
Adj.R_Squared = unname(summary(mod)$adj.r.squared)))
}
k <- ncol(company)/levels
benchmarknames <- sort(rep(names(benchmark), k))
benchmarknames
# [1] "W1free" "W1free" "W1free" "W2free" "W2free" "W2free" "W3free" "W3free"
# [9] "W3free" "W4free" "W4free" "W4free"
tablelist <- Map(reghandle, names(company), benchmarknames)
table <- do.call(rbind, lapply(tablelist, data.frame))
table
# Alpha_coef Beta_coef Adj.R_Squared
# W1free -0.0003729282 1.1575992 0.6540945
# W1free.32 0.0003352187 0.9155197 0.8303173
# W1free.64 -0.0002986047 0.9099407 0.6990809
# W2free.7 -0.0003745362 0.9973343 0.9925469
# W2free.39 0.0002856617 0.9957494 0.9991361
# W2free.71 -0.0011951258 1.0032802 0.9986368
# W3free.14 -0.0028038826 0.9759708 0.9991466
# W3free.46 0.0023389397 0.9871707 0.9987852
# W3free.78 0.0009924749 0.9972702 0.9992721
# W4free.21 -0.0012718476 0.9931372 0.9990408
# W4free.53 -0.0036123789 1.0318762 0.9914355
# W4free.85 -0.0029080074 1.0139958 0.9964788

Applying `ar` (autoregressive model) for my data frame using `lapply` returns `numeric(0)`?

I'm working with a data.frame with all numeric data. I want to calculate the first order autoregressive coefficients for each column. I chose apply function to do the task and I defined a function as the following:
return.ar <- function(vec){
return(as.numeric(ar(vec)$ar))
}
Then I applied it to a data frame I subset by column names as the following
lapply(df_return[,col.names],return.ar)
I was expecting to get a vector with ar coefficients. But instead I got a list with all the coefficients put in the first element like the following
$C.Growth
[1] 0.35629140 -0.07671252 -0.08699333 -0.27404355 0.21448342
[6] -0.19049197 0.06610908 -0.23077602
$Mkt.ret
numeric(0)
$SL
numeric(0)
$SM
numeric(0)
$SH
numeric(0)
$LL
numeric(0)
$LM
numeric(0)
$LH
numeric(0)
I don't understand what's going on.
The output of dput(head(df_return)) looks like the following:
structure(list(Year = c(1929, 1930, 1931, 1932, 1933, 1934),
C.Growth = c(0.94774902516838, 0.989078396169958, 0.911586749357132,
0.996183522774413, 1.08170234030149, 1.05797659377887), S.Return = c(-19.7068321696574,
-31.0834309393085, -45.2864376593084, -9.42504715968666,
57.0992131145999, 4.05781718258972), Rf = c(4.79316783034255,
2.58656906069154, 1.24356234069162, 0.954952840313344, 0.199213114599945,
0.147817182589718), Inflation = c(-0.0531678303425544, -0.15656906069154,
-0.15356234069162, -0.00495284031334435, 0.100786885400055,
0.0321828174102824), Mkt.ret = c(-14.9668321696574, -28.6534309393085,
-44.1964376593084, -8.47504715968666, 57.3992131145999, 4.23781718258972
), SL = c(-45.2568321696575, -35.1134309393085, -41.1864376593084,
-5.28504715968666, 166.0392131146, 34.1378171825897), SM = c(-30.7368321696574,
-31.9034309393085, -48.5364376593084, -8.94504715968666,
118.7092131146, 19.7578171825897), SH = c(-36.7568321696575,
-45.1834309393085, -51.5364376593084, 2.78495284031334, 125.7792131146,
7.95781718258972), LL = c(-19.6968321696574, -26.2734309393085,
-36.2264376593084, -7.31504715968666, 44.1492131145999, 10.6978171825897
), LM = c(0.673167830342554, -29.2434309393085, -59.9864376593084,
-16.7150471596867, 89.4692131145999, -2.93218281741028),
LH = c(-4.35683216965745, -43.1934309393085, -57.7364376593084,
-4.30504715968666, 114.7092131146, -21.8421828174103)), .Names = c("Year",
"C.Growth", "S.Return", "Rf", "Inflation", "Mkt.ret", "SL", "SM",
"SH", "LL", "LM", "LH"), row.names = c(NA, 6L), class = "data.frame")
Once you include your data, diagnose becomes easy.
ar will do auto-section of p based on AIC. Some of your columns have strong evidence to be white noise, hence ar has selected p = 0, in which case $ar field will be numeric(0).
I suggest you also use the following:
lapply(df_return[col.names], function (x) ar(x, order.max = 5)$order)
or even better:
fit_ar <- function(x) ar(x, order.max = 5)[c("order", "ar")]
lapply(df_return[col.names], fit_ar)
The latter returns both p as well as AR coefficients for each column. I have set order.max = 5, so that ar won't choose it itself.
You tried to convince me that lapply is doing wrong, by using this for loop:
ar.vec <- numeric()
for (name in col.names)
ar.vec <- c(ar.vec, return.ar(df_return[[ name ]]))
But unfortunately you won't get anything useful from this. Note you used concatenation c(), thus there is no way to tell which coefficient is for which column.
lapply is not identical to such loop. You should use:
ar.vec <- vector("list", length(col.names))
for (i in 1:length(col.names))
ar.vec[[i]] <- return.ar(df_return[[ col.names[i] ]])

R and Data Selection

I have a data table dt, as given below:
structure(list(IM = c(0.830088495575221, 0.681436210847976, 0.498810939357907,
0.47265400115141, 0.527908540685945, 0.580763582966226, 0.408069043807859,
0.467368671545006, 0.44662887412295, 0.0331974034502217, 0.0368210899219588,
0.0333698233772947, 0.0294312465832275, 0.578743426515361, 0.566950053134963,
0.808756701221038, 0.585507838980771, 0.61507839619537, 0.586388329979879,
0.794196637085474), CM = c(0.876991150442478, 0.996180290297937,
0.651605231866825, 0.824409902130109, 0.94418291862811, 0.961820851688693,
0.943861532396347, 1.10137922144883, 1.1524325077831, 0.128868067469359,
0.155932251596297, 0.159414951213752, 0.196968075413411, 1.19678937171326,
0.901168969181722, 3.42528220866977, 2.4377239516641, 2.0040870054458,
1.86099597585513, 1.51928615911568), RM = c(0.601769911504425,
0.495034377387319, 0.405469678953627, 0.368451352907311, 0.361802286482851,
0.320851688693098, 0.791548118347242, 0.816050925099649, 0.786622368849031,
0.545805622636092, 0.594370732740163, 0.594771872860171, 0.536043514857356,
0.617215610296153, 0.619287991498406, 0.602602774009141, 0.634069706132375,
0.596543561108693, 0.582203219315895, 0.695985131558462)), .Names = c("IM", "CM", "RM"), class = c("data.table", "data.frame"), row.names
= c(NA,
-20L), .internal.selfref = <pointer: 0x00000000003f0788>)
I have written a function as given below:
DSanity.markWinsorize <- function(dt, colnames)
{
PERnames <- unlist(lapply(colnames, function(x) paste0("PER",x)));
print(dt[,colnames])
if(length(colnames)>1)
{dt[,PERnames] <- sapply(dt[,colnames], Num.calPtile);}
else
{dt[,PERnames] <- Num.calPtile(dt[,colnames]);}
return(dt)
}
## Calculate Percentile score of a data vector
Num.calPtile <- function(x)
{
return((ecdf(x))(x))
}
The job of this function is to create new columns, calculating the percentile of each of the data points for the columns provided to the function markWinsorize.
Here I am trying to run the function markWinsorize:
colnames <- c('CM','AM','BM')
DSanity.markWinsorize(dt,colnames)
I get the following error:
> sdc1 <- DSanity.markWinsorize(sdc,colnames)
[1] "CM" "AM" "BM"
Show Traceback
Rerun with Debug
Error in approxfun(vals, cumsum(tabulate(match(x, vals)))/n, method = "constant", :
zero non-NA points In addition: Warning message:
In xy.coords(x, y) : NAs introduced by coercion
It would be great if some of you can help me out here. Thanks.
Your approach is quite unwieldy. I recommend a completely new approach.
library(dplyr)
colnames <- c("CM", "AM", "BM")
dt %>%
select_(.dots = colnames) %>%
mutate_each(funs(ntile(., 100)))
I think this gives what you want (perhaps with the addition of %>% bind_cols(dt)).

Resources