R and Data Selection - r

I have a data table dt, as given below:
structure(list(IM = c(0.830088495575221, 0.681436210847976, 0.498810939357907,
0.47265400115141, 0.527908540685945, 0.580763582966226, 0.408069043807859,
0.467368671545006, 0.44662887412295, 0.0331974034502217, 0.0368210899219588,
0.0333698233772947, 0.0294312465832275, 0.578743426515361, 0.566950053134963,
0.808756701221038, 0.585507838980771, 0.61507839619537, 0.586388329979879,
0.794196637085474), CM = c(0.876991150442478, 0.996180290297937,
0.651605231866825, 0.824409902130109, 0.94418291862811, 0.961820851688693,
0.943861532396347, 1.10137922144883, 1.1524325077831, 0.128868067469359,
0.155932251596297, 0.159414951213752, 0.196968075413411, 1.19678937171326,
0.901168969181722, 3.42528220866977, 2.4377239516641, 2.0040870054458,
1.86099597585513, 1.51928615911568), RM = c(0.601769911504425,
0.495034377387319, 0.405469678953627, 0.368451352907311, 0.361802286482851,
0.320851688693098, 0.791548118347242, 0.816050925099649, 0.786622368849031,
0.545805622636092, 0.594370732740163, 0.594771872860171, 0.536043514857356,
0.617215610296153, 0.619287991498406, 0.602602774009141, 0.634069706132375,
0.596543561108693, 0.582203219315895, 0.695985131558462)), .Names = c("IM", "CM", "RM"), class = c("data.table", "data.frame"), row.names
= c(NA,
-20L), .internal.selfref = <pointer: 0x00000000003f0788>)
I have written a function as given below:
DSanity.markWinsorize <- function(dt, colnames)
{
PERnames <- unlist(lapply(colnames, function(x) paste0("PER",x)));
print(dt[,colnames])
if(length(colnames)>1)
{dt[,PERnames] <- sapply(dt[,colnames], Num.calPtile);}
else
{dt[,PERnames] <- Num.calPtile(dt[,colnames]);}
return(dt)
}
## Calculate Percentile score of a data vector
Num.calPtile <- function(x)
{
return((ecdf(x))(x))
}
The job of this function is to create new columns, calculating the percentile of each of the data points for the columns provided to the function markWinsorize.
Here I am trying to run the function markWinsorize:
colnames <- c('CM','AM','BM')
DSanity.markWinsorize(dt,colnames)
I get the following error:
> sdc1 <- DSanity.markWinsorize(sdc,colnames)
[1] "CM" "AM" "BM"
Show Traceback
Rerun with Debug
Error in approxfun(vals, cumsum(tabulate(match(x, vals)))/n, method = "constant", :
zero non-NA points In addition: Warning message:
In xy.coords(x, y) : NAs introduced by coercion
It would be great if some of you can help me out here. Thanks.

Your approach is quite unwieldy. I recommend a completely new approach.
library(dplyr)
colnames <- c("CM", "AM", "BM")
dt %>%
select_(.dots = colnames) %>%
mutate_each(funs(ntile(., 100)))
I think this gives what you want (perhaps with the addition of %>% bind_cols(dt)).

Related

Error in if (class(x) == "numeric") { : the condition has length > 1

I`m trying to visualise data of the following form:
date volaEUROSTOXX volaSA volaKENYA25 volaNAM volaNIGERIA
1 10feb2012 0.29844454 0.1675901 0.007862087 0.12084170 0.10247617
2 17feb2012 0.31811157 0.2260064 0.157017220 0.33648935 0.22584127
3 24feb2012 0.30013672 0.1039974 0.083863921 0.11694768 0.16388161
To do so, I first converted the date (stored as a character in the original data frame) into a date-format. Which works just fine:
vola$date <- as.Date(vola$date)
str(vola$date)
Date[1:543], format: "2012-02-10" "2012-02-17" "2012-02-24" "2012-03-02" "2012-03-09"
However, if I now try to graph my data by using the chart.TimeSeries command, I get the following:
chart.TimeSeries(volatility_annul_stringdate,lwd=2,auto.grid=F,ylab="Annualized Log Volatility",xlab="Time",
main="Log Volatility",lty=1,
legend.loc="topright")
Error in if (class(x) == "numeric") { : the condition has length > 1
I tried:
Converting my date variable (in the date format) further into a time series object:
vola$date <- ts(vola$date, frequency=52, start=c(2012,9)) #returned same error from above
Converting the whole data set using its-command:
vol.xts <- xts(vola, order.by= vola$date, unique = TRUE ) # which then returned:
order.by requires an appropriate time-based object
#even though date is a time-series
What am I doing wrong? I am rather new to RStudio.. I really want to use the chart.TimeSeries command. Can someone help me?
Thanks in advance!
My MRE:
library(PerformanceAnalytics)
vola <- structure(list(date_2 = c("2012-02-10", "2012-02-17", "2012-02-24",
"2012-03-02"), volaEUROSTOXX = c(0.298444539308548, 0.318111568689346,
0.300136715173721, 0.299697518348694), volaKENYA25 = c(0.00786208733916283,
0.157017216086388, 0.0838639214634895, 0.152377054095268), volaNAM = c(0.120841704308987,
0.336489349603653, 0.116947680711746, 0.157027021050453), volaNIGERIA = c(0.102476172149181,
0.225841268897057, 0.163881614804268, 0.317349642515182), volaSA = c(0.167590111494064,
0.226006388664246, 0.103997424244881, 0.193037077784538), date = structure(c(1328832000,
1329436800, 1330041600, 1330646400), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
vola <- subset(vola, select = -c(date))
vola$date_2 <- as.Date(vola$date_2)
chart.TimeSeries(vola,lwd=2,auto.grid=F,ylab="Annualized Log Volatility",xlab="Time",
main="Log Volatility",lty=1,
legend.loc="topright")
#This returns the above mentioned error message.
#Thus, I tried the following:
vola$date_2 <- ts(vola$date_2, frequency=52, start=c(2012,9))
chart.TimeSeries(vola,lwd=2,auto.grid=F,ylab="Annualized Log Volatility",xlab="Time",
main="Log Volatility",lty=1,
legend.loc="topright")
#Which returned a different error (as described above)
#And I tried:
vol.xts <- xts(vola, order.by= vola$date_2, unique = TRUE )
#This also returned an error message.
#My intention was to then run:
#chart.TimeSeries(vol.xts,lwd=2,auto.grid=F,ylab="Annualized Log Volatility",xlab="Time",
main="Log Volatility",lty=1,
legend.loc="topright")
The documentation of PerformanceAnalytics::chart.TimeSeries is a bit vague. The issue is that when passing a dataframe you have to set the dates as row.names. To this end I first converted your data (which is a tibble) to a data.frame. Afterwards I add the dates as rownames and drop the date column:
library(PerformanceAnalytics)
vola <- as.data.frame(vola)
vola <- subset(vola, select = -c(date))
row.names(vola) <- as.Date(vola$date_2)
vola$date_2 <- NULL
chart.TimeSeries(vola,
lwd = 2, auto.grid = F, ylab = "Annualized Log Volatility", xlab = "Time",
main = "Log Volatility", lty = 1,
legend.loc = "topright"
)

Error message using mstate::msprep ??bug?

I have had a problem with an error abend using mstate::msprep to prepare my data for a pretty classical 3 state problem. I can run the code from the mstate package vignette with no difficulty. My problem is entirely parallel to the vignette example. Subjects receive an islet transplant, then may achieve insulin independence. Whether they do or do not, they may have islet graft failure (or loss of insulin independence if it was achieved.) The vignette example works with included covariates (retained by the keep = parameter). My version works fine if I don't include the keep parameter but fails consistently if I use the keep parameter. Since my example works perfectly well without the keep variable, I very much doubt that there is a problem with my main data. It must be some problem with the “keep” data. See below for the session output.
Neither data set has any missing data. I tried the vignette data limiting it to three covariates -- one categorical, one continuous, and the third with one of the event-time variables, exactly parallel to my three covariates. The vignette still works perfectly, but mine doesn’t. Both covariate "keep" lists are character vectors. In sum, I can't imagine a more parallel "real" question to the vignette example.
I have tracked the problem to a subroutine of msprep "msprepEngine" at line 85 at the second time through the processing loop, but I haven't been able to figure out what the problem is. I suspect that it is a bug, but since I can't identify it, I can't be sure.
I would be very grateful for anyone that can help me with this issue. The vignette code is available with the package. Unfortunately I am not free to share my problem's data, but as I said above, the program works perfectly without the keep parameter. There must be something about my "keep" covariates that is giving the system indigestion.
Thanks in advance for any suggestions.
Larry Hunsicker
> library(magrittr)
> library(survival)
> library(mstate)
>
> #Three state tmat:
> data(ebmt3)
> names(msbmt)
[1] "id" "from" "to" "trans" "Tstart" "Tstop" "time" "status" "dissub" "age"
[11] "prtime"
> dim(msbmt)
[1] 5577 11
> tmat <- trans.illdeath(names = c("Tx", "PR", "RelDeath"))
> covs <- c('dissub', 'age', 'drmatch', 'tcd', 'prtime')
> class(covs)
[1] "character"
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"),
+ data = ebmt3, trans = tmat, id = 'id', keep = covs)
>
> names(insfree3)
[1] "PatientID" "YrFree" "Free" "YrLossFail" "LossFail" "StudyID" "IEQ_kg"
> tmat3 <- trans.illdeath(names = c("Tx", "II", "LossFail"))
> IImt <- msprep(time = c(NA, 'YrFree', 'YrLossFail'),
+ status = c(NA, 'Free', 'LossFail'),
+ data = insfree3, trans = tmat3, id = 'PatientID')
>
> tmat3 <- trans.illdeath(names = c("Tx", "II", "LossFail"))
> covs <- c('StudyID', 'IEQ_kg', 'YrFree')
> class(covs)
[1] "character"
> IImt <- msprep(time = c(NA, 'YrFree', 'YrLossFail'),
+ status = c(NA, 'Free', 'LossFail'),
+ data = insfree3, trans = tmat3, id = 'PatientID', keep = covs)
Error in rep(keep[, i], tbl) : invalid 'times' argument
I found the problem, and it is a bug. I just don't know whose bug it is. msprep() works when data is a data.frame, but not when it is a tibble. My repro example:
> library(survival)
> library(mstate)
> library(dplyr)
> data(ebmt3)
> class(ebmt3)
[1] "data.frame"
> tmat <- transMat(x = list(c(2, 3), c(3), c()), names = c("Tx",
+ "PR", "RelDeath"))
> ebmt3$prtime <- ebmt3$prtime/365.25
> ebmt3$rfstime <- ebmt3$rfstime/365.25
> covs <- c("dissub", "age", "drmatch", "tcd", "prtime")
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"), data = ebmt3,
+ trans = tmat, keep = covs)
> ebmt3 <- as_tibble(ebmt3)
> class(ebmt3)
[1] "tbl_df" "tbl" "data.frame"
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"), data = ebmt3,
+ trans = tmat, keep = covs)
Error in rep(keep[, i], tbl) : invalid 'times' argument
I tracked the error down to line 157 in msprep()
ddcovs <- lapply(1:nkeep, function(i) rep(keep[, i], tbl))
When data is a data.frame, this line works. When it is a tibble, it abends with the above error message.
It was my impression that things that work with a data.frame should also work with a tibble, since a tibble is a data.frame. So I'm not sure whether this is a bug in msprep() or in the code for a tibble. But the way to avoid the error is to be sure that the data parameter in the call to msprep() is a data.frame, but not a tibble.
Larry Hunsicker

Binding multiple shapefiles results in rownames error

I've got a list of around 20 shapefiles that I want to bind into one. These shapefiles have different number of fields - some have 1 and some have 2. Examples are shown below:
# 1 field
> dput(head(shp[[1]]))
structure(list(area = c(1.60254096388, 1.40740270051, 0.093933438653,
0.609245720277, 22.892748868, 0.0468096597394)), row.names = 0:5, class = "data.frame")
# 2 fields
> dput(head(shp[[3]]))
structure(list(per = c(61, 70, 79, 90, 57, 66), area = c(2218.8,
876.414, 2046.94, 1180.21, 1779.12, 122.668)), row.names = c(0:5), class = "data.frame")
I used the following code to bind them and it worked just as I wanted:
merged<- raster::bind(shp, keepnames= FALSE, variables = area)
writeOGR(merged, './shp', layer= 'area', driver="ESRI Shapefile")
However, I now need to subset one of the shapefiles in the list. I do it in this way:
shp[[3]]#data <- shp[[3]]#data %>% subset(Area >= 50)
names(shp[[3]]#data)[names(shp[[3]]#data) == "Area"] <- "area"
When I run the bind command, however, this now gives me an error:
merged<- raster::bind(shp, keepnames= FALSE, variables = area)
Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
Calls: <Anonymous> ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Execution halted
I'm not sure why that is. The shapefile hasn't changed, they are just subsetted. I tried deleting the rownames in the way shown below and it still throws the same error.
rownames(shp[[3]]#data) <- NULL
What could it be?
I think the problem is that that you subset #data (the attributes) but you should subset the entire object. Something like this
x <- shp[[3]] # for simplicity
x <- x[x$Area >= 50, ]
names(x)[names(x) == "Area"] <- "area"
shp[[3]] <- x

Error: target of assignment expands to non-language object

I am trying to dynamically clean up some column names for a large number of tables and I get the above error.
I have a gut feeling that I should be using quo but I have no idea on how to do that.
Any ideas?
The apply_alias applies a set of business rules to clean the names.
apply_alias <- function(l){
which(l=="Geography")
l[which(l=="Geography")] <- "GEO"
toupper(l)
}
The cleanup_column_names_tbl applies the alias_function to a list of table
cleanup_column_names_tbl <- function(PID){
for(p in PID){
names(get(paste0("tbl_",p))) <- apply_alias(names(get(paste0("tbl_",p))))
}
}
cleanup_column_names_tbl("14100287")
When I try to run it i get the following error message:
> cleanup_column_names_tbl("14100287")
Error in names(get(paste0("tbl_", p))) <- apply_alias(names(get(paste0("tbl_", :
target of assignment expands to non-language object
Sample data:
> dput(tbl_14100287[1,])
structure(list(V1 = 0L, REF_DATE = "1976-01", GEO = "Canada",
DGUID = "2016A000011124", `Labour force characteristics` = "Population",
Sex = "Both sexes", `Age group` = "15 years and over", Statistics = "Estimate",
`Data type` = "Seasonally adjusted", UOM = "Persons", UOM_ID = 249L,
SCALAR_FACTOR = "thousands", SCALAR_ID = 3L, VECTOR = "v2062809",
COORDINATE = "1.1.1.1.1.1", VALUE = 16852.4, STATUS = "",
SYMBOL = NA, TERMINATED = NA, DECIMALS = 1L), class = c("data.table",
"data.frame"), row.names = c(NA, -1L), .internal.selfref = <pointer: 0x000002123cf21ef0>)
You cannot assign a value to get since there is no function get<-. The right way of doing it would be something like the following.
apply_alias <- function(l){
l[which(l == "Geography")] <- "GEO"
toupper(l)
}
cleanup_column_names_tbl <- function(PID, envir = .GlobalEnv){
pid_full <- paste0("tbl_", PID)
res <- lapply(pid_full, function(p){
nms <- apply_alias(names(get(p)))
DF <- get(p)
names(DF) <- nms
DF
})
names(res) <- pid_full
list2env(res, envir = envir)
invisible(NULL)
}
cleanup_column_names_tbl("14100287")
names(tbl_14100287)
# [1] "V1" "REF_DATE"
# [3] "GEO" "DGUID"
# [5] "LABOUR FORCE CHARACTERISTICS" "SEX"
# [7] "AGE GROUP" "STATISTICS"
# [9] "DATA TYPE" "UOM"
#[11] "UOM_ID" "SCALAR_FACTOR"
#[13] "SCALAR_ID" "VECTOR"
#[15] "COORDINATE" "VALUE"
#[17] "STATUS" "SYMBOL"
#[19] "TERMINATED" "DECIMALS"
My solution:
Creating and expression and evaluating it. It is short but somehow does not feel like it is the proper way of doing things since I am breaking away from the functional paradigm of R.
cleanup_column_names_tbl <- function(PID){
for(p in PID){
expr1 <- paste0("names(", paste0("tbl_",p), ") <- apply_alias(names(", paste0("tbl_",p),"))")
eval(rlang::parse_expr(expr1))
}
}
Edit:
A slightly different way that:
avoids using strings
a little more flexible since it allows the use of strings and symbol
library(rlang)
test_df <- data.frame(a=1:10,b=1:10)
test_df2 <- data.frame(a=1:10,b=1:10)
fix_names <- function(df){
x <- ensym(df)
expr1 <- expr(names(!!x) <- toupper(names(!!x)))
eval(expr1, envir = parent.env(environment()))
# expr1
}
fix_names(test_df)
fix_names("test_df2")
names(test_df)
#> [1] "A" "B"
names(test_df2)
#> [1] "A" "B"

Applying `ar` (autoregressive model) for my data frame using `lapply` returns `numeric(0)`?

I'm working with a data.frame with all numeric data. I want to calculate the first order autoregressive coefficients for each column. I chose apply function to do the task and I defined a function as the following:
return.ar <- function(vec){
return(as.numeric(ar(vec)$ar))
}
Then I applied it to a data frame I subset by column names as the following
lapply(df_return[,col.names],return.ar)
I was expecting to get a vector with ar coefficients. But instead I got a list with all the coefficients put in the first element like the following
$C.Growth
[1] 0.35629140 -0.07671252 -0.08699333 -0.27404355 0.21448342
[6] -0.19049197 0.06610908 -0.23077602
$Mkt.ret
numeric(0)
$SL
numeric(0)
$SM
numeric(0)
$SH
numeric(0)
$LL
numeric(0)
$LM
numeric(0)
$LH
numeric(0)
I don't understand what's going on.
The output of dput(head(df_return)) looks like the following:
structure(list(Year = c(1929, 1930, 1931, 1932, 1933, 1934),
C.Growth = c(0.94774902516838, 0.989078396169958, 0.911586749357132,
0.996183522774413, 1.08170234030149, 1.05797659377887), S.Return = c(-19.7068321696574,
-31.0834309393085, -45.2864376593084, -9.42504715968666,
57.0992131145999, 4.05781718258972), Rf = c(4.79316783034255,
2.58656906069154, 1.24356234069162, 0.954952840313344, 0.199213114599945,
0.147817182589718), Inflation = c(-0.0531678303425544, -0.15656906069154,
-0.15356234069162, -0.00495284031334435, 0.100786885400055,
0.0321828174102824), Mkt.ret = c(-14.9668321696574, -28.6534309393085,
-44.1964376593084, -8.47504715968666, 57.3992131145999, 4.23781718258972
), SL = c(-45.2568321696575, -35.1134309393085, -41.1864376593084,
-5.28504715968666, 166.0392131146, 34.1378171825897), SM = c(-30.7368321696574,
-31.9034309393085, -48.5364376593084, -8.94504715968666,
118.7092131146, 19.7578171825897), SH = c(-36.7568321696575,
-45.1834309393085, -51.5364376593084, 2.78495284031334, 125.7792131146,
7.95781718258972), LL = c(-19.6968321696574, -26.2734309393085,
-36.2264376593084, -7.31504715968666, 44.1492131145999, 10.6978171825897
), LM = c(0.673167830342554, -29.2434309393085, -59.9864376593084,
-16.7150471596867, 89.4692131145999, -2.93218281741028),
LH = c(-4.35683216965745, -43.1934309393085, -57.7364376593084,
-4.30504715968666, 114.7092131146, -21.8421828174103)), .Names = c("Year",
"C.Growth", "S.Return", "Rf", "Inflation", "Mkt.ret", "SL", "SM",
"SH", "LL", "LM", "LH"), row.names = c(NA, 6L), class = "data.frame")
Once you include your data, diagnose becomes easy.
ar will do auto-section of p based on AIC. Some of your columns have strong evidence to be white noise, hence ar has selected p = 0, in which case $ar field will be numeric(0).
I suggest you also use the following:
lapply(df_return[col.names], function (x) ar(x, order.max = 5)$order)
or even better:
fit_ar <- function(x) ar(x, order.max = 5)[c("order", "ar")]
lapply(df_return[col.names], fit_ar)
The latter returns both p as well as AR coefficients for each column. I have set order.max = 5, so that ar won't choose it itself.
You tried to convince me that lapply is doing wrong, by using this for loop:
ar.vec <- numeric()
for (name in col.names)
ar.vec <- c(ar.vec, return.ar(df_return[[ name ]]))
But unfortunately you won't get anything useful from this. Note you used concatenation c(), thus there is no way to tell which coefficient is for which column.
lapply is not identical to such loop. You should use:
ar.vec <- vector("list", length(col.names))
for (i in 1:length(col.names))
ar.vec[[i]] <- return.ar(df_return[[ col.names[i] ]])

Resources