Formulate data for rpart - r

Concatenate columns name of a list to prepare a formula for rpart?
Just wanted to concatenate the names(log_data), log_data is a list of 60 vectors distinct vectors, so I just want their column names in a format so that I can put them in a formula of rpart in r..... like rpart(A ~ B + C + D + E ,log_data), so here I just want to extract
formula="A~B+C+D+E" as a whole string where A,B,C,D,E are the columns name which we have to extract from the log_data, or is there any better way to get a tree from the list.
I have tried,
a <- names(log_data)
rpart(a[1] ~ a[2] + a[3] + a[4], log_data)
getting an error
Error in paste(temp, yprob[, i], sep = " ") : subscript out of bounds
where
a[2]
[1] "X.u.crpice..vin20f1..vol.vin20f1v1.r_credit_credshare2...91...90."
a[3]
[1] "X.u.crpice..vin20f1..vol.vin20f1v1.r_credit_credshare2...92...90."
c<-paste(a[1], "~", sep="")
rpart_formula <- as.formula(paste(c, paste(a[2:60], collapse = " + "), sep = ""))
rpart(rpart_formula,log_data)
it is going in infinite loop at rpart just because of too long column name or may be n=60
Can I attach any column names colnames(log_data) <- c(?), what should I put at "?", so that will be easy to draw it for n=60.

I believe you want
shortnames <- paste0("c",seq(ncol(log_data)))
names(log_data) <- shortnames
form <- reformulate(paste(shortnames[2:4],collapse="+"),
response=shortnames[1])
rpart(form,log_data)

Related

Eliminating partially overlapping parts of 2 vectors in R

I wonder if it might be possible to drop the parts in n1 character vector that partially overlap with elements in f1 formula.
For example, in n1, we see "study_typecompare" & "study_typecontrol" partially overlap with study_type in f1.
Thus in the desired_output, we want to drop the "study_type" part of them. Because other elements (ex. time_wk_whn) in n1 fully overlap with an element in f1, we leave them unchanged.
Is obtaining my desired_output possible in BASE R or tidyvesrse?
f1 <- gi ~ 0 + study_type + time_wk_whn + time_wk_btw + items_whn +
items_btw + training_hr_whn + training_hr_btw
n1 <- c("study_typecompare","study_typecontrol","time_wk_whn",
"time_wk_btw","items_whn","items_btw","training_hr_whn",
"training_hr_btw")
desired_output <- c("compare","control", "time_wk_whn",
"time_wk_btw","items_whn","items_btw",
"training_hr_whn","training_hr_btw")
We create a function to pass the formula and the vector ('fmla', 'vec') respectively. Extract the variables from the 'fmla' (all.vars), find the values in the vector that are not found in the formula variables (setdiff), create a pattern by paste those variables and replace with blank ("") using sub, and update the 'vec', return the updated vector
fun1 <- function(fmla, vec) {
v1 <- all.vars(fmla)
v2 <- setdiff(vec, v1)
v3 <- sub(paste(v1, collapse = "|"), "", v2)
vec[vec %in% v2] <- v3
vec
}
-checking
> identical(fun1(f1, n1), desired_output)
[1] TRUE

How to create a function with already written for-loop to automatically create vectors in r environment

I have a df with fixed columns and unfixed row number. I created empty vectors and populate R commands to create vectors on its own once I use eval(parse(text = someVector)). What I did with for-loop works, but I would like to turn it into a function and/or use *apply() and I don't know how to do that. I would very much like to upgrade my programming skills. I would like to be able to choose the variables by name or position and always go through every row.
working with the reprex, I expect 30 vectors created in the working environment - for every car model for the specified column separate vector to store the value of that column for this row/carmodel and 6 more vectors that store the R commands.
for example one of the vectors should look like this: cyl_MazdaRX4Wag <- 6
# df
df <- mtcars[1:5,]
df$carmodel <- gsub("[[:space:]]", "", rownames(df))
# create empty vectors to store R command
carmodel <- c()
mpg <- c()
cyl <- c()
hp <- c()
gear <- c()
carb <- c()
# loop through every row to create an R command
for(i in 1:nrow(df)){
carmodel[i] <- paste0("carmodel_", df$carmodel[i] , " <- ", "'", df$carmodel[i], "'",";")
mpg[i] <- paste0("mpg_", df$carmodel[i], " <- ", df$mpg[i], ";")
cyl[i] <- paste0("cyl_", df$carmodel[i], " <- ", df$cyl[i], ";")
hp[i] <- paste0("hp_", df$carmodel[i], " <- ", df$hp[i], ";")
gear[i] <- paste0("gear_", df$carmodel[i], " <- ", df$gear[i], ";")
carb[i] <- paste0("carb_", df$carmodel[i], " <- ", df$carb[i], ";")
}
# collapse the vectors in one string
carmodel <- paste(carmodel, collapse = " ")
mpg <- paste(mpg, collapse = " ")
cyl <- paste(cyl, collapse = " ")
hp <- paste(hp, collapse = " ")
gear <- paste(gear, collapse = " ")
carb <- paste(carb, collapse = " ")
# execute R command
eval(parse(text = carmodel))
eval(parse(text = mpg))
eval(parse(text = cyl))
eval(parse(text = hp))
# delete vectors that store the R commands
rm(list = c("carmodel","mpg","cyl", "hp","gear","carb"))
eval(parse(text = gear))
eval(parse(text = carb))
We can select columns on which we want to work. Create a named vector with name and it's value.
cols <- c('carmodel', 'mpg', 'cyl', 'hp', 'gear', 'carb')
temp <- unlist(lapply(cols, function(x) as.list(setNames(df[[x]],
paste0(x, df$carmodel)))), recursive = FALSE)
Usually, it is better to keep data as a list, rather than individual objects. If you need them as separate variables in the global environment we can use list2env.
list2env(temp, .GlobalEnv)

Lookup list of formulas in other list

I am comparing two lists of formulas to see if some previously computed models can be reused. Right now I'm doing this like this:
set.seed(123)
# create some random formulas
l1 <- l2 <- list()
for (i in 1:10) {
l1[[i]] <- as.formula(paste("z ~", paste(sample(letters, 3), collapse = " + ")))
l2[[i]] <- as.formula(paste("z ~", paste(sample(letters, 3), collapse = " + ")))
}
# at least one appears in the other list
l1[[5]] <- l2[[7]]
# helper function to convert formulas to character strings
as.formulaCharacter <- function(x) paste(deparse(x))
# convert both lists to strings
s1 <- sapply(l1, as.formulaCharacter)
s2 <- sapply(l2, as.formulaCharacter)
# look up elements of one vector in the other
idx <- match(s1, s2, nomatch = 0L) # 7
s1[idx] # found matching elements
However, I noticed that some formulas are not retrieved although they are practically equivalent.
f1 <- z ~ b + c + b:c
f2 <- z ~ c + b + c:b
match(as.formulaCharacter(f1), as.formulaCharacter(f2)) # no match
I get why this result is different, the strings just aren't the same, but I'm struggling with how to extend this approach method to also work for formulas with reordered elements. I could use strsplit to first sort all formula components independently, but that sounds horribly inefficient to me.
Any ideas?
If the formulas are restricted to a sum of terms which contain colon separated variables then we can create a standardized string by extracting the term labels, exploding those with colons, sorting them, pasting the exploded terms back together, sorting this and turning that into a formula string.
stdize <- function(fo) {
s <- strsplit(attr(terms(f2), "term.labels"), ":")
terms <- sort(sapply(lapply(s, sort), paste, collapse = ":"))
format(reformulate(terms, all.vars(fo)[1]))
}
stdize(f1) == stdize(f2)
## [1] TRUE

Converting a vector into formula

Given a data.frame and a vector only with -1,0,1 with length equal to the number of columns of the data.frame. Is there a natural way to transform the vector into a formula with those elements in position with a -1 appear on the left side of the formula and those with +1 appear on the right side?
For example, given the following data.frame
df = data.frame(
'a' = rnorm(10),
'b' = rnorm(10),
'c' = rnorm(10),
'd' = rnorm(10),
'e' = rnorm(10))
and following vector vec = c(-1,-1,0,1,1).
Is there a natural way to build formula a+b~d+e?
We assume that if there are no 1's in vec that we should use a right hand side of 1 and if there are no -1's in vec then the left hand side is empty.
The alternatives each produce a character string but if a formula class object is wanted use formula(s) where s is that string.
1) paste each side Subset out the names corresponding to vec -1 giving LHS and paste/collapse them and do the same with vec 1 giving RHS and paste those with ~ all together. If we knew that there were at least one 1 in vec we could omit the if statement. Of the solutions here this seems the most straightforward.
nms <- names(df)
LHS <- paste(nms[vec == -1], collapse = "+")
RHS <- paste(nms[vec == 1], collapse = "+")
if (RHS == "") RHS <- "1"
paste0(LHS, "~", RHS)
## [1] "a+b~d+e"
2) sapply Alternately combine the LHS and RHS lines into a single sapply. If we knew that there were at least one 1 in vec then we could
simplify the code by omitting the if statement. This approach is shorter than (1).
sa <- sapply(c(-1, 1), function(x) paste(names(df)[vec == x], collapse = "+"))
if (sa[2] == "") sa[2] <- "1"
paste0(sa[1], "~", sa[2])
## [1] "a+b~d+e"
3) tapply We can alternately combine the LHS and RHS lines into a single tapply like this:
ta <- tapply(names(df), vec, paste, collapse = "+")
paste0(if (any(vec == -1)) ta[["-1"]], "~", if (any(vec == 1)) ta[["1"]] else 1)
## [1] "a+b~d+e"
If we knew that -1 and 1 each appear at least once in vec then we can simplify the last line to:
paste0(ta[["-1"]], "~", ta[["1"]]])
## [1] "a+b~d+e"
Overall this approach is the shortest if we can guarantee that there will be at least one 1 and at least one -1 but otherwise handling the edge cases seems somewhat cumbersome compared to the other approaches.
We could do this by creating a group by paste
paste(aggregate(nm ~ vec, subset(data.frame(nm = names(df), vec,
stringsAsFactors = FALSE), vec != 0),
FUN = paste, collapse= ' + ')[['nm']], collapse=' ~ ')
#[1] "a + b ~ d + e"
Or another option is tapply
paste(tapply(names(df), vec, FUN = paste,
collapse= ' + ')[c('-1', '1')], collapse= ' ~ ')
#[1] "a + b ~ d + e"

Obtain scaled predictor values when using `lm` in R

I'm using lm in R to do simple multilinear regression. Here's an example model:
m <- lm(formula = t ~ a + b + 0, data = df1)
where t, a and b are columns in df1. This model calculates 2 coefficients, let's call them a.coef and b.coef. If I then use this model to predict some other data, say in df2, I can get the predicted values like so:
predict(m, df2)
if I have the columns a and b in df2 as well. It essentially returns
df2$a * a.coef + df2$b * b.coef
What I'd like, however, are the columns df2$a * a.coef and df2$b * b.coef. R sums them and gives me the answer, but I'd like to see how the scaling affects these values.
Is there a convenient way to do this in R (esp in lm or predict.lm), or will I have to manually code this myself? I played with the terms argument in predict.lm, but I couldn't get anywhere.
Thanks for the help!
EDIT
I wrote this function:
scaled.fn <- function(dt, x, y, i) {
# dt is data.table
# x is dependent column (col name as str)
# y are predictor columns (col names as vector of str)
# i is name of column to multiply, as str
dep = paste(y, collapse = " + ")
my.formula = paste(x, " ~ ", dep, sep = "")
m = lm(formula = my.formula, data = dt)
# column names in dt are named in y
return(dt[, get(i) * coef(m)[i]])
}
Try this:
sweep(df2, MARGIN = 2, coef(m), '*')
EDIT: more specific solution:
sweep(df2[,c("a","b")], MARGIN = 2, coef(m), '*')

Resources