Extract a predictors form constparty object (CHAID output) in R - r

I have a large dataset (questionnaire results) of mostly categorical variables. I have tested for dependency between the variables using chi-square test. There are incomprehensible number of dependencies between variables. I used the chaid() function in the CHAID package to detect interactions and separate out (what I hope to be) the underlying structure of these dependencies for each variable. What typically happens is that the chi-square test will reveal a large number of dependencies (say 10-20) for a variable and the chaid function will reduce this to something much more comprehensible (say 3-5). What I want to do is to extract the names of those variable that were shown to be relevant in the chaid() results.
The chaid() output is in the form of a constparty object. My question is how to extract the variable names associated with the nodes in such an object.
Here is a self contained code example:
library(evtree) # for the ContraceptiveChoice dataset
library(CHAID)
library(vcd)
library(MASS)
data("ContraceptiveChoice")
longform = formula(contraceptive_method_used ~ wifes_education +
husbands_education + wifes_religion + wife_now_working +
husbands_occupation + standard_of_living_index + media_exposure)
z = chaid(longform, data = ContraceptiveChoice)
# plot(z)
z
# This is the part I want to do programatically
shortform = formula(contraceptive_method_used ~ wifes_education + husbands_occupation)
# The thing I want is a programatic way to extract 'shortform' from 'z'
# Examples of use of 'shortfom'
loglm(shortform, data = ContraceptiveChoice)

One possible sollution:
nn <- nodeapply(z)
n.names= names(unlist(nn[[1]]))
ext <- unlist(sapply(n.names, function(x) grep("split.varid.", x, value=T)))
ext <- gsub("kids.split.varid.", "", ext)
ext <- gsub("split.varid.", "", ext)
dep.var <- as.character(terms(z)[1][[2]]) # get the dependent variable
plus = paste(ext, collapse=" + ")
mul = paste(ext, collapse=" * ")
shortform <- as.formula(paste (dep.var, plus, sep = " ~ "))
satform <- as.formula(paste (dep.var, mul, sep = " ~ "))
mosaic(shortform, data = ContraceptiveChoice)
#stp <- step(glm(satform, data=ContraceptiveChoice, family=binomial), direction="both")

Related

Creating a function to loop columns through an equation in R

Solution (thanks #Peter_Evan!) in case anyone coming across this question has a similar issue
(Original question is below)
## get all slopes (lm coefficients) first
# list of subfields of interest to loop through
sf <- c("left_presubiculum", "right_presubiculum",
"left_subiculum", "right_subiculum", "left_CA1", "right_CA1",
"left_CA3", "right_CA3", "left_CA4", "right_CA4", "left_GC-ML-DG",
"right_GC-ML-DG")
# dependent variables are sf, independent variable common to all models in the inner lm() call is ICV
# applies the lm(subfield ~ ICV, dataset = DF) to all subfields of interest (sf) specified previously
lm.results <- lapply(sf, function(dv) {
temp.lm <- lm(get(dv) ~ ICV, data = DF)
coef(temp.lm)
})
# returns a list, where each element is a vector of coefficients
# do.call(rbind, ) will paste them together
lm.coef <- data.frame(sf = sf,
do.call(rbind, lm.results))
# tidy up name of intercept variable
names(lm.coef)[2] <- "intercept"
lm.coef
## set up all components for the equation
# matrix to store output
out <- matrix(ncol = length(sf), nrow = NROW(DF))
# name the rows after each subject
row.names(out) <- DF$Subject
# name the columns after each subfield
colnames(out) <- sf
# nested for loop that goes by subject (j) and subfield (i)
for(j in DF$Subject){
for (i in sf) {
slope <- lm.coef[lm.coef$sf == i, "ICV"]
out[j,i] <- as.numeric( DF[DF$Subject == j, i] - (slope * (DF[DF$Subject == j, "ICV"] - mean(DF$ICV))) )
}
}
# check output
out
===============
Original Question:
I have a dataframe (DF) with 13 columns (12 different brain subfields, and one column containing total intracranial volume(ICV)) and 50 rows (each a different participant). I'm trying to automate an equation being looped over every column for each participant.
The data:
structure(list(Subject = c("sub01", "sub02", "sub03", "sub04",
"sub05", "sub06", "sub07", "sub08", "sub09", "sub10", "sub11",
"sub12", "sub13", "sub14", "sub15", "sub16", "sub17", "sub18",
"sub19", "sub20"), ICV = c(1.50813, 1.3964237, 1.6703585, 1.4641886,
1.6351018, 1.5524641, 1.4445532, 1.6384505, 1.6152434, 1.5278011,
1.4788126, 1.4373356, 1.4109637, 1.3634952, 1.3853583, 1.4855268,
1.6082085, 1.5644998, 1.5617522, 1.4304141), left_subiculum = c(411.225013,
456.168033, 492.968477, 466.030173, 533.95505, 476.465524, 448.278213,
476.45566, 422.617374, 498.995121, 450.773906, 461.989663, 549.805272,
452.619547, 457.545623, 451.988333, 475.885847, 490.127968, 470.686415,
494.06548), left_CA1 = c(666.893596, 700.982955, 646.21927, 580.864234,
721.170599, 737.413139, 737.683665, 597.392434, 594.343911, 712.781376,
733.157168, 699.820162, 701.640861, 690.942843, 606.259484, 731.198846,
567.70879, 648.887718, 726.219904, 712.367433), left_presubiculum = c(325.779458,
391.252815, 352.765098, 342.67797, 390.885737, 312.857458, 326.916867,
350.657957, 325.152464, 320.718835, 273.406949, 305.623938, 371.079722,
315.058313, 311.376271, 319.56678, 348.343569, 349.102678, 322.39908,
306.966008), `left_GC-ML-DG` = c(327.037756, 305.63224, 328.945065,
238.920358, 319.494513, 305.153183, 311.347404, 259.259723, 295.369164,
312.022281, 324.200989, 314.636501, 306.550385, 311.399107, 295.108592,
356.197094, 251.098248, 294.76349, 317.308576, 301.800253), left_CA3 = c(275.17038,
220.862237, 232.542718, 170.088695, 234.707172, 210.803287, 246.861975,
171.90896, 220.83478, 236.600832, 246.842024, 239.677362, 186.599097,
224.362411, 229.9142, 293.684776, 172.179779, 202.18936, 232.5666,
221.896625), left_CA4 = c(277.614028, 264.575987, 286.605092,
206.378619, 281.781858, 258.517989, 269.354864, 226.269982, 256.384436,
271.393257, 277.928824, 265.051581, 262.307377, 266.924683, 263.038686,
306.133918, 226.364556, 262.42823, 264.862956, 255.673948), right_subiculum = c(468.762375,
445.35738, 446.536018, 456.73484, 521.041823, 482.768261, 487.2911,
456.39996, 445.392976, 476.146498, 451.775611, 432.740085, 518.170065,
487.642399, 405.564237, 487.188989, 467.854363, 479.268714, 473.212833,
472.325916), right_CA1 = c(712.973011, 717.815214, 663.637105,
649.614586, 711.844375, 779.212704, 862.784416, 648.925038, 648.180611,
760.761704, 805.943016, 717.486756, 801.853608, 722.213109, 621.676321,
791.672796, 605.35667, 637.981476, 719.805053, 722.348921), right_presubiculum = c(327.285242,
364.937865, 288.322641, 348.30058, 341.309111, 279.429847, 333.096795,
342.184296, 364.245998, 350.707173, 280.389853, 276.423658, 339.439377,
321.534798, 302.164685, 328.365751, 341.660085, 305.366589, 320.04127,
303.83284), `right_GC-ML-DG` = c(362.391907, 316.853532, 342.93274,
282.550769, 339.792696, 357.867386, 342.512721, 277.797528, 309.585721,
343.770416, 333.524912, 302.505077, 309.063135, 291.29361, 302.510461,
378.682679, 255.061044, 302.545288, 313.93902, 297.167161), right_CA3 = c(307.007404,
243.839349, 269.063801, 211.336979, 249.283479, 276.092623, 268.183349,
202.947849, 214.642782, 247.844657, 291.206598, 235.864996, 222.285729,
201.427853, 237.654913, 321.338801, 199.035108, 243.204203, 236.305659,
213.386702), right_CA4 = c(312.164065, 272.905586, 297.99392,
240.765062, 289.98697, 306.459566, 284.533068, 245.965817, 264.750571,
296.149675, 290.66935, 264.821461, 264.920869, 246.267976, 266.07378,
314.205819, 229.738951, 274.152503, 256.414608, 249.162404)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
The equation:
adjustedBrain(participant1) = rawBrain(participant1) - slope*[ICV(participant1) - (mean of all ICV measures included in the calculation of the slope)]
The code (which is not working and I was hoping for some pointers):
adjusted_Brain <- function(DF, subject) {
subfields <- colnames(select(DF, "left_presubiculum", "right_presubiculum",
"left_subiculum", "right_subiculum", "left_CA1", "right_CA1",
"left_CA3", "right_CA3", "left_CA4", "right_CA4", "left_GC-ML-DG",
"right_GC-ML-DG"))
out <- matrix(ncol = length(subfields), nrow = NROW(DF))
for (i in seq_along(subfields)) {
DF[i] = DF[DF$Subject == "subject", "i"] -
slope * (DF[DF$Subject == "subject", "ICV"] -
mean(DF$ICV))
}
}
Getting this error:
Error: Can't subset columns that don't exist.
x Column `i` doesn't exist.
A few notes:
The slopes for each subject for each subfield will be different (and will come from a regression) -> is there a way to specify that in the function so the slope (coefficient from the appropriate regression equation) gets called in?
I have my nrow set to the number of participants right now in the output because I'd like to have this run through EVERY subject across EVERY subfield and spit out a matrix with all the adjusted brain volumes... But that seems very complicated and so for now I will just settle for running each participant separately.
Any help is greatly appreciated!
As others have noted in the comments, there are quite a few syntax issues that prevent your code from running, as well as a few unstated requirements. That aside, I think there is enough to recommend a few improvements that you can hopefully build on. Here are the top line changes:
You likely don't need this to be a function, but rather a nested for loop (if you want to do this with base R). As written, the code isn't flexible enough to merit a function. If you intend to apply this many times across different datasets, a function might make sense. However, it will require a much larger rewrite.
Assuming you are fitting a simple regression via lm, then you can pull out the coefficient of interest via the $ operator and indexing (see below). Some thought will need to go into how to handle different models in the loop. Here, we assume you only need one coefficient from one model.
There are a few areas where the syntax is incorrect and a review of sub setting in base R would be helpful. Others have pointed out in the comments were some of these are.
Here is one approach were we loop through each subject (j) through each feature or subfield (i) and store them in a matrix (out). This is just an approach and will almost certainly need tweaking on your end!
#NOTE: the dataset your provided is saved as x in this example.
#fit a linear model - here we assume there is only one coef. of interest, but you may need to alter
# depending on how the slope changes in each calculation
reg <- lm(ICV ~ right_CA3, x)
# view the coeff.
reg$coefficients
# pull out the slope by getting the coeff. of interest (via index) from the reg object
slope <- reg$coefficients[[1]]
# list of features/subfeilds to loop through
sf <- c("left_presubiculum", "right_presubiculum",
"left_subiculum", "right_subiculum", "left_CA1", "right_CA1",
"left_CA3", "right_CA3", "left_CA4", "right_CA4", "left_GC-ML-DG",
"right_GC-ML-DG")
# matrix to store output
out <- matrix(ncol = length(sf), nrow = NROW(x))
#name the rows after each subject
row.names(out) <- x$Subject
#name the columns after each sub feild
colnames(out) <- sf
# nested for loop that goes by subject (j) and features/subfeilds (i)
for(j in x$Subject){
for (i in sf) {
out[j,i] <- as.numeric( x[x$Subject == j, i] - (slope * (x[x$Subject == j, "ICV"] - mean(x$ICV))) )
}
}
# check output
out

Properly calling variable name when creating multiple Benford plots

I am creating Benford plots for all the numeric variables in my dataset. https://en.wikipedia.org/wiki/Benford%27s_law
Running a single variable
#install.packages("benford.analysis")
library(benford.analysis)
plot(benford(iris$Sepal.Length))
Looks great. And the legend says "Dataset: iris$Sepal.Length", perfect!.
Using apply to run 4 variables,
apply(iris[1:4], 2, function(x) plot(benford(x)))
Creates four plots, however, each plot's legend says "Dataset: x"
I attempted to use a for loop,
for (i in colnames(iris[1:4])){
plot(benford(iris[[i]]))
}
This creates four plots, but now the legends says "Dataset: iris[[i]]". And I would like the name of the variable on each chart.
I tried a different loop, hoping to get titles with an evaluated parsed string like "iris$Sepal.Length":
for (i in colnames(iris[1:4])){
plot(benford(eval(parse(text=paste0("iris$", i)))))
}
But now the legend says "Dataset: eval(parse(text=paste0("iris$", i)))".
AND, Now I've run into the infamous eval(parse(text=paste0( (eg: How to "eval" results returned by "paste0"? and R: eval(parse(...)) is often suboptimal )
I would like labels such as "Dataset: iris$Sepal.Length" or "Dataset: Sepal.Length". How can I create multiple plots with meaningfully variable names in the legend?
This is happening because of the first line within the benford function=:
benford <- function(data, number.of.digits = 2, sign = "positive", discrete=TRUE, round=3){
data.name <- as.character(deparse(substitute(data)))
Source: https://github.com/cran/benford.analysis/blob/master/R/functions-new.R
data.name is then used to name your graph. Whatever variable name or expression you pass to the function will unfortunately be caught by the deparse(substitute()) call, and will be used as the name for your graph.
One short-term solution is to copy and rewrite the function:
#install.packages("benford.analysis")
library(benford.analysis)
#install.packages("data.table")
library(data.table) # needed for function
# load hidden functions into namespace - needed for function
r <- unclass(lsf.str(envir = asNamespace("benford.analysis"), all = T))
for(name in r) eval(parse(text=paste0(name, '<-benford.analysis:::', name)))
benford_rev <- function{} # see below
for (i in colnames(iris[1:4])){
plot(benford_rev(iris[[i]], data.name = i))
}
This has negative side effects of:
Not being maintainable with package revisions
Fills your GlobalEnv with normally hidden functions in the package
So hopefully someone can propose a better way!
benford_rev <- function(data, number.of.digits = 2, sign = "positive", discrete=TRUE, round=3, data.name = as.character(deparse(substitute(data)))){ # changed
# removed line
benford.digits <- generate.benford.digits(number.of.digits)
benford.dist <- generate.benford.distribution(benford.digits)
empirical.distribution <- generate.empirical.distribution(data, number.of.digits,sign, second.order = FALSE, benford.digits)
n <- length(empirical.distribution$data)
second.order <- generate.empirical.distribution(data, number.of.digits,sign, second.order = TRUE, benford.digits, discrete = discrete, round = round)
n.second.order <- length(second.order$data)
benford.dist.freq <- benford.dist*n
## calculating useful summaries and differences
difference <- empirical.distribution$dist.freq - benford.dist.freq
squared.diff <- ((empirical.distribution$dist.freq - benford.dist.freq)^2)/benford.dist.freq
absolute.diff <- abs(empirical.distribution$dist.freq - benford.dist.freq)
### chi-squared test
chisq.bfd <- chisq.test.bfd(squared.diff, data.name)
### MAD
mean.abs.dev <- sum(abs(empirical.distribution$dist - benford.dist)/(length(benford.dist)))
if (number.of.digits > 3) {
MAD.conformity <- NA
} else {
digits.used <- c("First Digit", "First-Two Digits", "First-Three Digits")[number.of.digits]
MAD.conformity <- MAD.conformity(MAD = mean.abs.dev, digits.used)$conformity
}
### Summation
summation <- generate.summation(benford.digits,empirical.distribution$data, empirical.distribution$data.digits)
abs.excess.summation <- abs(summation - mean(summation))
### Mantissa
mantissa <- extract.mantissa(empirical.distribution$data)
mean.mantissa <- mean(mantissa)
var.mantissa <- var(mantissa)
ek.mantissa <- excess.kurtosis(mantissa)
sk.mantissa <- skewness(mantissa)
### Mantissa Arc Test
mat.bfd <- mantissa.arc.test(mantissa, data.name)
### Distortion Factor
distortion.factor <- DF(empirical.distribution$data)
## recovering the lines of the numbers
if (sign == "positive") lines <- which(data > 0 & !is.na(data))
if (sign == "negative") lines <- which(data < 0 & !is.na(data))
if (sign == "both") lines <- which(data != 0 & !is.na(data))
#lines <- which(data %in% empirical.distribution$data)
## output
output <- list(info = list(data.name = data.name,
n = n,
n.second.order = n.second.order,
number.of.digits = number.of.digits),
data = data.table(lines.used = lines,
data.used = empirical.distribution$data,
data.mantissa = mantissa,
data.digits = empirical.distribution$data.digits),
s.o.data = data.table(second.order = second.order$data,
data.second.order.digits = second.order$data.digits),
bfd = data.table(digits = benford.digits,
data.dist = empirical.distribution$dist,
data.second.order.dist = second.order$dist,
benford.dist = benford.dist,
data.second.order.dist.freq = second.order$dist.freq,
data.dist.freq = empirical.distribution$dist.freq,
benford.dist.freq = benford.dist.freq,
benford.so.dist.freq = benford.dist*n.second.order,
data.summation = summation,
abs.excess.summation = abs.excess.summation,
difference = difference,
squared.diff = squared.diff,
absolute.diff = absolute.diff),
mantissa = data.table(statistic = c("Mean Mantissa",
"Var Mantissa",
"Ex. Kurtosis Mantissa",
"Skewness Mantissa"),
values = c(mean.mantissa = mean.mantissa,
var.mantissa = var.mantissa,
ek.mantissa = ek.mantissa,
sk.mantissa = sk.mantissa)),
MAD = mean.abs.dev,
MAD.conformity = MAD.conformity,
distortion.factor = distortion.factor,
stats = list(chisq = chisq.bfd,
mantissa.arc.test = mat.bfd)
)
class(output) <- "Benford"
return(output)
}
I have just updated the package (GitHub version) to allow for a user supplied name.
Now the function has a new parameter called data.name in which you can provide a character vector with the name of the data and override the default. Thus, for your example you can simply run the following code.
First install the GitHub version (I will submit this version to CRAN soon).
devtools::install_github("carloscinelli/benford.analysis") # install new version
Now you can provide the name of the data inside the for loop:
library(benford.analysis)
for (i in colnames(iris[1:4])){
plot(benford(iris[[i]], data.name = i))
}
And all the plots will have the correct naming as you wish (below).
Created on 2019-08-10 by the reprex package (v0.2.1)

R Passing parameters to lda() or qda()

I'm trying to optimize my code by saving parameters in vector form and pass it to lda() for modeling. The following method works fine for lm, but not for qda or lda. The error message I received is highlighted in yellow.
intvars <- c("x*y","y*t","z*w")
intfm <- paste("clickthrough", "~", paste(intvars, collapse = " + "))
lda_model_int <- lda(intfm, data = s_train)
Error in lda.default(intfm, data = s_train) : 'x' is not a matrix
You will have to change your string to formula or you can reformulate
intvars <- c("x*y","y*t","z*w")
intfm <- reformulate(intvars,"clickthrough")
lda_model_int <- lda(intfm, data = s_train)
If you wanted to do it your way you will have to do
intvars <- c("x*y","y*t","z*w")
intfm <- as. formula(paste("clickthrough", "~", paste(intvars, collapse = " + ")))
lda_model_int <- lda(intfm, data = s_train)

Creating lavaan and mirt models permuations of varying size and length

hoping someone can offer some guidance here.
I'm creating a multivariate simulation using the simDesign package, I am varying the number of factors as well as items that load on each factor. I would like to write a command that identifies the number of factors present in factornumbers and assigns the appropriate items to them (no cross loading). I will be testing all combinations of the conditions below and more, and I would like to have a model command that acknowledge the iterations of differing models, so I don't have to write multiple model statements.
factornumbers<-c(1,2,3,5)
itemsperfactor<-c(5,10,30)
What lavaan and mirt are looking for is below:
mirtmodel<-mirt.model('
F1=1-15
F2=16-30
MEAN=F1,F2
COV=F1*F2')
lavmodel <- ' F1=~ Item_1 + Item_2 + Item_3 + Item_4 + Item_5 + Item_6 + Item_7 + Item_8 + Item_9 + Item_10 + Item_11 + Item_12 + Item_13 + Item_14 + Item_15
F2=~ Item_16 + Item_17 + Item_18 + Item_19 + Item_20 + Item_21 + Item_22 + Item_23 + Item_24 + Item_25 + Item_26 + Item_27 + Item_28 + Item_29 + Item_30'
The simDesign package offers this example, I would like to expand on it but I'm not sure I have the know-how:
lavmodel<-paste0('F=~ ', paste0(colnames(dat)[1L], ' + '),
paste0(colnames(dat)[-1L], collapse = ' + '))
What I would like is a single mirt and lavaan command that finds the number of factors specified in the factornumbers command and assigns the correct items specified in the data as well as itemsperfactor.
EDIT:
I would like the model identification to pick up on which factor & item structure is in use for that condition and fill in the model identification with the correct information.
For Example:
mirtmodel<-mirt.model('
F1=1-1
F2=6-10
F3=11-15
F4=16-20
F5=21-25
MEAN=F1,F2,F3,F4,F5
COV=F1*F2*F3*F4*F5')
Or
mirtmodel<-mirt.model('
F1=1-30
F2=31-60
MEAN=F1,F2
COV=F1*F2')
And also the corresponding lavaan models.
The idea here is to paste different strings together so that the condition input (row of the respective Design object) is all that is required to construct a suitable model specification string. Generating syntax for simulations is arguably the most annoying part of simulations, but at least in R there are a good number of helpful string operations (plus, packages like stringr).
Here's my interpretation of what you are currently looking for using base R functions.
library(SimDesign)
library(mirt)
Design <- createDesign(factornumbers = c(1,2,3,5),
itemsperfactor = c(5,10,30))
gen_syntax_mirt <- function(condition){
fn <- with(condition, factornumbers)
ipf <- with(condition, itemsperfactor)
nitems <- fn * ipf
maxloads <- sort(seq(nitems, ipf, length.out = fn))
minloads <- c(1, maxloads[-length(maxloads)] + 1)
fnames <- paste0('F', 1:fn)
df <- cbind(fnames, ' = ', minloads, '-', maxloads)
s1 <- apply(df, 1, paste0, collapse = '')
s2 <- paste0('MEAN = ', paste0(fnames, collapse = ','))
s3 <- paste0('COV = ', paste0(fnames, collapse = '*'))
ret <- paste0(c(s1, s2, s3), collapse = '\n')
mirt.model(ret)
}
gen_syntax_mirt(Design[1,])
gen_syntax_mirt(Design[10,])
The input to this function is a single row from the Design input to runSimulation(), so you can see here that it will work just fine. Do something similar for lavaan's syntax and you'll be set.

Getting error "variable lengths differ (found for 'columns_features')" in R

I am applying rpart function to a data frame named train having all the integer values.
There are too many features so for that I have created a formula.
columns_features <- (paste(colnames(train)[31:50], collapse = "+"))
formulas <- as.formula(train$left_eye_center_x ~ columns_features)
tree_pred <- rpart(formulas , data = train)
Here , I get the error message
Error in model.frame.default(formula = formulas, data = train, na.action = function (x) : variable lengths differ (found for 'columns_features')
When I check formulas it has
train$left_eye_center_x ~ columns_features
and for column_features it has
[1] "l_1+ l_2+ l_3+ l_4+ l_5+ l_6+ l_7+ l_8+ l_9+ l_10+ l_11+ l_12+ l_13+ l_14+ l_15+ l_16+ l_17+ l_18+ l_19+ l_20"
For checking purpose when I manually enter the column names here, it works
formulas <- as.formula(train$left_eye_center_x ~ l_1+ l_2+ l_3+ l_4+ l_5+ l_6+ l_7+ l_8+ l_9+ l_10+ l_11+ l_12+ l_13+ l_14+ l_15+ l_16+ l_17+ l_18+ l_19+ l_20 )
tree_pred <- rpart(formulas , data = train)
Is double quote creating the error? What could be solution to this? I have many features so I cannot afford to enter each and every feature manually.
From the ?as.formula examples:
## Create a formula for a model with a large number of variables:
xnam <- paste0("x", 1:25)
(fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))
Which implies that in your case the following should work:
formulas <- as.formula(paste("train$left_eye_center_x ~", paste(colnames(train)[31:50], collapse = "+")))
A work-around, instead of using your approach would be (NB: I never used rpart, but I am confident that this works):
formulas <- as.formula(train$left_eye_center_x ~ .)
tree_pred <- rpart(formulas , data = train[,31:50])
If rpart does not like getting indexed data you could define a new dataframe:
train4rpart <- train[,31:50]
tree_pred <- rpart(formulas , data = train4rpart)
Actually, reading through ?rpart, you can skip the whole formula thing:
tree_pred <- rpart(train$left_eye_center_x ~ . , data = train[,31:50])
OR
tree_pred <- rpart(train$left_eye_center_x ~ . , data = train4rpart)

Resources