When running a regression analysis in R (using glm) cases are removed due to 'missingness' of the data. Is there any way to flag which cases have been removed? I would ideally like to remove these from my original dataframe.
Many thanks
The model fit object returned by glm() records the row numbers of the data that it excludes for their incompleteness. They are a bit buried but you can retrieve them like this:
## Example data.frame with some missing data
df <- mtcars[1:6, 1:5]
df[cbind(1:5,1:5)] <- NA
df
# mpg cyl disp hp drat
# Mazda RX4 NA 6 160 110 3.90
# Mazda RX4 Wag 21.0 NA 160 110 3.90
# Datsun 710 22.8 4 NA 93 3.85
# Hornet 4 Drive 21.4 6 258 NA 3.08
# Hornet Sportabout 18.7 8 360 175 NA
# Valiant 18.1 6 225 105 2.76
## Fit an example model, and learn which rows it excluded
f <- glm(mpg~drat,weight=disp, data=df)
as.numeric(na.action(f))
# [1] 1 3 5
Alternatively, to get the row indices without having to fit the model, use the same strategy with the output of model.frame():
as.numeric(na.action(model.frame(mpg~drat,weight=disp, data=df)))
# [1] 1 3 5
Without a reproducible example I can't provide code tailored to your problem, but here's a generic method that should work. Assume your data frame is called df and your variables are called y, x1, x2, etc. And assume you want y, x1, x3, and x6 in your model.
# Make a vector of the variables that you want to include in your glm model
# (Be sure to include any weighting or subsetting variables as well, per Josh's comment)
glm.vars = c("y","x1","x3","x6")
# Create a new data frame that includes only those rows with no missing values
# for the variables that are in your model
df.glm = df[complete.cases(df[ , glm.vars]), ]
Also, if you want to see just the rows that have at least one missing value, do the following (note the addition of ! (the "not" operator)):
df[!complete.cases(df[ , glm.vars]), ]
Related
I have a set of Fisher's discriminant linear functions that I need to multiply against some test data. Both data files are in the form of two matrices (variables lined up to match variable order), so I need to multiply them together.
Here is some example test data, which I've added a constant=1 variable (you'll see why when you we get to the coefficients)
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
> testdata
constant mpg disp hp
Mazda RX4 1 21.0 160 110
Mazda RX4 Wag 1 21.0 160 110
Datsun 710 1 22.8 108 93
Hornet 4 Drive 1 21.4 258 110
Hornet Sportabout 1 18.7 360 175
Valiant 1 18.1 225 105
Here are my coefficients matrix (the Fishers discriminant linear functions)
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
> coefs
constant mpg disp hp
Function1 -67.67 4.01 0.14 0.13
Function2 -59.46 3.49 0.15 0.15
Function3 -89.70 3.69 0.22 0.20
I need to multiply the values in test data against the respective coefficients to get 3 functions scores per row. Here is how the values would be calculated
for the first row, Function1 = 1*(-67.67)+21*(4.01)+160*(0.14)+110*(0.13)
for the first row, Function2 = 1*(-59.46)+21*(3.49)+160*(0.15)+110*(0.15)
for the first row, Function3 = 1*(-89.70)+21*(3.69)+160*(0.22)+110*(0.20)
It's kind of like a sumproduct of coefficients against each row time 3 for each function.
So the df/matrix should look like this when multiplied same number of rows with 3 function score variables
> df_result
Function1 Function2 Function3
row1 53.24 54.33 44.99
row2
Not ideal, but I'm taking the data out doing it excel. If this is possible to do, any help is greatly appreciated. Many thanks
Are you just looking for the inner product?
testdata <- cbind(constant=1,mtcars[ 1:6 ,c("mpg","disp","hp") ])
coefs <- data.frame(constant = c(-67.67, -59.46, -89.70),
mpg = c(4.01,3.49,3.69),
disp = c(0.14,0.15,0.22),
hp = c(0.13,0.15,0.20))
rownames(coefs) <- c("Function1","Function2","Function3")
as.matrix(testdata) %*% t(as.matrix(coefs))
# Function1 Function2 Function3
# Mazda RX4 53.240 54.330 44.990
# Mazda RX4 Wag 53.240 54.330 44.990
# Datsun 710 50.968 50.262 36.792
# Hornet 4 Drive 68.564 70.426 68.026
# Hornet Sportabout 80.467 86.053 93.503
# Valiant 50.061 53.209 47.589
I'm pretty new to R and this is my first post on the website, I am trying to omit na rows from my data frame. I am using na.omit function which runs put doesn't omit the desired column.
My data frame looks as below, I want to remove "na" values from the Gene.Symbol Column only without affecting the other two columns.
I've tried
na.omit(data.frame, cols= Gene.Symbol(data.frame))
Which runs, but doesn't remove any rows, I know from looking at the data frame that there are about 19 rows with "na" so the command isn't working at all.
thanks for the help!
Gene.Symbol Diag.A Rel.A
A2ML 173 17
na 02 95
ABCA10 18 97
ABCA4 14 na
ADCY2 81 98
If you're looking to exclude all missing values from N columns, complete.cases is another option. It returns a vector of TRUE's and FALSE's where TRUE are rows that don't have any NA in the selected columns. I find it has better documentation than na.omit and it's much clearer:
tst <- mtcars[1:5, ]
tst$some_na <- c(NA, NA, 2, 2, 3)
tst$another_na <- c(NA, NA, 2, 2, NA)
# These are the columns you want to exclude `NA` from:
non_na <- complete.cases(tst[, c("some_na", "another_na")])
# No NA's
tst[non_na, ]
#> mpg cyl disp hp drat wt qsec vs am gear carb some_na
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2
#> another_na
#> Datsun 710 2
#> Hornet 4 Drive 2
Assuming, I'm understanding the documentation of [[ correctly, a matrix can be used to subset a data.frame:
A third form of indexing is via a numeric matrix with the one column for each dimension: each row of the index matrix then selects a single element of the array, and the result is a vector. Negative indices are not allowed in the index matrix. NA and zero values are allowed: rows of an index matrix containing a zero are ignored, whereas rows containing an NA produce an NA in the result.
While this works for [, I'm struggling to understand how to do this with [[.
mtcars[1:6, 1:6]
#> mpg cyl disp hp drat wt
#> Mazda RX4 21.0 6 160 110 3.90 2.620
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875
#> Datsun 710 22.8 4 108 93 3.85 2.320
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440
#> Valiant 18.1 6 225 105 2.76 3.460
(ind <- matrix(1:6, ncol = 2))
#> [,1] [,2]
#> [1,] 1 4
#> [2,] 2 5
#> [3,] 3 6
mtcars[ind]
#> [1] 110.00 3.90 2.32
mtcars[[ind]]
#> Error in as.matrix(x)[[i]]: attempt to select more than one element in vectorIndex
Is this a bug? Or am I misinterpreting the documentation?
Here is the source of [[.data.frame (v3.6.1)
function (x, ..., exact = TRUE)
{
na <- nargs() - !missing(exact)
if (!all(names(sys.call()) %in% c("", "exact")))
warning("named arguments other than 'exact' are discouraged")
if (na < 3L)
(function(x, i, exact) if (is.matrix(i))
as.matrix(x)[[i]]
else .subset2(x, i, exact = exact))(x, ..., exact = exact)
else {
col <- .subset2(x, ..2, exact = exact)
i <- if (is.character(..1))
pmatch(..1, row.names(x), duplicates.ok = TRUE)
else ..1
col[[i, exact = exact]]
}
}
The doc page (?Extract) you reference says that arrays can be indexed by matrices. Implicitly, I take that to mean non-arrays cannot be indexed by matrices. Data frames are not arrays, so they cannot be indexed by matrices. (Matrices are arrays, of course.)
I do think you're misinterpreting the documentation. You're looking at a documentation page that jointly documents [, [[, and $, together. In the argument description, it says
When indexing arrays by [ a single argument i can be a matrix with as many columns as there are dimensions of x...
The section you quote at the top of your question comes later on, under the heading Matrices and Arrays, which I take to be a section about subsetting matrices and arrays, not about using matrices as indices. (Look at the rest of the section, and the sections before and after, and I think you'll agree with me.)
Nowhere on that documentation page does it talk about using matrices as indices for [[.
I'm surprised it's handled specially in the [[ code you show - but near as I can tell, a matrix given to [[.data.frame will error out unless it's a 1x1 matrix, in which case the data frame is treated as a matrix and the single element is returned, for some arcane reason (probably "compatability with S", though I've no good guess as to why S would allow it).
I'm going to perform xgboost on R using xgb.train function.
In order to use the xgb.train function, I know that input data must be transformed as using xgb.DMatrix function.
But when I used this function in my data setm I got an error message :
Error in xgb.DMatrix(data = as.matrix(train)) :
[09:01:01] amalgamation/../dmlc-core/src/io/local_filesys.cc:66: LocalFileSystem.GetPathInfo 1 Error:No such file or directory
Following is my full R code. To use input data, How to transform input data?
credit<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) credit[,i]=as.factor(credit[,i])
str(credit)
library(caret)
set.seed(1000)
intrain<-createDataPartition(y=credit$Creditability, p=0.7, list=FALSE)
train<-credit[intrain, ]
test<-credit[-intrain, ]
d_train<-xgb.DMatrix(data=as.matrix(train))
If you still want to use factors you should use the model.matrix() function to convert your factors to dummy variables.
For example:
my.dat <- mtcars[c("mpg","cyl","disp")]
my.dat$cyl <- as.factor(my.dat$cyl)
# Convert data frame to X matrix
x.train <- model.matrix(mpg~.,data=my.dat)
head(x.train)
Output:
(Intercept) cyl6 cyl8 disp
Mazda RX4 1 1 0 160
Mazda RX4 Wag 1 1 0 160
Datsun 710 1 0 0 108
Hornet 4 Drive 1 1 0 258
Hornet Sportabout 1 0 1 360
Valiant 1 1 0 225
This creates dummy variables cyl6 and cyl8 where 4 cylinder vehicles would be the base group (where cyl6=0 and cyl8=0).
Then you can pass this matrix into the xgb.DMatrix function:
d_train<-xgb.DMatrix(x.train,label=my.dat$mpg)
Apologies for what is probably a very basic question.
I have created a linear model for a massive meteorological dataset using multiple regression. My goal is to use that model to "predict" data during a certain period using predictors 1, 2 and 3. I will then compare those predicted data to the observed data for that period.
My approach thus far has been to create a new vector for the predicted values and loop through the vector, creating predicted values based on the extracted coefficients of the linear model. Then, I will simply subtract the predicted values from the observed values. For some reason, this approach results in the new predicted vector being NULL. Any idea how I could approach this?
A sample is below. "data" refers to the dataset containing the predictors.
coef <- coefficients(multipleRegressionModel)
predictedValues=c()
for(i in 1:length(data$timePeriod)){
predictedValues[i] = append(predictedValues, data$coef[1]+data$predictor1[i]*data$coef[2]+data$predictor2[i]*data$coef[3]+
data$predictor3[i]*data$coef[4])
}
diff=c()
diff=observedValues - predictedValues
It looks like you are making this more difficult than it needs to be. R has a predict() function that does all of this for you. If you had a sample data.frame like so:
set.seed(26)
mydf = data.frame (a=1:20 , b = rnorm(20),
c = 1:20 + runif(20,2,3)*runif(20, 2, 3),
d = 1:20 + rpois(20,5)*runif(1:20)*sin(1:20))
And you wanted to train on some rows, and test on the others
trainRows<-sample(1:20, 16)
mydf.train<-mydf[trainRows,]
mydf.test<-mydf[-trainRows,]
Then fit the model and predict
model<-lm(a~b+c+d, data = mydf.train)
summary(model) #gives info about your model.
mydf.test$pred<-predict(model1, newdata = mydf.test)
MSE<-mean((mydf.test$pred-mydf.test$a)^2) #calculate mean squared error
MSE
#[1] 0.06321
View the predictions with mydf.test$pred
Here is a simple example using a glm on the mtcars data.
Line<- #setting up the linear model function
function (train_dat, test_dat, variables, y_var, family = "gaussian")
{
fm <- as.formula(paste(y_var, " ~", paste(variables, collapse = "+"))) #formula
glm1 <- glm(fm, data = train_dat, family = family) #run the model
pred <- predict(glm1, newdata = test_dat) #predict the model
}
data(mtcars)
y_var<-'mpg'
x_vars<-setdiff(names(mtcars),y_var)
mtcars[,'linear_prediction']<-Line(mtcars,mtcars,x_vars,y_var)
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb linear_prediction
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 22.59951
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 22.11189
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26.25064
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 21.23740
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 17.69343
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 20.38304