I have a 2 different data frames for which i would like to perform linear regression
I have written following code for it
mydir<- "/media/dev/Daten/Task1/subject1/t1"
#multiple subject paths should be given here
# read full paths
myfiles<- list.files(mydir,pattern = "regional_vol*",full.names=T)
# initialise the dataframe from first file
df<- read.table( myfiles[1], header = F,row.names = NULL, skip = 3, nrows = 1,sep = "\t")
# [-c(1:3),]
df
#read all the other files and update dataframe
#we read 4 lines to read the header correctly, then remove 3
ans<- lapply(myfiles[-1], function(x){ read.table( x, header = F, skip = 3, nrows = 1,sep = "\t") })
ans
#update dataframe
#[-c(1:3),]
lapply(ans, function(x){df<<-rbind(df,x)} )
#this should be the required dataframe
uncorrect<- array(df)
# Linear regression of ICV extracted from global size FSL
# Location where your icv is located
ICVdir <- "/media/dev/Daten/Task1/T1_Images"
#loding csv file from ICV
mycsv <- list.files(ICVdir,pattern = "*.csv",full.names = T )
af<- read.csv(file = mycsv,header = TRUE)
ICV<- as.data.frame(af[,2],drop=FALSE)
#af[1,]
#we take into consideration second column of csv
#finalcsv <-lapply(mycsv[-1],fudnction(x){read.csv(file="global_size_FSL")})
subj1<- as.data.frame(rep(0.824,each=304))
plot(df ~ subj1, data = df,
xlab = "ICV value of each subject",
ylab = "Original uncorrected volume",
main="intercept calculation"
)
fit <- lm(subj1 ~ df )
The data frame df has 304 values in following format
6433 6433
1430 1430
1941 1941
3059 3059
3932 3932
6851 6851
and another data frame Subj1 has 304 values in following format
0.824
0.824
0.824
0.824
0.824
When i run my code i am incurring following error
Error in model.frame.default(formula = subj1 ~ df, drop.unused.levels = TRUE) :
invalid type (list) for variable 'subj1'
any suggestions why the data.frame values from variable subj1 are invalid
As mentioned, you are trying to give a data.frame as an independent variable. Try:
fit <- lm(subj1 ~ ., data=df )
This will use all variables in the data frame, as long as subj1 is the dependent variable's name, and not a data frame by itself.
If df has two columns which are the predictors, and subj1 is the predicted (dependent) variable, combing the two, give them proper column names, and create the model in the format above.
Something like:
data <- cbind(df, subj1)
names(data) <- c("var1", "var2", "subj1")
fit <- lm(subj1 ~ var1 + var2, data=df )
Edit: some pointers:
make sure you use a single data frame that holds all of your independent variables, and your dependent variable.
The number of rows should be equal.
If an independent variable in a constant, it has no variance for different values of the dependent variable, and so will have no meaning. If the dependent variable is a constant, there is no point for regressing - we can predict the value with 100% accuracy.
Related
How to use the names in the string vector in models? for example why cant I do this
loop_variables = c("Age", "BMI", "Height")
for (i in 1:length(loop_variables){
basic_logistic_model = glm(outcome~loop_variable[i], data=DB, family="binomial"
summary(basic_logistic_model)
}
I see alot of R users doing vectors with names of study variables then looping it
what am I doing wrong?
It is the formula that needs to be update. We may use paste or reformulate. In addition, it is better to have an object to store the output of summary especially a list would suit.
summary_lst <- vector('list', length(loop_variables))
names(summary_lst) <- loop_variables
for (i in 1:length(loop_variables){
# convert the column to factor column
DB[[loop_variables[i]]] <- factor(DB[[loop_variables[i]]])
# create the formula
fmla <- reformulate(loop_variable[i], response = 'outcome')
basic_logistic_model = glm(fmla, data=DB, family="binomial")
# assign the summary output to the list element
summary_lst[[i]] <- summary(basic_logistic_model)
}
In my view, data transformation part (e.g. converting variables into factors) and modelling part (e.g. using glm()) ought to be separated and not to be mixed in the loop for the code readability and efficiency.
Here, I will show how to execute looping iterations using purrr::map(), while the data to be analysed is transformed using dplyr::mutate() beforehand.
Package loading
library(purrr) # for `map`, `set_names`
library(dplyr) # for `mutate`
Data transformation
Add new variables that was converted into factors using dummy coding
fct_ToothGrowth <- ToothGrowth |>
mutate(
fct_dose = dose |>
as.factor()
fct_len = len |>
## The numeric variable `len` is converted
## into a three-level factor
cut(3) |>
as.factor()
)
contrasts(fct_ToothGrowth$fct_dose)
contrasts(fct_ToothGrowth$fct_len)
Add new variables that was converted into factors using non-dummy coding
Sum contrast and forward difference coding are used here as examples.
fct_ToothGrowth <- ToothGrowth |>
mutate(
fct_dose = `contrasts<-`(
factor(
dose,
levels = c("0.5", "1", "2")
), ,
## sum contrast coding (as known as deviation coding)
contr.sum(3)
),
fct_len = `contrasts<-`(
factor(
cut(len, 3)
), ,
## Forward difference coding
MASS::contr.sdif(3)
)
)
contrasts(fct_ToothGrowth$fct_dose)
contrasts(fct_ToothGrowth$fct_len)
Looping glm()
explanatory_variables <- c("fct_len", "fct_dose", "len", "dose")
summaries <- map(
.x = explanatory_variables,
## "fct_len", "fct_dose", "len", and "dose" are replaced
## by the arguments specified in `.x`.
~ paste0("supp ~ ", .x) |>
## `supp ~ fct_len`, ..., `supp ~ dose` are inputted
## into the first argument of `glm()`, namely `formula` argument
glm(family = binomial, data = fct_ToothGrowth)
) |>
## set names to the returned sublists
set_names(nm = explanatory_variables)
summaries$fct_len
summaries$fct_dose
summaries$len
summaries$dose
I have run several (17) meta-analyses (identified by specific names) and I need to extract the models' outputs into one single table, as well as add a column with the name of each name. I have done it manually, but I was wondering if I could build a loop to do so.
I'm attaching the first three of the 17 analyses, the "names" being "cent", "dist", and "sqrs"
#meta-analyses
res_cent<-rma.mv(yi, vi, mods = ~ factor(drug)-1, random = list(~ 1 | publication_id,~ 1 | strain_def),
data = SR_meta,subset=(SR_meta$measure=="cent"))
res_dist<-rma.mv(yi, vi, mods = ~ factor(drug)-1, random = list(~ 1 | publication_id,~ 1 | strain_def),
data = SR_meta,subset=(SR_meta$measure=="dist"))
res_sqrs<-rma.mv(yi, vi, mods = ~ factor(drug)-1, random = list(~ 1 | publication_id,~ 1 | strain_def),
data = SR_meta,subset=(SR_meta$measure=="sqrs"))
#Creating list for model output - cent
list_cent<-coef(summary(res_cent))
list_cent<-setNames(cbind(rownames(list_cent), list_cent, row.names = NULL),
c("Drug", "Estimate", "se","zval","p-value","CI_l","CI_u"))
df_cent <- list_cent[ -c(3,4) ]
df_cent$Drug<-gsub("factor*","",df_cent$Drug)
df_cent$Drug<-gsub("drug*","",df_cent$Drug)
df_cent$Drug<-gsub("[[:punct:]]","",df_cent$Drug)
n_cent<-plyr::count(cent_sum2, vars = "drug")
names(n_cent)[names(n_cent) == "freq"] <- "n_cent"
df_cent<-cbind(df_cent,n_cent[2])
##same thing can be repeated for the other two measures "dist", and "sqrs".
The output is a data frame that contains the name of the drugs used as factors in the meta-analyses, their estimated effect sizes, p-values, confidence intervals, and how many measures we have per factor (n). I want to compile all of these outputs in a table, (at the end of the code called "matrix_ps") and add a column with the name of the measures.
I have done all the steps manually (below) but it looks extremely inefficient.
Is there a way to create a loop to do this, in which the all the names of the measures are changed an then outcome is appended?
Something like
measures<-c("cent","dist","sqrs")
for(i in measures) - not sure how to continue?
matrix_cent<-data.frame(df_cent$Drug,list_cent$`p-value`,df_cent$n_cent,df_cent$Estimate,df_cent$CI_l,df_cent$CI_u)
matrix_dist<-data.frame(df_dist$Drug,list_dist$`p-value`,df_dist$n_dist,df_dist$Estimate,df_dist$CI_l,df_dist$CI_u)
matrix_sqrs<-data.frame(df_sqrs$Drug,list_sqrs$`p-value`,df_sqrs$n_sqrs,df_sqrs$Estimate,df_sqrs$CI_l,df_sqrs$CI_u)
matrix_cent$measure<-"cent"
matrix_dist$measure<-"dist"
matrix_sqrs$measure<-"sqrs"
matrix_cent<-matrix_cent%>% rename(drug=df_cent.Drug,measure=measure,p=list_cent..p.value.,n=df_cent.n_cent,estimate=df_cent.Estimate,ci_low=df_cent.CI_l,ci_up=df_cent.CI_u)
matrix_dist<-matrix_dist%>% rename(drug=df_dist.Drug,measure=measure,p=list_dist..p.value.,n=df_dist.n_dist,estimate=df_dist.Estimate,ci_low=df_dist.CI_l,ci_up=df_dist.CI_u)
matrix_sqrs<-matrix_sqrs%>% rename(drug=df_sqrs.Drug,measure=measure,p=list_sqrs..p.value.,n=df_sqrs.n_sqrs,estimate=df_sqrs.Estimate,ci_low=df_sqrs.CI_l,ci_up=df_sqrs.CI_u)
matrix_ps<-rbind(matrix_cent,matrix_dist,matrix_rear,matrix_sqrs,matrix_toa,matrix_eca,matrix_eoa,matrix_trans,matrix_dark,matrix_light,matrix_stps,matrix_rrs,matrix_time,matrix_toc,matrix_cross,matrix_hd,matrix_lat)
We don't have your data but you can put all your code in a function :
get_result <- function(x, y) {
list_cent<-coef(summary(x))
list_cent<-setNames(cbind(rownames(list_cent), list_cent, row.names = NULL),
c("Drug", "Estimate", "se","zval","p-value","CI_l","CI_u"))
df_cent <- list_cent[ -c(3,4) ]
df_cent$Drug<-gsub("factor*","",df_cent$Drug)
df_cent$Drug<-gsub("drug*","",df_cent$Drug)
df_cent$Drug<-gsub("[[:punct:]]","",df_cent$Drug)
n_cent<-plyr::count(cent_sum2, vars = "drug")
names(n_cent)[names(n_cent) == "freq"] <- y
df_cent<-cbind(df_cent,n_cent[2])
return(df_cent)
}
Now assuming all your analyses follow the pattern 'res_' you can do :
library(purrr)
list_models <- mget(ls(pattern = 'res_'))
result <- imap(list_models, get_result) %>% reduce(inner_join)
I am generating a model fit using glm. My data has a mix of integer variables and categorical variables. Categorical variables are in the form of codes and hence integer type in the data. Initially when I tried to generate the model I passed the categorical variables in integer format as it is and got the model. I was looking at the p-values to check the once that are significant and noticed few variables were significant which I was not expecting.
This is when realized that may be the categorical variables in integer form are creating some issue. So like code 3 might get a higher importance than code 1 (not sure on this and it would be great if someone can confirm this). On doing some research I found that we can convert the categorical integer variable to factor. I did the same and re-generated the model.
I also saw some posts where it was mentioned to convert to binary, so I did that we well. So now I have 3 results -
r1 >> with categorical integer variables
r2 >> with categorical factor variables
r3 >> with categorical variable converted to binary
I feel that output 1 with categorical integer variables is incorrect (Please confirm). But between output 2 and 3 I am confused which one to consider as
p-values are different,
which one would be more accurate
can I related the p-values of output 3 with output 2?
How does glm handle such variables
Hope glm inside a for loop is not an issue
My database is big, can we do glm using data.table?
I am pasting below my code with some sample data to be reproduced
library("plyr")
library("foreign")
library("data.table")
#####Generating sample data
set.seed(1200)
id <- 1:100
bill <- sample(1:3,100,replace = T)
nos <- sample(1:40,100,replace = T)
stru <- sample(1:4,100,replace = T)
type <- sample(1:7,100,replace = T)
value <- sample(100:1000,100,replace = T)
df1 <- data.frame(id,bill,nos,stru,type,value)
var1 <- c("bill","nos","stru")
options(scipen = 999)
r1 <- data.frame()
for(type1 in unique(df1$type)){
for(var in var1){
# dynamically generate formula
fmla <- as.formula(paste0("value ~ ", var))
# fit glm model
fit <- glm(fmla, data=df1[df1$type == type1,],family='quasipoisson')
p.value <- coef(summary(fit))[8]
cfit <- coef(summary(fit))
# create data frame
df2 <- data.frame(var = var, type = type1, basket="value",p.value = cfit[8],stringsAsFactors = F)
r1 <- rbind(r1, df2)
}
}
##### converting the categorical numeric variables to factor variables
df1$bill_f <- as.factor(bill)
df1$stru_f <- as.factor(stru)
var1 <- c("bill_f","nos","stru_f")
r2 <- data.frame()
for(type1 in unique(df1$type)){
for(var in var1){
# dynamically generate formula
fmla <- as.formula(paste0("value ~ ", var))
# fit glm model
fit <- glm(fmla, data=df1[df1$type == type1,],family='quasipoisson')
p.value <- coef(summary(fit))[8]
cfit <- coef(summary(fit))
# create data frame
df2 <- data.frame(var = var, type = type1, basket="value",p.value = cfit[8],stringsAsFactors = F)
r2 <- rbind(r2, df2)
}
}
#####converting the categorical numeric variables to binary format (1/0)
df1$bill_1 <- ifelse(df1$bill == 1,1,0)
df1$bill_2 <- ifelse(df1$bill == 2,1,0)
df1$bill_3 <- ifelse(df1$bill == 3,1,0)
df1$stru_1 <- ifelse(df1$stru == 1,1,0)
df1$stru_2 <- ifelse(df1$stru == 2,1,0)
df1$stru_3 <- ifelse(df1$stru == 3,1,0)
df1$stru_4 <- ifelse(df1$stru == 4,1,0)
var1 <- c("bill_1","bill_2","bill_3","nos","stru_1","stru_2","stru_3")
r3 <- data.frame()
for(type1 in unique(df1$type)){
for(var in var1){
# dynamically generate formula
fmla <- as.formula(paste0("value ~ ", var))
# fit glm model
fit <- glm(fmla, data=df1[df1$type == type1,],family='quasipoisson')
p.value <- coef(summary(fit))[8]
cfit <- coef(summary(fit))
# create data frame
df2 <- data.frame(var = var, type = type1, basket="value",p.value = cfit[8],stringsAsFactors = F)
r3 <- rbind(r3, df2)
}
}
Your feeling is mostly correct. For a GLM you should make the distinction between continious variables and discrete (categorical) variables.
Binary variables are variables which contain only 2 levels, for example 0 and 1.
Since you only have variables with 2+ levels, you should use the factor() function.
I'm trying since hours and hours to svm a dataframe based on the last class name.
I have this data frame
#FIll the data frame
df = read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep=",",
col.names=c("buying", "maint", "doors", "persons", "lug_boot", "safety", ""),
fill=TRUE,
strip.white=TRUE)
lastColName <- colnames(df)[ncol(df)]
...
model <- svm(lastColName~.,
data = df,
kernel="polynomial",
degree = degree,
type = "C-classification",
cost = cost)
I'm getting either NULL or Error in model.frame.default(formula = str(lastColName) ~ ., data = df1, : invalid type (NULL) for variable 'str(lastColName)'. I understand that NULL arrives when the column hasn't a name. I don't understand the other error since it's the last column name..
Any idea?
You have to use as.formula when you are trying to use dynamic variable in the formula. For details see ?as.formula
The following code works fine:
library(e1071)
df_1 = read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep=",",
col.names=c("buying", "maint", "doors", "persons", "lug_boot", "safety", ""),
fill=TRUE,
strip.white=TRUE)
lastColName <- colnames(df_1)[ncol(df_1)]
model <- svm(as.formula(paste(lastColName, "~ .", sep = " ")),
data = df_1,
kernel="polynomial",
degree = 3,
type = "C-classification",
cost = 1)
# to predict on the data remove the last column
prediction <- predict(model, df_1[,-ncol(df_1)])
# The output
table(prediction)
# The output is:
prediction
acc good unacc vgood
0 0 1728 0
# Since this is a highly unbalanced classification the model is not doing a very good job
I have the famous titanic data set from Kaggle's website. I want to predict the survival of the passengers using logistic regression. I am using the glm() function in R. I first divide my data frame(total rows = 891) into two data frames i.e. train(from row 1 to 800) and test(from row 801 to 891).
The code is as follows
`
>> data <- read.csv("train.csv", stringsAsFactors = FALSE)
>> names(data)
`[1] "PassengerId" "Survived" "Pclass" "Name" "Sex" "Age" "SibSp"
[8] "Parch" "Ticket" "Fare" "Cabin" "Embarked" `
#Replacing NA values in Age column with mean value of non NA values of Age.
>> data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
#Converting sex into binary values. 1 for males and 0 for females.
>> sexcode <- ifelse(data$Sex == "male",1,0)
#dividing data into train and test data frames
>> train <- data[1:800,]
>> test <- data[801:891,]
#setting up the model using glm()
>> model <- glm(Survived~sexcode[1:800]+Age+Pclass+Fare,family=binomial(link='logit'),data=train, control = list(maxit = 50))
#creating a data frame
>> newtest <- data.frame(sexcode[801:891],test$Age,test$Pclass,test$Fare)
>> prediction <- predict(model,newdata = newtest,type='response')
`
And as I run the last line of code
prediction <- predict(model,newdata = newtest,type='response')
I get the following error
Error in eval(expr, envir, enclos) : object 'Age' not found
Can anyone please explain what the problem is. I have checked the newteset variable and there doesn't seem to be any problem in that.
Here is the link to titanic data set https://www.kaggle.com/c/titanic/download/train.csv
First, you should add the sexcode directly to the dataframe:
data$sexcode <- ifelse(data$Sex == "male",1,0)
Then, as I commented, you have a problem in your columns names in the newtest dataframe because you create it manually. You can use directly the test dataframe.
So here is your full working code:
data <- read.csv("train.csv", stringsAsFactors = FALSE)
data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
data$sexcode <- ifelse(data$Sex == "male",1,0)
train <- data[1:800,]
test <- data[801:891,]
model <- glm(Survived~sexcode+Age+Pclass+Fare,family=binomial(link='logit'),data=train, control = list(maxit = 50))
prediction <- predict(model,newdata = test,type='response')