How to select numerical columns for linear regression in R [duplicate] - r

This question already has answers here:
Selecting only numeric columns from a data frame
(12 answers)
Closed 2 years ago.
I have the following dataset named fish_data
> structure(list(Species = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Bream", "Parkki", "Perch", "Pike", "Roach", "Smelt", "Whitefish"), class = "factor"),
> WeightGRAM = c(242, 290, 340, 363, 430, 450), VertLengthCM = c(23.2, 24, 23.9, 26.3, 26.5, 26.8)
> DiagLengthCM = c(25.4, 26.3, 26.5, 29, 29, 29.7),
> CrossLengthCM = c(30, 31.2, 31.1, 33.5, 34, 34.7),
> HeightCM = c(11.52, 12.48, 12.3778, 12.73, 12.444, 13.6024),
> WidthCM = c(4.02, 4.3056, 4.6961, 4.4555, 5.134, 4.9274)),
> row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"), na.action = structure(c(`41` = 41L), class = "omit"))
It look something like this:
How can i Build a linear regression model named m1 with WeightGRAM as a function of Species and all the measurement variables i.e. VertLengthCM, DiaLengthCM, CrossLengthCM, HeightCM, WidthCM?
i have the linear regression code as below:
m1 <- lm(WeightGRAM~.,data = fish_data )
summary(m1)
But i want to exclude the "species" as it is a factor

You can try this:
#Index
index <- which(names(fish_data)=='Species')
#Model
m1 <- lm(WeightGRAM~.,data = fish_data[,-index] )
Call:
lm(formula = WeightGRAM ~ ., data = fish_data[, -index])
Coefficients:
(Intercept) VertLengthCM DiagLengthCM CrossLengthCM HeightCM WidthCM
-827.56 -124.85 70.08 72.14 -23.41 72.52

You can check the if the column is numeric or not using is.numeric which returns a logical value. You can use it to subset fish_data.
cols <- sapply(fish_data, is.numeric)
m1 <- lm(WeightGRAM~.,data = fish_data[, cols])
m1
#Call:
#lm(formula = WeightGRAM ~ ., data = fish_data[, cols])
#Coefficients:
# (Intercept) VertLengthCM DiagLengthCM CrossLengthCM HeightCM WidthCM
# -827.6 -124.8 70.1 72.1 -23.4 72.5

Related

cannot coerce class ‘"formula"’ to a data.frame

I am trying to use Hotelling test
When I call hotelling.test(.~Number, bottle.df)
everything is OK.
However, when I try to do Hotelling test to only one element,
```
bottle_elem1<-data.frame(bottle.df$Number,bottle.df$Mn)
hotelling.test(bottle.df.Number, bottle_elem1)
```
it gives an error
> Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class ‘"formula"’ to a data.frame
>Traceback:
>1. data.frame(. ~ Number, bottle.df$Mn)
>2. as.data.frame(x[[i]], optional = TRUE)
>3. as.data.frame.default(x[[i]], optional = TRUE)
>4. stop(gettextf("cannot coerce class %s to a data.frame", sQuote(deparse(class(x))[1L])),
. domain = NA)
I understand that should do it differently, but don't know how. If I use like previously .~Number, there is an error too
What is correct code to do Hotelling test for a column? Maybe I should extract column differently, but don't know how.
bottle is from Hotelling package
structure(list(Number = c(1L, 1L, 1L, 1L, 1L, 1L), Mn = c(56.1,
53.8, 58.7, 54.6, 58.6, 56.8), Ba = c(170.7, 166.2, 184.2, 170.5,
185.2, 180.5), Sr = c(145.1, 143.3, 156.5, 158.1, 161.3, 146.7
), Zr = c(77.4, 71.6, 78.2, 75.3, 83.9, 79.2), Ti = c(267.4,
270, 286.4, 273.6, 289.9, 274)), row.names = c(NA, 6L), class = "data.frame")

Problem with post-hoc emmeans() test after lmerTest

I have a dataset looking at a response variable (Fat %), over time (Week 0-4), and over a treatment condition -- short vs long day.
I used a lmer model test to find out if the variables and interaction term were significant and it was significant. I want to look further at the interaction term (so basically a Tukey test but still accounting for the repeated measures). That's when I started to use the emmeans package and the output is not giving me the full output I would like. Any suggestions I would love, thank you.
here is my data set:
structure(list(`Bird ID` = c(61, 62, 71, 72, 73, 76, 77, 63,
64, 69), Day = c("long", "long", "long", "long", "long", "long",
"long", "short", "short", "short"), Week = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), `Body Weight` = c(34.57, 41.05, 37.74, 37.04, 33.38,
35.6, 31.88, 34.32, 35.5, 35.78), `Fat %` = c(2.42718446601942,
2.07515423443634, 11.7329093799682, 8.61137591356848, 5.36031238906638,
7.9879679144385, 1.2263099219621, 5.17970401691332, 8.73096446700508,
3.62993896562801), `Lean %` = c(97.5728155339806, 97.9248457655636,
88.2670906200318, 91.3886240864315, 94.6396876109336, 92.0120320855615,
98.7736900780379, 94.8202959830867, 91.2690355329949, 96.370061034372
), `Fat(g)` = c(0.7, 0.74, 3.69, 2.71, 1.51, 2.39, 0.33, 1.47,
2.58, 1.13), `Lean(g)` = c(28.14, 34.92, 27.76, 28.76, 26.66,
27.53, 26.58, 26.91, 26.97, 30), ID = c(1, 2, 3, 4, 5, 6, 7,
8, 9, 10)), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
code I have tried:
model:
model3b <- lmer( `Fat %` ~ Day + Week + Day:Week + (1|ID), data=jussara_data)
summary(model3b)
resp <- jussara_data$`Fat %`
f1 <- jussara_data$Week
f2 <- jussara_data$Day
fit1 = lm(log(resp) ~ f1 + f2 + f1:f2, data = jussara_data)
emm1 = emmeans(fit1, specs = pairwise ~ f1:f2)
emm1$emmeans
emm1$contrasts
The contrasts function I was hoping it would give me the summary looking something like this (but I need the repeated measures included not just this anova analysis):
Fat % groups
4:short 32.065752 a
3:short 27.678036 a
2:short 21.358485 b
4:long 13.895404 c
1:short 13.138941 c
2:long 12.245741 c
3:long 12.138498 c
1:long 10.315978 cd
0:short 6.134327 d
0:long 5.631602 d
but instead only gave me this:
f1 f2 emmean SE df lower.CL upper.CL
2 long 2.24 0.0783 66 2.09 2.40
2 short 2.80 0.0783 66 2.64 2.95
Results are given on the log (not the response) scale.
Confidence level used: 0.95
contrast estimate SE df t.ratio p.value
2 long - 2 short -0.556 0.111 66 -5.025 <.0001
Results are given on the log (not the response) scale.
Thank you for the help!

How can I create a volcano plot in r using muma package

I've been trying to create a volcano plot using the muma package in r but I've had some difficulties in importing my data in CSV format, every time I try to run the code (below) I get this error message.
If you know an easier way to create a volcano plot, using a different package, it would really help me.
thanks
explore.data(file="datosFin_A1AT.csv",scaling="Auto", scal = TRUE, normalize = TRUE,
imputation = TRUE, imput="mean")
Error in [.data.frame(comp, , 3:ncol(comp)) :
undefined columns selected
structure(list(Muestra = c("MI-001", "MI-003", "MI-009", "MI-012"), Class = c("Presencia",
"Ausencia", "Presencia", "Ausencia"), Per_Cintura = c(97.6, 92.8, 98.8, 113.4), HDL = c(38, 51, 51, 44), TG = c("195", "76", "160", "128"), ApoB = c(145, 161, 173, 50.9), Glucosa_mg=c(86, 85, 96, 79, 7), LBP = c(443.187, 438.925, 703.752,
540.541), IFABP = c(0.705485, 0.906843, 144.873, 145.884), CLD3 = c(0.2, 501.596, 315.582, 446.307), Acetico = c(NA, 745.654, NA, 105.378), Propionico = c(NA, 682.719, 86.628, 303.139), Butirico = c(NA, 571.421, 265.559, 135.674), Isobutirico = c(286.085, 0.0381631, 0.276992, 0.0467809), prevotella = c(0.12843, 0.07927, 0.22459, 0.01726), Pathogen=c(0.05639, 0.16051, 0.01617, 0.04398), Lachnospiraceae = c(0.24202, 0.73606, 0.67789, 0.62656), Aker_Bacter = c(0.06167, 0.00999, 0.03426, 0.0211), Ruminoco = c(0.33593, 3e-05, 0.01538, 0.01298), TNFa = c(14.16, 35.35, 43.71, 42.99), PCR = c(1.71, 1.84, 3.52, 2.32), IL33 = c(148.7, 207.6, 146.2, 162.6), IL8 = c(157.9, 115.3, NA, NA), IL1b = c(13.68, 12.36, 13.69, 19.06), IL18 = c(231.6, 293.5, 366.2, 298.5)))

Function to calculate median by column to an R dataframe that is done regularly to multiple dataframes

Trying to write a function to combine multiple steps that are used regularly on an R dataframe. At the moment I stack individual lines, which is most inefficient. An Example each step I take at the moment
library(scores)
MscoreIndex <- 3
labMedians <- mapply(median, df[-1], na.rm = T) #calculate the median for each column except 1st
LabGrandMedian <- median(mapply(median, df[-1], na.rm = T),na.rm = T)
labMscore <- as.vector(round(abs(scores_na(labMedians, "mad")), digits = 2)) #calculate mscore by lab
labMscoreIndex <- which(labMscore > MscoreMax) #get the position in the vector that exceeds Mscoremax
df[-1][labMscoreIndex] <- NA # discharge values above threshold by making NA
An example my df below
structure(list(Determination_No = 1:6, `2` = c(55.94, 55.7, 56.59,
56.5, 55.98, 55.93), `3` = c(56.83, 56.54, 56.18, 56.5, 56.51,
56.34), `4` = c(56.39, 56.43, 56.53, 56.31, 56.47, 56.35), `5` = c(56.32,
56.29, 56.31, 56.32, 56.39, 56.32), `7` = c(56.48, 56.4, 56.54,
56.43, 56.73, 56.62), `8` = c(56.382, 56.258, 56.442, 56.258,
56.532, 56.264), `10` = c(56.3, 56.5, 56.2, 56.5, 56.7, 56.5),
`12` = c(56.11, 56.46, 56.1, 56.35, 56.36, 56.37)), class = "data.frame", row.names = c(NA,
-6L))
I started by trying to get the indivdual lab medians and the grandmedian with the following but got errors
I tried.
mediansFunction <- function(x){
analytemedians <- mapply(median(x[,-1]))
grandmedian <- median(x[,-1])
list(analytemedians,grandmedian)
}
mediansFunction(df)
But I get "Error in median.default(x[, -1]) : need numeric data"
Try :
mediansFunction <- function(x){
analytemedians <- sapply(x[-1], median)
median_of_median <- median(analytemedians)
grand_median <- median(as.matrix(x[-1]))
list(analytemedians = analytemedians,
median_of_median = median_of_median,
grand_median = grand_median)
}
mediansFunction(df)
#$analytemedians
# 2 3 4 5 7 8 10 12
#55.960 56.505 56.410 56.320 56.510 56.323 56.500 56.355
#$median_of_median
#[1] 56.3825
#$grand_median
#[1] 56.386

Getting estimate and p-value into dataframe

I am fairly new to R. My data looks something like this (only with 9000 columns and 66 rows)
Time <- c(0, 6.4, 8.6, 15.2, 19.4, 28.1, 42.6, 73, 73, 85, 88, 88, 88, 88, 88)
ID1 <- c(55030, 54539, 54937, 48897, 58160, 54686, 55393, 47191, 39805, 37601, 51328, 28882, 45587, 60061, 31892, 28670)
ID2 <- c(20485, 11907, 10571, 20974, 10462, 11149, 20970, NA, NA, 9295, NA, 8714, 24446, 10748, 9037, 11859)
ID3 <- c(93914, 44482, 43705, 51144, 49485, 43908, 44324, 37342, 18872, 39660,61673, 43837, 36528, 44738, 41648, 11100)
DF <- data.frame (Time, ID1, ID2, ID3)
I want to get a data frame that looks like this :
ID1, rho, p-value
ID2, rho, p-value
...
The rho and the p-value would be the results from a cor.test (spearman) with Time and each ID
Among other things I've tried this:
results <- data.frame(ID="", Estimate="", P.value="")
estimates = numeric(16)
pvalues = numeric(16)
for (i in 2:4){
test <- cor.test(DF[,1], DF[,i])
estimates[i] = test$estimate
pvalues[i] = test$p.value
}
And R gives me the following error:
Error: object 'test' not found
I've also tried:
result <- do.call(rbind,lapply(2:4, function(x) {
cor.result<-cor.test(DF[,1],DF[,x])
pvalue <- cor.result$p.value
estimate <- cor.result$estimate
return(data.frame(pvalue = pvalue, estimate = estimate))
})
)
And R gives me a similar error
Error: object 'cor.result' not found
I'm sure it's an easy fix but I can't seem to figure it out. Any help is more than welcome.
This is what I got after running
dput(head(SmallDataset[,1:5]))
structure(list(Species = c("Human.hsapiens", "Chimpanzee.ptroglodytes",
"Gorilla.ggorilla", "Orangutan.pabelii", "Gibbon.nleucogenys",
"Macaque.mmulatta"), Time = c(0, 6.4, 8.61, 15.2, 19.43, 28.1
), ID1 = c(55030, 54539, 54937, 48897, 58160, 54686), ID2 = c(20485,
11907, 10571, 20974, 10462, 11149), ID3 = c(93914, 44482, 43705,
51144, 49485, 43908)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
My solution involves defining a function within a lapply call
##
library(dplyr)
###Create dataframe
Time <- c(0, 6.4, 8.6, 15.2, 19.4, 28.1, 42.6, 73, 73, 85, 88, 88, 88, 88, 88, 89)
ID1 <- c(55030, 54539, 54937, 48897, 58160, 54686, 55393, 47191, 39805, 37601, 51328, 28882, 45587, 60061, 31892, 28670)
ID2 <- c(20485, 11907, 10571, 20974, 10462, 11149, 20970, NA, NA, 9295, NA, 8714, 24446, 10748, 9037, 11859)
ID3 <- c(93914, 44482, 43705, 51144, 49485, 43908, 44324, 37342, 18872, 39660,61673, 43837, 36528, 44738, 41648, 11100)
DF <- data.frame (Time, ID1, ID2, ID3)
##Run the correlations
l2 <- lapply(2:4, function(i)cor.test(DF$Time, DF[,i]))
##Define function to extract p_value and coefficients
l3 <- lapply(l2, function(i){
return(tibble(estimate = i$estimate,
p_value = i$p.value))
})
##Create a dataframe with information
l4 <- bind_rows(l3) %>% mutate(ID = paste0("ID", 1:3)) ##Data frame with info
l4
Consider building a list of data frames witih lapply (an iteration function similar to for but builds a list of objects of equal length as input). Afterwards, row bind all data frame elements together:
results <- lapply(2:4, function(i){
test <- cor.test(DF[,1], DF[,i])
data.frame(ID = names(DF)[i],
estimate = unname(test$estimate),
pvalues = unname(test$p.value))
})
final_df <- do.call(rbind, results)
final_df
# ID estimate pvalues
# 1 ID1 -0.6238591 0.009805341
# 2 ID2 -0.2270515 0.455676037
# 3 ID3 -0.4964092 0.050481533
NOTE: Your posted data for Time is missing an observation and cannot immediately be cast into data.frame() with other vectors. To resolve, I supplemented a 6th 88 at end:
Time <- c(0, 6.4, 8.6, 15.2, 19.4, 28.1, 42.6, 73, 73, 85, 88, 88, 88, 88, 88, 88)
Using posted SmallDataset:
SmallDataset <- structure(...)
results <- lapply(3:5, function(i){
test <- cor.test(SmallDataset$Time, SmallDataset[,i])
data.frame(ID = names(SmallDataset)[i],
estimate = unname(test$estimate),
pvalues = unname(test$p.value))
})
final_df <- do.call(rbind, results)
final_df
# ID estimate pvalues
# 1 ID1 0.03251407 0.9512461
# 2 ID2 -0.41733336 0.4103428
# 3 ID3 -0.60732484 0.2010166

Resources