cannot coerce class ‘"formula"’ to a data.frame - r

I am trying to use Hotelling test
When I call hotelling.test(.~Number, bottle.df)
everything is OK.
However, when I try to do Hotelling test to only one element,
```
bottle_elem1<-data.frame(bottle.df$Number,bottle.df$Mn)
hotelling.test(bottle.df.Number, bottle_elem1)
```
it gives an error
> Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class ‘"formula"’ to a data.frame
>Traceback:
>1. data.frame(. ~ Number, bottle.df$Mn)
>2. as.data.frame(x[[i]], optional = TRUE)
>3. as.data.frame.default(x[[i]], optional = TRUE)
>4. stop(gettextf("cannot coerce class %s to a data.frame", sQuote(deparse(class(x))[1L])),
. domain = NA)
I understand that should do it differently, but don't know how. If I use like previously .~Number, there is an error too
What is correct code to do Hotelling test for a column? Maybe I should extract column differently, but don't know how.
bottle is from Hotelling package
structure(list(Number = c(1L, 1L, 1L, 1L, 1L, 1L), Mn = c(56.1,
53.8, 58.7, 54.6, 58.6, 56.8), Ba = c(170.7, 166.2, 184.2, 170.5,
185.2, 180.5), Sr = c(145.1, 143.3, 156.5, 158.1, 161.3, 146.7
), Zr = c(77.4, 71.6, 78.2, 75.3, 83.9, 79.2), Ti = c(267.4,
270, 286.4, 273.6, 289.9, 274)), row.names = c(NA, 6L), class = "data.frame")

Related

svyglm and interactions package showing error in R4.0.2 not R3.6.3

I have a syvglm code that would work in R3.6.3, but not R4.0.2 (which gives the error messages of:
for svylgm:
Error: Must subset elements with a valid subscript vector. x Subscript
has the wrong type omit. i It must be logical, numeric, or
character. Backtrace:
survey::svyglm(...)
survey:::svyglm.survey.design(...)
survey:::[.survey.design2(design, -nas, )
base::[.data.frame(x$variables, i, ..1, drop = FALSE)
vctrs:::[.vctrs_vctr(xj, i)
vctrs:::vec_index(x, i, ...)
vctrs::vec_slice(x, i)
For interactions package:
Error: <labelled> - is not permitted
Backtrace:
interactions::probe_interaction(...)
interactions::sim_slopes(...)
interactions:::center_ss(...)
interactions:::center_ss_survey(...)
jtools::gscale(vars = ndfvars, data = design, center.only = TRUE)
...
vctrs:::-.vctrs_vctr(left, right)
haven:::vec_arith.haven_labelled.default("-", e1, e2)
vctrs::stop_incompatible_op(op, x, y)
vctrs:::stop_incompatible(...)
vctrs:::stop_vctrs(...)
My code is:
design <-
svydesign(
id = ~I_i ,
data = sm ,
weights = ~w ,
)
fit <- svyglm(sc ~ sex + school +sex*school, design)
summary(fit)
probe_interaction(fit, pred = school, modx = sex)
I wish to continue with R4.0.2 as downgrading R create a lot of problems with other R packages. The probe_Interaction part did not work on older version of R (PC computer) either.
Any clue what went wrong?
Data:
structure(list(I_i = structure(c(-54, 1000987, 1000166), label = "ID institution", format.spss = "F39.0", display_width = 39L, labels = c(Filtered = -99, Don't know = -98, Refused = -97, Not in list = -96, Implausible value = -95, Not reached = -94, Does not apply = -93, Question erroneously not asked = -92, Survey aborted = -91, Unspecific missing = -90, Not participated = -56, Not determinable = -55, Missing by design = -54, Anonymized = -53, Implausible value removed = -52, No estimate in check module = -51 ), class = c("haven_labelled", "vctrs_vctr", "double")), w = c(0.197917401820921, 0.186422150045312, 0.275986837527898),sc = c(NA, 7.66666666666667, 8.66666666666667 ), sex = structure(c(1, 1, 0), label = "Gender", format.spss = "F3.0", display_width = 9L, labels = c(Implausible value = -95, Unspecific missing = -90, Missing by design = -54, <U+0085> male? = 1, <U+0085> female? = 2), class = c("haven_labelled", "vctrs_vctr", "double")), school = c(NA, 1L, 1L)), row.names = c(NA, 3L), class = "data.frame")

How to select numerical columns for linear regression in R [duplicate]

This question already has answers here:
Selecting only numeric columns from a data frame
(12 answers)
Closed 2 years ago.
I have the following dataset named fish_data
> structure(list(Species = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Bream", "Parkki", "Perch", "Pike", "Roach", "Smelt", "Whitefish"), class = "factor"),
> WeightGRAM = c(242, 290, 340, 363, 430, 450), VertLengthCM = c(23.2, 24, 23.9, 26.3, 26.5, 26.8)
> DiagLengthCM = c(25.4, 26.3, 26.5, 29, 29, 29.7),
> CrossLengthCM = c(30, 31.2, 31.1, 33.5, 34, 34.7),
> HeightCM = c(11.52, 12.48, 12.3778, 12.73, 12.444, 13.6024),
> WidthCM = c(4.02, 4.3056, 4.6961, 4.4555, 5.134, 4.9274)),
> row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"), na.action = structure(c(`41` = 41L), class = "omit"))
It look something like this:
How can i Build a linear regression model named m1 with WeightGRAM as a function of Species and all the measurement variables i.e. VertLengthCM, DiaLengthCM, CrossLengthCM, HeightCM, WidthCM?
i have the linear regression code as below:
m1 <- lm(WeightGRAM~.,data = fish_data )
summary(m1)
But i want to exclude the "species" as it is a factor
You can try this:
#Index
index <- which(names(fish_data)=='Species')
#Model
m1 <- lm(WeightGRAM~.,data = fish_data[,-index] )
Call:
lm(formula = WeightGRAM ~ ., data = fish_data[, -index])
Coefficients:
(Intercept) VertLengthCM DiagLengthCM CrossLengthCM HeightCM WidthCM
-827.56 -124.85 70.08 72.14 -23.41 72.52
You can check the if the column is numeric or not using is.numeric which returns a logical value. You can use it to subset fish_data.
cols <- sapply(fish_data, is.numeric)
m1 <- lm(WeightGRAM~.,data = fish_data[, cols])
m1
#Call:
#lm(formula = WeightGRAM ~ ., data = fish_data[, cols])
#Coefficients:
# (Intercept) VertLengthCM DiagLengthCM CrossLengthCM HeightCM WidthCM
# -827.6 -124.8 70.1 72.1 -23.4 72.5

Selecting columns where only one item is true

I am using the following code to determine if any of the columns in my data table have 1065. If any of the columns do have 1065, I get "TRUE" which works perfectly. Now I want to only output true if any of the columns notcancer0:notcancer33 contains 1065 AND all the rest are NA. Other columns may contain other values like 1064, 1066, etc. But I want to output "TRUE" for the rows where there is only 1065 and all the rest of the columns contain NAs for that row. What is the best way to do this?
biobank_nsaid[, ischemia1 := Reduce(`|`, lapply(.SD, `==`, "1065")), .SDcols=notcancer0:notcancer33]
Sample data:
biobank_nsaid = structure(list(aspirin = structure(c(2L, 1L, 1L, 1L), .Label =
c("FALSE", "TRUE"), class = "factor"), aspirinonly = c(TRUE, FALSE, FALSE,
FALSE), med0 = c(1140922174L, 1140871050L, 1140879616L, 1140909674L ), med1 =
c(1140868226L, 1140876592L, 1140869180L, NA), med2 = c(1140879464L, NA,
1140865016L, NA), med3 = c(1140879428L, NA, NA, NA)), row.names = c(NA, -4L),
class = c("data.table", "data.frame"))
Here are 2 options:
setDT(biobank_nsaid)[, ischemia1 :=
rowSums(is.na(.SD))==ncol(.SD)-1L & rowSums(.SD==1140909674, na.rm=TRUE)==1L,
.SDcols=med0:med3]
Or after some boolean manipulations:
biobank_nsaid[, ic2 :=
!(rowSums(is.na(.SD))!=ncol(.SD)-1L | rowSums(.SD==1140909674, na.rm=TRUE)!=1L),
.SDcols=med0:med3]

Trouble trying to clean a character vector in R data frame (UTF-8 encoding issue)

I'm having some issues cleaning up a dataset after I manually extracted the data online - I'm guessing these are encoding issues. I have an issue trying to remove the "U+00A0" in the "Athlete" column cels along with the operator brackets. I looked up the corresponding UTF-8 code and it's for "No-Break-Space". I'm also not sure how to replace the other UTF-8 characters to make the names legible - for e.g. getting U+008A to display as Š.
Subset of data
head2007decathlon <- structure(list(Rank = 1:6, Athlete = c("<U+00A0>Roman <U+008A>ebrle<U+00A0>(CZE)", "<U+00A0>Maurice Smith<U+00A0>(JAM)", "<U+00A0>Dmitriy Karpov<U+00A0>(KAZ)", "<U+00A0>Aleksey Drozdov<U+00A0>(RUS)", "<U+00A0>Andr<e9> Niklaus<U+00A0>(GER)", "<U+00A0>Aleksey Sysoyev<U+00A0>(RUS)"), Total = c(8676L, 8644L, 8586L, 8475L, 8371L, 8357L), `100m` = c(11.04, 10.62, 10.7, 10.97, 11.12, 10.8), LJ = c(7.56, 7.5, 7.19, 7.25, 7.42, 7.01), SP = c(15.92, 17.32, 16.08, 16.49, 14.12, 16.16), HJ = c(2.12, 1.97, 2.06, 2.12, 2.06, 2.03), `400m` = c(48.8, 47.48, 47.44, 50, 49.4, 48.42), `110mh` = c(14.33, 13.91, 14.03, 14.76, 14.51, 14.59), DT = c(48.75, 52.36, 48.95, 48.62, 44.48, 49.76), PV = c(4.8, 4.8, 5, 5, 5.3, 4.9), JT = c(71.18, 53.61, 59.84, 65.51, 63.28, 57.75), `1500m` = c(275.32, 273.52, 279.68, 276.93, 272.5, 276.16), Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2007", class = "factor"), Nationality = c(NA, NA, NA, NA, NA, NA)), .Names = c("Rank", "Athlete", "Total", "100m", "LJ", "SP", "HJ", "400m", "110mh", "DT", "PV", "JT", "1500m", "Year", "Nationality"), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
This is what I've tried so far to no success:
1) head2007decathlon$Athlete <- gsub(pattern="\U00A0",replacement="",x=head2007decathlon$Athlete)
2) head2007decathlon$Athlete <- gsub(pattern="<U00A0>",replacement="",x=head2007decathlon$Athlete)
3) head2007decathlon$Athlete <- iconv(head2007decathlon$Athlete, from="UTF-8", to="LATIN1")
4) Encoding(head2007decathlon$Athlete) <- "UTF-8"
5) head2007decathlon$Athlete<- enc2utf8(head2007decathlon$Athlete)
The following would remove the no break space.
head2007decathlon$Athlete <- gsub(pattern="<U\\+00A0>",replacement="",x=head2007decathlon$Athlete)
Not sure how to convert the other characters. One problem could be that the codes are not exactly in a format that R sees as UTF-8.
One example:
iconv('\u008A', from="UTF-8", to="LATIN1")
this seems to have an effect, contrary to trying to convert U+008A. Although
the output is:
[1] "\x8a"
not the character you want. Hope this helps somehow.

Create subsets of time series from one data frame in order to run SeasonalMannKendall trend tests by station

I have the following data with about 13 different stations:
> dput(RawData[1:10,])
structure(list(Station = c(469, 469, 469, 469, 469, 469, 469,
469, 469, 469), Classification = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c(" Approved ", " Conditionally Approved ",
" Prohibited "), class = "factor"), SampleDate = structure(c(8504,
8504, 8505, 8505, 8532, 8532, 8533, 8533, 8561, 8561), class = "Date"),
Year = c(1993, 1993, 1993, 1993, 1993, 1993, 1993, 1993,
1993, 1993), SWTemp = c(10, 11, 10, 10, 14, 15, 12, 14, 15,
16), Salinity = c(26, 28, 28, 30, NA, NA, 30, 30, 28, 18),
FecalColiform = c(1.8, 2, 2, 1.8, 2, 2, 1.8, 1.8, 4.5, 2)), .Names = c("Station",
"Classification", "SampleDate", "Year", "SWTemp", "Salinity",
"FecalColiform"), row.names = c(NA, 10L), class = "data.frame")
I would like to run a SeasonalMannKendal on the fecal coliform data for each station separately. I know it has to be a time series. How do I make each station into it's own time series so that I can run these tests?
I have tried to reshape the data to list station results by sample date, but this creates NAs for certain dates and I can't run the test that way either.
What would my best approach be?
Thank you in advance!
You could do this in a few ways, but this way will give you a list of dataframes, where each dataframe has the data for only one Station. I also added in as.ts to convert the data into a time series using the TSA library. You might need to play with the dates to get them how you want them.
library("TSA")
mylist = list()
for(i in 1:unique(RawData$Station)){
mylist[[i]] = as.ts(RawData[RawData$Station == i,])
}
names(mylist) = unique(RawData$Station)
Now you'll be able to pick which site you want by doing something like mylist[["429"]].
If you just want separate dataframes that aren't in a list, you'd do something like:
Station1 = RawData[RawData$Station == 1,]
Station2 = RawData[RawData$Station == 2,]
...etc.
PS, in the future it would be helpful to use dput(yourdata) so that someone can easily copy-paste your problem and work on it. Thanks for giving a clear example though!

Resources