Having issues using order function in R - r

My data.frame is stateData and when I execute stateData[order(stateData$"heart failure"),], with heart failure being a column name, I'm getting my dataframe back with the heart failure column having increasing values like this:
10.0, 10.1, 10.3, 10.7, 15.0, 15.1, 15.9, 8.1, 8.3, 8.9, 9.0, 9.1
Here are details:dput(head(stateData))
heart failure = structure(c(97L, 44L, 25L, 6L, 52L, 57L ), .Label = c("10.0", "7.2", "7.3", "7.4", "7.5", "7.6", "7.7", "7.8", "7.9", "8.0", "8.1", "8.2", "8.3", "8.4", "8.5", "8.6", "8.7", "8.8", "8.9", "9.0", "9.1", "9.2", "9.3", "9.4", "9.5", "9.6", "9.7", "9.8", "9.9", "Not Available"), class = "factor"),
Why is it not sorting it all the way?
Any help is appreciated! Thank you!
Edit: Here is my solution! I got it, thanks for all of the advice!
stateData[,"heart failure"] <- as.numeric(levels(stateData["heart failure"])[stateData[,"heart failure"]])
sortedData <- stateData[order(stateData[,"heart failure"]),]

The column stateData$"heart failure" is a factor, so when R sorts it, it puts it in alphabetical order. If you want the data sorted numerically, try this:
stateData$"heart failure" <- as.numeric(levels(stateData$"heart failure"))[stateData$"heart failure"]
stateData[order(stateData$"heart failure"),]

Related

I want to create a ggplot scatter plot with 4 variables (3 continuous variables and 1 factorial variable)

ggplot with three variables would like to add a fourth variable, as a point shape, which is a factor
This is my code:
ggplot(data, aes(x = Dim.1, y=Dim.2, color= soilph, shape= imputedsoil))
dimensions 1 and 2 and soil pH are continuous.
imputed is a two-level factor.
This is the error code I receive:
Error: Aesthetics must be either length 1 or the same as the data (135): colour
Any help you could give would be appreciated, I can't find any advice online for a four-variable 'ggplot' where one of the 'extra' variables is a factor and one is continuous.
Here is a small sample of my data created using the dput function:
structure(list(Dim.1 = c(-0.826779300424281, -0.826779300424281,
-0.826779300424281, -0.826779300424281, -0.826779300424281, -0.826779300424281,
-0.826779300424281, -0.826779300424281, 3.13077029175822, 3.07814447486786,
3.89869453739013, 2.63240937463108, 2.78203007023669, 3.57985580551222,
-1.74653115781836, -1.74653115781836), Dim.2 = c(-2.21180913915791,
-2.21180913915791, -2.21180913915791, -2.21180913915791, -2.21180913915791,
-2.21180913915791, -2.21180913915791, -2.21180913915791, 1.41212418127028,
1.30760054372476, 0.865339639945497, 1.07343354599186, 1.20882751794457,
0.739609579390227, -3.87761036488784, -3.87761036488784), imputedsoil = c("Imputed",
"Imputed", "Imputed", "Imputed", "Imputed", "Imputed", "Imputed",
"Imputed", "Not Missing", "Not Missing", "Not Missing", "Not Missing",
"Not Missing", "Not Missing", "Not Missing", "Not Missing"),
soilph = c(6.92, 6.92, 6.92, 6.92, 6.92, 6.92, 6.92, 6.92,
7.16, 7.43, 7.13, 7.16, 7.43, 7.13, 4.81, 4.81)), row.names = c("46",
"47", "48", "49", "50", "51", "52", "53", "54", "55", "56", "57",
"58", "59", "60", "61"), class = "data.frame")

Cannot get Repeated measures ANOVA in RStudio to work

I am trying to work out on how to conduct a repeated measures ANOVA. My data is structured as followed
means <- structure(list(col = c("c", "v1", "b1", "v2", "b2"),
`1` = c(8.55,9.73, 8.93, 9.52, 9.91),
`2` = c(8.4, 9.97, 9.08, 9.66, 9.97),
`3` = c(8.48, 10.04, 9.13, 9.73, 10.04),
`4` = c(8.42, 9.63,8.9, 9.34, 9.82),
`5` = c(8.42, 9.59, 8.87, 9.39, 9.69),
`6` = c(8.52, 9.74, 9.02, 9.58, 9.84),
`7` = c(8.37, 9.67,8.98, 9.47, 9.74),
`8` = c(8.42, 9.67, 9.02, 9.52, 9.77),
`9` = c(8.56, 9.79, 9.36, 9.6, 9.78),
`10` = c(8.44, 9.63,9.15, 9.52, 9.67),
`11` = c(8.3, 9.58, 9.05, 9.49, 9.63),
`12` = c(8.03, 9.33, 8.82, 9.23, 9.38),
`13` = c(7.95, 9.08, 8.7, 9.04, 9.19),
`14` = c(8, 8.34, 8.37, 8.43, 8.54),
`15` = c(8.04,8.26, 8.4, 8.45, 8.61),
`16` = c(8.08, 8.09, 8.18, 8.16,8.28),
`17` = c(7.99, 8.06, 8.09, 8.15, 8.26),
`18` = c(8.06, 8.06, 8.09, 8.1, 8.22),
`19` = c(7.96, 7.96, 7.99, 8.03, 8.1),
`20` = c(7.96, 7.98, 7.99, 7.99, 8.11),
`21` = c(8.16, 8.22, 8.22, 8.26, 8.33),
`22` = c(8.08, 8.16, 8.13, 8.2, 8.2),
`23` = c(7.94, 7.97, 7.94, 7.98, 8.07),
`24` = c(8.02,8.03, 8, 8.08, 8.1),
`25` = c(8.03, 8.08, 8.09, 8.12, 8.15),
`26` = c(7.92, 7.95, 7.95, 7.96, 7.98)),
.Names = c("col","1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23","24", "25", "26"),
class = c("data.table", "data.frame"))
where "col" represent different substrates (treatments) and the numbers in the header are measurements over time. This is only part of the data.
To conduct the repeated measurements ANOVA (which is hopefully the right statistical test), I tried to follow several examples I found in the net, e.g. http://rtutorialseries.blogspot.de/2011/02/r-tutorial-series-one-way-repeated.html
# step 1 Define the levels:
levels <- c(1:26)
# define factor
factor <- as.factor(levels)
#define the frame
frame <- data.frame(factor)
# bind the colums
bind <- cbind (means$`1`,means$`2`,means$`3`,means$`4`,means$`5`,means$`6`,means$`7`,means$`8`,means$`9`,means$`10`,means$`11`,means$`12`,means$`13`,means$`14`,means$`15`,means$`16`,means$`17`,means$`18`,means$`19`,means$`20`,means$`21`,means$`22`,means$`23`,means$`24`,means$`25`,means$`26`)
# define the model
model <- lm(ph_bind ~ 1)
# ANOVA
analysis <- Anova(model, idata=frame, idesign= ~factor)
This results in:
> analysis <- Anova(model, idata = factor, idesign = ~factor)
Warning message:
In Anova.lm(model, idata = factor, idesign = ~factor) :
the model contains only an intercept: Type III test substituted
> summary (analysis)
Sum Sq Df F value Pr(>F)
Min. : 61.59 Min. : 1 Min. :20519 Min. :0
1st Qu.:2495.43 1st Qu.: 33 1st Qu.:20519 1st Qu.:0
Median :4929.27 Median : 65 Median :20519 Median :0
Mean :4929.27 Mean : 65 Mean :20519 Mean :0
3rd Qu.:7363.10 3rd Qu.: 97 3rd Qu.:20519 3rd Qu.:0
Max. :9796.94 Max. :129 Max. :20519 Max. :0
NA's :1 NA's :1
This is not the expected output I was hoping for. What am I doing wrong?
Grateful for any help:)

Trouble trying to clean a character vector in R data frame (UTF-8 encoding issue)

I'm having some issues cleaning up a dataset after I manually extracted the data online - I'm guessing these are encoding issues. I have an issue trying to remove the "U+00A0" in the "Athlete" column cels along with the operator brackets. I looked up the corresponding UTF-8 code and it's for "No-Break-Space". I'm also not sure how to replace the other UTF-8 characters to make the names legible - for e.g. getting U+008A to display as Š.
Subset of data
head2007decathlon <- structure(list(Rank = 1:6, Athlete = c("<U+00A0>Roman <U+008A>ebrle<U+00A0>(CZE)", "<U+00A0>Maurice Smith<U+00A0>(JAM)", "<U+00A0>Dmitriy Karpov<U+00A0>(KAZ)", "<U+00A0>Aleksey Drozdov<U+00A0>(RUS)", "<U+00A0>Andr<e9> Niklaus<U+00A0>(GER)", "<U+00A0>Aleksey Sysoyev<U+00A0>(RUS)"), Total = c(8676L, 8644L, 8586L, 8475L, 8371L, 8357L), `100m` = c(11.04, 10.62, 10.7, 10.97, 11.12, 10.8), LJ = c(7.56, 7.5, 7.19, 7.25, 7.42, 7.01), SP = c(15.92, 17.32, 16.08, 16.49, 14.12, 16.16), HJ = c(2.12, 1.97, 2.06, 2.12, 2.06, 2.03), `400m` = c(48.8, 47.48, 47.44, 50, 49.4, 48.42), `110mh` = c(14.33, 13.91, 14.03, 14.76, 14.51, 14.59), DT = c(48.75, 52.36, 48.95, 48.62, 44.48, 49.76), PV = c(4.8, 4.8, 5, 5, 5.3, 4.9), JT = c(71.18, 53.61, 59.84, 65.51, 63.28, 57.75), `1500m` = c(275.32, 273.52, 279.68, 276.93, 272.5, 276.16), Year = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "2007", class = "factor"), Nationality = c(NA, NA, NA, NA, NA, NA)), .Names = c("Rank", "Athlete", "Total", "100m", "LJ", "SP", "HJ", "400m", "110mh", "DT", "PV", "JT", "1500m", "Year", "Nationality"), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
This is what I've tried so far to no success:
1) head2007decathlon$Athlete <- gsub(pattern="\U00A0",replacement="",x=head2007decathlon$Athlete)
2) head2007decathlon$Athlete <- gsub(pattern="<U00A0>",replacement="",x=head2007decathlon$Athlete)
3) head2007decathlon$Athlete <- iconv(head2007decathlon$Athlete, from="UTF-8", to="LATIN1")
4) Encoding(head2007decathlon$Athlete) <- "UTF-8"
5) head2007decathlon$Athlete<- enc2utf8(head2007decathlon$Athlete)
The following would remove the no break space.
head2007decathlon$Athlete <- gsub(pattern="<U\\+00A0>",replacement="",x=head2007decathlon$Athlete)
Not sure how to convert the other characters. One problem could be that the codes are not exactly in a format that R sees as UTF-8.
One example:
iconv('\u008A', from="UTF-8", to="LATIN1")
this seems to have an effect, contrary to trying to convert U+008A. Although
the output is:
[1] "\x8a"
not the character you want. Hope this helps somehow.

Forestplot in R superscript in tabletext matrix

I am creating a forestplot in R. Since I cannot display my own data I am using example code and data that I found here
https://cran.r-project.org/web/packages/forestplot/vignettes/forestplot.html
cochrane_from_rmeta <-
structure(list(
mean = c(NA, NA, 0.578, 0.165, 0.246, 0.700, 0.348, 0.139, 1.017, NA, 0.531),
lower = c(NA, NA, 0.372, 0.018, 0.072, 0.333, 0.083, 0.016, 0.365, NA, 0.386),
upper = c(NA, NA, 0.898, 1.517, 0.833, 1.474, 1.455, 1.209, 2.831, NA, 0.731)),
.Names = c("mean", "lower", "upper"),
row.names = c(NA, -11L),
class = "data.frame")
tabletext<-cbind(
c("", "Study", "Auckland", "Block",
"Doran", "Gamsu", "Morrison", "Papageorgiou",
"Tauesch", NA, "Summary"),
c("Deaths", "(steroid)", "36", "1",
"4", "14", "3", "1",
"8", NA, NA),
c("Deaths", "(placebo)", "60", "5",
"11", "20", "7", "7",
"10", NA, NA),
c("", "OR", "0.58", "0.16",
"0.25", "0.70", "0.35", "0.14",
"1.02", NA, "0.53"),
c("",NA,NA,NA,NA,NA,NA,NA,NA,NA,"Heterogeniety I^2 = 20%"))[enter image description here][1]
library(forestplot)
forestplot(tabletext,
cochrane_from_rmeta,new_page = TRUE,
is.summary=c(TRUE,TRUE,rep(FALSE,8),TRUE),
clip=c(0.1,2.5),
xlog=TRUE,
col=fpColors(box="royalblue",line="darkblue", summary="royalblue"))
I would like to superscript the very last label for Heterogeniety. I tried expression(Heterogeniety~I^2), but I get the following error message:
Error in cbind(c("Outcome", "Death or BPD", NA, NA, NA, "Interaction p value = .xx", : cannot create a matrix from type 'expression'
How do I include a superscript in my forestplot?
R does not allow you to create a matrix of expression. Also mentioned in the documentation for forestplot,
You can also provide a matrix although this cannot have expressions by design
Luckily, the documentation for forestplot also mentioned that it can take in a list for labeltext.
The list should be wrapped in m x n number to resemble a matrix: list(list("rowname 1 col 1", "rowname 2 col 1"), list("r1c2", expression(beta))
Hence, you might want to convert your tabletext into a nested list as follows:
tabletext <- list(
list("", "Study", "Auckland", "Block",
"Doran", "Gamsu", "Morrison", "Papageorgiou",
"Tauesch", NA, "Summary"),
list("Deaths", "(steroid)", "36", "1",
"4", "14", "3", "1",
"8", NA, NA),
list("Deaths", "(placebo)", "60", "5",
"11", "20", "7", "7",
"10", NA, NA),
list("", "OR", "0.58", "0.16",
"0.25", "0.70", "0.35", "0.14",
"1.02", NA, "0.53"),
list("",NA,NA,NA,NA,NA,NA,NA,NA,NA, expression(Heterogeniety~I^2==20~'%')))
forestplot(tabletext,
cochrane_from_rmeta,new_page = TRUE,
is.summary=c(TRUE,TRUE,rep(FALSE,8),TRUE),
clip=c(0.1,2.5),
xlog=TRUE,
col=fpColors(box="royalblue",line="darkblue", summary="royalblue"))
HTH.

Error: "initial value in 'vmmin' is not finite" not in mle2() but in confint()

I know the web is plastered with questions (and answers) about the 'initial value in vmmim is not finite' error when trying to fit parameters for an mle2 object. I do not have this error when creating my mle2 object, but I DO get this error when trying to find the 95% CI for a parameter from an mle2 object.
Here is a reproducible example:
Here are the data:
d = structure(list(SST_1YR = c(11.6, 11.7, 11.9, 12, 12.1, 12.2,
12.3, 12.4, 12.5, 12.6, 12.7, 12.8, 12.9, 13, 13.1, 13.2, 13.3,
13.4, 13.5, 13.6, 13.7, 13.8, 13.9, 14, 14.2, 14.3, 14.4, 14.5,
14.6, 14.7, 14.8, 14.9, 15, 15.1, 15.2, 15.3, 15.4, 15.5, 15.6,
15.7, 15.8, 15.9, 16, 16.2, 16.3, 16.5, 16.6, 16.7, 16.9, 17,
17.1, 17.2, 17.3, 17.4, 17.5, 17.6, 17.7, 17.8, 17.9), DML = structure(c(84.5,
71, 114.75, 90.9473684210526, 31.7631578947368, 92.5, 80.4, 98.7021276595745,
70.8, 66.8382352941177, 70.2553191489362, 98.1111111111111, 86.5241379310345,
59.7209302325581, 38.7692307692308, 78.2028985507246, 86.3503649635037,
69.1161290322581, 61.9122807017544, 60.1212121212121, 98.5490196078431,
94.3145161290323, 76.5643564356436, 39.4230769230769, 98.42,
95.6129032258064, 65.9673202614379, 39, 64.0576923076923, 42.4166666666667,
59.6989247311828, 62.8039215686275, 74.5263157894737, 50.8888888888889,
64.35, 40.5, 53.7466666666667, 42, 49.5, 23.8888888888889, 39.6170212765957,
74.8947368421053, 42.8518518518519, 40.0344827586207, 53, 39.3333333333333,
24.1333333333333, 30, 39.4880952380952, 94.4883720930233, 69.1428571428571,
33.7179487179487, 26.1538461538462, 37.8965517241379, 38.4117647058824,
44.2727272727273, 68.3157894736842, 37.3, 43.4444444444444), .Dim = 59L, .Dimnames = list(
c("11.6", "11.7", "11.9", "12", "12.1", "12.2", "12.3", "12.4",
"12.5", "12.6", "12.7", "12.8", "12.9", "13", "13.1", "13.2",
"13.3", "13.4", "13.5", "13.6", "13.7", "13.8", "13.9", "14",
"14.2", "14.3", "14.4", "14.5", "14.6", "14.7", "14.8", "14.9",
"15", "15.1", "15.2", "15.3", "15.4", "15.5", "15.6", "15.7",
"15.8", "15.9", "16", "16.2", "16.3", "16.5", "16.6", "16.7",
"16.9", "17", "17.1", "17.2", "17.3", "17.4", "17.5", "17.6",
"17.7", "17.8", "17.9")))), .Names = c("SST_1YR",
"DML"), row.names = c(NA, -59L), class = "data.frame")
Here is the creation of the mle2 object (with no warnings...)
m = mle2(DML~dgamma(scale=(a+b*SST_1YR)/sh, shape=sh), start=list(a=170, b=-7.4, sh=10), data=d)
And here is where I get an NA and my vmmin warning for the lower bound of parameter b:
confint(m)
I have tried changing the starting values but nothing I have tried has helped. I have created other models with the same data but a different distribution and no error. Can anyone help me figure out what is causing this error?
Using package bbmle-1.0.17
There are a few things to try here. First look at the data (always a good idea):
library("ggplot2"); theme_set(theme_bw())
ggplot(d,aes(SST_1YR,DML)) + geom_point()+
geom_smooth(method="glm",family=Gamma(link="identity"))+
geom_smooth(method="lm",colour="red",fill="red")
Note that in this case the Gamma regression looks almost identical to a regular linear regression (i.e. the shape parameter is large). Also, the distribution of the x values is far from the origin -- this may lead to numeric problems.
library("bbmle")
m <- mle2(DML~dgamma(scale=(a+b*SST_1YR)/sh, shape=sh),
start=list(a=170, b=-7.4, sh=10), data=d)
confint(m)
Confirms the problem:
## 2.5 % 97.5 %
## a 132.05952 203.192159
## b NA -4.407289
## sh 6.83566 13.933383
I thought that setting parscale could help, but it appears to make the problem worse rather than better:
m2 <- update(m,control=list(parscale=c(a=170,b=8,sh=10)))
confint(m2)
## 2.5 % 97.5 %
## a NA 203.153230
## b NA -4.407281
## sh 6.835659 13.933383
Does centering the predictor variable help? scale(x,scale=FALSE) centers but doesn't scale x ... (using SST_1YR-mean(SST_1YR) might be clearer, that way we wouldn't have three scales floating around in the expression ...
m3 <- mle2(DML~dgamma(scale=(a+b*scale(SST_1YR,scale=FALSE))/sh, shape=sh),
start=list(a=170, b=-7.4, sh=10), data=d)
confint(m3)
## 2.5 % 97.5 %
## a 56.462610 66.754118
## b -9.421521 -4.407262
## sh 6.835662 13.933384
Looks good, although it would be a little tricky to get the intercept terms back to the original scale (although we could just take them from the previous, uncentered fit).
It turns out you can also fit this model via
glm(DML~SST_1YR,family=Gamma(link="identity"),data=d)
although confint() again fails rather mysteriously (Error in y/mu: non-conformable arrays).
Some other things that I tried that didn't work particularly well (included here only for completeness):
try to prevent linear regression from going negative:
mle2(DML~dgamma(scale=pmin((a+b*SST_1YR)/sh,1e-5),
shape=sh),
start=list(a=170, b=-7.4, sh=10), data=d)
use a penalized form of dgamma to return bad likelihoods rather than NA when x<0:
dgamma_pen <- function(x,...,log=FALSE) {
r <- if (x<0) (-100) else dgamma(x,...,log=TRUE)
if (log) r else exp(r)
}
m4 <- mle2(DML~dgamma_pen(scale=pmin((a+b*SST_1YR)/sh,1e-5),
shape=sh),
start=list(a=170, b=-7.4, sh=10), data=d)

Resources