This is an exmaple of fct_reorder
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width, .desc = TRUE), data = iris)
This code is identical with boxplot(Sepal.Width ~ reorder(Species, -Sepal.Width), data = iris)
What is the better point fct_reorder() than reorder()?
The two functions are very similar, but have a few differences:
reorder() works with atomic vectors and defaults to using mean().
fct_reorder() only works with factors (or character vectors) and defaults to using median()
Example:
library(forcats)
x <- 1:10
xf <- factor(1:10)
y <- 10:1
reorder(x, y)
#> [1] 1 2 3 4 5 6 7 8 9 10
#> attr(,"scores")
#> 1 2 3 4 5 6 7 8 9 10
#> 10 9 8 7 6 5 4 3 2 1
#> Levels: 10 9 8 7 6 5 4 3 2 1
reorder(xf, y)
#> [1] 1 2 3 4 5 6 7 8 9 10
#> attr(,"scores")
#> 1 2 3 4 5 6 7 8 9 10
#> 10 9 8 7 6 5 4 3 2 1
#> Levels: 10 9 8 7 6 5 4 3 2 1
fct_reorder(x, y)
#> Error: `f` must be a factor (or character vector).
fct_reorder(xf, y)
#> [1] 1 2 3 4 5 6 7 8 9 10
#> Levels: 10 9 8 7 6 5 4 3 2 1
Created on 2022-01-07 by the reprex package (v2.0.1)
Related
How do I replace the NA values in 'example' with the corresponding values in 'example 2'? So 7 would take the place of the first NA and 8 would take the place of the second NA etc. My data is much larger so I would not be able to rename the values individually for the multiple NAs. Thanks
example <- data.frame('count' = c(1,3,4,NA,8,NA,9,0,NA,NA,7,5,8,NA))
example2 <- data.frame('count' = c(7,8,4,6,7))
Another possible solution, based on replace:
example$count <- replace(example$count, is.na(example$count), example2$count)
example
#> count
#> 1 1
#> 2 3
#> 3 4
#> 4 7
#> 5 8
#> 6 8
#> 7 9
#> 8 0
#> 9 4
#> 10 6
#> 11 7
#> 12 5
#> 13 8
#> 14 7
You can try with :
example[is.na(example),] <- example2
Which will give you :
count
1 1
2 3
3 4
4 7
5 8
6 8
7 9
8 0
9 4
10 6
11 7
12 5
13 8
14 7
EDIT: Since you probably have more than just one column in your dataframes, you should use :
example$count[is.na(example$count)] <- example2$count
Another option using which to check the index of NA values:
ind <- which(is.na(example$count))
example[ind, "count"] <- example2$count
Output:
count
1 1
2 3
3 4
4 7
5 8
6 8
7 9
8 0
9 4
10 6
11 7
12 5
13 8
14 7
The title says it all: Changing the (supposedly arbitrary) labels of your random-effects grouping variable (e.g. the names of your subjects in a repeated-measures experiment) can change the resulting output in lme4. Minimal example:
require(dplyr)
require(lme4)
require(digest)
df = faithful %>% mutate(subject = rep(as.character(1:8), each = 34),
subject2 = rep(as.character(9:16), each = 34))
summary(lmer(eruptions ~ waiting + (waiting | subject), data = df))$coefficients[2,1] # = 0.07564181
summary(lmer(eruptions ~ waiting + (waiting | subject2), data = df))$coefficients[2,1] # = 0.07567655
I think it happens because lme4 converts them to a factor, and different names produce different factor level orderings. E.g. this produces the problem:
df2 = faithful %>% mutate(subject = factor(rep(as.character(1:8), each = 34)),
subject2 = factor(rep(as.character(9:16), each = 34)))
summary(lmer(eruptions ~ waiting + (waiting | subject), data = df2))$coefficients[2,1] # = 0.07564181
summary(lmer(eruptions ~ waiting + (waiting | subject2), data = df2))$coefficients[2,1] # = 0.07567655
But this doesn't:
df3 = faithful %>% mutate(subject = factor(rep(as.character(1:8), each = 34)),
subject2 = factor(rep(as.character(1:8), each = 34),
levels = as.character(1:8),
labels = as.character(9:16)))
summary(lmer(eruptions ~ waiting + (waiting | subject), data = df3))$coefficients[2,1] # = 0.07564181
summary(lmer(eruptions ~ waiting + (waiting | subject2), data = df3))$coefficients[2,1] # = 0.07564181
This seems like an issue in lme4. Different arbitrary variable labels shouldn't produce different output, right? Am I missing something? Why does lme4 do this?
(I know the difference in output is small, but I got bigger differences in other cases, enough to, e.g., change a p value from .055 to .045. Also, if this is right, I think it could cause slight reproducibility issues -- e.g. if, after finishing their analyses, an experimenter anonymizes their human subjects data (by changing the names) and then posts it in a public repository.)
The first part of your sequence 1:8 gives the same order in numeric or character format, whereas the second part doesn't:
identical(order(1:8), order(as.character(1:8)))
# [1] TRUE
identical(order(9:16), order(as.character(9:16)))
# [1] FALSE
That's because numbers as characters are sorted by their first digit:
sort(9:16)
# [1] 9 10 11 12 13 14 15 16
sort(as.character(9:16))
# [1] "10" "11" "12" "13" "14" "15" "16" "9"
So if you use two different but one-digit character sequences there is seemingly no issue:
library(lme4)
fo1 <- eruptions ~ waiting + (waiting | sub)
fo2 <- eruptions ~ waiting + (waiting | sub2)
df1 <- transform(faithful, sub=rep(as.character(1:8), each=34),
sub2=rep(as.character(2:9), each=34))
summary(lmer(fo1, data=df1))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07564181
summary(lmer(fo2, data=df1))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07564181
However, the order of your grouping variables indeed do matter in lmer(). This can be shown by giving subject and subject2 the same levels but a different order:
set.seed(840947)
df2 <- transform(faithful, sub=rep(sample(1:8), each=34), sub2=rep(sample(1:8), each=34))
summary(fit2a <- lmer(fo1, data=df2))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07564179
summary(fit2b <- lmer(fo2, data=df2))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07567537
This yields completely different coefficients one more time. The levels and level orders may be inspected like so:
fit2a#flist$sub
# [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# [33] 4 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
# [65] 8 8 8 8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [97] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# [129] 3 3 3 3 3 3 3 3 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
# [161] 6 6 6 6 6 6 6 6 6 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [193] 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# [225] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
# [257] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
# Levels: 1 2 3 4 5 6 7 8
fit2b#flist$sub2
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [33] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [65] 2 2 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
# [97] 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
# [129] 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
# [161] 7 7 7 7 7 7 7 7 7 7 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# [193] 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# [225] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# [257] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# Levels: 1 2 3 4 5 6 7 8
There is already a ticket filed at github where you should join. Perhaps try to find a similar case beforehand where there is an ordering problem, but not a singular fit.
When I fit this model, I get a singular fit warning when I fit it. This is not a great sign, as the variance accounting for by just a random intercept is practically 0, and you also have a random slope. A random effect here might not be doing anything meaningful in the model.
Secondly, I question if this is the right model for this situation, what follows is unsolicited advice and I apologize if you think it is inappropriate. Second, I would have made this a comment but was unsure of how to add images.
First, I did some exploratory plots and found that both your dependent variable and fixed effect have a bimodal distribution. If we plot a scatterplot like below, we can definitely see that it might not be a linear trend.
When we then look at the model residuals, we see heteroscedasticity, which is sub-optimal. I'm not a statistician, but I've had some consultants tell me that this is one of the worst assumptions to violate in a linear model.
I think you may be seeing instability in the estimates due to the singular fit, but hopefully, someone else can come along that knows more that can clear this up.
I am new to R. In JAVA I would introduce a control variable to create a sequence such as
1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
I was thinking on doing something like
seq(from=c(1:5),to=c(5,10),by=1)
However that does not work...
Can that be solved purely with seq and rep?
How about this?
rep(0:4, each=5)+seq(from=1, to=5, by=1)
[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
Try this. You can create a function to create the sequence and apply to an initial vector v1. Here the code:
#Data
v1 <- 1:5
#Code
v2 <- c(sapply(v1, function(x) seq(from=x,by=1,length.out = 5)))
Output:
[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
And the way using seq() and rep() can be:
#Code2
rep(1:5, each = 5) + 0:4
Output:
[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
Using outer is pretty concise:
c(outer(1:5, 0:4, `+`))
#> [1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
Note, 0:4 is short for seq(from = 0, to = 4, by = 1)
A perfect use case for Map or mapply. I always prefer Map because it does not simplify the output by default.
Map(seq, from = 1:5, to = 5:9)
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 2 3 4 5 6
[[3]]
[1] 3 4 5 6 7
[[4]]
[1] 4 5 6 7 8
[[5]]
[1] 5 6 7 8 9
You can use unlist() to get it the way you want.
unlist(Map(seq, from = 1:5, to = 5:9))
[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
Note that `by = 1`, the default.
I have a list of numbers and would like to find which is the next highest compared to each number in a data.frame. I have:
list <- c(3,6,9,12)
X <- c(1:10)
df <- data.frame(X)
And I would like to add a variable to df being the next highest number in the list. i.e:
X Y
1 3
2 3
3 3
4 6
5 6
6 6
7 9
8 9
9 9
10 12
I've tried:
df$Y <- which.min(abs(list-df$X))
but that gives an error message and would just get the closest value from the list, not the next above.
Another approach is to use findInterval:
df$Y <- list[findInterval(X, list, left.open=TRUE) + 1]
> df
X Y
1 1 3
2 2 3
3 3 3
4 4 6
5 5 6
6 6 6
7 7 9
8 8 9
9 9 9
10 10 12
You could do this...
df$Y <- sapply(df$X, function(x) min(list[list>=x]))
df
X Y
1 1 3
2 2 3
3 3 3
4 4 6
5 5 6
6 6 6
7 7 9
8 8 9
9 9 9
10 10 12
For each row in the matrix "result" shown below
A B C D E F G H I J
1 4 6 3 5 9 9 9 3 4 4
2 5 7 5 5 8 8 8 7 4 5
3 7 5 4 4 7 9 7 4 4 5
4 6 6 6 6 8 9 8 6 3 6
5 4 5 5 5 8 8 7 4 3 7
6 7 9 7 6 7 8 8 5 7 6
7 5 6 6 5 8 8 7 3 3 5
8 6 7 4 5 8 9 8 4 6 5
9 6 8 8 6 7 7 7 7 6 6
I would like to plot a histogram for each row with 3 bins as shown below:
samp<-result[1,]
hist(samp, breaks = 3, col="lightblue", border="pink")
Now what is needed is to convert the histogram frequency counts into an array as follows
If I have say 4 bins and say first bin has count=5 and second bin has a count=2 and fourth bin=3. Now I want a vector of all values in each of these bins, coming from data result(for every row) in a vector as my output.
row1 5 2 0 3
For hundreds of rows I would like to do it in an automated way and hence posted this question.
In the end the matrix should look like
bin 2-4 bin 4-6 bin6-8 bin8-10
row 1 5 2 0 3
row 2
row 3
row 4
row 5
row 6
row 7
row 8
row 9
DF <- read.table(text="A B C D E F G H I J
1 4 6 3 5 9 9 9 3 4 4
2 5 7 5 5 8 8 8 7 4 5
3 7 5 4 4 7 9 7 4 4 5
4 6 6 6 6 8 9 8 6 3 6
5 4 5 5 5 8 8 7 4 3 7
6 7 9 7 6 7 8 8 5 7 6
7 5 6 6 5 8 8 7 3 3 5
8 6 7 4 5 8 9 8 4 6 5
9 6 8 8 6 7 7 7 7 6 6", header=TRUE)
m <- as.matrix(DF)
apply(m,1,function(x) hist(x,breaks = 3)$count)
# $`1`
# [1] 5 2 0 3
#
# $`2`
# [1] 5 0 2 3
#
# $`3`
# [1] 6 3 1
#
# $`4`
# [1] 1 6 2 1
#
# $`5`
# [1] 3 3 4
#
# $`6`
# [1] 3 4 2 1
#
# $`7`
# [1] 2 5 3
#
# $`8`
# [1] 6 3 1
#
# $`9`
# [1] 4 4 0 2
Note that according to the documentation the number of breaks is only a suggestion. If you want to have the same number of breaks in all rows, you should do the binning outside of hist:
breaks <- 1:5*2
t(apply(m,1,function(x) table(cut(x,breaks,include.lowest = TRUE))))
# [2,4] (4,6] (6,8] (8,10]
# 1 5 2 0 3
# 2 1 4 5 0
# 3 4 2 3 1
# 4 1 6 2 1
# 5 3 3 4 0
# 6 0 3 6 1
# 7 2 5 3 0
# 8 2 4 3 1
# 9 0 4 6 0
You could access the counts vector which is returned by hist (see ?hist for details):
counts <- hist(samp, breaks = 3, col="lightblue", border="pink")$counts