Difference of fct_reorder and reorder - r

This is an exmaple of fct_reorder
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width, .desc = TRUE), data = iris)
This code is identical with boxplot(Sepal.Width ~ reorder(Species, -Sepal.Width), data = iris)
What is the better point fct_reorder() than reorder()?

The two functions are very similar, but have a few differences:
reorder() works with atomic vectors and defaults to using mean().
fct_reorder() only works with factors (or character vectors) and defaults to using median()
Example:
library(forcats)
x <- 1:10
xf <- factor(1:10)
y <- 10:1
reorder(x, y)
#> [1] 1 2 3 4 5 6 7 8 9 10
#> attr(,"scores")
#> 1 2 3 4 5 6 7 8 9 10
#> 10 9 8 7 6 5 4 3 2 1
#> Levels: 10 9 8 7 6 5 4 3 2 1
reorder(xf, y)
#> [1] 1 2 3 4 5 6 7 8 9 10
#> attr(,"scores")
#> 1 2 3 4 5 6 7 8 9 10
#> 10 9 8 7 6 5 4 3 2 1
#> Levels: 10 9 8 7 6 5 4 3 2 1
fct_reorder(x, y)
#> Error: `f` must be a factor (or character vector).
fct_reorder(xf, y)
#> [1] 1 2 3 4 5 6 7 8 9 10
#> Levels: 10 9 8 7 6 5 4 3 2 1
Created on 2022-01-07 by the reprex package (v2.0.1)

Related

How to replace NA values in one column of a data frame, with values from a column in a different data frame?

How do I replace the NA values in 'example' with the corresponding values in 'example 2'? So 7 would take the place of the first NA and 8 would take the place of the second NA etc. My data is much larger so I would not be able to rename the values individually for the multiple NAs. Thanks
example <- data.frame('count' = c(1,3,4,NA,8,NA,9,0,NA,NA,7,5,8,NA))
example2 <- data.frame('count' = c(7,8,4,6,7))
Another possible solution, based on replace:
example$count <- replace(example$count, is.na(example$count), example2$count)
example
#> count
#> 1 1
#> 2 3
#> 3 4
#> 4 7
#> 5 8
#> 6 8
#> 7 9
#> 8 0
#> 9 4
#> 10 6
#> 11 7
#> 12 5
#> 13 8
#> 14 7
You can try with :
example[is.na(example),] <- example2
Which will give you :
count
1 1
2 3
3 4
4 7
5 8
6 8
7 9
8 0
9 4
10 6
11 7
12 5
13 8
14 7
EDIT: Since you probably have more than just one column in your dataframes, you should use :
example$count[is.na(example$count)] <- example2$count
Another option using which to check the index of NA values:
ind <- which(is.na(example$count))
example[ind, "count"] <- example2$count
Output:
count
1 1
2 3
3 4
4 7
5 8
6 8
7 9
8 0
9 4
10 6
11 7
12 5
13 8
14 7

Changing the labels of your random-effects grouping variable changes the results in lme4

The title says it all: Changing the (supposedly arbitrary) labels of your random-effects grouping variable (e.g. the names of your subjects in a repeated-measures experiment) can change the resulting output in lme4. Minimal example:
require(dplyr)
require(lme4)
require(digest)
df = faithful %>% mutate(subject = rep(as.character(1:8), each = 34),
subject2 = rep(as.character(9:16), each = 34))
summary(lmer(eruptions ~ waiting + (waiting | subject), data = df))$coefficients[2,1] # = 0.07564181
summary(lmer(eruptions ~ waiting + (waiting | subject2), data = df))$coefficients[2,1] # = 0.07567655
I think it happens because lme4 converts them to a factor, and different names produce different factor level orderings. E.g. this produces the problem:
df2 = faithful %>% mutate(subject = factor(rep(as.character(1:8), each = 34)),
subject2 = factor(rep(as.character(9:16), each = 34)))
summary(lmer(eruptions ~ waiting + (waiting | subject), data = df2))$coefficients[2,1] # = 0.07564181
summary(lmer(eruptions ~ waiting + (waiting | subject2), data = df2))$coefficients[2,1] # = 0.07567655
But this doesn't:
df3 = faithful %>% mutate(subject = factor(rep(as.character(1:8), each = 34)),
subject2 = factor(rep(as.character(1:8), each = 34),
levels = as.character(1:8),
labels = as.character(9:16)))
summary(lmer(eruptions ~ waiting + (waiting | subject), data = df3))$coefficients[2,1] # = 0.07564181
summary(lmer(eruptions ~ waiting + (waiting | subject2), data = df3))$coefficients[2,1] # = 0.07564181
This seems like an issue in lme4. Different arbitrary variable labels shouldn't produce different output, right? Am I missing something? Why does lme4 do this?
(I know the difference in output is small, but I got bigger differences in other cases, enough to, e.g., change a p value from .055 to .045. Also, if this is right, I think it could cause slight reproducibility issues -- e.g. if, after finishing their analyses, an experimenter anonymizes their human subjects data (by changing the names) and then posts it in a public repository.)
The first part of your sequence 1:8 gives the same order in numeric or character format, whereas the second part doesn't:
identical(order(1:8), order(as.character(1:8)))
# [1] TRUE
identical(order(9:16), order(as.character(9:16)))
# [1] FALSE
That's because numbers as characters are sorted by their first digit:
sort(9:16)
# [1] 9 10 11 12 13 14 15 16
sort(as.character(9:16))
# [1] "10" "11" "12" "13" "14" "15" "16" "9"
So if you use two different but one-digit character sequences there is seemingly no issue:
library(lme4)
fo1 <- eruptions ~ waiting + (waiting | sub)
fo2 <- eruptions ~ waiting + (waiting | sub2)
df1 <- transform(faithful, sub=rep(as.character(1:8), each=34),
sub2=rep(as.character(2:9), each=34))
summary(lmer(fo1, data=df1))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07564181
summary(lmer(fo2, data=df1))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07564181
However, the order of your grouping variables indeed do matter in lmer(). This can be shown by giving subject and subject2 the same levels but a different order:
set.seed(840947)
df2 <- transform(faithful, sub=rep(sample(1:8), each=34), sub2=rep(sample(1:8), each=34))
summary(fit2a <- lmer(fo1, data=df2))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07564179
summary(fit2b <- lmer(fo2, data=df2))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07567537
This yields completely different coefficients one more time. The levels and level orders may be inspected like so:
fit2a#flist$sub
# [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# [33] 4 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
# [65] 8 8 8 8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [97] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# [129] 3 3 3 3 3 3 3 3 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
# [161] 6 6 6 6 6 6 6 6 6 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [193] 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# [225] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
# [257] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
# Levels: 1 2 3 4 5 6 7 8
fit2b#flist$sub2
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [33] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [65] 2 2 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
# [97] 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
# [129] 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
# [161] 7 7 7 7 7 7 7 7 7 7 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# [193] 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# [225] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# [257] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# Levels: 1 2 3 4 5 6 7 8
There is already a ticket filed at github where you should join. Perhaps try to find a similar case beforehand where there is an ordering problem, but not a singular fit.
When I fit this model, I get a singular fit warning when I fit it. This is not a great sign, as the variance accounting for by just a random intercept is practically 0, and you also have a random slope. A random effect here might not be doing anything meaningful in the model.
Secondly, I question if this is the right model for this situation, what follows is unsolicited advice and I apologize if you think it is inappropriate. Second, I would have made this a comment but was unsure of how to add images.
First, I did some exploratory plots and found that both your dependent variable and fixed effect have a bimodal distribution. If we plot a scatterplot like below, we can definitely see that it might not be a linear trend.
When we then look at the model residuals, we see heteroscedasticity, which is sub-optimal. I'm not a statistician, but I've had some consultants tell me that this is one of the worst assumptions to violate in a linear model.
I think you may be seeing instability in the estimates due to the singular fit, but hopefully, someone else can come along that knows more that can clear this up.

Use rep() and seq() to create a vector

I am new to R. In JAVA I would introduce a control variable to create a sequence such as
1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
I was thinking on doing something like
seq(from=c(1:5),to=c(5,10),by=1)
However that does not work...
Can that be solved purely with seq and rep?
How about this?
rep(0:4, each=5)+seq(from=1, to=5, by=1)
[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
Try this. You can create a function to create the sequence and apply to an initial vector v1. Here the code:
#Data
v1 <- 1:5
#Code
v2 <- c(sapply(v1, function(x) seq(from=x,by=1,length.out = 5)))
Output:
[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
And the way using seq() and rep() can be:
#Code2
rep(1:5, each = 5) + 0:4
Output:
[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
Using outer is pretty concise:
c(outer(1:5, 0:4, `+`))
#> [1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
Note, 0:4 is short for seq(from = 0, to = 4, by = 1)
A perfect use case for Map or mapply. I always prefer Map because it does not simplify the output by default.
Map(seq, from = 1:5, to = 5:9)
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 2 3 4 5 6
[[3]]
[1] 3 4 5 6 7
[[4]]
[1] 4 5 6 7 8
[[5]]
[1] 5 6 7 8 9
You can use unlist() to get it the way you want.
unlist(Map(seq, from = 1:5, to = 5:9))
[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
Note that `by = 1`, the default.

How to find closest match from list in R

I have a list of numbers and would like to find which is the next highest compared to each number in a data.frame. I have:
list <- c(3,6,9,12)
X <- c(1:10)
df <- data.frame(X)
And I would like to add a variable to df being the next highest number in the list. i.e:
X Y
1 3
2 3
3 3
4 6
5 6
6 6
7 9
8 9
9 9
10 12
I've tried:
df$Y <- which.min(abs(list-df$X))
but that gives an error message and would just get the closest value from the list, not the next above.
Another approach is to use findInterval:
df$Y <- list[findInterval(X, list, left.open=TRUE) + 1]
> df
X Y
1 1 3
2 2 3
3 3 3
4 4 6
5 5 6
6 6 6
7 7 9
8 8 9
9 9 9
10 10 12
You could do this...
df$Y <- sapply(df$X, function(x) min(list[list>=x]))
df
X Y
1 1 3
2 2 3
3 3 3
4 4 6
5 5 6
6 6 6
7 7 9
8 8 9
9 9 9
10 10 12

Converting multiple histogram frequency count into an array in R

For each row in the matrix "result" shown below
A B C D E F G H I J
1 4 6 3 5 9 9 9 3 4 4
2 5 7 5 5 8 8 8 7 4 5
3 7 5 4 4 7 9 7 4 4 5
4 6 6 6 6 8 9 8 6 3 6
5 4 5 5 5 8 8 7 4 3 7
6 7 9 7 6 7 8 8 5 7 6
7 5 6 6 5 8 8 7 3 3 5
8 6 7 4 5 8 9 8 4 6 5
9 6 8 8 6 7 7 7 7 6 6
I would like to plot a histogram for each row with 3 bins as shown below:
samp<-result[1,]
hist(samp, breaks = 3, col="lightblue", border="pink")
Now what is needed is to convert the histogram frequency counts into an array as follows
If I have say 4 bins and say first bin has count=5 and second bin has a count=2 and fourth bin=3. Now I want a vector of all values in each of these bins, coming from data result(for every row) in a vector as my output.
row1 5 2 0 3
For hundreds of rows I would like to do it in an automated way and hence posted this question.
In the end the matrix should look like
bin 2-4 bin 4-6 bin6-8 bin8-10
row 1 5 2 0 3
row 2
row 3
row 4
row 5
row 6
row 7
row 8
row 9
DF <- read.table(text="A B C D E F G H I J
1 4 6 3 5 9 9 9 3 4 4
2 5 7 5 5 8 8 8 7 4 5
3 7 5 4 4 7 9 7 4 4 5
4 6 6 6 6 8 9 8 6 3 6
5 4 5 5 5 8 8 7 4 3 7
6 7 9 7 6 7 8 8 5 7 6
7 5 6 6 5 8 8 7 3 3 5
8 6 7 4 5 8 9 8 4 6 5
9 6 8 8 6 7 7 7 7 6 6", header=TRUE)
m <- as.matrix(DF)
apply(m,1,function(x) hist(x,breaks = 3)$count)
# $`1`
# [1] 5 2 0 3
#
# $`2`
# [1] 5 0 2 3
#
# $`3`
# [1] 6 3 1
#
# $`4`
# [1] 1 6 2 1
#
# $`5`
# [1] 3 3 4
#
# $`6`
# [1] 3 4 2 1
#
# $`7`
# [1] 2 5 3
#
# $`8`
# [1] 6 3 1
#
# $`9`
# [1] 4 4 0 2
Note that according to the documentation the number of breaks is only a suggestion. If you want to have the same number of breaks in all rows, you should do the binning outside of hist:
breaks <- 1:5*2
t(apply(m,1,function(x) table(cut(x,breaks,include.lowest = TRUE))))
# [2,4] (4,6] (6,8] (8,10]
# 1 5 2 0 3
# 2 1 4 5 0
# 3 4 2 3 1
# 4 1 6 2 1
# 5 3 3 4 0
# 6 0 3 6 1
# 7 2 5 3 0
# 8 2 4 3 1
# 9 0 4 6 0
You could access the counts vector which is returned by hist (see ?hist for details):
counts <- hist(samp, breaks = 3, col="lightblue", border="pink")$counts

Resources