Interpolate different size data frames with approx in r - r

I have 2 datasets of different sizes but with common data in the first columns like this
x <- data.frame(cbind(c(1,2,3,4,5,6,7,8,9,10),c(1,4,3,2,5,4,6,7,1,3)))
y <- data.frame(cbind(c(0,2,4,6,8,10),c(6,5,4,7,5,4)))
> x
X1 X2
1 1 1
2 2 4
3 3 3
4 4 2
5 5 5
6 6 4
7 7 6
8 8 7
9 9 1
10 10 3
> y
X1 X2
1 0 6
2 2 5
3 4 4
4 6 7
5 8 5
6 10 4
I've been trying to use the approx function to do the interpolation on X2 in y but I haven't been able to find examples with different column sizes.

You could merge y with the common column in x and approximate on it as xout.
data.frame(X1=x$X1, X2=approx(merge(x["X1"], y, all=T)[,2], xout=x$X1)$y)
# X1 X2
# 1 1 6.0
# 2 2 5.5
# 3 3 5.0
# 4 4 4.5
# 5 5 4.0
# 6 6 5.5
# 7 7 7.0
# 8 8 6.0
# 9 9 5.0
# 10 10 4.5

Related

Repeat sequences where the start increases

So what I just did is this:
s1 <- seq(1, 3, by = 0.5)
rep(s1, 3)
# [1] 1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0
s2 <- seq(-4, 4, by = 2)
rep(s2, each = 3)
# [1] -4 -4 -4 -2 -2 -2 0 0 0 2 2 2 4 4 4
Now I should code something that in the end should look like this:
1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
1 to 5 5 times but the 1st number should always increase by 1.
How can I do that?
I guess you can try embed like below
> n <- 5
> c(embed(seq(n + 4), n)[, n:1])
[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
Use mapply :
inds <- 1:5
c(mapply(seq, inds, inds + 4))
#[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
Add to a corresponding matrix the cols.
m <- matrix(0:4, 5, 5)
as.vector(m + col(m))
# [1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
We can use rep
n <- 5
seq_len(n) + rep(0:4, each = n)
#[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9

Changing the labels of your random-effects grouping variable changes the results in lme4

The title says it all: Changing the (supposedly arbitrary) labels of your random-effects grouping variable (e.g. the names of your subjects in a repeated-measures experiment) can change the resulting output in lme4. Minimal example:
require(dplyr)
require(lme4)
require(digest)
df = faithful %>% mutate(subject = rep(as.character(1:8), each = 34),
subject2 = rep(as.character(9:16), each = 34))
summary(lmer(eruptions ~ waiting + (waiting | subject), data = df))$coefficients[2,1] # = 0.07564181
summary(lmer(eruptions ~ waiting + (waiting | subject2), data = df))$coefficients[2,1] # = 0.07567655
I think it happens because lme4 converts them to a factor, and different names produce different factor level orderings. E.g. this produces the problem:
df2 = faithful %>% mutate(subject = factor(rep(as.character(1:8), each = 34)),
subject2 = factor(rep(as.character(9:16), each = 34)))
summary(lmer(eruptions ~ waiting + (waiting | subject), data = df2))$coefficients[2,1] # = 0.07564181
summary(lmer(eruptions ~ waiting + (waiting | subject2), data = df2))$coefficients[2,1] # = 0.07567655
But this doesn't:
df3 = faithful %>% mutate(subject = factor(rep(as.character(1:8), each = 34)),
subject2 = factor(rep(as.character(1:8), each = 34),
levels = as.character(1:8),
labels = as.character(9:16)))
summary(lmer(eruptions ~ waiting + (waiting | subject), data = df3))$coefficients[2,1] # = 0.07564181
summary(lmer(eruptions ~ waiting + (waiting | subject2), data = df3))$coefficients[2,1] # = 0.07564181
This seems like an issue in lme4. Different arbitrary variable labels shouldn't produce different output, right? Am I missing something? Why does lme4 do this?
(I know the difference in output is small, but I got bigger differences in other cases, enough to, e.g., change a p value from .055 to .045. Also, if this is right, I think it could cause slight reproducibility issues -- e.g. if, after finishing their analyses, an experimenter anonymizes their human subjects data (by changing the names) and then posts it in a public repository.)
The first part of your sequence 1:8 gives the same order in numeric or character format, whereas the second part doesn't:
identical(order(1:8), order(as.character(1:8)))
# [1] TRUE
identical(order(9:16), order(as.character(9:16)))
# [1] FALSE
That's because numbers as characters are sorted by their first digit:
sort(9:16)
# [1] 9 10 11 12 13 14 15 16
sort(as.character(9:16))
# [1] "10" "11" "12" "13" "14" "15" "16" "9"
So if you use two different but one-digit character sequences there is seemingly no issue:
library(lme4)
fo1 <- eruptions ~ waiting + (waiting | sub)
fo2 <- eruptions ~ waiting + (waiting | sub2)
df1 <- transform(faithful, sub=rep(as.character(1:8), each=34),
sub2=rep(as.character(2:9), each=34))
summary(lmer(fo1, data=df1))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07564181
summary(lmer(fo2, data=df1))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07564181
However, the order of your grouping variables indeed do matter in lmer(). This can be shown by giving subject and subject2 the same levels but a different order:
set.seed(840947)
df2 <- transform(faithful, sub=rep(sample(1:8), each=34), sub2=rep(sample(1:8), each=34))
summary(fit2a <- lmer(fo1, data=df2))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07564179
summary(fit2b <- lmer(fo2, data=df2))$coe[2, 1]
# boundary (singular) fit: see ?isSingular
# [1] 0.07567537
This yields completely different coefficients one more time. The levels and level orders may be inspected like so:
fit2a#flist$sub
# [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# [33] 4 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
# [65] 8 8 8 8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [97] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# [129] 3 3 3 3 3 3 3 3 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
# [161] 6 6 6 6 6 6 6 6 6 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [193] 1 1 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# [225] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
# [257] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
# Levels: 1 2 3 4 5 6 7 8
fit2b#flist$sub2
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [33] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [65] 2 2 2 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
# [97] 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
# [129] 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
# [161] 7 7 7 7 7 7 7 7 7 7 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
# [193] 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
# [225] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# [257] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
# Levels: 1 2 3 4 5 6 7 8
There is already a ticket filed at github where you should join. Perhaps try to find a similar case beforehand where there is an ordering problem, but not a singular fit.
When I fit this model, I get a singular fit warning when I fit it. This is not a great sign, as the variance accounting for by just a random intercept is practically 0, and you also have a random slope. A random effect here might not be doing anything meaningful in the model.
Secondly, I question if this is the right model for this situation, what follows is unsolicited advice and I apologize if you think it is inappropriate. Second, I would have made this a comment but was unsure of how to add images.
First, I did some exploratory plots and found that both your dependent variable and fixed effect have a bimodal distribution. If we plot a scatterplot like below, we can definitely see that it might not be a linear trend.
When we then look at the model residuals, we see heteroscedasticity, which is sub-optimal. I'm not a statistician, but I've had some consultants tell me that this is one of the worst assumptions to violate in a linear model.
I think you may be seeing instability in the estimates due to the singular fit, but hopefully, someone else can come along that knows more that can clear this up.

Average all other columns based on one column in matrix [duplicate]

This question already has answers here:
Summarize all group values and a conditional subset in the same call
(4 answers)
Closed 4 years ago.
I need to average a large number of columns based on the name in another column. My matrix looks like this (with separate unique row names):
Names X1 Y1 Z1 X2 Y2 Z2
P.maccus 4 2 2 6 5 3
P.maccus 6 5 3 7 6 5
P.maccus 8 3 2 8 7 3
A.ammophius 3 6 2 7 5 5
P.sabaji 2 5 3 8 4 5
P.sabaji 4 6 3 9 6 5
P.sabaji 5 7 2 8 7 3
P.sabaji 3 5 3 9 5 4
I need to average each row to look like this:
Names X1 Y1 Z1 X2 Y2 Z2
P.maccus 6 3.33 2.33 7 6 3.66
A.ammophius 3 6 2 7 5 5
P.sabaji 3.5 5.75 2.75 8.5 5.5 4.25
Can anyone help? Thank you!
This is pretty easy with dplyr. you can do
dd %>% group_by(Names) %>% summarize_all(mean)
tested with the following data
dd<-read.table(text="Names X1 Y1 Z1 X2 Y2 Z2
P.maccus 4 2 2 6 5 3
P.maccus 6 5 3 7 6 5
P.maccus 8 3 2 8 7 3
A.ammophius 3 6 2 7 5 5
P.sabaji 2 5 3 8 4 5
P.sabaji 4 6 3 9 6 5
P.sabaji 5 7 2 8 7 3
P.sabaji 3 5 3 9 5 4", header=TRUE)
You can use aggregate() for that.
Assuming your data matrix is in variable named df:
aggregate(. ~ Names, data=df, FUN=mean)
Names X1 Y1 Z1 X2 Y2 Z2
1 A.ammophius 3.0 6.000000 2.000000 7.0 5.0 5.000000
2 P.maccus 6.0 3.333333 2.333333 7.0 6.0 3.666667
3 P.sabaji 3.5 5.750000 2.750000 8.5 5.5 4.250000

Eliminate in an increasing order rows in a data frame

Eliminate in an increasing order rows in a data frame
x<-c(4,5,6,23,5,6,7,8,0,3)
y<-c(2,4,5,6,23,5,6,7,8,0)
z<-c(1,2,4,5,6,23,5,6,7,8)
df<-data.frame(x,y,z)
df
x y z
1 4 2 1
2 5 4 2
3 6 5 4
4 23 6 5
5 5 23 6
6 6 5 23
7 7 6 5
8 8 7 6
9 0 8 7
10 3 0 8
I would like to eliminate number 23 in the df from all columns by instructing to sequentially increasingly remove a row per column (not by matching the value 23, but by its initial x location).
df
x y z
1 4 2 1
2 5 4 2
3 6 5 4
4 5 6 5
5 6 5 6
6 7 6 5
7 8 7 6
8 0 8 7
9 3 0 8
Thank you
You can iterate through the columns and remove the element from each, then reassemble as a data frame:
result <- as.data.frame(lapply(1:ncol(df), function(x) df[-(x+3),x]))
names(result) <- names(df)
result
## x y z
## 1 4 2 1
## 2 5 4 2
## 3 6 5 4
## 4 5 6 5
## 5 6 5 6
## 6 7 6 5
## 7 8 7 6
## 8 0 8 7
## 9 3 0 8
df[-(x+3),x] is the column with the value removed, by location. To start with row N in column x you would use df[-(x+N-1),x].
You could also try:
n <- 4
df1 <- df[-n,]
df1[] <- unlist(df,use.names=FALSE)[-seq(n, prod(dim(df)), by=nrow(df)+1)]
df1
# x y z
#1 4 2 1
#2 5 4 2
#3 6 5 4
#5 5 6 5
#6 6 5 6
#7 7 6 5
#8 8 7 6
#9 0 8 7
#10 3 0 8

Converting multiple histogram frequency count into an array in R

For each row in the matrix "result" shown below
A B C D E F G H I J
1 4 6 3 5 9 9 9 3 4 4
2 5 7 5 5 8 8 8 7 4 5
3 7 5 4 4 7 9 7 4 4 5
4 6 6 6 6 8 9 8 6 3 6
5 4 5 5 5 8 8 7 4 3 7
6 7 9 7 6 7 8 8 5 7 6
7 5 6 6 5 8 8 7 3 3 5
8 6 7 4 5 8 9 8 4 6 5
9 6 8 8 6 7 7 7 7 6 6
I would like to plot a histogram for each row with 3 bins as shown below:
samp<-result[1,]
hist(samp, breaks = 3, col="lightblue", border="pink")
Now what is needed is to convert the histogram frequency counts into an array as follows
If I have say 4 bins and say first bin has count=5 and second bin has a count=2 and fourth bin=3. Now I want a vector of all values in each of these bins, coming from data result(for every row) in a vector as my output.
row1 5 2 0 3
For hundreds of rows I would like to do it in an automated way and hence posted this question.
In the end the matrix should look like
bin 2-4 bin 4-6 bin6-8 bin8-10
row 1 5 2 0 3
row 2
row 3
row 4
row 5
row 6
row 7
row 8
row 9
DF <- read.table(text="A B C D E F G H I J
1 4 6 3 5 9 9 9 3 4 4
2 5 7 5 5 8 8 8 7 4 5
3 7 5 4 4 7 9 7 4 4 5
4 6 6 6 6 8 9 8 6 3 6
5 4 5 5 5 8 8 7 4 3 7
6 7 9 7 6 7 8 8 5 7 6
7 5 6 6 5 8 8 7 3 3 5
8 6 7 4 5 8 9 8 4 6 5
9 6 8 8 6 7 7 7 7 6 6", header=TRUE)
m <- as.matrix(DF)
apply(m,1,function(x) hist(x,breaks = 3)$count)
# $`1`
# [1] 5 2 0 3
#
# $`2`
# [1] 5 0 2 3
#
# $`3`
# [1] 6 3 1
#
# $`4`
# [1] 1 6 2 1
#
# $`5`
# [1] 3 3 4
#
# $`6`
# [1] 3 4 2 1
#
# $`7`
# [1] 2 5 3
#
# $`8`
# [1] 6 3 1
#
# $`9`
# [1] 4 4 0 2
Note that according to the documentation the number of breaks is only a suggestion. If you want to have the same number of breaks in all rows, you should do the binning outside of hist:
breaks <- 1:5*2
t(apply(m,1,function(x) table(cut(x,breaks,include.lowest = TRUE))))
# [2,4] (4,6] (6,8] (8,10]
# 1 5 2 0 3
# 2 1 4 5 0
# 3 4 2 3 1
# 4 1 6 2 1
# 5 3 3 4 0
# 6 0 3 6 1
# 7 2 5 3 0
# 8 2 4 3 1
# 9 0 4 6 0
You could access the counts vector which is returned by hist (see ?hist for details):
counts <- hist(samp, breaks = 3, col="lightblue", border="pink")$counts

Resources