Average all other columns based on one column in matrix [duplicate] - r

This question already has answers here:
Summarize all group values and a conditional subset in the same call
(4 answers)
Closed 4 years ago.
I need to average a large number of columns based on the name in another column. My matrix looks like this (with separate unique row names):
Names X1 Y1 Z1 X2 Y2 Z2
P.maccus 4 2 2 6 5 3
P.maccus 6 5 3 7 6 5
P.maccus 8 3 2 8 7 3
A.ammophius 3 6 2 7 5 5
P.sabaji 2 5 3 8 4 5
P.sabaji 4 6 3 9 6 5
P.sabaji 5 7 2 8 7 3
P.sabaji 3 5 3 9 5 4
I need to average each row to look like this:
Names X1 Y1 Z1 X2 Y2 Z2
P.maccus 6 3.33 2.33 7 6 3.66
A.ammophius 3 6 2 7 5 5
P.sabaji 3.5 5.75 2.75 8.5 5.5 4.25
Can anyone help? Thank you!

This is pretty easy with dplyr. you can do
dd %>% group_by(Names) %>% summarize_all(mean)
tested with the following data
dd<-read.table(text="Names X1 Y1 Z1 X2 Y2 Z2
P.maccus 4 2 2 6 5 3
P.maccus 6 5 3 7 6 5
P.maccus 8 3 2 8 7 3
A.ammophius 3 6 2 7 5 5
P.sabaji 2 5 3 8 4 5
P.sabaji 4 6 3 9 6 5
P.sabaji 5 7 2 8 7 3
P.sabaji 3 5 3 9 5 4", header=TRUE)

You can use aggregate() for that.
Assuming your data matrix is in variable named df:
aggregate(. ~ Names, data=df, FUN=mean)
Names X1 Y1 Z1 X2 Y2 Z2
1 A.ammophius 3.0 6.000000 2.000000 7.0 5.0 5.000000
2 P.maccus 6.0 3.333333 2.333333 7.0 6.0 3.666667
3 P.sabaji 3.5 5.750000 2.750000 8.5 5.5 4.250000

Related

how to get row average for certain columns in r data frame?

I have data that looks like this
t=c(3,2,9,8)
u=c(5,6,7,8)
v=c(3,2,1,9)
w=c(5,6,7,8)
x=c(1,2,3,4)
y=c(4,3,2,1)
z=data.frame(t,u,v,w,x,y)
output:
t u v w x y
1 3 5 3 5 1 4
2 2 6 2 6 2 3
3 9 7 1 7 3 2
4 8 8 9 8 4 1
I would like to get the mean of each row for the first three columns, and then get the mean of each row for the last three columns. Ex. mean of row 1, columns t-v and mean of row 1, columns w-y, and so on.
Desired output:
t u v avg w x y avg2
1 3 5 3 3.6 5 1 4 3
2 2 6 2 3.3 6 2 3 3.6
3 9 7 1 5.6 7 3 2 4
4 8 8 9 8.3 8 4 1 4.3
How can I go about doing this?
Use rowMeans(). Using column names:
z$avg <- rowMeans(z[c("t", "u", "v")])
z$avg2 <- rowMeans(z[c("w", "x", "y")])
Result:
t u v w x y avg avg2
1 3 5 3 5 1 4 3.666667 3.333333
2 2 6 2 6 2 3 3.333333 3.666667
3 9 7 1 7 3 2 5.666667 4.000000
4 8 8 9 8 4 1 8.333333 4.333333
Alternative using column indices, with re-arranged output:
z$avg <- rowMeans(z[1:3])
z$avg2 <- rowMeans(z[4:6])
z <- z[c(1:3, 7, 4:6, 8)]
Result:
t u v avg w x y avg2
1 3 5 3 3.666667 5 1 4 3.333333
2 2 6 2 3.333333 6 2 3 3.666667
3 9 7 1 5.666667 7 3 2 4.000000
4 8 8 9 8.333333 8 4 1 4.333333
One more alternative using tidyverse is the rowwise() and c_across
z <- z %>%
rowwise() %>%
mutate(avg=mean(c_across(1:6)))

Replace dataframe missing values with linear trend

I've an imported dataframe where some values are missing in x2. Here a simplified example.
I'd like to replace the missing values with a linear trend between the last and next available.
Any suggestion on how to do it?
a <- data.frame(x1=1:11, x2=c(6,"","","","",12,"","",4,"",20))
a
x1 x2
1 1 6
2 2
3 3
4 4
5 5
6 6 12
7 7
8 8
9 9 4
10 10
11 11 20
You can try approx like below
transform(
a,
x2 = approx(x1[nzchar(x2)], na.omit(as.numeric(x2)), x1)$y
)
which gives
x1 x2
1 1 6.000000
2 2 7.200000
3 3 8.400000
4 4 9.600000
5 5 10.800000
6 6 12.000000
7 7 9.333333
8 8 6.666667
9 9 4.000000
10 10 12.000000
11 11 20.000000

How to lag multiple specific columns of a data frame in R

I would like to lag multiple specific columns of a data frame in R.
Let's take this generic example. Let's assume I have defined which columns of my dataframe I need to lag:
Lag <- c(0, 1, 0, 1)
Lag.Index <- is.element(Lag, 1)
df <- data.frame(x1 = 1:8, x2 = 1:8, x3 = 1:8, x4 = 1:8)
My initial dataframe:
x1 x2 x3 x4
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
I would like to compute the following dataframe:
x1 x2 x3 x4
1 1 NA 1 NA
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
I would know how to do it for only one lagged column as shown here, but not able to find a way to do it for multiple lagged columns in an elegant way. Any help is very much appreciated.
You can use purrr's map2_dfc to lag different values by column.
purrr::map2_dfc(df, Lag, dplyr::lag)
# x1 x2 x3 x4
# <int> <int> <int> <int>
#1 1 NA 1 NA
#2 2 1 2 1
#3 3 2 3 2
#4 4 3 4 3
#5 5 4 5 4
#6 6 5 6 5
#7 7 6 7 6
#8 8 7 8 7
Or with data.table :
library(data.table)
setDT(df)[, names(df) := Map(shift, .SD, Lag)]
A data.table option using shift along with Vectorize
> setDT(df)[, Vectorize(shift)(.SD, Lag)]
x1 x2 x3 x4
[1,] 1 NA 1 NA
[2,] 2 1 2 1
[3,] 3 2 3 2
[4,] 4 3 4 3
[5,] 5 4 5 4
[6,] 6 5 6 5
[7,] 7 6 7 6
[8,] 8 7 8 7
Not sure whether this is elegant enough, but I would use dplyr's mutate_at function to tweak columns
df %>% dplyr::mutate_at(.vars = vars(x2,x4),.funs = ~lag(., default = NA))
We convert the lag to logical class, get the corresponding names and use across from dplyr
library(dplyr)
df %>%
mutate(across(names(.)[as.logical(Lag)], lag))
# x1 x2 x3 x4
#1 1 NA 1 NA
#2 2 1 2 1
#3 3 2 3 2
#4 4 3 4 3
#5 5 4 5 4
#6 6 5 6 5
#7 7 6 7 6
#8 8 7 8 7
Or we can do this in base R
df[as.logical(Lag)] <- rbind(NA, df[-nrow(df), as.logical(Lag)])

Interpolate different size data frames with approx in r

I have 2 datasets of different sizes but with common data in the first columns like this
x <- data.frame(cbind(c(1,2,3,4,5,6,7,8,9,10),c(1,4,3,2,5,4,6,7,1,3)))
y <- data.frame(cbind(c(0,2,4,6,8,10),c(6,5,4,7,5,4)))
> x
X1 X2
1 1 1
2 2 4
3 3 3
4 4 2
5 5 5
6 6 4
7 7 6
8 8 7
9 9 1
10 10 3
> y
X1 X2
1 0 6
2 2 5
3 4 4
4 6 7
5 8 5
6 10 4
I've been trying to use the approx function to do the interpolation on X2 in y but I haven't been able to find examples with different column sizes.
You could merge y with the common column in x and approximate on it as xout.
data.frame(X1=x$X1, X2=approx(merge(x["X1"], y, all=T)[,2], xout=x$X1)$y)
# X1 X2
# 1 1 6.0
# 2 2 5.5
# 3 3 5.0
# 4 4 4.5
# 5 5 4.0
# 6 6 5.5
# 7 7 7.0
# 8 8 6.0
# 9 9 5.0
# 10 10 4.5

How to retain only unique values in each row for a data frame

How to retain only unique values in each row for a data frame
input is as below:
1 1 2 3 4 1 6 7 8
2 2 5 5 7 8 9 0 0
6 6 6 6 5 1 2 3 4
Output would be as below
1 2 3 4 6 7 8
2 5 7 8 9
6 5 1 2 3 4
plyr, unique i tried, but it retains the unique values in complete data set
You can use sapply or lapply to accomplish it .
#supposing your data.frame is called 'df'
sapply(df, unique)
#$x1
#[1] 1 2 3 4 6 7 8
#
#$x2
#[1] 2 5 7 8 9 0
#
#$x3
#[1] 6 5 1 2 3 4
or
lapply(df, unique)
#$x1
#[1] 1 2 3 4 6 7 8
#
#$x2
#[1] 2 5 7 8 9 0
#
#$x3
#[1] 6 5 1 2 3 4
# Imagine D is your data.frame object
apply(D,1, function(x) rle(x)$values)
A=apply(dat,1,unique)
data.frame(t(sapply(A,`length<-`,max(lengths(A)))))
X1 X2 X3 X4 X5 X6 X7
1 1 2 3 4 6 7 8
2 2 5 7 8 9 0 NA
3 6 5 1 2 3 4 NA

Resources