r get mean of n columns by row - r

I have a simple data.frame
> df <- data.frame(a=c(3,5,7), b=c(5,3,7), c=c(5,6,4))
> df
a b c
1 3 5 5
2 5 3 6
3 7 7 4
Is there a simple and efficient way to get a new data.frame with the same number of rows but with the mean of, for example, column a and b by row? something like this:
mean.of.a.and.b c
1 4 5
2 4 6
3 7 4

Use rowMeans() on the first two columns only. Then cbind() to the third column.
cbind(mean.of.a.and.b = rowMeans(df[-3]), df[3])
# mean.of.a.and.b c
# 1 4 5
# 2 4 6
# 3 7 4
Note: If you have any NA values in your original data, you may want to use na.rm = TRUE in rowMeans(). See ?rowMeans for more.

Another option using the dplyr package:
library("dplyr")
df %>%
rowwise()%>%
mutate(mean.of.a.and.b = mean(c(a, b))) %>%
## Then if you want to remove a and b:
select(-a, -b)

I think the best option is using rowMeans() posted by Richard Scriven. rowMeans and rowSums are equivalent to use of apply with FUN = mean or FUN = sum but a lot faster. I post the version with apply just for reference, in case we would like to pass another function.
data.frame(mean.of.a.and.b = apply(df[-3], 1, mean), c = df[3])
Output:
mean.of.a.and.b c
1 4 5
2 4 6
3 7 4
Very verbose using SQL with sqldf
library(sqldf
sqldf("SELECT (sum(a)+sum(b))/(count(a)+count(b)) as mean, c
FROM df group by c")
Output:
mean c
1 7 4
2 4 5
3 4 6

Related

Sort data.frame or data.table using vector of column names [duplicate]

This question already has answers here:
Sort a data.table fast by Ascending/Descending order
(2 answers)
Order data.table by a character vector of column names
(2 answers)
Sort a data.table programmatically using character vector of multiple column names
(1 answer)
Closed 2 years ago.
I have a data.frame (a data.table in fact) that I need to sort by multiple columns. The names of columns to sort by are in a vector. How can I do it? E.g.
DF <- data.frame(A= 5:1, B= 11:15, C= c(3, 3, 2, 2, 1))
DF
A B C
5 11 3
4 12 3
3 13 2
2 14 2
1 15 1
sortby <- c('C', 'A')
DF[order(sortby),] ## How to do this?
The desired output is the following but using the sortby vector as input.
DF[with(DF, order(C, A)),]
A B C
1 15 1
2 14 2
3 13 2
4 12 3
5 11 3
(Solutions for data.table are preferable.)
EDIT: I'd rather avoid importing additional packages provided that base R or data.table don't require too much coding.
With data.table:
setorderv(DF, sortby)
which gives:
> DF
A B C
1: 1 15 1
2: 2 14 2
3: 3 13 2
4: 4 12 3
5: 5 11 3
For completeness, with setorder:
setorder(DF, C, A)
The advantage of using setorder/setorderv is that the data is reordered by reference and thus very fast and memory efficient. Both functions work on data.table's as wel as on data.frame's.
If you want to combine ascending and descending ordering, you can use the order-parameter of setorderv:
setorderv(DF, sortby, order = c(1L, -1L))
which subsequently gives:
> DF
A B C
1: 1 15 1
2: 3 13 2
3: 2 14 2
4: 5 11 3
5: 4 12 3
With setorder you can achieve the same with:
setorder(DF, C, -A)
Using dplyr, you can use arrange_at which accepts string column names :
library(dplyr)
DF %>% arrange_at(sortby)
# A B C
#1 1 15 1
#2 2 14 2
#3 3 13 2
#4 4 12 3
#5 5 11 3
Or with the new version
DF %>% arrange(across(sortby))
In base R, we can use
DF[do.call(order, DF[sortby]), ]
Also possible with dplyr:
DF %>%
arrange(get(sort_by))
But Ronaks answer is more elegant.

Picking up only specific columns based on conditions on multiple columns in R [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 6 years ago.
I have a data frame, say
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
it looks like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 1 e
6 3 2 f
7 3 3 g
8 6 1 h
9 8 1 i
10 8 2 j
11 8 3 k
12 8 4 l
I would like pick unique elements from column x, based on column y such that y should be maximum (in this case say for row number 5 to 7 are 3'3, I would like to pick the x = 3 corresponding to y = 3 (maximum value) similarly for x = 8 I d like to pick y = 4 row )
the output should look like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 3 g
6 6 1 h
7 8 4 l
I have a solution for that, which I am posting in the solution, but if there is there any better method to achieve this, My solution only works in this specific case (picking the largest) what is the general case solution for this?
One solution using dplyr
library(dplyr)
df %>%
group_by(x) %>%
slice(max(y))
# x y z
# (dbl) (dbl) (chr)
#1 1 1 a
#2 2 1 b
#3 3 3 g
#4 5 1 c
#5 6 1 d
#6 8 4 l
The base R alternative is using aggregate
aggregate(y~x, df, max)
You can achieve the same result using a dplyr chain and dplyr's group_by function. Once you use a group_by function the rest of the functions in the chain are applied within group as opposed to the whole data.frame. So here I filter to where the only rows left are the max(y) per the grouping value of x. This can be extended to be used for the min of y or a particular value.
I think its generally good practice to ungroup the data at the end of a chain using group_by to avoid any unexpected behavior.
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df %>%
group_by(x) %>%
filter(y==max(y)) %>%
ungroup()
To make it more general... say instead you wanted the mean of y for a given x as opposed to the max. You could then use the summarise function instead of the filter as shown below.
df %>%
group_by(x) %>%
summarise(y=mean(y)) %>%
ungroup()
Using data.table we can use df[order(z), .I[which.max(y)], by = x] to get the rownumbers of interest, eg:
library(data.table)
setDT(df)
df[df[order(z), .I[which.max(y)], by = x][, V1]]
x y z
1: 1 1 a
2: 2 1 b
3: 5 1 c
4: 6 1 d
5: 3 3 g
6: 8 4 l
Here is my solution using dplyr package
library(dplyr)
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df <- arrange(df,desc(y))
df_out <- df[!duplicated(df$x),]
df_out
Printing df_out
x y z
1 8 4 l
2 3 3 g
6 1 1 a
7 2 1 b
8 5 1 c
9 6 1 d
Assuming the data frame is ordered by df[order(df$x, df$y),] as it is in the example, you can use base R functions, split, lapply, and do.call/rbind to extract your desired rows using the "split / apply / combine" methodology.
do.call(rbind, lapply(split(df, df$x), function(i) i[nrow(i),]))
x y z
1 1 1 a
2 2 1 b
3 3 3 g
5 5 1 c
6 6 1 h
8 8 4 l
split breaks up the data.frame into a list based on x. This list is fed to lapply which selects the last row of each data.frame, and returns these one row data.frames as a list. This list is then rbinded into a single data frame using do.call.

How to do something to each element in the group

Suppose I have a dataframe like so
a b c
1 2 3
1 3 4
1 4 5
2 5 6
2 6 7
3 7 8
4 8 9
What I want is the following:
a b c d
1 2 3 a
1 3 4 b
1 4 5 c
2 5 6 a
2 6 7 b
3 7 8 a
4 8 9 a
Essentially, I want to do a cycling, for each group by the column a, I want to create a new column which cycles the letters from a to z in order. Group 1 has three elements, so the letter goes from 'a' to 'c'. Group 3 and 4 has only 1 element, so the letter only gets assigned 'a'.
A data.table option is
library(data.table)
setDT(dd)[, d:= letters[seq_len(.N)], by = a]
One way to do this is with a split-apply-combine paradigm, as in plyr (or dplyr or data.table or ...
Create data:
dd <- data.frame(a=rep(1:4,c(3,2,1,1)),
b=2:8,c=3:9)
Use ddply to split the data frame by variable a, transforming each piece by adding an appropriate variable, then recombine:
library("plyr")
ddply(dd,"a",
transform,
d=letters[1:length(b)])
Or in dplyr:
library("dplyr")
dd %>% group_by(a) %>%
mutate(d=letters[1:n()])
Or in base R (thanks #thelatemail):
dd$d <- ave(rownames(dd), dd$a,
FUN=function(x) letters[seq_along(x)] )

Summing down a column with multiple categories

If I have a data frame that looks like
dat<-data.frame(val= c(1,2,3,4,5,6,7),category= c("A","B","c","A","B","c","D"))
dat
val category
1 1 A
2 2 B
3 3 c
4 4 A
5 5 B
6 6 c
7 7 D
I'd like to AVERAGE by the category so the output looks like
A 2.5
B 3.5
C 4.5
D 7
What's the best way to do this?
The most straightforward way would be to use tapply as follows:
tapply(dat$val, dat$category, FUN = mean)
Note that if you have missing values you'd want to amend it to ignore those in the calculation of the mean
tapply(dat$val, dat$category, FUN = mean, na.rm = TRUE)
see ?tapply

R delete non max values in redundant rows

I have a matrix that contains following:
A B C D
a 1 3 2 5
b 3 2 5 8
a 2 1 0 9
a 4 2 1 3
c 4 3 1 1
b 2 5 1 9
A, B, C, D are column names and
a, b, c, d are row names.
I want to make it look like
A B C D
a 4 3 2 9
b 3 5 5 9
c 4 3 1 1
using R, Which is to
1) order the row in alphabetical order,
2) and then if there are redundant rows (i.e. there are other rows with the same row name), pick a maximum value among the redundant rows for each column and delete the others.
I first used python to do this process, but I was wondering if there is
more convenient way for this job in R.
I would appreciate any help.
You can use data.table
dt_in <- data.table(matrix_in)
dt_in[, name := rownames(matrix_in)]
dt_max <- dt_in[, list(A = max(A), B = max(B), C = max(C), D = max(D)), by = "name"]
as.matrix(data.frame(dt_max))
Here's a one liner using data.table you can keep the rows while converting to data.table and then apply max function over all columns using lapply(.SD,...) by the rn variable (the saved row names)
library(data.table)
data.table(m, keep.rownames = TRUE)[, lapply(.SD, max), by = rn]
# rn A B C D
# 1: a 4 3 2 9
# 2: b 3 5 5 9
# 3: c 4 3 1 1
You can simply use aggregate function:
aggregate(matrix ~ rownames(matrix), matrix, max)

Resources