Summing down a column with multiple categories - r

If I have a data frame that looks like
dat<-data.frame(val= c(1,2,3,4,5,6,7),category= c("A","B","c","A","B","c","D"))
val category
1 1 A
2 2 B
3 3 c
4 4 A
5 5 B
6 6 c
7 7 D
I'd like to AVERAGE by the category so the output looks like
A 2.5
B 3.5
C 4.5
D 7
What's the best way to do this?

The most straightforward way would be to use tapply as follows:
tapply(dat$val, dat$category, FUN = mean)
Note that if you have missing values you'd want to amend it to ignore those in the calculation of the mean
tapply(dat$val, dat$category, FUN = mean, na.rm = TRUE)
see ?tapply


How to keep rows with the same values in two variables in r?

I have a dataset with several variables, but I want to keep the rows that are the same based on two columns. Here is an example of what I want to do:
a <- c(rep('A',3), rep('B', 3), rep('C',3))
b <- c(1,1,2,4,4,4,5,5,5)
df <- data.frame(a,b)
a b
1 A 1
2 A 1
3 A 2
4 B 4
5 B 4
6 B 4
7 C 5
8 C 5
9 C 5
I know that if I use the duplicated function I can get:
a b
1 A 1
3 A 2
4 B 4
7 C 5
But since the level 'A' on column a does not have a unique value in b, I want to drop both observations to get a new data.frame as this:
a b
4 B 4
7 C 5
I don't mind to have repeated values across b, as long as for every same level on a there is the same value in b.
Is there a way to do this? Thanks!
This one maybe?
ag <- aggregate(b~a, df, unique)
# a b
#2 B 4
#3 C 5
Maybe something like this:
> ind <- apply(sapply(with(df, split(b,a)), diff), 2, function(x) all(x==0) )
> out <- df[!duplicated(df),]
> out[out$a %in% names(ind)[ind], ]
a b
4 B 4
7 C 5
Here is another option with data.table
setDT(df)[, if(uniqueN(b)==1) .SD[1L], by = a]
# a b
#1: B 4
#2: C 5

Picking up only specific columns based on conditions on multiple columns in R [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 6 years ago.
I have a data frame, say
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
it looks like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 1 e
6 3 2 f
7 3 3 g
8 6 1 h
9 8 1 i
10 8 2 j
11 8 3 k
12 8 4 l
I would like pick unique elements from column x, based on column y such that y should be maximum (in this case say for row number 5 to 7 are 3'3, I would like to pick the x = 3 corresponding to y = 3 (maximum value) similarly for x = 8 I d like to pick y = 4 row )
the output should look like this
x y z
1 1 1 a
2 2 1 b
3 5 1 c
4 6 1 d
5 3 3 g
6 6 1 h
7 8 4 l
I have a solution for that, which I am posting in the solution, but if there is there any better method to achieve this, My solution only works in this specific case (picking the largest) what is the general case solution for this?
One solution using dplyr
df %>%
group_by(x) %>%
# x y z
# (dbl) (dbl) (chr)
#1 1 1 a
#2 2 1 b
#3 3 3 g
#4 5 1 c
#5 6 1 d
#6 8 4 l
The base R alternative is using aggregate
aggregate(y~x, df, max)
You can achieve the same result using a dplyr chain and dplyr's group_by function. Once you use a group_by function the rest of the functions in the chain are applied within group as opposed to the whole data.frame. So here I filter to where the only rows left are the max(y) per the grouping value of x. This can be extended to be used for the min of y or a particular value.
I think its generally good practice to ungroup the data at the end of a chain using group_by to avoid any unexpected behavior.
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df %>%
group_by(x) %>%
filter(y==max(y)) %>%
To make it more general... say instead you wanted the mean of y for a given x as opposed to the max. You could then use the summarise function instead of the filter as shown below.
df %>%
group_by(x) %>%
summarise(y=mean(y)) %>%
Using data.table we can use df[order(z), .I[which.max(y)], by = x] to get the rownumbers of interest, eg:
df[df[order(z), .I[which.max(y)], by = x][, V1]]
x y z
1: 1 1 a
2: 2 1 b
3: 5 1 c
4: 6 1 d
5: 3 3 g
6: 8 4 l
Here is my solution using dplyr package
df <- data.frame(x = c(1,2,5,6,3,3,3,6,8,8,8,8),
y = c(1,1,1,1,1,2,3,1,1,2,3,4),
z = c("a","b","c","d","e","f","g","h","i","j","k","l"))
df <- arrange(df,desc(y))
df_out <- df[!duplicated(df$x),]
Printing df_out
x y z
1 8 4 l
2 3 3 g
6 1 1 a
7 2 1 b
8 5 1 c
9 6 1 d
Assuming the data frame is ordered by df[order(df$x, df$y),] as it is in the example, you can use base R functions, split, lapply, and to extract your desired rows using the "split / apply / combine" methodology., lapply(split(df, df$x), function(i) i[nrow(i),]))
x y z
1 1 1 a
2 2 1 b
3 3 3 g
5 5 1 c
6 6 1 h
8 8 4 l
split breaks up the data.frame into a list based on x. This list is fed to lapply which selects the last row of each data.frame, and returns these one row data.frames as a list. This list is then rbinded into a single data frame using

r get mean of n columns by row

I have a simple data.frame
> df <- data.frame(a=c(3,5,7), b=c(5,3,7), c=c(5,6,4))
> df
a b c
1 3 5 5
2 5 3 6
3 7 7 4
Is there a simple and efficient way to get a new data.frame with the same number of rows but with the mean of, for example, column a and b by row? something like this:
mean.of.a.and.b c
1 4 5
2 4 6
3 7 4
Use rowMeans() on the first two columns only. Then cbind() to the third column.
cbind(mean.of.a.and.b = rowMeans(df[-3]), df[3])
# mean.of.a.and.b c
# 1 4 5
# 2 4 6
# 3 7 4
Note: If you have any NA values in your original data, you may want to use na.rm = TRUE in rowMeans(). See ?rowMeans for more.
Another option using the dplyr package:
df %>%
mutate(mean.of.a.and.b = mean(c(a, b))) %>%
## Then if you want to remove a and b:
select(-a, -b)
I think the best option is using rowMeans() posted by Richard Scriven. rowMeans and rowSums are equivalent to use of apply with FUN = mean or FUN = sum but a lot faster. I post the version with apply just for reference, in case we would like to pass another function.
data.frame(mean.of.a.and.b = apply(df[-3], 1, mean), c = df[3])
mean.of.a.and.b c
1 4 5
2 4 6
3 7 4
Very verbose using SQL with sqldf
sqldf("SELECT (sum(a)+sum(b))/(count(a)+count(b)) as mean, c
FROM df group by c")
mean c
1 7 4
2 4 5
3 4 6

R delete non max values in redundant rows

I have a matrix that contains following:
a 1 3 2 5
b 3 2 5 8
a 2 1 0 9
a 4 2 1 3
c 4 3 1 1
b 2 5 1 9
A, B, C, D are column names and
a, b, c, d are row names.
I want to make it look like
a 4 3 2 9
b 3 5 5 9
c 4 3 1 1
using R, Which is to
1) order the row in alphabetical order,
2) and then if there are redundant rows (i.e. there are other rows with the same row name), pick a maximum value among the redundant rows for each column and delete the others.
I first used python to do this process, but I was wondering if there is
more convenient way for this job in R.
I would appreciate any help.
You can use data.table
dt_in <- data.table(matrix_in)
dt_in[, name := rownames(matrix_in)]
dt_max <- dt_in[, list(A = max(A), B = max(B), C = max(C), D = max(D)), by = "name"]
Here's a one liner using data.table you can keep the rows while converting to data.table and then apply max function over all columns using lapply(.SD,...) by the rn variable (the saved row names)
data.table(m, keep.rownames = TRUE)[, lapply(.SD, max), by = rn]
# rn A B C D
# 1: a 4 3 2 9
# 2: b 3 5 5 9
# 3: c 4 3 1 1
You can simply use aggregate function:
aggregate(matrix ~ rownames(matrix), matrix, max)

sort and number within levels of a factor in r

if i have the following data frame G:
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
I am trying to get:
z type x y
3 a 6 3
2 a 5 2
1 a 4 1
4 b 1 2
5 b 0.9 1
6 c 4 1
I.e. i want to sort the whole data frame within the levels of factor type based on vector x. Get the length of of each level a = 3 b=2 c=1 and then number in a decreasing fashion in a new vector y.
My starting place is currently with sort()
tapply(y, x, sort)
Would it be best to first try and use sapply to split everything first?
There are many ways to skin this cat. Here is one solution using base R and vectorized code in two steps (without any apply):
Sort the data using order and xtfrm
Use rle and sequence to genereate the sequence.
Replicate your data:
dat <- read.table(text="
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
", header=TRUE, stringsAsFactors=FALSE)
Two lines of code:
r <- dat[order(dat$type, -xtfrm(dat$x)), ]
r$y <- sequence(rle(r$type)$lengths)
Results in:
z type x y
3 3 a 6.0 1
2 2 a 5.0 2
1 1 a 4.0 3
4 4 b 1.0 1
5 5 b 0.9 2
6 6 c 4.0 1
The call to order is slightly complicated. Since you are sorting one column in ascending order and a second in descending order, use the helper function xtfrm. See ?xtfrm for details, but it is also described in ?order.
I like Andrie's better:
dat <- read.table(text="z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4", header=T)
Three lines of code:
dat <- dat[order(dat$type), ]
x <- by(dat, dat$type, nrow)
dat$y <- unlist(sapply(x, function(z) z:1))
I Edited my response to adapt for the comments Andrie mentioned. This works but if you went this route instead of Andrie's you're crazy.
