Mean of Multiple Columns in R - r

I am trying take the mean of a list of columns in R and am running into a issue. Let's say I have:
A B C D
1 2 3 4
5 6 7 8
9 10 11 12
What I am trying to do is take the mean of columns c(A,C) and save it as a value say (E) as well as the mean of columns c(B,D) and have it save as a different value say F. Is that possible?
E F
2 3
6 7
10 11

Check out dplyr:
library(dplyr)
df <- df %>% mutate(E=(A+C)/2, F=(B+D)/2)
df
A B C D E F
1 1 2 3 4 2 3
2 5 6 7 8 6 7
3 9 10 11 12 10 11

We can subset the dataset with columns 1 & 2, another one with 3 & 4, add them together, divide by 2, and change the column names with setNames
setNames((df1[1:2] + df1[3:4])/2, c("E", "F"))
# E F
#1 2 3
#2 6 7
#3 10 11
Or another option is rowMeans by keeping it in a list using the recycling logical vector, loop through the list (using sapply) and get the rowMeans
i1 <- c(TRUE, FALSE)
sapply(list(df1[i1], df1[!i1]), rowMeans)
Or another option is unlist the dataset, convert it to array and use apply to get the mean
apply(array(unlist(df1), c(3, 2, 2)), c(1,2), mean)

Related

R logical indexing by equality on multiple columns

Say I have the following dataframe, df:
A B C
1 4 25 a
2 3 79 b
3 4 25 c
4 6 17 d
5 4 21 e
6 5 25 f
How to I index the lines where elements in certain columns match a vector, e.g. df$A == 4 & df$B == 25?
I would have thought something like: df[df[,c("A", "B")] == c(4, 25),] would have worked, but this doesn't give me the right answer (it returns no lines).
I would like a method that uses a vector of column names to match on, and a vector of values to match.
Try this:
df[colSums(t(df[,c("A", "B")]) == c(4, 25))==2,]
x A B C
1 1 4 25 a
3 3 4 25 c
This works recycling the vector c(4, 25).
You could also use a simple merge, which would have analogies for data.table and dplyr and in sql as well:
merge(df, setNames(list(4,25),c("A","B")))
# A B C
#1 4 25 a
#2 4 25 c

How to do something to each element in the group

Suppose I have a dataframe like so
a b c
1 2 3
1 3 4
1 4 5
2 5 6
2 6 7
3 7 8
4 8 9
What I want is the following:
a b c d
1 2 3 a
1 3 4 b
1 4 5 c
2 5 6 a
2 6 7 b
3 7 8 a
4 8 9 a
Essentially, I want to do a cycling, for each group by the column a, I want to create a new column which cycles the letters from a to z in order. Group 1 has three elements, so the letter goes from 'a' to 'c'. Group 3 and 4 has only 1 element, so the letter only gets assigned 'a'.
A data.table option is
library(data.table)
setDT(dd)[, d:= letters[seq_len(.N)], by = a]
One way to do this is with a split-apply-combine paradigm, as in plyr (or dplyr or data.table or ...
Create data:
dd <- data.frame(a=rep(1:4,c(3,2,1,1)),
b=2:8,c=3:9)
Use ddply to split the data frame by variable a, transforming each piece by adding an appropriate variable, then recombine:
library("plyr")
ddply(dd,"a",
transform,
d=letters[1:length(b)])
Or in dplyr:
library("dplyr")
dd %>% group_by(a) %>%
mutate(d=letters[1:n()])
Or in base R (thanks #thelatemail):
dd$d <- ave(rownames(dd), dd$a,
FUN=function(x) letters[seq_along(x)] )

r get mean of n columns by row

I have a simple data.frame
> df <- data.frame(a=c(3,5,7), b=c(5,3,7), c=c(5,6,4))
> df
a b c
1 3 5 5
2 5 3 6
3 7 7 4
Is there a simple and efficient way to get a new data.frame with the same number of rows but with the mean of, for example, column a and b by row? something like this:
mean.of.a.and.b c
1 4 5
2 4 6
3 7 4
Use rowMeans() on the first two columns only. Then cbind() to the third column.
cbind(mean.of.a.and.b = rowMeans(df[-3]), df[3])
# mean.of.a.and.b c
# 1 4 5
# 2 4 6
# 3 7 4
Note: If you have any NA values in your original data, you may want to use na.rm = TRUE in rowMeans(). See ?rowMeans for more.
Another option using the dplyr package:
library("dplyr")
df %>%
rowwise()%>%
mutate(mean.of.a.and.b = mean(c(a, b))) %>%
## Then if you want to remove a and b:
select(-a, -b)
I think the best option is using rowMeans() posted by Richard Scriven. rowMeans and rowSums are equivalent to use of apply with FUN = mean or FUN = sum but a lot faster. I post the version with apply just for reference, in case we would like to pass another function.
data.frame(mean.of.a.and.b = apply(df[-3], 1, mean), c = df[3])
Output:
mean.of.a.and.b c
1 4 5
2 4 6
3 7 4
Very verbose using SQL with sqldf
library(sqldf
sqldf("SELECT (sum(a)+sum(b))/(count(a)+count(b)) as mean, c
FROM df group by c")
Output:
mean c
1 7 4
2 4 5
3 4 6

Subset columns using logical vector

I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?
I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:
isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)
Then how can I continue to use this vector to subset dataframe?
If you have a data.frame like
dd <- data.frame(matrix(rpois(7*4,10),ncol=7, dimnames=list(NULL,letters[1:7])))
# a b c d e f g
# 1 11 2 5 9 7 6 10
# 2 10 5 11 13 11 11 8
# 3 14 8 6 16 9 11 9
# 4 11 8 12 8 11 6 10
You can subset with a logical vector using one of
mycols<-c(T,F,F,T,F,F,T)
dd[mycols]
dd[, mycols]
There's really no need to write a function when we have colMeans (thanks #MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).
Here's how I would approach your function if you want to use sapply on it later.
> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
z = c(rep(NA, 3), 1, 2))
> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
# x y z
# FALSE TRUE TRUE
Then, to subset with the above line your data using the above line of code, it's
> d[sapply(d, isNARateLt70)]
But as mentioned, colMeans works just the same,
> d[colMeans(is.na(d)) <= 0.7]
# y z
# 1 1 NA
# 2 NA NA
# 3 NA NA
# 4 1 1
# 5 1 2
Maybe this will help too. The 2 parameter in apply() means apply this function column wise on the data.frame cars.
> columns <- apply(cars, 2, function(x) {mean(x) > 10})
> columns
speed dist
TRUE TRUE
> cars[1:10, columns]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Resources