calculate multiple columns mean in R and generate a new table - r

I have a data set in .csv. It contains multiple columns for example.
Group Wk1 WK2 WK3 WK4 WK5 WK6
A 1 2 3 4 5 6
B 7 8 9 1 2 3
C 4 5 6 7 8 9
D 1 2 3 4 5 6
Then if I want to have the mean of both WK1 & WK2, Wk3, WK4 & WK5, WK6.
How can I do that?
The result may like
Group 1 2 3 4
mean 3.75 5.25 4.5 6
And how can I save it into a new table?
Thanks in advance.

You can melt your data.frame, create your groups using some basic indexing, and use aggregate:
library(reshape2)
X <- melt(mydf, id.vars="Group")
Match <- c(Wk1 = 1, Wk2 = 1, Wk3 = 2, Wk4 = 3, Wk5 = 3, Wk6 = 4)
aggregate(value ~ Match[X$variable], X, mean)
# Match[X$variable] value
# 1 1 3.75
# 2 2 5.25
# 3 3 4.50
# 4 4 6.00
tapply is also an appropriate candidate here:
tapply(X$value, Match[X$variable], mean)
# 1 2 3 4
# 3.75 5.25 4.50 6.00

Related

create new var by product of preceding result per goup id in dplyr

I have the follorwing data, and what I need is to create new var new will obtain by product the preceding row values of var Z per group id. eg. the first value of column new is 0.9, 0.90.1, 0.90.1*0.5 for id x=1.
data <- data.frame(x=c(1,1,1,1,2,2,3,3,3,4,4,4,4),
y=c(4,2,2,6,5,6,6,7,8,2,1,6,5),
z=c(0.9,0.1,0.5,0.12,0.6,1.2,2.1,0.9,0.4,0.8,0.45,1.3,0.85))
desired outcome
x y z new
1 1 4 0.90 0.9000
2 1 2 0.10 0.0900
3 1 2 0.50 0.0450
4 1 6 0.12 0.0054
5 2 5 0.60 0.6000
6 2 6 1.20 0.7200
7 3 6 2.10 2.1000
8 3 7 0.90 1.8900
9 3 8 0.40 0.7560
10 4 2 0.80 0.8000
11 4 1 0.45 0.3600
12 4 6 1.30 0.4680
13 4 5 0.85 0.3978
We can use the cumprod from base R
library(dplyr)
data %>%
group_by(x) %>%
mutate(new = cumprod(z)) %>%
ungroup
Or with base R
data$new <- with(data, ave(z, x, FUN = cumprod))

Add observation number by group in R [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
This is a silly question but I am new to R and it would make my life so much easier if I could figure out how to do this!
So here is some sample data
data <- read.table(text = "Category Y
A 5.1
A 3.14
A 1.79
A 3.21
A 5.57
B 3.68
B 4.56
B 3.32
B 4.98
B 5.82
",header = TRUE)
I want to add a column that counts the number of observations within a group. Here is what I want it to look like:
Category Y OBS
A 5.1 1
A 3.14 2
A 1.79 3
A 3.21 4
A 5.57 5
B 3.68 1
B 4.56 2
B 3.32 3
B 4.98 4
B 5.82 5
I have tried:
data <- data %>% group_by(Category) %>% mutate(count = c(1:length(Category)))
which just creates another column numbered from 1 to 10, and
data <- data %>% group_by(Category) %>% add_tally()
which just creates another column of all 5s
Base R:
data$OBS <- ave(seq_len(nrow(data)), data$Category, FUN = seq_along)
data
# Category Y OBS
# 1 A 5.10 1
# 2 A 3.14 2
# 3 A 1.79 3
# 4 A 3.21 4
# 5 A 5.57 5
# 6 B 3.68 1
# 7 B 4.56 2
# 8 B 3.32 3
# 9 B 4.98 4
# 10 B 5.82 5
BTW: one can use any of the frame's columns as the first argument, including ave(data$Category, data$Category, FUN=seq_along), but ave chooses its output class based on the input class, so using a string as the first argument will result in a return of strings:
ave(data$Category, data$Category, FUN = seq_along)
# [1] "1" "2" "3" "4" "5" "1" "2" "3" "4" "5"
While not heinous, it needs to be an intentional choice. Since it appears that you wanted an integer in that column, I chose the simplest integer-in, integer-out approach. It could also have used rep(1L,nrow(data)) or anything that is both integer and the same length as the number of rows in the frame, since seq_along (the function I chose) won't otherwise care.
library(data.table)
setDT(data)[, OBS := seq_len(.N), by = .(Category)]
data
Category Y OBS
1: A 5.10 1
2: A 3.14 2
3: A 1.79 3
4: A 3.21 4
5: A 5.57 5
6: B 3.68 1
7: B 4.56 2
8: B 3.32 3
9: B 4.98 4
10: B 5.82 5
library(dplyr)
data %>% group_by(Category) %>% mutate(Obs = row_number())
# A tibble: 10 x 3
# Groups: Category [2]
Category Y Obs
<chr> <dbl> <int>
1 A 5.1 1
2 A 3.14 2
3 A 1.79 3
4 A 3.21 4
5 A 5.57 5
6 B 3.68 1
7 B 4.56 2
8 B 3.32 3
9 B 4.98 4
10 B 5.82 5
OR
data$OBS <- ave(data$Category, data$Category, FUN = seq_along)
data
Category Y OBS
1 A 5.10 1
2 A 3.14 2
3 A 1.79 3
4 A 3.21 4
5 A 5.57 5
6 B 3.68 1
7 B 4.56 2
8 B 3.32 3
9 B 4.98 4
10 B 5.82 5
Another base R
category <- c(rep('A',5),rep('B',5))
sequence <- sequence(rle(as.character(category))$lengths)
data <- data.frame(category=category,sequence=sequence)
head(data,10)

Complex aggregate function construction in R? [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 2 years ago.
Probably this is not that complex, but I couldn't figure out how to write a concise title explaining it:
I'm trying to use the aggregate function in R to return (1) the lowest value of a given column (val) by category (cat.2) in a data frame and (2) the value of another column (cat.1) on the same row. I know how to do part #1, but I can't figure out part #2.
The data:
cat.1<-c(1,2,3,4,5,1,2,3,4,5)
cat.2<-c(1,1,1,2,2,2,2,3,3,3)
val<-c(10.1,10.2,9.8,9.7,10.5,11.1,12.5,13.7,9.8,8.9)
df<-data.frame(cat.1,cat.2,val)
> df
cat.1 cat.2 val
1 1 1 10.1
2 2 1 10.2
3 3 1 9.8
4 4 2 9.7
5 5 2 10.5
6 1 2 11.1
7 2 2 12.5
8 3 3 13.7
9 4 3 9.8
10 5 3 8.9
I know how to use aggregate to return the minimum value for each cat.2:
> aggregate(df$val, by=list(df$cat.2), FUN=min)
Group.1 x
1 1 9.8
2 2 9.7
3 3 8.9
The second part of it, which I can't figure out, is to return the value in cat.1 on the same row of df where aggregate found min(df$val) for each cat.2. Not sure I'm explaining it well, but this is the intended result:
> ...
Group.1 x cat.1
1 1 9.8 3
2 2 9.7 4
3 3 8.9 5
Any help much appreciated.
If we need the output after the aggregate, we can do a merge with original dataset
merge(aggregate(df$val, by=list(df$cat.2), FUN=min),
df, by.x = c('Group.1', 'x'), by.y = c('cat.2', 'val'))
# Group.1 x cat.1
#1 1 9.8 3
#2 2 9.7 4
#3 3 8.9 5
But, this can be done more easily with dplyr by using slice to slice the rows with the min value of 'val' after grouping by 'cat.2'
library(dplyr)
df %>%
group_by(cat.2) %>%
slice(which.min(val))
# A tibble: 3 x 3
# Groups: cat.2 [3]
# cat.1 cat.2 val
# <dbl> <dbl> <dbl>
#1 3 1 9.8
#2 4 2 9.7
#3 5 3 8.9
Or with data.table
library(data.table)
setDT(df)[, .SD[which.min(val)], cat.2]
Or in base R, this can be done with ave
df[with(df, val == ave(val, cat.2, FUN = min)),]
# cat.1 cat.2 val
#3 3 1 9.8
#4 4 2 9.7
#10 5 3 8.9

rotating variable in ddply

I am trying to get means from a column in a data frame based on a unique value. So trying to get mean of column b and column c in this exampled based on the unique values in column a. I thought the .(a) would make it calculate by unique value in a (it gives the unique values of a) but it just gives a mean for the whole column b or c.
df2<-data.frame(a=seq(1:5),b=c(1:10), c=c(11:20))
simVars <- c("b", "c")
for ( var in simVars ){
print(var)
dat = ddply(df2, .(a), summarize, mean_val = mean(df2[[var]])) ## my script
assign(var, dat)
}
c
a mean_val
1 15.5
2 15.5
3 15.5
4 15.5
5 15.5
How can I have it take an average for the column based on the unique value from column a?
thanks
You don't need a loop. Just calculate the means of b and c within a single call to ddply and the means will be calculated separately for each value of a. And, as #Gregor said, you don't need to re-specify the data frame name inside mean():
ddply(df2, .(a), summarise,
mean_b=mean(b),
mean_c=mean(c))
a mean_b mean_c
1 1 3.5 13.5
2 2 4.5 14.5
3 3 5.5 15.5
4 4 6.5 16.5
5 5 7.5 17.5
UPDATE: To get separate data frames for each column of means:
# Add a few additional columns to the data frame
df2 = data.frame(a=seq(1:5),b=c(1:10), c=c(11:20), d=c(21:30), e=c(31:40))
# New data frame with means by each level of column a
library(dplyr)
dfmeans = df2 %>%
group_by(a) %>%
summarise_each(funs(mean))
# Separate each column of means into a separate data frame and store it in a list:
means.list = lapply(names(dfmeans)[-1], function(x) {
cbind(dfmeans[,"a"], dfmeans[,x])
})
means.list
[[1]]
a b
1 1 3.5
2 2 4.5
3 3 5.5
4 4 6.5
5 5 7.5
[[2]]
a c
1 1 13.5
2 2 14.5
3 3 15.5
4 4 16.5
5 5 17.5
[[3]]
a d
1 1 23.5
2 2 24.5
3 3 25.5
4 4 26.5
5 5 27.5
[[4]]
a e
1 1 33.5
2 2 34.5
3 3 35.5
4 4 36.5
5 5 37.5

sequential subtraction in r

I would highly appreciate if somebody could help me out with this. This looks simple but I have no clue how to go about it.
I am trying to work out the percentage change in one row with respect to the previous one. For example: my data frame looks like this:
day value
1 21
2 23.4
3 10.7
4 5.6
5 3.2
6 35.2
7 12.9
8 67.8
. .
. .
. .
365 27.2
What I am trying to do is to calculate the percentage change in each row with respect to previous row. For example:
day value
1 21
2 (day2-day1/day1)*100
3 (day3-day2/day2)*100
4 (day4-day3/day3)*100
5 (day5-day4/day4)*100
6 (day6-day5/day5)*100
7 (day7-day6/day6)*100
8 (day8-day7/day7)*100
. .
. .
. .
365 (day365-day364/day364)*100
and then print out only those days where the there was a percentage increase of >50% from the previous row
Many thanks
You are looking for diff(). See its help page by typing ?diff. Here are the indices of days that fulfill your criterion:
> value <- c(21,23.4,10.7,5.6,3.2,35.2,12.9,67.8)
> which(diff(value)/head(value,-1)>0.5)+1
[1] 6 8
Use diff:
value <- 100*diff(value)/value[2:length(value)]
Here's one way:
dat <- data.frame(day = 1:10, value = 1:10)
dat2 <- transform(dat, value2 = c(value[1], diff(value) / head(value, -1) * 100))
day value value2
1 1 1 1.00000
2 2 2 100.00000
3 3 3 50.00000
4 4 4 33.33333
5 5 5 25.00000
6 6 6 20.00000
7 7 7 16.66667
8 8 8 14.28571
9 9 9 12.50000
10 10 10 11.11111
dat2[dat2$value2 > 50, ]
day value value2
2 2 2 100
You're looking for the difffunction :
x<-c(3,1,4,1,5)
diff(x)
[1] -2 3 -3 4
Here is another way:
#dummy data
df <- read.table(text="day value
1 21
2 23.4
3 10.7
4 5.6
5 3.2
6 35.2
7 12.9
8 67.8", header=TRUE)
#get index for 50% change
x <- sapply(2:nrow(df),function(i)((df$value[i]-df$value[i-1])/df$value[i-1])>0.5)
#output
df[c(FALSE,x),]
# day value
#6 6 35.2
#8 8 67.8

Resources