How to calculate value for an observation by group? - r

I have a data frame like so:
mydf <- data.frame(group=c(rep("a", 4),rep("b", 4), rep("c", 4)), score=sample(1:10, 12, replace=TRUE))
mydf
group score
1 a 10
2 a 9
3 a 2
4 a 3
5 b 1
6 b 10
7 b 1
8 b 10
9 c 3
10 c 7
11 c 1
12 c 3
I can calculate the mean of each group like so:
> by(mydf[,c("score")], mydf$group, mean)
mydf$group: a
[1] 6
-------------------------------------------------------------------
mydf$group: b
[1] 5.5
-------------------------------------------------------------------
mydf$group: c
[1] 3.5
But what I wish to do, is create a new column, say called resdidual which contains the residual from the mean of the group. It would seem like there is some way to use one of the apply functions to do this, but for some reason I can't see it.
I would want my end result to look like so:
mydf
group score residual
1 a 10 4
2 a 9 3
3 a 2 -4
4 a 3 -3
5 b 1 -4.5
6 b 10 4.5
7 b 1 -4.5
8 b 10 4.5
9 c 3 -.5
10 c 7 3.5
11 c 1 -2.5
12 c 3 -.5
Any ideas or pointers to the right direction is appreciated.

How about:
mydf$score - tapply(mydf$score, mydf$group, mean)[as.character(mydf$group)]
tapply works the same as by but with a nicer output. The [as.character(mydf$group)] subsets and replicates tapply's output so that it aligns mdf$group.

library(dplyr)
mydf %>% group_by(group) %>% mutate(residual = score - mean(score))
I take the data, I group by group, then I add a column (using mutate) which is the difference between the variable score and the mean of that variable in each group.

library(hash)
mydf <- data.frame(group=c(rep("a", 4),rep("b", 4), rep("c", 4)), score=sample(1:10, 12, replace=TRUE))
byResult <- by(mydf[,c("score")], mydf$group, mean)
h <- hash(keys= names(byResult), values =byResult)
residualsVar <- apply(mydf,1,function(row){
as.vector(values(h,row[1]))-as.numeric(row[2])
})
df <- cbind(mydf,residualsVar)

Related

How to recursively compute average over time in R

Consider the follow dataset
period<-c(1,2,3,4,5)
x<-c(3,6,7,4,6)
cumulative_average<-c((3)/1,(3+6)/2,(3+6+7)/3,(3+6+7+4)/4,(3+6+7+4+6)/5)
df_test<-data.frame(value,cum_average)
df_test
period value cum_average
1 3 3
2 6 4.5
3 7 5.3
4 4 5.0
5 6 5.2
Assume that the 5 observations in the 'x' column represents the value assumed by a variable in 'period' from 1 to 5, respectively. How can I produce column 'cum_average'??
I believe that this could be done using zoo::timeAverage but when I try to lunch the package on my relatively old machine I incur in some conflict and cannot use it.
Any help would be much appreciated!
Solution
new_df <- df_test %>% mutate(avgT = cumsum(value)/period)
did the trick.
Thank you so much for your answers!
Maybe you are looking for this. You can first compute the cumulative sum as mentioned by #tmfmnk and then divide by the rownumber which tracks the number of observation, if the mean is required. Here the code using dplyr:
library(dplyr)
#Code
newdf <- df_test %>% mutate(AvgTime=cumsum(x)/row_number())
Output:
period x AvgTime
1 1 3 3.000000
2 2 6 4.500000
3 3 7 5.333333
4 4 4 5.000000
5 5 6 5.200000
If only cumulative sum is needed:
#Code2
newdf <- df_test %>% mutate(CumTime=cumsum(x))
Output:
period x CumTime
1 1 3 3
2 2 6 9
3 3 7 16
4 4 4 20
5 5 6 26
Or only base R:
#Base R
df_test$Cumsum <- cumsum(df_test$x)
Output:
period x Cumsum
1 1 3 3
2 2 6 9
3 3 7 16
4 4 4 20
5 5 6 26
Using standard R:
period<-c(1,2,3,4,5)
value<-c(3,6,7,4,6)
recursive_average<-cumsum(value) / (1:length(value))
df_test<-data.frame(value, recursive_average)
df_test
value recursive_average
1 3 3.000000
2 6 4.500000
3 7 5.333333
4 4 5.000000
5 6 5.200000
If your period vector, is the vector you wish to use to calculate the average, simply replace 1:length(value) with period
We can use cummean
library(dplyr)
df_test %>%
mutate(AvgTime=cummean(value))
-output
# period value AvgTime
#1 1 3 3.000000
#2 2 6 4.500000
#3 3 7 5.333333
#4 4 4 5.000000
#5 5 6 5.200000
data
df_test <- structure(list(period = c(1, 2, 3, 4, 5), value = c(3, 6, 7,
4, 6)), class = "data.frame", row.names = c(NA, -5L))

Combining elements of one column into two columns by group in R

Given a two column data.frame with one containing group labels and a second containing integer values ordered from smallest to largest. How can the data be expanded creating pairs of combinations of the integer column?
Not sure the best way to state this. I'm not interested in all possible combinations but instead all unique combinations starting from the lowest value.
In r, the combn function gives the desired output not considering groups, for example:
t(combn(seq(1:4),2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since the first values is 1 we get the unique combination of (1,2) and not the additional combination of (2,1) which I don't need. How would one then apply a similar method by groups?
for example given a data.frame
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
test
Group Val
1 A 1
2 A 3
3 A 6
4 A 8
5 B 2
6 B 4
7 B 5
8 B 7
I was able to come up with this solution that gives the desired output:
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
j=1
for(i in unique(test$Group)){
if(j==1){
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test1 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
j=j+1
}else{
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test2 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
test1 <- rbind(test1,test2)
}
}
test1
Group Val1 Val2
1 A 1 3
2 A 1 6
3 A 1 8
4 A 3 6
5 A 3 8
6 A 6 8
7 B 2 4
8 B 2 5
9 B 2 7
10 B 4 5
11 B 4 7
12 B 5 7
However, this is not elegant and is really slow as the number of groups and length of each group become large. It seems like there should be a more elegant and efficient solution but so far I have not come across anything on SO.
I would appreciate any ideas!
here is a data.table approach
library( data.table )
#make test a data.table
setDT(test)
#split by group
L <- split( test, by = "Group")
#get unique combinations of 2 Vals
L2 <- lapply( L, function(x) {
as.data.table( t( combn( x$Val, m = 2, simplify = TRUE ) ) )
})
#merge them back together
data.table::rbindlist( L2, idcol = "Group" )
# Group V1 V2
# 1: A 1 3
# 2: A 1 6
# 3: A 1 8
# 4: A 3 6
# 5: A 3 8
# 6: A 6 8
# 7: B 2 4
# 8: B 2 5
# 9: B 2 7
#10: B 4 5
#11: B 4 7
#12: B 5 7
You can set simplify = F in combn() and then use unnest_wider() in dplyr.
library(dplyr)
library(tidyr)
test %>%
group_by(Group) %>%
summarise(Val = combn(Val, 2, simplify = F)) %>%
unnest_wider(Val, names_sep = "_")
# Group Val_1 Val_2
# <chr> <dbl> <dbl>
# 1 A 1 3
# 2 A 1 6
# 3 A 1 8
# 4 A 3 6
# 5 A 3 8
# 6 A 6 8
# 7 B 2 4
# 8 B 2 5
# 9 B 2 7
# 10 B 4 5
# 11 B 4 7
# 12 B 5 7
library(tidyverse)
df2 <- split(df$Val, df$Group) %>%
map(~gtools::combinations(n = 4, r = 2, v = .x)) %>%
map(~as_tibble(.x, .name_repair = "unique")) %>%
bind_rows(.id = "Group")

R: Filtering by two columns using "is not equal" operator dplyr/subset

This questions must have been answered before but I cannot find it any where. I need to filter/subset a dataframe using values in two columns to remove them. In the examples I want to keep all the rows that are not equal (!=) to both replicate "1" and treatment "a". However, either subset and filter functions remove all replicate 1 and all treatment a. I could solve it by using which and then indexing, but it is not the best way for using pipe operator. do you know why filter/subset do not filter only when both conditions are true?
require(dplyr)
#Create example dataframe
replicate = rep(c(1:3), times = 4)
treatment = rep(c("a","b"), each = 6)
df = data.frame(replicate, treatment)
#filtering data
> filter(df, replicate!=1, treatment!="a")
replicate treatment
1 2 b
2 3 b
3 2 b
4 3 b
> subset(df, (replicate!=1 & treatment!="a"))
replicate treatment
8 2 b
9 3 b
11 2 b
12 3 b
#solution by which - indexing
index = which(df$replicate==1 & df$treatment=="a")
> df[-index,]
replicate treatment
2 2 a
3 3 a
5 2 a
6 3 a
7 1 b
8 2 b
9 3 b
10 1 b
11 2 b
12 3 b
I think you're looking to use an "or" condition here. How does this look:
require(dplyr)
#Create example dataframe
replicate = rep(c(1:3), times = 4)
treatment = rep(c("a","b"), each = 6)
df = data.frame(replicate, treatment)
df %>%
filter(replicate != 1 | treatment != "a")
replicate treatment
1 2 a
2 3 a
3 2 a
4 3 a
5 1 b
6 2 b
7 3 b
8 1 b
9 2 b
10 3 b

R - Output of aggregate and range gives 2 columns for every column name - how to restructure?

I am trying to produce a summary table showing the range of each variable by group. Here is some example data:
df <- data.frame(group=c("a","a","b","b","c","c"), var1=c(1:6), var2=c(7:12))
group var1 var2
1 a 1 7
2 a 2 8
3 b 3 9
4 b 4 10
5 c 5 11
6 c 6 12
I used the aggregate function like this:
df_range <- aggregate(df[,2:3], list(df$group), range)
Group.1 var1.1 var1.2 var2.1 var2.2
1 a 1 2 7 8
2 b 3 4 9 10
3 c 5 6 11 12
The output looked normal, but the dimensions are 3x3 instead of 5x3 and there are only 3 names:
names(df_range)
[1] "Group.1" "var1" "var2"
How do I get this back to the normal data frame structure with one name per column? Or alternatively, how do I get the same summary table without using aggregate and range?
That is the documented output of a matrix within the data frame. You can undo the effect with:
newdf <- do.call(data.frame, df_range)
# Group.1 var1.1 var1.2 var2.1 var2.2
#1 a 1 2 7 8
#2 b 3 4 9 10
#3 c 5 6 11 12
dim(newdf)
#[1] 3 5
Here's an approach using dplyr:
library(dplyr)
df %>%
group_by(group) %>%
summarise_each(funs(max(.) - min(.)), var1, var2)

length of column, split by group

Hi I am using R to count the number of datapoints in a column, split by groups like so:
Type Value
---- -----
A 1
A 6
A 4
A 6
B 8
B 10
B 3
B 8
C 7
C 4
where I want to plot 3 bars, how many A's,how many B's, how many C's. The values in the value column are not important.
How do I do this.
If my data were in different columns I could obviously use
sapply(list(col1,col2,col3),length))
but I don't want to transform my data.
Thanks
If the Value column doesn't matter, then ggplot2 can help you in that regard
library(ggplot2)
set.seed(9001)
df <- data.frame(Type = c(rep("A", 4), rep("B", 4), rep("C", 2)), Value = sample(1:20, 10))
df
## Type Value
## 1 A 5
## 2 A 19
## 3 A 4
## 4 A 12
## 5 B 1
## 6 B 18
## 7 B 13
## 8 B 17
## 9 C 8
## 10 C 14
ggplot(df) + geom_bar(aes(x = Type))

Resources