How do I work with continuous variables in frequency tables? [duplicate] - r

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 6 years ago.
I have a thousand-row table of the type below and need to calculate the sum and mean of a continuous "count" variable for every categorical "df" varible.
I have attempted to solve this through table() function, but since I am a using continuous variable, I can't work myself towards a solution.
df count
1 a 5
2 f 3
3 g 8
4 l 2
5 a 10
6 s 4
7 l 6
8 s 8
9 a 2
10 g 1

if I am not mistaken, you are looking for the following code
library(dplyr)
daf %>%
group_by(df) %>%
summarise(Sum = sum(count), Count = n()) %>%
ungroup() %>%
arrange(df)
"daf" is the data set that I am working on.
Enjoy R programming!!!

Maybe this would help you out ,
> df3 <- aggregate(count ~ df , df, mean)
> df3
df count
1 a 5.666667
2 f 3.000000
3 g 4.500000
4 l 4.000000
5 s 6.000000
> df2 <- aggregate(count ~ df , df, sum)
> df2
df count
1 a 17
2 f 3
3 g 9
4 l 8
5 s 12
Simple aggregate functions can do it . Count in df3 is the mean and count in df2 is the sum .

This isn't an especially unique question, but the suggested duplicated questions only ask for a single summary statistic. As this is a simple problem to solve in dplyr I thought I'd throw this in.
dframe <- data.frame(df = c("a", "f", "g", "l", "a", "s", "l", "s", "a", "g"), count = c(5, 3, 8, 2, 10, 4, 6, 8, 2, 1))
dframe
df count
1 a 5
2 f 3
3 g 8
4 l 2
5 a 10
6 s 4
7 l 6
8 s 8
9 a 2
10 g 1
library(dplyr)
dframe %>% group_by(df) %>% summarise(sum = sum(count), mean = mean(count))
Source: local data frame [5 x 3]
df sum mean
(fctr) (dbl) (dbl)
1 a 17 5.666667
2 f 3 3.000000
3 g 9 4.500000
4 l 8 4.000000
5 s 12 6.000000
You can see that summarise() allows you to calculate whatever, and however many, summary statistics for each group that you like.

Related

Filter groups based on difference two highest values

I have the following dataframe called df (dput below):
> df
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 B 8
6 B 2
7 B 2
8 B 3
9 C 10
10 C 1
11 C 1
12 C 8
I would like to filter groups based on the difference between their highest value (max) and second highest value. The difference should be smaller equal than 2 (<=2), this means that group B should be removed because the highest value is 8 and the second highest value is 3 which is a difference of 5. The desired output should look like this:
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
So I was wondering if anyone knows how to filter groups based on the difference between their highest and second-highest value?
dput of df:
df<-structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), value = c(5, 1, 1, 5, 8, 2, 2, 3, 10, 1,
1, 8)), class = "data.frame", row.names = c(NA, -12L))
Using dplyr
library(dplyr)
df %>%
group_by(group) %>%
filter(abs(diff(sort(value, decreasing=T)[1:2])) <= 2) %>%
ungroup()
# A tibble: 8 × 2
group value
<chr> <int>
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
A base R alternative
grp <- na.omit(aggregate(. ~ group, df, function(x)
abs(diff(sort(x, decreasing=T)[1:2])) <= 2))
do.call(rbind, c(mapply(function(g, v)
list(df[df$group == g & v,]), grp$group, grp$value), make.row.names=F))
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
I possibility would be to first create a vector with the groups that achieve your condition and then filter in the original data.frame. Here how I thought:
library(dplyr)
group_to_keep <-
df %>%
group_by(group) %>%
slice_max(n = 2,value) %>%
filter(abs(diff(value)) <= 2) %>%
pull(group) %>%
unique()
df %>%
filter(group %in% group_to_keep)
You can use ave.
df[ave(df$value, df$group, FUN=\(x) diff(sort(c(-x, Inf)))[1]) <= 2,]
# group value
#1 A 5
#2 A 1
#3 A 1
#4 A 5
#9 C 10
#10 C 1
#11 C 1
#12 C 8
In case you can sure that you have all the time at least two values you can use.
df[ave(df$value, df$group, FUN=\(x) diff(tail(sort(x), 2))) <= 2,]
df[ave(df$value, df$group, FUN=\(x) diff(sort(-x)[1:2])) <= 2,]

How to rename values by frequency in R

I am making several graphs based on the clustering data from DAPC. I need the colors to be the same across all the graphs, and I'd like to use specific colors for the largest groups. The important thing for this question, is I get a data set from DAPC like this:
my_df <- data.frame(
ID = c(1:10),
Group = c("a", "b", "b", "c", "a", "b", "a", "b", "b", "c")
)
> my_df
ID Group
1 a
2 b
3 b
4 c
5 a
6 b
7 a
8 b
9 b
10 c
I know how to find the group with the most members like this:
freqs <- table(my_df$Group)
freqs <- freqs[order(freqs, decreasing = TRUE)]
>freqs
b a c
5 3 2
Is there a way to change the values based on their frequency? Each time I rerun DAPC, it changes the groups, so I'd like to write code that does this automatically instead of having to redo it manually. Here's how I'd like the dataframe to be changed:
> my_df > my_new_df
ID Group ID Group
1 a 1 '2nd'
2 b 2 '1st'
3 b 3 '1st'
4 c 4 '3rd'
5 a 5 '2nd'
6 b 6 '1st'
7 a 7 '2nd'
8 b 8 '1st'
9 b 9 '1st'
10 c 10 '3rd'
You may use ave and create a factor out of it with the corresponding labels=. To avoid hard-coding, define the labels in a vector lb beforehand.
lb <- c("1st", "2nd", "3rd", paste0(4:10, "th"))
with(my_df, factor(as.numeric(ave(as.character(Group), as.character(Group), FUN=table)),
labels=rev(lb[1:length(unique(table(Group)))])))
# [1] 2nd 1st 1st 3rd 2nd 1st 2nd 1st 1st 3rd
# Levels: 3rd 2nd 1st
To convert more columns like this, use sapply.
sapply(my_df[selected.columns], function(x) {
factor(as.numeric(ave(as.character(x), as.character(x), FUN=table)),
labels=rev(lb[1:length(unique(table(x)))]))
})
Do you mean something like this:
my_df %>% left_join(my_df %>% group_by(Group) %>% summarise(N=n())) %>%
arrange(desc(N)) %>% select(-N)
ID Group
1 2 B
2 3 B
3 6 B
4 8 B
5 9 B
6 1 A
7 5 A
8 7 A
9 4 C
10 10 C
Update
This can be useful. I hope this helps.
my_df %>% left_join(my_df %>% group_by(Group) %>% summarise(N=n()) %>% arrange(desc(N)) %>%
bind_cols(my_df %>% select(Group) %>% distinct() %>% rename(key=Group)) %>%
rename(NewGroup=Group,Group=key)) %>%
select(-c(Group,N)) %>% rename(Group=NewGroup)
ID Group
1 1 B
2 2 A
3 3 A
4 4 C
5 5 B
6 6 A
7 7 B
8 8 A
9 9 A
10 10 C

Reshaping R dataframe (compute average of a column based on multiple 'level' columns)

I have a R dataframe with this type of structure (Dummy example):
df
A B C D
1 a 3 5
1 a 5 3
1 b 2 8
2 a 4 7
2 a 6 5
2 b 4 3
...
"A", "B", "C", and "D" are column headers.
I want to reshape this dataframe so that I get average(mean) of "C" and "D" by each level of "A" and "B".
So the final product I want would be:
new_df
A BaC BbC BaD BbD
1 4 2 4 8
2 5 4 6 3
I managed to do it in a very crude way:
spread_df_C <- spread(df, B, C)
aggregated_df_C <- aggregate(spread_df$a, list(spread_df$A), mean)
spread_df_D <- spread(df, B, D)
aggregated_df_D <- aggregate(spread_df$a, list(spread_df$A), mean)
new_df <- merge(aggregated_df_C, aggregated_df_D, by=c("A", "A")
This would get me the final product eventually, but I am laboriously computing mean for each of the levels. I need to do this for several levels, and there has to be more elegant way of executing it.
Experts, please help
An option using the reshape2 package.
library(reshape2)
dcast(melt(dat, measure.vars = c("C", "D")), A ~ B + variable, fun.aggregate = mean)
# A a_C a_D b_C b_D
#1 1 4 4 2 8
#2 2 5 6 4 3
The first step is to melt columns C and D and then cast the resulting dataframe back to wide format.
Consider base R's reshape after aggregation and switch of column name before/after period:
agg <- aggregate(. ~ A + B, df, mean)
rdf <- reshape(agg, idvar = "A", timevar = "B", direction = "wide")
names(rdf)[-1] <- paste0("B", substr(names(rdf)[-1], 3, 3), substr(names(rdf)[-1], 1, 1))
rdf
# A BaC BaD BbC BbD
# 1 1 4 4 2 8
# 2 2 5 6 4 3
With tidyverse, you can do:
df %>%
gather(var, val, -c(1:2)) %>%
group_by_at(1:3) %>%
summarise(val = mean(val)) %>%
ungroup() %>%
mutate(var = paste(var, B, sep = "_")) %>%
select(-2) %>%
spread(var, val)
A C_a C_b D_a D_b
<int> <dbl> <dbl> <dbl> <dbl>
1 1 4 2 4 8
2 2 5 4 6 3

Divide (and name) one group of columns by another group in dplyr

After a (very scaring) dplyr pipeline I've ended up with a dataset like this:
year A B C [....] Z count.A count.B count.C [....] count.Z
1999 10 20 10 ... 6 3 5 67 ... 6
2000 3 5 5 ... 7 5 2 5 ... 5
Some example data to reproduce:
df <- data.frame(year = c(1999, 2000),
A = c(10, 20),
B = c(3, 6),
C = c(1, 2),
count.A = c(1, 2),
count.B = c(8, 9),
count.C = c(5, 7))
What I really need is to combine each column with its "count" counterpart i.e.
weight.A = A / count.A,
weight.B = B / count.B
I've to do that programmatically as I have hundreds of columns. Is there a way to do that in a dplyr pipeline?
Don't store variables in column names. If you reshape your data to make it tidy, the calculation is really simple:
library(tidyverse)
df %>% gather(var, val, -year) %>% # reshape to long
separate(var, c('var', 'letter'), fill = 'left') %>% # extract var from former col names
mutate(var = coalesce(var, 'value')) %>% # add name for unnamed var
spread(var, val) %>% # reshape back to wide
mutate(weight = value / count) # now this is very simple
#> year letter count value weight
#> 1 1999 A 1 10 10.0000000
#> 2 1999 B 8 3 0.3750000
#> 3 1999 C 5 1 0.2000000
#> 4 2000 A 2 20 10.0000000
#> 5 2000 B 9 6 0.6666667
#> 6 2000 C 7 2 0.2857143
If your columns are consistently named (and easy enough to retrieve) you could easily do this using an lapply:
cols <- c("A","B","C")
df[,paste0("weighted.",cols)] <- lapply(cols, function(x) df[,x] / df[, paste0("count.",x)])
# year A B C count.A count.B count.C weighted.A weighted.B weighted.C
#1 1999 10 3 1 1 8 5 10 0.3750000 0.2000000
#2 2000 20 6 2 2 9 7 10 0.6666667 0.2857143
Assuming that the columns are in order, we can use data.table. Specify the columns of interest in .SDcols and divide by subset of columns of Subset of Data.table with the other half and assign (:=) it to new columns
library(data.table)
setDT(df)[, paste0("weighted.",names(df)[1:3]) := .SD[,1:3]/.SD[,4:6], .SDcols = A:count.C]
df
# year A B C count.A count.B count.C weighted.year weighted.A weighted.B
#1: 1999 10 3 1 1 8 5 10 0.3750000 0.2000000
#2: 2000 20 6 2 2 9 7 10 0.6666667 0.2857143
Assuming you can programmatically create a vector of all column names, here is how I'd do for your example above
for (c.name in c("A", "B", "C")) {
c.weight <- sprintf("weight.%s", c.name)
c.count <- sprintf("count.%s", c.name)
df[,c.weight] <- df[,c.name] / df[,c.count]
}

How to summarize value not matching the group using dplyr

I want to sum values of rows which belongs to group other than the row's group. For example using this sample data
> df <- data.frame(id=1:5, group=c("A", "A", "B", "B", "A"), val=seq(9, 1, -2))
> df
id group val
1 1 A 9
2 2 A 7
3 3 B 5
4 4 B 3
5 5 A 1
Summarizing with dplyr by group
> df %>% group_by(group) %>% summarize(sumval = sum(val))
Source: local data frame [2 x 2]
group sumval
(fctr) (dbl)
1 A 17
2 B 8
What I want is the value for rows belonging to group A to use sumval of not group A. i.e. the final result is
id group val notval
1 1 A 9 8
2 2 A 7 8
3 3 B 5 17
4 4 B 3 17
5 5 A 1 8
Is there a way to do this in dplyr? Preferrably in a single chain?
We can do this with base R
s1 <- sapply(unique(df$group), function(x) sum(df$val[df$group !=x]))
s1[with(df, match(group, unique(group)))]
#[1] 8 8 17 17 8
Or using data.table
library(data.table)
setDT(df)[,notval := sum(df$val[df$group!=group]) ,group]
#akrun answers are best. But if you want to do in dplyr, this is a round about way.
df <- data.frame(id=1:5, group=c("A", "A", "B", "B", "A"), val=seq(9, 1, -2))
df %>% mutate(TotalSum = sum(val)) %>% group_by(group) %>%
mutate(valsumval = TotalSum - sum(val))
Source: local data frame [5 x 5]
Groups: group [2]
id group val TotalSum valsumval
(int) (fctr) (dbl) (dbl) (dbl)
1 1 A 9 25 8
2 2 A 7 25 8
3 3 B 5 25 17
4 4 B 3 25 17
5 5 A 1 25 8
This also works even if there are more than two groups.
Also Just this works
df %>% group_by(group) %>% mutate(notval = sum(df$val)- sum(val))

Resources