This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 1 year ago.
I have a dataframe like the following in R:
df <- matrix(c('A','A','A','A','B','B','B','B','C','C','C','C',4,6,8,2,2,7,2,8,9,1,2,5),ncol=2)
For each row of this dataframe, I want to include the total value for each class (A,B,C) so that the dataframe will look this this:
df <- matrix(c('A','A','A','A','B','B','B','B','C','C','C','C',4,6,8,2,2,7,2,8,9,1,2,5,20,20,20,20,19,19,19,19,17,17,17,17),ncol=3)
What's a way I could accomplish this?
Thanks in advance for your help.
Using R base
df <- data.frame(df)
df$Total <- ave(as.numeric(df$X2), df$X1, FUN=sum)
A dplyr solution would be this:
data.frame(df) %>%
group_by(X1) %>%
mutate(Sum = sum(as.numeric(X2)))
# A tibble: 12 × 3
# Groups: X1 [3]
X1 X2 Sum
<chr> <chr> <dbl>
1 A 4 20
2 A 6 20
3 A 8 20
4 A 2 20
5 B 2 19
6 B 7 19
7 B 2 19
8 B 8 19
9 C 9 17
10 C 1 17
11 C 2 17
12 C 5 17
Related
This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed last year.
I have the following data frame:
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6))
and I want to record the number of occurrences of each category so the output would be:
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6),
category_count=c(2,4,4,4,4,2,1))
Is there a simple way to do this?
# load package
library(data.table)
# set as data.table
setDT(df)
# count by category
df[, category_count := .N, category]
With dplyr:
library(dplyr)
df %>%
group_by(category) %>%
mutate(category_count = n()) %>%
ungroup
# A tibble: 7 × 3
category value category_count
<chr> <dbl> <int>
1 a 1 2
2 b 5 4
3 b 3 4
4 b 6 4
5 b 7 4
6 a 4 2
7 c 6 1
base
df <- data.frame(catergory=c("a","b","b","b","b","a","c"), value=c(1,5,3,6,7,4,6),
category_count=c(2,4,4,4,4,2,1))
df$res <- with(df, ave(x = seq(nrow(df)), list(catergory), FUN = length))
df
#> catergory value category_count res
#> 1 a 1 2 2
#> 2 b 5 4 4
#> 3 b 3 4 4
#> 4 b 6 4 4
#> 5 b 7 4 4
#> 6 a 4 2 2
#> 7 c 6 1 1
Created on 2022-02-08 by the reprex package (v2.0.1)
This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
Hi suppose I have a table with many columns ( in the thousands) and some rows that are duplicates. What I like to do is sum any duplicates for each row and for every columns. I'm stuck because I don't want to have to hard code or loop through each column and remerge. Is there a better way to do this? Here is an example with only 3 columns for simplicity.
dat <- read.table(text='name etc4 etc1 etc2
A 9 0 3
A 10 10 4
A 11 9 4
B 2 7 5
C 40 6 0
C 50 6 1',header=TRUE)
# I could aggregate one column at a time
# but is there a way to do for each columns without prior hard coding?
aggregate( etc4 ~ name, data = dat, sum)
We can specify the . which signifies all the rest of the columns other than the 'name' column in aggregate
aggregate(. ~ name, data = dat, sum)
name etc4 etc1 etc2
1 A 30 19 11
2 B 2 7 5
3 C 90 12 1
Or if we need more fine control i.e if there are other columns as well and want to avoid, either subset the data with select or use cbind
aggregate(cbind(etc1, etc2, etc4) ~ name, data = dat, sum)
name etc1 etc2 etc4
1 A 19 11 30
2 B 7 5 2
3 C 12 1 90
If we need to store the names and reuse, subset the data and then convert to matrix
cname <- c("etc4", "etc1" )
aggregate(as.matrix(dat[cname]) ~ name, data = dat, sum)
name etc4 etc1
1 A 30 19
2 B 2 7
3 C 90 12
Or this may also be done in a faster way with fsum
library(collapse)
fsum(get_vars(dat, is.numeric), g = dat$name)
etc4 etc1 etc2
A 30 19 11
B 2 7 5
C 90 12 1
A tidyverse approach
dat %>%
group_by(name) %>%
summarise(across(.cols = starts_with("etc"),.fns = sum))
# A tibble: 3 x 4
name etc4 etc1 etc2
<chr> <int> <int> <int>
1 A 30 19 11
2 B 2 7 5
3 C 90 12 1
This question already has answers here:
How I can select rows from a dataframe that do not match?
(3 answers)
Subsetting a data frame to the rows not appearing in another data frame
(5 answers)
Closed 2 years ago.
I would like to get the ids that are not sampled
id <- rep(1:10,each=2)
trt <- rep(c("A","B"),2)
score <- rnorm(20,0,1)
df <- data.frame(id,trt,score)
df$id <- as.factor(df$id)
df
id trt score
1 1 A 0.4920104
2 1 B 0.5030771
3 2 A 1.4030437
4 2 B 0.4132130
5 3 A -2.4449382
6 3 B -1.0981531
7 4 A -0.6013329
8 4 B -0.8411616
9 5 A -0.2696329
10 5 B -0.9869931
11 6 A 1.0681588
12 6 B 1.7500570
13 7 A 0.6008876
14 7 B -0.2181209
15 8 A -1.2943954
16 8 B -2.4495156
17 9 A 0.7680115
18 9 B 0.5497457
19 10 A -1.9713569
20 10 B -0.7696987
df <- df %>% filter(id %in% sample(levels(id),5))
df
id trt score
1 3 A 1.8816245
2 3 B 0.8614810
3 5 A 0.5508704
4 5 B -1.4144959
5 7 A 0.5174229
6 7 B 0.5244466
7 9 A 0.4318934
8 9 B -1.6376436
9 10 A 0.1746228
10 10 B 1.6319294
Here I would like to get the other ids. How can I code for this? Suppose there are many ids and not possible to select them manually
id trt score
1 1 A 0.07040075
2 1 B -0.70388700
3 2 A 0.78421333
4 2 B -0.90052385
7 4 A -0.48052247
8 4 B -0.66198818
11 6 A 1.12168455
12 6 B 0.90454813
15 8 A 1.54550328
16 8 B 0.64822307
........................................................................................................................................................................................................................................................................................................................
If we assign the filtered object to a new one ('df1') instead of assigning on the original object name, an option is anti_join
library(dplyr)
anti_join(df, df1, by = 'id')
Or another option is filter
df %>%
filter(! id %in% df1$id)
data
df1 <- df %>%
filter(id %in% sample(levels(id),5))
This question already has answers here:
Calculate row means on subset of columns
(7 answers)
Closed 3 years ago.
I have the following data frame
ID <- c(1,1,2,3,4,5,6)
Value1 <- c(20,50,30,10,15,10,NA)
Value2 <- c(40,33,84,NA,20,1,NA)
Value3 <- c(60,40,60,10,25,NA,NA)
Grade1 <- c(20,50,30,10,15,10,NA)
Grade2 <- c(40,33,84,NA,20,1,NA)
DF <- data.frame(ID,Value1,Value2,Value3,Grade1,Grade2)
ID Value1 Value2 Value3 Grade1 Grade2
1 1 20 40 60 20 40
2 1 50 33 40 50 33
3 2 30 84 60 30 84
4 3 10 NA 10 10 NA
5 4 15 20 25 15 20
6 5 10 1 NA 10 1
7 6 NA NA NA NA NA
I would like to group by ID, select columns with names contain the string ("Value"), and get the mean of these columns with NA not included.
Here is an example of the desired output
ID mean(Value)
1 41
2 58
3 10
....
In my attempt to solve this challenge, I wrote the following code
Library(tidyverse)
DF %>% group_by (ID) %>% select(contains("Value")) %>% summarise(mean(.,na.rm = TRUE))
The code groups the data by IDs, select columns with column name containing ("Value"), and attempts to summarise the selected columns by using the mean function. When I run my code, I get the following output
> DF %>% group_by (ID) %>% select(contains("Value")) %>% summarise(mean(.))
Adding missing grouping variables: `ID`
# A tibble: 6 x 2
ID `mean(.)`
<dbl> <dbl>
1 1 NA
2 2 NA
3 3 NA
4 4 NA
5 5 NA
6 6 NA
I would appreciate your help in this manner.
You should try using pivot_longer to get your data from wide to long form Read latest tidyR update on pivot_longer & pivot_wider (https://tidyr.tidyverse.org/articles/pivot.html)
library(tidyverse)
ID <- c(1,2,3,4,5,6)
Value1 <- c(50,30,10,15,10,NA)
Value2 <- c(33,84,NA,20,1,NA)
Value3 <- c(40,60,10,25,NA,NA)
DF <- data.frame(ID,Value1,Value2,Value3)
DF %>% pivot_longer(-ID) %>%
group_by(ID) %>% summarise(mean=mean(value,na.rm=TRUE))
Output here
ID mean
<dbl> <dbl>
1 1 41
2 2 58
3 3 10
4 4 20
5 5 5.5
6 6 NaN
Without using dplyr or any specific package, this would help :
DF$mean<- rowMeans(DF[,c(2:4)], na.rm = T)
I get a freq table, but can I save this table in a csv file or - better - sort it or extract the biggest values?
library(plyr)
count(birthdaysExample, 'month')
I'm guessing at what the relevant part of your data looks like, but in any case this should get you a frequency table sorted by values:
library(plyr)
birthdaysExample <- data.frame(month = round(runif(200, 1, 12)))
freq_df <- count(birthdaysExample, 'month')
freq_df[order(freq_df$freq, decreasing = TRUE), ]
This gives you:
month freq
5 5 29
9 9 24
3 3 22
4 4 18
6 6 17
7 7 15
2 2 14
10 10 14
11 11 14
8 8 13
1 1 10
12 12 10
To get the highest 3 values:
library(magrittr)
freq_df[order(freq_df$freq, decreasing = TRUE), ] %>% head(., 3)
month freq
5 5 29
9 9 24
3 3 22
Or, with just base R:
head(freq_df[order(freq_df$freq, decreasing = TRUE), ], 3)
With dplyr
dplyr is a newer approaching for many routine data manipulations in R (one of many tutorials) that is a bit more intuitive:
library(dplyr)
library(magrittr)
freq_df2 <- birthdaysExample %>%
group_by(month) %>%
summarize(freq = n()) %>%
arrange(desc(freq))
freq_df2
This returns:
Source: local data frame [12 x 2]
month freq
1 5 29
2 9 24
3 3 22
4 4 18
5 6 17
6 7 15
7 2 14
8 10 14
9 11 14
10 8 13
11 1 10
12 12 10
The object it returns is not a data frame anymore, so if you want to use base R functions with it, it might be easier to convert it back, with something like:
my_df <- as.data.frame(freq_df2)
And if you really want, you can write this to a CSV file with:
write.csv(my_df, file="foo.csv")