Aggregate by one variable but adding other variables [duplicate] - r

This question already has an answer here:
How to GROUP and choose lowest value in R [duplicate]
(1 answer)
Closed 6 years ago.
I have a data.frame with this structure:
id time var1 var2 var3
1 2 4 5 6
1 4 8 51 7
1 1 9 17 38
2 12 8 9 21
2 15 25 6 23
For all the ids, I want to have the row that contains the minimum time. In the example in would be this:
id time var1 var2 var3
1 1 9 17 38
2 12 8 9 21
I think that the aggregate function would be useful, but I'm not sure how to use it.

Your title may be misleading, since you really just want to keep the row with the minimum time for every id. Try this:
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, time) %>%
filter(row_number() == 1)

We can use by, do.call, and the ever useful which.min function to get what we need:
do.call('rbind', by(df, df$id, function(x) x[which.min(x$time), ]))
# id time var1 var2 var3
# 1 1 1 9 17 38
# 2 2 12 8 9 21
And if you suspect there may be more than one minimum value per id, you can eschew the which.min function and use which(x$time == min(x$time)):
do.call('rbind', by(df, df$id, function(x) x[which(x$time == min(x$time)), ]))
# id time var1 var2 var3
# 1 1 1 9 17 38
# 2 2 12 8 9 21
Data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L),
time = c(2L, 4L, 1L, 2L, 15L),
var1 = c(4L, 8L, 9L, 8L, 25L),
var2 = c(5L, 51L, 17L, 9L, 6L),
var3 = c(6L, 7L, 38L, 21L, 23L)),
.Names = c("id", "time", "var1", "var2", "var3"),
class = "data.frame", row.names = c(NA, -5L))

dplyr using the function slice
library(dplyr)
df %>%
group_by(id) %>%
slice(which.min(time))
Output:
Source: local data frame [2 x 5]
Groups: id [2]
id time var1 var2 var3
<dbl> <dbl> <dbl> <dbl> <int>
1 1 1 9 17 38
2 2 12 8 9 21
sqldf
library(sqldf)
sqldf('SELECT id, MIN(time) time, var1, var2, var3
FROM df
GROUP BY id')
Output:
id time var1 var2 var3
1 1 1 9 17 38
2 2 12 8 9 21

Related

Creating loop to count the number of unique values in column based on values in another column

So, for example, I have the following dataframe, data:
col1
col2
1
5
1
5
1
3
2
10
2
11
3
11
Now, I want to make a new column, col3, which gives me the number of unique values in col2 for every grouping in col1.
So far, I have the following code:
length(unique(data$col2[data$col1 == 1]))
Which would here return the number 2.
However, I'm having a hard time making a loop that goes through all the values in col1 to create the new column, col3.
We can use n_distinct after grouping
library(dplyr)
data <- data %>%
group_by(col1) %>%
mutate(col3 = n_distinct(col2)) %>%
ungroup
-output
data
# A tibble: 6 × 3
col1 col2 col3
<int> <int> <int>
1 1 5 2
2 1 5 2
3 1 3 2
4 2 10 2
5 2 11 2
6 3 11 1
Or with data.table
library(data.table)
setDT(data)[, col3 := uniqueN(col2), col1]
data
data <- structure(list(col1 = c(1L, 1L, 1L, 2L, 2L, 3L), col2 = c(5L,
5L, 3L, 10L, 11L, 11L)), class = "data.frame", row.names = c(NA,
-6L))
You want the counts for every row, so using a for loop you would do
data$col3 <- NA_real_
for (i in seq_len(nrow(data))) {
data$col3[i] <- length(unique(data$col2[data$col1 == data$col1[i]]))
}
data
# col1 col2 col3
# 1 1 5 2
# 2 1 5 2
# 3 1 3 2
# 4 2 10 2
# 5 2 11 2
# 6 3 11 1
However, using for loops in R is mostly inefficient, and in this case we can use the grouping function ave which comes with R.
data <- transform(data, col3=ave(col2, col1, FUN=\(x) length(unique(x))))
data
# col1 col2 col3
# 1 1 5 2
# 2 1 5 2
# 3 1 3 2
# 4 2 10 2
# 5 2 11 2
# 6 3 11 1
Data:
data <- structure(list(col1 = c(1L, 1L, 1L, 2L, 2L, 3L), col2 = c(5L,
5L, 3L, 10L, 11L, 11L)), class = "data.frame", row.names = c(NA,
-6L))

Aggregating columns based on columns name in R

I have this dataframe in R
Party Pro2005 Anti2005 Pro2006 Anti2006 Pro2007 Anti2007
R 1 18 0 7 2 13
R 1 19 0 7 1 14
D 13 7 3 4 10 5
D 12 8 3 4 9 6
I want to aggregate it to where it will combined all the pros and anti based on party
for example
Party ProSum AntiSum
R. 234. 245
D. 234. 245
How would I do that in R?
You can use:
library(tidyverse)
df %>%
pivot_longer(-Party,
names_to = c(".value", NA),
names_pattern = "([a-zA-Z]*)([0-9]*)") %>%
group_by(Party) %>%
summarise(across(where(is.numeric), sum, na.rm = T))
# A tibble: 2 x 3
Party Pro Anti
<chr> <int> <int>
1 D 50 34
2 R 5 78
I would suggest a tidyverse approach reshaping the data and the computing the sum of values:
library(tidyverse)
#Data
df <- structure(list(Party = c("R", "R", "D", "D"), Pro2005 = c(1L,
1L, 13L, 12L), Anti2005 = c(18L, 19L, 7L, 8L), Pro2006 = c(0L,
0L, 3L, 3L), Anti2006 = c(7L, 7L, 4L, 4L), Pro2007 = c(2L, 1L,
10L, 9L), Anti2007 = c(13L, 14L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-4L))
The code:
df %>% pivot_longer(cols = -1) %>%
#Format strings
mutate(name=gsub('\\d+','',name)) %>%
#Aggregate
group_by(Party,name) %>% summarise(value=sum(value,na.rm=T)) %>%
pivot_wider(names_from = name,values_from=value)
The output:
# A tibble: 2 x 3
# Groups: Party [2]
Party Anti Pro
<chr> <int> <int>
1 D 34 50
2 R 78 5
Splitting by parties and loop sum over the pro/anti using sapply, finally rbind.
res <- data.frame(Party=sort(unique(d$Party)), do.call(rbind, by(d, d$Party, function(x)
sapply(c("Pro", "Anti"), function(y) sum(x[grep(y, names(x))])))))
res
# Party Pro Anti
# D D 50 34
# R R 5 78
An outer solution is also suitable.
t(outer(c("Pro", "Anti"), c("R", "D"),
Vectorize(function(x, y) sum(d[d$Party %in% y, grep(x, names(d))]))))
# [,1] [,2]
# [1,] 5 78
# [2,] 50 34
Data:
d <- read.table(header=T, text="Party Pro2005 Anti2005 Pro2006 Anti2006 Pro2007 Anti2007
R 1 18 0 7 2 13
R 1 19 0 7 1 14
D 13 7 3 4 10 5
D 12 8 3 4 9 6 ")

How to find the average of several lines with the same id in a big R dataframe? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
i have a big data frame (more than 100 000 entries) that look something like this :
ID Pre temp day
134 10 6 1
134 20 7 1
134 10 8 1
234 5 1 2
234 10 4 2
234 15 10 3
I want to reduce my data frame by finding the mean value of pre, temp and day for identical ID values.
At the end, my data frame would look something like this
ID Pre temp day
134 13.3 7 1
234 10 5 2.3
i'm not sure how to do it ?
Thank you in advance !
With the dplyr package you can group_by your ID value and then use summarise to take the mean
library(dplyr)
df %>%
group_by(ID) %>%
summarise(Pre= mean(Pre),
temp = mean(temp),
day = mean(day))
# A tibble: 2 x 4
ID Pre temp day
<dbl> <dbl> <dbl> <dbl>
1 134 13.3 7 1
2 234 10 5 2.33
With dplyr, a solution looks like this:
textFile <- "ID Pre temp day
134 10 6 1
134 20 7 1
134 10 8 1
234 5 1 2
234 10 4 2
234 15 10 3"
data <- read.table(text = textFile,header=TRUE)
library(dplyr)
data %>% group_by(ID) %>%
summarise(.,Pre = mean(Pre),temp = mean(temp),day=mean(day))
...and the output:
<int> <dbl> <dbl> <dbl>
1 134 13.3 7 1
2 234 10 5 2.33
>
You can try next:
library(dplyr)
#Data
df <- structure(list(ID = c(134L, 134L, 134L, 234L, 234L, 234L), Pre = c(10L,
20L, 10L, 5L, 10L, 15L), temp = c(6L, 7L, 8L, 1L, 4L, 10L), day = c(1L,
1L, 1L, 2L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-6L))
#Code
df %>% group_by(ID) %>% summarise_all(mean,na.rm=T)
# A tibble: 2 x 4
ID Pre temp day
<int> <dbl> <dbl> <dbl>
1 134 13.3 7 1
2 234 10 5 2.33
There is no need of setting each individual variable.

R using dplyr group_by/ sum in for loop, output as concatenated list

I am using the dplyr package to group by a week variable and get the sum for three variables. The output should be attached to each other.
Here is my data frame df:
week var1 var2 var3
1 1 2 3
1 2 2 3
2 4 4 5
2 2 2 6
3 6 6 6
3 4 4 4
My command is
calculate <- function(vars){
x <- df %>% group_by(week) %>% summarise(summe = sum(vars))%>%mutate(group = paste(vars))
x
}
cols <- c("var1", "var2", "var3")
for (i in 1:length(cols)){
var <- cols[i]
cal <- calculate(var)
total <- rbind(total,cal)
}
The expected output should be
week summe group
1 3 var1
2 6 var1
3 10 var1
1 4 var2
2 6 var2
3 10 var2
1 6 var3
2 11 var3
3 10 var3
My question is: Is there a better way instead of using a for loop?
Cheers,
Andi
We could pivot to 'long' format and then do a group by 'sum'
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('var'), names_to = 'group') %>%
group_by(week, group) %>%
summarise(summe = sum(value)) %>%
ungroup %>%
arrange(group) %>%
select(week, summe, group)
# A tibble: 9 x 3
# week summe group
# <int> <int> <chr>
#1 1 3 var1
#2 2 6 var1
#3 3 10 var1
#4 1 4 var2
#5 2 6 var2
#6 3 10 var2
#7 1 6 var3
#8 2 11 var3
#9 3 10 var3
We can also do the sum grouped by 'week' first and the pivot to 'long' format
df %>%
group_by(week) %>%
summarise_at(vars(-group_cols()), sum) %>%
pivot_longer(cols = starts_with('var'), names_to = 'group', values_to = 'summe') %>%
select(week, summe, group)
data
df <- structure(list(week = c(1L, 1L, 2L, 2L, 3L, 3L), var1 = c(1L,
2L, 4L, 2L, 6L, 4L), var2 = c(2L, 2L, 4L, 2L, 6L, 4L), var3 = c(3L,
3L, 5L, 6L, 6L, 4L)), class = "data.frame", row.names = c(NA,
-6L))

Getting rowSums for triplicate records and retaining only the one with highest value

I have a data frame with 163 observations and 65 columns with some animal data. The 163 observations are from 56 animals, and each was supposed to have triplicated records, but some information was lost so for the majority of animals, I have triplicates ("A", "B", "C") and for some I have only duplicates (which vary among "A" and "B", "A" and "C" and "B" and "C").
Columns 13:65 contain some information I would like to sum, and only retain the one triplicate with the higher rowSums value. So my data frame would be something like this:
ID Trip Acet Cell Fibe Mega Tera
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3
I am not sure if what I need is to write my own function, or a loop, or what the best alternative actually is - sorry I am still learning and unfortunately for me, I don't think like a programmer so that makes things even more challenging...
So what I want is to know to keep on rows 2 and 6 (which have the highest rowSums among triplicates per animal), but for the whole data frame. What I want as a result is
ID Trip Acet Cell Fibe Mega Tera
1 4 B 9 3 7 5 5
2 12 C 5 5 7 3 3
REALLY sorry if the question is poorly elaborated or if it doesn't make sense, this is my first time asking a question here and I have only recently started learning R.
We can create the row sums separately and use that to find the row with the maximum row sums by using ave. Then use the logical vector to subset the rows of dataset
nm1 <- startsWith(names(df1), "V")
OP updated the column names. In that case, either an index
nm1 <- 3:7
Or select the columns with setdiff
nm1 <- setdiff(names(df1), c("ID", "Trip"))
v1 <- rowSums(df1[nm1], na.rm = TRUE)
i1 <- with(df1, v1 == ave(v1, ID, FUN = max))
df1[i1,]
# ID Trip V1 V2 V3 V4 V5
#2 4 B 9 3 7 5 5
#6 12 C 5 5 7 3 3
data
df1 <- structure(list(ID = c(4L, 4L, 4L, 12L, 12L, 12L), Trip = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
V1 = c(2L, 9L, 1L, 4L, 6L, 5L), V2 = c(4L, 3L, 2L, 6L, 8L,
5L), V3 = c(9L, 7L, 4L, 7L, 1L, 7L), V4 = c(8L, 5L, 8L, 2L,
1L, 3L), V5 = c(3L, 5L, 6L, 3L, 2L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Here is one way.
library(tidyverse)
dat2 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
group_by(ID) %>%
filter(Sum == max(Sum)) %>%
select(-Sum) %>%
ungroup()
dat2
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
Here is another one. This method makes sure only one row is preserved even there are multiple rows with row sum equals to the maximum.
dat3 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
arrange(ID, desc(Sum)) %>%
group_by(ID) %>%
slice(1) %>%
select(-Sum) %>%
ungroup()
dat3
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
DATA
dat <- read.table(text = " ID Trip V1 V2 V3 V4 V5
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3 ",
header = TRUE)

Resources