Sum a variable by group & create new column with frequency [duplicate] - r

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 6 years ago.
I have 2 columns of data. The first one is an id and the second one a value.
There may be many occurrences of the same id.
I need to aggregate the data by summing all the values for the same id AND I would like to create a new column with the number of occurrences of the same id.
For example:
id value
1 15
1 10
2 5
3 7
1 4
3 12
4 16
I know I can use aggregate to sum the values and reduce the table to 4 rows, but I would like an extra column with the number of occurrences of the id like this:
id value freq
1 29 3
2 5 1
3 19 2
4 16 1
Thank you

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', get the sum of 'value' and also the number of rows with (.N)
library(data.table)
setDT(df1)[, .(value=sum(value), freq = .N) , by = id]
# id value freq
#1: 1 29 3
#2: 2 5 1
#3: 3 19 2
#4: 4 16 1
Or as #Frank commented
dcast(setDT(df1), id ~ ., fun = list(sum, length))
Or a similar approach with dplyr
library(dplyr)
df1 %>%
group_by(id) %>%
summarise(value = sum(value), freq = n())

Using base R, one can can combine aggregate() and table() like this:
cbind(aggregate(value ~ id, df1, sum), freq=as.vector(table(df1$id)))
# id value freq
#1 1 29 3
#2 2 5 1
#3 3 19 2
#4 4 16 1
data used in this example:
df1 <- structure(list(id = c(1L, 1L, 2L, 3L, 1L, 3L, 4L),
value = c(15L, 10L, 5L, 7L, 4L, 12L, 16L)),
.Names = c("id", "value"), class = "data.frame",
row.names = c(NA, -7L))

Related

Finding minimum by groups and among columns

I am trying to find the minimum value among different columns and group.
A small sample of my data looks something like this:
group cut group_score_1 group_score_2
1 a 1 3 5.0
2 b 2 2 4.0
3 a 0 2 2.5
4 b 3 5 4.0
5 a 2 3 6.0
6 b 1 5 1.0
I want to group by the groups and for each group, find the row which contains the minimum group score among both group scores and then also get the name of the column which contains the minimum (group_score_1 or group_score_2),
so basically my result should be something like this:
group cut group_score_1 group_score_2
1 a 0 2 2.5
2 b 1 5 1.0
I tried a few ideas, and came up eventually to dividing the into several new data frames, filtering by group and selecting the relevant columns and then using which.min(), but I'm sure there's a much more efficient way to do it. Not sure what I am missing.
We can use data.table methods
library(data.table)
setDT(df)[df[, .I[which.min(do.call(pmin, .SD))],
group, .SDcols = patterns('^group_score')]$V1]
# group cut group_score_1 group_score_2
#1: a 0 2 2.5
#2: b 1 5 1.0
For each group, you can calculate min value and select the row in which that value exist in one of the column.
library(dplyr)
df %>%
group_by(group) %>%
filter({tmp = min(group_score_1, group_score_2);
group_score_1 == tmp | group_score_2 == tmp})
# group cut group_score_1 group_score_2
# <chr> <int> <int> <dbl>
#1 a 0 2 2.5
#2 b 1 5 1
The above works well when you have only two group_score columns. If you have many such columns it is not possible to list down each one of them with group_score_1 == tmp | group_score_2 == tmp etc. In such case, get the data in long format and get the corresponding cut value of the minimum value and join the data. Assuming cut is unique in each group.
df %>%
tidyr::pivot_longer(cols = starts_with('group_score')) %>%
group_by(group) %>%
summarise(cut = cut[which.min(value)]) %>%
left_join(df, by = c("group", "cut"))
Here is a base R option using pmin + ave + subset
subset(
df,
as.logical(ave(
do.call(pmin, df[grep("group_score_\\d+", names(df))]),
group,
FUN = function(x) x == min(x)
))
)
which gives
group cut group_score_1 group_score_2
3 a 0 2 2.5
6 b 1 5 1.0
Data
> dput(df)
structure(list(group = c("a", "b", "a", "b", "a", "b"), cut = c(1L,
2L, 0L, 3L, 2L, 1L), group_score_1 = c(3L, 2L, 2L, 5L, 3L, 5L
), group_score_2 = c(5, 4, 2.5, 4, 6, 1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

Fill in missing rows in data in R

Suppose I have a data frame like this:
1 8
2 12
3 2
5 -6
6 1
8 5
I want to add a row in the places where the 4 and 7 would have gone in the first column and have the second column for these new rows be 0, so adding these rows:
4 0
7 0
I have no idea how to do this in R.
In excel, I could use a vlookup inside an iferror. Is there a similar combo of functions in R to make this happen?
Edit: also, suppose that row 1 was missing and needed to be filled in similarly. Would this require another solution? What if I wanted to add rows until I reached ten rows?
Use tidyr::complete to fill in the missing sequence between min and max values.
library(tidyr)
library(rlang)
complete(df, V1 = min(V1):max(V1), fill = list(V2 = 0))
#Or using `seq`
#complete(df, V1 = seq(min(V1), max(V1)), fill = list(V2 = 0))
# V1 V2
# <int> <dbl>
#1 1 8
#2 2 12
#3 3 2
#4 4 0
#5 5 -6
#6 6 1
#7 7 0
#8 8 5
If we already know min and max of the dataframe we can use them directly. Let's say we want data from V1 = 1 to 10, we can do.
complete(df, V1 = 1:10, fill = list(V2 = 0))
If we don't know the column names beforehand, we can do something like :
col1 <- names(df)[1]
col2 <- names(df)[2]
complete(df, !!sym(col1) := 1:10, fill = as.list(setNames(0, col2)))
data
df <- structure(list(V1 = c(1L, 2L, 3L, 5L, 6L, 8L), V2 = c(8L, 12L,
2L, -6L, 1L, 5L)), class = "data.frame", row.names = c(NA, -6L))

Create incremental value with restart with condition within ID

so I have a data of 2 fields, ID and Timestamp
ID Time
1 12
1 15
1 16
2 12
2 11
And i want to increment if the difference between time and previous time is inferior to 2 for example within the same ID, unless stay at the same value and restart at 1 when ID is different.
Desired output:
ID Time ID_SESSION
1 12 1
1 15 1
1 16 2
2 12 1
2 11 1
It would be needed in dplyr/sparklyr for spark implementation with R/
A one-liner using base R,
with(df, ave(Time, ID, FUN = function(i)cumsum(c(TRUE, diff(i) <= 2))))
#[1] 1 1 2 1 2
May be we need
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(ID_SESSION = (lag(c(FALSE, diff(Time) > 2), default= FALSE)) + 1)
Or in a one-liner with data.table
library(data.table)
setDT(df1)[, ID_SESSION := shift(c(FALSE, diff(Time) > 2), fill = FALSE) + 1, ID]
df1
# ID Time ID_SESSION
#1: 1 12 1
#2: 1 15 1
#3: 1 16 2
#4: 2 12 1
#5: 2 11 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Time = c(12L, 15L,
16L, 12L, 11L)), class = "data.frame", row.names = c(NA, -5L))

Create a column with number of times a value has appeared so far in R? [duplicate]

This question already has answers here:
Cumulative count of each value [duplicate]
(6 answers)
Closed 6 years ago.
I have a data table:
ID FREQUENCY
"jso" 3
"and" 2
"jso" 3
"mo" 1
"jso" 3
"and" 2
It has a column with the frequency. However, I want to create a table with how many times the id has appeared so far. So I'd want my data table to look like this:
ID FREQUENCY
"jso" 1
"and" 1
"jso" 2
"mo" 1
"jso" 3
"and" 2
How would you do this?
This can be done by group by operations. With data.table, convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', we get the sequence of rows (seq_len(.N)) and assign (:=) it to 'FREQUENCY'
library(data.table)
setDT(df1)[,FREQUENCY := seq_len(.N) , by = ID]
Or using dplyr, the row_number() is a convenient function for the sequence of rows (after grouping by 'ID'.
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(FREQUENCY = row_number())
Or with base R
with(df1, ave(FREQUENCY, ID, FUN = seq_along))
#[1] 1 1 2 1 3 2
data
df1 <- structure(list(ID = c("jso", "and", "jso", "mo", "jso", "and"
), FREQUENCY = c(3L, 2L, 3L, 1L, 3L, 2L)), .Names = c("ID", "FREQUENCY"
), class = "data.frame", row.names = c(NA, -6L))

Fill a column's blank spaces contingent on a second column in R

I'd appreciate some help with this one. I have something similar to the data below.
df$A df$B
1 .
1 .
1 .
1 6
2 .
2 .
2 7
What I need to do is fill in df$B with each value that corresponds to the end of the run of values in df$A. Example below.
df$A df$B
1 6
1 6
1 6
1 6
2 7
2 7
2 7
Any help would be welcome.
It seems to me that the missing values are denoted by .. It is better to read the dataset with na.strings="." so that the missing values will be NA. For the current dataset, the 'B' column would be character/factor class (depending upon whether you used stringsAsFactors=FALSE/TRUE (default) in the read.table/read.csv.
Using data.table, we convert the data.frame to data.table (setDT(df1)), change the 'character' class to 'numeric' (B:= as.numeric(B)). This will also result in coercing the . to NA (a warning will appear). Grouped by "A", we change the "B" values to the last element (B:= B[.N])
library(data.table)
setDT(df1)[,B:= as.numeric(B)][,B:=B[.N] , by = A]
# A B
#1: 1 6
#2: 1 6
#3: 1 6
#4: 1 6
#5: 2 7
#6: 2 7
#7: 2 7
Or with dplyr
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(B= as.numeric(tail(B,1)))
Or using ave from base R
df1$B <- with(df1, as.numeric(ave(B, A, FUN=function(x) tail(x,1))))
data
df1 <- structure(list(A = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), B = c(".",
".", ".", "6", ".", ".", "7")), .Names = c("A", "B"),
class = "data.frame", row.names = c(NA, -7L))

Resources