R Get Sum of Column - r

I have a simple dataframe in R
df1 <- data.frame(
questionID = c(1,1,3,4,5,5),
userID = c(101, 101, 102, 101, 102,101),
Value=c(10,20,30,40,50,10))
The basic idea is to have a column that indicates the sum of value for a user on questions they asked before (lower number questions).
I tried using this function (after trying the pipe of sum which just gave errors about non-numeric that everybody seems to face)
f2 <- function(x){
Value_out <- filter(df1,questionID<x['questionID'] & userID == x['userID'] ) %>%
select(Value) %>%
summarize_if(is.numeric, sum, na.rm=TRUE)
}
out=mutate(df1,Expert=apply(df1, 1,f2))
While this works if you print it out, the Expert column is saved as a list of dataframes. All I want is an int or numeric of the sum of Value. Is there anyway to do this? By the way, yes, I've looked all over for ways to do this, with most answers just summarizing the column in a manner that won't work for me.

I think I would avoid writing my own function altogether and use data.table on this one. You can do what you want in just a couple lines, and I wouldn't be surprised if there was a way to golf this down to fewer lines
Given your same data, we create a data.table object:
library(data.table)
dt <- data.table(
questionID = c(1,1,3,4,5,5),
userID = c(101, 101, 102, 101, 102,101),
Value=c(10,20,30,40,50,10))
Next, we shift our values by 1 (lag) within each userID:
dt[, lastVal := shift(Value, n = 1, fill = 0), by = .(userID)]
And finally, we cumsum those by userID, and replace those with multiple Values with the same userID x questionID with the min Expert, which should be 0 because we used fill = 0 in shift above before we cumsum:
dt[,
Expert := cumsum(lastVal),
by = .(userID)][,
Expert := min(Expert),
by = .(userID, questionID)]
So, putting that all together, we have:
library(data.table)
dt <- data.table(
questionID = c(1,1,3,4,5,5),
userID = c(101, 101, 102, 101, 102,101),
Value=c(10,20,30,40,50,10))
dt[, lastVal := shift(Value, n = 1, fill = 0), by = .(userID)]
dt[,
Expert := cumsum(lastVal),
by = .(userID)][,
Expert := min(Expert),
by = .(userID, questionID)]
dt
questionID userID Value lastVal Expert
1: 1 101 10 0 0
2: 1 101 20 10 0
3: 3 102 30 0 0
4: 4 101 40 20 30
5: 5 102 50 30 30
6: 5 101 10 40 70

Using dplyr and purrr::map_dbl one approach would be to group_by userID and sum Value for each questionID which is less than current value.
library(dplyr)
df1 %>%
group_by(userID) %>%
mutate(Expert = purrr::map_dbl(questionID, ~sum(Value[questionID < .x])))
# questionID userID Value Expert
# <dbl> <dbl> <dbl> <dbl>
#1 1 101 10 0
#2 1 101 20 0
#3 3 102 30 0
#4 4 101 40 30
#5 5 102 50 30
#6 5 101 10 70

Related

I'm trying to make a frequency table where Var1 only uses one value and Var3 does not appear in the table but filters data in the table

I'm trying to make a frequency table where VAR1 is some but not all values and VAR3 filters what rows are used to make the table.
ISCED is Var1 and i want only values of 12 and 15. If possible i don't want to see FALSE in my table. Secondly i want to only use the rows where PK_T = 1. Can't figure out how to do it.
My DF
ISCED EMTAK PK_T
12 233 1
12 245 0
12 233 NA
13 233 1
15 433 1
15 245 0
I want
ISCED EMTAK Freq
True 233 1
True 245 0
True 433 1
If we assume that your dataset is as you presented it:
myDF <- data.frame(ISCED=c(12, 12, 12, 13, 15, 15), EMTAK=c(233, 245, 233, 233, 433, 245), PK_T=c(1, 0, NA, 1, 1, 0))
Then using only the dplyr package and table() from baseR you can do:
install.packages("dplyr")
library(dplyr)
newDF <- myDF %>%
na.omit %>% # remove NA values
filter(PK_T==1) %>%
filter(ISCED==12 | ISCED==15) %>%
select(EMTAK) %>% # drop the column ISCED
mutate(Freq= table(.$EMTAK)) # mutate() creates the new column "Freq"
Giving:
EMTAK Freq
1 233 1
2 433 1
If you want to do other combinations then you adjust the arguments inside the filter()
Using your data and the data.table package, I subset the data to those where ISCED is 12 or 15, group by EMTAK and PK_T == 1 and then use .N to count the number of occurrences, I then chain a second expression to limit to those where ISCED is TRUE. Note I also drop the PK_T variable from the final table - its meaningless as there is no variation in the variable.
library(data.table)
dt1[ISCED %in% c(12, 15), .N, by = .(EMTAK, PK_T == 1)][PK_T == TRUE, -"PK_T"]
Giving
EMTAK N
1: 233 1
2: 433 1
Which is the correct answer with your data, there is only one occurrence of 233 where PK_T is 1, the other occurrence it is an NA.
A base R equivalent would be
table(df1[df1$ISCED %in% c(12, 15) & df1$PK_T == 1, "EMTAK"])
You can use table to get the frequenc. To include 0 times you can use as.factor. You can use [ to subset to the cases where x$PK_T==1. To get only those where ISCED values of 12 or 15 you can use %in%:
y <- table(as.factor(x$EMTAK)[x$ISCED %in% c(12, 15) &
!is.na(x$PK_T) & x$PK_T==1])
data.frame(ISCED = TRUE, EMTAK = names(y), Freq = as.vector(y))
# ISCED EMTAK Freq
#1 TRUE 233 1
#2 TRUE 245 0
#3 TRUE 433 1
Or using aggregate only if all conditions are true.
aggregate(cbind(Freq=PK_T) ~ EMTAK, x[x$ISCED %in% c(12, 15) &
!is.na(x$PK_T) & x$PK_T==1,], length)
# EMTAK Freq
#1 233 1
#2 433 1
Data:
x <- read.table(header=TRUE, text="ISCED EMTAK PK_T
12 233 1
12 245 0
12 233 NA
13 233 1
15 433 1
15 245 0")

looking for mutate_if, but for rows not columns [duplicate]

I'm in the process of trying out a dplyr-based workflow (rather than using mostly data.table, which I'm used to), and I've come across a problem that I can't find an equivalent dplyr solution to. I commonly run into the scenario where I need to conditionally update/replace several columns based on a single condition. Here's some example code, with my data.table solution:
library(data.table)
# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))
# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit',
`:=`(qty.exit = qty,
cf = 0,
delta.watts = 13)]
Is there a simple dplyr solution to this same problem? I'd like to avoid using ifelse because I don't want to have to type the condition multiple times - this is a simplified example, but there are sometimes many assignments based on a single condition.
These solutions (1) maintain the pipeline, (2) do not overwrite the input and (3) only require that the condition be specified once:
1a) mutate_cond Create a simple function for data frames or data tables that can be incorporated into pipelines. This function is like mutate but only acts on the rows satisfying the condition:
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
condition <- eval(substitute(condition), .data, envir)
.data[condition, ] <- .data[condition, ] %>% mutate(...)
.data
}
DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)
1b) mutate_last This is an alternative function for data frames or data tables which again is like mutate but is only used within group_by (as in the example below) and only operates on the last group rather than every group. Note that TRUE > FALSE so if group_by specifies a condition then mutate_last will only operate on rows satisfying that condition.
mutate_last <- function(.data, ...) {
n <- n_groups(.data)
indices <- attr(.data, "indices")[[n]] + 1
.data[indices, ] <- .data[indices, ] %>% mutate(...)
.data
}
DF %>%
group_by(is.exit = measure == 'exit') %>%
mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
ungroup() %>%
select(-is.exit)
2) factor out condition Factor out the condition by making it an extra column which is later removed. Then use ifelse, replace or arithmetic with logicals as illustrated. This also works for data tables.
library(dplyr)
DF %>% mutate(is.exit = measure == 'exit',
qty.exit = ifelse(is.exit, qty, qty.exit),
cf = (!is.exit) * cf,
delta.watts = replace(delta.watts, is.exit, 13)) %>%
select(-is.exit)
3) sqldf We could use SQL update via the sqldf package in the pipeline for data frames (but not data tables unless we convert them -- this may represent a bug in dplyr. See dplyr issue 1579). It may seem that we are undesirably modifying the input in this code due to the existence of the update but in fact the update is acting on a copy of the input in the temporarily generated database and not on the actual input.
library(sqldf)
DF %>%
do(sqldf(c("update '.'
set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13
where measure = 'exit'",
"select * from '.'")))
4) row_case_when Also check out row_case_when defined in
Returning a tibble: how to vectorize with case_when? . It uses a syntax similar to case_when but applies to rows.
library(dplyr)
DF %>%
row_case_when(
measure == "exit" ~ data.frame(qty.exit = qty, cf = 0, delta.watts = 13),
TRUE ~ data.frame(qty.exit, cf, delta.watts)
)
Note 1: We used this as DF
set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))
Note 2: The problem of how to easily specify updating a subset of rows is also discussed in dplyr issues 134, 631, 1518 and 1573 with 631 being the main thread and 1573 being a review of the answers here.
You can do this with magrittr's two-way pipe %<>%:
library(dplyr)
library(magrittr)
dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
cf = 0,
delta.watts = 13)
This reduces the amount of typing, but is still much slower than data.table.
Here's a solution I like:
mutate_when <- function(data, ...) {
dots <- eval(substitute(alist(...)))
for (i in seq(1, length(dots), by = 2)) {
condition <- eval(dots[[i]], envir = data)
mutations <- eval(dots[[i + 1]], envir = data[condition, , drop = FALSE])
data[condition, names(mutations)] <- mutations
}
data
}
It lets you write things like e.g.
mtcars %>% mutate_when(
mpg > 22, list(cyl = 100),
disp == 160, list(cyl = 200)
)
which is quite readable -- although it may not be as performant as it could be.
As eipi10 shows above, there's not a simple way to do a subset replacement in dplyr because DT uses pass-by-reference semantics vs dplyr using pass-by-value. dplyr requires the use of ifelse() on the whole vector, whereas DT will do the subset and update by reference (returning the whole DT). So, for this exercise, DT will be substantially faster.
You could alternatively subset first, then update, and finally recombine:
dt.sub <- dt[dt$measure == "exit",] %>%
mutate(qty.exit= qty, cf= 0, delta.watts= 13)
dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])
But DT is gonna be substantially faster:
(editted to use eipi10's new answer)
library(data.table)
library(dplyr)
library(microbenchmark)
microbenchmark(dt= {dt <- dt[measure == 'exit',
`:=`(qty.exit = qty,
cf = 0,
delta.watts = 13)]},
eipi10= {dt[dt$measure=="exit",] %<>% mutate(qty.exit = qty,
cf = 0,
delta.watts = 13)},
alex= {dt.sub <- dt[dt$measure == "exit",] %>%
mutate(qty.exit= qty, cf= 0, delta.watts= 13)
dt.new <- rbind(dt.sub, dt[dt$measure != "exit",])})
Unit: microseconds
expr min lq mean median uq max neval cld
dt 591.480 672.2565 747.0771 743.341 780.973 1837.539 100 a
eipi10 3481.212 3677.1685 4008.0314 3796.909 3936.796 6857.509 100 b
alex 3412.029 3637.6350 3867.0649 3726.204 3936.985 5424.427 100 b
I just stumbled across this and really like mutate_cond() by #G. Grothendieck, but thought it might come in handy to also handle new variables. So, below has two additions:
Unrelated: Second last line made a bit more dplyr by using filter()
Three new lines at the beginning get variable names for use in mutate(), and initializes any new variables in the data frame before mutate() occurs. New variables are initialized for the remainder of the data.frame using new_init, which is set to missing (NA) as a default.
mutate_cond <- function(.data, condition, ..., new_init = NA, envir = parent.frame()) {
# Initialize any new variables as new_init
new_vars <- substitute(list(...))[-1]
new_vars %<>% sapply(deparse) %>% names %>% setdiff(names(.data))
.data[, new_vars] <- new_init
condition <- eval(substitute(condition), .data, envir)
.data[condition, ] <- .data %>% filter(condition) %>% mutate(...)
.data
}
Here are some examples using the iris data:
Change Petal.Length to 88 where Species == "setosa". This will work in the original function as well as this new version.
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88)
Same as above, but also create a new variable x (NA in rows not included in the condition). Not possible before.
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE)
Same as above, but rows not included in the condition for x are set to FALSE.
iris %>% mutate_cond(Species == "setosa", Petal.Length = 88, x = TRUE, new_init = FALSE)
This example shows how new_init can be set to a list to initialize multiple new variables with different values. Here, two new variables are created with excluded rows being initialized using different values (x initialised as FALSE, y as NA)
iris %>% mutate_cond(Species == "setosa" & Sepal.Length < 5,
x = TRUE, y = Sepal.Length ^ 2,
new_init = list(FALSE, NA))
One concise solution would be to do the mutation on the filtered subset and then add back the non-exit rows of the table:
library(dplyr)
dt %>%
filter(measure == 'exit') %>%
mutate(qty.exit = qty, cf = 0, delta.watts = 13) %>%
rbind(dt %>% filter(measure != 'exit'))
You could split the dataset and do a regular mutate call on the TRUE part.
the split can be done with either dplyr::group_split() or base::split(), I like the base version better here since it preserves names, see the discussion at https://github.com/tidyverse/dplyr/issues/4223 .
df1 <- data.frame(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50),
stringsAsFactors = F)
library(tidyverse)
df1 %>%
group_split(measure == "exit", .keep = FALSE) %>%
modify_at(2, ~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
bind_rows()
#> # A tibble: 50 × 7
#> site space measure qty qty.exit delta.watts cf
#> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 linear 22 0 100. 0.126
#> 2 3 3 led 12 0 61.5 0.161
#> 3 6 1 led 26 0 25.5 0.307
#> 4 5 2 cfl 16 0 26.5 0.865
#> 5 6 3 linear 19 0 57.5 0.684
#> 6 1 4 led 12 0 14.5 0.802
#> 7 6 4 led 5 0 90.5 0.547
#> 8 5 4 linear 28 0 54.5 0.171
#> 9 1 2 linear 5 0 24.5 0.775
#> 10 1 2 cfl 24 0 96.5 0.144
#> # … with 40 more rows
df1 %>%
split(~measure == "exit") %>%
modify_at("TRUE", ~mutate(.,qty.exit = qty, cf = 0, delta.watts = 13)) %>%
bind_rows()
#> site space measure qty qty.exit delta.watts cf
#> 1 5 1 linear 22 0 100.5 0.125646491
#> 2 3 3 led 12 0 61.5 0.160692291
#> 3 6 1 led 26 0 25.5 0.307239765
#> 4 5 2 cfl 16 0 26.5 0.864969074
#> 5 6 3 linear 19 0 57.5 0.683945200
#> 6 1 4 led 12 0 14.5 0.802398642
#> 7 6 4 led 5 0 90.5 0.547211378
#> 8 5 4 linear 28 0 54.5 0.170614207
#> 9 1 2 linear 5 0 24.5 0.774603932
#> 10 1 2 cfl 24 0 96.5 0.144310557
#> 11 3 4 linear 21 0 93.5 0.682622390
#> 12 4 4 led 2 0 48.5 0.941718646
#> 13 4 4 cfl 2 0 100.5 0.918448627
#> 14 5 2 led 11 0 63.5 0.998143780
#> 15 4 1 led 21 0 53.5 0.644740176
#> 16 1 3 cfl 5 0 28.5 0.110610285
#> 17 1 3 linear 24 0 41.5 0.538868200
#> 18 4 3 led 29 0 19.5 0.998474289
#> 19 2 3 cfl 4 0 22.5 0.008167536
#> 20 5 1 led 20 0 56.5 0.740833476
#> 21 3 2 led 5 0 44.5 0.223967706
#> 22 1 4 led 27 0 32.5 0.199850583
#> 23 3 4 cfl 17 0 61.5 0.104023080
#> 24 1 3 cfl 11 0 34.5 0.399036247
#> 25 2 3 linear 29 0 65.5 0.600678235
#> 26 2 4 cfl 23 0 29.5 0.291611352
#> 27 6 2 linear 13 0 37.5 0.225021614
#> 28 2 3 led 17 0 62.5 0.879606956
#> 29 2 4 led 29 0 51.5 0.301759669
#> 30 5 1 led 11 0 54.5 0.793816856
#> 31 2 3 led 20 0 29.5 0.514759195
#> 32 3 4 linear 6 0 68.5 0.475085443
#> 33 1 4 led 21 0 34.5 0.133207588
#> 34 2 4 linear 25 0 80.5 0.164279355
#> 35 5 3 led 7 0 73.5 0.252937836
#> 36 6 2 led 15 0 99.5 0.554864929
#> 37 3 2 linear 6 0 44.5 0.377257874
#> 38 4 4 exit 15 15 13.0 0.000000000
#> 39 3 3 exit 10 10 13.0 0.000000000
#> 40 5 1 exit 15 15 13.0 0.000000000
#> 41 4 2 exit 1 1 13.0 0.000000000
#> 42 5 3 exit 10 10 13.0 0.000000000
#> 43 1 3 exit 14 14 13.0 0.000000000
#> 44 5 2 exit 12 12 13.0 0.000000000
#> 45 2 2 exit 30 30 13.0 0.000000000
#> 46 6 3 exit 28 28 13.0 0.000000000
#> 47 1 1 exit 14 14 13.0 0.000000000
#> 48 3 3 exit 21 21 13.0 0.000000000
#> 49 4 2 exit 13 13 13.0 0.000000000
#> 50 4 3 exit 12 12 13.0 0.000000000
Created on 2022-10-07 by the reprex package (v2.0.1)
mutate_cond is a great function, but it gives an error if there is an NA in the column(s) used to create the condition. I feel that a conditional mutate should simply leave such rows alone. This matches the behavior of filter(), which returns rows when the condition is TRUE, but omits both rows with FALSE and NA.
With this small change the function works like a charm:
mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
condition <- eval(substitute(condition), .data, envir)
condition[is.na(condition)] = FALSE
.data[condition, ] <- .data[condition, ] %>% mutate(...)
.data
}
I don't actually see any changes to dplyr that would make this much easier. case_when is great for when there are multiple different conditions and outcomes for one column but it doesn't help for this case where you want to change multiple columns based on one condition. Similarly, recode saves typing if you are replacing multiple different values in one column but doesn't help with doing so in multiple columns at once. Finally, mutate_at etc. only apply conditions to the column names not the rows in the dataframe. You could potentially write a function for mutate_at that would do it but I can't figure out how you would make it behave differently for different columns.
That said here is how I would approach it using nest form tidyr and map from purrr.
library(data.table)
library(dplyr)
library(tidyr)
library(purrr)
# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))
dt2 <- dt %>%
nest(-measure) %>%
mutate(data = if_else(
measure == "exit",
map(data, function(x) mutate(x, qty.exit = qty, cf = 0, delta.watts = 13)),
data
)) %>%
unnest()
With the creation of rlang, a slightly modified version of Grothendieck's 1a example is possible, eliminating the need for the envir argument, as enquo() captures the environment that .p is created in automatically.
mutate_rows <- function(.data, .p, ...) {
.p <- rlang::enquo(.p)
.p_lgl <- rlang::eval_tidy(.p, .data)
.data[.p_lgl, ] <- .data[.p_lgl, ] %>% mutate(...)
.data
}
dt %>% mutate_rows(measure == "exit", qty.exit = qty, cf = 0, delta.watts = 13)
I think this answer has not been mentioned before. It runs almost as fast as the 'default' data.table-solution..
Use base::replace()
df %>% mutate( qty.exit = replace( qty.exit, measure == 'exit', qty[ measure == 'exit'] ),
cf = replace( cf, measure == 'exit', 0 ),
delta.watts = replace( delta.watts, measure == 'exit', 13 ) )
replace recycles the replacement value, so when you want the values of columns qty entered into colums qty.exit, you have to subset qty as well... hence the qty[ measure == 'exit'] in the first replacement..
now, you will probably not want to retype the measure == 'exit' all the time... so you can create an index-vector containing that selection, and use it in the functions above.
#build an index-vector matching the condition
index.v <- which( df$measure == 'exit' )
df %>% mutate( qty.exit = replace( qty.exit, index.v, qty[ index.v] ),
cf = replace( cf, index.v, 0 ),
delta.watts = replace( delta.watts, index.v, 13 ) )
benchmarks
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.005018 1.053370 1.137456 1.112871 1.186228 1.690996 100
# wimpel 1.061052 1.079128 1.218183 1.105037 1.137272 7.390613 100
# wimpel.index 1.043881 1.064818 1.131675 1.085304 1.108502 4.192995 100
At the expense of breaking with the usual dplyr syntax, you can use within from base:
dt %>% within(qty.exit[measure == 'exit'] <- qty[measure == 'exit'],
delta.watts[measure == 'exit'] <- 13)
It seems to integrate well with the pipe, and you can do pretty much anything you want inside it.

Find index of first and last occurrence in data table

I have a data table that looks like
|userId|36|37|38|39|40|
|1|1|0|3|0|0|
|2|3|0|0|0|1|
Where each numbered column (36-40) represent week numbers. I want to calculate the number of weeks before the 1st occurrence of a non-zero value, and the last.
For instance, for userId 1 in my dataset, the first value appears at week 36, and the last one appears at week 38, so the value I want is 2. For userId 2 it's 40-36 which is 4.
I would like to store the data like:
|userId|lifespan|
|1|2|
|2|4|
I'm struggling to do this, can someone please help?
General method I would take is to melt it, convert the character column names to numeric, and take the delta by each userID. Here is an example using data.table.
library(data.table)
dt <- fread("userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1",
header = TRUE)
dt <- melt(dt, id.vars = "userId")
dt[, variable := as.numeric(as.character(variable))]
dt
# userId variable value
# 1: 1 36 1
# 2: 2 36 3
# 3: 1 37 0
# 4: 2 37 0
# 5: 1 38 3
# 6: 2 38 0
# 7: 1 39 0
# 8: 2 39 0
# 9: 1 40 0
# 10: 2 40 1
dt[!value == 0, .(lifespan = max(variable) - min(variable)), by = .(userId)]
# userId lifespan
# 1: 1 2
# 2: 2 4
Here's a dplyr method:
df %>%
gather(var, value, -userId) %>%
mutate(var = as.numeric(sub("X", "", var))) %>%
group_by(userId) %>%
slice(c(which.max(value!=0), max(which(value!=0)))) %>%
summarize(lifespan = var[2]-var[1])
Result:
# A tibble: 2 x 2
userId lifespan
<int> <dbl>
1 1 2
2 2 4
Data:
df = read.table(text = "userId|36|37|38|39|40
1|1|0|3|0|0
2|3|0|0|0|1", header = TRUE, sep = "|")

binned frequency from multiple variables in same column

Using R, I have melted my table into two columns, variable and value. There are multiple variables and values represent lengths. I would like to end up with three columns: variable, fixed pre-determined bins and frequency of each bin per variable.
Example Data:
original table/data frame:
Var Value
a 509
a 1000
b 251
b 330
...
new table/data frame:
Var Bin Count
a 0-250 50
a 251-500 30
a 501-1000 1
b 0-250 80
b 251-500 55
b 501-1000 30
...
To my surprise, I haven't found a duplicate question googling the question's title.
So, here are three possible solutions using cut() implemented in base R, dplyr, and data.table.
# pre-define break points of bins
breaks <- seq(0L, 1000L, by = 250L)
Note that cut() creates left open, right closed intervals by default, e.g,
250 < x <= 500.
For the first interval, the parameter include.lowest = TRUE forces a left closed interval as well, e.g.,
0 <= x <= 250
which seems to be in line with OP's expected output.
base R
with(DF, table(Var, cut(Value, breaks, include.lowest = TRUE)))
Var [0,250] (250,500] (500,750] (750,1e+03]
a 0 0 1 1
b 1 2 0 0
dpylr
library(dplyr)
DF %>%
group_by(Var, Bin = cut(Value, breaks, include.lowest = TRUE)) %>%
count()
# A tibble: 4 x 3
# Groups: Var, Bin [4]
Var Bin n
<chr> <fctr> <int>
1 a (500,750] 1
2 a (750,1e+03] 1
3 b [0,250] 1
4 b (250,500] 2
data.table
library(data.table)
setDT(DF)[, .N, by = .(Var, Bin = cut(Value, breaks, include.lowest = TRUE))]
Var Bin N
1: a (500,750] 1
2: a (750,1e+03] 1
3: b [0,250] 1
4: b (250,500] 2
Data
library(data.table)
DF <- fread("Var Value
a 509
a 1000
b 0
b 251
b 330")

Find most frequent combination of values in a data.frame

I would like to find the most frequent combination of values in a data.frame.
Here's some example data:
dat <- data.frame(age=c(50,55,60,50,55),sex=c(1,1,1,0,1),bmi=c(20,25,30,20,25))
In this example the result I am looking for is the combination of age=55, sex=1 and bmi=25, since that is the most frequent combination of column values.
My real data has about 30000 rows and 20 columns. What would be an efficient way to find the most common combination of these 20 values among the 30000 observations?
Many thanks!
Here's an approach with data.table:
dt <- data.table(dat)
setkeyv(dt, names(dt))
dt[, .N, by = key(dt)]
dt[, .N, by = key(dt)][N == max(N)]
# age sex bmi N
# 1: 55 1 25 2
And an approach with base R:
x <- data.frame(table(dat))
x[x$Freq == max(x$Freq), ]
# age sex bmi Freq
# 11 55 1 25 2
I don't know how well either of these scale though, particularly if the number of combinations is going to be large. So, test back and report!
Replace x$Freq == max(x$Freq) with which.max(x$Freq) and N == max(N) with which.max(N) if you are really just interested in one row of results.
The quick and dirty solution. I am sure there is a fancier way to it though, with the plyr package or similar.
> (tab <- table(apply(dat, 1, paste, collapse=", ")))
50, 0, 20 50, 1, 20 55, 1, 25 60, 1, 30
1 1 2 1
> names(which.max(tab))
[1] "55, 1, 25"
Something like this??
> dat[duplicated(dat), ]
age sex bmi
5 55 1 25
using while (maybe time consuming)
Here's another data.frame with more than 1 case duplicated
> dat <- data.frame(age=c(50,55,60,50,55, 55, 60),
sex=c(1,1,1,0,1, 1,1),
bmi=c(20,25,30,20,25, 25,30))
> dat[duplicated(dat), ] # see data.frame
age sex bmi
5 55 1 25
6 55 1 25
7 60 1 30
# finding the most repeated item
> while(any(duplicated(dat))){
dat <- dat[duplicated(dat), ]
#print(dat)
}
> print(dat)
age sex bmi
6 55 1 25
Here's a tidyverse solution. Grouping by all variables and getting the count per group has the benefit that you can see the counts of all other groups, not just the max.
library(tidyverse)
dat <- data.frame(age=c(50,55,60,50,55),sex=c(1,1,1,0,1),bmi=c(20,25,30,20,25))
dat %>%
group_by_all() %>%
summarise(count = n()) %>%
arrange(desc(count))
#> # A tibble: 4 x 4
#> # Groups: age, sex [4]
#> age sex bmi count
#> <dbl> <dbl> <dbl> <int>
#> 1 55 1 25 2
#> 2 50 0 20 1
#> 3 50 1 20 1
#> 4 60 1 30 1
Created on 2018-10-17 by the reprex package (v0.2.0).

Resources