R: Count number of times B follows A using dplyr - r

I have a data.frame of monthly averages of radon measured over a few months. I have labeled each value either "below" or "above" a threshold and would like to count the number of times the average value does: "below to above", "above to below", "above to above" or "below to below".
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
A bit of digging into Matlab answer on here suggests that we could use the Matrix package:
require(Matrix)
sparseMatrix(i=c(2,2,2,1), j=c(2,2,2))
Produces this result which I can't yet interpret.
[1,] | |
[2,] | .
Any thoughts about a tidyverse method?

Sure, just use group by and count the values
library(dplyr)
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
df %>%
group_by(grp = paste(level, lead(level))) %>%
summarise(n = n()) %>%
# drop the observation that does not have a "next" value
filter(!grepl(pattern = "NA", x = grp))
#> # A tibble: 3 × 2
#> grp n
#> <chr> <int>
#> 1 above above 2
#> 2 above below 1
#> 3 below above 1

You could use table from base R:
table(df$level[-1], df$level[-nrow(df)])
above below
above 2 1
below 1 0
EDIT in response to #HCAI's comment: applying table to multiple columns:
First, generate some data:
set.seed(1)
U = matrix(runif(4*20),nrow = 20)
dfU=data.frame(round(U))
library(plyr) # for mapvalues
df2 = data.frame(apply(dfU,
FUN = function(x) mapvalues(x, from=0:1, to=c('below','above')),
MARGIN=2))
so that df2 contains random 'above' and 'below':
X1 X2 X3 X4
1 below above above above
2 below below above below
3 above above above below
4 above below above below
5 below below above above
6 above below above below
7 above below below below
8 above below below above
9 above above above below
10 below below above above
11 below below below below
12 below above above above
13 above below below below
14 below below below below
15 above above below below
16 below above below above
17 above above below above
18 above below above below
19 below above above above
20 above below below above
Now apply table to each column and vectorize the output:
apply(df2,
FUN=function(x) as.vector(table(x[-1],
x[-nrow(df2)])),
MARGIN=2)
which gives us
X1 X2 X3 X4
[1,] 5 2 7 2
[2,] 5 6 4 6
[3,] 6 5 3 6
[4,] 3 6 5 5
All that's left is a bit of care in labeling the rows of the output. Maybe someone can come up with a clever way to merge/join the data frames resulting from apply(df2, FUN=function(x) melt(table(x[-1],x[-nrow(df2)])),2), which would maintain the row names. (I spent some time looking into it but couldn't work out how to do it easily.)

not run, so there may be a typo, but you get the idea. I'll leave it to you to deal with na and the first obs. Single pass through the vector.
library(dplyr)
summarize(increase = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
decrease = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
constant = sum(case_when(value = lag(value) ~ 1, T ~ 0))
)

A slightly different version:
library(dplyr)
library(stringr)
df %>%
group_by(level = str_c(level, lead(level), sep = " ")) %>%
count(level) %>%
na.omit()
level n
<chr> <int>
1 above above 2
2 above below 1
3 below above 1

Another possible solution, based on tidyverse:
library(tidyverse)
df<-data.frame(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
df %>%
mutate(changes = str_c(lag(level), level, sep = "_")) %>%
count(changes) %>% drop_na(changes)
#> changes n
#> 1 above_above 2
#> 2 above_below 1
#> 3 below_above 1
Yet another solution, based on data.table:
library(data.table)
dt<-data.table(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
dt[, changes := paste(shift(level), level, sep = "_")
][2:.N][,.(n = .N), keyby = .(changes)]
#> changes n
#> 1: above_above 2
#> 2: above_below 1
#> 3: below_above 1

Related

findInterval by group with dplyr [duplicate]

This question already has answers here:
How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
(11 answers)
Closed 1 year ago.
In this example I have a tibble with two variables:
a group variable gr
the variable of interest val
set.seed(123)
df <- tibble(gr = rep(1:3, each = 10),
val = gr + rnorm(30))
Goal
I want to produce a discretized version of val using the function findInterval but the breakpoints should be gr-specific, since in my actual data as well as in this example, the distribution of valdepends on gr. The breakpoints are determined within each group using the quartiles of val.
What I did
I first construct a nested tibble containing the vectors of breakpoints for each value of gr:
df_breakpoints <- bind_cols(gr = 1:3,
purrr::map_dfr(1:3, function(gr) {
c(-Inf, quantile(df$val[df$gr == gr], c(0.25, 0.5, 0.75)), Inf)
})) %>%
nest(bp = -gr) %>%
mutate(bp = purrr::map(.$bp, unlist))
Then I join it with df:
df <- inner_join(df, df_breakpoints, by = "gr")
My first guess to define the discretized variable lvl was
df %>% mutate(lvl = findInterval(x = val, vec = bp))
It produces the error
Error : Problem with `mutate()` input `lvl2`.
x 'vec' must be sorted non-decreasingly and not contain NAs
ℹ Input `lvl` is `findInterval(x = val, vec = bp)`.
Then I tried
df$lvl <- purrr::imap_dbl(1:nrow(df),
~findInterval(x = df$val[.x], vec = df$bp[[.x]]))
or
df %>% mutate(lvl = purrr::map2_int(df$val, df$bp, findInterval))
It does work. However it is highly unefficient. With my actual data (1.2 million rows) it takes several minutes to run. I guess there is a much better way of doing this than iterating on rows. Any idea?
You can do this in group_by + mutate step -
library(dplyr)
df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup
# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows
findInterval is applied for each gr separately.

Row mean of two matching columns with same name but differ by: '_1' and '_2'

Lets say I have the dataframe:
z = data.frame(col_1 = c(1,2,3,4), col_2 = c(3,4,5,6))
col_1 col_2
1 1 3
2 2 4
3 3 5
4 4 6
I want to take columns with the same name that only differ by the number e.g. '_1' and '_2' and take the pairwise mean. In reality I have a big dataframe with many pairs and they are not in a nice order, therefore looking for a clever solution that can be applied to this.
So the output should look like this:
col
1 2
2 3
3 4
4 5
With the column name given as the same as the column pair but with the additional label removed.
Any help would be great thanks.
Here is a base R option using list2DF + split.default + rowMeans
list2DF(lapply(split.default(z,gsub("_\\d+","",names(z))),rowMeans))
which gives
col
1 2
2 3
3 4
4 5
Try this tidyverse approach. By using separate() you can extract the name and then with reshaping you can reach the desired output. Here the code:
library(dplyr)
library(tidyr)
#Data
z = data.frame(col_1 = c(1,2,3,4), col_2 = c(3,4,5,6))
#Code
z1 <- z %>% mutate(id=1:n()) %>%
pivot_longer(-id) %>%
separate(name,c('var1','var2'),sep='_') %>%
group_by(id,var1) %>% summarise(Mean=mean(value)) %>%
pivot_wider(names_from = var1,values_from=Mean) %>% ungroup() %>% select(-id)
Output:
# A tibble: 4 x 1
col
<dbl>
1 2
2 3
3 4
4 5
Here is a purrr oriented solution:
library(purrr)
library(stringr)
split.default(z, str_remove(names(z), "[:digit:]+$")) %>% map_dfc(rowMeans)
#> # A tibble: 4 x 1
#> col_
#> <dbl>
#> 1 2
#> 2 3
#> 3 4
#> 4 5
It works even if z is:
z <- data.frame(col_1 = c(1,2,3,4),
col_2 = c(3,4,5,6),
anothercol_1 = c(1,2,3,4),
anothercol_2 = c(3,4,5,6))

Create ranges by accumulating values

I have the DF1:
KEY <- c(11,12,22,33,44,55,66,77,88,99,1010,1111,1212,1313,1414,1515,1616,1717,1818,1919,2020)
PRICE <- c(0,0,1,5,7,10,20,80,110,111,200,1000,2500,2799,3215,4999,7896,8968,58914,78422,96352)
DF1 <- data.frame(KEY,PRICE)
I want to group DF1 into ranges, accumulating the values of the two columns (count the KEY column and sum the PRICE column). This is the result I hope for:
INTERVAL <-c('0','UP_TO_10','UP_TO_100','UP_TO_1000','UP_TO_5000','UP_TO_10000','UP_TO_100000')
COUNT_KEY <-c(2,6,8,12,16,18,21)
SUM_PRICE <- c(0,23,123,1544,15057,31921,265609)
DF2 <- data.frame(INTERVAL,COUNT_KEY,SUM_PRICE)
How do I make this table?
If you have a vector of limits or thresholds, such as:
LIMITS <- c(0, 10, 100, 1000, 5000, 10000, 100000)
You could obtain a count of rows where PRICE is below each limit:
unlist(lapply(LIMITS, function(x) sum(DF1$PRICE <= x)))
[1] 2 6 8 12 16 18 21
And a sum of these prices as well:
unlist(lapply(LIMITS, function(x) sum(DF1$PRICE[DF1$PRICE <= x])))
[1] 0 23 123 1544 15057 31921 265609
Is this what you had in mind?
This is everything all together:
LIMITS <- c(0, 10, 100, 1000, 5000, 10000, 100000)
COUNT_KEY <- unlist(lapply(LIMITS, function(x) sum(DF1$PRICE <= x)))
SUM_PRICE <- unlist(lapply(LIMITS, function(x) sum(DF1$PRICE[DF1$PRICE <= x])))
data.frame(INTERVAL = c(0, paste("UP_TO", LIMITS[-1], sep="_")), COUNT_KEY, SUM_PRICE)
INTERVAL COUNT_KEY SUM_PRICE
1 0 2 0
2 UP_TO_10 6 23
3 UP_TO_100 8 123
4 UP_TO_1000 12 1544
5 UP_TO_5000 16 15057
6 UP_TO_10000 18 31921
7 UP_TO_100000 21 265609
You have to manually define you boundaries first:
X = c(-Inf,0,10,100,1000,5000,10000,100000)
Then you use cut to assign to entries to your labels. And we first summarize the counts and total price within the intervals.
library(dplyr)
DF1 %>%
mutate(LABELS = cut(DF1$PRICE,X,INTERVAL,include.lowest =TRUE)) %>%
group_by(LABELS) %>%
summarise(COUNT_KEY=n(),SUM_PRICE=sum(PRICE))
# A tibble: 7 x 3
LABELS COUNT_KEY SUM_PRICE
<fct> <int> <dbl>
1 0 2 0
2 UP_TO_10 4 23
3 UP_TO_100 2 100
4 UP_TO_1000 4 1421
5 UP_TO_5000 4 13513
6 UP_TO_10000 2 16864
7 UP_TO_100000 3 233688
This is close to what you want, except the sum_price and counts, should be cumulative. So this can be achieved by doing mutate_if(is.numeric,cumsum):
DF1 %>%
mutate(LABELS = cut(DF1$PRICE,X,INTERVAL,include.lowest =TRUE)) %>% group_by(LABELS) %>%
summarise(COUNT_KEY=n(),SUM_PRICE=sum(PRICE)) %>%
mutate_if(is.numeric,cumsum)
To give:
# A tibble: 7 x 3
LABELS COUNT_KEY SUM_PRICE
<fct> <int> <dbl>
1 0 2 0
2 UP_TO_10 6 23
3 UP_TO_100 8 123
4 UP_TO_1000 12 1544
5 UP_TO_5000 16 15057
6 UP_TO_10000 18 31921
7 UP_TO_100000 21 265609
Okay, here's an all-in-one, tidy way to handle this using dplyr ;)
library(dplyr)
DF1 %>%
mutate(
INTERVAL =
factor(
case_when( # create discrete variable
PRICE == 0 ~ '0',
PRICE <= 10 ~ 'UP_TO_10',
PRICE <= 100 ~ 'UP_TO_100',
PRICE <= 1000 ~ 'UP_TO_1000',
PRICE <= 5000 ~ 'UP_TO_5000',
PRICE <= 10000 ~ 'UP_TO_10000',
PRICE <= 100000 ~ 'UP_TO_100000'
),
levels = # set the factor levels
c(
'0',
'UP_TO_10',
'UP_TO_100',
'UP_TO_1000',
'UP_TO_5000',
'UP_TO_10000',
'UP_TO_100000'
)
)
) %>%
group_by(INTERVAL) %>% # create desired group
summarise( # and summary variables
COUNT_KEY = n(),
SUM_PRICE = sum(PRICE)
) %>%
mutate( # cumulative totals
COUNT_KEY_CUM = cumsum(COUNT_KEY),
SUM_PRICE_CUM = cumsum(SUM_PRICE)
)

sum by group including intermediate groups

I have:
df <- data.frame(group=c(1,1,2,4,4,5), value=c(3,1,5,2,3,6))
aggregate(value ~ group, data = df, FUN = 'sum')
group value
1 1 4
2 2 5
3 4 5
4 5 6
is there a way to include intermediate groups to return the below? I realise this could be done by creating a dataframe with all the desired groups and matching in the results from aggregate() but I am hoping there is a cleaner way to do this. it would need to be as fast as using aggregate and only use base r packages - this is due to restrictions in my workplace.
group value
1 1 4
2 2 5
3 3 0
4 4 5
5 5 6
You can try this .
library(tidyr)
library(dplyr)
df %>%
mutate(group=factor(group, 1:5)) %>%
complete(group) %>%group_by(group)%>%
dplyr::summarise(value=sum(value,na.rm = T))
group value
<fctr> <dbl>
1 1 4
2 2 5
3 3 0
4 4 5
5 5 6
You can do this easily with the tidyverse:
library(dplyr)
library(tidyr)
df %>%
group_by(group) %>%
summarise(valuesum = sum(value)) %>%
full_join(., expand(df, group = 1:5)) %>%
complete(group, fill = list(valuesum = 0))
The result:
# A tibble: 5 x 2
group valuesum
<dbl> <dbl>
1 1 4
2 2 5
3 3 0
4 4 5
5 5 6
Or a bit more difficult to understand with data.table:
library(data.table)
setDT(df)[.(group = 1:5), on = 'group', sum(value, na.rm = TRUE), by = .EACHI]
You can use mergefrom base R. I've changed the name of your data.frame to dat, since df is the name of an R function.
dat <- read.table(text = "
group value
1 4
2 5
4 5
5 6
", header = TRUE)
str(dat)
res <- aggregate(value ~ group, data = dat, FUN = 'sum')
merge(res, data.frame(group = seq(from = min(res$group), to = max(res$group))), all = TRUE)
Note that there will be a NA, not a zero. I believe that you should solve that by leaving it as a missing value.

Find most frequent combination of values in a data.frame

I would like to find the most frequent combination of values in a data.frame.
Here's some example data:
dat <- data.frame(age=c(50,55,60,50,55),sex=c(1,1,1,0,1),bmi=c(20,25,30,20,25))
In this example the result I am looking for is the combination of age=55, sex=1 and bmi=25, since that is the most frequent combination of column values.
My real data has about 30000 rows and 20 columns. What would be an efficient way to find the most common combination of these 20 values among the 30000 observations?
Many thanks!
Here's an approach with data.table:
dt <- data.table(dat)
setkeyv(dt, names(dt))
dt[, .N, by = key(dt)]
dt[, .N, by = key(dt)][N == max(N)]
# age sex bmi N
# 1: 55 1 25 2
And an approach with base R:
x <- data.frame(table(dat))
x[x$Freq == max(x$Freq), ]
# age sex bmi Freq
# 11 55 1 25 2
I don't know how well either of these scale though, particularly if the number of combinations is going to be large. So, test back and report!
Replace x$Freq == max(x$Freq) with which.max(x$Freq) and N == max(N) with which.max(N) if you are really just interested in one row of results.
The quick and dirty solution. I am sure there is a fancier way to it though, with the plyr package or similar.
> (tab <- table(apply(dat, 1, paste, collapse=", ")))
50, 0, 20 50, 1, 20 55, 1, 25 60, 1, 30
1 1 2 1
> names(which.max(tab))
[1] "55, 1, 25"
Something like this??
> dat[duplicated(dat), ]
age sex bmi
5 55 1 25
using while (maybe time consuming)
Here's another data.frame with more than 1 case duplicated
> dat <- data.frame(age=c(50,55,60,50,55, 55, 60),
sex=c(1,1,1,0,1, 1,1),
bmi=c(20,25,30,20,25, 25,30))
> dat[duplicated(dat), ] # see data.frame
age sex bmi
5 55 1 25
6 55 1 25
7 60 1 30
# finding the most repeated item
> while(any(duplicated(dat))){
dat <- dat[duplicated(dat), ]
#print(dat)
}
> print(dat)
age sex bmi
6 55 1 25
Here's a tidyverse solution. Grouping by all variables and getting the count per group has the benefit that you can see the counts of all other groups, not just the max.
library(tidyverse)
dat <- data.frame(age=c(50,55,60,50,55),sex=c(1,1,1,0,1),bmi=c(20,25,30,20,25))
dat %>%
group_by_all() %>%
summarise(count = n()) %>%
arrange(desc(count))
#> # A tibble: 4 x 4
#> # Groups: age, sex [4]
#> age sex bmi count
#> <dbl> <dbl> <dbl> <int>
#> 1 55 1 25 2
#> 2 50 0 20 1
#> 3 50 1 20 1
#> 4 60 1 30 1
Created on 2018-10-17 by the reprex package (v0.2.0).

Resources