I would like to find the most frequent combination of values in a data.frame.
Here's some example data:
dat <- data.frame(age=c(50,55,60,50,55),sex=c(1,1,1,0,1),bmi=c(20,25,30,20,25))
In this example the result I am looking for is the combination of age=55, sex=1 and bmi=25, since that is the most frequent combination of column values.
My real data has about 30000 rows and 20 columns. What would be an efficient way to find the most common combination of these 20 values among the 30000 observations?
Many thanks!
Here's an approach with data.table:
dt <- data.table(dat)
setkeyv(dt, names(dt))
dt[, .N, by = key(dt)]
dt[, .N, by = key(dt)][N == max(N)]
# age sex bmi N
# 1: 55 1 25 2
And an approach with base R:
x <- data.frame(table(dat))
x[x$Freq == max(x$Freq), ]
# age sex bmi Freq
# 11 55 1 25 2
I don't know how well either of these scale though, particularly if the number of combinations is going to be large. So, test back and report!
Replace x$Freq == max(x$Freq) with which.max(x$Freq) and N == max(N) with which.max(N) if you are really just interested in one row of results.
The quick and dirty solution. I am sure there is a fancier way to it though, with the plyr package or similar.
> (tab <- table(apply(dat, 1, paste, collapse=", ")))
50, 0, 20 50, 1, 20 55, 1, 25 60, 1, 30
1 1 2 1
> names(which.max(tab))
[1] "55, 1, 25"
Something like this??
> dat[duplicated(dat), ]
age sex bmi
5 55 1 25
using while (maybe time consuming)
Here's another data.frame with more than 1 case duplicated
> dat <- data.frame(age=c(50,55,60,50,55, 55, 60),
sex=c(1,1,1,0,1, 1,1),
bmi=c(20,25,30,20,25, 25,30))
> dat[duplicated(dat), ] # see data.frame
age sex bmi
5 55 1 25
6 55 1 25
7 60 1 30
# finding the most repeated item
> while(any(duplicated(dat))){
dat <- dat[duplicated(dat), ]
#print(dat)
}
> print(dat)
age sex bmi
6 55 1 25
Here's a tidyverse solution. Grouping by all variables and getting the count per group has the benefit that you can see the counts of all other groups, not just the max.
library(tidyverse)
dat <- data.frame(age=c(50,55,60,50,55),sex=c(1,1,1,0,1),bmi=c(20,25,30,20,25))
dat %>%
group_by_all() %>%
summarise(count = n()) %>%
arrange(desc(count))
#> # A tibble: 4 x 4
#> # Groups: age, sex [4]
#> age sex bmi count
#> <dbl> <dbl> <dbl> <int>
#> 1 55 1 25 2
#> 2 50 0 20 1
#> 3 50 1 20 1
#> 4 60 1 30 1
Created on 2018-10-17 by the reprex package (v0.2.0).
Related
I have a dataset that has 2 columns; column A is State_Name and has 5 different options of state, and column B is Total_Spend which has the average total spend of that state per day. There are 365 observations for each state.
What I want to do is count the number of outliers PER STATE using the 1.5 IQR rule and save the count of outliers per state to a new df or table.
So I would expect an output something like:
State
Outlier Count
ATL
5
GA
20
MI
11
NY
50
TX
23
I have managed to get it to work by doing it one state at a time but I can't figure out what to do to achieve this in a single go.
Here is my code at the moment (to return the result for a single state):
daily_agg %>%
select(State_Name, Total_Spend) %>%
filter(State_Name == "NY")
outlier_NY <- length(boxplot.stats(outlier_df$Total_Spend)$out)
Any help would be appreciated.
Thanks!
EDIT WITH TEST DATASET
outlier_mtcars <-
df %>%
select(cyl, disp) %>%
filter(cyl == "6")
outliers <- length(boxplot.stats(outlier_mtcars$disp)$out)
The above shows me 1 outlier for 6 cyl cars but I want a table that shows how many outliers for 4, 6, 8 cyl cars
Since I'm not very familiar with the function boxplot.stats, I didn't use this in my solution and instead manually calculates 1.5 * IQR + upper quantile.
Here mtcars was used as an example. For the records that are outliers, they are "flagged" as TRUE, where we can sum them up in summarize.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(flag = disp >= (IQR(disp) * 1.5 + quantile(disp, probs = 0.75)), .keep = "used") %>%
summarize(Outlier = sum(flag))
# A tibble: 3 × 2
cyl Outlier
<dbl> <int>
1 4 0
2 6 1
3 8 0
Since I don't have your data, I'll make some up with the two columns you mention:
df<-data.frame(state=sample(c("ny","fl"),100, replace=TRUE),
spend=sample(1:100, 100, replace=TRUE))
> head(df)
state spend
1 ny 3
2 fl 87
3 ny 91
4 fl 97
5 ny 47
6 fl 8
Then set your upper and lower bounds (could be quartiles, absolutes, whatever..)
df%>%
group_by(state)%>%
mutate(lower_bound=quantile(spend,0.25),
upper_bound=quantile(spend,0.75))%>%
mutate(is_outlier=if_else(spend<lower_bound|spend>upper_bound,TRUE,FALSE))
# A tibble: 10 × 5
# Groups: state [2]
state spend lower_bound upper_bound is_outlier
<chr> <int> <dbl> <dbl> <lgl>
1 ny 3 38 84 TRUE
2 fl 87 26 87 FALSE
3 ny 91 38 84 TRUE
4 fl 97 26 87 TRUE
Then if you only want to see the output, summarise by is_outlier:
df%>%
group_by(state)%>%
mutate(lower_bound=quantile(spend,0.25),upper_bound=quantile(spend,0.75))%>%
mutate(is_outlier=if_else(spend<lower_bound|spend>upper_bound,TRUE,FALSE))%>%
summarise(outliers=sum(is_outlier))
state outliers
<chr> <int>
1 fl 19
2 ny 30
I have a data.frame of monthly averages of radon measured over a few months. I have labeled each value either "below" or "above" a threshold and would like to count the number of times the average value does: "below to above", "above to below", "above to above" or "below to below".
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
A bit of digging into Matlab answer on here suggests that we could use the Matrix package:
require(Matrix)
sparseMatrix(i=c(2,2,2,1), j=c(2,2,2))
Produces this result which I can't yet interpret.
[1,] | |
[2,] | .
Any thoughts about a tidyverse method?
Sure, just use group by and count the values
library(dplyr)
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
df %>%
group_by(grp = paste(level, lead(level))) %>%
summarise(n = n()) %>%
# drop the observation that does not have a "next" value
filter(!grepl(pattern = "NA", x = grp))
#> # A tibble: 3 × 2
#> grp n
#> <chr> <int>
#> 1 above above 2
#> 2 above below 1
#> 3 below above 1
You could use table from base R:
table(df$level[-1], df$level[-nrow(df)])
above below
above 2 1
below 1 0
EDIT in response to #HCAI's comment: applying table to multiple columns:
First, generate some data:
set.seed(1)
U = matrix(runif(4*20),nrow = 20)
dfU=data.frame(round(U))
library(plyr) # for mapvalues
df2 = data.frame(apply(dfU,
FUN = function(x) mapvalues(x, from=0:1, to=c('below','above')),
MARGIN=2))
so that df2 contains random 'above' and 'below':
X1 X2 X3 X4
1 below above above above
2 below below above below
3 above above above below
4 above below above below
5 below below above above
6 above below above below
7 above below below below
8 above below below above
9 above above above below
10 below below above above
11 below below below below
12 below above above above
13 above below below below
14 below below below below
15 above above below below
16 below above below above
17 above above below above
18 above below above below
19 below above above above
20 above below below above
Now apply table to each column and vectorize the output:
apply(df2,
FUN=function(x) as.vector(table(x[-1],
x[-nrow(df2)])),
MARGIN=2)
which gives us
X1 X2 X3 X4
[1,] 5 2 7 2
[2,] 5 6 4 6
[3,] 6 5 3 6
[4,] 3 6 5 5
All that's left is a bit of care in labeling the rows of the output. Maybe someone can come up with a clever way to merge/join the data frames resulting from apply(df2, FUN=function(x) melt(table(x[-1],x[-nrow(df2)])),2), which would maintain the row names. (I spent some time looking into it but couldn't work out how to do it easily.)
not run, so there may be a typo, but you get the idea. I'll leave it to you to deal with na and the first obs. Single pass through the vector.
library(dplyr)
summarize(increase = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
decrease = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
constant = sum(case_when(value = lag(value) ~ 1, T ~ 0))
)
A slightly different version:
library(dplyr)
library(stringr)
df %>%
group_by(level = str_c(level, lead(level), sep = " ")) %>%
count(level) %>%
na.omit()
level n
<chr> <int>
1 above above 2
2 above below 1
3 below above 1
Another possible solution, based on tidyverse:
library(tidyverse)
df<-data.frame(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
df %>%
mutate(changes = str_c(lag(level), level, sep = "_")) %>%
count(changes) %>% drop_na(changes)
#> changes n
#> 1 above_above 2
#> 2 above_below 1
#> 3 below_above 1
Yet another solution, based on data.table:
library(data.table)
dt<-data.table(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
dt[, changes := paste(shift(level), level, sep = "_")
][2:.N][,.(n = .N), keyby = .(changes)]
#> changes n
#> 1: above_above 2
#> 2: above_below 1
#> 3: below_above 1
I'm working with the data frame below, which is just part of the full data, and I need to condense the duplicate numbers in the id column into one row. I want to preserve the row that has the highest sbp number, unless it's 300 or over, in which case I want to discard that too.
So for example, for the first three rows that have id as 13480, I want to keep the row that has 124 and discard the other two.
id,sex,visits,sbp
13480,M,2,124
13480,M,3,306
13480,M,4,116
13520,M,2,124
13520,M,3,116
13520,M,4,120
13580,M,2,NA
13580,M,3,124
This is the farthest I got, been trying to tweak this but not sure I'm on the right track:
maxsbp <- split(sbp, sbp$sbp)
r <- data.frame()
for (i in 1:length(maxsbp)){
one <- maxsbp[[i]]
index <- which(one$sbp == max(one$sbp))
select <- one[index,]
r <- rbind(r, select)
}
r1 <- r[!(sbp$sbp>=300),]
r1
I think a tidy solution to this would work quite well. I would first filter all values above 300, if you do not want to keep any value above that threshold. Then group_by id, order, and keep the first.
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
my.df %>% filter(sbp < 300) # filter to retain only values below 300
%>% group_by(id) # group by id
%>% arrange(-sbp) # arrange by id in descending order
%>% top_n(1, sbp) # retain first value i.e. the largest
# A tibble: 3 x 3
# Groups: id [3]
# id sex sbp
# <dbl> <chr> <dbl>
#1 13480 M 124
#2 13520 M 124
#3 13580 M 124
In R, very rarely you'll require explicit for loops to do tasks.
There are functions available which will help you perform such grouped operations.
For example, in base R you can use subset and ave :
subset(df,sbp == ave(sbp,id,FUN = function(x) max(sbp[sbp <= 300],na.rm = TRUE)))
# id sex visits sbp
#1 13480 M 2 124
#4 13520 M 2 124
#8 13580 M 3 124
The same can be done using dplyr whose syntax is a little bit easier to understand.
library(dplyr)
df %>%
group_by(id) %>%
filter(sbp == max(sbp[sbp <= 300], na.rm = TRUE))
slice_head can also be used
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
> my.df
id sex sbp
1 13480 M 124
2 13480 M 306
3 13480 M 116
4 13520 M 124
5 13520 M 116
6 13520 M 120
7 13580 M NA
8 13580 M 124
Proceed simply like this
my.df %>% group_by(id, sex) %>%
arrange(desc(sbp)) %>%
slice_head() %>%
filter(sbp <300)
# A tibble: 2 x 3
# Groups: id, sex [2]
id sex sbp
<dbl> <chr> <dbl>
1 13520 M 124
2 13580 M 124
I have to perform some simple operations upon few vectors and rows.
Assume that I have a database such as:
observation outcome_1_a outcome_2_a outcome_1_b outcome_2_b choice_a choice_b
1 41 34 56 19 1 1
2 32 78 43 6 2 1
3 39 19 18 55 1 2
For each observation, outcome_1 and outcome_2 are the two possible outcomes, choice is the outcome chosen and the prefix _i, with i = a,b, indicates the number of times the situation is repeated.
If I want to create variables storing the highest outcome for each situation (a,b), hence:
max.a <- pmax(data$outcome_1_a, data$outcome_2_a)
max.b <- pmax(data$outcome_1_b, data$outcome_2_b)
Similarly, if I want to create variables storing the values chosen in each situation, I can do:
choice.a <- ifelse(data$choice_a == "1", data$outcome_1_a, data$outcome_1_b)
choice.b <- ifelse(data$choice_b == "1", data$outcome_2_a, data$outcome_2_b)
Finally, If I'd like to compute the mean by row of the situations a and b, I can do:
library(data.table)
setDT(data)
data[, .(Mean = rowMeans(.SD)), by = observation, .SDcols = c("outcome_1_a","outcome_2_a", "outcome_1_b", "outcome_2_b")]
Now, all of these work just fine. However, I was wondering if such operations can be done in a more efficient way.
In the example there are only few situations, but, if in the future I'll have to deal with, let's say, 15 or more different situations (a,b,c,d,...,), writing such operations might be annoying.
Is there a way to automate such process based on the different prefixes and/or suffixes of the variables?
Thank you for your help
You can select columns with some regex. For example, to get your max.a value.
library(data.table)
setDT(data)
data[, do.call(pmax, .SD), .SDcols = names(data) %like% "\\d+_a$"]
[1] 41 78 39
Alternatively, you could select your columns with some regex outside of the data.table. Lots of ways to go about this.
Similar application to your last command.
data[,
.(Mean = rowMeans(.SD)),
by = observation,
.SDcols = names(data) %like% "^outcome"]
observation Mean
1: 1 37.50
2: 2 39.75
3: 3 32.75
For choice.a, how would you choose between b, c, d, e etc?
For instance:
outcome_1_a outcome_2_a outcome_1_b outcome_2_b outcome_1_c outcome_2_c outcome_1_d outcome_2_d outcome_1_e outcome_2_e choice_a choice_b choice_c choice_d choice_e
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 85 32 28 91 42 32 96 27 29 2 1 1 1 1
2 17 22 84 53 11 69 16 66 11 41 1 2 2 1 1
3 92 98 76 83 18 27 21 51 92 41 1 1 1 1 2
4 63 49 61 64 100 28 43 51 22 94 1 2 1 1 1
Define an index variable that will help you go through the loops:
seqmax <- seq(1, 10, by = 2)
seqmax is a 1 3 5 7 9. The reason being is that there are 5 letters "a" "b" "c" "d" "e". So this sequence will help you to pattern the loop. This can be automated for the max number of letters, just find the column index for the last column before choice_a. Then you can do seq(1, grep(names(data), pattern = "choice_a") - 1, by = 2). The by = 2 argument can be adjusted for the number of columns by letter.
I use lapply with <<- to assing the new column to data.
lapply(c(1:5), function(x){
data[, paste0("max.", letters[x])] <<- apply(data[, c(seqmax[x], seqmax[x] + 1)], 1, max)
data[, paste0("choice.", letters[x])] <<- ifelse(
data[, grep(names(data), pattern = paste0("choice_", letters[x]), value = T)] == 1,
data[, seqmax[x]], data[, seqmax[x] + 1])
data[, paste0("mean.", letters[x])] <<- rowMeans(
data[, grep(names(data), pattern = paste0("outcome_\\d+_", letters[x]), value = T)])
})
I have a data table like this
ID DAYS FREQUENCY
"ads" 20 3
"jwa" 45 2
"mno" 4 1
"ads" 13 3
"jwa" 60 2
"ads" 18 3
I want to add a column that subtracts the days based on the id and subtract the closest days together.
My new table would like like this:
ID DAYS FREQUENCY DAYS DIFF
"ads" 20 3 2 (because 20-18)
"jwa" 45 2 NA (because no value greater than 45 for that id)
"mno" 4 1 NA
"ads" 13 3 NA
"jwa" 60 2 15
"ads" 18 3 5
Bonus: Is there a way to use the merge function?
Here's an answer using dplyr:
require(dplyr)
mydata %>%
mutate(row.order = row_number()) %>% # row numbers added to preserve original row order
group_by(ID) %>%
arrange(DAYS) %>%
mutate(lag = lag(DAYS)) %>%
mutate(days.diff = DAYS - lag) %>%
ungroup() %>%
arrange(row.order) %>%
select(ID, DAYS, FREQUENCY, days.diff)
Output:
ID DAYS FREQUENCY days.diff
<fctr> <int> <int> <int>
1 ads 20 3 2
2 jwa 45 2 NA
3 mno 4 1 NA
4 ads 13 3 NA
5 jwa 60 2 15
6 ads 18 3 5
You can do this using dplyr and a quick loop:
library(dplyr)
# Rowwise data.frame creation because I'm too lazy not to copy-paste the example data
df <- tibble::frame_data(
~ID, ~DAYS, ~FREQUENCY,
"ads", 20, 3,
"jwa", 45, 2,
"mno", 4, 1,
"ads", 13, 3,
"jwa", 60, 2,
"ads", 18, 3
)
# Subtract each number in a numeric vector with the one following it
rolling_subtraction <- function(x) {
out <- vector('numeric', length(x))
for (i in seq_along(out)) {
out[[i]] <- x[i] - x[i + 1] # x[i + 1] is NA if the index is out of bounds
}
out
}
# Arrange data.frame in order of ID / Days and apply rolling subtraction
df %>%
arrange(ID, desc(DAYS)) %>%
group_by(ID) %>%
mutate(days_diff = rolling_subtraction(DAYS))