Conditional subset of data frame by special condition - r

df1 <-
data.frame(Sector=c("auto","auto","auto","industry","industry","industry"),
Topic=c("1","2","3","3","5","5"),
Frequency=c(1,2,5,2,3,2))
df1
df2 <-
data.frame(Sector=c("auto","auto","auto"),
Topic=c("1","2","3"),
Frequency=c(1,2,5))
df2
I have the dataframe 1 (df1) above and want a conditional subset of it that looks like df2. The condition is as followed:
"If at least one observation of the corresponding sectors has a larger frequency than 3 it should keep all observation of the sector, if not, all observations of the corresponding sector should be dropped."
In the example obove, only the three observations of the auto-sector remain, industry is dropped.
Has anybody an idea by which condition I might achieve the aimed subset?

We can use group_by and filter from dplyr to achieve this.
library(dplyr)
df2 <- df1 %>%
group_by(Sector) %>%
filter(any(Frequency > 3)) %>%
ungroup()
df2
# # A tibble: 3 x 3
# Sector Topic Frequency
# <fct> <fct> <dbl>
# 1 auto 1 1.
# 2 auto 2 2.
# 3 auto 3 5.

Here is a solution with base R:
df1 <-
data.frame(Sector=c("auto","auto","auto","industry","industry","industry"),
Topic=c("1","2","3","3","5","5"),
Frequency=c(1,2,5,2,3,2))
subset(df1, ave(Frequency, Sector, FUN=max) >3)
and a solution with data.table:
library("data.table")
setDT(df1)[, if (max(Frequency)>3) .SD, by=Sector]

Related

Select rows from grouped dataframe based on duplicate values

I have a dataframe with 3 columns. The id of each individual, the number of group they belong (gr) and location codes (loc). What I am trying to do is identify which individuals visit 2 locations with the following sequence: Location 1 -> Location 2 -> Location 1.
Dummy dataset:
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,4,4,4,4,4,4,4,4)
gr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,4,4,4,4,4,4,4)
loc <- c(5,5,4,4,5,5,5,3,3,3,3,2,2,2,2,3,3,2,2,2)
df<- data.frame(id,gr, loc)
I have tried using a diff function, to identify differences between the locations:
dif<- diff(as.numeric(df$loc))
But I can't find any other way to move forward. In addition this approach doesn't account for the groups of each individual (and the ids repeat between groups). I was thinking maybe using a lag function but not sure how or if it helps at all. Any recommendations? Many thanks in advance, I'm still pretty new in R.
Desired output:
id<- c(1,4)
gr<- c(1,4)
out<- data.frame(cbind(id, gr))
A possible data.table option
unique(
setDT(df)[
,
q := rleid(loc), .(id, gr)
][
,
.SD[uniqueN(q) == 3 & first(loc) == last(loc)], .(id, gr)
][
,
.(id, gr)
]
)
gives
id gr
1: 1 1
2: 4 4
May be this works
library(dplyr)
library(data.table)
df %>%
group_by(id) %>%
filter(n_distinct(rleid(loc)) >2) %>%
slice_tail(n = 1) %>%
select(-loc) %>%
ungroup
# A tibble: 2 x 2
# id gr
# <dbl> <dbl>
#1 1 1
#2 4 4

Aggregate data frame by column, filtering on a different column

I want to aggregate some columns of a data frame using a factor (group in the example) but I want to use only the rows with the highest values in a different column (time in the example)
df=data.frame(group=c(rep('a',5),rep('b',5)),
time=c(1:5,2:6),
V1=c(1,1,1,2,2,1,1,1,1,1),
V2=c(2,2,1,1,1,1,1,1,1,5))
I know how to do it using ddply but it's pretty slow
ddply(df,'group',summarize,
V1=sum(V1[order(time,decreasing = T)[1:2]]),
V2=sum(V2[order(time,decreasing = T)[1:2]]))
"group" "V1" "V2"
"a" 4 2
"b" 2 6
Is there a faster way to do it (aggregate or data.table)?
We can arrange the data by time, group_by time and sum top 2 values using tail.
This can be done using dplyr :
library(dplyr)
df %>%
arrange(group, time) %>%
group_by(group) %>%
summarise_at(vars(V1:V2), ~sum(tail(., 2)))
# group V1 V2
# <fct> <dbl> <dbl>
#1 a 4 2
#2 b 2 6
and in data.table as :
library(data.table)
setDT(df)[order(group, time), lapply(.SD, function(x) sum(tail(x, 2))),
.SDcols = c('V1', 'V2'), group]

How to group multiple rows based on some criteria and sum values in R?

Hi All,
Example :- The above is the data I have. I want to group age 1-2 and count the values. In this data value is 4 for age group 1-2. Similarly I want to group age 3-4 and count the values. Here the value for age group 3-4 is 6.
How can I group age and aggregate the values correspond to it?
I know this way: code-
data.frame(df %>% group_by(df$Age) %>% tally())
But the values are aggregating on individual Age.
I want the values aggregating on multiple age to be a group as mentioned above example.
Any help on this will be greatly helpful.
Thanks a lot to All.
Here are two solutions, with base R and with package dplyr.
I will use the data posted by Shree.
First, base R.
I create a grouping variable grp and then aggregate on it.
grp <- with(df, c((age %in% 1:2) + 2*(age %in% 3:4)))
aggregate(age ~ grp, df, length)
# grp age
#1 1 4
#2 2 6
Second a dplyr way.
Function case_when is used to create a grouping variable. This allows for meaningful names to be given to the groups in an easy way.
library(dplyr)
df %>%
mutate(grp = case_when(
age %in% 1:2 ~ "2:3",
age %in% 3:4 ~ "3:4",
TRUE ~ NA_character_
)) %>%
group_by(grp) %>%
tally()
## A tibble: 2 x 2
# grp n
# <chr> <int>
#1 1:2 4
#2 3:4 6
Here's one way using dplyr and ?cut from base R -
df <- data.frame(age = c(1,1,2,2,3,3,3,4,4,4),
Name = letters[1:10],
stringsAsFactors = F)
df %>%
count(grp = cut(age, breaks = c(0,2,4)))
# A tibble: 2 x 2
grp n
<fct> <int>
1 (0,2] 4
2 (2,4] 6

Keeping IDs conditional on repeating variable

I have data that looks like this:
Is there a way I can very efficiently (without much R code) retain only 'ID' cases where instances of 'X' are equal to zero? For example, in this case only ID number 3 should be retained in my data set.
THIS ISSUE IS CLOSED - THERE ARE MULTIPLE STRONG ANSWERS IN THE COMMENTS BELOW
using the data.table package, I was able to quickly pull this together
library(data.table)
df <- data.table(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
df <- df[, .(ident = all(x ==0), y, x), by = ID][ident== TRUE] #aggregate, x, y and identifier by each ID
df[, ident := NULL] # get rid of redundant identifier column
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
subset(df, !ID %in% subset(df, x!=0)$ID)
That is, first find the ID's where x is not zero (subset(df, x!=0)$ID), and then exclude cases with those ID's (!ID %in% subset(df, x!=0)$ID)
try this:
first get all IDs for which any row has a non-zero value
Then use that to subset
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
exclude <- subset(df, x!=0)$ID
new_df <- subset(df, ! ID %in% exclude)
A base R option using ave, where we select the ID if all values (x) for the ID are 0.
df[ave(df$x == 0, df$ID, FUN = all), ]
# ID y x
#7 3 9 0
#8 3 5 0
#9 3 5 0
An equivalent dplyr solution would be
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(x == 0)) %>%
ungroup()
# A tibble: 3 x 3
# ID y x
# <dbl> <dbl> <dbl>
#1 3. 9. 0.
#2 3. 5. 0.
#3 3. 5. 0.

To create a frequency table with dplyr to count the factor levels and missing values and report it

Some questions are similar to this topic (here or here, as an example) and I know one solution that works, but I want a more elegant response.
I work in epidemiology and I have variables 1 and 0 (or NA). Example:
Does patient has cancer?
NA or 0 is no
1 is yes
Let's say I have several variables in my dataset and I want to count only variables with "1". Its a classical frequency table, but dplyr are turning things more complicated than I could imagine at the first glance.
My code is working:
dataset %>%
select(VISimpair, HEARimpai, IntDis, PhyDis, EmBehDis, LearnDis,
ComDis, ASD, HealthImpair, DevDelays) %>% # replace to your needs
summarise_all(funs(sum(1-is.na(.))))
And you can reproduce this code here:
library(tidyverse)
dataset <- data.frame(var1 = rep(c(NA,1),100), var2=rep(c(NA,1),100))
dataset %>% select(var1, var2) %>% summarise_all(funs(sum(1-is.na(.))))
But I really want to select all variables I want, count how many 0 (or NA) I have and how many 1 I have and report it and have this output
Thanks.
What about the following frequency table per variable?
First, I edit your sample data to also include 0's and load the necessary libraries.
library(tidyr)
library(dplyr)
dataset <- data.frame(var1 = rep(c(NA,1,0),100), var2=rep(c(NA,1,0),100))
Second, I convert the data using gather to make it easier to group_by later for the frequency table created by count, as mentioned by CPak.
dataset %>%
select(var1, var2) %>%
gather(var, val) %>%
mutate(val = factor(val)) %>%
group_by(var, val) %>%
count()
# A tibble: 6 x 3
# Groups: var, val [6]
var val n
<chr> <fct> <int>
1 var1 0 100
2 var1 1 100
3 var1 NA 100
4 var2 0 100
5 var2 1 100
6 var2 NA 100
A quick and dirty method to do this is to coerce your input into factors:
dataset$var1 = as.factor(dataset$var1)
dataset$var2 = as.factor(dataset$var2)
summary(dataset$var1)
summary(dataset$var2)
Summary tells you number of occurrences of each levels of factor.

Resources