Select rows from grouped dataframe based on duplicate values - r

I have a dataframe with 3 columns. The id of each individual, the number of group they belong (gr) and location codes (loc). What I am trying to do is identify which individuals visit 2 locations with the following sequence: Location 1 -> Location 2 -> Location 1.
Dummy dataset:
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,4,4,4,4,4,4,4,4)
gr<-c(1,1,1,1,1,1,1,1,1,1,1,1,1,4,4,4,4,4,4,4)
loc <- c(5,5,4,4,5,5,5,3,3,3,3,2,2,2,2,3,3,2,2,2)
df<- data.frame(id,gr, loc)
I have tried using a diff function, to identify differences between the locations:
dif<- diff(as.numeric(df$loc))
But I can't find any other way to move forward. In addition this approach doesn't account for the groups of each individual (and the ids repeat between groups). I was thinking maybe using a lag function but not sure how or if it helps at all. Any recommendations? Many thanks in advance, I'm still pretty new in R.
Desired output:
id<- c(1,4)
gr<- c(1,4)
out<- data.frame(cbind(id, gr))

A possible data.table option
unique(
setDT(df)[
,
q := rleid(loc), .(id, gr)
][
,
.SD[uniqueN(q) == 3 & first(loc) == last(loc)], .(id, gr)
][
,
.(id, gr)
]
)
gives
id gr
1: 1 1
2: 4 4

May be this works
library(dplyr)
library(data.table)
df %>%
group_by(id) %>%
filter(n_distinct(rleid(loc)) >2) %>%
slice_tail(n = 1) %>%
select(-loc) %>%
ungroup
# A tibble: 2 x 2
# id gr
# <dbl> <dbl>
#1 1 1
#2 4 4

Related

How to subset dataframe based on conditions between columns across rows depending on values

I have a dataframe with information on individual id, period and code of work place. I would like to know who are the individuals who have worked alone for the entire time span of the dataset.
Consider the very simple example below. Individual A worked alone at two work places (x,y) in period 1. Individual B and C worked together at work place z in period 1. Individual B worked alone at work place w in period 2. Individual D worked alone at place k in period 2.
mydf <- data.frame(id=c('A','A','B','C','B','D'),
period=c(1,1,1,1,2,2),
work_place=c('x','y','z','z','w','k'))
I would like to identify the rows concerning those who have worked alone for the entire period, which in this case are those referring individuals A and D.
ids_alone <- data.frame(id=c('A','A','D'),
period=c(1,1,2),
work_place=c('x','y','k'))
Grouped by 'period', 'work_place', create a column 'n' with the number of distinct 'id's, then grouped by 'id', filter those 'id's having all elements of 'n' as 1
library(dplyr)
mydf %>%
group_by(period, work_place) %>%
mutate(n = n_distinct(id)) %>%
group_by(id) %>%
filter(all(n ==1)) %>%
ungroup %>%
select(-n)
-output
# A tibble: 3 x 3
# id period work_place
# <chr> <dbl> <chr>
#1 A 1 x
#2 A 1 y
#3 D 2 k
A data.table option (following the same idea from #akrun)
setDT(mydf)[
,
n := uniqueN(id),
.(period, work_place)
][
,
.SD[mean(n) == 1], id
][
,
n := NULL
][]
which gives
id period work_place
1: A 1 x
2: A 1 y
3: D 2 k

How to group multiple rows based on some criteria and sum values in R?

Hi All,
Example :- The above is the data I have. I want to group age 1-2 and count the values. In this data value is 4 for age group 1-2. Similarly I want to group age 3-4 and count the values. Here the value for age group 3-4 is 6.
How can I group age and aggregate the values correspond to it?
I know this way: code-
data.frame(df %>% group_by(df$Age) %>% tally())
But the values are aggregating on individual Age.
I want the values aggregating on multiple age to be a group as mentioned above example.
Any help on this will be greatly helpful.
Thanks a lot to All.
Here are two solutions, with base R and with package dplyr.
I will use the data posted by Shree.
First, base R.
I create a grouping variable grp and then aggregate on it.
grp <- with(df, c((age %in% 1:2) + 2*(age %in% 3:4)))
aggregate(age ~ grp, df, length)
# grp age
#1 1 4
#2 2 6
Second a dplyr way.
Function case_when is used to create a grouping variable. This allows for meaningful names to be given to the groups in an easy way.
library(dplyr)
df %>%
mutate(grp = case_when(
age %in% 1:2 ~ "2:3",
age %in% 3:4 ~ "3:4",
TRUE ~ NA_character_
)) %>%
group_by(grp) %>%
tally()
## A tibble: 2 x 2
# grp n
# <chr> <int>
#1 1:2 4
#2 3:4 6
Here's one way using dplyr and ?cut from base R -
df <- data.frame(age = c(1,1,2,2,3,3,3,4,4,4),
Name = letters[1:10],
stringsAsFactors = F)
df %>%
count(grp = cut(age, breaks = c(0,2,4)))
# A tibble: 2 x 2
grp n
<fct> <int>
1 (0,2] 4
2 (2,4] 6

finding the minimum value of multiple variables by group

I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))
How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8
We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7
Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")

Keeping IDs conditional on repeating variable

I have data that looks like this:
Is there a way I can very efficiently (without much R code) retain only 'ID' cases where instances of 'X' are equal to zero? For example, in this case only ID number 3 should be retained in my data set.
THIS ISSUE IS CLOSED - THERE ARE MULTIPLE STRONG ANSWERS IN THE COMMENTS BELOW
using the data.table package, I was able to quickly pull this together
library(data.table)
df <- data.table(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
df <- df[, .(ident = all(x ==0), y, x), by = ID][ident== TRUE] #aggregate, x, y and identifier by each ID
df[, ident := NULL] # get rid of redundant identifier column
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
subset(df, !ID %in% subset(df, x!=0)$ID)
That is, first find the ID's where x is not zero (subset(df, x!=0)$ID), and then exclude cases with those ID's (!ID %in% subset(df, x!=0)$ID)
try this:
first get all IDs for which any row has a non-zero value
Then use that to subset
df <- data.frame(ID=c(1,1,1,2,2,2,3,3,3), y=c(5,6,4,6,3,1,9,5,5), x=c(1,0,0,0,1,1,0,0,0))
exclude <- subset(df, x!=0)$ID
new_df <- subset(df, ! ID %in% exclude)
A base R option using ave, where we select the ID if all values (x) for the ID are 0.
df[ave(df$x == 0, df$ID, FUN = all), ]
# ID y x
#7 3 9 0
#8 3 5 0
#9 3 5 0
An equivalent dplyr solution would be
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(x == 0)) %>%
ungroup()
# A tibble: 3 x 3
# ID y x
# <dbl> <dbl> <dbl>
#1 3. 9. 0.
#2 3. 5. 0.
#3 3. 5. 0.

Conditional subset of data frame by special condition

df1 <-
data.frame(Sector=c("auto","auto","auto","industry","industry","industry"),
Topic=c("1","2","3","3","5","5"),
Frequency=c(1,2,5,2,3,2))
df1
df2 <-
data.frame(Sector=c("auto","auto","auto"),
Topic=c("1","2","3"),
Frequency=c(1,2,5))
df2
I have the dataframe 1 (df1) above and want a conditional subset of it that looks like df2. The condition is as followed:
"If at least one observation of the corresponding sectors has a larger frequency than 3 it should keep all observation of the sector, if not, all observations of the corresponding sector should be dropped."
In the example obove, only the three observations of the auto-sector remain, industry is dropped.
Has anybody an idea by which condition I might achieve the aimed subset?
We can use group_by and filter from dplyr to achieve this.
library(dplyr)
df2 <- df1 %>%
group_by(Sector) %>%
filter(any(Frequency > 3)) %>%
ungroup()
df2
# # A tibble: 3 x 3
# Sector Topic Frequency
# <fct> <fct> <dbl>
# 1 auto 1 1.
# 2 auto 2 2.
# 3 auto 3 5.
Here is a solution with base R:
df1 <-
data.frame(Sector=c("auto","auto","auto","industry","industry","industry"),
Topic=c("1","2","3","3","5","5"),
Frequency=c(1,2,5,2,3,2))
subset(df1, ave(Frequency, Sector, FUN=max) >3)
and a solution with data.table:
library("data.table")
setDT(df1)[, if (max(Frequency)>3) .SD, by=Sector]

Resources