Remove IDs with fewer than 9 unique observations - r

I am trying to filter my data and remove IDs that have fewer than 9 unique month observations. I would also like to create a list of IDs that includes the count.
I've tried using a few different options:
library(dplyr)
count <- bind %>% group_by(IDS) %>% filter(n(data.month)>= 9) %>% ungroup()
count2 <- subset(bind, with(bind, IDS %in% names(which(table(data.month)>=9))))
Neither of these worked.
This is what my data looks like:
data.month ID
01 2
02 2
03 2
04 2
05 2
05 2
06 2
06 2
07 2
07 2
07 2
07 2
07 2
08 2
09 2
10 2
11 2
12 2
01 5
01 5
02 5
01 7
01 7
01 7
01 4
02 4
03 4
04 4
05 4
05 4
06 4
06 4
07 4
07 4
07 4
07 4
07 4
08 4
09 4
10 4
11 4
12 4
In the end, I would like a this:
IDs
2
3
I would also like this
IDs Count
2 12
5 2
7 1
4 12
So far this code is the closest, but still just gives error codes:
count <- bind %>%
group_by(IDs) %>%
filter(length(unique(bind$data.month >=9)))
Error in filter_impl(.data, quo) :
Argument 2 filter condition does not evaluate to a logical vector

You can do with unique and length
library(dplyr)
df %>% group_by(ID) %>% summarise(Count=length(unique(data.month)))
# A tibble: 4 x 2
ID Count
<int> <int>
1 2 12
2 4 12
3 5 2
4 7 1
If want to get the ID
df%>%group_by(ID)%>%summarise(Count=length(unique(data.month)))%>%filter(Count>9)%>%select(ID)
# A tibble: 2 x 1
ID
<int>
1 2
2 4

We can use n_distinct
To remove IDs with less than 9 unique observations
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
pull(ID) %>% unique
#[1] 2 4
Or
df %>%
group_by(ID) %>%
filter(n_distinct(data.month) >= 9) %>%
distinct(ID)
# ID
# <int>
#1 2
#2 4
For unique counts of each ID
df %>%
group_by(ID) %>%
summarise(count = n_distinct(data.month))
# ID count
# <int> <int>
#1 2 12
#2 4 12
#3 5 2
#4 7 1

here is a data.table approach
library( data.table )
ID's with 9 obervations or more
unique( DT[, if (.N >= 9) .SD, by = .(data.month)]$ID )
#[1] 2 4
#Unique ID's per month
unique(DT, by = c("data.month", "ID"))[, .(counts = .N), by = .(IDs = ID)]
# IDs counts
# 1: 2 12
# 2: 5 2
# 3: 7 1
# 4: 4 12
sample data
DT <- fread("data.month ID
01 2
02 2
03 2
04 2
05 2
05 2
06 2
06 2
07 2
07 2
07 2
07 2
07 2
08 2
09 2
10 2
11 2
12 2
01 5
01 5
02 5
01 7
01 7
01 7
01 4
02 4
03 4
04 4
05 4
05 4
06 4
06 4
07 4
07 4
07 4
07 4
07 4
08 4
09 4
10 4
11 4
12 4")

Related

Conditional rolling counting function

I would like to implement a rolling count function for the working days in a month. Weekends (Saturday and Sunday) should be assigned a NA.
A replicable example:
#Change language if your are in a non-English location like me
Sys.setlocale("LC_TIME", "C")
workdays <- c("Mon","Tue","Wed","Thu","Fri")
dataset <- data.frame(Date = seq(as.Date("2020-03-01"),as.Date("2020-04-01")-1,"days"))
dataset$Day <- format(dataset$Date,format="%d")
dataset$WeekDay <- format(dataset$Date,format="%a")
dataset$Month <- format(dataset$Date,format="%m")
dataset$Year <- format(dataset$Date,format="%y")
dataset$Workday <- dataset$WeekDay %in% workdays
I wanted to use dplry grouped by the respective month and year to sum conditionally for the working days.
dataset %>%
group_by(Month,Year) %>%
mutate(WorkdayNo = ???)
In my example, the first ten rows should then look like this:
[1] NA 1 2 3 4 5 NA NA 6 7 (...)
cumsum with ifelse should help -
library(dplyr)
dataset %>%
group_by(Month,Year) %>%
mutate(WorkdayNo = if_else(Workday, cumsum(Workday), NA_integer_)) %>%
ungroup
# Date Day WeekDay Month Year Workday WorkdayNo
# <date> <chr> <chr> <chr> <chr> <lgl> <int>
# 1 2020-03-01 01 Sun 03 20 FALSE NA
# 2 2020-03-02 02 Mon 03 20 TRUE 1
# 3 2020-03-03 03 Tue 03 20 TRUE 2
# 4 2020-03-04 04 Wed 03 20 TRUE 3
# 5 2020-03-05 05 Thu 03 20 TRUE 4
# 6 2020-03-06 06 Fri 03 20 TRUE 5
# 7 2020-03-07 07 Sat 03 20 FALSE NA
# 8 2020-03-08 08 Sun 03 20 FALSE NA
# 9 2020-03-09 09 Mon 03 20 TRUE 6
#10 2020-03-10 10 Tue 03 20 TRUE 7
# … with 21 more rows

In R how can I find the number of connections I have in a given dataframe and produce a variable representing it?

So I currently have a dataframe which represents a social network like follows:
id age id1 id2 id3
01 14 02 05 03
02 23 01 05 03
03 52 04 01 02
04 41 03
05 32 01 02
Ideally I would like a new data frame like the following:
id age id1 id2 id3 Connections
01 14 02 05 03 3
02 23 01 05 03 3
03 52 04 01 02 3
04 41 03 1
05 32 01 02 2
With a new variable the represents the number of connections the "id" has. As of now I currently have a code like follows:
links <- df
links <- as.matrix(links)
links <- as.data.frame(rbind(links[,c(1,3)], links[,c(1,4)]), links[,c(1,5)])
head(links)
library(igraph)
g = graph.data.frame(links)
m = as.matrix(get.adjacency(g))
m
pmax(rowSums(m), colSums(m))
Which gives me:
1 2 3 4 5 NA
3 3 3 1 2 3
How can I then incorporate this into the dataframe to create the "Connections" variable? Ideally my other data contains up to 50 connections so I would like an easier way in which I don't have to recreate a dataframe.
A quick tidyverse way is to reshape the data into a long shape, add up how many non-NA values each ID has, and reshape back to wide.
library(tidyverse)
df %>%
gather(key = key, value = val, -id, -age) %>%
group_by(id, age) %>%
mutate(connections = sum(!is.na(val))) %>%
head()
#> # A tibble: 6 x 5
#> # Groups: id, age [5]
#> id age key val connections
#> <chr> <dbl> <chr> <chr> <int>
#> 1 01 14 id1 02 3
#> 2 02 23 id1 01 3
#> 3 03 52 id1 04 3
#> 4 04 41 id1 03 1
#> 5 05 32 id1 01 2
#> 6 01 14 id2 05 3
df %>%
gather(key = key, value = val, -id, -age) %>%
group_by(id, age) %>%
mutate(connections = sum(!is.na(val))) %>%
spread(key = key, value = val)
#> # A tibble: 5 x 6
#> # Groups: id, age [5]
#> id age connections id1 id2 id3
#> <chr> <dbl> <int> <chr> <chr> <chr>
#> 1 01 14 3 02 05 03
#> 2 02 23 3 01 05 03
#> 3 03 52 3 04 01 02
#> 4 04 41 1 03 <NA> <NA>
#> 5 05 32 2 01 02 <NA>
But I wouldn't consider your first approach wrong. Since you're working with a network, it makes sense to use network analysis tools and calculate the degree of each node, same as the number of connections.
library(dplyr)
# Toy data
df = data.frame(id = c(1,2,3,4),
age = c(1, 1, 1, 1),
id1 = c(1, 2, 3, 4),
id2 = c(1, 2, 3, NA),
id3 = c(1,2, NA, NA))
df$Connections = df %>%
select(-id, -age) %>% # Remove unnecessary columns
apply(1, function(row) {
binary_row = as.numeric(!is.na(row)) # Convert each column to binary
sum(binary_row) # Return connection count
})
What about something like this:
First, using regex we determine the columns corresponding to connections
# here connections columns must contain the pattern "id"+digit(s)
connectionsNames <- grepl("id\\d+", names(df), perl = TRUE)
Then we use rowSums to create the new column
df$connections <- sum(connectionsNames) - rowSums(is.na(df))
Here the result
df
id age id1 id2 id3 connections
1 1 1 1 1 1 3
2 2 1 2 2 2 3
3 3 1 3 3 NA 2
4 4 1 4 NA NA 1

dplyr: filter a value by existing in two conditions [duplicate]

I have a R dataset x as below:
ID Month
1 1 Jan
2 3 Jan
3 4 Jan
4 6 Jan
5 6 Jan
6 9 Jan
7 2 Feb
8 4 Feb
9 6 Feb
10 8 Feb
11 9 Feb
12 10 Feb
13 1 Mar
14 3 Mar
15 4 Mar
16 6 Mar
17 7 Mar
18 9 Mar
19 2 Apr
20 4 Apr
21 6 Apr
22 7 Apr
23 8 Apr
24 10 Apr
25 1 May
26 2 May
27 4 May
28 6 May
29 7 May
30 8 May
31 2 Jun
32 4 Jun
33 5 Jun
34 6 Jun
35 9 Jun
36 10 Jun
I am trying to figure out a R function/code to identify all IDs that exist atleast once in every month.
In the above case, ID 4 & 6 are present in all months.
Thanks
First, split the df$ID by Month and use intersect to find elements common in each sub-group.
Reduce(intersect, split(df$ID, df$Month))
#[1] 4 6
If you want to subset the corresponding data.frame, do
df[df$ID %in% Reduce(intersect, split(df$ID, df$Month)),]
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', get the row index (.I) where the number of unique 'Months' are equal to the number of unique 'Months' in the whole dataset and subset the data based on this
library(data.table)
setDT(df1)[df1[, .I[uniqueN(Month) == uniqueN(df1$Month)], ID]$V1]
# ID Month
# 1: 4 Jan
# 2: 4 Feb
# 3: 4 Mar
# 4: 4 Apr
# 5: 4 May
# 6: 4 Jun
# 7: 6 Jan
# 8: 6 Jan
# 9: 6 Feb
#10: 6 Mar
#11: 6 Apr
#12: 6 May
#13: 6 Jun
To extract the 'ID's
setDT(df1)[, ID[uniqueN(Month) == uniqueN(df1$Month)], ID]$V1
#[1] 4 6
Or with base R
1) Using table with rowSums
v1 <- rowSums(table(df1) > 0)
names(v1)[v1==max(v1)]
#[1] "4" "6"
This info can be used for subsetting the data
subset(df1, ID %in% names(v1)[v1 == max(v1)])
2) Using tapply
lst <- with(df1, tapply(Month, ID, FUN = unique))
names(which(lengths(lst) == length(unique(df1$Month))))
#[1] "4" "6"
Or using dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n_distinct(Month)== n_distinct(df1$Month)) %>%
.$ID %>%
unique
#[1] 4 6
or if we need to get the rows
df1 %>%
group_by(ID) %>%
filter(n_distinct(Month)== n_distinct(df1$Month))
# A tibble: 13 x 2
# Groups: ID [2]
# ID Month
# <int> <chr>
# 1 4 Jan
# 2 6 Jan
# 3 6 Jan
# 4 4 Feb
# 5 6 Feb
# 6 4 Mar
# 7 6 Mar
# 8 4 Apr
# 9 6 Apr
#10 4 May
#11 6 May
#12 4 Jun
#13 6 Jun
An alternative solution using dplyr and purrr:
tib %>%
dplyr::group_by(Month) %>%
dplyr::group_split(.keep = F) %>%
purrr::reduce(intersect)
# A tibble: 2 x 1
# ID
# <dbl>
# 1 4
# 2 6
returns the desired IDs, where tib is a tibble containing the input data.

Extract elements common in all column groups

I have a R dataset x as below:
ID Month
1 1 Jan
2 3 Jan
3 4 Jan
4 6 Jan
5 6 Jan
6 9 Jan
7 2 Feb
8 4 Feb
9 6 Feb
10 8 Feb
11 9 Feb
12 10 Feb
13 1 Mar
14 3 Mar
15 4 Mar
16 6 Mar
17 7 Mar
18 9 Mar
19 2 Apr
20 4 Apr
21 6 Apr
22 7 Apr
23 8 Apr
24 10 Apr
25 1 May
26 2 May
27 4 May
28 6 May
29 7 May
30 8 May
31 2 Jun
32 4 Jun
33 5 Jun
34 6 Jun
35 9 Jun
36 10 Jun
I am trying to figure out a R function/code to identify all IDs that exist atleast once in every month.
In the above case, ID 4 & 6 are present in all months.
Thanks
First, split the df$ID by Month and use intersect to find elements common in each sub-group.
Reduce(intersect, split(df$ID, df$Month))
#[1] 4 6
If you want to subset the corresponding data.frame, do
df[df$ID %in% Reduce(intersect, split(df$ID, df$Month)),]
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', get the row index (.I) where the number of unique 'Months' are equal to the number of unique 'Months' in the whole dataset and subset the data based on this
library(data.table)
setDT(df1)[df1[, .I[uniqueN(Month) == uniqueN(df1$Month)], ID]$V1]
# ID Month
# 1: 4 Jan
# 2: 4 Feb
# 3: 4 Mar
# 4: 4 Apr
# 5: 4 May
# 6: 4 Jun
# 7: 6 Jan
# 8: 6 Jan
# 9: 6 Feb
#10: 6 Mar
#11: 6 Apr
#12: 6 May
#13: 6 Jun
To extract the 'ID's
setDT(df1)[, ID[uniqueN(Month) == uniqueN(df1$Month)], ID]$V1
#[1] 4 6
Or with base R
1) Using table with rowSums
v1 <- rowSums(table(df1) > 0)
names(v1)[v1==max(v1)]
#[1] "4" "6"
This info can be used for subsetting the data
subset(df1, ID %in% names(v1)[v1 == max(v1)])
2) Using tapply
lst <- with(df1, tapply(Month, ID, FUN = unique))
names(which(lengths(lst) == length(unique(df1$Month))))
#[1] "4" "6"
Or using dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n_distinct(Month)== n_distinct(df1$Month)) %>%
.$ID %>%
unique
#[1] 4 6
or if we need to get the rows
df1 %>%
group_by(ID) %>%
filter(n_distinct(Month)== n_distinct(df1$Month))
# A tibble: 13 x 2
# Groups: ID [2]
# ID Month
# <int> <chr>
# 1 4 Jan
# 2 6 Jan
# 3 6 Jan
# 4 4 Feb
# 5 6 Feb
# 6 4 Mar
# 7 6 Mar
# 8 4 Apr
# 9 6 Apr
#10 4 May
#11 6 May
#12 4 Jun
#13 6 Jun
An alternative solution using dplyr and purrr:
tib %>%
dplyr::group_by(Month) %>%
dplyr::group_split(.keep = F) %>%
purrr::reduce(intersect)
# A tibble: 2 x 1
# ID
# <dbl>
# 1 4
# 2 6
returns the desired IDs, where tib is a tibble containing the input data.

Counting the distinct values for each day and group and inserting the value in an array in R

I want to transform the data below to give me an association array with the count of each unique id in each group for each day. So, for example, from the data below
Year Month Day Group ID
2014 04 26 1 A
2014 04 26 1 B
2014 04 26 2 B
2014 04 26 2 C
2014 05 12 1 B
2014 05 12 2 E
2014 05 12 2 F
2014 05 12 2 G
2014 05 12 3 G
2014 05 12 3 F
2015 05 19 1 F
2015 05 19 1 D
2015 05 19 2 E
2015 05 19 2 G
2015 05 19 2 D
2015 05 19 3 A
2015 05 19 3 E
2015 05 19 3 B
I want to make an array that gives:
[1] (04/26/2014)
Grp 1 2 3
1 0 1 0
2 1 0 0
3 0 0 0
[2] (05/12/2014)
Grp 1 2 3
1 0 0 1
2 0 0 2
3 1 2 0
[3] (05/19/2015)
Grp 1 2 3
1 0 1 0
2 1 0 1
3 0 1 0
The 'Grp' is just to indicate the group number. I know how to count the distinct values within the table, overall, but I’m trying to use for loops to also insert the appropriate unique value for each day for e.g., inserting the unique number of IDs that are present in both group 1 and 2 in 04/26/2014 and inserting that number in the group 1 and group 2 association matrix for that day. Any help would be appreciated.
I don't quite understand how you get the second one, but you can try this
dd <- read.table(header = TRUE, text = "Year Month Day Group ID
2014 04 26 1 A
2014 04 26 1 B
2014 04 26 2 B
2014 04 26 2 C
2014 05 12 1 B
2014 05 12 2 E
2014 05 12 2 F
2014 05 12 2 G
2014 05 12 3 G
2014 05 12 3 F
2015 05 19 1 F
2015 05 19 1 D
2015 05 19 2 E
2015 05 19 2 G
2015 05 19 2 D
2015 05 19 3 A
2015 05 19 3 E
2015 05 19 3 B")
dd <- within(dd, {
date <- as.Date(apply(dd[, 1:3], 1, paste0, collapse = '-'))
Group <- factor(Group)
Year <- Month <- Day <- NULL
})
Eg, for the first one
sp <- split(dd, dd$date)[[1]]
tbl <- table(sp$ID, sp$Group)
`diag<-`(crossprod(tbl), 0)
# 1 2 3
# 1 0 1 0
# 2 1 0 0
# 3 0 0 0
And do them all at once
lapply(split(dd, dd$date), function(x) {
cp <- crossprod(table(x$ID, x$Group))
diag(cp) <- 0
cp
})
# $`2014-04-26`
#
# 1 2 3
# 1 0 1 0
# 2 1 0 0
# 3 0 0 0
#
# $`2014-05-12`
#
# 1 2 3
# 1 0 0 0
# 2 0 0 2
# 3 0 2 0
#
# $`2015-05-19`
#
# 1 2 3
# 1 0 1 0
# 2 1 0 1
# 3 0 1 0
A possible solution with dplyr and tidyr will be as follows:
library(dplyr)
library(tidyr)
df$date <- as.Date(paste(df$Year, df$Month, df$Day, sep = '-'))
df %>%
expand(date, Group) %>%
left_join(., df) %>%
group_by(date, Group) %>%
summarise(nID = n_distinct(ID)) %>%
split(., .$date)
Resulting output:
$`2014-04-26`
Source: local data frame [3 x 3]
Groups: date [1]
date Group nID
(date) (int) (int)
1 2014-04-26 1 2
2 2014-04-26 2 2
3 2014-04-26 3 1
$`2014-05-12`
Source: local data frame [3 x 3]
Groups: date [1]
date Group nID
(date) (int) (int)
1 2014-05-12 1 1
2 2014-05-12 2 3
3 2014-05-12 3 2
$`2015-05-19`
Source: local data frame [3 x 3]
Groups: date [1]
date Group nID
(date) (int) (int)
1 2015-05-19 1 2
2 2015-05-19 2 3
3 2015-05-19 3 3

Resources