How to prevent R from dropping groups of size 0 - r

I am running the following code to count the amount of days that sites are over 90.
temp2 <- temp %>%
filter(ds > '2017-12-31') %>%
filter(over90 == 1) %>%
group_by(site) %>%
tally()
However, this drops sites which do not have days over 90. I want my output to include these sites with a value of zero, instead of simply not existing. Any way I can do this?
Pretend data:
site over90
a 1
a 0
a 1
b 0
What happens:
site n
a 2
What I want:
site n
a 2
b 0

If your site is a factor variable you can use tidyr::complete to fill in the blanks at the end:
filter(dd, over90 == 1) %>%
group_by(site) %>%
tally() %>%
complete(site, fill = list(n = 0))
# A tibble: 2 x 2
site n
<fctr> <dbl>
1 a 2
2 b 0
data
dd <- read.table(text =
"site over90
a 1
a 0
a 1
b 0",
header = TRUE)

tally isn't doing quite what you want it to. Since you just want to count the number of rows where over_90 is equal to one and that variable is binary, you can sum that column to get the effect of counting.
df = data_frame(site=sample(letters[1:3], 10, replace=TRUE),
over_90=sample(0:1, 10, replace=TRUE))
df
# A tibble: 10 x 2
# site over_90
# <chr> <int>
# 1 c 0
# 2 b 1
# 3 c 0
# 4 b 0
# 5 c 1
# 6 b 0
# 7 a 0
# 8 a 1
# 9 c 1
#10 c 0
df$over_90[df$site == 'c'] = 0
df %>% group_by(site) %>% summarise(n = sum(over_90))
# A tibble: 3 x 2
# site n
# <chr> <dbl>
#1 a 1
#2 b 1
#3 c 0

One option is to use table instead. But you have first convert site in factor so that count for all factors is included after calling table function.
library(dplyr)
df %>% mutate(site = as.factor(site)) %>% #This line is very important.
filter(over90 ==1) %>%
group_by(site) %>%
table()
# site 1
# a 2
# b 0
Data:
df <- read.table(text =
"site over90
a 1
a 0
a 1
b 0",
header = TRUE, stringsAsFactors = FALSE)

Related

How to run Excel-like formulas using dplyr?

In the below reproducible R code, I'd like to add a column "adjust" that results from a series of calculations that in Excel would use cumulative countifs, max, and match (actually, to make this more complete the adjust column should have used the match formula since there could be more than 1 element in the list starting in row 15, but I think it's clear what I'm doing without actually using match) formulas as shown below in the illustration. The yellow shading shows what the reproducible code generates, and the blue shading shows my series of calculations in Excel that derive the desired values in the "adjust" column. Any suggestions for doing this, in dplyr if possible?
I am a long-time Excel user trying to migrate all of my work to R.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","B","B"),
Group = c(0,1,1,1,2,2,3,3)
)
myDataGroups <- myData %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(ElementCnt = row_number()) %>%
ungroup() %>%
mutate(Group = factor(Group, unique(Group))) %>%
arrange(Group) %>%
mutate(groupCt = cumsum(Group != lag(Group, 1, Group[[1]])) - 1L) %>%
as.data.frame()
myDataGroups
We may use rowid to get the sequence to update the 'Group', and then create a logical vector on 'Group' to create the binary and use cumsum on the 'excessOver2' and take the lag
library(dplyr)
library(data.table)
myDataGroups %>%
mutate(Group = rowid(Element, Group),
excessOver2 = +(Group > 2), adjust = lag(cumsum(excessOver2),
default = 0))
-output
Element Group origOrder ElementCnt groupCt excessOver2 adjust
1 A 1 1 1 -1 0 0
2 B 1 2 1 0 0 0
3 B 2 3 2 0 0 0
4 B 3 4 3 0 1 0
5 B 1 5 4 1 0 1
6 B 2 6 5 1 0 1
7 B 1 7 6 2 0 1
8 B 2 8 7 2 0 1
library(dplyr)
myData %>%
group_by(Element, Group) %>%
summarize(ElementCnt = row_number(), over2 = 1 * (ElementCnt > 2),
.groups = "drop_last") %>%
mutate(adjust = cumsum(lag(over2, default = 0))) %>%
ungroup()
Result
# A tibble: 8 × 5
Element Group ElementCnt over2 adjust
<chr> <dbl> <int> <dbl> <dbl>
1 A 0 1 0 0
2 B 1 1 0 0
3 B 1 2 0 0
4 B 1 3 1 0
5 B 2 1 0 1
6 B 2 2 0 1
7 B 3 1 0 1
8 B 3 2 0 1

Filter Rows Between with Multiple Events per Subject

I have a large data set and I'm trying to filter the days following a specific event for each subject. This issue is that the "event" of interest may happen multiple times for some subjects and for a few subjects the event doesn't happen at all (in which case they could just be removed from the summarized data).
Here is an example of the data and what I've tried:
library(tidyverse)
set.seed(355)
subject <- c(rep(LETTERS[1:4], each = 40), rep("E", times = 40))
event <- c(sample(0:1, size = length(subject)-40, replace = T, prob = c(0.95, 0.05)), rep(0, times = 40))
df <- data.frame(subject, event)
df %>%
filter(event == 1) %>%
count(subject, event, sort = T)
# A tibble: 4 x 3
subject event n
<fct> <dbl> <int>
1 D 1 3
2 A 1 2
3 B 1 2
4 C 1 2
So we see that subject D has had the event 3 times while subjects A, B, and C have had the event 2 times. Subject E has not had the event at all.
My next step was to create an "event" tag that identifies where each event happened and then produced an NA for all over rows. I also created an event sequence, which sequences along between events, because I thought it might be useful, but I didn't end up trying to use it.
df_cleaned <- df %>%
group_by(subject, event) %>%
mutate(event_seq = seq_along(event == 1),
event_detail = ifelse(event == 1, "event", NA)) %>%
as.data.frame()
I tried two different approaches using a filter() and between() to get each event and the 2 rows following each event. Both of these approaches create an error because of the multiple events within subject. I can't figure out a good workaround for it.
Approach 1:
df_cleaned %>%
group_by(subject) %>%
filter(., between(row_number(),
left = which(!is.na(event_detail)),
right = which(!is.na(event_detail)) + 1))
Approach 2:
df_cleaned %>%
group_by(subject) %>%
mutate(event_group = cumsum(!is.na(event_detail))) %>%
filter(., between(row_number(), left = which(event_detail == "event"), right = which(event_detail == "event") + 2))
If you want to get rows with 1 in event and the following two rows, you can do the following. Thanks to Ananda Mahto who is the author of splitstackshape package, we can handle this type of operation with getMyRows(), which returns a list. You can specify a range of rows in the function. Here I said 0:2. So I am asking R to take each row with 1 in event and the following two rows. I used bind_rows() to return a data frame. But if you need to work with a list, you do not have to do that.
install_github("mrdwab/SOfun")
library(SOfun)
library(dplyr)
ind <- which(x = df$event == 1)
bind_rows(getMyRows(data = df, pattern = ind, range = 0:2))
subject event
1 A 1
2 A 0
3 A 0
4 A 1
5 A 0
6 A 0
7 B 1
8 B 0
9 B 0
10 B 1
11 B 0
12 B 0
13 C 1
14 C 0
15 C 0
16 C 1
17 C 0
18 C 0
19 D 1
20 D 0
21 D 0
22 D 1
23 D 0
24 D 0
25 D 1
26 D 0
27 D 0
Here is a tidyverse approach which uses cumsum() to create groups of rows after (and including) an event and which picks the top 3 rows of each group:
df %>%
group_by(subject) %>%
mutate(event_group = cumsum(event == 1L)) %>%
group_by(event_group, add = TRUE) %>%
filter(event_group > 0 & row_number() <= 3L)
# A tibble: 27 x 3
# Groups: subject, event_group [9]
subject event event_group
<fct> <dbl> <int>
1 A 1 1
2 A 0 1
3 A 0 1
4 A 1 2
5 A 0 2
6 A 0 2
7 B 1 1
8 B 0 1
9 B 0 1
10 B 1 2
# … with 17 more rows
For testing an edge case, here is a modified data set where subject A starts with three subsequent events. Furthermore, I have added row numbers rn in order to check that the correct rows are picked:
df2 <- df %>%
mutate(event = ifelse(row_number() <= 2L, 1L, event),
rn = row_number())
Now we get
df2 %>%
group_by(subject) %>%
mutate(event_group = cumsum(event == 1L)) %>%
group_by(event_group, add = TRUE) %>%
filter(event_group > 0 & row_number() <= 3L)
# A tibble: 29 x 4
# Groups: subject, event_group [11]
subject event rn event_group
<fct> <dbl> <int> <int>
1 A 1 1 1
2 A 1 2 2
3 A 1 3 3
4 A 0 4 3
5 A 0 5 3
6 A 1 22 4
7 A 0 23 4
8 A 0 24 4
9 B 1 59 1
10 B 0 60 1
# … with 19 more rows
which is in line with my expectations for this edge case.
Here is a base R option which looks similar to #jazzurro's attempt. We get the row indices where event == 1, then select next two rows from each index, use unique so in case there are overlapping indices we select only the unique ones and subset it from the original df.
inds <- which(df$event == 1)
df[unique(c(sapply(inds, `+`, 0:2))), ]
# subject event
#3 A 1
#4 A 0
#5 A 0
#22 A 1
#23 A 0
#24 A 0
#59 B 1
#60 B 0
#61 B 0
#62 B 1
#63 B 0
#64 B 0
#....
Another option using dplyr, could be using lag
library(dplyr)
df %>%
group_by(subject) %>%
filter(event == 1 | lag(event) == 1 | lag(event, 2) == 1)

Frequency between couples of words

Having a data frame like this:
df <- data.frame(id = c(1,2,3,4,5), keywords = c("google, yahoo, air, cookie", "cookie, air", "air, cookie", "google", "yahoo, google"))
How is it possible to extract a table like
df_binary_exist <- data.frame(id = c(1,2,3,4,5), google = c(1,0,0,1,1), yahoo = c(1,0,0,0,1), air = c(1,1,1,0,0), cookie = c(1,1,1,0,0))
df_binary_exist
id google yahoo air cookie
1 1 1 1 1 1
2 2 0 0 1 1
3 3 0 0 1 1
4 4 1 0 0 0
5 5 1 1 0 0
and from this table find the most frequent couples?
df_frequency <- data.frame(couple = c("yahoo-google", "cookie-air"), freq = c(2,3))
df_frequency
couple freq
1 yahoo-google 2
2 cookie-air 3
The first part can be achieved by using separate_rows, count and spread
library(dplyr)
library(tidyr)
df1 <- df %>% separate_rows(keywords)
df1 %>%
dplyr::count(id, keywords) %>%
spread(keywords, n, fill = 0)
# id air cookie google yahoo
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 1
#2 2 1 1 0 0
#3 3 1 1 0 0
#4 4 0 0 1 0
#5 5 0 0 1 1
For second part I used a base R method where we first split keywords based on id, paste combination of every 2 elements and count their frequency using table.
data.frame(sort(table(unlist(sapply(split(df1$keywords, df1$id), function(x)
combn(sort(x), pmin(2, length(x)), paste, collapse = "-")))), decreasing = TRUE))
# Var1 Freq
#1 air-cookie 3
#2 google-yahoo 2
#3 air-google 1
#4 air-yahoo 1
#5 cookie-google 1
#6 cookie-yahoo 1
#7 google 1
One tidyverse possibility could be:
df %>%
mutate(keywords = strsplit(keywords, ", ", fixed = TRUE)) %>%
unnest() %>%
full_join(df %>%
mutate(keywords = strsplit(keywords, ", ", fixed = TRUE)) %>%
unnest(), by = c("id" = "id")) %>%
filter(keywords.x != keywords.y) %>%
count(keywords.x, keywords.y) %>%
transmute(keywords = paste(pmax(keywords.x, keywords.y), pmin(keywords.x, keywords.y), sep = "-"),
n) %>%
distinct(keywords, .keep_all = TRUE)
keywords n
<chr> <int>
1 cookie-air 3
2 google-air 1
3 yahoo-air 1
4 google-cookie 1
5 yahoo-cookie 1
6 yahoo-google 2
It, first, splits the "keywords" column on , and then performs a full join with itself. Second, it filters out the rows where the values are the same as the OP is interested in pairs of values. Third, it counts the number of occurrences of pairs. Finally, it creates an ordered variable of pairs and keeps only the distinct rows based on this variable.
Or the same using separate_rows():
df %>%
separate_rows(keywords) %>%
full_join(df %>%
separate_rows(keywords), by = c("id" = "id")) %>%
filter(keywords.x != keywords.y) %>%
count(keywords.x, keywords.y) %>%
transmute(keywords = paste(pmax(keywords.x, keywords.y), pmin(keywords.x, keywords.y), sep = "-"),
n) %>%
distinct(keywords, .keep_all = TRUE)
We can do this easily with
library(qdapTools)
cbind(df[1], mtabulate(strsplit(as.character(df$keywords), ", ")))
# id air cookie google yahoo
#1 1 1 1 1 1
#2 2 1 1 0 0
#3 3 1 1 0 0
#4 4 0 0 1 0
#5 5 0 0 1 1

I am trying to identify patterns of missing values in rows of a dataset

I am trying to find patterns in missing values in rows.
For example if I have this data set:
a b c d
1 0.1 NA NA
2 NA 3 4
5 NA 6 NA
I expect the output to be:
n a b c d m
1 0 0 1 1 2
1 0 1 0 0 1
1 0 1 0 1 2
where column n shows the number of rows missing values in column m and 1's indicate missing values (except for columns n and m) .That is, the interpretation of the first row of the output is as follows: 1 row is missing 2 values which are for variables c and d; second row: 1 row is missing 1 value in variable b and so on.
I have tried using the subtable() function in extracat package(archived version) but I cant find the locations of missing values in each variables. I can only find frequencies.
rowmiss<-rowSums(is.na(dat1[1:ncol(dat1)]))
r1<-matrix(rowmiss, nrow=nrow(dat1))
subtable(rowmiss,1)
I expect the output to be as shown above. What I am finding so far is the frequency of missing values in rows but I expect patterns and positions of missing values.
Here's a tidyverse approach. The n column seems redundant, should it be doing something else?
library(tidyverse)
df %>%
rowid_to_column() %>%
gather(col, val, -rowid) %>%
mutate(val = is.na(val) * 1) %>%
group_by(rowid) %>% mutate(m = sum(val)) %>% ungroup() %>%
spread(col, val) %>%
mutate(n = 1) %>%
select(n, a:d, m)
# A tibble: 3 x 6
n a b c d m
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 1 1 2
2 1 0 1 0 0 1
3 1 0 1 0 1 2
An alternative way of doing this with tidyverse:
library(tidyverse)
df %>%
mutate_all(~ is.na(.) %>% as.numeric()) %>%
mutate(m = rowSums(.)) %>%
group_by_all() %>%
count()
Output (you may also want to ungroup() if doing anything further with the df):
# A tibble: 3 x 6
# Groups: a, b, c, d, m [3]
a b c d m n
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0 0 1 1 2 1
2 0 1 0 0 1 1
3 0 1 0 1 2 1
mice::md.pattern() also does basically what you want, but returns a matrix with some of the useful info in the rownames, so would require a bit of processing to trun into a dataframe.

check tables exceeding certain values and count number of times exceed respective threshold by respective id and label

I have a dataframe df
df <- data.frame(id =c(1,2,1,4,1,5,6),
label=c("a","b", "a", "a","a", "e", "a"),
color = c("g","a","g","g","a","a","a"),
threshold = c(12, 10, 12, 12, 12, 35, 40),
value =c(32.1,0,15.0,10,1,50,45),stringsAsFactors = F
)
Threshold value is based on the label
I should get a table below like this by considering each id,with respective label how many times exceeding its threshold by the value
Color is independent in consideration for calculating the exceed values
I tried like this
final_df <- df %>%
mutate(check = if_else(value > threshold, 1, 0)) %>%
group_by(id, label) %>%
summarise(exceed = sum(check))
But instead of getting with respective id i have got the number of total in exceed
With base R only, use aggregate.
aggregate(seq.int(nrow(df)) ~ id + label, df, function(i) sum(df[i, 4] < df[i, 5]))
# id label seq.int(nrow(df))
#1 1 a 2
#2 4 a 0
#3 6 a 1
#4 2 b 0
#5 5 e 1
In order to match the expected output posted in the question, it will take a little extra work.
exceed <- seq.int(nrow(df))
agg <- aggregate(exceed ~ id + label, df, function(i) sum(df[i, 4] < df[i, 5]))
res <- merge(df[1:3], agg)
unique(res)
# id label color exceed
#1 1 a g 2
#3 1 a a 2
#4 2 b a 0
#5 4 a g 0
#6 5 e a 1
#7 6 a a 1
By a small modification of your code:
df %>%
group_by(id, label) %>%
mutate(check = if_else(value > threshold, 1, 0)) %>%
summarise(exceed = sum(check)) %>%
group_by(id, label)
id label exceed
<dbl> <chr> <dbl>
1 1 a 2
2 2 b 0
3 4 a 0
4 5 e 1
5 6 a 1
To match the expected output more closely:
df %>%
group_by(id, label) %>%
mutate(exceed = sum(if_else(value > threshold, 1, 0))) %>%
group_by(id, label, color) %>%
filter(row_number() == 1)
id label color threshold value exceed
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 a g 12 32.1 2
2 2 b a 10 0 0
3 4 a g 12 10 0
4 1 a a 12 1 2
5 5 e a 35 50 1
6 6 a a 40 45 1
library(dplyr)
df %>%
group_by(id, label) %>%
mutate(exceed = sum(value > threshold)) %>%
slice(1)
id label color threshold value exceed
<dbl> <chr> <chr> <dbl> <dbl> <int>
1 1 a g 12 32.1 2
2 2 b a 10 0 0
3 4 a g 12 10 0
4 5 e a 35 50 1
5 6 a a 40 45 1
If you like the output to contain a separate row for each combination, of ID, label and color, just add a new group_by before the slice function:
df %>%
group_by(id, label) %>%
mutate(exceed = sum(value > threshold)) %>%
group_by(id, label, color) %>%
slice(1)
id label color threshold value exceed
<dbl> <chr> <chr> <dbl> <dbl> <int>
1 1 a a 12 1 2
2 1 a g 12 32.1 2
3 2 b a 10 0 0
4 4 a g 12 10 0
5 5 e a 35 50 1
6 6 a a 40 45 1
A little change in your code
final_df <- df %>% mutate(check = if_else(value > threshold, 1, 0)) %>% group_by(id, label) %>% filter(check==1)
unique(final_df$id)
We could use table and merge :
table_ <- table(subset(df,value>threshold, c("id","label")))
df2 <- merge(unique(df[c("id","label","color")]),table_,all.x=TRUE)
df2$Freq[is.na(df2$Freq)] <- 0
# id label color Freq
# 1 1 a g 2
# 2 1 a a 2
# 3 2 b a 0
# 4 4 a g 0
# 5 5 e a 1
# 6 6 a a 1

Resources