Finding unique rows that are NOT between an interval - r

I'm trying to find a way to filter a data set so that I see only the rows that do NOT have a measurement in a particular interval. For some reason my brain is cannot seem to put the logic together. I've created an example dataset below to try and explain my thinking
library(dplyr)
df <- data.frame (id = c(1,1,1,1,1,1,1,1,2,2,2,2,2, 3, 3),
number = c(-10, -9, -8, -1, -0.5, 0.0, 0.23, 5, -2, -1.1, -.88, 1.2, 4, -10,10))
)
df
So here, ideally, I want to find the unique id's that do NOT have values in between -1 and 0. ID 1 and ID 2 both have values in between -1 and 0, so they would not be included.
df %>% filter(between(number, -1, 0))
But ID 3 only has measurements of -10 and 10, so that ID does not have measures in between the interval of -1 to 0. I'm trying to get that as my final output (the 2 rows with ID 3). But can't think of a way to achieve that.
Thanks in advance!

You could use group_by and filter the groups with all values not in specific range like this:
library(dplyr)
df <- data.frame (id = c(1,1,1,1,1,1,1,1,2,2,2,2,2, 3, 3),
number = c(-10, -9, -8, -1, -0.5, 0.0, 0.23, 5, -2, -1.1, -.88, 1.2, 4, -10,10))
df %>%
group_by(id) %>%
filter(all(!between(number, -1, 0)))
#> # A tibble: 2 × 2
#> # Groups: id [1]
#> id number
#> <dbl> <dbl>
#> 1 3 -10
#> 2 3 10
Created on 2022-09-30 with reprex v2.0.2

df %>% group_by(id) %>% filter(!any(between(number, -1, 0)))

Related

How to find sum of a column given the date and month is the same

I am wondering how I can find the sum of a column, (in this case it's the AgeGroup_20_to_24 column) for a month and year. Here's the sample data:
https://i.stack.imgur.com/E23Th.png
I essentially want to find the total amount of cases per month/year.
For an example: 01/2020 = total sum cases of the AgeGroup
02/2020 = total sum cases of the AgeGroup
I tried doing this, however I get this:
https://i.stack.imgur.com/1eH0O.png
xAge20To24 <- covid%>%
mutate(dates=mdy(Date), year = year(dates), month = month(dates))%>%
mutate(total = sum(AgeGroup_20_to_24))%>%
select(Date, year, month, AgeGroup_20_to_24)%>%
group_by(year)
View(xAge20To24)
Any help will be appreciated.
structure(list(Date = c("3/9/2020", "3/10/2020", "3/11/2020",
"3/12/2020", "3/13/2020", "3/14/2020"), AgeGroup_0_to_19 = c(1,
0, 2, 0, 0, 2), AgeGroup_20_to_24 = c(1, 0, 2, 0, 2, 1), AgeGroup_25_to_29 = c(1,
0, 1, 2, 2, 2), AgeGroup_30_to_34 = c(0, 0, 2, 3, 4, 3), AgeGroup_35_to_39 = c(3,
1, 2, 1, 2, 1), AgeGroup_40_to_44 = c(1, 2, 1, 3, 3, 1), AgeGroup_45_to_49 = c(1,
0, 0, 2, 0, 1), AgeGroup_50_to_54 = c(2, 1, 1, 1, 0, 1), AgeGroup_55_to_59 = c(1,
0, 1, 1, 1, 2), AgeGroup_60_to_64 = c(0, 2, 2, 1, 1, 3), AgeGroup_70_plus = c(2,
0, 2, 0, 0, 0)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I'm not sure if your question and your data match up. You're asking for by-month summaries of data, but your data only includes March entries. I've provided two examples of summarizing your data below, one that uses the entire date and one that uses by-day summaries since we can't use month. If your full data set has more months included, you can just swap the day for month instead. First, a quick summary of just the dates can be done with this code:
#### Load Library ####
library(tidyverse)
library(lubridate)
#### Pivot and Summarise Data ####
covid %>%
pivot_longer(cols = c(everything(),
-Date),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Date) %>%
summarise(Sum_Cases = sum(Cases))
This pivots your data into long format, groups by the entire date, then summarizes the cases, which gives you this by-date sum of data:
# A tibble: 6 × 2
Date Sum_Cases
<chr> <dbl>
1 3/10/2020 6
2 3/11/2020 16
3 3/12/2020 14
4 3/13/2020 15
5 3/14/2020 17
6 3/9/2020 13
Using the same pivot_longer principle, you can mutate the data to date format like you already did, pivot to longer format, then group by day, thereafter summarizing the cases:
#### Theoretical Example ####
covid %>%
mutate(Date=mdy(Date),
Year = year(Date),
Month = month(Date),
Day = day(Date)) %>%
pivot_longer(cols = c(everything(),
-Date,-Year,-Month,-Day),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Day) %>% # use by day instead of month
summarise(Sum_Cases = sum(Cases))
Which you can see below. Here we can see the 14th had the most cases:
# A tibble: 6 × 2
Day Sum_Cases
<int> <dbl>
1 9 13
2 10 6
3 11 16
4 12 14
5 13 15
6 14 17

How can I select only the dummy variable columns?

Is there a way to select only those columns in a dataframe where the values in the columns are either 0 or 1 at a time?
I have a data frame with various values, including various strings as well as numbers.
I tried to use dplyr select but could not find a way to evaluate the values contained in the columns.
Sample data is shown below.
data %>%
tribble(
~id, ~gender, ~height, smoking,
1, 1, 170, 0,
2, 0, 150, 0,
3, 1, 160, 1
)
You can pass a function (or rlang-tilde function) to select_if, and look for columns that only contain 0:1.
tribble(
~id, ~gender, ~height, ~smoking,
1, 1, 170, 0,
2, 0, 150, 0,
3, 1, 160, 1
) %>%
select_if(~ all(. %in% 0:1))
# # A tibble: 3 x 2
# gender smoking
# <dbl> <dbl>
# 1 1 0
# 2 0 0
# 3 1 1
If you may have NA in a dummy-variable column, you may want to instead use %in% c(0:1, NA) in the predicate.

How to mutate columns in R based on ordering of subset of these columns?

To begin with, let's suppose we have a dataset like this:
data <- data.frame(
id = 1:5,
time = c(0.1, 0.2, 0.1, 0.1, 0.2),
obj_a_size = c(1, 3, 8, 4, 2),
obj_a_cuteness = c(3, 6, 4, 1, 2),
obj_b_size = c(5, 4, 4, 2, 5),
obj_b_cuteness = c(6, 2, 10, 9, 6),
obj_c_size = c(3, 6, 7, 1, 6),
obj_c_cuteness = c(10, 1, 6, 8, 8)
)
It has columns concerning whole experiment (like time) and object-specific columns (like X_size and X_cuteness). These objects are ordered randomly, though, so I'd like to mutate these column to order the objects by size for each experiment separately. The result I expect to be like that:
data <- data.frame(
id = 1:5,
time = c(0.1, 0.2, 0.1, 0.1, 0.2),
obj_max_size = c(5, 6, 8, 4, 6),
obj_max_cuteness = c(6, 1, 4, 1, 8),
obj_2nd_size = c(3, 4, 7, 2, 5),
obj_2nd_cuteness = c(10, 2, 6, 9, 6),
obj_min_size = c(1, 3, 3, 1, 2),
obj_min_cuteness = c(3, 6, 10, 8, 2)
)
Notice that cuteness isn't ordered descending or ascending, but I want cuteness to be considered part of an object and set obj_max_cuteness = obj_2_cuteness wherever obj_max_size = obj_2_size, and so on.
Number of objects is known in advance (there are four of them), columns are known as well, and there are four columns describing each object. There is no missing data. I'm willing to use any package, if necessary. Also, original dataset is about 500k by 30, so bonus points for quick or memory-friendly code.
EDIT: Some noticed that the description is not very clear. What I'm after is a bit object-oriented thing: in the case above each object within experiment could be described as such (X in obj_X_ means that it belongs to experiment no. X):
obj_1_a = {"size": 1, "cuteness": 3}
obj_1_b = {"size": 5, "cuteness": 6}
obj_1_c = {"size": 3, "cuteness": 10}
obj_2_a = {"size": 3, "cuteness": 6}
...
I want to reorder them by size so that (in the resulting data frame):
obj_1_max = {"size": 5, "cuteness": 6}
obj_1_2nd = {"size": 3, "cuteness": 10}
obj_1_min = {"size": 1, "cuteness": 3}
obj_2_max = {"size": 6, "cuteness": 1}
...
Is this what you are after?
The min and max value calculations are straightforward. To find the 2nd max you need to do a bit more work. My interpretation of the 2nd values is that it is the 2nd value of the sorted and unique values. My output differs from yours but that may be due to a different interpretation of what you mean by the 2nd value. My reading: you are looking for the first value down from the max value; from the groups of 3 columns (size, cuteness).
library(dplyr)
data <- data.frame(
id = 1:5,
time = c(0.1, 0.2, 0.1, 0.1, 0.2),
obj_a_size = c(1, 3, 8, 4, 2),
obj_a_cuteness = c(3, 6, 4, 1, 2),
obj_b_size = c(5, 4, 4, 2, 5),
obj_b_cuteness = c(6, 2, 10, 9, 6),
obj_c_size = c(3, 6, 7, 1, 6),
obj_c_cuteness = c(10, 1, 6, 8, 8)
)
obj_max_size <- data %>%
pivot_longer(cols = contains('size')) %>%
group_by(id) %>%
summarise(obj_max_size = max(value)) %>%
ungroup() %>%
select(obj_max_size)
obj_min_size <- data %>%
pivot_longer(cols = contains('size')) %>%
group_by(id) %>%
summarise(obj_min_size = min(value)) %>%
ungroup() %>%
select(obj_min_size)
obj_2nd_size <- data %>%
pivot_longer(cols = contains('size')) %>%
group_by(id) %>%
distinct(value) %>%
arrange(desc(value)) %>%
slice(2) %>%
ungroup() %>%
select(obj_2nd_size = value)
obj_max_cuteness <- data %>%
pivot_longer(cols = contains('cuteness')) %>%
group_by(id) %>%
summarise(obj_max_cuteness = max(value)) %>%
ungroup() %>%
select(obj_max_cuteness)
obj_min_cuteness <- data %>%
pivot_longer(cols = contains('cuteness')) %>%
group_by(id) %>%
summarise(obj_min_cuteness = min(value)) %>%
ungroup() %>%
select(obj_min_cuteness)
obj_2nd_cuteness <- data %>%
pivot_longer(cols = contains('cuteness')) %>%
group_by(id) %>%
distinct(value) %>%
arrange(desc(value)) %>%
slice(2) %>%
ungroup() %>%
select(obj_2nd_cuteness = value)
output <- bind_cols(id = data$id, obj_max_size, obj_min_size, obj_2nd_size, obj_max_cuteness, obj_min_cuteness, obj_2nd_cuteness)
With output looking like this:
> output
# A tibble: 5 x 7
id obj_max_size obj_min_size obj_2nd_size obj_max_cuteness obj_min_cuteness obj_2nd_cuteness
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 1 3 10 3 6
2 2 6 3 4 6 1 2
3 3 8 4 7 10 4 6
4 4 4 1 2 9 1 8
5 5 6 2 5 8 2 6

How to create a new dataset based on multiple conditions in R?

I have a dataset called carcom that looks like this
carcom <- data.frame(household = c(173, 256, 256, 319, 319, 319, 422, 422, 422, 422), individuals= c(1, 1, 2, 1, 2, 3, 1, 2, 3, 4))
Where individuals refer to father for "1" , mother for "2", child for "3" and "4". What I would like to get two new columns. First one should indicate the number of children in that household if there is. Second, assigning a weight to each individual respectively "1" for father, "0.5" to mother and "0.3" to each child. My new dataset should look like this
newcarcom <- data.frame(household = c(173, 256, 319, 422), child = c(0, 0, 1, 2), weight = c(1, 1.5, 1.8, 2.1)
I have been trying to find the solutions for days. Would be appreciated if someone helps me. Thanks
We can count number of individuals with value 3 and 4 in each household. To calculate weight we change the value for 1:4 to their corresponding weight values using recode and then take sum.
library(dplyr)
newcarcom <- carcom %>%
group_by(household) %>%
summarise(child = sum(individuals %in% 3:4),
weight = sum(recode(individuals,`1` = 1, `2` = 0.5, .default = 0.3)))
# household child weight
# <dbl> <int> <dbl>
#1 173 0 1
#2 256 0 1.5
#3 319 1 1.8
#4 422 2 2.1
Base R version suggested by #markus
newcarcom <- do.call(data.frame, aggregate(individuals ~ household, carcom, function(x)
c(child = sum(x %in% 3:4), weight = sum(replace(y <- x^-1, y < 0.5, 0.3)))))
An option with data.table
library(data.table)
setDT(carcom)[, .(child = sum(individuals %in% 3:4),
weight = sum(recode(individuals,`1` = 1, `2` = 0.5, .default = 0.3))), household]

Summary of N recent values

I am trying to get summary statistics (sum and max here) with most N recent values.
Starting data:
dt = data.table(id = c('a','a','a','a','b','b','b','b'),
week = c(1,2,3,4,1,2,3,4),
value = c(2, 3, 1, 0, 5, 7,3,2))
Desired result:
dt = data.table(id = c('a','a','a','a','b','b','b','b'),
week = c(1,2,3,4,1,2,3,4),
value = c(2, 3, 1, 0, 5, 7,3,2),
sum_recent2week = c(NA, NA, 5, 4, NA, NA, 12, 10),
max_recent2week = c(NA, NA, 3, 3, NA, NA, 7, 7))
With the data, I would like to have sum and max of 2 (N=2) most recent values for each row by id. 4th(sum_recent2week) and 5th (max_recent2week) columns are my desired columns
You can use rollsum and rollmax from the zoo package.
dt[, `:=`(sum_recent2week =
shift(rollsum(value, 2, align = 'left', fill = NA), 2),
max_recent2week =
shift(rollmax(value, 2, align = 'left', fill = NA), 2))
, id]
For the sum, if you're using data table version >= 1.12, you can use data.table::frollmean. The default for frollmean is fill = NA, so no need to specify that in this case.
dt[, `:=`(sum_recent2week =
shift(frollmean(value, 2, align = 'left')*2, 2),
max_recent2week =
shift(rollmax(value, 2, align = 'left', fill = NA), 2))
, id]
I'm sure it can be done in a much more elegant way, but here is one tidyverse possibility:
dt %>%
group_by(id) %>%
mutate(sum_recent2week = lag(value + lead(value), n = 2),
max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1))) %>%
rowid_to_column() %>%
select(-week, -value) %>%
top_n(-2) %>%
right_join(dt %>%
rowid_to_column(), by = c("rowid" = "rowid",
"id" = "id")) %>%
select(-rowid)
id sum_recent2week max_recent2week week value
<chr> <dbl> <dbl> <dbl> <dbl>
1 a NA NA 1. 2.
2 a NA NA 2. 3.
3 a 5. 3. 3. 1.
4 a 4. 3. 4. 0.
5 b NA NA 1. 5.
6 b NA NA 2. 7.
7 b 12. 7. 3. 3.
8 b 10. 7. 4. 2.
First, it is computing the "sum_recent2week" and "max_recent2week" per group. Second, it selects the last two rows per group. Finally, it merges it with the original data.
Or if you want to compute it for all rows, not just for the last two rows per group:
dt %>%
group_by(id) %>%
mutate(sum_recent2week = lag(value + lead(value), n = 2),
max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1)))

Resources