Subset rows with first observation after a given occurrence - r

I am trying to accomplish the following:
group data by id
remove any rows after '3' occurs.
find the closest '1','2' or NA that precedes '3' and only keep that row.
My data:
data <- data.frame(
id=c(1,1,1,1,1, 2,2,2,2, 3,3,3),
a=c(NA,1,2,3,3, NA,3,2,3, 1,5,3))
Desired output:
desired <- data.frame(
id=c(1,2,3), a=c(2,NA,1))
For steps 1-2, I have tried:
data %>% group_by(id) %>% slice(if(first(a) == 3))
but that seems quite off.
Thank you.

This breaks the problem into separate steps
data %>%
group_by(id) %>%
filter(row_number()<first(which(a==3))) %>% # drop things past a 3
filter(a %in% c(1,2,NA)) %>% # only keep 1,2 or NA
filter(row_number()==n()) # choose the last row in each group

Related

How to subtract using max(date) and second latest (month) date

I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])
mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1

R - Identifying only strings ending with A and B in a column

I have a column in a data frame in R that contains sample names. Some names are identical except that they end in A or B at the end, and some samples repeat themselves, like this:
df <- data.frame(Samples = c("S_026A", "S_026B", "S_028A", "S_028B", "S_038A", "S_040_B", "S_026B", "S_38A"))
What I am trying to do is to isolate all sample names that have an A and B at the end and not include the sample names that only have either A or B.
The end result of what I'm looking for would look like this:
"S_026" and "S_028" as these are the only ones that have A and B at the end.
All I seem to find is how to remove duplicates, and removing duplicates would only give me "S_026B" and "S_38A" in this case.
Alternatively, I have tried to strip the A's and B's at the end and then sum how many times each of those names sum > 2, but again, this does not give me the desired results.
Any suggestions?
We could use substring to get the last character after grouping by substring not including the last character, and check if there are both 'A', and 'B' in the substring
library(dplyr)
df %>%
group_by(grp = substr(Samples, 1, nchar(Samples)-1)) %>%
filter(all(c("A", "B") %in% substring(Samples, nchar(Samples)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 x 1
Samples
<chr>
1 S_026A
2 S_026B
3 S_028A
4 S_028B
5 S_026B
You can extract the last character from Sample in different column, keep only those values that have both 'A' and 'B' and keep only the unique values.
library(dplyr)
library(tidyr)
df %>%
extract(Samples, c('value', 'last'), '(.*)(.)') %>%
group_by(value) %>%
filter(all(c('A', 'B') %in% last)) %>%
ungroup %>%
distinct(value)
# value
# <chr>
#1 S_026
#2 S_028

How can I select the values that only appear once in the column of a data table in R?

Like the title, the question is very straightforward. (pardon my ignorance)
I have a column, character type, in a data table.
And there are several different words/values stored, some of them only appear once, others appear multiple times.
How can I select out the ones that only appear once??
Any help is appreciated! Thank you!
One option would be to do a group by and then select the groups having single row
library(data.table)
dt1 <- dt[, .SD[.N == 1], .(col)]
library(dplyr)
df %>%
group_by(column) %>%
dplyr::filter(n() == 1) %>%
ungroup()
Example:
data = tibble(text = c("a","a","b","c","c","c"))
data %>%
group_by(text) %>%
dplyr::filter(n() == 1) %>%
ungroup()
# A tibble: 1 x 1
text
<chr>
1 b

Remove rows below certain row number/condition by group

I'm trying to subset a dataframe in R. It contains several categories. The first few rows for each category need to be removed. The number of rows to remove is inconsistent, but there is a row that indicates the cutoff. How do I remove everything above the cutoff (including that row) for each group?
Example data:
category <- c(rep("A", 3), rep("B", 5), rep("C", 4))
info <- as.character(c("Junk", "Border", "Useful",
"This", "is", "Useless", "Border", "Yes please",
"Unwanted", "Row", "Border", "Required"))
example_df <- data.frame(category, info)
example_df$row_number <- 1:nrow(example_df)
I can extract the row numbers of the border and the start of each group:
border_rows <- which(example_df$info == "Border")
start_rows <- example_df %>%
group_by(category) %>%
slice(1)
start_rows <- start_rows$row_number
I've tried the following, but this only removes the first two rows (i.e. the ones that need to be removed for group A).
for(i in 1:length(border_rows)) {
new_df <- example_df[-(start_rows[i]:border_rows[i]), ]
}
You can easily do this with dplyr package -
library(dplyr)
example_df %>%
group_by(category) %>%
filter(row_number() > which(info == "Border")) %>%
ungroup()
# A tibble: 3 x 2
category info
<fct> <fct>
1 A Useful
2 B Yes please
3 C Required

R dedupe records that are not exactly duplicates

I have a list of record that I need to dedup, these look like a combination of the same set of, but using the regular functions to deduplicate records does not work because the two columns are not duplicates. Below is a reproducible example.
df <- data.frame( A = c("2","2","2","43","43","43","331","391","481","490","501","501","501","502","502","502"),
B = c("43","501","502","2","501","502","491","496","490","481","2","43","502","2","43","501"))
Below is the desired output that I'm looking for.
df_Final <- data.frame( A = c("2","2","2","331","391","481"),
B = c("43","501","502","491","496","490"))
I guess the idea is that you want to find when the elements in column A first appear in column B
idx = match(df$A, df$B)
and keep the row if the element in A isn't in B (is.na(idx)) or the element in A occurs before it's first occurrence in B (seq_along(idx) < idx)
df[is.na(idx) | seq_along(idx) < idx,]
Maybe a more-or-less literal tidyverse approach to this would be to create and then drop a temporary column
library(tidyverse)
df %>% mutate(idx = match(A, B)) %>%
filter(is.na(idx) | seq_along(idx) < idx) %>%
select(-idx)
You can remove all rows which would be duplicates under some reordering with
require(dplyr)
df %>%
apply(1, sort) %>% t %>%
data.frame %>%
group_by_all %>%
slice(1)

Resources