I have the following dataset I am working with:
day descent_cd
<int> <chr>
1 26 B
2 19 W
3 19 B
4 16 B
5 1 W
6 2 W
7 2 B
8 2 B
9 3 W
10 3 W
# … with 1,283 more rows
In short: the "day" variable is the day of the month. "Descent_cd" is race (black or white).
I am trying to organize it so that I get a column for "B" and a column for "W" both sorted by total arrest made that day ... meaning: counting all the "B"s for day "1" and the same for "W" and then so on and so forth through the rest of the month.
I ultimately want to do this as a geom_ridge graph.
Is this what you are looking for?
library(tidyverse)
#sample data
df <- tibble::tribble(
~day, ~descent_cd,
26L, "B",
19L, "W",
19L, "B",
16L, "B",
1L, "W",
2L, "W",
2L, "B",
2L, "B",
3L, "W",
3L, "W"
)
df %>%
group_by(day, descent_cd) %>%
summarise(total_arrest = n()) %>% #calculate number of arrests per day per descent_cd
pivot_wider(names_from = descent_cd, values_from = total_arrest) %>% #create columns W and B
mutate(W = if_else(is.na(W),as.integer(0),W), #replace NAs with 0 (meaning 0 arrests that day)
B = if_else(is.na(B),as.integer(0),B)) %>%
arrange(desc(wt = W+B)) #arrange df in descending order of total arrests per day
# A tibble: 6 x 3
# Groups: day [6]
day W B
<int> <int> <int>
1 2 1 2
2 3 2 0
3 19 1 1
4 1 1 0
5 16 0 1
6 26 0 1
Related
I am trying to analyze how estimating percent cover of a reef changes as the number of points used to analyze the reef changes. My actual dataset consists of 150 photos each with 50 points. The idea is to have R estimate percent cover with all those points and then remove 1 point from each photo and reanalyze, then remove another point and reanalyze etc.
Any help in how I can write or find or where I can look for a function like this is welcome as I am very new to all this! Below is a sample dataset with just 3 plots and 5 points per plot. So as mentioned the idea is to analyze with all points, then randomly remove one point from each plot, reanalyze and repeat. Basically this sample the first analysis would be 15 points, the next analysis would be a total of 12 plots etc.
Sample dataset:
Plot ID
1 S
1 S
1 S
1 T
1 T
2 S
2 C
2 C
2 SP
2 S
3 S
3 T
3 T
3 C
3 T
Thank you!
base R
set.seed(42)
dat[ave(rep(TRUE, nrow(dat)), dat$Plot,
FUN = function(z) length(z) > 1 & !seq_along(z) %in% sample(length(z), 1)),]
# Plot ID
# 2 1 S
# 3 1 S
# 4 1 T
# 5 1 T
# 6 2 S
# 7 2 C
# 8 2 C
# 9 2 SP
# 12 3 T
# 13 3 T
# 14 3 C
# 15 3 T
I added the logic to ensure a minimum size of 1 (length(z) > 1), you might want to bump this up if you have different needs, or remove that condition if you don't care about removing a Plot when it has only one row.
dplyr
library(dplyr)
set.seed(42)
dat %>%
group_by(Plot) %>%
sample_n(n() - 1) %>%
ungroup()
# # A tibble: 12 x 2
# Plot ID
# <int> <chr>
# 1 1 S
# 2 1 T
# 3 1 T
# 4 1 S
# 5 2 C
# 6 2 SP
# 7 2 S
# 8 2 C
# 9 3 S
# 10 3 C
# 11 3 T
# 12 3 T
Here is a base R function with tapply/sample.
Its arguments are the data.frame and the grouping column.
sample_rows <- function(data, group){
group <- as.character(substitute(group))
tapply(seq_len(nrow(data)), data[[group]], \(x) sample(x, 1))
}
set.seed(2021)
i <- sample_rows(df1, Plot)
df2 <- df1[-i, ]
nrow(df2)
#[1] 12
i <- sample_rows(df2, Plot)
df2 <- df2[-i, ]
nrow(df2)
#[1] 9
Data
df1 <-
structure(list(Plot = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L), ID = c("S", "S", "S", "T", "T", "S", "C",
"C", "SP", "S", "S", "T", "T", "C", "T")), class = "data.frame",
row.names = c(NA, -15L))
I have a dataframe in R that has a large number of bank_account_IDs and Vendor_Codes. Bank_account_IDs should not be shared between Vendor_Codes, but sometimes a fraudulent vendor exists that shares another vendor's bank_account_ID.
I want to add a new field to the dataframe that provides a count for the number of times an account_ID exists with more than 1 Vendor_Code.
My sample dataframe is as follows:
bank_account_ID = c(a, b, c, a, a, d, e, f, b, c)
Vendor_Code = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
df <-data.frame(a,b)
My ideal new dataframe should look something like this:
bank_account_ID Vendor_Code duplicate_count
a 1 2
b 2 1
c 3 1
a 4 2
a 5 2
d 6 0
e 7 0
f 8 0
b 9 1
c 10 1
Thanks in advance!
We can get the number of distinct elements with n_distinct grouped by the 'bank_account_ID' and subtract 1
library(dplyr)
df %>%
group_by(bank_account_ID) %>%
mutate(dupe_count = n_distinct(Vendor_Code)-1) %>%
ungroup
-output
# A tibble: 10 x 4
# bank_account_ID Vendor_Code duplicate_count dupe_count
# <chr> <int> <int> <dbl>
# 1 a 1 2 2
# 2 b 2 1 1
# 3 c 3 1 1
# 4 a 4 2 2
# 5 a 5 2 2
# 6 d 6 0 0
# 7 e 7 0 0
# 8 f 8 0 0
# 9 b 9 1 1
#10 c 10 1 1
data
df <- structure(list(bank_account_ID = c("a", "b", "c", "a", "a", "d",
"e", "f", "b", "c"), Vendor_Code = 1:10, duplicate_count = c(2L,
1L, 1L, 2L, 2L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
This question already has answers here:
dplyr filter columns with value 0 for all rows with unique combinations of other columns
(2 answers)
Closed 1 year ago.
I have a data frame where I'd like to remove entire groups if their y value is the same across 6 time points.
Patients
Time
Status
1
a
5
1
b
5
1
c
5
1
d
5
1
e
5
1
f
5
2
a
4
2
b
4
2
c
5
2
d
5
2
e
5
2
f
5
Basically, I'd like to remove all patients from this data frame who have a status of "5" at ALL time points. If a patient has any value apart from 5 at any point in time I'd like to include them.
I tried
df <- df %>%
filter(a !=5 & b !=5 & c !=5 & d !=5 & e !=5 & f !=5)
To no avail, unfortunately. Would appreciate any help. Thank you!
You can use any/all :
library(dplyr)
df %>% group_by(Patients) %>% filter(any(Status != 5))
#With `all`
#df %>% group_by(Patients) %>% filter(!all(Status == 5))
# Patients Time Status
# <int> <chr> <int>
#1 2 a 4
#2 2 b 4
#3 2 c 5
#4 2 d 5
#5 2 e 5
#6 2 f 5
This can be also be written with base R :
subset(df, ave(Status != 5, Patients, FUN = any))
#and `data.table` :
library(data.table)
setDT(df)[, .SD[any(Status != 5)], Patients]
Without grouping by Patients you can do :
subset(df, Patients %in% unique(Patients[Status != 5]))
data
df <- structure(list(Patients = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), Time = c("a", "b", "c", "d", "e", "f", "a", "b",
"c", "d", "e", "f"), Status = c(5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L,
5L, 5L, 5L, 5L)), row.names = c(NA, -12L), class = "data.frame")
Something like this?
df <- data.frame(
patients = c(rep(1,6),rep(2,6)),
time = rep(letters[1:6],2),
status = c(rep(5,6),rep(4,2),rep(5,4))
)
df %>%
group_by(patients) %>%
dplyr::filter(status*6 != sum(status))
if I understood your problem correctly one of these two solutions should help:
library(dplyr)
library(data.table)
# your test data
df <- data.table::fread("Patients Time Status
1 a 5
1 b 5
1 c 5
1 d 5
1 e 5
1 f 5
2 a 4
2 b 4
2 c 5
2 d 5
2 e 5
2 f 5")
# one option to get all rows diferent than 5
df %>%
# exclude everything where Status is 5
dplyr::filter(Status != 5)
Patients Time Status
1: 2 a 4
2: 2 b 4
# one option to get all distinct patients
df %>%
# exclude everything where Status is 5
dplyr::filter(Status != 5) %>%
# unique values per column or column combination
dplyr::distinct(Patients)
Patients
1: 2
# on option to get all data of patien with at least one status != 5
df %>%
# exclude everything where Status is 5
dplyr::filter(Status != 5) %>%
# unique values per column or column combination
dplyr::distinct(Patients) %>%
# join back on original data to get all values for specific patients
dplyr::inner_join(df, by = "Patients")
Patients Time Status
1: 2 a 4
2: 2 b 4
3: 2 c 5
4: 2 d 5
5: 2 e 5
6: 2 f 5
I'd like to find consecutive month by client. I thought this is easy but
still can't find solutions..
My goal is to find months' consecutive purchases for each client. Any
My data
Client Month consecutive
A 1 1
A 1 2
A 2 3
A 5 1
A 6 2
A 8 1
B 8 1
In base R, we can use ave
df$consecutive <- with(df, ave(Month, Client, cumsum(c(TRUE, diff(Month) > 1)),
FUN = seq_along))
df
# Client Month consecutive
#1 A 1 1
#2 A 1 2
#3 A 2 3
#4 A 5 1
#5 A 6 2
#6 A 8 1
#7 B 8 1
In dplyr, we can create a new group with lag to compare the current month with the previous month and assign row_number() in each group.
library(dplyr)
df %>%
group_by(Client,group=cumsum(Month-lag(Month, default = first(Month)) > 1)) %>%
mutate(consecutive = row_number()) %>%
ungroup %>%
select(-group)
We can create a grouping variable based on the difference in adjacent 'Month' for each 'Client' and use that to create the sequence
library(dplyr)
df1 %>%
group_by(Client) %>%
group_by(grp =cumsum(c(TRUE, diff(Month) > 1)), add = TRUE) %>%
mutate(consec = row_number()) %>%
ungroup %>%
select(-grp)
# A tibble: 7 x 4
# Client Month consecutive consec
# <chr> <int> <int> <int>
#1 A 1 1 1
#2 A 1 2 2
#3 A 2 3 3
#4 A 5 1 1
#5 A 6 2 2
#6 A 8 1 1
#7 B 8 1 1
Or using data.table
library(data.table)
setDT(df1)[, grp := cumsum(c(TRUE, diff(Month) > 1)), Client
][, consec := seq_len(.N), .(Client, grp)
][, grp := NULL][]
data
df1 <- structure(list(Client = c("A", "A", "A", "A", "A", "A", "B"),
Month = c(1L, 1L, 2L, 5L, 6L, 8L, 8L), consecutive = c(1L,
2L, 3L, 1L, 2L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-7L))
dfin <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 1 0 20 20
1 2 1 20 20
Per study and ID, for those who have duplicate CYCLE == 0 values, remove the row that had the higher TIME.
dfout <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 2 1 20 20
Using RStudio.
An option is to do a group by 'STUDY', 'ID' and filter out the duplicated 0 values in 'CYCLE'
library(dplyr)
dfin %>%
arrange(STUDY, ID, TIME) %>%
group_by(STUDY, ID) %>%
filter(!(duplicated(CYCLE) & CYCLE == 0))
# A tibble: 2 x 5
# Groups: STUDY, ID [2]
# STUDY ID CYCLE TIME VALUE
# <int> <int> <int> <int> <int>
#1 1 1 0 10 50
#2 1 2 1 20 20
Also, if there are many duplicates for 0 and want to remove only the row where 'TIME' is also max
dfin %>%
group_by(STUDY, ID) %>%
filter(!(TIME == max(TIME) & CYCLE == 0))
Or using base R
dfin1 <- do.call(order, dfin[c("STUDY", "ID", "TIME")])
dfin1[!(duplicated(dfin1[1:3]) & duplicated(dfin1$CYCLE)),]
# STUDY ID CYCLE TIME VALUE
#1 1 1 0 10 50
#3 1 2 1 20 20
data
dfin <- structure(list(STUDY = c(1L, 1L, 1L), ID = c(1L, 1L, 2L), CYCLE = c(0L,
0L, 1L), TIME = c(10L, 20L, 20L), VALUE = c(50L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-3L))