I have a dataset like that :
ID
Amount
MemberCard
345890
251000
NO
341862
400238
YES
345791
678921
YES
341750
87023
NO
345716
12987
YES
I need to delete all the observations with an amount > 250000, but i have to keep the IDs 341862 & 345791. So i was wondering if a kind of "except" command exists in R when subsetting, instead of creating a data frame with these 2 observations only and rbind after.
Select a row if ID is one of c(341862, 345791) OR if Amount is less than equal to 25000.
We can use subset in base R -
res <- subset(df, ID %in% c(341862, 345791) | Amount <= 25000)
res
# ID Amount MemberCard
#1 341862 400238 YES
#2 345791 678921 YES
#3 345716 12987 YES
Or with dplyr::filter -
library(dplyr)
df %>% filter(ID %in% c(341862, 345791) | Amount <= 25000)
If all you want is to have empty values for observations with Amount > 250000, you can use replace():
library(tidyverse)
df_new <- df %>%
mutate(Amount = replace(Amount, Amount >250000, NA))
If you want the results to be applied to both columns, you can just add it to mutate():
df_new <- df %>%
mutate(Amount = replace(Amount, Amount > 250000, NA),
MemberCard = replace(Amount, Amount > 250000, NA))
This will preserve the ID, but removes all other values if the condition is met. Hope this helps. 😉
We may also use
subset(df, ID == 341862| ID == 345791|Amount <= 25000)
Related
Let's say I have this dataframe:
df <- data.frame(Sequence_ID = c(100,100,100,100,101,101,101,101,102,102,102,102,103,103,103,103), Success = c(1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1))
If any one of the rows that have the same Sequence_ID have a 1 in the Success column, then I want all rows in that group to have 1 in the success column.
I can get my desired output with the following code:
for(i in 1:nrow(df)){
x <- df$Sequence_ID[i]
if (any(df$success[df$Sequence_ID == x] == 1)){
df$success[df$Sequence_ID == x] <- 1
}
}
I was wondering if there is a way to do this in dplyr. Thanks in advance
library(dplyr)
df %>%
group_by(Sequence_ID) %>%
mutate(
Success = as.integer(any(as.logical(Success)))
)
does the trick :)
df <- read.csv("http://www.sharecsv.com/dl/da89d0f973c81ad8c0ff4bcb0e7293b0/testdata.csv")
df %>% dplyr::group_by(TOF)
I want to look at duplicated TOF values. Whenever a duplicated value is found (in other words, TOF values belonging to the same dplyr::group), I want to keep those that satisfy the following condition:
intFT > max(intFT) * 0.1 ### this condition is valid within-group, i.e. max(intFT) refers to the highest intFT in a certain TOF group grouped by dplyr::group_by
Furthermore, in every TOF group, only top three elements with the highest intFT should be kept.
NA values should not be removed.
This returns an incorrect solution:
df %>% dplyr::group_by(TOF) %>% filter(intFT > max(intFT) * 0.1)
I don't have your data but something like this could work
df %>%
dplyr::group_by(TOF) %>%
add_tally %>%
mutate(remove_it = if_else(n > 2 | intFT < max(intFT) * 0.1),"yes","no") %>%
filter(remove_it == "no") %>%
top_n(3)
Example piece of the dataset: In this case, I would want to identify ID# 02002075 because there is a DOB for each of the two entries with this ID
id dob
00000745 19150406
00000745 19150406
00102316 19231110
00102316 19231110
02002075 19450506
02002075 19350107
I have a large data set and am currently focused on two columns. One is ID number and the other is DOB. There are some repeated ID numbers for multiple entries. However, some of the entries have the same ID numbers but different DOB. I need to identify these cases.
This gives me a data table of all the duplicated ID numbers, but I need help in figuring out how to then identify all the entries with a different DOB
d <- read_delim('data_headers_MS.txt', delim='\t'); dim(d)
x <- d[duplicated(d$id), ]; dim(x)
head(x)
ss <- x$id[x$id!='999999999']; length(ss)
ss <- unique(ss); length(ss)
y <- subset(d, d$id %in% ss, select=c(id, soc.sec, dob, name.last, name.first, dx.age)); dim(y)
head(y)
y <- y[order(y$id), ]
library(dplyr)
d %>%
group_by(id) %>%
summarize(distinct_dob = length(unique(dob))) %>%
filter(distinct_dob > 1) %>%
ungroup()
I have a list of record that I need to dedup, these look like a combination of the same set of, but using the regular functions to deduplicate records does not work because the two columns are not duplicates. Below is a reproducible example.
df <- data.frame( A = c("2","2","2","43","43","43","331","391","481","490","501","501","501","502","502","502"),
B = c("43","501","502","2","501","502","491","496","490","481","2","43","502","2","43","501"))
Below is the desired output that I'm looking for.
df_Final <- data.frame( A = c("2","2","2","331","391","481"),
B = c("43","501","502","491","496","490"))
I guess the idea is that you want to find when the elements in column A first appear in column B
idx = match(df$A, df$B)
and keep the row if the element in A isn't in B (is.na(idx)) or the element in A occurs before it's first occurrence in B (seq_along(idx) < idx)
df[is.na(idx) | seq_along(idx) < idx,]
Maybe a more-or-less literal tidyverse approach to this would be to create and then drop a temporary column
library(tidyverse)
df %>% mutate(idx = match(A, B)) %>%
filter(is.na(idx) | seq_along(idx) < idx) %>%
select(-idx)
You can remove all rows which would be duplicates under some reordering with
require(dplyr)
df %>%
apply(1, sort) %>% t %>%
data.frame %>%
group_by_all %>%
slice(1)
I am working on medical claims data and the data file is as showcased below
claim_id status
abc123 P
abc123 R
xyz374 P
xyz386 R
I would like to create a new column as flag which will basically group by claim_id and if the status for the same claim_id includes both "P" and "R". The flag column should include "Yes"
claim_id status flag
abc123 P Yes
abc123 R Yes
xyz374 P No
xyz386 R No
My approach to this solution is using dplyr :-
data <-data1 %>%
group_by(claim_id)%>%
mutate(flag = ifelse(any(status == "P" | status == "R"),
"Yes",
as.character(status)))
This approach takes a longer time and also marks all the rows as Yes in flag column.
Try this:
data1 <- data1 %>% group_by(claim_id) %>% mutate(flag = (n_distinct(status) == 2))
This one assumes that those are the only two possible values for the status field. If that is not true, you will need to something like this:
data1 <- data1 %>% group_by(claim_id) %>% mutate(flag = (('P' %in% status) & ('R' %in% status)))
You can also do
data1 %>%
group_by(claim_id) %>%
mutate(flag = ifelse(all(c("P", "R") %in% status), "Yes", "No"))
However, it might be even better to use a logical flag. It avoids the ifelse altogether (making it faster) and makes subsetting really easy afterwards:
data1 %>%
group_by(claim_id) %>%
mutate(flag = all(c("P", "R") %in% status))