I have table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
5/1/1 | B | 4
4/1/1 | C | 5
1/1/1 | A | 1
7/1/1 | B | 2
1/1/1 | C | 3
I need table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
4/1/1 | C | 5
7/1/1 | B | 2
How to remove old rows based on two criteria (Column1, Column2)?
Group by Dates, arrange in descending order within group, then keep the first row with slice, like this
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(as.Date(Date))) %>% # will sort within group now
slice(1) %>% # keep first row entry of each group
ungroup()
Your error is occurring because your date format is a bit funny. I recommend using lubridate::parse_date_time which is more robust than base R datetime functions
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(parse_date_time(Date, format="mdy"))) %>% # will sort within group now
# the date format is specified as month-day-year
slice(1) %>% # keep first row entry of each group
ungroup()
EDIT
Based on helpful comment by #count, we can simplify dplyr chain to
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
slice(which.max(parse_date_time(Date, format="mdy"))) %>% # keep max-Date row entry of each group
ungroup()
Related
I encounter another challenge about combine two row into one based on identifier col.
My dataset looks like this:
var<-c("round","round","round","hhid","hhid","chid","chid","sex")
dfile<-c("df1","df2","df3","df1","df2","df1","df2","df1")
uniquevar<-c("df1::round","df2::round","df3::round", "df1::hhid","df2::hhid","df1::chid","df2::chid","df1::sex")
flag<-c("dup","dup","dup","dup","dup","dup","dup","NA")
df<-data.frame(var, dfile,flag)
I am trying to do
find the obs which is marked as "dup"
If it is marked as "dup", combine two/three/or multiple rows into one with format:
df1::var | df2::var |df3::var
So, the ideal outcome would look like this
var dfile. uniquevar flag
round df1 |df2 |df3 df1::round | df2::round |df3::round dup
hhid df1 |df2 df1::hhid | df2::hhid dup
chid df1 |df2 df1::chid | df2::chid dup
sex df1 NA
So far I can only do that manually in excel, that is really time-consuming. I appreciate if I could be told how to achieve that in R, which would be much faster considering the dataset contains over 600,000 obs...
Thanks a lot~~!
You can paste cells together after using group_by(var). Use sep = "::" to specify the separator between different columns, and collapse = " | " for the separator representing rows. You can do this inside summarize from the dplyr package.
library(dplyr)
df %>%
group_by(var) %>%
summarize(uniquevar = ifelse(all(flag == "dup"),
paste(dfile, var, sep = "::", collapse = " | "),
""),
dfile = paste(dfile, collapse = " | "),
dup = flag[1]) %>%
select(var, dfile, uniquevar, dup)
#> # A tibble: 4 x 4
#> var dfile uniquevar dup
#> <chr> <chr> <chr> <chr>
#> 1 chid df1 | df2 "df1::chid | df2::chid" dup
#> 2 hhid df1 | df2 "df1::hhid | df2::hhid" dup
#> 3 round df1 | df2 | df3 "df1::round | df2::round | df3::round" dup
#> 4 sex df1 "" NA
Here is a sample of the dataset that I have. I am looking to find the state that has the maximum number of stores. In this case, CA and also see how many IDs come from that state
| ID | | State | | Stores|
| -- | |------ | | ----- |
|a11 | | CA | | 16585 |
|a12 | | CA | | 45552 |
|a13 | | AK | | 7811 |
|a14 | | MA | | 4221 |
I have this code using dplyr
max_state <- df %>%
group_by(State) %>%
summarise(total_stores = sum(Stores)) %>%
top_n(1) %>%
select(State)
This gives me "CA"
Can I use this variable "max(state)" to pass through a filter and use summarise(n()) to count the number of Ids for CA?
A few ways:
# this takes your max_state (CA) and brings in the parts of
# your original table that have the same State
max_state %>%
left_join(df) %>%
summarize(n = n())
# filter the State in df to match the State in max_state
df %>%
filter(State == max_state$State) %>%
summarize(n = n())
# Add Stores_total for each State, only keep the State rows which
# match that of the max State, and count the # of IDs therein
df %>%
group_by(State) %>%
mutate(Stores_total = sum(Stores)) %>%
filter(Stores_total == max(Stores_total)) %>%
count(ID)
You can combine more operations into one summarize call that will be applied to the same group:
df |>
group_by(State) |>
summarize(gsum = sum(Stores), nids = n()) |>
filter(gsum == max(gsum))
##>+ # A tibble: 1 × 3
##> State gsum nids
##> <chr> <dbl> <int>
##>1 CA 62137 2
Where the dataset df is obtained by:
df <- data.frame(ID = c("a11", "a12","a13", "a14"),
State = c("CA", "CA", "AK", "MA"),
Stores = c(16585, 45552, 7811, 4221))
I have a data set that might contain some very similar keys - something like a row of data for each of the email address john.doe#foo.com and john.m.doe#foo.com. How can I combine similarly named keys and do an aggregate in R?
Sample input
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 10 |
|john.m.doe#foo.com | 11 |
|jane.doe#foo.com | 20 |
Expected result
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 21 |
|jane.doe#foo.com | 20 |
I know agrep and few other libraries can do fuzzy matching, but how do I employ it in combining rows in a data set?
Here is one way to use agrep in combination with dplyr:
df <- data.frame(mail = c("john.doe#foo.com", "john.m.doe#foo.com", "jane.doe#foo.com"),
sub = c(10, 11, 20))
df %>%
rowwise() %>%
mutate(new = paste(agrep(mail, df$mail, max = 2, ignore.case = TRUE), collapse = ",")) %>%
group_by(new) %>%
mutate(sub = sum(sub)) %>%
slice(1)
mail sub new
<fct> <dbl> <chr>
1 john.doe#foo.com 21 1,2
2 jane.doe#foo.com 20 3
Consider the below given dataframe;
Sample DataFrame
| Name | Age | Type |
---------------------
| EF | 50 | A |
| GH | 60 | B |
| VB | 70 | C |
Code to perform Filter
df2 <- df1 %>% filter(Type == 'C') %>% select(Name)
The above code will provide me a dataframe with singe column and row.
I would like to perform a conditional filter where if a certain type is not present it should consider the name to be NULL/NA.
Example
df2 <- df1 %>% filter(Type = 'D') %>% select(Name)
Must give an output of;
| Name |
--------
| NA |
Instead of throwing an error. Any inputs will be really helpful. Either DPLYR or any other methods will be appreciable.
Here is a base R approach:
name <- df[df$Name == "D", "Name"]
ifelse(identical(name, character(0)), NA, name)
[1] NA
Should the name not match to D, the subset operation would return character(0). We can compare the output against this, and then return NA as appropriate.
Data:
df <- data.frame(Name=c("EF", "GH", "VB"),
Age=c(50, 60, 70),
Type=c("A", "B", "C"),
stringsAsFactors=FALSE)
An approach with complete from tidyr would be:
library(dplyr)
library(tidyr)
df1 %>%
complete(Type = LETTERS) %>% # Specify which Types you'd expect, other values are filled with NA
filter(Type == 'D') %>%
select(Name)
# A tibble: 1 x 1
# Name
# <fct>
# 1 NA
I have table like this (but number of columns can be different, I have a number of pairs ref_* + alt_*):
+--------+-------+-------+-------+-------+
| GeneID | ref_a | alt_a | ref_b | alt_b |
+--------+-------+-------+-------+-------+
| a1 | 0 | 1 | 1 | 3 |
| a2 | 1 | 1 | 7 | 8 |
| a3 | 0 | 1 | 1 | 3 |
| a4 | 0 | 1 | 1 | 3 |
+--------+-------+-------+---------------+
and need to filter out rows that have ref_a + alt_a < 10 and ref_b + alt_b < 10. It's easy to do it with apply, creating additional columns and filtering, but I'm learning to keep my data tidy, so trying to do it with dplyr.
I would use mutate first to create columns with sums and then filter by these sums. But can't figure out how to use mutate in this case.
Edited:
Number of columns is not fixed!
You do not need to mutate here. Just do the following:
require(tidyverse)
df %>%
filter(ref_a + alt_a < 10 & ref_b + alt_b < 10)
If you want to use mutate first you could go with:
df %>%
mutate(sum1 = ref_a + alt_a, sum2 = ref_b + alt_b) %>%
filter(sum1 < 10 & sum2 < 10)
Edit: The fact that we don't know the number of variables in advance makes it a bit more complicated. However, I think you could use the following code to perform this task (assuming that the variable names are all formated with "_a", "_b" and so on. I hope there is a shorter way to perform this task :)
df$GeneID <- as.character(df$GeneID)
df %>%
gather(variable, value, -GeneID) %>%
rowwise() %>%
mutate(variable = unlist(strsplit(variable, "_"))[2]) %>%
ungroup() %>%
group_by(GeneID, variable) %>%
summarise(sum = sum(value)) %>%
filter(sum < 10) %>%
summarise(keepGeneID = ifelse(n() == (ncol(df) - 1)/2, TRUE, FALSE)) %>%
filter(keepGeneID == TRUE) %>%
select(GeneID) -> ids
df %>%
filter(GeneID %in% ids$GeneID)
Edit 2: After some rework I was able to improve the code a bit:
df$GeneID <- as.character(df$GeneID)
df %>%
gather(variable, value, -GeneID) %>%
rowwise() %>%
mutate(variable = unlist(strsplit(variable, "_"))[2]) %>%
ungroup() %>%
group_by(GeneID, variable) %>%
summarise(sum = sum(value)) %>%
group_by(GeneID) %>%
summarise(max = max(sum)) %>%
filter(max < 10) -> ids
df %>%
filter(GeneID %in% ids$GeneID)