Consider the below given dataframe;
Sample DataFrame
| Name | Age | Type |
---------------------
| EF | 50 | A |
| GH | 60 | B |
| VB | 70 | C |
Code to perform Filter
df2 <- df1 %>% filter(Type == 'C') %>% select(Name)
The above code will provide me a dataframe with singe column and row.
I would like to perform a conditional filter where if a certain type is not present it should consider the name to be NULL/NA.
Example
df2 <- df1 %>% filter(Type = 'D') %>% select(Name)
Must give an output of;
| Name |
--------
| NA |
Instead of throwing an error. Any inputs will be really helpful. Either DPLYR or any other methods will be appreciable.
Here is a base R approach:
name <- df[df$Name == "D", "Name"]
ifelse(identical(name, character(0)), NA, name)
[1] NA
Should the name not match to D, the subset operation would return character(0). We can compare the output against this, and then return NA as appropriate.
Data:
df <- data.frame(Name=c("EF", "GH", "VB"),
Age=c(50, 60, 70),
Type=c("A", "B", "C"),
stringsAsFactors=FALSE)
An approach with complete from tidyr would be:
library(dplyr)
library(tidyr)
df1 %>%
complete(Type = LETTERS) %>% # Specify which Types you'd expect, other values are filled with NA
filter(Type == 'D') %>%
select(Name)
# A tibble: 1 x 1
# Name
# <fct>
# 1 NA
Related
I encounter another challenge about combine two row into one based on identifier col.
My dataset looks like this:
var<-c("round","round","round","hhid","hhid","chid","chid","sex")
dfile<-c("df1","df2","df3","df1","df2","df1","df2","df1")
uniquevar<-c("df1::round","df2::round","df3::round", "df1::hhid","df2::hhid","df1::chid","df2::chid","df1::sex")
flag<-c("dup","dup","dup","dup","dup","dup","dup","NA")
df<-data.frame(var, dfile,flag)
I am trying to do
find the obs which is marked as "dup"
If it is marked as "dup", combine two/three/or multiple rows into one with format:
df1::var | df2::var |df3::var
So, the ideal outcome would look like this
var dfile. uniquevar flag
round df1 |df2 |df3 df1::round | df2::round |df3::round dup
hhid df1 |df2 df1::hhid | df2::hhid dup
chid df1 |df2 df1::chid | df2::chid dup
sex df1 NA
So far I can only do that manually in excel, that is really time-consuming. I appreciate if I could be told how to achieve that in R, which would be much faster considering the dataset contains over 600,000 obs...
Thanks a lot~~!
You can paste cells together after using group_by(var). Use sep = "::" to specify the separator between different columns, and collapse = " | " for the separator representing rows. You can do this inside summarize from the dplyr package.
library(dplyr)
df %>%
group_by(var) %>%
summarize(uniquevar = ifelse(all(flag == "dup"),
paste(dfile, var, sep = "::", collapse = " | "),
""),
dfile = paste(dfile, collapse = " | "),
dup = flag[1]) %>%
select(var, dfile, uniquevar, dup)
#> # A tibble: 4 x 4
#> var dfile uniquevar dup
#> <chr> <chr> <chr> <chr>
#> 1 chid df1 | df2 "df1::chid | df2::chid" dup
#> 2 hhid df1 | df2 "df1::hhid | df2::hhid" dup
#> 3 round df1 | df2 | df3 "df1::round | df2::round | df3::round" dup
#> 4 sex df1 "" NA
Here is a sample of the dataset that I have. I am looking to find the state that has the maximum number of stores. In this case, CA and also see how many IDs come from that state
| ID | | State | | Stores|
| -- | |------ | | ----- |
|a11 | | CA | | 16585 |
|a12 | | CA | | 45552 |
|a13 | | AK | | 7811 |
|a14 | | MA | | 4221 |
I have this code using dplyr
max_state <- df %>%
group_by(State) %>%
summarise(total_stores = sum(Stores)) %>%
top_n(1) %>%
select(State)
This gives me "CA"
Can I use this variable "max(state)" to pass through a filter and use summarise(n()) to count the number of Ids for CA?
A few ways:
# this takes your max_state (CA) and brings in the parts of
# your original table that have the same State
max_state %>%
left_join(df) %>%
summarize(n = n())
# filter the State in df to match the State in max_state
df %>%
filter(State == max_state$State) %>%
summarize(n = n())
# Add Stores_total for each State, only keep the State rows which
# match that of the max State, and count the # of IDs therein
df %>%
group_by(State) %>%
mutate(Stores_total = sum(Stores)) %>%
filter(Stores_total == max(Stores_total)) %>%
count(ID)
You can combine more operations into one summarize call that will be applied to the same group:
df |>
group_by(State) |>
summarize(gsum = sum(Stores), nids = n()) |>
filter(gsum == max(gsum))
##>+ # A tibble: 1 × 3
##> State gsum nids
##> <chr> <dbl> <int>
##>1 CA 62137 2
Where the dataset df is obtained by:
df <- data.frame(ID = c("a11", "a12","a13", "a14"),
State = c("CA", "CA", "AK", "MA"),
Stores = c(16585, 45552, 7811, 4221))
I have a data set that might contain some very similar keys - something like a row of data for each of the email address john.doe#foo.com and john.m.doe#foo.com. How can I combine similarly named keys and do an aggregate in R?
Sample input
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 10 |
|john.m.doe#foo.com | 11 |
|jane.doe#foo.com | 20 |
Expected result
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 21 |
|jane.doe#foo.com | 20 |
I know agrep and few other libraries can do fuzzy matching, but how do I employ it in combining rows in a data set?
Here is one way to use agrep in combination with dplyr:
df <- data.frame(mail = c("john.doe#foo.com", "john.m.doe#foo.com", "jane.doe#foo.com"),
sub = c(10, 11, 20))
df %>%
rowwise() %>%
mutate(new = paste(agrep(mail, df$mail, max = 2, ignore.case = TRUE), collapse = ",")) %>%
group_by(new) %>%
mutate(sub = sum(sub)) %>%
slice(1)
mail sub new
<fct> <dbl> <chr>
1 john.doe#foo.com 21 1,2
2 jane.doe#foo.com 20 3
This question is essentially a duplicated of this question, except I am working in R. The pyspark solution looks solid, but I haven't been able to figure out how to apply collect_list over a window function in the same way in sparklyr.
I have a Spark DataFrame with the following structure:
------------------------------
userid | date | city
------------------------------
1 | 2018-08-02 | A
1 | 2018-08-03 | B
1 | 2018-08-04 | C
2 | 2018-08-17 | G
2 | 2018-08-20 | E
2 | 2018-08-23 | F
I am trying to group the DataFrame by userid, order each group by date, and collapse the city column into a concatenation of its values. Desired output:
------------------
userid | cities
------------------
1 | A, B, C
2 | G, E, F
The trouble is that each method I've tried to do this with has resulted in some users (appx. 3% on a test of 5000 users) not having their "cities" column in the correct order.
Attempt 1: using dplyr and collect_list.
my_sdf %>%
dplyr::group_by(userid) %>%
dplyr::arrange(date) %>%
dplyr::summarise(cities = paste(collect_list(city), sep = ", ")))
Attempt 2: using replyr::gapply since the operation fits the description of "Grouped-Order-Apply".
get_cities <- . %>%
summarise(cities = paste(collect_list(city), sep = ", "))
my_sdf %>%
replyr::gapply(gcolumn = "userid",
f = get_cities,
ocolumn = "date",
partitionMethod = "group_by")
Attempt 3: write as a SQL window function.
my_sdf %>%
spark_session(sc) %>%
sparklyr::invoke("sql",
"SELECT userid, CONCAT_WS(', ', collect_list(city)) AS cities
OVER (PARTITION BY userid
ORDER BY date)
FROM my_sdf") %>%
sparklyr::sdf_register() %>%
sparklyr::sdf_copy_to(sc, ., "my_sdf", overwrite = T)
^ throws the following error:
Error: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'OVER' expecting <EOF>(line 2, pos 19)
== SQL ==
SELECT userid, conversion_location, CONCAT_WS(' > ', collect_list(channel)) AS path
OVER (PARTITION BY userid, conversion_location
-------------------^^^
ORDER BY occurred_at)
FROM paths_model
Solved! I misunderstood how collect_list() and Spark SQL could work together. I didn't realize a list could be returned, I thought that the concatenation had to take place within the query. The following produces the desired result:
spark_output <- spark_session(sc) %>%
sparklyr::invoke("sql",
"SELECT userid, collect_list(city)
OVER (PARTITION BY userid
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
AS cities
FROM my_sdf") %>%
sdf_register() %>%
group_by(userid) %>%
filter(row_number(userid) == 1) %>%
ungroup() %>%
mutate(cities = paste(cities, sep = " > ")) %>%
sdf_register()
Ok: so I admit that the following solution is not at all efficient (it uses a for loop and is actually a lot of code for what seems like it could be a simple task), but I believe this should work:
#install.packages("tidyverse") # if needed
library(tidyverse)
df <- tribble(
~userid, ~date, ~city,
1 , "2018-08-02" , "A",
1 , "2018-08-03" , "B",
1 , "2018-08-04" , "C",
2 , "2018-08-17" , "G",
2 , "2018-08-20" , "E",
2 , "2018-08-23" , "F"
)
cityPerId <- df %>%
spread(key = date, value = city)
toMutate <- NA
for (i in 1:nrow(cityPerId)) {
cities <- cityPerId[i,][2:ncol(cityPerId)] %>% t() %>%
as.vector() %>%
na.omit()
collapsedCities <- paste(cities, collapse = ",")
toMutate <- c(toMutate, collapsedCities)
}
toMutate <- toMutate[2:length(toMutate)]
final <- cityPerId %>%
mutate(cities = toMutate) %>%
select(userid, cities)
I have table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
5/1/1 | B | 4
4/1/1 | C | 5
1/1/1 | A | 1
7/1/1 | B | 2
1/1/1 | C | 3
I need table:
Date | Column1 | Column2
------+---------+--------
6/1/1 | A | 3
4/1/1 | C | 5
7/1/1 | B | 2
How to remove old rows based on two criteria (Column1, Column2)?
Group by Dates, arrange in descending order within group, then keep the first row with slice, like this
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(as.Date(Date))) %>% # will sort within group now
slice(1) %>% # keep first row entry of each group
ungroup()
Your error is occurring because your date format is a bit funny. I recommend using lubridate::parse_date_time which is more robust than base R datetime functions
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
arrange(desc(parse_date_time(Date, format="mdy"))) %>% # will sort within group now
# the date format is specified as month-day-year
slice(1) %>% # keep first row entry of each group
ungroup()
EDIT
Based on helpful comment by #count, we can simplify dplyr chain to
library(lubridate)
library(dplyr)
ans <- df %>%
group_by(Column1, Column2) %>%
slice(which.max(parse_date_time(Date, format="mdy"))) %>% # keep max-Date row entry of each group
ungroup()