Subset a dplyr result - r

Im trying to subset the result of a dplyr call. Can someone explain why this doesnt work?
library(dplyr)
df<-data.frame(name=c("bob","ann"),age=c(22,24),random=c(1,2))
View(df%>%filter(name=="bob")) #works fine
#Now to avoid showing the random column I tried:
View(df%>%filter(name="bob")[,c(1,2)]) #standard subset notation to remove column 3 doesnt work here

I think if you're going to use dplyr to filter the df, you should use dplyr to select from the df. Not sure if there's any performance differences.
df %>%
filter(name == "bob") %>%
select(1,2)
df %>%
filter(name == "bob") %>%
select(name, age)

Related

R filter or subset for finding a specific repeat count for data.frame

I want to use filter or subset from dplyr that will give a new dataframe only with rows in which for the selected column the value is counted exactly 2 times in the original data.frame
I try this:
df2 <-
df %>%
group_by(x) %>% mutate(duplicate = n()) %>%
filter(duplicate == 2)
and this
df2 <- subset(df,duplicated(x))
but neither option works
In the group_by, just use the unquoted column name. Also, we don't need to create a column in mutate before filtering. It can be directly done on the fly in filter
library(dplyr)
df %>%
group_by(x) %>%
filter(n() ==2) %>%
ungroup

Filtering by Numerical Variable but Need to Satisfy Multiple Categorical Groups

I'm working with a modified version of the babynames dataset, which can be gotten by installing the babynames packages and calling:
# to install the package
install.packages('babynames')
# to load the package
library(babynames)
# to get the only one dataframe of interest from the package
babynames <- babynames::babynames
# the modified data that I'm working with
babynames_prop_sexes <- babynames %>%
select(-prop, -year) %>%
group_by(name, sex) %>%
mutate(total_occurence = sum(n))
I need to sort out names that have more than 10000 occurrences for both sexes. How can I approach this? (Preferably by using dplyr but any method is welcomed.)
Thanks in advance for any help!
There might be a more elegant solution. But this should get you a list of names that appear with > 10000 entries as both an M and an F.
For the method, I just kept going with dplyr verbs. After using filter to get rid of the entries that appear < 10000 times, I could then group_by the name and use tally(), knowing that n = 2 when that entry appeared twice, once for M and once for F.
large_total_both_genders_same_name <- babynames %>%
group_by(name, sex) %>%
summarize(total = sum(n)) %>%
filter(total > 10000) %>%
arrange(name) %>%
group_by(name) %>%
tally() %>%
arrange(desc(n)) %>%
filter(n == 2) %>%
dplyr::select(name)
And if you want to filter your original file by that shortlist of names you can use a semi_join on the table we created, to shorten up the list. In this case, it wouldn't be obvious what you are looking at unless you also included the year column, which you removed.
original_babynames_shortened <- babynames_prop_sexes %>%
filter(name %in% large_total_both_genders_same_name$name)
But anyway, this is a common process. Create a summary table of some kind that is saved as its own 'intermediary' table, so to speak, then join that to your original, as a filter. Sometimes this can all be done in one go, but it's often easier, in my opinion to break this into two pieces.

Filter the first group after group_by

Sometimes it is handy to take a test case out of your data when working with group_by() from the dplyr library. I was wondering if there is any fast way to just grab the first group of a grouped dataframe and cast it to a new dataframe.
All I could come up with was this workaround:
library(dplyr)
smalldf <- mtcars %>% group_by(gear) %>% group_split(.) %>% .[[1]]

Using map on specific column in list?

I'm trying to split a dataframe in a list of dataframes and then sort each dataframe by a specific variable using map(). I thought my approach would work, but I'm obviously not correctly passing something to the function, but I'm unsure as to how to make it work. For instance, using lapply() I could do this:
library(tidyverse)
df = iris
df %>%
group_split(Species) %>%
{lapply(.,function(x) {x %>% arrange(desc(Sepal.Length))})}
Using map(), I've tried this approach but it's not working:
df %>%
group_split(Species) %>%
map(.,arrange(Sepal.Length),desc)
How can I structure this so it works? I only want to apply the map() to one of the columns as in the lapply() example.
df %>%
group_split(Species) %>%
map(~arrange(.data = .x, desc(Sepal.Length)))
or
df %>%
group_split(Species) %>%
map(~.x %>% arrange(desc(Sepal.Length)))

How to use group_by() with rep_len() r

Let me know if I need a dummy example for this but essentially I have a df of subgroups, each subgroup a different length (typically 30-35k values). I'd like to bind in a vector with partial vector recycling of c(1:200). From this question I figure I can use rep_len() to get around the dataframe's anti-partial-recycling. The problem is, I can't define length.out in rep_len(), as length.out changes with each subgroup. Any help would be appreciated. I tried doing this:
df_new <- df %>%
group_by(subgroup) %>%
mutate(newcol <- rep_len(1:200, length.out=.))
Which threw an invalid length.out error. I also tried
df_new <- df %>%
group_by(subgroup) %>%
mutate(newcol <- rep_len(1:200, length.out=nrow(.)))
But this throws an error that length.out is the length of my entire df, not the previous subgroup. Any help would be appreciated!
The dplyr package has a count function n() which could work.
mtcars %>%
group_by(cyl) %>%
mutate(newcol = rep_len(1:200, length.out=n()))
Also in the mutate statement it should be a "=" and not "<-"

Resources