How do I get count from multiple columns in R? - r

I have a data frame with three columns: State1, State2, State3. Is there a way to get the counts of each state in one dataframe, using all three columns (preferably with dplyr and without an explicit loop)? I only figured out how to do one column:
df %>% group_by(State1) %>% summarise(n=sum(!is.na(State1)))

You're close. You should gather all your columns into one column first, then group_by and summarize.
df %>%
gather("key", "value", state1, state2, state3) %>%
group_by(value) %>%
summarise(n=n())
Note: This also counts the number of NA entries if you have any.

Related

R filter or subset for finding a specific repeat count for data.frame

I want to use filter or subset from dplyr that will give a new dataframe only with rows in which for the selected column the value is counted exactly 2 times in the original data.frame
I try this:
df2 <-
df %>%
group_by(x) %>% mutate(duplicate = n()) %>%
filter(duplicate == 2)
and this
df2 <- subset(df,duplicated(x))
but neither option works
In the group_by, just use the unquoted column name. Also, we don't need to create a column in mutate before filtering. It can be directly done on the fly in filter
library(dplyr)
df %>%
group_by(x) %>%
filter(n() ==2) %>%
ungroup

Filter the first group after group_by

Sometimes it is handy to take a test case out of your data when working with group_by() from the dplyr library. I was wondering if there is any fast way to just grab the first group of a grouped dataframe and cast it to a new dataframe.
All I could come up with was this workaround:
library(dplyr)
smalldf <- mtcars %>% group_by(gear) %>% group_split(.) %>% .[[1]]

How to remove outliers in only one column after grouping by another column in R

I want to remove outliers from a variable MEASURE after grouping by TYPE. I tried the following code but it didn't work. I've searched and I've only came across how to remove outliers for the whole dataframe or one column. But not by after grouping.
df2 <- df %>%
group_by(TYPE) %>%
mutate(MEASURE_WITHOUT_OUTLIERS = remove_outliers(MEASURE))
You can use boxplot.stats to get outlier values in each group and use filter to remove them.
library(dplyr)
df2 <- df %>%
group_by(TYPE) %>%
filter(!MEASURE %in% boxplot.stats(MEASURE)$out) %>%
ungroup

What does n=n( ) mean in R?

The other day I was reading the following lines in R and I don't understand what the %>% and summarise(n=n()) and summarise(total=n()) meant. I understand the group_by and ungroup methods though.
Can someone help out? There isn't any documentation for this either.
library(dplyr)
net.multiplicity <- group_by(net, nodeid, epoch) %>% summarise(n=n()) %>%
ungroup() %>% group_by(n) %>% summarise(total=n())
This is from the dplyr package. n=n() means that a variable named n will be assigned the number of rows (think number of observations) in the summarized data.
the %>% is read as "and then" and is way of listing your functions sequentially rather then nesting them. So that command is saying you should do the grouping and then summarize the result of the grouping by the number of rows in each group and then ungroup that result, and then group the un-grouped data based on n and then summarize that by the total number of rows in each of the new groups.

dplyr: getting group_by-column even when not selecting it

When selecting columns I get one column I haven't selected but it's a group_by column:
library(magrittr)
library(dplyr)
df <- data.frame(i=c(1,1,1,1,2,2,2,2), j=c(1,2,1,2,1,2,1,2), x=runif(8))
df %>%
group_by(i,j) %>%
summarize(s=sum(x)) %>%
filter(i==1) %>%
select(s)
I get column i even I haven't selected it:
i s
1 1 0.8355195
2 1 0.9322474
Why does this happen (why not column j?) and how can I avoid it? Okay I could filter at the beginning....
That's because the grouping variable is carried on by default. Please see the dplyr vignette:
Grouping affects the verbs as follows: grouped select() is the same as ungrouped select(), except that grouping variables are always retained.
Note that (each) summarize peels off one layer of grouping (in your case, j), so after the summarize, your data is only grouped by i and that is printed in the output. If you don't want that, you can ungroup the data before selecting s:
require(dplyr)
df %>%
group_by(i,j) %>%
summarize(s=sum(x)) %>%
ungroup() %>%
filter(i==1) %>%
select(s)
#Source: local data frame [2 x 1]
#
# s
#1 1.129867
#2 1.265131

Resources