When selecting columns I get one column I haven't selected but it's a group_by column:
library(magrittr)
library(dplyr)
df <- data.frame(i=c(1,1,1,1,2,2,2,2), j=c(1,2,1,2,1,2,1,2), x=runif(8))
df %>%
group_by(i,j) %>%
summarize(s=sum(x)) %>%
filter(i==1) %>%
select(s)
I get column i even I haven't selected it:
i s
1 1 0.8355195
2 1 0.9322474
Why does this happen (why not column j?) and how can I avoid it? Okay I could filter at the beginning....
That's because the grouping variable is carried on by default. Please see the dplyr vignette:
Grouping affects the verbs as follows: grouped select() is the same as ungrouped select(), except that grouping variables are always retained.
Note that (each) summarize peels off one layer of grouping (in your case, j), so after the summarize, your data is only grouped by i and that is printed in the output. If you don't want that, you can ungroup the data before selecting s:
require(dplyr)
df %>%
group_by(i,j) %>%
summarize(s=sum(x)) %>%
ungroup() %>%
filter(i==1) %>%
select(s)
#Source: local data frame [2 x 1]
#
# s
#1 1.129867
#2 1.265131
Related
I want to use filter or subset from dplyr that will give a new dataframe only with rows in which for the selected column the value is counted exactly 2 times in the original data.frame
I try this:
df2 <-
df %>%
group_by(x) %>% mutate(duplicate = n()) %>%
filter(duplicate == 2)
and this
df2 <- subset(df,duplicated(x))
but neither option works
In the group_by, just use the unquoted column name. Also, we don't need to create a column in mutate before filtering. It can be directly done on the fly in filter
library(dplyr)
df %>%
group_by(x) %>%
filter(n() ==2) %>%
ungroup
Sometimes it is handy to take a test case out of your data when working with group_by() from the dplyr library. I was wondering if there is any fast way to just grab the first group of a grouped dataframe and cast it to a new dataframe.
All I could come up with was this workaround:
library(dplyr)
smalldf <- mtcars %>% group_by(gear) %>% group_split(.) %>% .[[1]]
I have a data frame with three columns: State1, State2, State3. Is there a way to get the counts of each state in one dataframe, using all three columns (preferably with dplyr and without an explicit loop)? I only figured out how to do one column:
df %>% group_by(State1) %>% summarise(n=sum(!is.na(State1)))
You're close. You should gather all your columns into one column first, then group_by and summarize.
df %>%
gather("key", "value", state1, state2, state3) %>%
group_by(value) %>%
summarise(n=n())
Note: This also counts the number of NA entries if you have any.
library(ggmosaic)
library(tidyverse)
Below is the sample code
happy2<-happy%>%
select(sex,marital,degree,health)%>%
group_by(sex,marital,degree,health)%>%
summarise(Count=n())
The following code splits the dataset into a nested list with tables of male and female (sex variable) for each category of the degree variable.
happy2 %>%
split(.$degree) %>%
lapply(function(x) split(x, x$sex))
This is where I'm now struggling. I would like to reshape, or using Tidyr, spread the "marital" variable, or perhaps this should be split again, so that each category of "marital" is a column header with each column containing the "health" variable and corresponding "Count". The redundant "sex" and "degree" columns can be dropped.
Since I'm working with a list, I've been attempting to use Tidyverse methods, for example, I've been trying to use purrr to drop variables:
happy2%>%map(~select(.x,-sex)
I'm thinking that I can also spread using purrr, but I'm having trouble making this work.
To help illustrate what I'm looking for, I attached a pic of the possible structure. I didn't include all categories and the counts are not correct since I'm only showing the structure. I suppose the "marital" category could also be a third split variable as well if that's easier? So what I'm hoping for is male and female tables for each category of degree, with marital by health and showing the corresponding count.
Help would be appreciated...
Would the following work? I changed the syntax for split by sex so that I can chain the subsequent commands together:
happy2 %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>%
spread(health, Count)))
Edit:
This would give you a separate table for each marital status:
happy2 %>%
ungroup() %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>% split(.$marital)))
And if you don't want the first column indicating marital status, the following version drops that:
happy2 %>%
ungroup() %>%
split(.$degree) %>%
lapply(function(x) x %>% split(.$sex) %>%
lapply(function(x) x %>% select(-sex, -degree) %>% split(.$marital) %>%
lapply(function(x) x %>% select(-marital))))
What about this:
# cleaned up your code a bit
# removed the select (as it does nothing)
# consistent column names (count is lower case like the rest of the variables)
# added spacing
happy2 <- happy %>%
group_by(sex, marital, degree, health) %>%
summarise(count=n())
happy2 %>%
dplyr::ungroup() %>%
split(list(.$degree, .$sex, .$marital)) %>%
lapply(. %>% select(health, count))
Or do you really want the "martial" status as table heading for the "health" column has in your picture?
Is it possible to select all unique values from a column of a data.frame using select function in dplyr library?
Something like "SELECT DISTINCT field1 FROM table1" in SQL notation.
Thanks!
In dplyr 0.3 this can be easily achieved using the distinct() method.
Here is an example:
distinct_df = df %>% distinct(field1)
You can get a vector of the distinct values with:
distinct_vector = distinct_df$field1
You can also select a subset of columns at the same time as you perform the distinct() call, which can be cleaner to look at if you examine the data frame using head/tail/glimpse.:
distinct_df = df %>% distinct(field1) %>% select(field1)
distinct_vector = distinct_df$field1
Just to add to the other answers, if you would prefer to return a vector rather than a dataframe, you have the following options:
dplyr >= 0.7.0
Use the pull verb:
mtcars %>% distinct(cyl) %>% pull()
dplyr < 0.7.0
Enclose the dplyr functions in a parentheses and combine it with $ syntax:
(mtcars %>% distinct(cyl))$cyl
The dplyr select function selects specific columns from a data frame. To return unique values in a particular column of data, you can use the group_by function. For example:
library(dplyr)
# Fake data
set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE))
# Return the distinct values of x
dat %>%
group_by(x) %>%
summarise()
x
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
If you want to change the column name you can add the following:
dat %>%
group_by(x) %>%
summarise() %>%
select(unique.x=x)
This both selects column x from among all the columns in the data frame that dplyr returns (and of course there's only one column in this case) and changes its name to unique.x.
You can also get the unique values directly in base R with unique(dat$x).
If you have multiple variables and want all unique combinations that appear in the data, you can generalize the above code as follows:
set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE),
y=sample(letters[1:5], 100, replace=TRUE))
dat %>%
group_by(x,y) %>%
summarise() %>%
select(unique.x=x, unique.y=y)