Select unique values with 'select' function in 'dplyr' library - r

Is it possible to select all unique values from a column of a data.frame using select function in dplyr library?
Something like "SELECT DISTINCT field1 FROM table1" in SQL notation.
Thanks!

In dplyr 0.3 this can be easily achieved using the distinct() method.
Here is an example:
distinct_df = df %>% distinct(field1)
You can get a vector of the distinct values with:
distinct_vector = distinct_df$field1
You can also select a subset of columns at the same time as you perform the distinct() call, which can be cleaner to look at if you examine the data frame using head/tail/glimpse.:
distinct_df = df %>% distinct(field1) %>% select(field1)
distinct_vector = distinct_df$field1

Just to add to the other answers, if you would prefer to return a vector rather than a dataframe, you have the following options:
dplyr >= 0.7.0
Use the pull verb:
mtcars %>% distinct(cyl) %>% pull()
dplyr < 0.7.0
Enclose the dplyr functions in a parentheses and combine it with $ syntax:
(mtcars %>% distinct(cyl))$cyl

The dplyr select function selects specific columns from a data frame. To return unique values in a particular column of data, you can use the group_by function. For example:
library(dplyr)
# Fake data
set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE))
# Return the distinct values of x
dat %>%
group_by(x) %>%
summarise()
x
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
If you want to change the column name you can add the following:
dat %>%
group_by(x) %>%
summarise() %>%
select(unique.x=x)
This both selects column x from among all the columns in the data frame that dplyr returns (and of course there's only one column in this case) and changes its name to unique.x.
You can also get the unique values directly in base R with unique(dat$x).
If you have multiple variables and want all unique combinations that appear in the data, you can generalize the above code as follows:
set.seed(5)
dat = data.frame(x=sample(1:10,100, replace=TRUE),
y=sample(letters[1:5], 100, replace=TRUE))
dat %>%
group_by(x,y) %>%
summarise() %>%
select(unique.x=x, unique.y=y)

Related

Why wont the group_by() function in R work properly?

I have a large dataframe and I am trying to sort by 8 categories in one column and then find the sum of their weight (kg) using the group_by() and summarise() functions from dplyr package.
However, in the 'total' variable created, the sums of some of the categories produce N/A and I'm not sure why as they should be numerical values. There isn't anything weird about the dataframe which I can see.
code:
totals <- db %>% group_by(category) %>% summarise(kilos = sum(weight))
sum function does not work with NA values. Specify the na.rm argument as TRUE and it will ignore the NA values. Below should work:
enter code here totals <- db %>%
group_by(category) %>%
summarise(kilos = sum(weight,na.rm =TRUE))

R function to aggregate all elements in a list

Using the tidyverse package, I can easily aggregate a single variable. However, I wish to create a function which will allow me to aggregate multiple variables simultaneously.
I understand I have to convert the dataframe containing multiple variables to a list and then lapply an aggregating function across this list. However, I am unable to create this function.
Following is a REPREX of what I am trying to do:
# Load package
library(dplyr)
# Load dataset
dat <- data.frame(Titanic)
# Select variables
dat <- dat[, c('Class', 'Sex', 'Age','Survived')]
# Aggregate a single variable
dat %>% group_by(Class) %>% summarise(n=n())
# Desired outcome: Aggregate all variables simultaneously using a function
dat_ls <- as.list(dat) ## Create a list with all the variables
dat_agg <- lapply(dat_ls, function(???)) ## Apply aggregating function to each element in the list
With the list, we can use table
lapply(dat_ls, table)
Another option is to reshape to 'long' format and then use count
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(everything()) %>%
count(name, value)

How do I get a count for each unique value in a dataframe column, even if I don't know what the unique values are?

Basically, I'd like to identify the unique values in an R dataframe column and get a count of each one, with the ultimate goal of ranking them largest count to smallest. Any ideas how I can go about doing this?
Thank you so much in advance!
The base R function is table
table(df$column)
A reproducible example using mtcars
> data(mtcars)
> table(mtcars$cyl)
4 6 8
11 7 14
> sort(table(mtcars$cyl),decreasing=TRUE)
8 4 6
14 11 7
One option is add_count that creates a column with frequency count and then use that to order the rows
library(dplyr)
df1 %>%
add_count(col1) %>%
arrange(desc(n))
If we need only the summarised values, use count
df1 %>%
count(col1) %>%
arrange(desc(n))
A reproducible example using mtcars
data(mtcars)
mtcars %>%
add_count(vs) %>%
arrange(desc(n))
Here's a different dplyr solution:
library(dplyr)
df <- as.data.frame(table(df$colname)) %>%
arrange(desc(Var1))

rowDiffs type function, keeping "row 1" as the reference row per group

Say I have this simple data frame with a grouping variable, and three xs per group:
df<-data.frame(grp=rep(letters[1:3],each=3),
x=rnorm(9))
grp x
1 a 1.9561455
2 a -2.3916438
3 a 0.7267603
4 b -0.8794693
5 b -0.3089820
6 b -1.7228825
7 c -0.3964017
8 c -0.6237301
9 c -0.1522535
I want to, per group, take the initial row as a reference row, and get the difference between x and this reference x (first row) for all rows, such that the outcome is:
grp x xdiff
1 a 1.9561455 0.0000000
2 a -2.3916438 -4.3477893
3 a 0.7267603 -1.2293853
4 b -0.8794693 0.0000000
5 b -0.3089820 0.5704873
6 b -1.7228825 -0.8434132
7 c -0.3964017 0.0000000
8 c -0.6237301 -0.2273284
9 c -0.1522535 0.2441482
I was able to do it through this way:
rowOne<-df %>% group_by(grp) %>% filter(row_number()==1)
names(rowOne)[2]<-"x_initial"
df %>% left_join(rowOne) %>% mutate(xdiff=x-x_initial)
But I'm hoping there is a simpler way to do it, that doesn't require creating new datasets, merging and subtracting.
I have a dozen or so columns I need to do this for, and I'd like to be able to just do something like:
df %>% group_by(grp) %>% mutate(xdiff=rowDiffs(x))
But, obviously, this is not the correct function. Is there a function out there I haven't come across, or an easier way to program R to do this task?
Thanks!
The difference between a column by the first value in the column grouped by another column can be done using either data.table or dplyr or base R methods.
If we are doing this for a single column, the compact data.table method is one option. We convert 'data.frame' to 'data.table' (setDT(df)), grouped by the grouping column ('grp'), we get the difference between the column ('x') and the first value in that column (x[1L] - Note that I used the integer representation i.e. 1L. It would also work by simply using x[1]. In some cases, the integers might be a bit faster).
library(data.table)
setDT(df)[, xdiff:=x-x[1L] , by = grp]
Or a similar option with dplyr is piping (%>%) the arguments from left to right, ie. use the dataset ('df'), then we group by 'grp', and create a new column using mutate. Note that there is a first function in dplyr to select the first observation. It has also other arguments (?first).
library(dplyr)
df %>%
group_by(grp) %>%
mutate(xdiff= x- first(x))
Or a base R option suggested by #David Arenburg
df$xdiff <- with(df, ave(x, grp), FUN = function(x) x - x[1L])
If you have many columns, we can use mutate_each (from dplyr) after the grouping step, change the column names with setNames (NOTE: If there is mutliple functions i.e. >1, we could change it within the mutate_each itself), and bind the original columns with bind_cols.
df1 %>%
group_by(grp) %>%
mutate_each(funs(.-first(.))) %>%
setNames(., c(names(df1)[1L], paste0(names(df1)[-1L], 'diff'))) %>%
ungroup() %>%
select(-grp) %>%
bind_cols(df1, .)
Or using data.table, we can create new columns by assigning (:=). Here, we loop the columns under consideration with lapply (.SD is the Subset of DataTable) and get the difference grouped by 'grp'.
nm1 <- setdiff(names(df1), 'grp')
setDT(df1)[, paste0(nm1, 'diff') :=lapply(.SD, function(x) x-x[1L]), grp]
data
set.seed(24)
df1 <- cbind(df, y= rnorm(9))

dplyr: getting group_by-column even when not selecting it

When selecting columns I get one column I haven't selected but it's a group_by column:
library(magrittr)
library(dplyr)
df <- data.frame(i=c(1,1,1,1,2,2,2,2), j=c(1,2,1,2,1,2,1,2), x=runif(8))
df %>%
group_by(i,j) %>%
summarize(s=sum(x)) %>%
filter(i==1) %>%
select(s)
I get column i even I haven't selected it:
i s
1 1 0.8355195
2 1 0.9322474
Why does this happen (why not column j?) and how can I avoid it? Okay I could filter at the beginning....
That's because the grouping variable is carried on by default. Please see the dplyr vignette:
Grouping affects the verbs as follows: grouped select() is the same as ungrouped select(), except that grouping variables are always retained.
Note that (each) summarize peels off one layer of grouping (in your case, j), so after the summarize, your data is only grouped by i and that is printed in the output. If you don't want that, you can ungroup the data before selecting s:
require(dplyr)
df %>%
group_by(i,j) %>%
summarize(s=sum(x)) %>%
ungroup() %>%
filter(i==1) %>%
select(s)
#Source: local data frame [2 x 1]
#
# s
#1 1.129867
#2 1.265131

Resources