Group by all columns in a data.table - r

I'm working with iris data.table in R.
To remind how it looks I paste six five rows here
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
6: 5.4 3.9 1.7 0.4 setosa
I would like to calculate the number of rows, grouped by all columns. Of course we may write all variables in by, like this:
iris[, .(Freq = .N), by = .(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Freq
1: 5.1 3.5 1.4 0.2 setosa 1
2: 4.9 3.0 1.4 0.2 setosa 1
3: 4.7 3.2 1.3 0.2 setosa 1
4: 4.6 3.1 1.5 0.2 setosa 1
5: 5.0 3.6 1.4 0.2 setosa 1
6: 5.4 3.9 1.7 0.4 setosa 1
However, I wonder if there is a method to group by all variables without needing to type all the columns names?

In case you are looking for duplicates, uniqueN will default to using all columns:
uniqueN(as.data.table(iris))
# [1] 149
This doesn't answer your question directly, but it might be a more direct way of accomplishing what you were trying to do in the first place.
Similarly, if you're looking for which rows are duplicated, you can use duplicated's data.table method which similarly defaults to using all columns:
iris[duplicated(iris)]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.8 2.7 5.1 1.9 virginica

We can use
library(data.table)
out1 <- as.data.table(iris)[, .N, by = names(iris)]
-checking with OP's approach
out2 <- as.data.table(iris)[, .N, by = .(Sepal.Length,
Sepal.Width, Petal.Length, Petal.Width, Species)]
identical(out1, out2)
#[1] TRUE

Here is an approach in Base-R
Freq <- table(apply(iris,1,paste0, collapse=" "))
iris$Freq <- apply(iris,1, function(x) Freq[names(Freq) %in% paste0(x,collapse=" ")])
output:
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Freq
... ... ... ... ... ... ...
140 6.9 3.1 5.4 2.1 virginica 1
141 6.7 3.1 5.6 2.4 virginica 1
142 6.9 3.1 5.1 2.3 virginica 1
143 5.8 2.7 5.1 1.9 virginica 2
144 6.8 3.2 5.9 2.3 virginica 1
145 6.7 3.3 5.7 2.5 virginica 1

Related

In a data.table remove identical consecutive values over certain times by group

In a data.table, if a certain column has identical values occurring consecutively over a certain number of times, I'd like to remove the corresponding rows. I also would like to do this by group.
For example, say dt is my data.table. I would like to remove rows if the same value occurs consecutively over 2 times in Petal.Width grouped by Species.
dt <- iris[c(1:3, 7:7, 51:53, 62:63), ]
setDT(dt)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 52 6.4 3.2 4.5 1.5 versicolor
# 53 6.9 3.1 4.9 1.5 versicolor
# 62 5.9 3.0 4.2 1.5 versicolor
# 63 6.0 2.2 4.0 1.0 versicolor
The desired outcome is a data.table with the following rows.
# 7 4.6 3.4 1.4 0.3 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 63 6.0 2.2 4.0 1.0 versicolor
Here is an option:
library(data.table)
setDT(dt)[dt[,{
rl <- rleid(Species, Petal.Width)
rw <- rowid(rl)
.I[!rl %in% rl[rw > 1]]
}]]
output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 4.6 3.4 1.4 0.3 setosa
2: 7.0 3.2 4.7 1.4 versicolor
3: 6.0 2.2 4.0 1.0 versicolor
Here's an option:
library(data.table)
dt <- iris[c(1:3, 7:7, 51:53, 62:63), ]
setDT(dt)
dt[dt[, .I[.N < 3], by = .(rleid(Petal.Width), Species)]$V1]
Thanks to #chinsoon12 for suggesting to wrap rleid() around Pedal.Width to filter out consecutive values.

select the first and last row within in a data frame?

Is there a function in BASE R that could show the first and last rows within in a data frame? I know the functions like ropls::strF and print an object in data.table could do this. It is not like this topic Select first and last row from grouped data
ropls::strF(iris)
#Sepal.Length Sepal.Width ... Petal.Width Species
#numeric numeric ... numeric factor
#nRow nCol size NAs
#150 5 0 Mb 0
#Sepal.Length Sepal.Width ... Petal.Width Species
#1 5.1 3.5 ... 0.2 setosa
#2 4.9 3 ... 0.2 setosa
#... ... ... ... ... ...
#149 6.2 3.4 ... 2.3 virginica
#150 5.9 3 ... 1.8 virginica
library(data.table)
a <- as.data.table(iris)
a
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1: 5.1 3.5 1.4 0.2 setosa
#2: 4.9 3.0 1.4 0.2 setosa
#3: 4.7 3.2 1.3 0.2 setosa
#4: 4.6 3.1 1.5 0.2 setosa
#5: 5.0 3.6 1.4 0.2 setosa
#---
#146: 6.7 3.0 5.2 2.3 virginica
#147: 6.3 2.5 5.0 1.9 virginica
#148: 6.5 3.0 5.2 2.0 virginica
#149: 6.2 3.4 5.4 2.3 virginica
#150: 5.9 3.0 5.1 1.8 virginica
As others said in the comments, there isn't a function in base R to do this, but it's straightforward enough to write a function that binds together the first N rows and last N rows.
head_and_tail <- function(x, n = 1) {
rbind(
head(x, n),
tail(x, n)
)
}
head_and_tail(iris, n = 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 148 6.5 3.0 5.2 2.0 virginica
#> 149 6.2 3.4 5.4 2.3 virginica
#> 150 5.9 3.0 5.1 1.8 virginica
Created on 2018-12-22 by the reprex package (v0.2.1)

Multiple list nesting with split(), R

Given a dataset with multiple unique elements in a column, I'd like to split those unique elements into new dataframes, but have the dataframe nested one level down. Essentially adding an extra level to the split() command.
For instance (using the built-in iris table as an example:
iris
mylist <- split(iris, iris$Species)
produces a list, mylist, that contains 3 sublists, setosa, versicolor, virginica.
mylist[["setosa"]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
But I would actually like to nest that data table in a sublist called results BUT keep the upper level list name as setosa. Such that:
mylist$setosa["results"]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
I could do this with manual manipulation, but I'd like this to run automatically. I've tried unsuccessfully with mapply
mapply(function(names, df)
names <- split(df, df[["Species"]]),
unique(iris$Species), iris)
Any advice? Also happy to use a tidyr package if that makes things easier...
Consider by (object-oriented wrapper to tapply), very similar to split but allows you to run a function on each subset. Often many useRs run split + lapply, unaware both can replaced with by:
mylist <- by(iris, iris$Species, function(sub) list(results=sub), simplify = FALSE)
head(mylist$setosa$results)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
head(mylist$versicolor$results)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 51 7.0 3.2 4.7 1.4 versicolor
# 52 6.4 3.2 4.5 1.5 versicolor
# 53 6.9 3.1 4.9 1.5 versicolor
# 54 5.5 2.3 4.0 1.3 versicolor
# 55 6.5 2.8 4.6 1.5 versicolor
# 56 5.7 2.8 4.5 1.3 versicolor
head(mylist$virginica$results)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 101 6.3 3.3 6.0 2.5 virginica
# 102 5.8 2.7 5.1 1.9 virginica
# 103 7.1 3.0 5.9 2.1 virginica
# 104 6.3 2.9 5.6 1.8 virginica
# 105 6.5 3.0 5.8 2.2 virginica
# 106 7.6 3.0 6.6 2.1 virginica
setNames in lapply will keep the names of the list you're iterating through
iris
mylist <- split(iris, iris$Species)
mylist2 <- lapply(setNames(names(mylist), names(mylist)), function(x){
list(results = mylist[[x]])
})

How to retrieve column for row-wise maximum value in an R data.table?

I have the following R data.table:
library(data.table)
iris = as.data.table(iris)
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
...
Let's say I wanted to find the row-wise maximum value by each row, only for the subset of data.table columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
I would use the following code:
iris[, maximum_element :=max(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width), by=1:nrow(iris)]
Which outputs
Sepal.Length Sepal.Width Petal.Length Petal.Width Species maximum_element
1: 5.1 3.5 1.4 0.2 setosa 5.1
2: 4.9 3.0 1.4 0.2 setosa 4.9
3: 4.7 3.2 1.3 0.2 setosa 4.7
4: 4.6 3.1 1.5 0.2 setosa 4.6
5: 5.0 3.6 1.4 0.2 setosa 5.0
For my problem, I'm actually not interested in the value, but which column the value came from, i.e. I would like the following output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species maximum_column
1: 5.1 3.5 1.4 0.2 setosa Sepal.Length
2: 4.9 3.0 1.4 0.2 setosa Sepal.Length
3: 4.7 3.2 1.3 0.2 setosa Sepal.Length
4: 4.6 3.1 1.5 0.2 setosa Sepal.Length
5: 5.0 3.6 1.4 0.2 setosa Sepal.Length
(In this case, the max. value each comes from Sepal.Length).
How do I "retrieve" the column name with the maximum value?
Here is an option with pmax
iris[, maximum_element := do.call(pmax, .SD), .SDcols = 1:4]
and to find the column names, use max.col on .SD after specifying the .SDcols as the numeric columns, i.e. columns 1 to 4
iris[,maximum_column := names(.SD)[max.col(.SD)], .SDcols = 1:4]
head(iris, 4)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species maximum_column
#1: 5.1 3.5 1.4 0.2 setosa Sepal.Length
#2: 4.9 3.0 1.4 0.2 setosa Sepal.Length
#3: 4.7 3.2 1.3 0.2 setosa Sepal.Length
#4: 4.6 3.1 1.5 0.2 setosa Sepal.Length

Order data frame columns by factor - factor order needs to be reordered

I need to make a barplot based on two columns of a data frame. For the right order I need to reorder the factor of one column, and to reorder the rest of the data frame with it. I tried to reorder the factor, but the rest of the columns remained the same. How can I sort the whole data frame? I will show what I did with the iris data (Note that my data actually has two nominal columns)
> d<-iris
> head(d)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> d$Species<-factor(d$Species, labels=c("virginica","setosa","versicolor"))
> head(d)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 virginica
2 4.9 3.0 1.4 0.2 virginica
3 4.7 3.2 1.3 0.2 virginica
4 4.6 3.1 1.5 0.2 virginica
5 5.0 3.6 1.4 0.2 virginica
6 5.4 3.9 1.7 0.4 virginica
>
As you an see the Species columns is sorted as I wanted it, but the rest of the columns stayed. What can I do?
EDIT
I tried out the answers, and realize, that I did not specify my question enough:
I need to rearrange the order of the factor of the species column when I resort the data frame.
I am looking for this result in addition to the reordered data frame, so that this order will be used when plotting:
head(d1$Species)
[1] virginica virginica virginica virginica virginica virginica
Levels: virginica setosa versicolor
Until now however, the order is still not what I want:
`
d1 <- d[order(factor(d$Species, levels=c("virginica","setosa","versicolor"))),]
head(d1)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#101 6.3 3.3 6.0 2.5 virginica
#102 5.8 2.7 5.1 1.9 virginica
#103 7.1 3.0 5.9 2.1 virginica
#104 6.3 2.9 5.6 1.8 virginica
#105 6.5 3.0 5.8 2.2 virginica
#106 7.6 3.0 6.6 2.1 virginica
Update
To change the levels of the factor Species, you would have to:
d$Species <- factor(d$Species, levels=c("virginica","setosa","versicolor"))
d1 <- d[order(d$Species),]
levels(d1$Species)
#[1] "virginica" "setosa" "versicolor"
head(d1,2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#101 6.3 3.3 6.0 2.5 virginica
#102 5.8 2.7 5.1 1.9 virginica
require(plyr)
d <- iris
arrange(d, factor(d$Species, levels = c("virginica","setosa","versicolor")))

Resources