select the first and last row within in a data frame?

select the first and last row within in a data frame? - r

Is there a function in BASE R that could show the first and last rows within in a data frame? I know the functions like ropls::strF and print an object in data.table could do this. It is not like this topic Select first and last row from grouped data
ropls::strF(iris)
#Sepal.Length Sepal.Width ... Petal.Width Species
#numeric numeric ... numeric factor
#nRow nCol size NAs
#150 5 0 Mb 0
#Sepal.Length Sepal.Width ... Petal.Width Species
#1 5.1 3.5 ... 0.2 setosa
#2 4.9 3 ... 0.2 setosa
#... ... ... ... ... ...
#149 6.2 3.4 ... 2.3 virginica
#150 5.9 3 ... 1.8 virginica
library(data.table)
a <- as.data.table(iris)
a
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1: 5.1 3.5 1.4 0.2 setosa
#2: 4.9 3.0 1.4 0.2 setosa
#3: 4.7 3.2 1.3 0.2 setosa
#4: 4.6 3.1 1.5 0.2 setosa
#5: 5.0 3.6 1.4 0.2 setosa
#---
#146: 6.7 3.0 5.2 2.3 virginica
#147: 6.3 2.5 5.0 1.9 virginica
#148: 6.5 3.0 5.2 2.0 virginica
#149: 6.2 3.4 5.4 2.3 virginica
#150: 5.9 3.0 5.1 1.8 virginica

As others said in the comments, there isn't a function in base R to do this, but it's straightforward enough to write a function that binds together the first N rows and last N rows.
head_and_tail <- function(x, n = 1) {
rbind(
head(x, n),
tail(x, n)
)
}
head_and_tail(iris, n = 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 148 6.5 3.0 5.2 2.0 virginica
#> 149 6.2 3.4 5.4 2.3 virginica
#> 150 5.9 3.0 5.1 1.8 virginica
Created on 2018-12-22 by the reprex package (v0.2.1)

Related

Conditional filtering with data.table with multiple statements

I would like to know if there is an elegant and concise way to do conditional filtering with data.table.
My aim is the following:
if condition 1 is met, filter based on condition 2.
For instance, in the case of the iris dataset,
how can I drop the observations among Species=="setosa" where Sepal.Length<5.5, while keeping all observations with Sepal.Length<5.5 for other species?
I know how to do this in steps, but I wonder if there is a better way to do it in a single liner
# this is how I would do it in steps.
data("iris")
# first only select observations in setosa I am interested in keeping
iris1<- setDT(iris)[Sepal.Length>=5.5&Species=="setosa"]
# second, drop all of setosa observations.
iris2<- setDT(iris)[Species!="setosa"]
# join data,
iris_final<-full_join(iris1,iris2)
head(iris_final)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.8 4.0 1.2 0.2 setosa
2: 5.7 4.4 1.5 0.4 setosa
3: 5.7 3.8 1.7 0.3 setosa
4: 5.5 4.2 1.4 0.2 setosa
5: 5.5 3.5 1.3 0.2 setosa # only keeping setosa with Sepal.Length>=5.5. Note that for other species, Sepal.Length can be <5.5
6: 7.0 3.2 4.7 1.4 versicolor
is there a more concise and elegant way of doing this?

Is something like the following what you are looking for? It is not very clear what you want.
library(data.table)
dt <- data.table(iris)
dt[Sepal.Length >= 5.5 & Species == "setosa" | Species != "setosa"]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.8 4.0 1.2 0.2 setosa
#> 2: 5.7 4.4 1.5 0.4 setosa
#> 3: 5.7 3.8 1.7 0.3 setosa
#> 4: 5.5 4.2 1.4 0.2 setosa
#> 5: 5.5 3.5 1.3 0.2 setosa
#> ---
#> 101: 6.7 3.0 5.2 2.3 virginica
#> 102: 6.3 2.5 5.0 1.9 virginica
#> 103: 6.5 3.0 5.2 2.0 virginica
#> 104: 6.2 3.4 5.4 2.3 virginica
#> 105: 5.9 3.0 5.1 1.8 virginica

You can use the | or operator:
This is asking to remove any lines where Species=="setosa" & Sepal.Length<5.5 and keep lines where Sepal.Length>5.5
iris1[!(Species=="setosa" & Sepal.Length<5.5) | Sepal.Length>5.5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.8 4.0 1.2 0.2 setosa
2: 5.7 4.4 1.5 0.4 setosa
3: 5.7 3.8 1.7 0.3 setosa
4: 5.5 4.2 1.4 0.2 setosa
5: 5.5 3.5 1.3 0.2 setosa
---
101: 6.7 3.0 5.2 2.3 virginica
102: 6.3 2.5 5.0 1.9 virginica
103: 6.5 3.0 5.2 2.0 virginica
104: 6.2 3.4 5.4 2.3 virginica
105: 5.9 3.0 5.1 1.8 virginica

Group by all columns in a data.table

I'm working with iris data.table in R.
To remind how it looks I paste six five rows here
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
6: 5.4 3.9 1.7 0.4 setosa
I would like to calculate the number of rows, grouped by all columns. Of course we may write all variables in by, like this:
iris[, .(Freq = .N), by = .(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Freq
1: 5.1 3.5 1.4 0.2 setosa 1
2: 4.9 3.0 1.4 0.2 setosa 1
3: 4.7 3.2 1.3 0.2 setosa 1
4: 4.6 3.1 1.5 0.2 setosa 1
5: 5.0 3.6 1.4 0.2 setosa 1
6: 5.4 3.9 1.7 0.4 setosa 1
However, I wonder if there is a method to group by all variables without needing to type all the columns names?

In case you are looking for duplicates, uniqueN will default to using all columns:
uniqueN(as.data.table(iris))
# [1] 149
This doesn't answer your question directly, but it might be a more direct way of accomplishing what you were trying to do in the first place.
Similarly, if you're looking for which rows are duplicated, you can use duplicated's data.table method which similarly defaults to using all columns:
iris[duplicated(iris)]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.8 2.7 5.1 1.9 virginica

We can use
library(data.table)
out1 <- as.data.table(iris)[, .N, by = names(iris)]
-checking with OP's approach
out2 <- as.data.table(iris)[, .N, by = .(Sepal.Length,
Sepal.Width, Petal.Length, Petal.Width, Species)]
identical(out1, out2)
#[1] TRUE

Here is an approach in Base-R
Freq <- table(apply(iris,1,paste0, collapse=" "))
iris$Freq <- apply(iris,1, function(x) Freq[names(Freq) %in% paste0(x,collapse=" ")])
output:
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Freq
... ... ... ... ... ... ...
140 6.9 3.1 5.4 2.1 virginica 1
141 6.7 3.1 5.6 2.4 virginica 1
142 6.9 3.1 5.1 2.3 virginica 1
143 5.8 2.7 5.1 1.9 virginica 2
144 6.8 3.2 5.9 2.3 virginica 1
145 6.7 3.3 5.7 2.5 virginica 1

String manipulation in mutate with stringr

So lets say that I want to locate a pattern in a string and if the pattern exists then I only keep the part of the string before the pattern. My problem is that if the pattern does not exist then it returns NA and the final result will be NA. I want it to return the original string when the pattern does not exist.
library(stringr)
library(dplyr)
unique(iris$Species)
#> [1] setosa versicolor virginica
#> Levels: setosa versicolor virginica
test <- iris %>%
mutate(Species = str_sub(Species, 1, str_locate(Species, "t")[,1] ))
head(test)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 set
#> 2 4.9 3.0 1.4 0.2 set
#> 3 4.7 3.2 1.3 0.2 set
#> 4 4.6 3.1 1.5 0.2 set
#> 5 5.0 3.6 1.4 0.2 set
#> 6 5.4 3.9 1.7 0.4 set
tail(test)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 145 6.7 3.3 5.7 2.5 <NA>
#> 146 6.7 3.0 5.2 2.3 <NA>
#> 147 6.3 2.5 5.0 1.9 <NA>
#> 148 6.5 3.0 5.2 2.0 <NA>
#> 149 6.2 3.4 5.4 2.3 <NA>
#> 150 5.9 3.0 5.1 1.8 <NA>
Created on 2019-07-14 by the reprex package (v0.3.0)

We can use a regex lookaround with str_remove. If the pattern is not found, it will return the original string. Here, we are matching characters (.*) after the 't' character and if found, those characters are removed
library(dplyr)
library(stringr)
test <- iris %>%
mutate(Species = str_remove(Species, "(?<=t).*"))
head(test)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 set
#2 4.9 3.0 1.4 0.2 set
#3 4.7 3.2 1.3 0.2 set
#4 4.6 3.1 1.5 0.2 set
#5 5.0 3.6 1.4 0.2 set
#6 5.4 3.9 1.7 0.4 set
tail(test)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#145 6.7 3.3 5.7 2.5 virginica
#146 6.7 3.0 5.2 2.3 virginica
#147 6.3 2.5 5.0 1.9 virginica
#148 6.5 3.0 5.2 2.0 virginica
#149 6.2 3.4 5.4 2.3 virginica
#150 5.9 3.0 5.1 1.8 virginica

Multiple list nesting with split(), R

Given a dataset with multiple unique elements in a column, I'd like to split those unique elements into new dataframes, but have the dataframe nested one level down. Essentially adding an extra level to the split() command.
For instance (using the built-in iris table as an example:
iris
mylist <- split(iris, iris$Species)
produces a list, mylist, that contains 3 sublists, setosa, versicolor, virginica.
mylist[["setosa"]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
But I would actually like to nest that data table in a sublist called results BUT keep the upper level list name as setosa. Such that:
mylist$setosa["results"]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
I could do this with manual manipulation, but I'd like this to run automatically. I've tried unsuccessfully with mapply
mapply(function(names, df)
names <- split(df, df[["Species"]]),
unique(iris$Species), iris)
Any advice? Also happy to use a tidyr package if that makes things easier...

Consider by (object-oriented wrapper to tapply), very similar to split but allows you to run a function on each subset. Often many useRs run split + lapply, unaware both can replaced with by:
mylist <- by(iris, iris$Species, function(sub) list(results=sub), simplify = FALSE)
head(mylist$setosa$results)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
head(mylist$versicolor$results)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 51 7.0 3.2 4.7 1.4 versicolor
# 52 6.4 3.2 4.5 1.5 versicolor
# 53 6.9 3.1 4.9 1.5 versicolor
# 54 5.5 2.3 4.0 1.3 versicolor
# 55 6.5 2.8 4.6 1.5 versicolor
# 56 5.7 2.8 4.5 1.3 versicolor
head(mylist$virginica$results)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 101 6.3 3.3 6.0 2.5 virginica
# 102 5.8 2.7 5.1 1.9 virginica
# 103 7.1 3.0 5.9 2.1 virginica
# 104 6.3 2.9 5.6 1.8 virginica
# 105 6.5 3.0 5.8 2.2 virginica
# 106 7.6 3.0 6.6 2.1 virginica

setNames in lapply will keep the names of the list you're iterating through
iris
mylist <- split(iris, iris$Species)
mylist2 <- lapply(setNames(names(mylist), names(mylist)), function(x){
list(results = mylist[[x]])
})

Creating a random sample from a dataframe with a nested structure

This question builds from the SO post found here
I am trying to extract a random sample of rows in a data frame using a nesting condition.
Using the following dummy dataset (modified from iris):
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 5.3 2.9 1.5 0.2 setosa
5 5.2 3.7 1.3 0.2 virginica
6 4.7 3.2 1.5 0.2 virginica
7 3.9 3.1 1.4 0.2 virginica
8 4.7 3.2 1.3 0.2 virginica
9 4.0 3.1 1.5 0.2 versicolor
10 5.0 3.6 1.4 0.2 versicolor
11 4.6 3.1 1.5 0.2 versicolor
12 5.0 3.6 1.5 0.2 versicolor
The code below works fine to take a simple sample of 2 rows:
iris[sample(nrow(iris), 2), ]
However, what I would like to do is to take a sample of 2 rows for each level of a specific variable. For example create a random sample of 2 rows for each level of the variable 'Species', like that:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
4 5.3 2.9 1.5 0.2 setosa
6 4.7 3.2 1.5 0.2 virginica
7 3.9 3.1 1.4 0.2 virginica
11 4.6 3.1 1.5 0.2 versicolor
12 5.0 3.6 1.5 0.2 versicolor
Thanks for your help!

Very easy with dplyr:
library(dplyr)
iris %>%
group_by(Species) %>%
sample_n(size = 2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 4.6 3.4 1.4 0.3 setosa
# 2 5.2 3.5 1.5 0.2 setosa
# 3 6.5 2.8 4.6 1.5 versicolor
# 4 5.7 2.8 4.5 1.3 versicolor
# 5 5.8 2.8 5.1 2.4 virginica
# 6 7.7 2.6 6.9 2.3 virginica
You can group by as many columns as you'd like
CO2 %>% group_by(Type, Treatment) %>% sample_n(size = 2)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

select the first and last row within in a data frame? - r

Related

Conditional filtering with data.table with multiple statements

Group by all columns in a data.table

String manipulation in mutate with stringr

Multiple list nesting with split(), R

Creating a random sample from a dataframe with a nested structure

Categories

Resources