String manipulation in mutate with stringr - r

So lets say that I want to locate a pattern in a string and if the pattern exists then I only keep the part of the string before the pattern. My problem is that if the pattern does not exist then it returns NA and the final result will be NA. I want it to return the original string when the pattern does not exist.
library(stringr)
library(dplyr)
unique(iris$Species)
#> [1] setosa versicolor virginica
#> Levels: setosa versicolor virginica
test <- iris %>%
mutate(Species = str_sub(Species, 1, str_locate(Species, "t")[,1] ))
head(test)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 set
#> 2 4.9 3.0 1.4 0.2 set
#> 3 4.7 3.2 1.3 0.2 set
#> 4 4.6 3.1 1.5 0.2 set
#> 5 5.0 3.6 1.4 0.2 set
#> 6 5.4 3.9 1.7 0.4 set
tail(test)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 145 6.7 3.3 5.7 2.5 <NA>
#> 146 6.7 3.0 5.2 2.3 <NA>
#> 147 6.3 2.5 5.0 1.9 <NA>
#> 148 6.5 3.0 5.2 2.0 <NA>
#> 149 6.2 3.4 5.4 2.3 <NA>
#> 150 5.9 3.0 5.1 1.8 <NA>
Created on 2019-07-14 by the reprex package (v0.3.0)

We can use a regex lookaround with str_remove. If the pattern is not found, it will return the original string. Here, we are matching characters (.*) after the 't' character and if found, those characters are removed
library(dplyr)
library(stringr)
test <- iris %>%
mutate(Species = str_remove(Species, "(?<=t).*"))
head(test)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 set
#2 4.9 3.0 1.4 0.2 set
#3 4.7 3.2 1.3 0.2 set
#4 4.6 3.1 1.5 0.2 set
#5 5.0 3.6 1.4 0.2 set
#6 5.4 3.9 1.7 0.4 set
tail(test)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#145 6.7 3.3 5.7 2.5 virginica
#146 6.7 3.0 5.2 2.3 virginica
#147 6.3 2.5 5.0 1.9 virginica
#148 6.5 3.0 5.2 2.0 virginica
#149 6.2 3.4 5.4 2.3 virginica
#150 5.9 3.0 5.1 1.8 virginica

Related

Conditional filtering with data.table with multiple statements

I would like to know if there is an elegant and concise way to do conditional filtering with data.table.
My aim is the following:
if condition 1 is met, filter based on condition 2.
For instance, in the case of the iris dataset,
how can I drop the observations among Species=="setosa" where Sepal.Length<5.5, while keeping all observations with Sepal.Length<5.5 for other species?
I know how to do this in steps, but I wonder if there is a better way to do it in a single liner
# this is how I would do it in steps.
data("iris")
# first only select observations in setosa I am interested in keeping
iris1<- setDT(iris)[Sepal.Length>=5.5&Species=="setosa"]
# second, drop all of setosa observations.
iris2<- setDT(iris)[Species!="setosa"]
# join data,
iris_final<-full_join(iris1,iris2)
head(iris_final)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.8 4.0 1.2 0.2 setosa
2: 5.7 4.4 1.5 0.4 setosa
3: 5.7 3.8 1.7 0.3 setosa
4: 5.5 4.2 1.4 0.2 setosa
5: 5.5 3.5 1.3 0.2 setosa # only keeping setosa with Sepal.Length>=5.5. Note that for other species, Sepal.Length can be <5.5
6: 7.0 3.2 4.7 1.4 versicolor
is there a more concise and elegant way of doing this?
Is something like the following what you are looking for? It is not very clear what you want.
library(data.table)
dt <- data.table(iris)
dt[Sepal.Length >= 5.5 & Species == "setosa" | Species != "setosa"]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 5.8 4.0 1.2 0.2 setosa
#> 2: 5.7 4.4 1.5 0.4 setosa
#> 3: 5.7 3.8 1.7 0.3 setosa
#> 4: 5.5 4.2 1.4 0.2 setosa
#> 5: 5.5 3.5 1.3 0.2 setosa
#> ---
#> 101: 6.7 3.0 5.2 2.3 virginica
#> 102: 6.3 2.5 5.0 1.9 virginica
#> 103: 6.5 3.0 5.2 2.0 virginica
#> 104: 6.2 3.4 5.4 2.3 virginica
#> 105: 5.9 3.0 5.1 1.8 virginica
You can use the | or operator:
This is asking to remove any lines where Species=="setosa" & Sepal.Length<5.5 and keep lines where Sepal.Length>5.5
iris1[!(Species=="setosa" & Sepal.Length<5.5) | Sepal.Length>5.5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.8 4.0 1.2 0.2 setosa
2: 5.7 4.4 1.5 0.4 setosa
3: 5.7 3.8 1.7 0.3 setosa
4: 5.5 4.2 1.4 0.2 setosa
5: 5.5 3.5 1.3 0.2 setosa
---
101: 6.7 3.0 5.2 2.3 virginica
102: 6.3 2.5 5.0 1.9 virginica
103: 6.5 3.0 5.2 2.0 virginica
104: 6.2 3.4 5.4 2.3 virginica
105: 5.9 3.0 5.1 1.8 virginica

How can I draw a random sample from a dataset, proportionate to size, based on different proportions for each value of a factor variable, in R

I want to draw a random sample from my dataset, using different proportions for each value of a factor variable, as well as using weights stored in some other column. dplyr solution in pipes will be preferred as it can be inserted easily in long code.
Let's take the example of iris dataset. Species column is divided into three values 50 rows each. Let's also assume the sample weights are stored in column Sepal.Length. If I have to sample equal proportions (or equal rows) per species, the problem is easy to solve
library(tidyverse)
iris %>% group_by(Species) %>% slice_sample(prop = 0.1, weight_by = Sepal.Length)
# A tibble: 15 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.4 3.7 1.5 0.2 setosa
2 5.3 3.7 1.5 0.2 setosa
3 5.7 4.4 1.5 0.4 setosa
4 5 3.5 1.6 0.6 setosa
5 4.8 3.1 1.6 0.2 setosa
6 6.1 2.9 4.7 1.4 versicolor
7 6.7 3.1 4.7 1.5 versicolor
8 5 2 3.5 1 versicolor
9 7 3.2 4.7 1.4 versicolor
10 5.7 2.9 4.2 1.3 versicolor
11 7.2 3.2 6 1.8 virginica
12 6.7 2.5 5.8 1.8 virginica
13 6.4 2.8 5.6 2.1 virginica
14 6.3 3.3 6 2.5 virginica
15 7.2 3 5.8 1.6 virginica
But I got stuck when I have to choose/sample different proportions for each species, say 10%, 20%, 25% respectively.
iris %>% group_by(Species) %>% slice_sample(prop = c(0.1, 0.2, 0.25), weight_by = Sepal.Length)
#Error: `prop` must be a single number
OR
iris %>% group_split(Species) %>% map_df(c(0.1, 0.2, 0.25), ~ slice_sample(prop = ., weight_by = Sepal.Length))
# A tibble: 0 x 0
Please help
If I understand you right:
iris %>%
group_split(Species) %>%
map2(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))
[[1]]
# A tibble: 5 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 4.9 3 1.4 0.2 setosa
2 4.8 3 1.4 0.1 setosa
3 5.2 4.1 1.5 0.1 setosa
4 5 3.5 1.6 0.6 setosa
5 5.2 3.5 1.5 0.2 setosa
[[2]]
# A tibble: 10 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 6.3 2.5 4.9 1.5 versicolor
2 5.5 2.6 4.4 1.2 versicolor
3 6.9 3.1 4.9 1.5 versicolor
4 6.6 2.9 4.6 1.3 versicolor
5 6.1 3 4.6 1.4 versicolor
6 5.7 2.8 4.5 1.3 versicolor
7 6.7 3.1 4.4 1.4 versicolor
8 5.1 2.5 3 1.1 versicolor
9 5.7 3 4.2 1.2 versicolor
10 7 3.2 4.7 1.4 versicolor
[[3]]
# A tibble: 12 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 6.4 3.2 5.3 2.3 virginica
2 7.2 3.2 6 1.8 virginica
3 6.3 3.3 6 2.5 virginica
4 6.2 2.8 4.8 1.8 virginica
5 7.6 3 6.6 2.1 virginica
6 5.7 2.5 5 2 virginica
7 4.9 2.5 4.5 1.7 virginica
8 6.7 3.1 5.6 2.4 virginica
9 7.7 2.8 6.7 2 virginica
10 6.7 3.3 5.7 2.5 virginica
11 6 3 4.8 1.8 virginica
12 5.6 2.8 4.9 2 virginica
Just change map2 to map2_df if you want a data frame returned:
iris %>%
group_split(Species) %>%
map2_df(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))
# A tibble: 27 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.7 3.8 1.7 0.3 setosa
2 4.8 3.1 1.6 0.2 setosa
3 5.1 3.8 1.5 0.3 setosa
4 4.9 3.6 1.4 0.1 setosa
5 4.8 3.4 1.6 0.2 setosa
6 5.7 2.8 4.1 1.3 versicolor
7 6.6 3 4.4 1.4 versicolor
8 6.8 2.8 4.8 1.4 versicolor
9 5.8 2.7 4.1 1 versicolor
10 6.4 3.2 4.5 1.5 versicolor
# ... with 17 more rows
A similar solution using purrr.
First we specify our proportions for each Species.
props <- c(setosa=0.1, versicolor=0.2, virginica=0.5)
Then we iterate over each name-value pair in props using imap. For each pair in props, we filter the rows of data frame to only contain that species, and then sample the corresponding percentage that was specified using slice_sample.
imap_dfr(props,
~filter(iris, Species==.y) %>%
slice_sample(prop=.x))
Using imap_dfr then puts together the three data frames (one for each species) into a single data frame.
Here's the result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.8 3.1 1.6 0.2 setosa
2 5.0 3.5 1.3 0.3 setosa
3 5.1 3.8 1.6 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 4.9 3.1 1.5 0.2 setosa
6 6.7 3.1 4.7 1.5 versicolor
7 5.7 2.8 4.1 1.3 versicolor
8 6.1 3.0 4.6 1.4 versicolor
9 5.6 3.0 4.5 1.5 versicolor
10 6.6 2.9 4.6 1.3 versicolor
11 5.5 2.6 4.4 1.2 versicolor
12 6.7 3.0 5.0 1.7 versicolor
13 5.7 2.6 3.5 1.0 versicolor
14 5.9 3.2 4.8 1.8 versicolor
15 5.4 3.0 4.5 1.5 versicolor
16 5.8 2.8 5.1 2.4 virginica
17 6.7 3.3 5.7 2.1 virginica
18 7.4 2.8 6.1 1.9 virginica
19 6.4 2.8 5.6 2.1 virginica
20 6.7 3.1 5.6 2.4 virginica
21 6.1 3.0 4.9 1.8 virginica
22 6.0 2.2 5.0 1.5 virginica
23 6.3 2.7 4.9 1.8 virginica
24 6.3 2.8 5.1 1.5 virginica
25 7.2 3.2 6.0 1.8 virginica
26 7.7 2.6 6.9 2.3 virginica
27 5.8 2.7 5.1 1.9 virginica
28 4.9 2.5 4.5 1.7 virginica
29 6.7 3.0 5.2 2.3 virginica
30 7.7 3.8 6.7 2.2 virginica
31 6.9 3.1 5.4 2.1 virginica
32 5.8 2.7 5.1 1.9 virginica
33 6.8 3.0 5.5 2.1 virginica
34 6.3 2.5 5.0 1.9 virginica
35 6.9 3.1 5.1 2.3 virginica
36 6.3 3.3 6.0 2.5 virginica
37 7.6 3.0 6.6 2.1 virginica
38 6.5 3.0 5.5 1.8 virginica
39 7.7 2.8 6.7 2.0 virginica
40 6.5 3.2 5.1 2.0 virginica
You can keep the information of proportion in the dataframe itself and sample rows from it.
library(dplyr)
iris %>%
distinct(Species) %>%
mutate(prop = c(0.1, 0.2, 0.25)) %>%
inner_join(iris, by = 'Species') %>%
group_by(Species) %>%
sample_n(first(prop)*n()) -> result
result %>% count(Species)
# Species n
# <fct> <int>
#1 setosa 5
#2 versicolor 10
#3 virginica 12
I expected slice_sample(prop = first(prop)) to work but it doesn't hence, I used sample_n.

select the first and last row within in a data frame?

Is there a function in BASE R that could show the first and last rows within in a data frame? I know the functions like ropls::strF and print an object in data.table could do this. It is not like this topic Select first and last row from grouped data
ropls::strF(iris)
#Sepal.Length Sepal.Width ... Petal.Width Species
#numeric numeric ... numeric factor
#nRow nCol size NAs
#150 5 0 Mb 0
#Sepal.Length Sepal.Width ... Petal.Width Species
#1 5.1 3.5 ... 0.2 setosa
#2 4.9 3 ... 0.2 setosa
#... ... ... ... ... ...
#149 6.2 3.4 ... 2.3 virginica
#150 5.9 3 ... 1.8 virginica
library(data.table)
a <- as.data.table(iris)
a
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1: 5.1 3.5 1.4 0.2 setosa
#2: 4.9 3.0 1.4 0.2 setosa
#3: 4.7 3.2 1.3 0.2 setosa
#4: 4.6 3.1 1.5 0.2 setosa
#5: 5.0 3.6 1.4 0.2 setosa
#---
#146: 6.7 3.0 5.2 2.3 virginica
#147: 6.3 2.5 5.0 1.9 virginica
#148: 6.5 3.0 5.2 2.0 virginica
#149: 6.2 3.4 5.4 2.3 virginica
#150: 5.9 3.0 5.1 1.8 virginica
As others said in the comments, there isn't a function in base R to do this, but it's straightforward enough to write a function that binds together the first N rows and last N rows.
head_and_tail <- function(x, n = 1) {
rbind(
head(x, n),
tail(x, n)
)
}
head_and_tail(iris, n = 3)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 148 6.5 3.0 5.2 2.0 virginica
#> 149 6.2 3.4 5.4 2.3 virginica
#> 150 5.9 3.0 5.1 1.8 virginica
Created on 2018-12-22 by the reprex package (v0.2.1)

Select entire row based on calculation done to column in data.table [duplicate]

This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 7 years ago.
I understand that data.table allows you to do computations based on groups within a column. For example.
Reproducible example
iris[,.SD[which.min(Petal.Width)], by=Species]
generating
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1: setosa 4.9 3.1 1.5 0.1
2: versicolor 4.9 2.4 3.3 1.0
3: virginica 6.1 2.6 5.6 1.4
I want every row where the minimum is met; not just the first, something that is easily achieved in a DF:
for example this:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
10 4.9 3.1 1.5 0.1 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
33 5.2 4.1 1.5 0.1 setosa
38 4.9 3.6 1.4 0.1 setosa
58 4.9 2.4 3.3 1.0 versicolor
61 5.0 2.0 3.5 1.0 versicolor
63 6.0 2.2 4.0 1.0 versicolor
68 5.8 2.7 4.1 1.0 versicolor
80 5.7 2.6 3.5 1.0 versicolor
82 5.5 2.4 3.7 1.0 versicolor
94 5.0 2.3 3.3 1.0 versicolor
135 6.1 2.6 5.6 1.4 virginica
What I don't want is just the first instance of where the minima is met:
This would be equivalent to doing something like this using a data.frame
iris
iris <- as.data.frame(iris) #in case reader does not start new R session
f.min <- function(spec) {
spec.sub <- iris[iris$Species==spec,]
min.rows <- spec.sub[spec.sub$Petal.Width == min(spec.sub$Petal.Width),]
}
do.call(rbind, lapply(levels(iris$Species), f.min ))
There are some powerful features in data.table which are worth learning. Hence why I would like to know the equivalent in data.table.
Try:
iris[,.SD[which.min(Petal.Width)], by=Species]
This will give you the minimas but does not show ties.
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1: setosa 4.9 3.1 1.5 0.1
2: versicolor 4.9 2.4 3.3 1.0
3: virginica 6.1 2.6 5.6 1.4
A dplyr solution showing the ties as well would be:
require(dplyr)
require(magrittr)
iris %>%
group_by(Species) %>%
filter(rank(Petal.Width, ties.method= "min") == 1)
Source: local data table [13 x 5]
Groups: Species
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.1 1.5 0.1 setosa
2 4.8 3.0 1.4 0.1 setosa
3 4.3 3.0 1.1 0.1 setosa
4 5.2 4.1 1.5 0.1 setosa
5 4.9 3.6 1.4 0.1 setosa
6 4.9 2.4 3.3 1.0 versicolor
7 5.0 2.0 3.5 1.0 versicolor
8 6.0 2.2 4.0 1.0 versicolor
9 5.8 2.7 4.1 1.0 versicolor
10 5.7 2.6 3.5 1.0 versicolor
11 5.5 2.4 3.7 1.0 versicolor
12 5.0 2.3 3.3 1.0 versicolor
13 6.1 2.6 5.6 1.4 virginica
The 'ties.method' parameter is where you can select what should be displayed.
Hope this helps.

How to replicate a ddply behavior that uses a custom function with dplyr?

I'm trying to replace all my plyr calls with dplyr. There are still a few snags and one of them is with the group_by function. I imagine it acts the same way as the second ddply argument and does a split, apply and combine based on the grouping variables I list. But that doesn't appear to be the case. Here is a rather trivial example.
Let's define a silly function
mm <- function(x) return(x[1:5, ])
Now we can split the species in the irisdataset like so and apply this function to each piece.
ddply(iris, .(Species), mm)
This works as intended. However, when I try the same with dplyr, it doesn't work as expected.
iris %>% group_by(Species) %>% mm
What am I doing wrong?
As shown in ?do, you can refer to a group with . in your expression. The following will replicate your ddply output:
iris %>% group_by(Species) %>% do(.[1:5, ])
# Source: local data frame [15 x 5]
# Groups: Species
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 7.0 3.2 4.7 1.4 versicolor
# 7 6.4 3.2 4.5 1.5 versicolor
# 8 6.9 3.1 4.9 1.5 versicolor
# 9 5.5 2.3 4.0 1.3 versicolor
# 10 6.5 2.8 4.6 1.5 versicolor
# 11 6.3 3.3 6.0 2.5 virginica
# 12 5.8 2.7 5.1 1.9 virginica
# 13 7.1 3.0 5.9 2.1 virginica
# 14 6.3 2.9 5.6 1.8 virginica
# 15 6.5 3.0 5.8 2.2 virginica
More generally, to apply a custom function to groups with dplyr, you can do something like the following (thanks #docendodiscimus):
iris %>% group_by(Species) %>% do(mm(.))
slice has been created for this :
library(dplyr)
iris %>% group_by(Species) %>% slice(1:5)
#> # A tibble: 15 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 7 3.2 4.7 1.4 versicolor
#> 7 6.4 3.2 4.5 1.5 versicolor
#> 8 6.9 3.1 4.9 1.5 versicolor
#> 9 5.5 2.3 4 1.3 versicolor
#> 10 6.5 2.8 4.6 1.5 versicolor
#> 11 6.3 3.3 6 2.5 virginica
#> 12 5.8 2.7 5.1 1.9 virginica
#> 13 7.1 3 5.9 2.1 virginica
#> 14 6.3 2.9 5.6 1.8 virginica
#> 15 6.5 3 5.8 2.2 virginica

Resources