Is there a tidyr::extract equivalent for character vectors?

Is there a tidyr::extract equivalent for character vectors? - r

I was pondering on this after having come across another question.
library(tidyverse)
set.seed(42)
df <- data.frame(x = cut(runif(100), c(0,25,75,125,175,225,299)))
tidyr::extract does a nice job splitting into groups defined by the regex:
df %>%
extract(x, c("start", "end"), "(\\d+),(\\d+)") %>% head
#> start end
#> 1 0 25
#> 2 0 25
#> 3 0 25
#> 4 0 25
#> 5 0 25
#> 6 0 25
Desired output on a character vector. I know you could just create a new function, I wondered if this is already out there.
x_chr <- as.character(df$x)
des_res <- str_split(str_extract(x_chr, "(\\d+),(\\d+)"), ",")
head(des_res)
#> [[1]]
#> [1] "0" "25"
#>
#> [[2]]
#> [1] "0" "25"
#>
#> [[3]]
#> [1] "0" "25"
#>
#> [[4]]
#> [1] "0" "25"
#>
#> [[5]]
#> [1] "0" "25"
#>
#> [[6]]
#> [1] "0" "25"

You can use strcapture in base R :
strcapture("(\\d+),(\\d+)", x_chr,
proto = list(start = numeric(), end = numeric()))
# start end
#1 0 25
#2 0 25
#3 0 25
#4 0 25
#5 0 25
#6 0 25
#...
#...
You can also use stringr::str_match :
stringr::str_match(x_chr, "(\\d+),(\\d+)")[, -1]
In str_match, 1st column returns the complete pattern whereas all the subsequent columns are the capture groups.

Related

Subset with only consecutive numbers

I want to generate all subset of {1,2,3,4} with only consecutive numbers. (For example I want subset {1}, {1,2} or {2,3,4} but not {2,4}. )
This is what I have been trying:
library(ggm)
p2<-powerset(1:4, sort = TRUE, nonempty = TRUE)
m2<-p2
for (i in 1:length(p2)){
ifelse(length(p2[[i]]) <2, m2<-m2, ifelse(max(diff(as.numeric(p2[[i]])))>1, m2<-m2[-
c(i)],m2<-m2))
}
I want to first generate power set of {1,2,3,4} and exclude subsets with inconsecutive numbers. But when I am doing the
m2<-m2[- c(i)]
command in the 2nd ifelse to exclude subsets with inconsecutive numbers, I believe I change the index of power set so I keep getting the wrong subsets as I desired.
Any suggestions on how to do it correctly?
Thanks!

You can get all unique ascending sequences between 1 and 4 in base R with the following one-liner:
apply(which(upper.tri(diag(4), TRUE), TRUE), 1, function(x) x[1]:x[2])
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 1 2
#>
#> [[3]]
#> [1] 2
#>
#> [[4]]
#> [1] 1 2 3
#>
#> [[5]]
#> [1] 2 3
#>
#> [[6]]
#> [1] 3
#>
#> [[7]]
#> [1] 1 2 3 4
#>
#> [[8]]
#> [1] 2 3 4
#>
#> [[9]]
#> [1] 3 4
#>
#> [[10]]
#> [1] 4

Change every element of every list (nested lists) using purrr

I have hundreds of observations of census data - each feature is stored within a list with the name census. I am trying to perform an action
a) on all elements of all lists: I want to make all non character elements numeric.
b) a named element present within each list: I want to remove a prefix from a named column in every list
A toy example below.
Census is a nested list within a list
library(tidyverse)
library(purrr)
POA_CODE = c("POA101","POA102")
dogs = c(4,4)
cats = c(3,2)
children = c(0, 1)
salary = c(100, 120)
employed.prop = c(1,0.5)
pets <- list(POA_CODE, as.integer(dogs), as.integer(cats))
children <-list(POA_CODE, as.integer(children))
employment <-list(POA_CODE, salary, employed.prop)
census <- list(pets, children, employment)
Attempt to change all non-numeric elements in every list to numeric
#change all non-numeric elements to numeric
census_num <- census %>%
map(function(x){
ifelse(is.character == TRUE, x,
as.numeric(x))}
)
I get the following error message:
Error in is.character == TRUE :
comparison (1) is possible only for atomic and list types
Attempt to remove prefix from every postcode in census[[]]$'POA_CODE'
#Remove "POA" prefix from every postcode
census_code <- pmap(census, ~.x[["POA_CODE"]],function(x){
str_replace(POA_CODE,"POA","")
})
I get the error
Error: Element 2 of `.l` must have length 1 or 3, not 2

You have a nested list, so you need nested maps :
library(purrr)
map(census, function(x) map_if(x, is.character, ~as.numeric(sub('POA', '', .x))))
#[[1]]
#[[1]][[1]]
#[1] 101 102
#[[1]][[2]]
#[1] 4 4
#[[1]][[3]]
#[1] 3 2
#[[2]]
#[[2]][[1]]
#[1] 101 102
#[[2]][[2]]
#[1] 0 1
#[[3]]
#[[3]][[1]]
#[1] 101 102
#[[3]][[2]]
#[1] 100 120
#[[3]][[3]]
#[1] 1.0 0.5
In base R, we can solve it with nested lapply :
lapply(census, function(x) lapply(x, function(y)
if(is.character(y)) as.numeric(sub('POA', '', y)) else y))

You could use rapply() in base R:
rapply(
census,
function(x) if(is.character(x)) as.numeric(sub("^\\D+","", x)) else x,
how = "replace")
#> [[1]]
#> [[1]][[1]]
#> [1] 101 102
#>
#> [[1]][[2]]
#> [1] 4 4
#>
#> [[1]][[3]]
#> [1] 3 2
#>
#>
#> [[2]]
#> [[2]][[1]]
#> [1] 101 102
#>
#> [[2]][[2]]
#> [1] 0 1
#>
#>
#> [[3]]
#> [[3]][[1]]
#> [1] 101 102
#>
#> [[3]][[2]]
#> [1] 100 120
#>
#> [[3]][[3]]
#> [1] 1.0 0.5
or purrr::map_depth()
library(purrr)
map_depth(census, 2, ~if(is.character(.)) as.numeric(sub("^\\D+","", .)) else .)
#> [[1]]
#> [[1]][[1]]
#> [1] 101 102
#>
#> [[1]][[2]]
#> [1] 4 4
#>
#> [[1]][[3]]
#> [1] 3 2
#>
#>
#> [[2]]
#> [[2]][[1]]
#> [1] 101 102
#>
#> [[2]][[2]]
#> [1] 0 1
#>
#>
#> [[3]]
#> [[3]][[1]]
#> [1] 101 102
#>
#> [[3]][[2]]
#> [1] 100 120
#>
#> [[3]][[3]]
#> [1] 1.0 0.5

We can use rrapply with parse_number
library(rrapply)
library(readr)
rrapply(census, f = function(x) if(is.character(x)) readr::parse_number(x) else x)
#[[1]]
#[[1]][[1]]
#[1] 101 102
#[[1]][[2]]
#[1] 4 4
#[[1]][[3]]
#[1] 3 2
#[[2]]
#[[2]][[1]]
#[1] 101 102
#[[2]][[2]]
#[1] 0 1
#[[3]]
#[[3]][[1]]
#[1] 101 102
#[[3]][[2]]
#[1] 100 120
#[[3]][[3]]
#[1] 1.0 0.5

Replacing values in a list based on a condition

I have a list of values called squares and would like to replace all values which are 0 to a 40.
I tried:
replace(squares, squares==0, 40)
but the list remains unchanged

If it is a list, then loop through the list with lapply and use replace
squares <- lapply(squares, function(x) replace(x, x==0, 40))
squares
#[[1]]
#[1] 40 1 2 3 4 5
#[[2]]
#[1] 1 2 3 4 5 6
#[[3]]
#[1] 40 1 2 3
data
squares <- list(0:5, 1:6, 0:3)

I think for this purpose, you can just treat it as if it were a vector as follows:
squares=list(2,4,6,0,8,0,10,20)
squares[squares==0]=40
Output:
[[1]]
[1] 2
[[2]]
[1] 4
[[3]]
[1] 6
[[4]]
[1] 40
[[5]]
[1] 8
[[6]]
[1] 40
[[7]]
[1] 10
[[8]]
[1] 20

Remove outliers based on a preceding value

How to remove outliers using a criterion that a value cannot be more than 2-fold higher then its preceding one.
Here is my try:
x<-c(1,2,6,4,10,20,50,10,2,1)
remove_outliers <- function(x, na.rm = TRUE, ...) {
for(i in 1:length(x))
x < (x[i-1] + 2*x)
x
}
remove_outliers(y)
expected outcome: 1,2,4,10,20,2,1
Thanks!

I think the first 10 should be removed in your data because 10>2*4. Here's a way to do what you want without loops. I'm using the dplyr version of lag.
library(dplyr)
x<-c(1,2,6,4,10,20,50,10,2,1)
x[c(TRUE,na.omit(x<=dplyr::lag(x)*2))]
[1] 1 2 4 20 10 2 1
EDIT
To use this with a data.frame:
df <- data.frame(id=1:10, x=c(1,2,6,4,10,20,50,10,2,1))
df[c(TRUE,na.omit(df$x<=dplyr::lag(df$x,1)*2)),]
id x
1 1 1
2 2 2
4 4 4
6 6 20
8 8 10
9 9 2
10 10 1

A simple sapply:
bool<-sapply(seq_along(1:length(x)),function(i) {ifelse(x[i]<2*x[i-1],FALSE,TRUE)})
bool
[[1]]
logical(0)
[[2]]
[1] TRUE
[[3]]
[1] TRUE
[[4]]
[1] FALSE
[[5]]
[1] TRUE
[[6]]
[1] TRUE
[[7]]
[1] TRUE
[[8]]
[1] FALSE
[[9]]
[1] FALSE
[[10]]
[1] FALSE
resulting in:
x[unlist(bool)]
[1] 1 2 4 10 20 1

Multiple operators in a string

I have some operators in a list
[[1]]
[1] "*"
[[2]]
[1] "-"
[[3]]
[1] "+"
[[4]]
[1] "/"
[[5]]
[1] "^"
I wanted to do the operations between two two datasets of same dimensions. For example, dataset1*dataset2, dataset1-dataset2, etc. Is it possible using the strings in list?

Yes, here is one example:
ops <- list("+", "-")
x <- y <- 1:10
lapply(ops, function(op) eval(parse(text = paste0("x", op, "y"))))
# [[1]]
# [1] 2 4 6 8 10 12 14 16 18 20
#
# [[2]]
# [1] 0 0 0 0 0 0 0 0 0 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Is there a tidyr::extract equivalent for character vectors? - r

Related

Subset with only consecutive numbers

Change every element of every list (nested lists) using purrr

Replacing values in a list based on a condition

Remove outliers based on a preceding value

Multiple operators in a string

Categories

Resources