I have a vector of string where people where asked to guess someones age, this includes statements like "50-60", "ca. 50" or ">50". I want to use regular expressions to match these cases and get the real numerical values. "50-60" should produce 55 (as the mean of both values), the other two examples 50.
For each variant, I wanted to have a case in a switch like below, but it doesn't seem to work. Is it even possible to use a regex in a switch?
switch (string,
str_detect(string, "[:digit:]+[:blank:]*(-|_)[:blank:]*[:digit:]+") = {
first <- str_sub(string, 1, 2) %>% as.numeric()
second <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value <- mean(c(first, second))
},
str_detect(string, "((ca)\.?)|>|~[:blank:]*[:digit:]+") = {
value <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
},
str_detect(string, "[:digit:]+[:punct:]") = {
value <- str_sub(string, 1, 2) %>% as.numeric()
},
print(string, " could not be matched")
)
The expressions themselves are working as intended (as far as I tested), so I guess I can't use them in the switch like this. However I couldn't find a solution anywhere.
Edit: Added what the expected output for the examples is
We can do this with tidyverse methods
Convert the string to a tibble/data.frame
Remove the characters not neeeded with str_remove_all
Then, separate the column into two by specifying the sep
Get the rowMeans
library(dplyr)
library(tidyr)
library(stringr)
tibble(mystring) %>%
mutate(mystring = str_remove_all(mystring, "[A-Za-z.><]+")) %>%
separate(mystring, into = c('col1', 'col2'), sep="[- ]+",
convert = TRUE) %>%
transmute(out = rowMeans(., na.rm = TRUE))
-output
# A tibble: 3 x 1
out
<dbl>
1 55
2 50
3 50
data
mystring <- c("50-60", "ca. 50", ">50")
You can use a nested if/else approach -
library(stringr)
string <- "50-60"
if(str_detect(string, "[:digit:]+[:blank:]*(-|_)[:blank:]*[:digit:]+")) {
first <- str_sub(string, 1, 2) %>% as.numeric()
second <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value <- mean(c(first, second))
value
} else if(str_detect(string, "((ca)\\.?)|>|~[:blank:]*[:digit:]+")) {
value <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value
} else if(str_detect(string, "[:digit:]+[:punct:]")) {
value <- str_sub(string, 1, 2) %>% as.numeric()
value
} else NA
#[1] 55
For string <- "ca. 50" it returns 50.
mystring <- c("50-60", "ca. 50", ">50")
library(stringr)
lapply(str_extract_all(mystring, "[0-9]+"),
function(x) if (length(x) == 1) as.numeric(x[1]) else mean(as.numeric(x)))
[[1]]
[1] 55
[[2]]
[1] 50
[[3]]
[1] 50
Related
I have
df = data.frame(Col1 = c( NA, 1," ", 2345.768,"hi","", NA, 3.4, "44.99"))
and want to format specific values, and created a udf
format_it = function(y, n_decimals, dash_type, suffix = ""){
if(is.na(y)) return(dash_type)
if(nchar(gsub(" ", "", y))==0) return(y)
has_letter = grep("[A-z]+", y)
if(is_empty(has_letter)== TRUE) {
return(paste0(format(round(as.numeric(y), n_decimals), nsmall=n_decimals, big.mark = ","),suffix))
}
if(has_letter == 1){
return(y)
} else{
x = as.numeric(y)
ifelse(is.na(x),
dash_type,
paste0(format(round(as.numeric(x), n_decimals), nsmall=n_decimals, big.mark = ","),suffix))}
}
I tested each value individually, ie format_it(df$Col1[1],1,"-"), and each one worked ok
but, when I set up a set_formatter in flextable,
df %>%
flextable() %>%
set_formatter(Col1 = function(x) format_it(x,1,"-"))
I hoped the results would be correct, but received the wrong results,
with the message: the condition has length > 1 and only the first element will be used
I tried updating to include Vectorize, but received the same error
Any suggestions?
I would like to see
I'm a little confused on your function, but a fresh code approach to recreating your table (based on your function) in a reproducible way is below, which produces your desired output. It first replaces any NA values in the original data with "-", then checks for all non-numeric values (ie, "hi") using grepl and keeps those the same, then standardizes the significant digits in the numeric values with sprintf. This approach was within the dplyr "world" using mutate() and case_when() and did not use a user-defined function.
df %>%
mutate(Col1 = case_when(
is.na(Col1) ~ "-",
!grepl("[^A-Za-z]", Col1) ~ Col1,
grepl(".", Col1) ~ sprintf("%.1f", as.numeric(Col1)),
)) %>%
flextable::flextable()
I have the following data frame:
val
4.20
4.00
I would like to remove the trailing zeros and the decimal point if there are no additional decimals. Thus, my desired end result is
val
4.2
4
I know that I can use str_remove to get rid of the trailing zeros, but I am left with the decimal point on the 4. How can I update this code to drop that decimal point?
library(tidyverse)
data.frame(val = c("4.20", "4.00")) %>%
mutate(val = str_remove(val, "0+$"))
Edit: The numbers must be stored as a character.
Edit 2: Figured out solution below. Since data is stored as a character, solution needs to be robust to other instances in which you might deal with regular old character strings etc. This function deals with all contingencies:
decimal_func <- function(x) {
decimalVal_check <- function(y) {
case_when(str_count(y, "\\.") <= 1 & str_detect(str_replace(y, "\\.", ""), "^[:digit:]+$") == T ~ "Valid", TRUE ~ "Invalid")
}
if(decimalVal_check(x) == "Valid") {
if(str_count(x, "\\.") == 0) {
x
} else {
str_remove(x, "0+$") %>%
ifelse(substr(., nchar(.), nchar(.)) == ".", str_replace(., "\\.", ""), .)
}
} else {
x
}
}
One way:
data.frame(val = c("4.20", "4.00")) %>%
type.convert(as.is =TRUE) %>%
as_tibble()%>%
mutate(val = as.character(val))
# A tibble: 2 x 1
val
<chr>
1 4.2
2 4
Using str_remove:
data.frame(val = c("4.20", "4.00")) %>%
mutate(val = str_remove(val, '\\.?0+$'))
val
1 4.2
2 4
Any of the following can work:
formatC(c(1,2.40,5.06), zero.print = "")
[1] "1" "2.4" "5.06"
prettyNum(c(1,2.40,5.06), zero.print = "")
[1] "1" "2.4" "5.06"
prettyNum(c(1,2.40,5.06), drop0trailing = TRUE)
[1] "1" "2.4" "5.06"
formatC(c(1,2.40,5.06), drop0trailing = TRUE)
[1] "1" "2.4" "5.06"
I have a vector of strings, which I want to sort alphabetically, and then sort by the number, which is at the end of the strings.
Final output should be "AGSHIM1", "AGSHIU1", "AGSHIZ1","AGSHIH2", "AGSHIM2","AGSHIU2", "AGSHIZ2"
d<-c("AGSHIZ2", "AGSHIZ1", "AGSHIU1", "AGSHIM1", "AGSHIH2", "AGSHIM2",
"AGSHIU2")
d[order(d,as.numeric(substr(d, nchar(d), nchar(d))))]
>"AGSHIH2" "AGSHIM1" "AGSHIM2" "AGSHIZ1" "AGSHIZ2" "AGSHIU1" "AGSHIU2"
What you can do is separate the number from the string, and sort by the number first, and then within each group of numbers sort alphabetically:
sortSpecial <- function(d) {
df <- data.frame(
original = d,
chars = gsub("[[:digit:]]", "", d),
nums = gsub("[^[:digit:]]", "", d)
)
df <- df[with(df, order(nums, chars)),]
return(df$original)
}
d <- sortSpecial(d)
d
# [1] "AGSHIM1" "AGSHIU1" "AGSHIZ1" "AGSHIH2" "AGSHIM2" "AGSHIU2" "AGSHIZ2"
There should be a more elegant approach, I just don't know it. Nevertheless, let me know if it helps.
Update
I could not help but get inspired by Karthik S's approach. If you don't want to generate the function first, you can do the same steps as before using dplyr:
library(dplyr)
d <- data.frame(d = d) %>%
mutate(
chars = gsub("[[:digit:]]", "", d),
nums = gsub("[^[:digit:]]", "", d)
) %>%
arrange(nums, chars) %>%
pull(d)
Again, the steps are identical so the choice of approach is a matter of preference.
Another approach. But I am sure a shorter solution most likely exists.
library(dplyr)
library(stringr)
library(tibble)
d %>% as.tibble() %>%
transmute(dig = str_extract(value,'\\d'), ltrs = str_remove(value, '\\d')) %>% type.convert(as.is = 1) %>%
arrange(dig,ltrs) %>% transmute(d = str_c(ltrs,dig, sep = '')) %>% pull(d)
[1] "AGSHIM1" "AGSHIU1" "AGSHIZ1" "AGSHIH2" "AGSHIM2" "AGSHIU2" "AGSHIZ2"
Here is one base R option using gsub + order
> d[order(as.numeric(gsub("\\D", "", d)), d)]
[1] "AGSHIM1" "AGSHIU1" "AGSHIZ1" "AGSHIH2" "AGSHIM2" "AGSHIU2" "AGSHIZ2"
I'm trying to convert an integer value to its bit expression and convert parts of that again to integer (as per this blogpost, for context).
Transforming integers to bits does seem to work, but I'm missing something when transforming the values back into the initial integer.
'intToBase2' <- function(x){
x %>%
intToBits() %>%
rev %>%
as.character() %>%
{sapply(strsplit(., "", fixed = TRUE), '[', 2)} %>%
paste0(.,collapse = '')
}
val <- 1855928457
intToBase2(val) # does seem to return the correct bit expression, as expected
However, if I try to reverse the same logic, something gets lost somewhere:
val %>%
intToBase2 %>% # Get expression in bits
{strsplit(.,'')[[1]]} %>% # split
sprintf('0%s',.) %>% # add leading zeros
rev %>% # reverse order
as.raw %>% # expression to raw
readBin(.,what='integer')
R> [1] 16777217
What am I missing? I assume some of the steps I did are incorrect.
This function reverses yours, and is vectorized:
base2ToInt <- function(string)
{
sapply(strsplit(string, ""), function(x) sum(rev(+(x == "1")) * 2^(seq(x)-1)))
}
So you can do
base2ToInt(c("1", "0", "10", "11111111"))
#> [1] 1 0 2 255
and
base2ToInt(intToBase2(val))
#> [1] 1855928457
> base2ToInt(intToBase2(val)) == val
#> [1] TRUE
It works by splitting the string(s) into characters, so each string becomes a vector of "1"s and "0"s. These are converted to numeric values, then reversed, then multiplied by a sequence of powers of 2 of the same length. This is summed to give an answer for each string.
It's a bit "code golf", but bit manipulation often is...
I'm attempting to write a function to count the number of consecutive instances of a pattern. As an example, I'd like the string
string<-"A>A>A>B>C>C>C>A>A"
to be transformed into
"3 A > 1 B > 3 C > 2 A"
I've got a function that counts the instances of each string, see below. But it doesn't achieve the ordering effect that I'd like. Any ideas or pointers?
Thanks,
R
Existing function:
fnc_gen_PathName <- function(string) {
p <- strsplit(as.character(string), ";")
p1 <- lapply(p, table)
p2 <- lapply(p1, function(x) {
sapply(1:length(x), function(i) {
if(x[i] == 25){
paste0(x[i], "+ ", names(x)[i])
} else{
paste0(x[i], "x ", names(x)[i])
}
})
})
p3 <- lapply(p2, function(x) paste(x, collapse = "; "))
p3 <- do.call(rbind, p3)
return(p3)
}
As commented by #MrFlick you could try the following using rle and strsplit
with(rle(strsplit(string, ">")[[1]]), paste(lengths, values, collapse = " > "))
## [1] "3 A > 1 B > 3 C > 2 A"
Here are two dplyr solutions: one regular and one with rle. Advantages are: can input multiple strings as a vector, builds a tidy intermediate dataset before (ugh) renesting.
library(dplyr)
library(tidyr)
library(stringi)
strings = "A>A>A>B>C>C>C>A>A"
data_frame(string = strings) %>%
mutate(string_split =
string %>%
stri_split_fixed(">")) %>%
unnest(string_split) %>%
mutate(ID =
string_split %>%
lag %>%
`!=`(string_split) %>%
plyr::mapvalues(NA, TRUE) %>%
cumsum) %>%
count(string, ID, string_split) %>%
group_by(string) %>%
summarize(new_string =
paste(n,
string_split,
collapse = " > ") )
data_frame(string = strings) %>%
group_by(string) %>%
do(.$string %>%
first %>%
stri_split_fixed(">") %>%
first %>%
rle %>%
unclass %>%
as.data.frame) %>%
summarize(new_string =
paste(lengths, values, collapse = " > "))