I have the following data frame:
val
4.20
4.00
I would like to remove the trailing zeros and the decimal point if there are no additional decimals. Thus, my desired end result is
val
4.2
4
I know that I can use str_remove to get rid of the trailing zeros, but I am left with the decimal point on the 4. How can I update this code to drop that decimal point?
library(tidyverse)
data.frame(val = c("4.20", "4.00")) %>%
mutate(val = str_remove(val, "0+$"))
Edit: The numbers must be stored as a character.
Edit 2: Figured out solution below. Since data is stored as a character, solution needs to be robust to other instances in which you might deal with regular old character strings etc. This function deals with all contingencies:
decimal_func <- function(x) {
decimalVal_check <- function(y) {
case_when(str_count(y, "\\.") <= 1 & str_detect(str_replace(y, "\\.", ""), "^[:digit:]+$") == T ~ "Valid", TRUE ~ "Invalid")
}
if(decimalVal_check(x) == "Valid") {
if(str_count(x, "\\.") == 0) {
x
} else {
str_remove(x, "0+$") %>%
ifelse(substr(., nchar(.), nchar(.)) == ".", str_replace(., "\\.", ""), .)
}
} else {
x
}
}
One way:
data.frame(val = c("4.20", "4.00")) %>%
type.convert(as.is =TRUE) %>%
as_tibble()%>%
mutate(val = as.character(val))
# A tibble: 2 x 1
val
<chr>
1 4.2
2 4
Using str_remove:
data.frame(val = c("4.20", "4.00")) %>%
mutate(val = str_remove(val, '\\.?0+$'))
val
1 4.2
2 4
Any of the following can work:
formatC(c(1,2.40,5.06), zero.print = "")
[1] "1" "2.4" "5.06"
prettyNum(c(1,2.40,5.06), zero.print = "")
[1] "1" "2.4" "5.06"
prettyNum(c(1,2.40,5.06), drop0trailing = TRUE)
[1] "1" "2.4" "5.06"
formatC(c(1,2.40,5.06), drop0trailing = TRUE)
[1] "1" "2.4" "5.06"
Related
My situation looks like this:
B = Billion, M = Million
df_example <- c("5,14B", "7,4B", "65,12M", "992,3M", "1,2B")
is.character(df_example)
> TRUE
# Function for converting
converting_mio_bio <- function(x) {
if (str_contains(x, "B")) {
x <- x %>% str_replace_all(",",".")
x <- x %>% str_remove_all("B")
x <- x %>% as.numeric()
x <- x * 1000
}
else {
x <- x %>% str_replace_all(",",".")
x <- x %>% str_remove_all("M")
x <- x %>% as.numeric()
}
print(x)
}
# Applying function
converting_mio_bio(df_example)
> converting_mio_bio(df_example) [1] 5140 7400 NA NA 1200 Warning message: In x %>% as.numeric() : NAs introduced by coercion>
This I get as result. Am not sure why. I know that it appears if there "," in the character or letters, but both should be fixed with the function. Can somebody tell me why this happens?
Thank you,
NAs produced by transforming character to numeric
Using ex in the Note at the end (we have renamed it since the name in the question suggests it is a data frame and it is actually a character vector) we use gsubfn to perform the indicated replacements.
library(gsubfn)
gsubfn(".", list("," = ".", B = "e9", M = "e6", K = "e3"), ex) |>
as.numeric()
giving:
[1] 5.140e+09 7.400e+09 6.512e+07 9.923e+08 1.200e+09
Note
ex <- c("5,14B", "7,4B", "65,12M", "992,3M", "1,2B")
We may use
library(stringr)
library(magrittr)
chartr(",", ".", df_example) %>%
str_replace_all(setNames(c("e0", "e3"), c("M", "B"))) %>%
as.numeric
-output
[1] 5140.00 7400.00 65.12 992.30 1200.00
Or similar approach in base R
chartr(",", ".", df_example) |>
gsub("M", "e0", x = _) |>
gsub("B", "e3", x = _) |>
as.numeric()
[1] 5140.00 7400.00 65.12 992.30 1200.00
in base R
with(read.table(text=gsub("([MB])", " \\1", df_example), dec=','), V1*c(B=1000, M=1)[V2])
B B M M B
5140.00 7400.00 65.12 992.30 1200.00
I am attempting to automate many of the tasks done when creating new shiny apps, by writing the needed code to files based on any given dataset. While creating code to be used as a starting point for factor levels, I have become stuck.
The idea is to gather all the unique values appearing in certain columns, and create character vectors from them that can then be altered as needed. The issue is that some of the desired levels span multiple columns, as more than one can be selected. I have managed to write almost working code, but it fails to behave as I expected at one point. The issue results in all but the first string being dropped when mappping a list of vectors. Sorry, it is hard to explain, hopefully you can see what I am doing below - and ask if anything is still not clear please.
### Starting point
data <- tibble(
a = rep(c("foo", "bar"), 3),
b = rep(c("baz", "zap"), 3),
c = rep(c("yes", "no"), 3),
c_opt_one = rep(c("c_one", ""), 3),
c_opt_two = rep(c("c_two", ""), 3)
)
levels_meta <- tibble(
column = c("a", "b", "c", "c_opt", "c_opt"),
blah = rep(c("blah"), 5) <- multiple other columns, not needed here
)
### Desired output, with problem noted
#>levels
# a_responses <- c(
# "foo" = "foo", <- only first entry kept
# "bar" = "bar" <- missing
# )
#
# b_responses <- c(
# "baz" = "baz", <- only first entry kept
# "zap" = "zap" <- missing
# )
#
# c_responses <- c(
# "yes" = "yes", <- only first entry kept
# "no" = "no" <- missing
# )
#
# c_opt_responses <- c(
# "c_opt_one" = "c_one", <- all kept as desired, but only because these
# "c_opt_two" = "c_two" <- come from single element vectors before combined
# )
### Processing code
level_names <- levels_meta %>%
select(column) %>%
group_by(column) %>%
add_count()
multi_col_level_names <- level_names %>%
filter(n > 1) %>%
pull(column) %>%
unique()
single_col_level_names <- setdiff(level_names$column, multi_col_level_names)
levels <- lapply(data, unique) %>%
lapply(setdiff, "")
levels <- map(levels, ~ paste0(" \"", .x, "\"", " = \"", .x, "\""))
# Problem occurs here - only first entry is kept.
# I did try replacing the FALSE arg with levels[[.x]], but same result.
levels <- imap(levels, ~ ifelse(length(.x) == 1, str_replace(.x, "\\w+", .y), .x))
# Rest of code does work, including in case anyone could suggest a more efficient way
multi_col_levels <- map(
multi_col_level_names,
function(prefix) levels %>%
keep(startsWith(names(.), prefix)) %>%
set_names(str_replace(names(.), names(.), prefix))
) %>% squash()
multi_col_levels <- map(
set_names(multi_col_level_names),
~ unlist(multi_col_levels[names(multi_col_levels) == .], use.names = FALSE)
)
levels <- c(levels[single_col_level_names], multi_col_levels)
levels <- map(levels, ~ paste0(.x, collapse = ",\n"))
levels <- imap(levels, ~ paste0(.y, "_responses <- c(\n", .x, "\n)"))
paste_lvls <- function(out, input) paste(out, input, sep = "\n\n")
levels <- levels %>% reduce(paste_lvls)
My suggestion is to keep it more simple than your imap/ifelse-solution. The problem should be relatively small, so a simple for loop can solve it with less hassle and more clarity (given that the rest of code does what you want):
for (eachlevel in names(levels)) {
if(length(levels[[eachlevel]]) == 1) {
levels[[eachlevel]] <- str_replace(levels[[eachlevel]], "\\w+", eachlevel)
}
}
I am not sure if the approach below is what you are after:
library(tidyverse)
levels_meta$column %>%
unique %>%
set_names(., paste0(., "_response")) %>%
map(. ,
~ {
dat <- select(data, starts_with(.x) & ends_with(.x))
if(length(dat) == 0) {
dat <- select(data, starts_with(.x))
}
if (length(dat) == 1) {
set_names(unique(dat[[.x]]))
} else if (length(dat) > 1) {
map(dat, ~ unique(.x[which(.x != "")]))
} else {
NULL
}
}
)
#> $a_response
#> foo bar
#> "foo" "bar"
#>
#> $b_response
#> baz zap
#> "baz" "zap"
#>
#> $c_response
#> yes no
#> "yes" "no"
#>
#> $c_opt_response
#> $c_opt_response$c_opt_one
#> [1] "c_one"
#>
#> $c_opt_response$c_opt_two
#> [1] "c_two"
Created on 2022-06-02 by the reprex package (v2.0.1)
I have a vector of string where people where asked to guess someones age, this includes statements like "50-60", "ca. 50" or ">50". I want to use regular expressions to match these cases and get the real numerical values. "50-60" should produce 55 (as the mean of both values), the other two examples 50.
For each variant, I wanted to have a case in a switch like below, but it doesn't seem to work. Is it even possible to use a regex in a switch?
switch (string,
str_detect(string, "[:digit:]+[:blank:]*(-|_)[:blank:]*[:digit:]+") = {
first <- str_sub(string, 1, 2) %>% as.numeric()
second <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value <- mean(c(first, second))
},
str_detect(string, "((ca)\.?)|>|~[:blank:]*[:digit:]+") = {
value <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
},
str_detect(string, "[:digit:]+[:punct:]") = {
value <- str_sub(string, 1, 2) %>% as.numeric()
},
print(string, " could not be matched")
)
The expressions themselves are working as intended (as far as I tested), so I guess I can't use them in the switch like this. However I couldn't find a solution anywhere.
Edit: Added what the expected output for the examples is
We can do this with tidyverse methods
Convert the string to a tibble/data.frame
Remove the characters not neeeded with str_remove_all
Then, separate the column into two by specifying the sep
Get the rowMeans
library(dplyr)
library(tidyr)
library(stringr)
tibble(mystring) %>%
mutate(mystring = str_remove_all(mystring, "[A-Za-z.><]+")) %>%
separate(mystring, into = c('col1', 'col2'), sep="[- ]+",
convert = TRUE) %>%
transmute(out = rowMeans(., na.rm = TRUE))
-output
# A tibble: 3 x 1
out
<dbl>
1 55
2 50
3 50
data
mystring <- c("50-60", "ca. 50", ">50")
You can use a nested if/else approach -
library(stringr)
string <- "50-60"
if(str_detect(string, "[:digit:]+[:blank:]*(-|_)[:blank:]*[:digit:]+")) {
first <- str_sub(string, 1, 2) %>% as.numeric()
second <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value <- mean(c(first, second))
value
} else if(str_detect(string, "((ca)\\.?)|>|~[:blank:]*[:digit:]+")) {
value <- str_sub(string, str_length(string)-1, str_length(string)) %>% as.numeric()
value
} else if(str_detect(string, "[:digit:]+[:punct:]")) {
value <- str_sub(string, 1, 2) %>% as.numeric()
value
} else NA
#[1] 55
For string <- "ca. 50" it returns 50.
mystring <- c("50-60", "ca. 50", ">50")
library(stringr)
lapply(str_extract_all(mystring, "[0-9]+"),
function(x) if (length(x) == 1) as.numeric(x[1]) else mean(as.numeric(x)))
[[1]]
[1] 55
[[2]]
[1] 50
[[3]]
[1] 50
I have a dataframe that contains one column separated by ; like this
AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455
What I want is remove everything except the codes that starts with AB....
AB00001;AB00002
AB84375
AB84375;AB84375
AB001
I've tried to separate them with separate(), but I donĀ“t know how to continue. Any suggestions?
If your data frame is called df and your column is called V1, you could try:
sapply(strsplit(df$V1, ";"), function(x) paste(grep("^AB", x, value = TRUE), collapse = ";"))
#> [1] "AB00001;AB00002" "AB84375" "AB84375;AB84375" "AB001"
This splits at all the semicolons then matches all strings starting with "AB", then joins them back together with semicolons.
I thought of using stringr and Daniel O's data:
df %>%
mutate(data = str_extract_all(data, "AB\\w+"))
which gives us
data
1 AB00001, AB00002
2 AB84375
3 AB84375, AB84375
4 AB001
1) Base R Assuming DF shown reproducibly in the Note at the end we prefix each line with a semicolon and then use the gsub with the pattern shown and finally remove the semicolon we added. No packages are used.
transform(DF, V1 = sub("^;", "", gsub("(;AB\\d+)|;[^;]*", "\\1", paste0(";", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
2) dplyr/tidyr This one is longer than the others in this answer but it is straight forward and has no complex regular expressions.
library(dplyr)
library(tidyr)
DF %>%
mutate(id = 1:n()) %>%
separate_rows(V1, sep = ";") %>%
filter(substr(V1, 1, 2) == "AB") %>%
group_by(id) %>%
summarize(V1 = paste(V1, collapse = ";")) %>%
ungroup %>%
select(-id)
giving:
# A tibble: 4 x 1
V1
<chr>
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
3) gsubfn Replace codes that do not start with AB with an empty string and then remove redundant semicolons from what is left.
library(gsubfn)
transform(DF, V1 = gsub("^;|;$", "", gsub(";+", ";",
gsubfn("[^;]*", ~ if (substr(x, 1, 2) == "AB") x else "", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
Note
Lines <- "AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455"
DF <- read.table(text = Lines, as.is = TRUE, strip.white = TRUE)
I'm attempting to write a function to count the number of consecutive instances of a pattern. As an example, I'd like the string
string<-"A>A>A>B>C>C>C>A>A"
to be transformed into
"3 A > 1 B > 3 C > 2 A"
I've got a function that counts the instances of each string, see below. But it doesn't achieve the ordering effect that I'd like. Any ideas or pointers?
Thanks,
R
Existing function:
fnc_gen_PathName <- function(string) {
p <- strsplit(as.character(string), ";")
p1 <- lapply(p, table)
p2 <- lapply(p1, function(x) {
sapply(1:length(x), function(i) {
if(x[i] == 25){
paste0(x[i], "+ ", names(x)[i])
} else{
paste0(x[i], "x ", names(x)[i])
}
})
})
p3 <- lapply(p2, function(x) paste(x, collapse = "; "))
p3 <- do.call(rbind, p3)
return(p3)
}
As commented by #MrFlick you could try the following using rle and strsplit
with(rle(strsplit(string, ">")[[1]]), paste(lengths, values, collapse = " > "))
## [1] "3 A > 1 B > 3 C > 2 A"
Here are two dplyr solutions: one regular and one with rle. Advantages are: can input multiple strings as a vector, builds a tidy intermediate dataset before (ugh) renesting.
library(dplyr)
library(tidyr)
library(stringi)
strings = "A>A>A>B>C>C>C>A>A"
data_frame(string = strings) %>%
mutate(string_split =
string %>%
stri_split_fixed(">")) %>%
unnest(string_split) %>%
mutate(ID =
string_split %>%
lag %>%
`!=`(string_split) %>%
plyr::mapvalues(NA, TRUE) %>%
cumsum) %>%
count(string, ID, string_split) %>%
group_by(string) %>%
summarize(new_string =
paste(n,
string_split,
collapse = " > ") )
data_frame(string = strings) %>%
group_by(string) %>%
do(.$string %>%
first %>%
stri_split_fixed(">") %>%
first %>%
rle %>%
unclass %>%
as.data.frame) %>%
summarize(new_string =
paste(lengths, values, collapse = " > "))