R: Replace last 3 zeroes with K in a column - r

I am trying to replace last 3 000 with K in a column in the dataframe
eg:
data <- data.frame(abc = c(1000, 100000, 450000))
abc <- 1000
then abc <- 1K
if
abc <- 100000
then abc <- 100K
gsub or regex replaces the first 3 zeroes
I tried this:
lapply(data$abc, gsub, pattern = "000", replacement = "K", fixed = TRUE)
Also, how can I make it work on an interval like :
data <- data.frame(abc = c("150000-250000", "100000-150000", "250000K+"))

An option is to use %/% with 1000 and paste the "K"
library(dplyr)
library(stringr)
data %>%
mutate(abc = str_c(abc %/% 1000, "K"))
Or using sub, match the 3 zeros at the end ($) of the string and replace with "K"
options(scipen = 999)
sub("0{3}$", "K", data$abc)
#[1] "1K" "100K" "450K"
If we have a different string with interval, then change the pattern to match 3 zeros at either at the end ($) or before a - and replace with "K"
gsub("0{3}(?=-|$)", "K", "150000-250000", perl = TRUE)
#[1] "150K-250K"

Here is a slight modification of your code. format is to turn off the scientific notation. sapply makes the output becomes a vector. 000$ means only match those at the end.
data <- data.frame(abc = c(1000, 100000, 450000))
data$abc <- format(data$abc, scientific = FALSE)
gsub(pattern = "000$", replacement = "K", data$abc)
# [1] " 1K" "100K" "450K"

Related

Move substring to end of string in R

I have a set of df with a large number of columns. The column names follow a pattern like so:
my.df <- data.frame(sentiment_brand1_1 = c(1,0,0,1), sentiment_brand1_2 = c(0,1,1,0),
sentiment_brand2_1 = c(1,1,1,1),
sentiment_brand2_2 = c(0,0,0,0),
brand1_rating_1 = c(1,2,3,4),
brand2_rating_1 = c(4,3,2,1))
I'd like to programmatically rename the columns, moving the substrings "brand1" and "brand2" from the middle of the column name to the end, e.g.:
desired_colnames <- c("sentiment_1_brand1",
"sentiment_2_brand1",
"sentiment_1_brand2",
"sentiment_2_brand2",
"rating_1_brand1",
"rating_1_brand2")
Capture the substring groups and rearrange in replacement
sub("(.*)_(brand1)(.*)", "\\1\\3_\\2", v1)
-output
[1] "variable_1_brand1" "_stuff_1_brand1" "thing_brand1"
data
v1 <- c("variable_brand1_1", "_brand1_stuff_1", "_brand1thing")
## Input:
Test <- c("variable_brand1_1", "_brand1_stuff_1", "_brand1thing")
library("stringr")
paste(str_remove(Test, "_brand1"), "_brand1", sep = "")
## OutPut:
[1] "variable_1_brand1" "_stuff_1_brand1" "thing_brand1"

R: change round brace interval number by subtracting 1 from the value

I have a data which is like this:
abc <- data.frame( a = c("[100-150)", "[150, 200)"))
I want to alter it to make it like this:
abc <- data.frame(a = c("100-149", "150-199"))
I know how to replace the brackets:
abc$a <- lapply(abc$a, gsub, pattern = "[", replacement = "", fixed = TRUE)
abc$a <- lapply(abc$a, gsub, pattern = "]", replacement = "", fixed = TRUE)
abc$a <- lapply(abc$a, gsub, pattern = ")", replacement = "", fixed = TRUE)
It is the subtraction of 1 number from the end that is the problem.
Is there a way to do this?
Please note this is just an example, in reality my data has a column like this which is about 2000 rows.
An option with gsubfn. We extract the numbers (\\d+) after the - or , convert it to numeric subtract 1 and paste with -
library(gsubfn)
gsubfn("[-,] ?(\\d+)", ~ paste0("-", as.numeric(x) - 1) , as.character(abc$a))
#[1] "[100-149)" "[150-199)"

Converting character to number of months

I am working on a dataset, where there is a column- account_age. In this column, the age is mentioned in format- "1YRS 5MON" in character form.
How to convert the same to month ? Please guide.
We can do match of the 'YRS', 'MON' with gsubfn, replace the characters with the numbers and evaluate
library(gsubfn)
unname(sapply(gsubfn("[A-Z]+", list(YRS = "*12 +", MON = "*1"),
df1$col1), function(x) eval(parse(text = x))))
#[1] 17
Or another option is to extract the digits and do a sum or products
library(tidyverse)
map_dbl(str_extract_all(df1$col1, "\\d+"), ~ as.numeric(.x) %*% c(12, 1))
#[1] 17
Or we can remove the letters, read it with data.frame and get the sum of products
as.matrix(read.table(text = gsub("[A-Z]+", "", df1$col1),
header = FALSE) )%*% c(12, 1)
data
df1 <- data.frame(col1 = "1YRS 5MON", stringsAsFactors = FALSE)

String ID format checking and padding in R

Some codes are formatted as numbers divided by dashes (e.g., Social Security Numbers are typically formatted "ddd-dd-dddd", where d stands for any digit; denote this in short 3-2-4 format, standing for the number of digit in each "chunk").
I need to input product codes which come at 5-4, 4-4 or 5-3 format, and then:
(a) validate that they conform to any of these format, and (b) pad with zeros, so that the output is in 5-4 format.
Here is a code that does that. Is there a nicer way? how can it be vectorized?
library(stringr)
as_product_code <- function(x) {
# Clean Product Codes
# Input: 5-4, 5-3, or 4-4 product code.
# Output: 5-4 product code.
chunks <- unlist(strsplit(x, split = "-", fixed = T))
if (length(chunks == 2) & (identical(nchar(chunks), c(5L, 3L)) |
identical(nchar(chunks), c(5L, 4L)) |
identical(nchar(chunks), c(4L, 4L)))) {
output_code<- paste(str_pad(chunks[1], pad = "0", width = 5),
str_pad(chunks[2], pad = "0", width = 4),
sep = "-")
return(output_code)
} else {
warning("Unexpected format. Doing nothing.")
return(x)
}
}
You can use regular expressions and the stringr-package. This will return NA for a entry which does not follow the specified pattern.
For regular expression have a look at the cheat sheet.
\\d stands for any digit (0-9) and the number in the brackets { } give the number of repetions (either {min, max} or {exact}). The ^ means, that I'm looking at the beginning of the string and $ marks the end. Thus I don't match the string with ab at the end.
test <- c("1234-1234", "12345-123", "12345-1234ab", "12345-1234", "1234-123")
ifelse(str_detect(test, "^(\\d{4,5})-(\\d{4})$|^(\\d{5})-(\\d{3})$"),
str_replace_all(test, c("^(\\d{4})-" = "0\\1-", "-(\\d{3})$" = "-0\\1")),
NA)
[1] "01234-1234" "12345-0123" NA "12345-1234" NA
We can actually take advantage of the dataframe structure here to get some vectorization help.
# Create reproducible example
set.seed(9025)
d1 = sample(1:5, 1e5, replace=TRUE)
d2 = sample(1:5, 1e5, replace=TRUE)
codes = sapply(1:1e5, function(i) {
c1 = paste0(sample(1:9, d1[i]), collapse='')
c2 = paste0(sample(1:9, d2[i]), collapse='')
paste(c1, c2, sep='-')
})
library(stringr)
library(tidyverse)
# Create our dataframe, separate the product code, pad the values,
# and use vectorized ifelse to "remove" bad product codes.
output = codes %>%
tbl_df() %>%
separate(value, into=c('c1', 'c2'), sep='-', remove=TRUE) %>%
mutate(include = ifelse(nchar(c1) %in% 4:5 &
nchar(c2) %in% 3:4 &
(nchar(c1) + nchar(c2) > 7),
1, 0),
c1 = str_pad(c1, width=5, side='left', pad=0),
c2 = str_pad(c2, width=4, side='right', pad=0),
code = paste(c1, c2, sep='-')) %>%
mutate(code = ifelse(include == 1, code, '')) %>%
pull(code)
head(codes)
[1] "62971-2" "5-51864" "32419-328" "931-8"
[5] "18324-248" "8-628"
head(output)
[1] "" "" "32419-3280"
[4] "" "18324-2480" ""
You can use Vectorize base R function:
as_product_code <- function(x) {
#your function
}
x <- c('1234-1234','1234-1234')
as_product_code_vec <- Vectorize(as_product_code,'x',USE.NAMES = F)
as_product_code_vec(x)

R How to perform math operation on Regular Expression matches

I'm working on a data frame that has non-detects (with different decimal separators), missing and measured values.
I want to replace the non detects with half of the value after the less sign (<1 becomes 1/2=0.5).
1) I convert to charactes the imported dataframe.
df = data.frame(value=c("NA", "1.2", "<1.0", "<6,6"))
1) convert factor to character
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
2) I replace all "," to "."
pattern = ","
grep(pattern, df, value = TRUE)
df <- data.frame(lapply(df, function(x) {gsub(pattern=pattern, replacement=".", x, perl = TRUE)}))
3) I can find all non-detecs and I can replace it with the value after the less sign
pattern = "(^<)(\\d+)"
grep(pattern, df, value = TRUE)
df <- data.frame(lapply(df, function(x) {gsub(pattern=pattern, replacement="\\d", x, perl = TRUE)}))
I can't find how to perform math operation to the replacement string matched, something as:
replacement = as.character((as.numeric("\\2"))/2)
You can use the following code in Step 2:
df$value = gsub(",", ".", df$value, fixed = TRUE)
It will replace literal commas with literal dots in the value column.
Then, you may use the gsubfn package to match and manipulate substrings matched with regex:
> library(gsubfn)
> df$value = gsubfn("^<(\\d*\\.?\\d+)", ~ as.numeric(x)/2, df$value)
> df
value
1 NA
2 1.2
3 0.5
4 3.3
Here, ^<(\\d*\\.?\\d+) will match < at the start of the string and \\d*\\.?\\d+ pattern will match and capture into Group 1 any float/integer value and will divide it by 2 later in the callback function.

Resources