Replace specific value in column in R - r

I have a variable in a data frame which consists of either 5 or 6 digits/characters. Those values in the variable with 5 digits are all numbers e.g. 27701 those with 6 digits however all have a character 'C' preceding the numbers e.g. C22701.
How can I replace the 'C' characters with 999 for example?
I have tried:
replace(data$varname,'C',999)
Any ideas folks?

data$varname <- as.numeric(gsub('C', '999', data$varname)) should do the trick, I think. Assuming you want a numeric vector in the end. If you want a character vector, then you can leave as.numeric off.

You can use a substring to remove the first letter, and paste0 to add 999 to it.
> x <- c("C000", "P1745")
> paste0("999", substring(x,2))
# [1] "999000" "9991745"

Related

Flag when character appears more than once in a string

I have seen something similar answered for Python but not for R. Say I have the sample data below, and I want to create the "want" column, which flags when the character "|" appears more than once in the string in the "var1" column. How would I do this in R? I know I can use grepl to flag whenever "|" appears, but this would also capture when it only appears once.
Sample data:
var1<-c("BLUE|RED","RED|BLUE","WHITE|BLACK|ORANGE","BLACK|WHITE|ORANGE")
want<-c(0,0,1,1)
have<-as.data.frame(cbind(var1,want))
var1 want
BLUE|RED 0
RED|BLUE 0
WHITE|BLACK|ORANGE 1
BLACK|WHITE|ORANGE 1
str_count can be used - count the number of | (metacharacter - so escape (\\) or specify as fixed, and then create a logical vector (> 1), convert the logical to binary (as.integer or +)
library(stringr)
have$want <- +(str_count(have$var1, fixed("|") ) > 1)

number some patterns in the string using R

I have a strings and it has some patterns like this
my_string = "`d#k`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`n$l`0.4`0.1`0.25`0.28`0.18`0.3`0.17`0.2`0.03`!lk`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`vnabgjd`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`pogk(`1.01`0.71`0.86`0.89`0.79`0.91`0.78`0.81`0.64`r!#^##niw`0.0014`0.0020`9.9999`9.9999`0.0020`0.0022`0.0032`9.9999`0.0000`
As you can see there is patterns [`nonnumber] then [`number.num~] repeated.
So I want to identify how many [`number.num~] are between [`nonnumber].
I tried to use regex
index <- gregexpr("`(\\w{2,20})`\\d\\.\\d(.*?)`\\D",cle)
regmatches(cle,index)
but using this code, the [`\D] is overlapped. so just It can't number how many the pattern are.
So if you know any method about it, please leave some reply
Using strsplit. We split at the backtick and count the position difference which of the values coerced to "numeric" yield NA. Note, that we need to exclude the first element after strsplit and add an NA at the end in the numerics. Resulting in a vector named with the non-numerical element using setNames (not very good names actually, but it's demonstrating what's going on).
s <- el(strsplit(my_string, "\\`"))[-1]
s.num <- suppressWarnings(as.numeric(s))
setNames(diff(which(is.na(c(s.num, NA)))) - 1,
s[is.na(s.num)])
# d#k n$l !lk vnabgjd pogk( r!#^##niw
# 9 9 9 9 9 9

Insert delimiter between characters

I have a data frame of character column in which I want to insert a delimiter after every 2 characters. The character column is variable in the length. This is how it looks like
id character
1 aaabdg
2 bjdbjhdj
3 bjbkjekkechj
4 jkfb
the output data frame I want is as below
id character
1 aa_ab_dg
2 bj_db_jh_dj
3 bj_bk_je_kk_ec_hj
4 jk_fb
I have been trying to create regex to use in the below code but have not found any luck yet.
cat(paste0('[a-z]{2}', paste(str1, collapse="", ""), '[a-z]{2}'))
and
gsub("([a-z])", "\\,", str1)
Any help/suggestions would be much appreciated
Here is one option using gsub:
gsub("(..)(?!$)", "\\1_", "bjbkjekkechj", perl=TRUE)
[1] "bj_bk_je_kk_ec_hj"
This approach is to match and capture every pair of characters in succession, provided that there be at least one character following the pair. Then, we replace with those two captured characters, followed by an underscore. The negative lookahead (?!$) ensures that we do not add an underscore after the very last single or pair of characters.

Split a column with varying delimiters into 2, along with unfortunate fraction structure

I am fairly new to R and programming in general. I was given a data set to work with that unfortunately was structured fairly rough.
It is in the form of
W-X/Y"-Z
The first number being inches, however for values <1 inch it is simply
X/Y"-Z
I need a way to:
a) split Z off, (the number after the last delimiter of "-"
as well as
b) convert the W-X/Y" or X/Y" value to its decimal equivalent.
So 1-1/2" to just 1.5
So split the original column into 2 columns, one with the Z value, and one with the decimal inches value. As shown below
input length bin
3-1/2"-14 3.5 14
3/4"-20 .75 20
We can split the 'input' column by the last - or "" to get a list output. Loop over the list (with lapply), remove the blank elements (x[nzchar(x)]), replace the - with +, use eval(parse to evaluate the fraction to get the numeric output, concatenate with the second value, rbind the list elements, and assign (<-) the output to create two new columns.
df1[c("length", "bin")] <- do.call(rbind, lapply(strsplit(df1$input,
'-(?=[^-]+$)|"', perl=TRUE), function(x) {
x1 <- x[nzchar(x)]
c(eval(parse(text=sub("-", "+", x1[1]))), as.numeric(x1[2]))}))
df1
# input length bin
#1 3-1/2"-14 3.50 14
#2 3/4"-20 0.75 20
NOTE: If the "input" column is factor class, convert to character and use it in strsplit ,i.e. strsplit(as.character(df1$input), ...
data
df1 <- data.frame(input=c('3-1/2"-14', '3/4"-20'), stringsAsFactors=FALSE)

R: Count number of rows in data frame, with matching character in specified position of string

I have a data frame with a column with characters:
strings
1 a;b;c;d
2 g;h;i;j
3 k;m
4 o
I would like to get a count of the number of strings(rows) with a certain specified characters at a certain position within the string.
Eg.
Get count of number of strings with 3rd character as one of the
characters in this set: {a,b,m}.
The output should be 2 in this case, since only the 1st and 3rd row
have any characters in {a,b,m} as their 3rd character within the
string.
I could only use this code to find any strings that contains 'b':
sum(grepl("b",df))
However, this is not good enough for the above task.
Please advice.
You can try grepl:
x = c('a;b;c;d','g;h;i;j','k;m','o')
sum(grepl('^.{2}[abm]', x))
#[1] 2
Try this:
sum(substr(df$strings,3,3) %in% c("a","b","m"))
Alternatively, if you want to use a ; as the delimeter you can do:
sum(sapply(strsplit(df$strings,";"),function(x) x[2] %in% c("a","b","m")))

Resources