number some patterns in the string using R - r

I have a strings and it has some patterns like this
my_string = "`d#k`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`n$l`0.4`0.1`0.25`0.28`0.18`0.3`0.17`0.2`0.03`!lk`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`vnabgjd`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`pogk(`1.01`0.71`0.86`0.89`0.79`0.91`0.78`0.81`0.64`r!#^##niw`0.0014`0.0020`9.9999`9.9999`0.0020`0.0022`0.0032`9.9999`0.0000`
As you can see there is patterns [`nonnumber] then [`number.num~] repeated.
So I want to identify how many [`number.num~] are between [`nonnumber].
I tried to use regex
index <- gregexpr("`(\\w{2,20})`\\d\\.\\d(.*?)`\\D",cle)
regmatches(cle,index)
but using this code, the [`\D] is overlapped. so just It can't number how many the pattern are.
So if you know any method about it, please leave some reply

Using strsplit. We split at the backtick and count the position difference which of the values coerced to "numeric" yield NA. Note, that we need to exclude the first element after strsplit and add an NA at the end in the numerics. Resulting in a vector named with the non-numerical element using setNames (not very good names actually, but it's demonstrating what's going on).
s <- el(strsplit(my_string, "\\`"))[-1]
s.num <- suppressWarnings(as.numeric(s))
setNames(diff(which(is.na(c(s.num, NA)))) - 1,
s[is.na(s.num)])
# d#k n$l !lk vnabgjd pogk( r!#^##niw
# 9 9 9 9 9 9

Related

R: pairwise matrix of the number of characters that differ among strings

I have a vector containing a large number of strings that are all of the same length. For example:
vec = c("keep", "teem", "meat", "weep")
I would like to compare every possible pair of strings from within this vector and count the number of characters that differ between them. Using the vector above, "keep" would be compared to every other string in the vector, "teem" would be compared to every other string, and so on.
I'm only interested in counting the number of characters from the same position within each string that are different. So for example "keep" vs. "teem" would have 2 differences, "keep" vs. "meat" 3 differences, etc. I'd like to output the results as a pairwise matrix, where the strings in the vector make up the row names and column names.
I've learned from another post (How can I compare two strings to find the number of characters that match in R, using substitution distance?) that I can use the adist argument in mapply to calculate the number of differences between two strings:
mapply(adist,string1,string2)
But I'm not sure how to modify this to operate over every possible pairwise combination in my vector, and to place the results in a pairwise matrix. Any ideas for how to do that? Thanks!!
Do you mean using adist like below?
> `dimnames<-`(adist(vec),rep(list(vec),2))
keep teem meat weep
keep 0 2 3 1
teem 2 0 3 2
meat 3 3 0 3
weep 1 2 3 0
An option with stringdistmatrix
library(stringdist)
out <- as.matrix(stringdistmatrix(vec))
dimnames(out) <- list(vec, vec)

Why my output has different spacing for a specific vector with different data types in R?

I have recently started learning R language and was working on combination of vectors. I was following a tutorial and when I try to print character, complex, integer vector in c() there is a space difference between them.
I have enclosed the snapshot for the same as I might not be able to articulate it properly in words.
As Roland commented, a vector can only contain one specific data type. Here since you have character datatype, all the other data types are coerced into character datatype.
x <- c(123.56, 21, "rajat", 2+4i); print(x)
The space which should not be a problem as far as I understand is created because you have different number of characters in each elements of the vector.
>nchar(x)
[1] 6 2 5 4
Now, if you have equal number of characters the space distribution is as expected:
x <- c(123.56, 210000, "rajata", 2+442i); print(x)
[1] "123.56" "210000" "rajata" "2+442i"
nchar(x)
[1] 6 6 6 6

R: Count number of rows in data frame, with matching character in specified position of string

I have a data frame with a column with characters:
strings
1 a;b;c;d
2 g;h;i;j
3 k;m
4 o
I would like to get a count of the number of strings(rows) with a certain specified characters at a certain position within the string.
Eg.
Get count of number of strings with 3rd character as one of the
characters in this set: {a,b,m}.
The output should be 2 in this case, since only the 1st and 3rd row
have any characters in {a,b,m} as their 3rd character within the
string.
I could only use this code to find any strings that contains 'b':
sum(grepl("b",df))
However, this is not good enough for the above task.
Please advice.
You can try grepl:
x = c('a;b;c;d','g;h;i;j','k;m','o')
sum(grepl('^.{2}[abm]', x))
#[1] 2
Try this:
sum(substr(df$strings,3,3) %in% c("a","b","m"))
Alternatively, if you want to use a ; as the delimeter you can do:
sum(sapply(strsplit(df$strings,";"),function(x) x[2] %in% c("a","b","m")))

Using cbind on XTS object changes the dash (-) character in previous column names to a dot (.)

I have some R code that creates an XTS object, and then performs various cbind operations in the lifetime of that object. Some of my columns have names such as "adx-1". That is fine until another cbind() operation is performed. At that point, any columns with the "-" character are changes to a ".". So "adx-1" becomes "adx.1".
To reproduce:
x = xts(order.by=as.Date(c("2014-01-01","2014-01-02")))
x = cbind(x,c(1,2))
x
..2
2014-01-01 1
2014-01-02 2
colnames(x) = c("adx-1")
x
adx-1
2014-01-01 1
2014-01-02 2
x = cbind(x,c(1,2))
x
adx.1 ..2
2014-01-01 1 1
2014-01-02 2 2
It doesn't just do this with numbers either. It changes "test-text" to "test.text" as well. Multiple dashes are changed too. "test-text-two" is changed to "test.text.two".
Can someone please explain why this happens and, if possible, how to stop it from happening?
I can of course change my naming schemes, but it would be preferred if I didn't have to.
Thanks!
merge.xts converts the column names into syntactic names, which cannot contain -. According to ?Quotes:
Identifiers consist of a sequence of letters, digits, the period
('.') and the underscore. They must not start with a digit nor
underscore, nor with a period followed by a digit.
There is currently no way to alter this behavior.
The reason for the behavior is precisely the one Joshua Ulrich highlighted. It's common across many data types in R: you need "valid" names. Here is a great discussion of this "issue".
For data frames, you can pass the option check.names = FALSE as a workaround, but this is not implemented for xts object. This said, there are plenty of other workarounds available to you.
For instance, you could simply rename the columns of interest after very cbind. Using your code, simply add:
colnames(x)[1] <- c("adx-1")
to force back your desired column name.
Alternatively, you could consider this gsub solution if you wanted something potentially more systematic.

Why does R need the name of the dataframe?

If you have a dataframe like this
mydf <- data.frame(firstcol = c(1,2,1), secondcol = c(3,4,5))
Why would
mydf[mydf$firstcol,]
work but
mydf[firstcol,]
wouldn't?
You can do this:
mydf[,"firstcol"]
Remember that the column goes second, not first.
In your example, to see what mydf[mydf$firstcol,] gives you, let's break it down:
> mydf$firstcol
[1] 1 2 1
So really mydf[mydf$firstcol,] is the same as
> mydf[c(1,2,1),]
firstcol secondcol
1 1 3
2 2 4
1.1 1 3
So you are asking for rows 1, 2, and 1. That is, you are asking for your row one to be the same as row 1 of mydf, your row 2 to be the same as row 2 of mydf and your row 3 to be the same as row 1 of mydf; and you are asking for both columns.
Another question is why the following doesn't work:
> mydf[,firstcol]
Error in `[.data.frame`(mydf, , firstcol) : object 'firstcol' not found
That is, why do you have to put quotes around the column name when you ask for it like that but not when you do mydf$firstcol. The answer is just that the operators you are using require different types of arguments. You can look at '$' to see the form x$name and thus the second argument can be a name, which is not quoted. You can then look up ?'[', which will actually lead you to the same help page. And there you will find the following, which explains it. Note that a "character" vector needs to have quoted entries (that is how you enter a character vector in R (and many other languages).
i, j, ...: indices specifying elements to extract or replace. Indices
are ‘numeric’ or ‘character’ vectors or empty (missing) or
‘NULL’. Numeric values are coerced to integer as by
‘as.integer’ (and hence truncated towards zero). Character
vectors will be matched to the ‘names’ of the object (or for
matrices/arrays, the ‘dimnames’): see ‘Character indices’
below for further details.
Nothing to add to the very clear explanation of Xu Wang. You might want to note in addition that the package data.table allows you to use notation such as mydf[firstcol==1,] or mydf[,firstcol], that many find more natural.

Resources