Write table row names problems in R - r

I'm using the write.table() method to write a matrix into a text file. My matrix has row and column names. I noticed that R messes up that names.
First of all names that start with a digit are wrote using X as prefix. For example 1005_at will become X1005_at.
Second characters as - and / are substituted with a dot ..
Why is this happening? Is there a way to avoid this crazy issue?

make.names is used to convert names to syntactically valid ones. Check out this small example:
> make.names(c(".1 - / q", "if", "0", "NA"))
[1] "X.1.....q" "if." "X0" "NA."
The documentation says:
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number.
<...>
The character "X" is prepended if necessary. All invalid characters
are translated to "."

Related

Extracting numeric character of length (1|2) from character list

I am scraping PDFs for data and am trying to search for a numeric character (1:9) that is either of length 1 or 2. Unfortunately the value I am after changes position across the PDFs so I cannot simply call the index of the value and assign it to a variable.
I have tried many regex functions and can get numbers out of the list, but cannot seem to implement the argument to only pull numbers of the specific length.
# Data comes in as a long string
Test<-("82026-424 82026-424 1 CSX10 Store Room 75.74 75.74")
# Seperate data into individual pieces with str_split
Split_Test<-str_split(Test[1],"\\s+")
# We can easily unlist it with the following code (Not sure if needed)
Test_Unlisted<-unlist(Split_Test)
> Test_Unlisted
[1] "82026-424" "82026-424" "1" "CSX10" "Store" "Room"
[8] "75.74" "75.74"
My desired outcome would be to get the "1" out of the character list, and then if the value was "20" also be able to recognize that.
The best logic I can think of in code exists below, but this does not work.:
Test_Final<-str_match(Test_Unlisted, "\\d|\\d\\d")
Using this code I can grab anything of length=1, but it is not guaranteed to be a character:
Test_Final<-which(sapply(Test_Unlisted, nchar)==1)
Thanks for all the help!
You need to use
Test<-("82026-424 82026-424 1 CSX10 Store Room 75.74 75.74, 20")
regmatches(Test, gregexpr("\\b(?<!\\d\\.)\\d{1,2}\\b(?!\\.\\d)", Test, perl=TRUE))
See the regex demo and the regex demo.
Details
\b - a word boundary
(?<!\d\.) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a digit and a dot
\d{1,2} - 1 or 2 digits
\b - a word boundary
(?!\.\d) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a dot and a digit.
Note that due to the lookarounds used in the pattern, the regex should be passed to the PCRE regex engine, hence the perl=TRUE argument is required.
With stringr that is ICU regex engine powered, you may use
library(stringr)
str_extract_all(Test, "\\b(?<!\\d\\.)\\d{1,2}\\b(?!\\.\\d)")

I need to remove the backslashes from the following string which is a URL I have in a data frame in R

Can someone help me fix this? I am trying to remove the backslashes and the numbers between them from the following string.
a<-c("/organization/energystone-games-100-a\307\201\265\347\377\263\306\270\270\306\210\217")
I want to remove the backslashes and the numbers so the expected result should look like below:
/organization/energystone-games-100-a
There are actually no backslashes in the input. Backslash followed by digits is how R renders certain special characters. To remove them remove each character that is not lower case letter, upper case letter, digit, slash or minus.
gsub("[^a-zA-Z0-9/-]", "", a)
## [1] "/organization/energystone-games-100-a"
Actually no upper case letters appear so if you are only concerned about such strings then the pattern could be reduced to "[^a-z0-9/-]" .

Losing information when converting from character to numerical

I'm trying to convert characters like "9.230" to a numeric type.
First I erased the dots, because it was returning me "NA", and then I converted to numerical.
The problem is that when I convert to numerical I lose the trailing zero:
Example:
a<-9.230
as.numeric(gsub(".","",a,fixed=TRUE))
Returns: 923
Does anyone know how avoid this?
You assign the number 9.230 which is the same as 9.23. How is the system supposed to know that there was a trailing zero? If you want to transform a string, work with the string "9.230".
Look for result of
a<-9.230
gsub(".","",a,fixed=TRUE)
#[1] "923"
Question will be why? Because fixed=TRUE have been used in argument of gsub. Hence . is replaced by the 2nd argument of gsub that is "".
Basically thats the reason why as.numeric(gsub(".","",a,fixed=TRUE)) is resulting in 923
There is another point. How a <- 9.230 was changed to character in gsub function. This has been explained in r documentation for gsub:
Arguments: x, text
a character vector where matches are sought, or an object
which can be coerced by as.character to a character vector. Long
vectors are supported.
Final question: How to avoid such behavior?
Dont use gsub. Use sprintf("%.3f",a)

How to split a string by dashes outside of square brackets

I would like to split strings like the following:
x <- "abc-1230-xyz-[def-ghu-jkl---]-[adsasa7asda12]-s-[klas-bst-asdas foo]"
by dash (-) on the condition that those dashes must not be contained inside a pair of []. The expected result would be
c("abc", "1230", "xyz", "[def-ghu-jkl---]", "[adsasa7asda12]", "s",
"[klas-bst-asdas foo]")
Notes:
There is no nesting of square brackets inside each other.
The square brackets can contain any characters / numbers / symbols except square brackets.
The other parts of the string are also variable so that we can only assume that we split by - whenever it's not inside [].
There's a similar question for python (How to split a string by commas positioned outside of parenthesis?) but I haven't yet been able to accurately adjust that to my scenario.
You could use look ahead to verify that there is no ] following sooner than a [:
-(?![^[]*\])
So in R:
strsplit(x, "-(?![^[]*\\])", perl=TRUE)
Explanation:
-: match the hyphen
(?! ): negative look ahead: if that part is found after the previously matched hyphen, it invalidates the match of the hyphen.
[^[]: match any character that is not a [
*: match any number of the previous
\]: match a literal ]. If this matches, it means we found a ] before finding a [. As all this happens in a negative look ahead, a match here means the hyphen is not a match. Note that a ] is a special character in regular expressions, so it must be escaped with a backslash (although it does work without escape, as the engine knows there is no matching [ preceding it -- but I prefer to be clear about it being a literal). And as backslashes have a special meaning in string literals (they also denote an escape), that backslash itself must be escaped again in this string, so it appears as \\].
Instead of splitting, extract the parts:
library(stringr)
str_extract_all(x, "(\\[[^\\[]*\\]|[^-])+")
I am not familiar with r language, but I believe it can do regex based search and replace. Instead of struggling with one single regex split function, I would go in 3 steps:
replace - in all [....] parts by a invisible char, like \x99
split by -
for each element in the above split result(array/list), replace \x99 back to -
For the first step, you can find the parts by \[[^]]

How to extract characters from a string based on the text surrounding them in R

Edited to highlight the language I'm using I'm using the R language and I have many large lists of character strings and they have a similar format. I am interested in the characters directly in front of a series of characters that is consistently in the string, but not in a consistent place within the string. For instance:
a <- "aabbccddeeff"
b <- "aabbddff"
c <- "aabbffgghhii"
d <- "bbffgghhii"
I am interested in extracting the two characters directly preceding the "ff" in each character string. I can't find any reasonable solution apart from breaking each character string down using grepl() and then processing them each independently, which seems like an inefficient way to do it.
You can match those two characters and capture them with sub and the right regular expression.
Strings = c("aabbccddeeff",
"aabbddff",
"aabbffgghhii",
"bbffgghhii")
sub(".*(\\w\\w)ff.*", "\\1", Strings)
[1] "ee" "dd" "bb" "bb"
Explanation, This replaces the entire string with the two characters before the "ff". If there are multiple "ff" in the string, this expression takes the two characters before the last "ff".
How this works: The three arguments to sub are:
1. a pattern to search for
2. What it will be replaced with
3. The strings to apply it to.
Most of the work is in the pattern part - .*(\\w\\w)ff.*. The ff part of the pattern must be obvious. We are targeting things near the specific string ff. What comes right before it is (\\w\\w). \w refers to a "word character". That means any letter a-z or A-Z, any digit 0-9 or the one other character _. We want two characters so we have \\w\\w. By enclosing \\w\\w in parentheses, it turns this pattern of two characters into a "capture group", a string that will be saved into a variable for later use. Since this is the first (and only) capture group in this expression, those two characters will be stored in a variable called \1. Now we want only those two characters so in order to blow away everything before and after we put .* at the front and back. . matches any character and * means do this zero or more times, so .* means zero or more copies of any character. Now we have broken the string into four parts: "ff", the two characters before "ff", everything before that and everything after the ff. This covers the entire string. sub will _replace the part that was matched (everything) with whatever it says in the substitution pattern, in this case "\1". That is just how you write a string that evaluates to \1, the name of the variable where we stored the two characters that we want. We write it that way because backslash "escapes" whatever is after it. We actually want the character \ so we write \ to indicate \ and \1 evaluates to \1. So everything in the string is replaced by the targeted two characters. We apply this to every string in the list of strings Strings.

Resources