I have list of words, which i got from below code.
tags_vector <- unlist(tags_used)
Some of the strings in this list has ellipsis in the end,which i want to remove. Here i print the 5th element of this list, and its class
tags_vector[5]
#[1] "#b…"
class(tags_vector[5])
#[1] "character"
I am trying to remove the ellipsis from this 5th element using gsub, using the code ,
gsub("[…]", "", tags_vector[5])
#[1] "#b…"
This code doesn't works and i get "#b…" as output. But in the same code when i put the value of 5th element directly, it works fine as below,
gsub("[…]", "", "#b…")
#[1] "#b"
I even tried putting the value of tags_vector[5] in a variable x1 and tried to use it in gsub() code but it still din't work.
It might be a Unicode issue. In R(studio), not all characters are created equally.
I tried to create a reproducible example:
# create the ellipsis from the definition (similar to your tags_used)
> ell_def <- rawToChar(as.raw(c('0xE2','0x80','0xA6'))) # from the unicode definition here: http://www.fileformat.info/info/unicode/char/2026/index.htm
> Encoding(ell_def) <- 'UTF-8'
> ell_def
[1] "…"
> Encoding(ell_def)
[1] "UTF-8"
# create the ellipsis from text (similar to your string)
> ell_text <- '…'
> ell_text
[1] "…"
> Encoding(ell_text)
[1] "latin1"
# show that you can get strange results
> gsub(ell_text,'',ell_def)
[1] "…"
The reproducibility of this example might be dependent on your locale. In my case, I work in windows-1252 since you cannot set the locale to UTF-8 in Windows. According to this stringi source, "R lets strings in ASCII, UTF-8, and your platform's native encoding coexist peacefully". As the example above shows, this might sometimes give contradictory results.
Basically, the output you see looks the same, but isn't on a byte level.
If I run this example in the R terminal, I get similar results, but apparently, it shows the ellipsis as a dot: ".".
A quick fix for your example would be to use the ellipsis definition in your gsub. E.g.:
gsub(ell_def,'',tags_vector[5])
Related
I am trying to convert some strings of an input file from UTF8 to ASCII. For most of the strings I give it, the conversion works perfectly fine with iconv(). However on some of them, it returns NA. While manually fixing the issue in the file seems like the simplest option, it is unfortunately not an option that I have available at the moment at all.
I have made a reproducible example of my problem but we assume to assume that I have to figure a way for iconv() to somehow convert the string in s1 and not get NA.
Here is the reproducible example:
s1 <- "Besançon" #as read from an input file I cannot modify
s2 <- "Paris"
s3 <- "Linköping"
s4 <- "Besançon" #Manual input for testing
s1 <- iconv(s1, to='ASCII//TRANSLIT')
s2 <- iconv(s2, to='ASCII//TRANSLIT')
s3 <- iconv(s3, to='ASCII//TRANSLIT')
s4 <- iconv(s4, to='ASCII//TRANSLIT')
I get the following output:
> s1
[1] NA
> s2
[1] "Paris"
> s3
[1] "Link\"oping"
> s4
[1] "Besancon"
After playing around with the code, I figured that something was wrong in the entry "Besançon" that is now copied exactly from the input file. When I input it manually myself, the problem is solved. Since I can't modify the input file at all, what do you think is the exact issue and would you have any idea on how to solve it?
Thanks in advance,
Edit:
After closer inspection, there is something odd in the characters of the first line. It seems to be taken away by SO's formatting.
But to reproduce it, the best I could give is these two images describing it. First image places my cursor just before the #
Second image is after pressing delete, which should delete the white space... turns out it deletes the ". So there is definitely something weird there.
It turns out that using sub='' actually solved the issue although I am quite unsure why.
iconv(s1, to='ASCII//TRANSLIT', sub='')
From the documentation sub
character string. If not NA it is used to replace any non-convertible
bytes in the input. (This would normally be a single character, but
can be more.) If "byte", the indication is "" with the hex code of
the byte. If "Unicode" and converting from UTF-8, the Unicode point in
the form "<U+xxxx>".
So I eventually figured out that there was a character I couldn't convert (nor see) in the string and using sub was a way to eliminate it. I am still not sure what this character is though. But the problem is solved.
There is probably a latin1 (or other encoding) character in your supposedly utf8 file. For example:
> latin=iconv('Besançon','utf8','latin1')
> iconv(latin,to='ascii//translit')
[1] NA
> iconv(latin,'utf8','ascii//translit')
[1] NA
> iconv(latin,'latin1','ascii//translit')
[1] "Besancon"
> iconv(l,'Windows-1250','ascii//translit')
[1] "Besancon"
You can e.g. make one new vector or data column with the result of each character set encoding in your data, and if one is NA, fall back to the next one, e.g.
utf8 = iconv(x,'utf8','ascii//translit')
latin1 = iconv(x,'latin1','ascii//translit')
win1250 = iconv(x,'Windows-1250','ascii//translit')
result = ifelse(
is.na(utf8),
ifelse(
is.na(latin1),
win1250,
latin1
),
utf8
)
If these encodings don't work, make a file with just the problem word, then use the unix/linux file command to detect the encoding, or else try some likely encodings.
I have in the past just listed all of iconv's supported encodings, tried all with lapply, and then used whichever results worked on each string, but some "from" encodings will return a non-NA but incorrect result, so it's best to try this on each unique character in your data in order to decide which subset of iconv's encodings to use and in which order.
I am using OpenCPU and R to create a web API that takes in some inputs and returns a topoJSON file from a database, as well as some other information. OpenCPU automatically pushes the output through toJSON, which results in JSON output that has quoted JSON in it (i.e., the topoJSON). This is obviously not ideal--especially since it then gets incredibly cluttered with backticked quotes (\"). I tried using fromJSON to convert it to an R object, which could then be converted back (which is incredibly inefficient), but it returns a slightly different syntax and the result is that it doesn't work.
I feel like there should be some way to convert the string to some other type of object that results in toJSON calling a different handler that tells it to just leave it alone, but I can't figure out how to do that.
> s <- '{"type":"Topology","objects":{"map": "0"}}'
> fromJSON(s)
$type
[1] "Topology"
$objects
$objects$map
[1] "0"
> toJSON(fromJSON(s))
{"type":["Topology"],"objects":{"map":["0"]}}
That's just the beginning of the file (I replaced the actual map with "0"), and as you can see, brackets appeared around "Topology" and "0". Alternately, if I just keep it as a string, I end up with this mess:
> toJSON(s)
["{\"type\":\"Topology\",\"objects\":{\"0000595ab81ec4f34__csv\": \"0\"}}"]
Is there any way to fix this so that I just get the verbatim string but without quotes and backticks?
EDIT: Note that because I'm using OpenCPU, the output needs to come from toJSON (so no other function can be used, unfortunately), and I can't do any post-processing.
To it seems you just want the values rather than vectors. Set auto_unbox=TRUE to turn length-one vectors into scalar values
toJSON(fromJSON(s), auto_unbox = TRUE)
# {"type":"Topology","objects":{"map":"0"}}
That does print without escaping for me (using jsonlite_1.5). Maybe you are using an older version of jsonlite. You can also get around that by using cat() to print the result. You won't see the slashes when you do that.
cat(toJSON(fromJSON(s), auto_unbox = TRUE))
You can manually unbox the relevant entries:
library(jsonlite)
s <- '{"type":"Topology","objects":{"map": "0"}}'
j <- fromJSON(s)
j$type <- unbox(j$type)
j$objects$map <- unbox(j$objects$map)
toJSON(j)
# {"type":"Topology","objects":{"map":"0"}}
The file I'm reading contains one word per line.
I have issues with some of these words, as it seems some characters are unusual. see the following example with the first word of my list
stopwords <- read.csv("stopwords_fr.txt",stringsAsFactors = FALSE,header=FALSE,encoding="UTF-8")$V1
stopwords[1] # "a" , if you copy paste into R studio this character with the quotes around it, you'll see a little red dot preceding the a.
stopwords[1] == "a" # FALSE
How did it happen ? How can I avoid it ? And if I haven't avoided it, how do I convert this dotted "a" into a regular "a" ?
EDIT:
you can reproduce the issue by just copy pasting this in Rstudio:
"a" == "a" # FALSE
here's where I get the file from:
https://sites.google.com/site/kevinbouge/stopwords-lists/stopwords_fr.txt?attredirects=0&d=1
The encoding of the file, according to notepad++, is UTF-8-BOM. But using "UTF-8-BOM" as the encoding doesn't help. though it seemed to work in this answer:
Read a UTF-8 text file with BOM
stopwords <- read.csv("stopwords_fr.txt",stringsAsFactors = FALSE,header=FALSE,encoding="UTF-8-BOM")$V1
stopwords[1] # "a"
I have R version 3.0.2
Using RODBC package I pulled data from access to R.
One of the names in table in access is : "JED_credit\debit (local)"
When I'm trying to refer to the cell in R I get:
JED_credit\debit (local)
"Error: unexpected input in "JED_credit\"
I think it is not the recommended way of defining variables, but it is working by using backward ticks.
> `var` <- 'test'
> var
[1] "test"
> `var/bla` <- 'test'
> `var/bla`
[1] "test"
> `var()bla` <- 'test'
> `var()bla`
[1] "test"
> `var\bla` <- 'test'
> `var\bla`
[1] "test"
There are a few characters sequences in almost all languages which have a special meaning. For example \n stands for a newline character.
In your regular string, the parser treats \d in JED_credit\debit as a special character sequence rather than a part of the string. To make it a part of the string, you need to escape the back slash with another\ thus making your variable name as JED_credit\\debit.
Answer improvements are welcome.
I've been using gsub("toreplace","replacement", myvector) to clean out data in R. While this works for commas and the like, removing "$" has no effect. So if I do gsub("$","",myvector) all the dollar signs remain in place.
I think this is because $ is a special character in R. I tried escaping it "\$" but that yields the same result (no effect). And I couldn't find a resource on escaping special characters in R.
Obviously I should do this in preprocessing. But I was wondering if anyone out there knew how to either a) escape special characters in R b) get rid of pesky $ in R directly. For science.
You have to escape it twice, first for R, second for the regex.
gsub('\\$', '', c("a$a", "bb$"))
[1] "aa" "bb"
See ?Quotes for details on quoting and escaping.
Use fixed = TRUE:
gsub('$', '', c("a$a", "bb$"), fixed = TRUE)
Then you don't need to worry about any special characters. In stringr, this is implemented a little differently:
library(stringr)
str_replace_all(c("$100","ta$ty"), fixed("$"), "")
Thanks to DiggyF and James for the examples!
Escaping characters can be a pain some times, but just putting it in square brackets (make it a character class) helps with this:
> gsub("[$]","",c("$100","ta$ty"))
[1] "100" "taty"
if you have $ followed by number in set of data columns (e.g. $400,000) there is an easier way that worked like charm for me.
data%>%
mutate_at(5:6, parse_number)
where 5:6 are the data column numbers.