How to remove unknown symbols from strings? - r

Sorry if this is a stupid question, but I tried searching for similar problems and did not find what I was looking for.
I scraped some text from Internet and now try to work with it in R. I encountered a problem: there are unknown characters inserted in the middle of some words. It looks normal when I just display the table, but when I copy the text there is this symbol. For example, if the cell in the table is "Example", when I copy it to the console, I see this:
This unfortunately is problematic as R does not recognize the word in these cases and would not find the cell if I, for example, would try to find all cells that contain the word "Example". As the error seems random and doesn't just apply to specific words I do not know how to fix it - can anybody help me?
Thank you very much in advance!!

You can use iconv function to remove all non-ASCII characters from the string. Please see the example below:
iconv("Ex·ample", from = "UTF-8", to = "ASCII", sub = "")
# Example

Related

R displaying unicode/utf-8 encoding rather the special characters

I have a dataframe in R which has one row of utf-8 encoded special characters and one integer row.
If I display both rows, or go into the view(), I do not see the characters displayed correctly.
However, if I only select the row with the special characters, it works. Any ideas?
This is the output (if I paste it, the encoding disappears):
This looks like a bug in R. I've worked around a number of these in the corpus package. Try the following
library(corpus)
print.corpus_frame(WW_mapping[1:3,])
Alternatively, do
library(corpus)
class(WW_mapping) <- c("corpus_frame", "data.frame")
WW_mapping[1:3,]
Adding the "corpus_frame" class to the data frame changes the print and format methods; otherwise, it does not change the behavior of the object.
If that doesn't work, please report your sessionInfo() along with dput(WW_mapping). (Actually, even if this fix does work, please report this information so that we can let the R core developers know about the problem.)

characters converted to dates when using write.csv

my data frame has a column A with strings in character form
> df$A
[1] "2-60", "2-61", "2-62", "2-63" etc
I saved the table using write.csv, but when I open it with Excel column A appears formatted as date:
Feb-60
Feb-61
Feb-62
Feb-63
etc
Anyone knows what can I do to avoid this?
I tweaked the arguments of write.csv but nothing worked, and I can't seem to find an example in Stack Overflow that helps solve this problem.
As said in the comments, this is an excel behaviour, not R's. And that can't be deactivated:
Microsoft Excel is preprogrammed to make it easier to enter dates. For
example, 12/2 changes to 2-Dec. This is very frustrating when you
enter something that you don't want changed to a date. Unfortunately
there is no way to turn this off. But there are ways to get around it.
Microsoft Office Article
The first suggested way around it according to the article is not helpful, because it relies on changing the cell formatting, but that's too late when you open the .csv file in excel (it's already converted to an integer representing the date).
There is, however, a useful tip:
If you only have a few numbers to enter, you can stop Excel from
changing them into dates by entering:
An apostrophe (‘) before you enter a number, such as ’11-53 or ‘1/47. The apostrophe isn’t displayed in the cell after you press
Enter.
So you can make the data display as original by using
vec <- c("2-60", "2-61", "2-62", "2-63")
vec <- paste0("'", vec)
Just remember the values will still have the apostrophe if you read them again in R, so you might have to use
vec <- sub("'", "", vec)
This might not be ideal but at least it works.
One alternative is enclosing the text in =" ", as an excel formula, but that has the same end result and uses more characters.
Another solution - a bit tedious, Use Import Text File in Excel, click thru the dialog boxes and in Step 3 of 3 of the Text Import Wizard, you will have an option of setting the column data format, use "Text" for the column that has "2-60", "2-61", "2-62", "2-63". If you use General (the default), Excel tries to be smart and converts the answer for you.
I solved the problem by saving the file using the .xlsx format by using the function
write.xlsx()
from the package xlsx (https://www.rdocumentation.org/packages/xlsx/versions/0.6.5)

importing txt into R

I am trying to read an ftp file from the internet ("ftp://ftp.cmegroup.com/pub/settle/stlint") into R, using the following command:
aaa<-read.table("ftp://ftp.cmegroup.vom/pub/settle/stlint", sep="\t", skip=2, header=FALSE)
the result shows the 8th, 51st, 65th, 71st, 72nd, 73rd, 74th, etc etc rows of the resulting dataset as including add-on rows appended at the end. Basically instead of returning
{row8}
{row9}
etc
{row162}
{row163}
It returns (adding in the quotes around the \n)
{row8'\n'row9'\n'etc...etc...'\n'row162}
{row163}
If it seems like i'm picking arbitrary numbers then run the code above, take a look at the actual ftp file on the internet (as of mid-day feb18) and you'll see i'm not, it really adding 155x rows onto the end of the 8th row. So what i'm looking for is simply I'm looking for a way to read it in without the random appending of rows. Thanks, and apologize in advance i'm new to R and was not able to find this fix after a while of searching.

R not reading ampersand from .txt file with read.table() [duplicate]

I am trying to load a table to R with some of the column heading start with a number, i.e. 55353, 555xx, abvab, 77agg
I found after loading the files, all the headings that start with a number has a X before it, i.e. changed to X55353, X555xx, abvab, X77agg
What can I do to solve this problem. Please kindly notice that not all column heading are start with a number. What should I do to solve this problem?
Many thanks
Probably your issue will be solved by adding check.names=FALSE to your read.table() call.
For more information see ?read.table

How to inform R that an escaped character (em-dash) is finished when the string is composed by numbers?

The answer is probably simple but I am unable to find it.
When plotting in R I want to add a text in the plot containing a number range (e.g. 60-80) with a code like this:
a<-rnorm(10,10)
b<-rnorm(10,10)
plot(a,b)
text(10,10,labels=("60\u2013 80"))
I got the following text in the plot: "60- 80", with an annoying space after the dash.
Obviously when trying ("60\201380") it simply fails. I cannot use the direct way ("60-80") because it doesn't work with Arial in my computer.
Any idea to tell R that \u2013 has finished?
Thanks a lot
Just paste it. :)
text(10,10,labels=paste("60\u2013","80", sep=""))
and it works.

Resources