Special characters in R language - r

I have a table, which looks like this:
1β 2β
1.0199e-01 2.2545e-01
2.5303e-01 6.5301e-01
1.2151e+00 1.1490e+00
and so on...
I want to make a boxplot of this data. The commands I am using is this:
pdf('rtest.pdf')
w1<-read.table("data_CMR",header=T)
w2<-read.table("data_C",header=T)
boxplot(w1[,], w2[,], w3[,],outline=FALSE,names=c(colnames(w1),colnames(w2),colnames(w3)))
dev.off()
The problem is instead of symbol beta (β), I get two dots (..) in the output.
Any suggestions, to solve this problem.
Thank you in advance.

The suggestion to use check.names will prevent the appending of "X" to the "1β" and "2β" which would otherwise occur even once the encoding is sorted out (since column names are not supposed to start with numbers. (One could also have just used the"names" argument to boxplot.)
w1<-read.table(text="1β 2β
1.0199e-01 2.2545e-01
2.5303e-01 6.5301e-01
1.2151e+00 1.1490e+00",header=TRUE, check.names=FALSE, fileEncoding="UTF-8")
boxplot(w1)

This also works
pdf('rtest.pdf')
w1<-read.table("data_CMR",header=T)
w2<-read.table("data_C",header=T)
one<-expression(paste("1", beta,sep=""))
two <- expression(paste("2", beta,sep=""))
boxplot(w1[,], w2[,], w3[,],outline=FALSE, names=c(one,two))
dev.off()

This could be an encoding problem. Try adding encoding='UTF-8' to your read.table statements.
w1<-read.table("data_CMR",header=T,encoding='UTF-8')

Related

read.table unable to read tab delimited file?

I'm having trouble reading this table into R:
http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt
I tried all of the following:
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt")
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",skip=7,header=FALSE)
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",skip=8,header=FALSE)
read.table("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",skip=10,header=FALSE)
If I tell it that the separator is a tab, i get the wrong table:
d = read.table(file="http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt",header=FALSE,skip=7,sep="\t")
the only thing that seems to work is readLines. but then i don't know how to get a data.frame out of each line.
d =readLines("http://www.census.gov/popest/about/geo/state_geocodes_v2012.txt")
any suggestions? thanks.
I agree that read.fwf will work, once you've worked out the widths.
But, Yeah -- I just hate people who allow whitespace inside elements (e.g. "SouthDakota" ) . One other thing you can do is edit the source text file, replacing {2,N} spaces with a tab. That will leave the state names as-is but give you a workable delimiter.

parsing xml file manually with r

for some reason, I cannot download the r xml package at work. I have an xml file that has contents like this:
x<-read.table("info.xml")
x
</name></content></item><item id="id-123"><content><name>
</name></content></item><item id="id-456"><content><name>
</name></content></item><item id="id-5559"><content><name>
I need to pick values that start with id and - and the numbers like
id-123, id-456 id-5559, etc
tried this:
str_extract_all(x, "id-[0-9]")
but is only printing id-1, I really need help very quick. Any ideas?
str_extract_all(x, "id-[0-9]+")
The regular expression "id-[0-9]" is missing a "+" at the end.
There may be more issues, but that one jumps out.

read.fwf and the number sign

I am trying to read this file (3.8mb) using its fixed-width structure as described in the following link.
This command:
a <- read.fwf('~/ccsl.txt',c(2,30,6,2,30,8,10,11,6,8))
Produces an error:
line 37 did not have 10 elements
After replicating the issue with different values of the skip option, I figured that the lines causing the problem all contain the "#" symbol.
Is there any way to get around it?
As #jverzani already commented, this problem is probably the fact that the # sign often used as a character to signal a comment. Setting the comment.char input argument of read.fwf to something other than # could fix the problem. I'll leave my answer below as a more general case that you can use on any character that causes problems (e.g. the 's in the Dutch city name 's Gravenhage).
I've had this problem occur with other symbols. The approach I took was to simply replace the # by either nothing, or by a character which does not generate the error. In my case it was no problem to simply replace the character, but this might not be possible in your case.
So my approach would be to delete the symbol that generates the error, or replace by another character. This can be done using a text editor (find and replace), in an R script, or using some linux tools called grep and sed. If you want to do this in an R script, use scan or readLines to read the lines. Once the text is in memory, you can use sub to replace the character.
If you cannot replace the character, I would try the following approach: replace the character by a character that does not generate an error, read it into R using read.fwf, and finally replace the character by the # character.
Following up on the answer above: to get all characters to be read as literals, use both comment.char="" and quote="" (the latter takes care of #PaulHiemstra's problem with single-quotes in Dutch proper nouns) in the call to read.fwf (this is documented in ?read.table).

How escape or sanatize slash using regex in R?

I'm trying to read in a (tab separted) csv file in R. When I want to read the column including a /, I get an error.
doSomething <- function(dataset) {
a <- dataset$data_transfer.Jingle/TCP.total_size_kb
...
}
The error says, that this object cannot be found. I've tried escaping with backslash but it did not work.
If anybody has got some idea, I'd really appreciate it!
Give
head(dataset)
and watch the name it has been given. Perhaps it would be something like:
dataset$data_transfer.Jingle.TCP.total_size_kb
Two ways:
dataset[["data_transfer.Jingle/TCP.total_size_kb"]]
or
dataset$`data_transfer.Jingle/TCP.total_size_kb`

Rows being dropped in R with read.table?

I am loading a table in which the first column is a URL and reading it into R using read.table().
It seems that R is dropping about 1/3 of the columns and does not return any errors.
The URLs do not contain any # characters or tabs (my separator field), which I understand could be an issue. If I convert the URLs to integer IDs first, the problem goes away.
Is there something about the field that might be causing R to drop the rows?
Without a sample of the data, it's hard to say. But one small "gotcha" is that # is the default comment.char in read.table(). Try to set comment.char = "" and see if that fixes it.
Thanks for all your help,
Yes, so initially there were some hashes and I was able to handle them using comment.char = ''. The problem turned out to be that some of my URLs contained ' and " characters. The strangest thing about the situation is that it didn't return any errors. After I removed these characters using tr, I had no issues with loading the data.

Resources