I am pretty new to R, about 3 months in, and when I was trying to run a regression R shot me this error, Error: unexpected input in "reg1 <- lm(_" the variable I use has an underscore, and some other variables too, I didn't know if R support underscore in a regression or not as thats the first time I had a variable with an underscore in it's name. If it doesn't, how can I change to name?
As good practice, always begin variable/column names with letters (although not explicitly the rule, and you can technically start with the period symbol, this will save hassle). When dealing with data imported into R with predefined column names (or just when dealing with dataframes in general) you can rename columns in the dataframe df as follows
names(df)[names(df) == 'OldName'] <- 'NewName'
If you really need to, you can protect 'illegal' names with back-quotes (although I agree with other answers/comments that this is not good practice ...)
dd <- data.frame(`_y`=rnorm(10), x = 1:10, check.names=FALSE)
names(dd)
## [1] "_y" "x"
lm(`_y` ~ x, data = dd)
Related
please see the the column name "if" in the second column,the deifference is :when check.name=F,"." beside "if" disappear
Sorry for the code,because I try to type some codes to generate this data.frame like in the picture,but i failed due to the "if".We know that "if" is a reserved word in R(like else,for, while ,function).And here, i deliberately use the "if" as the column name (the 2nd column),and see whether R will generate some novel things.
So using another way, I type the "if" in the excel and save as the format of csv in order to use read.csv.
Question is:
Why "if." changes to "if"?(After i use check.names=FALSE)
enter image description here
?read.csv describes check.names= in a similar fashion:
check.names: logical. If 'TRUE' then the names of the variables in the
data frame are checked to ensure that they are syntactically
valid variable names. If necessary they are adjusted (by
'make.names') so that they are, and also to ensure that there
are no duplicates.
The default action is to allow you to do something like dat$<column-name>, but unfortunately dat$if will fail with Error: unexpected 'if' in "dat$if", ergo check.names=TRUE changing it to something that the parser will not trip over. Note, though, that dat[["if"]] will work even when dat$if will not.
If you are wondering if check.names=FALSE is ever a bad thing, then imagine this:
dat <- read.csv(text = "a,a\n2,3")
dat
# a a.1
# 1 2 3
dat <- read.csv(text = "a,a\n2,3", check.names = FALSE)
dat
# a a
# 1 2 3
In the second case, how does one access the second column by-name? dat$a returns 2 only. However, if you don't want to use $ or [[, and instead can rely on positional indexing for columns, then dat[,colnames(dat) == "a"] does return both of them.
Simple but frustrating problem here:
I've imported xls data into R, which unfortunately is the only current way to get the data - no csv option or direct DB query.
Anyways - I'm looking to do quite a bit of manipulation on this data set, however the variable names are extraordinarily messy: ie. col2 = "\r\n\r\n\r\n\r\r XXXXXX YYYYY ZZZZZZ" - you get my gist. Each column head has an equally messy name as this example and there are typically >15 columns per spreadsheet.
Ideally I'd like to program a name manipulation solution via R to avoid manually changing the names in xls prior to importing. But I can't seem to find the right solution, since every R function I try/check requires the column name be spelled out and set to a new variable. Spelling out the entire column name is tedious and impractical and plus the special characters seem to break R's functions anyways.
Does anyone know how to do a global replace all names or a global rename by column number rather than name?
I've tried
replace()
for loops
lapply()
Remove non-printing characters in the first gsub. Then trim whitespace off the ends using trimws and replace consecutive strings of the same character with just one of them in the second gsub. No packages are used.
# test input
d <- data.frame("\r\r\r\r\r\n\n\n\n\n\n XXXX YYYY ZZZZ" = 0, check.names = FALSE)
names(d) <- trimws(gsub("[^[:print:]]", "", names(d)))
names(d) <- gsub("(.)\\1+", "\\1", names(d))
d
## X Y Z
## 1 0
With R 3.6 or later you could consider replacing the first gsub line with this trimws line:
names(d) <- trimws(names(d), "both", "\\s")
If you want syntactic names add this after the above code:
names(d) <- make.names(names(d))
d
## X.Y.Z
## 1 0
I'm trying to read data from a *.txt or *.csv file into R with read.table or read.csv. However, my data is written as e.g. 1.4523e-9 in the file denoting 1.4523*10^{-9} though ggplot recognizes this as a string instead of a real. Is there some sort of eval( )-function to convert this to its correct value ?
Depending on the exact format of the csv file you import,read.csv and read.table often simply convert all columns to factors. Since a straightforward conversion to numeric as failed, I assume this is your problem. You can change this using the colClasses argument as such:
# if every column should be numeric:
df <- read.csv("foobar.csv", colClasses = "numeric")
#if only some columns should be numeric, use a vector.
#to read the first as factor and the second as numeric:
read.csv("foobar.csv", colClasses = c("factor", "numeric")
Of course, both of the above are barebones examples; you probably want to supply other arguments as well, eg header = T.
If you don't want to supply the classes of each column when you read the table (maybe you don't know them yet!), you can convert after the fact using either of the following:
df$a <- as.numeric(as.character(a)) #as you already discovered
df$a <- as.numeric(levels(df$a)[df$a])
Yes, these are both clunky, but they are standard and frequently recommended.
I have a dataset and unfortunately some of the column labels in my dataframe contain signs (- or +). This doesn't seem to bother the dataframe, but when I try to plot this with qplot it throws me an error:
x <- 1:5
y <- x
names <- c("1+", "2-")
mydf <- data.frame(x, y)
colnames(mydf) <- names
mydf
qplot(1+, 2-, data = mydf)
and if I enclose the column names in quotes it will just give me a category (or something to that effect, it'll give me a plot of "1+" vs. "2-" with one point in the middle).
Is it possible to do this easily? I looked at aes_string but didn't quite understand it (at least not enough to get it to work).
Thanks in advance.
P.S. I have searched for a solution online but can't quite find anything that helps me with this (it could be due to some aspect I don't understand), so I reason it might be because this is a completely retarded naming scheme I have :p.
Since you have non-standard column names, you need to to use backticks (`)in your column references.
For example:
mydf$`1+`
[1] 1 2 3 4 5
So, your qplot() call should look like this:
qplot(`1+`, `2-`, data = mydf)
You can find more information in ?Quotes and ?names
As said in the other answer you have a problem because you you don't have standard names. When solution is to avoid backticks notation is to convert colnames to a standard form. Another motivation to convert names to regular ones is , you can't use backticks in a lattice plot for example. Using gsub you can do this:
gsub('(^[0-9]+)[+|-]+|[+|-]+','a\\1',c("1+", "2-","a--"))
[1] "a1" "a2" "aa"
Hence, applying this to your example :
colnames(mydf) <- gsub('(^[0-9]+)[+|-]+|[+|-]+','a\\1',colnames(mydf))
qplot(a1,a2,data = mydf)
EIDT
you can use make.names with option unique =T
make.names(c("10+", "20-", "10-", "a30++"),unique=T)
[1] "X10." "X20." "X10..1" "a30.."
If you don't like R naming rules, here a custom version with using gsubfn
library(gsubfn)
gsubfn("[+|-]|^[0-9]+",
function(x) switch(x,'+'= 'a','-' ='b',paste('x',x,sep='')),
c("10+", "20-", "10-", "a30++"))
"x10a" "x20b" "x10b" "a30aa" ## note x10b looks better than X10..1
I am creating a simple data frame like this:
qcCtrl <- data.frame("2D6"="DNS00012345", "3A4"="DNS000013579")
My understanding is that the column names should be "2D6" and "3A4", but they are actually "X2D6" and "X3A4". Why are the X's being added and how do I make that stop?
I do not recommend working with column names starting with numbers, but if you insist, use the check.names=FALSE argument of data.frame:
qcCtrl <- data.frame("2D6"="DNS00012345", "3A4"="DNS000013579",
check.names=FALSE)
qcCtrl
2D6 3A4
1 DNS00012345 DNS000013579
One of the reasons I caution against this, is that the $ operator becomes more tricky to work with. For example, the following fails with an error:
> qcCtrl$2D6
Error: unexpected numeric constant in "qcCtrl$2"
To get round this, you have to enclose your column name in back-ticks whenever you work with it:
> qcCtrl$`2D6`
[1] DNS00012345
Levels: DNS00012345
The X is being added because R does not like having a number as the first character of a column name. To turn this off, use as.character() to tell R that the column name of your data frame is a character vector.