I have a dataset and unfortunately some of the column labels in my dataframe contain signs (- or +). This doesn't seem to bother the dataframe, but when I try to plot this with qplot it throws me an error:
x <- 1:5
y <- x
names <- c("1+", "2-")
mydf <- data.frame(x, y)
colnames(mydf) <- names
mydf
qplot(1+, 2-, data = mydf)
and if I enclose the column names in quotes it will just give me a category (or something to that effect, it'll give me a plot of "1+" vs. "2-" with one point in the middle).
Is it possible to do this easily? I looked at aes_string but didn't quite understand it (at least not enough to get it to work).
Thanks in advance.
P.S. I have searched for a solution online but can't quite find anything that helps me with this (it could be due to some aspect I don't understand), so I reason it might be because this is a completely retarded naming scheme I have :p.
Since you have non-standard column names, you need to to use backticks (`)in your column references.
For example:
mydf$`1+`
[1] 1 2 3 4 5
So, your qplot() call should look like this:
qplot(`1+`, `2-`, data = mydf)
You can find more information in ?Quotes and ?names
As said in the other answer you have a problem because you you don't have standard names. When solution is to avoid backticks notation is to convert colnames to a standard form. Another motivation to convert names to regular ones is , you can't use backticks in a lattice plot for example. Using gsub you can do this:
gsub('(^[0-9]+)[+|-]+|[+|-]+','a\\1',c("1+", "2-","a--"))
[1] "a1" "a2" "aa"
Hence, applying this to your example :
colnames(mydf) <- gsub('(^[0-9]+)[+|-]+|[+|-]+','a\\1',colnames(mydf))
qplot(a1,a2,data = mydf)
EIDT
you can use make.names with option unique =T
make.names(c("10+", "20-", "10-", "a30++"),unique=T)
[1] "X10." "X20." "X10..1" "a30.."
If you don't like R naming rules, here a custom version with using gsubfn
library(gsubfn)
gsubfn("[+|-]|^[0-9]+",
function(x) switch(x,'+'= 'a','-' ='b',paste('x',x,sep='')),
c("10+", "20-", "10-", "a30++"))
"x10a" "x20b" "x10b" "a30aa" ## note x10b looks better than X10..1
Related
I am pretty new to R, about 3 months in, and when I was trying to run a regression R shot me this error, Error: unexpected input in "reg1 <- lm(_" the variable I use has an underscore, and some other variables too, I didn't know if R support underscore in a regression or not as thats the first time I had a variable with an underscore in it's name. If it doesn't, how can I change to name?
As good practice, always begin variable/column names with letters (although not explicitly the rule, and you can technically start with the period symbol, this will save hassle). When dealing with data imported into R with predefined column names (or just when dealing with dataframes in general) you can rename columns in the dataframe df as follows
names(df)[names(df) == 'OldName'] <- 'NewName'
If you really need to, you can protect 'illegal' names with back-quotes (although I agree with other answers/comments that this is not good practice ...)
dd <- data.frame(`_y`=rnorm(10), x = 1:10, check.names=FALSE)
names(dd)
## [1] "_y" "x"
lm(`_y` ~ x, data = dd)
I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))
You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out
You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.
Simple but frustrating problem here:
I've imported xls data into R, which unfortunately is the only current way to get the data - no csv option or direct DB query.
Anyways - I'm looking to do quite a bit of manipulation on this data set, however the variable names are extraordinarily messy: ie. col2 = "\r\n\r\n\r\n\r\r XXXXXX YYYYY ZZZZZZ" - you get my gist. Each column head has an equally messy name as this example and there are typically >15 columns per spreadsheet.
Ideally I'd like to program a name manipulation solution via R to avoid manually changing the names in xls prior to importing. But I can't seem to find the right solution, since every R function I try/check requires the column name be spelled out and set to a new variable. Spelling out the entire column name is tedious and impractical and plus the special characters seem to break R's functions anyways.
Does anyone know how to do a global replace all names or a global rename by column number rather than name?
I've tried
replace()
for loops
lapply()
Remove non-printing characters in the first gsub. Then trim whitespace off the ends using trimws and replace consecutive strings of the same character with just one of them in the second gsub. No packages are used.
# test input
d <- data.frame("\r\r\r\r\r\n\n\n\n\n\n XXXX YYYY ZZZZ" = 0, check.names = FALSE)
names(d) <- trimws(gsub("[^[:print:]]", "", names(d)))
names(d) <- gsub("(.)\\1+", "\\1", names(d))
d
## X Y Z
## 1 0
With R 3.6 or later you could consider replacing the first gsub line with this trimws line:
names(d) <- trimws(names(d), "both", "\\s")
If you want syntactic names add this after the above code:
names(d) <- make.names(names(d))
d
## X.Y.Z
## 1 0
I am trying to subset a large data frame with my columns of interest. I do so using the grep function, this selects one column too many ("has_socio"), which I would like to remove.
The following code does exactly what I want, but I find it unpleasant to look at. I want to do it in one line. Aside from just calling the first subset inside the second subset, can it be optimized?
DF <- read.dta("./big.dta")
DF0 <- na.omit(subset(DF, select=c(other_named_vars, grep("has_",names(DF)))))
DF0 <- na.omit(subset(DF0, select=-c(has_socio)))
I know similar questions have been asked (e.g. Subsetting a dataframe in R by multiple conditions) but I do not find one that addresses this issue specifically. I recognize I could just write the grep RE more carefully, but I feel the above code more clearly expresses my intent.
Thanks.
Replace your grep with:
vec <- c("blah", "has_bacon", "has_ham", "has_socio")
grep("^has_(?!socio$)", vec, value=T, perl=T)
# [1] "has_bacon" "has_ham"
(?!...) is a negative lookahead operator, which looks ahead and makes sure that its contents do not follow the actual matching piece behind of it (has_ being the matching piece).
setdiff(grep("has_", vec, value = TRUE), "has_socio")
## [1] "has_bacon" "has_ham"
I have several datafiles, which I need to process in a particular order. The pattern of the names of the files is, e.g. "Ad_10170_75_79.txt".
Currently they are sorted according to the first numbers (which differ in length), see below:
f <- as.matrix (list.files())
f
[1] "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_1049_25_79.txt" "Ad_10531_77_79.txt"
But I need them to be sorted by the middle number, like this:
> f
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
As I just need the middle number of the filename, I thought the easiest way is, to get rid of the rest of the name and renaming all files. For this I tried using strsplit (plyr).
f2 <- strsplit (f,"_79.txt")
But I'm sure there is a way to sort the files directly, without renaming all files. I tried using sort and to describe the name with regex but without success. This has been a problem for many days, and I spent several hours searching and trying, to solve this presumably easy task. Any help is very much appreciated.
old example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_25_79.txt", "Ad_10531_77_79.txt")
Thank your for your answers. I think I have to modify my example, because the solution should work for all possible middle numbers, independent of their digits.
new example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_9_79.txt", "Ad_10531_77_79.txt")
Here's a regex approach.
f[order(as.numeric(gsub('Ad_\\d+_(\\d+)_\\d+\\.txt', '\\1', f)))]
# [1] "Ad_1049_9_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
Try this:
f[order(as.numeric(unlist(lapply(strsplit(f, "_"), "[[", 3))))]
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
First we split by _, then select the third element of every list element, find the order and subset f based on that order.
I would create a small dataframe containing filenames and their respective extracted indices:
f<- c("Ad_10170_75_79.txt","Ad_10345_76_79.txt","Ad_1049_25_79.txt","Ad_10531_77_79.txt")
f2 <- strsplit (f,"_79.txt")
mydb <- as.data.frame(cbind(f,substr(f2,start=nchar(f2)-1,nchar(f2))))
names(mydb) <- c("filename","index")
library(plyr)
arrange(mydb,index)
Take the first column of this as your filename vector.
ADDENDUM:
If a numeric index is required, simply convert character to numeric:
mydb$index <- as.numeric(mydb$index)