one quick question
picture a data frame like
data=data.frame(x=c(1,2,3), y=c(4,5,6), Genes=c("AHS;AKS;AHS","AHS;IO","HU"))
so i want to plot x and y
plot(x,y)
and do the label for the dots like this
text(data$x+0.2,data$y+0.2,labels=data$Genes)
BUT i dont want to use all arguments from the genes column ONLY the first one (e.g. before the ";")
Can u please help me with that?
This is only an example, i have already read my data in with read.delim, so i cannot do a specific "read in" with string separation.
As per my comment, you can use gsub to do this:
gsub('^([A-Z]+);.*$', '\\1', data$Genes)
You could also use strsplit:
unlist(lapply(strsplit(data$Genes, ';'), '[', 1))
But that's yucky...
Its also probably worth mentioning the stringr package which collects a lot of these string munging functions in to a single place with predictable syntax and names.
Related
Hello there: I currently have a list of file names (100s) which are separated by multiple "/" at certain points. I would like to find the last "/" in each name and replace it with "/Old". A quick example of what I have tried:
I have managed to do it for a single file name in the list but can't seem to apply it to the whole list.
Test<- "Cars/sedan/Camry"
Then I know I tried finding the last "/" in the name I tried the following :
Last <- tail(gregexpr("/", Test)[[1]], n= 1)
str_sub(Test, Last, Last)<- "/Old"
Which gives me
Test[1] "Cars/sedan/OldCamry"
Which is exactly what I need but I am having troubling applying tail and gregexpr to my list of names so that it does it all at the same time.
Thanks for any help!
Apologies for my poor formatting still adjusting.
If your file names are in a character vector you can use str_replace() from the stringr package for this:
items <- c(
"Cars/sedan/Camry",
"Cars/sedan/XJ8",
"Cars/SUV/Cayenne"
)
stringr::str_replace(items, pattern = "([^/]+$)", replacement = "Old\\1")
[1] "Cars/sedan/OldCamry" "Cars/sedan/OldXJ8" "Cars/SUV/OldCayenne"
Keeping a stringi function as an alternative.
If your dataframe is "df" and your text is in column named "text.
library(stringi)
df %>%
mutate(new_text=stringi::stri_replace_last_fixed(text, '/', '/Old '))
I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))
You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out
You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.
This is what my text file looks like:
1241105.41129.97Y317052.03
2282165.61187.63N364051.40
2251175.87190.72Y366447.49
2243125.88150.81N276045.45
328192.89117.68Y295050.51
2211140.81165.77N346053.11
1291125.61160.61Y335048.3
3273127.73148.76Y320048.04
2191132.22156.94N336051.38
3221118.73161.03Y349349.5
2341189.01200.31Y360048.02
1253144.45180.96N305051.51
2251125.19152.75N305052.72
2192137.82172.25N240046.96
3351140.96174.85N394048.09
1233135.08173.36Y265049.82
1201112.59140.75N380051.25
2202128.19159.73N307048.29
2192132.82172.25Y240046.96
3351148.96174.85Y394048.09
1233132.08173.36N265049.82
1231114.59140.75Y380051.25
3442128.19159.73Y307048.29
2323179.18191.27N321041.12
All these values are continuous and each character indicates something. I am unable to figure out how to separate each value into columns and specify a heading for all these new columns which will be created.
I used this code, however it does not seem to work.
birthweight <- read.table("birthweighthw1.txt", sep="", col.names=c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth”))
Any help would be appreciated.
Assuming that you have a clear definition for every column, you can use regular expressions to solve this in no time.
From your column names and example data, I guess that the regular expression that matches each field is:
ethnic: \d{1}
age: \d{1,2}
smoke: \d{1}
preweight: \d{3}\.\d{2}
delweight: \d{3}\.\d{2}
breastfed: Y|N
brthwght: \d{3}
brthlngth: \d{3}\.\d{1,2}
We can put all this together in a regular expression that captures each of these fields
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
Note: In R, we need to scape "\" that's why we write \d instead of \d.
That said, here comes the code to solve the problem.
First, you need to read your strings
lines <- readLines("birthweighthw1.txt")
Now, we define our regular expression and use the function str_match from the package stringr to get your data into character matrix.
require(stringr)
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
captured <- str_match(string= lines, pattern= reg.expression)
You can check that the first column in the matrix contains the text matched, and the following columns the data captured. So, we can get rid of the first column
captured <- captured[,-1]
and transform it into a data.frame with appropriate column names
result <- as.data.frame(captured,stringsAsFactors = FALSE)
names(result) <- c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth")
Now, every column in result is of type character, you can transform each of them into other types. For example:
require(dplyr)
result <- result %>% mutate(ethnic=as.factor(ethnic),
age=as.integer(age),
smoke=as.factor(smoke),
preweight=as.numeric(preweight),
delweight=as.numeric(delweight),
breastfed=as.factor(breastfed),
brthwght=as.integer(brthwght),
brthlngth=as.numeric(brthlngth)
)
I have a dataset and unfortunately some of the column labels in my dataframe contain signs (- or +). This doesn't seem to bother the dataframe, but when I try to plot this with qplot it throws me an error:
x <- 1:5
y <- x
names <- c("1+", "2-")
mydf <- data.frame(x, y)
colnames(mydf) <- names
mydf
qplot(1+, 2-, data = mydf)
and if I enclose the column names in quotes it will just give me a category (or something to that effect, it'll give me a plot of "1+" vs. "2-" with one point in the middle).
Is it possible to do this easily? I looked at aes_string but didn't quite understand it (at least not enough to get it to work).
Thanks in advance.
P.S. I have searched for a solution online but can't quite find anything that helps me with this (it could be due to some aspect I don't understand), so I reason it might be because this is a completely retarded naming scheme I have :p.
Since you have non-standard column names, you need to to use backticks (`)in your column references.
For example:
mydf$`1+`
[1] 1 2 3 4 5
So, your qplot() call should look like this:
qplot(`1+`, `2-`, data = mydf)
You can find more information in ?Quotes and ?names
As said in the other answer you have a problem because you you don't have standard names. When solution is to avoid backticks notation is to convert colnames to a standard form. Another motivation to convert names to regular ones is , you can't use backticks in a lattice plot for example. Using gsub you can do this:
gsub('(^[0-9]+)[+|-]+|[+|-]+','a\\1',c("1+", "2-","a--"))
[1] "a1" "a2" "aa"
Hence, applying this to your example :
colnames(mydf) <- gsub('(^[0-9]+)[+|-]+|[+|-]+','a\\1',colnames(mydf))
qplot(a1,a2,data = mydf)
EIDT
you can use make.names with option unique =T
make.names(c("10+", "20-", "10-", "a30++"),unique=T)
[1] "X10." "X20." "X10..1" "a30.."
If you don't like R naming rules, here a custom version with using gsubfn
library(gsubfn)
gsubfn("[+|-]|^[0-9]+",
function(x) switch(x,'+'= 'a','-' ='b',paste('x',x,sep='')),
c("10+", "20-", "10-", "a30++"))
"x10a" "x20b" "x10b" "a30aa" ## note x10b looks better than X10..1
In ggplot2, how do I refer to a variable name with spaces?
Why do qplot() and ggplot() break when used on variable names with quotes?
For example, this works:
qplot(x,y,data=a)
But this does not:
qplot("x","y",data=a)
I ask because I often have data matrices with spaces in the name. Eg, "State Income". ggplot2 needs data frames; ok, I can convert. So I'd want to try something like:
qplot("State Income","State Ideology",data=as.data.frame(a.matrix))
That fails.
Whereas in base R graphics, I'd do:
plot(a.matrix[,"State Income"],a.matrix[,"State Ideology"])
Which would work.
Any ideas?
Answer: because 'x' and 'y' are considered a length-one character vector, not a variable name. Here you discover why it is not smart to use variable names with spaces in R. Or any other programming language for that matter.
To refer to variable names with spaces, you can use either hadleys solution
a.matrix <- matrix(rep(1:10,3),ncol=3)
colnames(a.matrix) <- c("a name","another name","a third name")
qplot(`a name`, `another name`,data=as.data.frame(a.matrix)) # backticks!
or the more formal
qplot(get('a name'), get('another name'),data=as.data.frame(a.matrix))
The latter can be used in constructs where you pass the name of a variable as a string in eg a loop construct :
for (i in c("another name","a third name")){
print(qplot(get(i),get("a name"),
data=as.data.frame(a.matrix),xlab=i,ylab="a name"))
Sys.sleep(5)
}
Still, the best solution is not to use variable names with spaces.
Using get is not more "formal", actually I would argue the opposite. As the R help says (help("`")), you can almost always use a variable name that contains spaces, provided it's quoted. (Normally, with a backtick, as already suggested.)
Something similar was asked on ggplot2 mailing list and Mehmet Gültaş linked to this post. Another way of using strings to construct your ggplot call is through the aes_strings function. Note that you still have to put backticks inside the quotes for the thing to work for variables with spaces.
library(ggplot2)
names(mtcars)[1] <- "em pi dzi"
ggplot(mtcars, aes_string(x = "cyl", y = "`em pi dzi`")) +
theme_bw() +
geom_jitter()