How can i grep all the gene names starting only with "Gm" from data1[,7].
I tried data2[grep("^Gm",data2$Genes),]; but it extract the entire row which starts with "Gm".
data1[,7] <-
[1] "Ighmbp2,Mrpl21,Cpt1a,Mtl5,Gal,Ppp6r3,Gm23940,Lrp5"
[2] "Gm5852,Gm5773,Tdpoz4,Tdpoz3,Gm9116,Gm9117,Tdpoz5"
[3] "Arhgap15,Gm22867"
One option would be to split the string (strsplit(..) by , and then extract words in the output (which is a list, so lapply can be used) that begin with "Gm" using grep. (^- denotes the beginning of word/string)
lapply(strsplit(Genes, ','), function(x) grep('^Gm', x, value=TRUE))
#[[1]]
#[1] "Gm23940"
#[[2]]
#[1] "Gm5852" "Gm5773" "Gm9116" "Gm9117"
#[[3]]
#[1] "Gm22867"
Or you could extract the words by stri_extract_all from stringi
library(stringi)
stri_extract_all_regex(Genes, 'Gm[[:alnum:]]+')
Or if you need it as a vector, you can use unlist on the above output or use gsub to remove those words that don't begin with "Gm" (\\b(?!Gm)\\w+\\b) and ,', then usescan`.
scan(text=gsub('\\b(?!Gm)\\w+\\b|,', ' ',
Genes, perl=TRUE), what='', quiet=TRUE)
#[1] "Gm23940" "Gm5852" "Gm5773" "Gm9116" "Gm9117" "Gm22867"
Update
If you need to remove all the words starting with Gm
scan(text=gsub('\\bGm\\w+\\b|,', ' ', Genes, perl=TRUE),
what='', quiet=TRUE)
# [1] "Ighmbp2" "Mrpl21" "Cpt1a" "Mtl5" "Gal" "Ppp6r3"
# [7] "Lrp5" "Tdpoz4" "Tdpoz3" "Tdpoz5" "Arhgap15"
data
Genes <- c("Ighmbp2,Mrpl21,Cpt1a,Mtl5,Gal,Ppp6r3,Gm23940,Lrp5",
"Gm5852,Gm5773,Tdpoz4,Tdpoz3,Gm9116,Gm9117,Tdpoz5",
"Arhgap15,Gm22867")
Related
I need a solution to how I can clean my vector of strings which has characters and symbols,
for example
[1]c("hiv3=0", "comdiab=0", "ppl=0")
[2]c("fxet3=1", "hiv3=0", "ppl=0")
[3]c("fxet3=1", "escol4=0", "alcool=0", "tipores3=1")
[4]c("escol4=0", "alcool=0", "ppl=0", "tipores3=1")
The intended string will produce
[1]"hiv3=0,comdiab=0, ppl=0"
[2]"fxet3=1, hiv3=0, ppl=0"
[3]"fxet3=1, escol4=0, alcool=0, tipores3=1"
[4]"escol4=0, alcool=0, ppl=0, tipores3=1"
Any solution is acceptable, though I have tried using the gsub function
Regex solution would be very much acceptable also
Based on the post, it seems to be a listof vectors. We can use paste to create a single string from the list of vectors
sapply(lst1, paste, collapse=", ")
#[1] "hiv3=0, comdiab=0, ppl=0"
#[2] "fxet3=1, hiv3=0, ppl=0"
#[3] "fxet3=1, escol4=0, alcool=0, tipores3=1"
#[4] "escol4=0, alcool=0, ppl=0, tipores3=1"
or otherwise can be modified as
sapply(lst1, toString)
data
lst1 <- list(c("hiv3=0", "comdiab=0", "ppl=0"), c("fxet3=1", "hiv3=0",
"ppl=0"), c("fxet3=1", "escol4=0", "alcool=0", "tipores3=1"),
c("escol4=0", "alcool=0", "ppl=0", "tipores3=1"))
tidyverse answer
library(tidyverse)
my_strings <- list(c("hiv3=0", "comdiab=0", "ppl=0"),
c("fxet3=1", "hiv3=0", "ppl=0"),
c("fxet3=1", "escol4=0", "alcool=0", "tipores3=1"),
c("escol4=0", "alcool=0", "ppl=0", "tipores3=1"))
map_chr(.x = my_strings, .f = str_c, collapse = " ")
# [1] "hiv3=0 comdiab=0 ppl=0"
# [2] "fxet3=1 hiv3=0 ppl=0"
# [3] "fxet3=1 escol4=0 alcool=0 tipores3=1"
# [4] "escol4=0 alcool=0 ppl=0 tipores3=1"
Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"
I am having problems with accessing factors in R. I have a dataframe of tuple factor
test1
#[1] (34.0467, -118.2470) (34.0637, -118.2440) (34.0438, -118.2547)
#[4] (34.0523, -118.2676) (34.0584, -118.2810) (34.0583, -118.2616)
#39497 Levels: (0, 0) (0.0000, 0.0000) ... (34.6837, -118.1853)
How do I access just the first digit of the tuple?
thanks!
dput(test1)
...
"(34.3256, -118.4307)", "(34.3256, -118.4798)", "(34.3256, -118.5033)",
"(34.3257, -118.4244)", "(34.3258, -118.4343)", "(34.3262, -118.4104)",
"(34.3262, -118.4112)", "(34.3266, -118.4234)", "(34.3266, -118.4269)",
"(34.3266, -118.4323)", "(34.3269, -118.4278)", "(34.3272, -118.4365)",
"(34.3273, -118.4342)", "(34.3274, -118.4321)", "(34.3274, -118.4331)",
"(34.3275, -118.4247)", "(34.3275, -118.4298)", "(34.3276, -118.4115)",
"(34.3277, -118.4071)", "(34.3285, -118.4266)", "(34.3286, -118.4277)",
"(34.3287, -118.4286)", "(34.3292, -118.5048)", "(34.3293, -118.4246)",
"(34.3298, -118.4300)", "(34.3327, -118.5062)", "(34.3374, -118.5042)",
"(34.3760, -118.5254)", "(34.3767, -118.5263)", "(34.3775, -118.5270)",
"(34.3805, -118.5293)", "(34.4638, -118.1995)", "(34.5095, -117.9273)",
"(34.5304, -118.1418)", "(34.5453, -118.0405)", "(34.5650, -118.0856)",
"(34.5693, -118.0228)", "(34.5957, -118.1784)", "(34.6818, -118.0954)",
"(34.6837, -118.1853)"), class = "factor")
Can't get the beginning of that anyhow.
test1 <- factor(c("(34.3242, -118.4494)", "(34.3242, -118.4914)", "(34.3243, -118.4167)"))
First, convert the factor vector to a character vector.
test1 <- as.character(test1)
Then, remove all (s and )s, and split the strings by ,.
test1 <- gsub("\\(|\\)", "", test1)
test1 <- strsplit(test1, ",")
After that, change the digits from character format to numeric format.
test1 <- lapply(test1, as.numeric)
Finally, get the first coordinate of each point (change 1 to 2, if you want the second one).
test1 <- unlist(lapply(test1, '[[', 1))
Here is the output.
> test1
[1] 34.3242 34.3242 34.3243
Just index again
x[1][1]
x[2][1]
Try this
as.numeric(unlist(strsplit(gsub("[\\(\\)]", "",as.character(test1)),","))[c(T,F)])
Explanation
gsub is applicable only on character. So, as.character(test1) is converting test1 to character from factor. Then I am removing "(" & ")" from them like this
gsub("[\\(\\)]", "",as.character(test1))
#[1] "34.5693, -118.0228" "34.5957, -118.1784" "34.6818, -118.0954" "34.6837, -118.1853"
Later I split them into two parts depending on the separator , as
strsplit(gsub("[\\(\\)]", "",as.character(test1)),",")
#[[1]]
#[1] "34.5693" " -118.0228"
#[[2]]
#[1] "34.5957" " -118.1784"
#[[3]]
#[1] "34.6818" " -118.0954"
#[[4]]
#[1] "34.6837" " -118.1853"
Previous output is a list. unlist made output a vector.
unlist(strsplit(gsub("[\\(\\)]", "",as.character(test1)),","))
#[1] "34.5693" " -118.0228" "34.5957" " -118.1784" "34.6818" " -118.0954"
#[7] "34.6837" " -118.1853"
Basically [c(T,F)] is generating an alternating sequence of TRUE and FALSE for selection of first elements.
At last I made the output numeric using as.numeric
Output
#[1] 34.5693 34.5957 34.6818 34.6837
Here is a portion of my large dataframe
> a
SS29.SS29 PP1.PP1 SS4.SS4 CC43.CC43 FF57.FF57 NN23.NN23 MM25.MM25 KK9.KK9 MM55.MM55 AA75.AA75 SS88.SS88
1 669.9544 1.068153 35.86534 24.47688 1.058007 72.20306 1.854856 10.15414 0.08715572 0.02006310 0.1817582
2 651.2092 1.164428 37.59895 27.41381 1.095322 73.48029 1.927993 10.09958 0.09096972 0.02261701 0.1855258
How I'd be able to get rid of the double column names separated by a dot? e.g. for the first column I'd like to have SS29 instead of repetitive SS29.SS29, for the second column PP1 and so on. Is there any automated way of doing it?
The simplest way would be to use sub to remove the substring after the dot . character.
names(a) <- sub('\\.[^.]*', '', names(a))
You could use sub
names(a) <- sub("[.](.*)", "", names(a))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or a substring
substring(names(a), 1, regexpr("[.]", names(a))-1)
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or strsplit
names(a) <- unlist(strsplit(names(a), "[.](.*)"))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
You can assign new column names with
colnames(a) <- new_column_names
To compute new_column_names, you can use regular expressions, e.g.. the gsub function, as ssdecontrol suggested.
new_column_names <- gsub(...)
I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.
The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))
Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"
x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.