I am having problems with accessing factors in R. I have a dataframe of tuple factor
test1
#[1] (34.0467, -118.2470) (34.0637, -118.2440) (34.0438, -118.2547)
#[4] (34.0523, -118.2676) (34.0584, -118.2810) (34.0583, -118.2616)
#39497 Levels: (0, 0) (0.0000, 0.0000) ... (34.6837, -118.1853)
How do I access just the first digit of the tuple?
thanks!
dput(test1)
...
"(34.3256, -118.4307)", "(34.3256, -118.4798)", "(34.3256, -118.5033)",
"(34.3257, -118.4244)", "(34.3258, -118.4343)", "(34.3262, -118.4104)",
"(34.3262, -118.4112)", "(34.3266, -118.4234)", "(34.3266, -118.4269)",
"(34.3266, -118.4323)", "(34.3269, -118.4278)", "(34.3272, -118.4365)",
"(34.3273, -118.4342)", "(34.3274, -118.4321)", "(34.3274, -118.4331)",
"(34.3275, -118.4247)", "(34.3275, -118.4298)", "(34.3276, -118.4115)",
"(34.3277, -118.4071)", "(34.3285, -118.4266)", "(34.3286, -118.4277)",
"(34.3287, -118.4286)", "(34.3292, -118.5048)", "(34.3293, -118.4246)",
"(34.3298, -118.4300)", "(34.3327, -118.5062)", "(34.3374, -118.5042)",
"(34.3760, -118.5254)", "(34.3767, -118.5263)", "(34.3775, -118.5270)",
"(34.3805, -118.5293)", "(34.4638, -118.1995)", "(34.5095, -117.9273)",
"(34.5304, -118.1418)", "(34.5453, -118.0405)", "(34.5650, -118.0856)",
"(34.5693, -118.0228)", "(34.5957, -118.1784)", "(34.6818, -118.0954)",
"(34.6837, -118.1853)"), class = "factor")
Can't get the beginning of that anyhow.
test1 <- factor(c("(34.3242, -118.4494)", "(34.3242, -118.4914)", "(34.3243, -118.4167)"))
First, convert the factor vector to a character vector.
test1 <- as.character(test1)
Then, remove all (s and )s, and split the strings by ,.
test1 <- gsub("\\(|\\)", "", test1)
test1 <- strsplit(test1, ",")
After that, change the digits from character format to numeric format.
test1 <- lapply(test1, as.numeric)
Finally, get the first coordinate of each point (change 1 to 2, if you want the second one).
test1 <- unlist(lapply(test1, '[[', 1))
Here is the output.
> test1
[1] 34.3242 34.3242 34.3243
Just index again
x[1][1]
x[2][1]
Try this
as.numeric(unlist(strsplit(gsub("[\\(\\)]", "",as.character(test1)),","))[c(T,F)])
Explanation
gsub is applicable only on character. So, as.character(test1) is converting test1 to character from factor. Then I am removing "(" & ")" from them like this
gsub("[\\(\\)]", "",as.character(test1))
#[1] "34.5693, -118.0228" "34.5957, -118.1784" "34.6818, -118.0954" "34.6837, -118.1853"
Later I split them into two parts depending on the separator , as
strsplit(gsub("[\\(\\)]", "",as.character(test1)),",")
#[[1]]
#[1] "34.5693" " -118.0228"
#[[2]]
#[1] "34.5957" " -118.1784"
#[[3]]
#[1] "34.6818" " -118.0954"
#[[4]]
#[1] "34.6837" " -118.1853"
Previous output is a list. unlist made output a vector.
unlist(strsplit(gsub("[\\(\\)]", "",as.character(test1)),","))
#[1] "34.5693" " -118.0228" "34.5957" " -118.1784" "34.6818" " -118.0954"
#[7] "34.6837" " -118.1853"
Basically [c(T,F)] is generating an alternating sequence of TRUE and FALSE for selection of first elements.
At last I made the output numeric using as.numeric
Output
#[1] 34.5693 34.5957 34.6818 34.6837
Related
I have a string of comma separated values that I'd like to split into several pieces based on the number of commas.
E.g.: Split the following string every 5 values or commas:
txt = "120923,120417,120416,105720,120925,120790,120792,120922,120928,120930,120918,120929,61065,120421"
The result would be:
[1] 120923,120417,120416,105720,120925
[2] 120790,120792,120922,120928,120930
[3] 120918,120929,61065,120421
We could split the text on comma (',') and divide them into group of 5.
temp <- strsplit(txt, ",")[[1]]
split(temp, rep(seq_along(temp), each = 5, length.out = length(temp)))
#$`1`
#[1] "120923" "120417" "120416" "105720" "120925"
#$`2`
#[1] "120790" "120792" "120922" "120928" "120930"
#$`3`
#[1] "120918" "120929" "61065" "120421"
If you want them as one concatenated string we can use by
as.character(by(temp, rep(seq_along(temp), each = 5,
length.out = length(temp)), toString))
One base R option would be to use gregexpr with the following regex pattern:
\d+(?:,\d+){0,4}
This pattern would match one number, followed greedily by zero to four other CSV numbers. Note that because the pattern is greedy, it would always try to match the maximum numbers available remaining in the input.
txt <- "120923,120417,120416,105720,120925,120790,120792,120922,120928,120930,120918,120929,61065,120421"
regmatches(txt,gregexpr("\\d+(?:,\\d+){0,4}",txt))
[1] "120923,120417,120416,105720,120925" "120790,120792,120922,120928,120930"
[3] "120918,120929,61065,120421"
Using str_extract
library(stringr)
str_extract_all(txt, "\\d+(,\\d+){1,4}")[[1]]
#[1] "120923,120417,120416,105720,120925" "120790,120792,120922,120928,120930"
#[3] "120918,120929,61065,120421"
I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))
I want to separate a factor with 14 rows, each row is like "cg17205324 (Adolescence)"
I tried strsplit(), but always ended up with "cg17205324 ".
Googled various methods to clean the tailing whitespace but did not work, because it is a factor rather than string.
any tips?
We can use scan
scan(text=str1, what ="", quiet=TRUE)
#[1] "cg17205324" "(Adolescence)"
data
str1 <- "cg17205324 (Adolescence)"
You can try the following:
"cg17205324 (Adolescence)" -> outp
strsplit(outp," ") # " " serves as space and separate the two strings
[[1]]
[1] "cg17205324" "(Adolescence)"
a <- "cg17205324 (Adolescence)"
b <- strsplit(a, " ")
b
#[[1]]
#[1] "cg17205324" "(Adolescence)"
How can i grep all the gene names starting only with "Gm" from data1[,7].
I tried data2[grep("^Gm",data2$Genes),]; but it extract the entire row which starts with "Gm".
data1[,7] <-
[1] "Ighmbp2,Mrpl21,Cpt1a,Mtl5,Gal,Ppp6r3,Gm23940,Lrp5"
[2] "Gm5852,Gm5773,Tdpoz4,Tdpoz3,Gm9116,Gm9117,Tdpoz5"
[3] "Arhgap15,Gm22867"
One option would be to split the string (strsplit(..) by , and then extract words in the output (which is a list, so lapply can be used) that begin with "Gm" using grep. (^- denotes the beginning of word/string)
lapply(strsplit(Genes, ','), function(x) grep('^Gm', x, value=TRUE))
#[[1]]
#[1] "Gm23940"
#[[2]]
#[1] "Gm5852" "Gm5773" "Gm9116" "Gm9117"
#[[3]]
#[1] "Gm22867"
Or you could extract the words by stri_extract_all from stringi
library(stringi)
stri_extract_all_regex(Genes, 'Gm[[:alnum:]]+')
Or if you need it as a vector, you can use unlist on the above output or use gsub to remove those words that don't begin with "Gm" (\\b(?!Gm)\\w+\\b) and ,', then usescan`.
scan(text=gsub('\\b(?!Gm)\\w+\\b|,', ' ',
Genes, perl=TRUE), what='', quiet=TRUE)
#[1] "Gm23940" "Gm5852" "Gm5773" "Gm9116" "Gm9117" "Gm22867"
Update
If you need to remove all the words starting with Gm
scan(text=gsub('\\bGm\\w+\\b|,', ' ', Genes, perl=TRUE),
what='', quiet=TRUE)
# [1] "Ighmbp2" "Mrpl21" "Cpt1a" "Mtl5" "Gal" "Ppp6r3"
# [7] "Lrp5" "Tdpoz4" "Tdpoz3" "Tdpoz5" "Arhgap15"
data
Genes <- c("Ighmbp2,Mrpl21,Cpt1a,Mtl5,Gal,Ppp6r3,Gm23940,Lrp5",
"Gm5852,Gm5773,Tdpoz4,Tdpoz3,Gm9116,Gm9117,Tdpoz5",
"Arhgap15,Gm22867")
I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.
The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))
Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"
x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.