Unexpected outcome, not replacing, in R out of a gsub function

Unexpected outcome, not replacing, in R out of a gsub function - r

As the output of a certain operation, I have the following dataframe whith 729 observations.
> head(con)
Connections
1 r_con[C3-C3,Intercept]
2 r_con[C3-C4,Intercept]
3 r_con[C3-CP1,Intercept]
4 r_con[C3-CP2,Intercept]
5 r_con[C3-CP5,Intercept]
6 r_con[C3-CP6,Intercept]
As can be seen, the pattern to be removed is everything but the pair of Electrode information, for instance, in the first observation this would be C3-C3. Now, this is my take on the issue, which I'd expect to have the dataframe with everything removed. If I'm not wrong (which probably am) the regex syntax is ok and from my understanding I believe fixed=TRUE is also necessary. However, I do not understand the R output. When I would expect the pattern to be changed by nothing ""it returns this output, which doesn't make sense to me.
> gsub("r_con\\[\\,Intercept\\]\\","",con,fixed=TRUE)
[1] "3:731"
I believe this will probably be a silly question for an expert programmer, which I am far from being, and any insight would be much appreciated.
[UPDATE WITH SOLUTION]
Thanks to Tim and Ben I realised I was using a wrong regex syntax and a wrong source, this made it to me:
con2 <- sub("^r_con\\[([^,]+),Intercept\\]", "\\1", con$Connections)

I think your problem is that you're accessing "con" in your sub call. Also, as the user above me pointed out, you probably don't want to use sub.
I'm assuming, that your data is consistent, i.e., the strings in con$Connections follow more or less the same pattern. Then, this works:
I have set up this example:
con <- data.frame(Connections = c("r_con[C3-C3,Intercept]", "r_con[C3-CP1,Intercept]"))
library(stringr)
f <- function(x){
part <- str_split(x, ",")[[1]][1]
str_sub(part, 7, -1)
}
f(con$Connections[1])
sapply(con$Connections, f)

The sub function doesn't work this way. One viable approach would be to capture the quantity you want, then use this capture group as the replacement:
x <- "r_con[C3-C3,Intercept]"
term <- sub("^r_con\\[([^,]+),Intercept\\]", "\\1", x)
term
[1] "C3-C3"

Related

How to match any character existing between a pattern and a semicolon

I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))

You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out

You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.

R partial string matching ignore spaces omni-directional

I am having issue with partial string matching. I have pairs of people, and I need to compare their names. To do this I have run a charmatch both directions on the two last names, to see if name1 is part of name2, and vice versa. I have a small dataset below to demonstrate the question. I use charmatch below; I have used pmatch as well and it returns the same result.
When charmatch says seeks matches for the seeks matches for the elements of its first argument among those of its second... I take that to mean it will treat each group of characters in element1 as a pattern n see if same group exists in element2. But that's obviously not what's happening, it looks like it's direction specific.
So...is it direction specific? And if so...what else can I use to do what I am describing? My EG names pun intended, what I actually run into are lots of last names where husband has his name and wife has hers + husband. I need to be able to see if husband last name exists within wife last name.
I know it can be done with regular expressions but I am not familiar with them, probably should be, but am not, so I'd prefer an answer that does not use regex.
eg_data <- data.frame(name1 = c('Jimmy Conway', 'Jimmy'),
name2 = c('Conway','Jimmy Conway'))
eg_data$share_name1 <- mapply(charmatch, eg_data$name1, eg_data$name2)
eg_data$share_name2 <- mapply(charmatch, eg_data$name2, eg_data$name1)
eg_data$share_name <- 0
eg_data$share_name [(eg_data$share_name1==1 | eg_data$share_name2==1)]
<- 1

Same two lines, only string detect, not charmatch.
eg_data$share_name1 <- mapply(str_detect,eg_data$name1, eg_data$name2)
eg_data$share_name2 <- mapply(str_detect,eg_data$name2, eg_data$name1)
OR even
eg_data$share_name1 <- ifelse(mapply(str_detect,eg_data$name1, eg_data$name2)==TRUE,1,0)
eg_data$share_name2 <- ifelse(mapply(str_detect,eg_data$name2, eg_data$name1)==TRUE,1,0)
Thanks for anyone who looked. I hope this helps others.

This could be useful
> with(eg_data, intersect(name1, name2))
[1] "Jimmy Conway"

How to grep for all-but-one matching columns in R

I am trying to subset a large data frame with my columns of interest. I do so using the grep function, this selects one column too many ("has_socio"), which I would like to remove.
The following code does exactly what I want, but I find it unpleasant to look at. I want to do it in one line. Aside from just calling the first subset inside the second subset, can it be optimized?
DF <- read.dta("./big.dta")
DF0 <- na.omit(subset(DF, select=c(other_named_vars, grep("has_",names(DF)))))
DF0 <- na.omit(subset(DF0, select=-c(has_socio)))
I know similar questions have been asked (e.g. Subsetting a dataframe in R by multiple conditions) but I do not find one that addresses this issue specifically. I recognize I could just write the grep RE more carefully, but I feel the above code more clearly expresses my intent.
Thanks.

Replace your grep with:
vec <- c("blah", "has_bacon", "has_ham", "has_socio")
grep("^has_(?!socio$)", vec, value=T, perl=T)
# [1] "has_bacon" "has_ham"
(?!...) is a negative lookahead operator, which looks ahead and makes sure that its contents do not follow the actual matching piece behind of it (has_ being the matching piece).

setdiff(grep("has_", vec, value = TRUE), "has_socio")
## [1] "has_bacon" "has_ham"

Processing files in a particular order in R

I have several datafiles, which I need to process in a particular order. The pattern of the names of the files is, e.g. "Ad_10170_75_79.txt".
Currently they are sorted according to the first numbers (which differ in length), see below:
f <- as.matrix (list.files())
f
[1] "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_1049_25_79.txt" "Ad_10531_77_79.txt"
But I need them to be sorted by the middle number, like this:
> f
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
As I just need the middle number of the filename, I thought the easiest way is, to get rid of the rest of the name and renaming all files. For this I tried using strsplit (plyr).
f2 <- strsplit (f,"_79.txt")
But I'm sure there is a way to sort the files directly, without renaming all files. I tried using sort and to describe the name with regex but without success. This has been a problem for many days, and I spent several hours searching and trying, to solve this presumably easy task. Any help is very much appreciated.
old example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_25_79.txt", "Ad_10531_77_79.txt")
Thank your for your answers. I think I have to modify my example, because the solution should work for all possible middle numbers, independent of their digits.
new example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_9_79.txt", "Ad_10531_77_79.txt")

Here's a regex approach.
f[order(as.numeric(gsub('Ad_\\d+_(\\d+)_\\d+\\.txt', '\\1', f)))]
# [1] "Ad_1049_9_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"

Try this:
f[order(as.numeric(unlist(lapply(strsplit(f, "_"), "[[", 3))))]
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
First we split by _, then select the third element of every list element, find the order and subset f based on that order.

I would create a small dataframe containing filenames and their respective extracted indices:
f<- c("Ad_10170_75_79.txt","Ad_10345_76_79.txt","Ad_1049_25_79.txt","Ad_10531_77_79.txt")
f2 <- strsplit (f,"_79.txt")
mydb <- as.data.frame(cbind(f,substr(f2,start=nchar(f2)-1,nchar(f2))))
names(mydb) <- c("filename","index")
library(plyr)
arrange(mydb,index)
Take the first column of this as your filename vector.
ADDENDUM:
If a numeric index is required, simply convert character to numeric:
mydb$index <- as.numeric(mydb$index)

R:how to get grep to return the match, rather than the whole string

I have what is probably a really dumb grep in R question. Apologies, because this seems like it should be so easy - I'm obviously just missing something.
I have a vector of strings, let's call it alice. Some of alice is printed out below:
T.8EFF.SP.OT1.D5.VSVOVA#4
T.8EFF.SP.OT1.D6.LISOVA#1
T.8EFF.SP.OT1.D6.LISOVA#2
T.8EFF.SP.OT1.D6.LISOVA#3
T.8EFF.SP.OT1.D6.VSVOVA#4
T.8EFF.SP.OT1.D8.VSVOVA#3
T.8EFF.SP.OT1.D8.VSVOVA#4
T.8MEM.SP#1
T.8MEM.SP#3
T.8MEM.SP.OT1.D106.VSVOVA#2
T.8MEM.SP.OT1.D45.LISOVA#1
T.8MEM.SP.OT1.D45.LISOVA#3
I'd like grep to give me the number after the D that appears in some of these strings, conditional on the string containing "LIS" and an empty string or something otherwise.
I was hoping that grep would return me the value of a capturing group rather than the whole string. Here's my R-flavoured regexp:
pattern <- (?<=\\.D)([0-9]+)(?=.LIS)
nothing too complicated. But in order to get what I'm after, rather than just using grep(pattern, alice, value = TRUE, perl = TRUE) I'm doing the following, which seems bad:
reg.out <- regexpr(
"(?<=\\.D)[0-9]+(?=.LIS)",
alice,
perl=TRUE
)
substr(alice,reg.out,reg.out + attr(reg.out,"match.length")-1)
Looking at it now it doesn't seem too ugly, but the amount of messing about it's taken to get this utterly trivial thing working has been embarrassing. Anyone any pointers about how to go about this properly?
Bonus marks for pointing me to a webpage that explains the difference between whatever I access with $,# and attr.

Try the stringr package:
library(stringr)
str_match(alice, ".*\\.D([0-9]+)\\.LIS.*")[, 2]

You can do something like this:
pat <- ".*\\.D([0-9]+)\\.LIS.*"
sub(pat, "\\1", alice)
If you only want the subset of alice where your pattern matches, try this:
pat <- ".*\\.D([0-9]+)\\.LIS.*"
sub(pat, "\\1", alice[grepl(pat, alice)])