Remove characters from a dataset - r

I have a dataset as follows,
[1] "21/12/16, 14:25:10: abcd
[2] "21/12/16, 14:25:14: 1234
[3] "21/12/16, 14:25:22: XXX
[4] "21/12/16, 14:25:30: YYY
[5] "21/12/16, 14:25:47: ZZZ
Date variable has all the dates in the above dataset as,
> head(date)
[1] "21/12/16" "21/12/16" "21/12/16" "21/12/16" "21/12/16"
Time variable has all times from the dataset as,
> head(time)
[1] "14:25" "14:25" "14:25" "14:25" "14:25"
Now I want the dataset to be modified as,
[1] abcd
[2] 1234
[3] XXX
[4] YYY
[5] ZZZ
How can we do this? I tried gsub but no use. Can someone help me out here.

You aren't completely precise as to the expected behavior, but for the dataset that you've supplied, splitting on ":" and taking the fourth element of the resulting vector will get the desired result. You should think about the use case and whether you can rely on that working in general, however. e.g. Will there always be exactly three colons before the string you want? Will the string you want never contain a colon? etc.
Also, I think you're missing a closing quote mark in your rows.

readLines(con = textConnection("21/12/16, 14:25:10: abcd
21/12/16, 14:25:14: 1234
21/12/16, 14:25:22: XXX
21/12/16, 14:25:30: YYY
21/12/16, 14:25:47: ZZZ")) -> text_file_lines
text_file_lines
## [1] "21/12/16, 14:25:10: abcd" "21/12/16, 14:25:14: 1234"
## [3] "21/12/16, 14:25:22: XXX" "21/12/16, 14:25:30: YYY"
## [5] "21/12/16, 14:25:47: ZZZ"
# built-in
# somewhat forgiving regex replace
sub("^[[:digit:]]+/[[:digit:]]+/[[:digit:]]+,[[:space:]]+[[:digit:]]+:[[:digit:]]+:[[:digit:]]+:[[:space:]]", "", text_file_lines)
## [1] "abcd" "1234" "XXX" "YYY" "ZZZ"
# external pkg
# this matches from last : onward and extracts the bits you want
stringi::stri_match_last_regex(text_file_lines, ": ([[:print:]]+)$")[,2]
## [1] "abcd" "1234" "XXX" "YYY" "ZZZ"

Related

How to extract text from a column using R

How would I go about extracting, for each row (there are ~56,000 records in an Excel file) in a specific column, only part of a string? I need to keep all text to the left of the last '/' forward slash. The challenge is that not all cells have the same number of '/'. There is always a filename (*.wav) at the end of the last '/', but the number of characters in the filename is not always the same (sometimes 5 and sometimes 6).
Below are some examples of the strings in the cells:
cloch/51.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav
grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav
AB_AeolinaL/025-C#.wav
AB_AeolinaL/026-D.wav
AB_violadamourL/rel99999/091-G.wav
AB_violadamourL/rel99999/092-G#.wav
AB_violadamourR/024-C.wav
AB_violadamourR/025-C#.wav
The extracted text should be:
cloch
grand/Grand_bombarde/02-suchy_Grand_bombarde
grand/Grand_bombarde/02-suchy_Grand_bombarde
AB_AeolinaL
AB_AeolinaL
AB_violadamourL/rel99999
AB_violadamourL/rel99999
AB_violadamourR
AB_violadamourR
Can anyone recommend a strategy using R?
You can use the stringr package str_remove(string,pattern) function like:
str = "grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav"
str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(str,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
Then you can just iterate over all other strings:
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
Output:
> str_remove(strings,"/[0-9]+[-]*[A-Z]*[#]*[.][a-z]+")
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"
You have to substract strings using this method:
substr(strings,1,regexpr("\\/[^\\/]*$", strings)-1)
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"
Input
strings<-c("cloch/51.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav","grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav","AB_AeolinaL/025-C#.wav","AB_AeolinaL/026-D.wav","AB_violadamourL/rel99999/091-G.wav","AB_violadamourL/rel99999/092-G#.wav","AB_violadamourR/024-C.wav","AB_violadamourR/025-C#.wav")
In which this regex regexpr("\\/[^\\/]*$", strings) gives you the position of the last "/"
Assuming that the strings you propose are in a column of a dataframe:
df <- data.frame(x = 1:5, y = c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav"))
# I define a function that separates a string at each "/"
# throws the last piece and reattaches the pieces
cut_str <- function(s) {
st <- head((unlist(strsplit(s, "\\/"))), -1)
r <- paste(st, collapse = "/")
return(r)
}
# through the sapply function I get the desired result
new_strings <- as.vector(sapply(df$y, FUN = cut_str))
new_strings
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
You could use
dirname(strings)
If there is no /, this returns ., which you could remove afterwards if you like, e.g.:
res <- dirname(strings)
res[res=="."] <- ""
``
You could start the match with / followed by 1 or more times any char except a forward slash or a whitespace char using a negated character class [^\\s/]+
Then match .wav at the end of the string using $
Replace the match with an empty string using sub for example.
[^\\s/]+\\.wav$
See the regex matches | R demo
strings <- c("cloch/51.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/038-D.wav",
"grand/Grand_bombarde/02-suchy_Grand_bombarde/039-D#.wav",
"AB_AeolinaL/025-C#.wav",
"AB_AeolinaL/026-D.wav",
"AB_violadamourL/rel99999/091-G.wav",
"AB_violadamourL/rel99999/092-G#.wav",
"AB_violadamourR/024-C.wav",
"AB_violadamourR/025-C#.wav")
sub("/[^\\s/]+\\.wav$", "", strings)
Output
[1] "cloch"
[2] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[3] "grand/Grand_bombarde/02-suchy_Grand_bombarde"
[4] "AB_AeolinaL"
[5] "AB_AeolinaL"
[6] "AB_violadamourL/rel99999"
[7] "AB_violadamourL/rel99999"
[8] "AB_violadamourR"
[9] "AB_violadamourR"

grep and regrex for R phone numbers

I would like to get the phone numbers from a file. I know the numbers have different forms, I don't know how to code for each form. Using grep and regrexpr in R. The numbers are written in this form:
xxx-xxx-xxxx ,
(xxx)xxx-xxxx,
xxx xxx xxxx,
xxx.xxx.xxxx
Try this:
phones <- c("foo 111-111-1111 bar" , "(111)111-1111 quux", "who knows 111 111 1111", "111.111.1111 I do", "111)111-1111 should not work", "1111111111 ditto", "a 111-111-1111 b (222)222-2222 c")
re <- gregexpr("(\\(\\d{3}\\)|\\d{3}[-. ])\\d{3}[-. ]\\d{4}", phones)
regmatches(phones, re)
# [[1]]
# [1] "111-111-1111"
# [[2]]
# [1] "(111)111-1111"
# [[3]]
# [1] "111 111 1111"
# [[4]]
# [1] "111.111.1111"
# [[5]]
# character(0)
# [[6]]
# character(0)
# [[7]]
# [1] "111-111-1111" "(222)222-2222"
In the data, I provide a few examples with other text on both, either, and neither side, as well as two examples that should not match. (That is: a starter "test set", as you want to make sure you both match good examples and no-match bad examples.) The last one hopes to match multiple numbers in one string/sentence.
gregexpr and regmatches are useful for finding and extracting or replacing regex-substrings within 1+ strings. For a "replace" example, one could do:
regmatches(phones, re) <- "GONE!"
phones
# [1] "foo GONE! bar" "GONE! quux"
# [3] "who knows GONE!" "GONE! I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a GONE! b GONE! c"
Obviously contrived replacement but certainly usable. Note though that regmatches operates in side-effect, meaning that it modified the phones vector in-place instead of returning the value. It's possible to force it to operate not in side-effect, but it is a little less intuitive:
phones # I reset it to the original value
# [1] "foo 111-111-1111 bar" "(111)111-1111 quux"
# [3] "who knows 111 111 1111" "111.111.1111 I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a 111-111-1111 b (222)222-2222 c"
`regmatches<-`(phones, re, value = "GONE!")
# [1] "foo GONE! bar" "GONE! quux"
# [3] "who knows GONE!" "GONE! I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a GONE! b GONE! c"
phones
# [1] "foo 111-111-1111 bar" "(111)111-1111 quux"
# [3] "who knows 111 111 1111" "111.111.1111 I do"
# [5] "111)111-1111 should not work" "1111111111 ditto"
# [7] "a 111-111-1111 b (222)222-2222 c"
Edit: scope-creep.
out <- unlist(Filter(length, regmatches(phones, re)))
out
# [1] "111-111-1111" "(111)111-1111" "111 111 1111" "111.111.1111" "111-111-1111"
# [6] "(222)222-2222"
gsub("[^0-9]", "", out)
# [1] "1111111111" "1111111111" "1111111111" "1111111111" "1111111111" "2222222222"
out <- gsub("[^0-9]", "", out)
sprintf("(%s)%s-%s", substr(out, 1, 3), substr(out, 4, 6), substr(out, 7, 10))
# [1] "(111)111-1111" "(111)111-1111" "(111)111-1111" "(111)111-1111" "(111)111-1111"
# [6] "(222)222-2222"

Replace similar columns with some numerical values

I have dataframe like this:
Hashed_User_Id
[1] f2de2b4a6011a1ab52d3aefbc9b8a4103d7574f4
[2] 88cb5d85c41abb7ad99595ceb7c2fc98409dd4dc
[3] 25313021517412ce58072d798ccea29ba5d2f427
[4] f2de2b4a6011a1ab52d3aefbc9b8a4103d7574f4
[5] 88cb5d85c41abb7ad99595ceb7c2fc98409dd4dc
[6] 25313021517412ce58072d798ccea29ba5d2f427
I want to replace these hashed values by numeric values keeping same number for same values, something like this:
Hashed_User_Id
[1] 1
[2] 2
[3] 3
[4] 1
[5] 2
[6] 3
How can I achieve this?
As Ronak suggested,
as.integer(as.factor(Hashed_User_Id))

Finding just taxonomic authority for a species using taxize in R

I'm using gnr_resolve in taxize (v. 0.7.0) to find the taxonomic authority (author and date) for a list of species. By setting canonical=FALSE I can get the record including the author and date, but is there a way to return just the taxonomic authority?
gnr_resolve("Anguina tritici", data_source_ids=11, canonical=FALSE)
submitted_name matched_name data_source_title score
1 Anguina tritici Anguina tritici (Steinbuch, 1799) GBIF Backbone Taxonomy 0.988
So in this case I would only want (Steinbuch, 1799).
Using your examples in the original question and comments:
library('taxize')
x <- c(gnr_resolve("Anguina tritici", data_source_ids=11, canonical=FALSE)$matched_name,
gnr_resolve("Contracaecum ogcocephali", canonical=FALSE)$matched_name)
# [1] "Anguina tritici (Steinbuch, 1799)" "Contracaecum ogcocephali"
# [3] "Contracaecum ogcocephali" "Contracaecum ogcocephali"
# [5] "Contracaecum ogcocephali" "Contracaecum ogcocephali"
# [7] "Contracaecum ogcocephali" "Contracaecum ogcocephali"
# [9] "Contracaecum ogcocephali" "Contracaecum ogcocephali"
# [11] "Contracaecum ogcocephali Olsen 1952" "Contracaecum ogcocephali Olsen 1952"
It looks like you can use a regex to extract the pattern "lastname followed by optional comma followed by 4-digit year"
gsub('(\\w+,?\\s+\\d{4})|.', '\\1', x)
# [1] "Steinbuch, 1799" "" "" "" ""
# [6] "" "" "" "" ""
# [11] "Olsen 1952" "Olsen 1952"
where (\\w+,?\\s+\\d{4})|. says save to the first capture group (\\1) a word character one or more times, \\w+, followed by a comma (optional), ,? followed by white space one or more times, \\s+, followed by exactly four digits, \\d{4}

R: order a vector of strings with both character and numeric values both alphabetically and numerically

I have a vector of strings that contain both character and numeric values. For example:
a=c("ILLUMINA:420:C2D7UACXX:1:1102:14591:91480","ILLUMINA:420:C2D7UACXX:1:1102:14592:3881","ILLUMINA:420:C2D7UACXX:1:1102:14592:37103","ILLUMINA:420:C2D7UACXX:1:1102:14592:37356")
I'd like to order the vector so that the characters are sorted alphabetically and the numbers numerically. The structure of the strings is always of the format:
"ILLUMINA:420:C2D7UACXX:1:<number>:<number>:<number>", so actually the order only applies to the last three colon separated numbers.
I did try mixedsort {gtools} but the result was the same as using sort and
sort.int, which is:
> mixedsort(a)
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
Clearly the right order should be:
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"
Is there any immediate solution?
EDIT completely change the solution after OP clarification
You can extract the last 3 elements and order, and you create a data.frame:
dat = read.table(text=sub('.*:1:([0-9]+):([0-9]+):([0-9]+)','\\1|\\2|\\3',a),sep='|')
dat
V1 V2 V3
1 1102 14591 91480
2 1102 14592 3881
3 1102 14592 37103
4 1102 14592 37356
Then you order using 3 columns:
a[with(dat,order(V1,V2,V3))]
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"
gtools::mixedsort does work in your case, actually:
> a=c("ILLUMINA:420:C2D7UACXX:1:1102:14591:91480",
"ILLUMINA:420:C2D7UACXX:1:1102:14592:3881",
"ILLUMINA:420:C2D7UACXX:1:1102:14592:37103",
"ILLUMINA:420:C2D7UACXX:1:1102:14592:37356")
>
> mixedsort(a)
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480"
[2] "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103"
[4] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"
I am using gtools_3.4.2 and R-3.2.0
Here's a faster solution:
fields.list = strsplit(a,split=":")
sort.dt = data.table(t(sapply(fields.list,function(x) as.numeric(c(x[5],x[6],x[7])))))
sorted.a = v[with(sort.dt,order(V1,V2,V3))]
> sorted.a
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103"
[4] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"

Resources