Remove all text before colon - r
I have a file containing a certain number of lines. Each line looks like this:
TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1
I would like to remove all before ":" character in order to retain only PKMYT1 that is a gene name.
Since I'm not an expert in regex scripting can anyone help me to do this using Unix (sed or awk) or in R?
Here are two ways of doing it in R:
foo <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"
# Remove all before and up to ":":
gsub(".*:","",foo)
# Extract everything behind ":":
regmatches(foo,gregexpr("(?<=:).*",foo,perl=TRUE))
A simple regular expression used with gsub():
x <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"
gsub(".*:", "", x)
"PKMYT1"
See ?regex or ?gsub for more help.
There are certainly more than 2 ways in R. Here's another.
unlist(lapply(strsplit(foo, ':', fixed = TRUE), '[', 2))
If the string has a constant length I imagine substr would be faster than this or regex methods.
Using sed:
sed 's/.*://' < your_input_file > output_file
This will replace anything followed by a colon with nothing, so it'll remove everything up to and including the last colon on each line (because * is greedy by default).
As per Josh O'Brien's comment, if you wanted to only replace up to and including the first colon, do this:
sed "s/[^:]*://"
That will match anything that isn't a colon, followed by one colon, and replace with nothing.
Note that for both of these patterns they'll stop on the first match on each line. If you want to make a replace happen for every match on a line, add the 'g' (global) option to the end of the command.
Also note that on linux (but not on OSX) you can edit a file in-place with -i eg:
sed -i 's/.*://' your_file
Solution using str_remove from the stringr package:
str_remove("TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1", ".*:")
[1] "PKMYT1"
You can use awk like this:
awk -F: '{print $2}' /your/file
Some very simple move that I missed from the best response #Sacha Epskamp was to use the sub function, in this case to take everything before the ":"(instead of removing it), so it was very simple:
foo <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"
# 1st, as she did to remove all before and up to ":":
gsub(".*:","",foo)
# 2nd, to keep everything before and up to ":":
gsub(":.*","",foo)
Basically, the same thing, just change the ":" position inside the sub argument. Hope it will help.
If you have GNU coreutils available use cut:
cut -d: -f2 infile
I was working on a similar issue. John's and Josh O'Brien's advice did the trick. I started with this tibble:
library(dplyr)
my_tibble <- tibble(Col1=c("ABC:Content","BCDE:MoreContent","FG:Conent:with:colons"))
It looks like:
| Col1
1 | ABC:Content
2 | BCDE:MoreContent
3 | FG:Content:with:colons
I needed to create this tibble:
| Col1 | Col2 | Col3
1 | ABC:Content | ABC | Content
2 | BCDE:MoreContent | BCDE | MoreContent
3 | FG:Content:with:colons| FG | Content:with:colons
And did so with this code (R version 3.4.2).
my_tibble2 <- mutate(my_tibble
,Col2 = unlist(lapply(strsplit(Col1, ':',fixed = TRUE), '[', 1))
,Col3 = gsub("^[^:]*:", "", Col1))
Below are 2 equivalent solutions:
The first uses perl's -a autosplit feature to split each line into fields using :, populate the F fields array, and print the 2nd field $F[1] (counted starting from field 0)
perl -F: -lane 'print $F[1]' file
The second uses a regular expression to substitute s/// from ^ the beginning of the line, .*: any characters ending with a colon, with nothing
perl -pe 's/^.*://' file
Related
Extract all text after last occurrence of a special character
I have the string in R BLCU142-09|Apodemia_mejicanus and I would like to get the result Apodemia_mejicanus Using the stringr R package, I have tried str_replace_all("BLCU142-09|Apodemia_mejicanus", "[[A-Z0-9|-]]", "") # [1] "podemia_mejicanus" which is almost what I need, except that the A is missing.
You can use sub(".*\\|", "", x) This will remove all text up to and including the last pipe char. See the regex demo. Details: .* - any zero or more chars as many as possible \| - a | char (| is a special regex metacharacter that is an alternation operator, so it must be escaped, and since string literals in R can contain string escape sequences, the | is escaped with a double backslash). See the R demo online: x <- c("BLCU142-09|Apodemia_mejicanus", "a|b|c|BLCU142-09|Apodemia_mejicanus") sub(".*\\|", "", x) ## => [1] "Apodemia_mejicanus" "Apodemia_mejicanus"
We can match one or more characters that are not a | ([^|]+) from the start (^) of the string followed by | in str_remove to remove that substring library(stringr) str_remove(str1, "^[^|]+\\|") #[1] "Apodemia_mejicanus" If we use [A-Z] also to match it will match the upper case letter and replace with blank ("") as in the OP's str_replace_all data str1 <- "BLCU142-09|Apodemia_mejicanus"
You can always choose to _extract rather than _remove: s <- "BLCU142-09|Apodemia_mejicanus" stringr::str_extract(s,"[[:alpha:]_]+$") ## [1] "Apodemia_mejicanus" Depending on how permissive you want to be, you could also use [[:alpha:]]+_[[:alpha:]]+ as your target.
I would keep it simple: substring(my_string, regexpr("|", my_string, fixed = TRUE) + 1L)
How to remove multiple commas but keep one in between two values in a csv file?
I have a csv file with millions of records like below 1,,,,,,,,,,a,,,,,,,,,,,,,,,,4,,,,,,,,,,,,,,,456,,,,,,,,,,,,,,,,,,,,,3455,,,,,,,,,, 1,,,,,,,,,,b,,,,,,,,,,,,,,,,5,,,,,,,,,,,,,,,467,,,,,,,,,,,,,,,,,,,,,3445,,,,,,,,,, 2,,,,,,,,,,c,,,,,,,,,,,,,,,,6,,,,,,,,,,,,,,,567,,,,,,,,,,,,,,,,,,,,,4656,,,,,,,,,, I have to remove the extra commas between two values and keep only one. The output for the sample input should look like 1,a,4,456,3455 1,b,5,467,3445 2,c,6,567,4656 How can I achieve this using shell since it automates for the other files too. I need to load this data in to a database. Can we do it using R?
sed method: sed -e "s/,\+/,/g" -e "s/,$//" input_file > output_file Turns multiple commas to single comma and also remove last comma on line.
Edited to address modified question. R solution. The original solution provided was just processing text. Assuming that your rows are in a structure, you can handle multiple rows with: # Create Data Row1 = "1,,,,,,,a,,,,,,,,,,4,,,,,,,,,456,,,,,,,,,,,3455,,,,,,," Row2 = "2,,,,,,,b,,,,,,,,,,5,,,,,,,,,567,,,,,,,,,,,4566,,,,,,," Rows = c(Row1, Row2) CleanedRows = gsub(",+", ",", Rows) # Compress multiple commas CleanedRows = sub(",\\s*$", "", CleanedRows) # Remove final comma if any [1] "1,a,4,456,3455" "2,b,5,567,4566" But if you are trying to read this from a csv and compress the rows, ## Create sample data Data =read.csv(text="1,,,,,,,a,,,,,,,,,,4,,,,,,,,,456,,,,,,,,,,,3455,,,,,,, 2,,,,,,,b,,,,,,,,,,5,,,,,,,,,567,,,,,,,,,,,4566,,,,,,,", header=FALSE) You code would probably say Data = read.csv("YourFile.csv", header=FALSE) Data = Data[which(!is.na(Data[1,]))] Data V1 V8 V18 V27 V38 1 1 a 4 456 3455 2 2 b 5 567 4566 Note: This assumes that the non-blank fields are in the same place in every row.
Use tr -s: echo 'a,,,,,,,,b,,,,,,,,,,c' | tr -s ',' Output: a,b,c If the input line has trailing commas, tr -s ',' would squeeze those trailing commas into one comma, but to be rid that one requires adding a little sed code: tr -s ',' | sed 's/,$//'. Speed. Tests on a 10,000,000 line test file consisting of the first line in the OP example, repeated. 3 seconds. tr -s ',' (but leaves trailing comma) 9 seconds. tr -s ',' | sed 's/,$// 30 seconds. sed -e "s/,\+/,/g" -e "s/,$//" (Jean-François Fabre's answer.)
If you have a file that's really a CSV file, it might have quoting of commas in a few different ways, which can make regex-based CSV parsing unhappy. I generally use and recommend csvkit which has a nice set of CSV parsing utilities for the shell. Docs at http://csvkit.readthedocs.io/en/latest/ Your exact issue is answered in csvkit with this set of commands. First, csvstat shows what the file looks like: $ csvstat -H --max tmp.csv | grep -v None 1. column1: 2 11. column11: c 27. column27: 6 42. column42: 567 63. column63: 4656 Then, now that you know that all of the data is in those columns, you can run this: $ csvcut -c 1,11,27,42,63 tmp.csv 1,a,4,456,3455 1,b,5,467,3445 2,c,6,567,4656 to get your desired answer.
Can we do it using R? Provided your input is as shown, i.e., you want to skip the same columns in all rows, you can analyze the first line and then define column classes in read.table: text <- "1,,,,,,,,,,a,,,,,,,,,,,,,,,,4,,,,,,,,,,,,,,,456,,,,,,,,,,,,,,,,,,,,,3455,,,,,,,,,, 1,,,,,,,,,,b,,,,,,,,,,,,,,,,5,,,,,,,,,,,,,,,467,,,,,,,,,,,,,,,,,,,,,3445,,,,,,,,,, 2,,,,,,,,,,c,,,,,,,,,,,,,,,,6,,,,,,,,,,,,,,,567,,,,,,,,,,,,,,,,,,,,,4656,,,,,,,,,," tmp <- read.table(text = text, nrows = 1, sep = ",") colClasses <- sapply(tmp, class) colClasses[is.na(unlist(tmp))] <- "NULL" Here I assume there are no actual NA values in the first line. If there could be, you'd need to adjust it slightly. read.table(text = text, sep = ",", colClasses = colClasses) # V1 V11 V27 V42 V63 #1 1 a 4 456 3455 #2 1 b 5 467 3445 #3 2 c 6 567 4656 Obviously, you'd specify a file instead of text. This solution is fairly efficient for smallish to moderately sized data. For large data, substitute the second read.table with fread from package data.table (but that applies regardless of the skipping columns problem).
Regex lines with exactly 4 semicolons
I want to filter lines with exactly 4 semicolons in it. More or less semicolons should not be processed. I'm using regex/grep: POSITIVE Example: VES_I.MG;A;97;13;1 NEGATIVE Example: VES_I.MG;A;97;13;1;2
For something this straightforward, I would actually just suggest counting the semicolons and subsetting based on that numeric vector. A fast way to do this is with stri_count* from the "stringi" package: library(stringi) v <- c("VES_I.MG;A;97;13;1", "VES_I.MG;A;97;13;1;2") ## An example vector stri_count_fixed(v, ";") ## How many semicolons? # [1] 4 5 v[stri_count_fixed(v, ";") == 4] ## Just keep when count == 4 # [1] "VES_I.MG;A;97;13;1"
^(?=([^;]*;){4}[^;]*$).*$ You can try this with grep -P if you have the support for it.See demo. http://regex101.com/r/lZ5mN8/22
[EDIT: Fixed stupid bug...] The following will work with grep or any regex engine: ^[^;]*;[^;]*;[^;]*;[^;]*;[^;]*$ When used in a command line, make sure you put it inside quotes (" on Windows; either kind on *nix) so that special characters aren't interpreted by the shell.
If you have awk available, you can also try: awk -F';' 'NF==5' file just replace the 5 with n + 1. which n is your target count, for example the 4 in your question.
You don't need to use lookaheads and also you don't need to enable perl=TRUE parameter. > v <- c("VES_I.MG;A;97;13;1", "VES_I.MG;A;97;13;1;2") > grep("^(?:[^;]*;){4}[^;]*$", v) [1] 1 > grep("^(?:[^;]*;){4}[^;]*$", v, value=TRUE) [1] "VES_I.MG;A;97;13;1"
To match exactly four semicolons in a line, grep using the regex ^([^;]*;){4}[^;]*$: grep -P "^([^;]*;){4}[^;]*$" ./input.txt
This could be done without regular expressions by using count.fields. The first line gives the counts and the second line reads in the lines and reduces it to those lines with 5 fields. The final line parses the fields out and converts it to a data frame with 4 columns. cnt <- count.fields("myfile.dat", sep = ";") L <- readLines("myfile.dat")[cnt == 5] read.table(text = L, sep = ";")
Split first column into two and preserve rest in unix
I need to split first column delimited by '#' into two columns. My data is in following format. 1#b,a 2#b,a 5#c,d Required Output: 1,b,a 2,b,a 5,c,d Other columns can have # in their values so I want to apply regex only on first column. Thanks Jitendra
A file (file.orig) contains this: 1#b,a # 2#b,a # 5#c,d # Use sed: sed 's/#/,/1' file.orig > file.new Output (cat file.new): 1,b,a # 2,b,a # 5,c,d #
You didn't say where the data is. I'll assume it's in a file. tr '#' ',' < some_file.txt
awk -F, '{OFS=","}{gsub(/#/,",",$1);}1' your_file
R grep pattern regex with brackets
I have a problem with grep in R: patterns= c("AB_(1)","AB_(2)") text= c("AB_(1)","DDD","CC") grep(patterns[1],text) >integer(0) ???? the grep command has problem with "()" brackets, is there any as.XX(patterns[1]) that I can use??
You need escape by double backslash: > patterns= c("AB_\\(1\\)","AB_(2)") > text= c("AB_(1)","DDD","CC") > > grep(patterns[1],text) [1] 1
If there are no special pattern matching characters in the regular expression (as is the case in the example shown in the question) then use fixed=TRUE: grep(patterns[1], text, fixed = TRUE)