merge partial matched strings - r

I am struggling with trying to combine partially matched strings from two files.
File 1 contains a list of unique strings. These strings are partially matched to a number of strings in File 2. How do I merge the rows in file 1 with file 2 for every matched case
File1
mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660
File2
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
Desired output
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
I have tried using pmatch() in R, but don't get it right. I looks like something perl would handle??
Maybe something like this:
perl -ne'exec q;perl;, "-ne", q $print (/\Q$.$1.q;/?"$. YES":$. .q\; NO\;);, "file2" if m;^(.*)_pat1;' file1

This is a brief Perl solution, which saves all the data from file1 in a hash and then retrieves it as file2 is scanned
use strict;
use warnings;
use autodie;
my #files = qw/ file1.txt file2.txt /;
my %file1 = do {
open my $fh, '<', $files[0];
map /([^_]+)_(\S+)/, <$fh>;
};
open my $fh, '<', $files[1];
while (<$fh>) {
my ($key) = /([^_]+)/;
printf "%-32s%s", "${key}_$file1{$key}", $_;
}
output
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC

Of course you may do it in R. Indeed, pmatching whole strings won't give you the desired result - you've got to match appropriate substrings.
I assume that in file 1 the first identifier is 677 and not 667, otherwise it's hard to guess the matching scheme (I assume your example is only a part of a bigger database).
file1 <- readLines(textConnection('mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660'))
file2 <- readLines(textConnection('mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC'))
library(stringi)
file1_id <- stri_extract_first_regex(file1, "^.*?(?=_)")
file2_id <- stri_extract_first_regex(file2, "^.*?(?=_)")
cbind(file1=file1[match(file2_id, file1_id)], file2=file2)
## file1 file2
## [1,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA"
## [2,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT"
## [3,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT"
## [4,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC"
## [5,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC"

You can agrep for fuzzy search. You should play with distance. Here I am fixing it manually to 11.
Basically I am doing this to extract lines number that matches each word in file1:
sapply(file1,agrep,file2,max=11)
$`mmu-miR-677-5p_MIMAT0017239`
[1] 1 2 3
$`mmu-miR-181a-1-3p_MIMAT0000660`
[1] 4 5
To get the result of a data.frame:
do.call(rbind,
lapply(file1,
function(x)
data.frame(file1=x,
file2=agrep(x,file2,max=11,value=T))))
file1 file2
1 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
2 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
3 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
4 mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
5 mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC

Related

R's paste function equivalent in BASH?

I'm trying to translate this piece of R code into bash:
year <- 2010
emmissions <- paste("Emmision table from ",year, sep="")
em_table <- read.table(emmissions, head=F, as.is=T)
for (i in 1:nrow(em_table)) {
print(em_table[i,1])
}
But can't figure out how to translate paste function to concatenate string with variable. Expected outcome would be this script translated into bash code.
Using perhaps echo "Emmision table from ${year}" inside your for loop something like below:
for var in 2001 2003 2004;
do
echo "Emission table ${var}"
done
Updated on OP's request:
for sequence generation in bash , one can do : for i in {1..5} for a sequence of 1 to 5 or {0..20..4} for a step size of 4 from a sequence of 0 to 20
Assuming a column like structure, one can do this:
Assuming a table which has two column saved in a text file
col1 col2
1 2
3 4
5 6
while read col1 col2
do
echo "Col1 : $col1"
echo "Col2 : $col2"
done < table.txt
There already is an accepted answer but here is a awk solution.
Tested with just one file, named "test2020".
for var in 2020;
do
awk '{print $1}' "test${var}"
done
Output:
1
2
3
To read both columns the awk code line would be
awk '{print $1 $2}' "test${var}"
or, since the OP says the files only have two columns,
awk '{print $0}' "test${var}"
The file contents are:
1 a
2 b
3 c

R: Read in .csv file and convert into multiple column data frame

I am new to R and currently having a plenty of trouble just reading in .csv file and converting it into data.frame with 7 columns. Here is what I am doing:
gene_symbols_table <- as.data.frame(read.csv(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=TRUE, sep=","))
After that I am getting a data.frame with dim = 46761 x 1, but I need it to be 46761 x 7. I tried the following stackoverflow threads:
How can you read a CSV file in R with different number of columns
read.delim() - errors "more columns than column names" and "header and ''col.names" are of different lengths"
Split a column of a data frame to multiple columns
But somehow nothing is working in my case.
Here is how the table looks:
> head(gene_symbols_table, 3)
input.reason.matches.organism.name.primaryIdentifier.symbol.briefDescription.c
lass.secondaryIdentifier
1 WBGene00008675 MATCH 1 Caenorhabditis elegans
WBGene00008675 irld-26 Gene F11A5.7
2 WBGene00008676 MATCH 1 Caenorhabditis elegans
WBGene00008676 oac-15 Gene F11A5.8
3 WBGene00008677 MATCH 1 Caenorhabditis elegans
WBGene00008677 Gene F11A5.9
The .csv file in Excel looks like this:
input | reason | matches | organism.name | primaryIdentifier | symbol |
briefDescription
WBGene00008675 | MATCH | 1 | Caenorhabditis elegans WBGene00008675 | irld-26 | ...
...
The following code:
gene_symbols_table <- read.table(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=FALSE, sep=",",
col.names = paste0("V",seq_len(7)), fill = TRUE)
Seems to be working, however when I look into dim I can see right away that it is wrong: 20124 x 7. Then:
V1
1input;reason;matches;organism.name;primaryIdentifier;symbol;briefDescription;class;secondaryIdentifier
2 WBGene00008675;MATCH;1;Caenorhabditis
elegans;WBGene00008675;irld-26;;Gene;F11A5.7
3 WBGene00008676;MATCH;1;Caenorhabditis
elegans;WBGene00008676;oac-15;;Gene;F11A5.8
V2 V3 V4 V5
1
2
3
1
So, it is wrong
Other attempts at read.table are giving me the error specified in the second stackoverflow thread.
I have also tried splitting the data.frame with one column into 7, but so far no success.
The sep seems to be space or semi-colon, and not comma from what the table looks like. So either try specifying that, or you could try fread from the data.table package, which automatically detects the separator.
gene_symbols_table <- as.data.frame(fread(file="/home/nikita/Desktop
/CElegans_raw_data/gene_symbols_matching.csv", header=TRUE))

How to remove multiple commas but keep one in between two values in a csv file?

I have a csv file with millions of records like below
1,,,,,,,,,,a,,,,,,,,,,,,,,,,4,,,,,,,,,,,,,,,456,,,,,,,,,,,,,,,,,,,,,3455,,,,,,,,,,
1,,,,,,,,,,b,,,,,,,,,,,,,,,,5,,,,,,,,,,,,,,,467,,,,,,,,,,,,,,,,,,,,,3445,,,,,,,,,,
2,,,,,,,,,,c,,,,,,,,,,,,,,,,6,,,,,,,,,,,,,,,567,,,,,,,,,,,,,,,,,,,,,4656,,,,,,,,,,
I have to remove the extra commas between two values and keep only one. The output for the sample input should look like
1,a,4,456,3455
1,b,5,467,3445
2,c,6,567,4656
How can I achieve this using shell since it automates for the other files too.
I need to load this data in to a database. Can we do it using R?
sed method:
sed -e "s/,\+/,/g" -e "s/,$//" input_file > output_file
Turns multiple commas to single comma and also remove last comma on line.
Edited to address modified question.
R solution.
The original solution provided was just processing text. Assuming that your rows are in a structure, you can handle multiple rows with:
# Create Data
Row1 = "1,,,,,,,a,,,,,,,,,,4,,,,,,,,,456,,,,,,,,,,,3455,,,,,,,"
Row2 = "2,,,,,,,b,,,,,,,,,,5,,,,,,,,,567,,,,,,,,,,,4566,,,,,,,"
Rows = c(Row1, Row2)
CleanedRows = gsub(",+", ",", Rows) # Compress multiple commas
CleanedRows = sub(",\\s*$", "", CleanedRows) # Remove final comma if any
[1] "1,a,4,456,3455" "2,b,5,567,4566"
But if you are trying to read this from a csv and compress the rows,
## Create sample data
Data =read.csv(text="1,,,,,,,a,,,,,,,,,,4,,,,,,,,,456,,,,,,,,,,,3455,,,,,,,
2,,,,,,,b,,,,,,,,,,5,,,,,,,,,567,,,,,,,,,,,4566,,,,,,,",
header=FALSE)
You code would probably say
Data = read.csv("YourFile.csv", header=FALSE)
Data = Data[which(!is.na(Data[1,]))]
Data
V1 V8 V18 V27 V38
1 1 a 4 456 3455
2 2 b 5 567 4566
Note: This assumes that the non-blank fields are in the same place in every row.
Use tr -s:
echo 'a,,,,,,,,b,,,,,,,,,,c' | tr -s ','
Output:
a,b,c
If the input line has trailing commas, tr -s ',' would squeeze those trailing commas into one comma, but to be rid that one requires adding a little sed code: tr -s ',' | sed 's/,$//'.
Speed. Tests on a 10,000,000 line test file consisting of the first line in the OP example, repeated.
3 seconds. tr -s ',' (but leaves trailing comma)
9 seconds. tr -s ',' | sed 's/,$//
30 seconds. sed -e "s/,\+/,/g" -e "s/,$//" (Jean-François Fabre's answer.)
If you have a file that's really a CSV file, it might have quoting of commas in a few different ways, which can make regex-based CSV parsing unhappy.
I generally use and recommend csvkit which has a nice set of CSV parsing utilities for the shell. Docs at http://csvkit.readthedocs.io/en/latest/
Your exact issue is answered in csvkit with this set of commands. First, csvstat shows what the file looks like:
$ csvstat -H --max tmp.csv | grep -v None
1. column1: 2
11. column11: c
27. column27: 6
42. column42: 567
63. column63: 4656
Then, now that you know that all of the data is in those columns, you can run this:
$ csvcut -c 1,11,27,42,63 tmp.csv
1,a,4,456,3455
1,b,5,467,3445
2,c,6,567,4656
to get your desired answer.
Can we do it using R?
Provided your input is as shown, i.e., you want to skip the same columns in all rows, you can analyze the first line and then define column classes in read.table:
text <- "1,,,,,,,,,,a,,,,,,,,,,,,,,,,4,,,,,,,,,,,,,,,456,,,,,,,,,,,,,,,,,,,,,3455,,,,,,,,,,
1,,,,,,,,,,b,,,,,,,,,,,,,,,,5,,,,,,,,,,,,,,,467,,,,,,,,,,,,,,,,,,,,,3445,,,,,,,,,,
2,,,,,,,,,,c,,,,,,,,,,,,,,,,6,,,,,,,,,,,,,,,567,,,,,,,,,,,,,,,,,,,,,4656,,,,,,,,,,"
tmp <- read.table(text = text, nrows = 1, sep = ",")
colClasses <- sapply(tmp, class)
colClasses[is.na(unlist(tmp))] <- "NULL"
Here I assume there are no actual NA values in the first line. If there could be, you'd need to adjust it slightly.
read.table(text = text, sep = ",", colClasses = colClasses)
# V1 V11 V27 V42 V63
#1 1 a 4 456 3455
#2 1 b 5 467 3445
#3 2 c 6 567 4656
Obviously, you'd specify a file instead of text.
This solution is fairly efficient for smallish to moderately sized data. For large data, substitute the second read.table with fread from package data.table (but that applies regardless of the skipping columns problem).

Regex lines with exactly 4 semicolons

I want to filter lines with exactly 4 semicolons in it.
More or less semicolons should not be processed. I'm using regex/grep:
POSITIVE Example:
VES_I.MG;A;97;13;1
NEGATIVE Example:
VES_I.MG;A;97;13;1;2
For something this straightforward, I would actually just suggest counting the semicolons and subsetting based on that numeric vector.
A fast way to do this is with stri_count* from the "stringi" package:
library(stringi)
v <- c("VES_I.MG;A;97;13;1", "VES_I.MG;A;97;13;1;2") ## An example vector
stri_count_fixed(v, ";") ## How many semicolons?
# [1] 4 5
v[stri_count_fixed(v, ";") == 4] ## Just keep when count == 4
# [1] "VES_I.MG;A;97;13;1"
^(?=([^;]*;){4}[^;]*$).*$
You can try this with grep -P if you have the support for it.See demo.
http://regex101.com/r/lZ5mN8/22
[EDIT: Fixed stupid bug...]
The following will work with grep or any regex engine:
^[^;]*;[^;]*;[^;]*;[^;]*;[^;]*$
When used in a command line, make sure you put it inside quotes (" on Windows; either kind on *nix) so that special characters aren't interpreted by the shell.
If you have awk available, you can also try:
awk -F';' 'NF==5' file
just replace the 5 with n + 1. which n is your target count, for example the 4 in your question.
You don't need to use lookaheads and also you don't need to enable perl=TRUE parameter.
> v <- c("VES_I.MG;A;97;13;1", "VES_I.MG;A;97;13;1;2")
> grep("^(?:[^;]*;){4}[^;]*$", v)
[1] 1
> grep("^(?:[^;]*;){4}[^;]*$", v, value=TRUE)
[1] "VES_I.MG;A;97;13;1"
To match exactly four semicolons in a line, grep using the regex ^([^;]*;){4}[^;]*$:
grep -P "^([^;]*;){4}[^;]*$" ./input.txt
This could be done without regular expressions by using count.fields. The first line gives the counts and the second line reads in the lines and reduces it to those lines with 5 fields. The final line parses the fields out and converts it to a data frame with 4 columns.
cnt <- count.fields("myfile.dat", sep = ";")
L <- readLines("myfile.dat")[cnt == 5]
read.table(text = L, sep = ";")

How to read and print first head of a file in R?

I want to print a head of a file in R. I know how to use read.table and other input methods supported by R. I just want to know R alternatives to unix command cat or head that reads in a file and print some of them.
Thank you,
SangChul
read.table() takes an nrows argument for just this purpose:
read.table(header=TRUE, text="
a b
1 2
3 4
", nrows=1)
# a b
# 1 1 2
If you are instead reading in (possibly less structured) files with readLines(), you can use its n argument instead:
readLines(textConnection("a b
1 2 3 4 some other things
last"), n=1)
# [1] "a b"

Resources