How to I grep unique occurrences across multiple files? - recursion

I want to grep across multiple files and display the unique and count of the results. For example:
filea:
text1
text2
text3
fileb:
text1
text3
filec:
text2
text3
grep -r "text*" ???
output:
text1 2
text2 2
text3 3

try this:
awk '/text/{a[$0]++}END{for(x in a)print x,a[x]}' file*
I haven't tested, but it should go.

Related

Using the same regex for multiple specific columns in R

I have the data as below
Data
df <- structure(list(obs = 1:4, text0 = c("nothing to do with this column even it contains keywords",
"FIFA text", "AFC text", "UEFA text"), text1 = c("here is some FIFA text",
"this row dont have", "some UEFA text", "nothing"), text2 = c("nothing here",
"I see AFC text", "Some samples", "End of text")), class = "data.frame", row.names = c(NA,
-4L))
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples
4 4 UEFA text nothing End of text
Expected Output:
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples
Question: I have several columns contains some keywords (FIFA, UEFA, AFC) I am looking for. I want to filter these keywords on specific columns (in this case: text1, and text2 only). Any those keywords founded in text1 and text2 should be filtered as the expected output. We have nothing to do with text0. I am wondering if there is any regex to get this result.
Using filter_at
library(dplyr)
library(stringr)
patvec <- c("FIFA", "UEFA", "AFC")
# // create a single pattern string by collapsing the vector with `|`
# // specify the word boundary (\\b) so as not to have any mismatches
pat <- str_c("\\b(", str_c(patvec, collapse="|"), ")\\b")
df %>%
filter_at(vars(c('text1', 'text2')),
any_vars(str_detect(., pat)))
With across, currently does the all_vars matching instead of any_vars. An option is rowwise with c_across
df %>%
rowwise %>%
filter(any(str_detect(c_across(c(text1, text2)), pat))) %>%
ungroup
Also you can try (base R):
#Keys
keys <- c('FIFA', 'UEFA', 'AFC')
keys <- paste0(keys,collapse = '|')
#Filter
df[grepl(pattern = keys,x = df$text1) | grepl(pattern = keys,x = df$text2),]
Output:
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples
Another base R option:
pat <- sprintf("\\b(%s)\\b",paste(patvec, collapse = "|"))
subset(df, grepl(pat, do.call(paste, df[c("text1","text2")])))
obs text0 text1 text2
1 1 nothing to do with this column even it contains keywords here is some FIFA text nothing here
2 2 FIFA text this row dont have I see AFC text
3 3 AFC text some UEFA text Some samples

Why isn't str_count working with multiple strings?

I have a string with text like this:
Text <- c("How are you","What is your name","Hi my name is","You ate your cake")
And I want an output that counts the number of times the word "you" or "your" appears
Text NumYou
"How are you" 1
"What is your name" 1
"Hi my name is" 0
"You ate your cake" 2
I tried using the str_count function but it was missing occurrences of "you" and "your"
NumYou = str_count(text,c("you","your"))
Why isn't str_count working correctly?
Pass the pattern as one string.
stringr::str_count(tolower(Text),'you|your')
#[1] 1 1 0 2

Loop through column of r dataframe and pair records next to each other

I have a dataframe of words used in a speech and I'd like to count the number of times certain words are paired together. The words are in the right order so I just need to loop through the column and pair each word with the one before it.
Starting with
order | word
------------
1 | hello
------------
2 | my
------------
3 | name
------------
4 | is
Desired output would be something like this:
order | word | pair
--------------------
1 | hello| hello
--------------------
2 | my | hello my
--------------------
3 | name | my name
--------------------
4 | is | name is
Thanks in advance StackOverflow!
We can remove the last 'word' (word[-length(word)]), and first word (word[-1]), paste it together can concatenate with the first element to create the 'pair' column.
df1$pair <- with(df1, c(word[1], paste(word[-length(word)], word[-1])))
df1$pair
#[1] "hello" "hello my" "my name" "name is"
You can use Lag function from the Hmisc package
library(Hmisc)
df$pair <- with(df, paste(Lag(word), word))
df$pair
# [1] " hello" "hello my" "my name" "name is"

AWK or SED to remove pattern from every line while using grep

i have grep 2 columns , suppose col1 and col2. In col2 i want to remove a pattern which occurs in every line. how to use awk/sed for this purpose?
suppose ps -eaf | grep b would result into following output:
col1 col2 col3
1 a/b/rac 123
2 a/b/rac1 456
3 a/b/rac3 789
I want output to get stored in a file like this :
1 rac
2 rac1
3 rac3
Vaguely speaking, this might do what you want:
$ awk 'sub(/.*b\//,"",$2){print $1, $2}' file
1 rac
2 rac1
3 rac3
assuming file contains:
col1 col2 col3
1 a/b/rac 123
2 a/b/rac1 456
3 a/b/rac3 789
You will want to pipe the output to
| awk '{sub(/^.*\//, "", $2); print $1, $2}'
The first command in the awk program changes the contents of the second field to only contain whatever was after the final / character, which appears to be what you want. The second command prints the first two fields.

merge partial matched strings

I am struggling with trying to combine partially matched strings from two files.
File 1 contains a list of unique strings. These strings are partially matched to a number of strings in File 2. How do I merge the rows in file 1 with file 2 for every matched case
File1
mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660
File2
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
Desired output
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
I have tried using pmatch() in R, but don't get it right. I looks like something perl would handle??
Maybe something like this:
perl -ne'exec q;perl;, "-ne", q $print (/\Q$.$1.q;/?"$. YES":$. .q\; NO\;);, "file2" if m;^(.*)_pat1;' file1
This is a brief Perl solution, which saves all the data from file1 in a hash and then retrieves it as file2 is scanned
use strict;
use warnings;
use autodie;
my #files = qw/ file1.txt file2.txt /;
my %file1 = do {
open my $fh, '<', $files[0];
map /([^_]+)_(\S+)/, <$fh>;
};
open my $fh, '<', $files[1];
while (<$fh>) {
my ($key) = /([^_]+)/;
printf "%-32s%s", "${key}_$file1{$key}", $_;
}
output
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC
Of course you may do it in R. Indeed, pmatching whole strings won't give you the desired result - you've got to match appropriate substrings.
I assume that in file 1 the first identifier is 677 and not 667, otherwise it's hard to guess the matching scheme (I assume your example is only a part of a bigger database).
file1 <- readLines(textConnection('mmu-miR-677-5p_MIMAT0017239
mmu-miR-181a-1-3p_MIMAT0000660'))
file2 <- readLines(textConnection('mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC'))
library(stringi)
file1_id <- stri_extract_first_regex(file1, "^.*?(?=_)")
file2_id <- stri_extract_first_regex(file2, "^.*?(?=_)")
cbind(file1=file1[match(file2_id, file1_id)], file2=file2)
## file1 file2
## [1,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA"
## [2,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT"
## [3,] "mmu-miR-677-5p_MIMAT0017239" "mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT"
## [4,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC"
## [5,] "mmu-miR-181a-1-3p_MIMAT0000660" "mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC"
You can agrep for fuzzy search. You should play with distance. Here I am fixing it manually to 11.
Basically I am doing this to extract lines number that matches each word in file1:
sapply(file1,agrep,file2,max=11)
$`mmu-miR-677-5p_MIMAT0017239`
[1] 1 2 3
$`mmu-miR-181a-1-3p_MIMAT0000660`
[1] 4 5
To get the result of a data.frame:
do.call(rbind,
lapply(file1,
function(x)
data.frame(file1=x,
file2=agrep(x,file2,max=11,value=T))))
file1 file2
1 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGA
2 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_CTTCAGTGATGATTAGCTTCTGACT
3 mmu-miR-677-5p_MIMAT0017239 mmu-miR-677-5p_TTCAGTGATGATTAGCTTCTGACT
4 mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTAC
5 mmu-miR-181a-1-3p_MIMAT0000660 mmu-miR-181a-1-3p_ACCATCGACCGTTGATTGTACC

Resources