Best way to search in 950 numbers in a single column? - sqlite

I have millions of rows in a sqlite3 database. The column 'points' of a single row contains the following example:
{1399808086,1366221142,1374614902,1374608759,1375598069,1375270116,1935207612,1914502332,1913478333,1930188205,1934563311,1942881023,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988 ... up to nearly 1'000 numbers, ending with }
What is the best way to search for rows where the column 'points' includes
a) all three example search values
1366221142,1374614902,1374608759 (Position two, three and four in the above content)
b) as much as possible (3 or 2 or 1) of the above 3 example search values
I tried it with Indexes, but a search with LIKE and '%1366221142%' takes "forever".
Actually I try it with FTS5, but the import into a new created virtual table seems to take several days.
Do you know any other possibility?

If you can use Python then this approach is 'pretty fast'.
import sqlite3
conn = sqlite3.connect(':memory:')
points = '{1399808086,1366221142,1374614902,1374608759,1375598069,1375270116,1935207612,1914502332,1913478333,1930188205,1934563311,1942881023,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860,1929778348,1900414380,1883651988,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,-778051210,-740437658,-749943514,-754136794,-770946570,1376606678,1380850053,1380854148,1381902468,1381971092,308228244,324940180,324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860}'
conn.execute('CREATE TABLE something (recno, points)')
for r in range(10000):
conn.execute('INSERT INTO something (recno, points) values (?,?)', (r, points))
seeking = ['1366221142', '1374614902', '1374608759']
first = True
for row in conn.execute('SELECT recno, points FROM something'):
pointsList = points[1:-1].split(',')
counts = { _:pointsList.count(_) for _ in seeking }
if first:
print (row)
print (counts)
first = False
The (abridged) output is:
(0, '{1399808086,1366221142,1374614902,1374608759,1375598069,1375270116,1935207612,1914502332,1913478333,1930188205,1934563311,1942881023,1373508175,-778100129,-788765075,-788763091,-790856156,-790835404,-791756027,-795938489,-779165370,... 324948372,327078869,292409717,275550503,275554606,275547438,812554558,812489022,1894554398,1895733774,1895741966,1912515343,1943993629,1935471709,1918694493,1914490972,1913409788,1913475260,1913458860}')
{'1374614902': 1, '1374608759': 1, '1366221142': 1}
Note that the code arranges to put 10,000 copies of your string, extended to 1,000 numbers, into the database and then processes them. Of course, the database is in memory, which is a factor to consider.
You could just try it.

Related

Algorithm to count instances of a value from a file

I am reading through a file of financial data with beneficiaries. I need to count the number of beneficiaries and then calculate their allocated percentage. If there is 1 beneficiary, the allocation is 100%, if there are 2, if there are 3, 33.33%, etc. The file is sorted by investment then beneficiary, so if there is more than one beneficiary per investment they will be in order in the file. Here's an example:
input file data
the output that I want
Here is my code, but it's wrong because this way I am assigning 100% to the first beneficiary, 50% to the second beneficiary, 33.333% to the third, etc. How can I change it to do the count, then create the beneficiaries with the right count? (There is an outer loop which is a table of investments.)
iBeneficiaryCount = 0.
dTempPercentage = 100.
FOR EACH ttJointData WHERE ttJointData.inv-num EQ ttInvestment.inv-num:
IF ttJointData.Joint_Type EQ "Joint" THEN DO:
cTemp = "JT".
RUN CreateOwner (....).
END.
ELSE IF ttJointData.Joint_Type EQ "Beneficiary" THEN DO:
iBeneficiaryCount = iBeneficiaryCount + 1.
dTempPercentage = 100 / iBeneficiaryCount.
RUN AddBeneficiary(ttJointData.investment-num,ttInvestment.benficiary-id,dTempPercentage).
END.
END.
What are the best ways to capture that beneficiary percentage? I am thinking that I need to read through the data and put that value into the ttJointData table. Or is there a way to do it on the loop? Regardless, I need a neat algorithm to count up the instances from an input file and create and assign a percentage value.
You can use a query to calculate the number of beneficiaries before you loop through them.
Something like
DEFINE VARIABLE dTempPercentage AS DECIMAL NO-UNDO.
DEFINE VARIABLE iBeneficiaryCount AS INTEGER NO-UNDO.
DEFINE QUERY qryJD FOR ttJointData.
dTempPercentage = 100.
FOR EACH ttInvestment:
// calculate how many beneficiaries; must use PRESELECT here
OPEN QUERY qryJD PRESELECT EACH ttJointData WHERE ttJointData.inv-num EQ ttInvestment.inv-num.
iBeneficiaryCount = QUERY qryJD:NUM-RESULTS.
dTempPercentage = 100 / iBeneficiaryCount.
GET FIRST qryJD .
DO WHILE AVAILABLE ttJointData :
IF ttJointData.Joint_Type EQ "Joint" THEN DO:
cTemp = "JT".
RUN CreateOwner (....).
END.
ELSE IF ttJointData.Joint_Type EQ "Beneficiary" THEN DO:
RUN AddBeneficiary(ttJointData.investment-num,ttInvestment.benficiary-id,dTempPercentage).
END.
GET NEXT qryJD .
END.
CLOSE QUERY qryJD.
END.

Differences in result in two similar functions: finding the key with maximun value

I am currently having an issue. Basically, I have 2 similar functions in terms of concept but the results do not align. These are the codes I learned from Bioinformatics I on Coursera.
The first code is simply creating a dictionary of occurrences of each k-mer pattern from a text (which is a long stretch of nucleotides). In this case, k is 5.
def FrequencyMap(text,k):
freq ={}
for i in range (0, len(text)-k+1):
freq[text[i:i+k]]=0
for j in range (0, len(text)-k+1):
if text[j:j+k] == text[i:i+k]:
freq[text[i:i+k]] +=1
return freq, max(freq)
The text and the result dictionary are kinda long, but the main point is when I call max(freq), it returns the key 'TTTTC', which has a value of 1.
Meanwhile, I wrote another code that is simply based on the previous code to generate the 5-mer patterns that have the max values (number of occurrences in the text).
def FrequentWords(text, k):
a = FrequencyMap(text, k)
m = max(a.values())
words = []
for i in a:
if a[i]==m:
words.append(i)
return words,m
And this code returns 'ACCTA', which has the value of 99, meaning it appears 99 times in the text. This makes total sense.
I used the same text and k (k=5) for both codes. I ran the codes on Jupyter Notebook. Why does the first one not return 'ACCTA'?
Thank you so much,
Here is the text, if anyone wants to try:
"ACCATCCCTAGGGCATACCTAAGTCTACCTAAAAGGCTACCTAATACCATACCTAATTACCTAACTACCTAAAATAAGTCTACCTAATACCTAATACCTAAAGTTACCTAACGTACCTAATACCTAATACCTAACCACTACCTAATCCGATTTACCTAACAACCGATCGAGTACCTAATCGATACCTAAATAACGGACAATATACCTAATTACCTAATACCTAATACCTAAGTGTACCTAAGACGTCTACCTAATTGTACCTAACTACCTAATTACCTAAGATTAATACCTAATACCTAATTTACCTAATACCTAACGTGGACTACCTAATACCTAACTTTTCCCCTACCTAATACCTAACTGTACCTAAATACCTAATACCTAAGCTACCTAAAGAACAACATTGTACGTGCGCCGTACCTAAATACCTAACAACTACCTAACTGATACCTAATAGTGATTACCTAACGCTTCTACCTAACTACCTAAGTACCTAACGCTACCTAACTACCTAATGTCCACAAAATACCTAATACCTAATAGCTACCTAATTGTGTACCTAAGTACCTAACCTACCTAATAATACCTAAAAATACCTAAGTACCTAACGTACCTAAATTTTACCTAATCTACCTAACGTACCTAATACCTAATTATACCTAATTACCTAATGGTTACCTAAGTTACCTAATATGCCACTACCTAACCTTACCTAAGACCTACCTAATAGGTACCTAACTGGGTACCTAAGGCAGTTTACCTAATTCAGGGCTACCTAATGTACCTAATACCTAAGTACCTAATACCTAATCCCATACCTAATATTTACCTAAGGGCACCGGTACCTAATACCTAATACCTAATACCTAAACCTTCGTACCTAAATACCTAATCTACCTAATGTACCTAAGGTACCTAATACCTAAGTCACTACCTAATACCTAATACCTAATGGGAGGAGCTTACCTAAGGTTACCTAATTACCTAAATACCTAATCGTTACCTAA"
Why does the first one not return 'ACCTA'?
Because max(freq) returns the maximum key of the dictionary. In this case the keys are strings (the k-mers), and strings are compared alphabetically. Hence the maximum one is the last string when the are sorted alphabetically.
If you want the first function to return the k-mer that occurs most often, you should change max(freq) to max(freq.items(), key=lambda key_value_pair: key_value_pair[1])[0]. Here, you are sorting the (kmer, count) pairs (that's the key_value_pair parameter of the lambda expression) based on the frequency and then selecting the kmer.

R:how to extract the first integer or decimal number from a text, and if the first number equal to specific numbers extract the second integer/decimal

The data is like this:
example - name of database
detail - the first column the contain sting with number in it (the number can be attached to $ etc. like 25m$ and also can be decimal like 1.2m$ or $1.2M)
lets say the datatable look like this:
example$detail<- c("The cole mine market worth every year 100M$ and the equipment they use worth 30$m per capita", "In 2017 the first enterpenur realized there is a potential of 500$M in cole mining", "The cole can make 23b$ per year ans help 1000000 familys living on it")
i want to add a column to the example data table - named: "number" that will extract the first number in the string in column "detail". BUT if this number is equal to one of the numbers in vector "year" (its not in the example database - its a seprate list i created) i want it to extract the second number of the string example$detail.
so i create another years list (separate from the database),
years<-c(2016:2030 )
im trying to create new column - number
what i did so far:
I managed to add variable that extract the first number of a string, by writing the following command:
example$number<-as.integer( sub("\\D*(\\d+).*", "\\1", example$detail) ) # EXTRACT ONLT INTEGERS
example$number1<-format(round(as.numeric(str_extract(example$detail, "\\d+\\.*\\d*")), 2), nsmall = 2) #EXTRACT THE NUMBERS AS DECIMALS WITH TWO DIGITS AFTER THE . (ITS ENOUGH FOR ME)
example$number1<-ifelse(example$number %in% years, TRUE, example$number1 ) #IF THE FIRST NUMBER EXTRACTED ARE IN THE YEARS VECTOR RETURN "TRUE"
and then i tried to write a code that extract the second number according to this if and its not working, just return me errors
i tried:
gsub("[^\d]*[\d]+[^\d]+([\d]+)", example$detail)
str_extract(example$detail, "\d+(?=[A-Z\s.]+$)",[[2]])
as.integer( sub("\\D*(\\d+).*", "\\1", example$detail) )
as.numeric(strsplit(example$detail, "\\D+")[1])
i didnt understand how i symbolized any number (integer\digits) or how i symbolized THE SECOND number in string.
thanks a lot!!
List item
Since no good example data is provided I'm just going to 'wing-it' here.
Imagine the dataframe df has the columns year (int) and details (char), then
df = mutate(clean_details = sub("[^0-9.-]", "",details),
clean_details_part1 = as.integer(strsplit(clean_details,"[.]")[[1]][1]),
clean_details_part2 = as.integer(strsplit(clean_details,"[.]")[[1]][2])
)
This works with the code I wrote up. I didn't apply the logic because I see you're proficient enough to do that. I believe a simple ifelse statement would do to create a boolean and then you can filter on that boolean, or a most direct way.

Counting specific characters in a string, across a data frame. sapply

I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

JPA/JPQL COUNT question

I have the following JPQL query -
SELECT f.md5
FROM File f, Collection leafCollections, Collection instCollections
WHERE (f.status = com.foo.bar.FileStatus.Happy OR f.status = com.foo.bar.FileStatus.Sad)
AND f.collectionId = leafCollections.collectionId
AND leafCollections.instanceCollectionId = instCollections.collectionId
GROUP BY f.md5, instCollections.collectionId
It basically returns the md5s for files which are organized in a hierarchy (tree) such that if the same MD5 appears in more then one leaf in a particular branch of the hierarchy it will be only shown once (thanks to the GROUP BY).
This works fine. Let's say I get 100 rows back. Each row containing an md5 as a string.
Now I want to get the COUNT of the rows returned. I thought I could simply do:
SELECT COUNT(f.md5)
FROM File f, Collection leafCollections, Collection instCollections
WHERE (f.status = com.foo.bar.FileStatus.Happy OR f.status = com.foo.bar.FileStatus.Sad)
AND f.collectionId = leafCollections.collectionId
AND leafCollections.instanceCollectionId = instCollections.collectionId
GROUP BY f.md5, instCollections.collectionId
However this returns 100 rows, each one containing a long representing the number of times the md5 appeared in a branch. What I wanted was simply to get 1 row back with a long value of 100 being the total count of rows the original query returned. I feel like I am missing something obvious.
Suggestions?
As #doc_180 has commented. Using getSingleResult() will yield the count
Use the following code to get the result.
// Assuming that the Query is strQ
Query q = entityManager.createQuery( strQ );
int count = ( (Integer) q.getSingleResult() ).intValue();

Resources