astyanax how to create a columnfamily with compositecolumn - astyanax

I need to create 3 column families with composite name in every column of these types
Col Family 1 needs UTF8 and byte[]
Col Family 2 needs BigDecimal and byte[]
Col Family 3 needs BigInteger and byte[]
How do I create the column family in astyanax. I see a setComparatorType(String) but I want the comparator to be the UTF8, BigDecimal, or BigInteger (ie. the first part of the composite key)...I do not care what order the byte[] are in so that can be random...don't really care.
also, do I need to set anything else on the ColumnFamilyDefinition object of astyanax at all to create this column family?
Also, I see the example on putting values with the self-created annotated composite type. I am assumeing I just call the colMutation.putColumn(compositeTypeInst, value, theTime) to put it into cassandra?
thanks,
Dean

ah, actually that did work, the complex exception was from persisting the save of the composite. The create of the column family worked great using
ColumnFamilyDefinition def = cluster.makeColumnFamilyDefinition()
.setName(colFamily)
.setKeyspace(keyspace.getKeyspaceName())
.setComparatorType("CompositeType(UTF8Type, BytesType)");

Related

How do you sort a dictionary by it's values and then sum these values up to a certain point?

I was wondering what the best method would be to sort a dictionary of type Dict{String, Int} based on the value. I loop over a FASTQ file containing multiple sequence records, each record has a String as an identifier which serves as key and another string where i take the length from as the value of the key.
for example:
testdict["ee0a"]=length("aatcg")
testdict["002e4"]=length("aatcgtga")
testdict["12-f9"]=length(aatcgtgacgtga")
In this case the key value pairs would be "ee0a" => 5, "002e4" => 8, and "12-f9" => 13.
What i want to do is sort these pairs from highest value to the lowest value, afterwhich i sum these values in a different untill a that variable passes a certain threshold. I then need to save the keys i used so i can use them later on.
Is it possible to use the sort() function or use a SortedDict to achieve this? I would imagine that if the sorting succeeded i could use a while loop to add my keys to a list and add my values into a different variable untill it's greater than my threshold, and then use the list of keys to create a new dictionary with my selected key-value pairs.
However what would be the fastest way to do this? the FASTQ files i read in can contain multiple GB's worth of data so i'd love to create a sorted dictionary while reading in the file and select the records i want before doing anything else with the data.
If your file is multiple GB's worth of data I would avoid storing them in the Dict in the first place. I think it is better to process the file sequentially and store the keys that meet your condition in a PriorityQueue from the DataStructures.jl package. Of course you can repeat the same procedure if you read the data from a dictionary in memory (simply source changes from disk file to the dictionary)
Here is a pseudocode of what you could consider (a full solution would depend on how you read your data which you did not specify).
Assume that you want to store elements until they execeed threshold kept in THRESH constant.
pq = PriorityQueue{String, Int}()
s = 0
while (there are more key-value pairs in source file)
key, value = read(source file)
# this check avoids adding a key-value pair for which we are sure that
# it is not interesting
if s <= THRESH || value > peek(pq)[2]
enqueue!(pq, key, value)
s += value
# if we added something to the queue we have to check
# if we should not drop smallest elements from it
while s - peek(pq)[2] > THRESH
s -= dequeue!(pq)[2]
end
end
end
After this process pq will hold only key-value pairs you are interested in. The key benefit of this approach is that you never need to store whole data in RAM. At any point in time you only store the key-value pairs that would be selected at this stage of processing of the data.
Observe that this process does not give you an easily predictable result because several keys might have the same value. And if this value would be on a cutoff border you do not know which ones would be retained (however, you did not specify what you want to do in this special case - if you would specify the requirement for this case the algorithm should be updated a bit).
If you have enough memory to hold at least one or two full Dicts of the required size, you can use an inverted Dict with the length as key and an array of the old keys as values, to avoid losing data with a duplicate length value as a same key.
I think that the code below is then what your question was leading toward:
d1 = Dict("a" => 1, "b" => 2, "c" => 3, "d" => 2, "e" => 1, "f" =>5)
d2 = Dict()
for (k, v) in d1
d2[v] = haskey(d2, v) ? push!(d2[v], k) : [k]
end
println(d1)
println(d2)
for k in sort(collect(keys(d2)))
print("$k, $(d2[k]); ")
# here can delete keys under a threshold to speed further processing
end
If you don't have enough memory to hold an entire Dict, you may benefit
from first putting the data into a SQL database like SQLite and then doing
queries instead of modifying a Dict in memory. In that case, one column
of the table will be the data, and you would add a column for the data length
to the SQLite table. Or you can use a PriorityQueue as in the answer above.

Autofill based on list and value of a cell

I'm making a spreadsheet to help me with my personal accounting. I'm trying to create a formula in LibreOffice Calc that will search in a given cell for a number of different text strings and if found return a text string.
For example, the formula should search for "burger" or "McDonalds" in $C6 and likewise then return "Food" to $E6. It should not be case sensitive. And needs partially to match strings as well as in the case of Burger King. I need it to be able to search for other keywords and return those values as well, like "AutoZone" and return "Auto" and NewEgg and return "Electronics".
I've had a tough time finding any kind of solution to this and the closet I could get was with a MATCH formula but once I nested it in an IF it would not work. I've also tried nested IF with OR; not joy on either.
Examples:
=IF(OR(D10="*hulu*",D10="*netflix*",D10="*movie*",D10="*theature*",D10="*stadium*",D10="*google*music*")=1,"Entertainment",IF(OR(D10="*taco*",D10="*burger*",D10="*mcdonald*",D10="*dq*",D10="*tokyo*",D10="*wendy*",D10="*cafe*",D10="*wing*",D10="*tropical*",D10="*kfc*",D10="*olive*",D10="*caesar*",D10="*costa*vida*",D10="*Carl*",D10="*in*n*out*",D10="*golden*corral*",D10="*nija*",D10="*arby*",D10="*Domino*",D10="*Subway*",D10="*Iggy*",D10="*Pizza*Hut*",D10="*Rumbi*",D10="*Custard*",D10="*Jimmy*")=1,"Food",IF(OR(D10="*autozone*",D10="*Napa*",D10="*OREILLY*")=1,"AUTO","-")))
I can create a different table and make a lookup reference so another way to put this is I need something that does the opposite of what VLOOKUP and HLOOKUP do and return the header value for any data matching in given columns.
Something like:
=IF(NOT(ISNA(MATCH(A1,B3:B99))),B2,IF(NOT(ISNA(MATCH(A1,C3:C99))),c2,0))
If A1 was the test and B2 and C2 were the headers and it was searching below those.
As per my comments, try this:
=IF(SUM(LEN(G150)-LEN(SUBSTITUTE(LOWER(G150),{"hulu","netflix","movie","theater"," stadium"},"")))>0,"Entertainment",IF(SUM(LEN(G150)-LEN(SUBSTITUTE(LOWER(G150),{"burger","taco","vida","caf‌​e","wing","dairy","mcdonald","wendy","kfc","pizza","carl","domino","ceaser","oliv‌​e","jimmy","custard","subway","arby"},"")))>0,"Food",IF(SUM(LEN(G150)-LEN(SUBSTITUTE(LOWER(G150),{"autozone","Napa","oreilly"},"")))>0,"AUTO","-")))
It is an Array formula and must be confirmed with Ctrl-Shift-Enter.
You can do this various ways using INDEX/MATCH/VLOOKUP formulae. Just a couple of caveats: I am using Excel, and never used Libre so hope this works; and, you will need a mapping table that maps MacDonalds to Food, Google Music to Entertainment and so on (for all the cases possible).
Let's assume your mapping table in your screenshot is A6 to E9.
The formula in E10 =vlookup(C10,$C$6:$E$9,3,0)
Explanation: it looks up C10 (Burger King) in the table $C$6:$E$9 and result is the 3rd column (E is 3rd column from C, where C10 was looked up) in that table. The 0 will give you an exact match, if you want a partial match then enter 1 there.
Note: if your mapping table is in say columns G and H (Service name in G and Type of Service in H), AND you are unsure how many entries it will have, a mod to the formula is =vlookup(C10,$G:$H,2,0) OR =vlookup(C10,$G:$H,2,1) for a partial match. Here, 3 is replaced by 2 because H is the 2nd column from G where C10 will be looked up.
EDIT: Doing VLOOKUP with INDEX and MATCH functions for an approximate match of text - this could be the solution you are looking at in your last comment(?)
Two things needed to be done. a.Reference table entries, b.applying the INDEX/MATCH function.
Part a - in your reference table, you will have to make entries between 2*s for the value to be looked up. The way you mention in your example in the Qn *movie*,*wendy*,etc. That's really the trick that enables us to lookup by cell reference. Corresponding return values like Entertainment/Food/etc need to be their own full words. Let's assume you have this table prepared in columns G6:H26 (G-lookup value, H-return value)
Part b - In you cell F6 (as per your screenshot), you can try this formula =INDEX($H$6:$H$26,MATCH(C6,$G$6:$G$26,0))
That really just is the replacement formula for VLOOKUP using INDEX/MATCH.
As your values stored in column G are in *s, the cell C6 in the MATCH formula will do a partial read.

Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API

I am trying to build random forest models by group(School_ID, more than 3 thousands) on a large model input csv file using Spark Scala API. Each of the group contains about 3000-4000 records. The resources I have at disposal are 20-30 aws m3.2xlarge instances.
In R, I can construct models by group and save them to a list like this-
library(dplyr);library(randomForest);
Rf_model <- train %>% group_by(School_ID) %>%
do(school= randomForest(formula=Rf_formula, data=., importance = TRUE))
The list can be stored somewhere and I can call them when I need to use them like below -
save(Rf_model.school,file=paste0(Modelpath,"Rf_model.dat"))
load(file=paste0(Modelpath,"Rf_model.dat"))
pred <- predict(Rf_model.school$school[school_index][[1]], newdata=test)
I was wondering how to do that in Spark, whether or not I need to split the data by group first and how to do it efficiently if it's necessary.
I was able to split up the file by School_ID based on the below code but it seems it creates one individual job to subset for each iteration and takes a long time to finish the jobs. Is there a way to do it in one pass?
model_input.cache()
val schools = model_input.select("School_ID").distinct.collect.flatMap(_.toSeq)
val bySchoolArray = schools.map(School_ID => model_input.where($"School_ID" <=> School_ID))
for( i <- 0 to programs.length - 1 ){
bySchoolArray(i).
write.format("com.databricks.spark.csv").
option("header", "true").
save("model_input_bySchool/model_input_"+ schools(i))
}
Source:
How can I split a dataframe into dataframes with same column values in SCALA and SPARK
Edit 8/24/2015
I'm trying to convert my dataframe into a format that is accepted by the random forest model. I'm following the instruction on this thread
How to create correct data frame for classification in Spark ML
Basically, I create a new variable "label" and store my class in Double. Then I combine all my features using VectorAssembler function and transform my input data as follows-
val assembler = new VectorAssembler().
setInputCols(Array("COL1", "COL2", "COL3")).
setOutputCol("features")
val model_input = assembler.transform(model_input_raw).
select("SCHOOL_ID", "label", "features")
Partial error message(let me know if you need the complete log message) -
scala.MatchError: StringType (of class
org.apache.spark.sql.types.StringType$)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2.apply(VectorAssembler.scala:57)
This is resolved after converting all the variables to numeric types.
Edit 8/25/2015
The ml model doesn't accept the label I coded manually so I need to use StringIndexer to go around the problem as indicated here. According to the official documentation, the most frequent label gets 0. It causes inconsistent labels across School_ID. I was wondering if there's a way to create the labels without resetting the order of the values.
val indexer = new StringIndexer().
setInputCol("label_orig").
setOutputCol("label")
Any suggestions or directions would be helpful and feel free to raise any questions. Thanks!
Since you already have separate data frame for each school there is not much to be done here. Since you data frames I assume you want to use ml.classification.RandomForestClassifier. If so you can try something like this:
Extract pipeline logic. Adjust RandomForestClassifier parameters and transformers according to your requirements
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.{Pipeline, PipelineModel}
def trainModel(df: DataFrame): PipelineModel = {
val rf = new RandomForestClassifier()
val pipeline = new Pipeline().setStages(Array(rf))
pipeline.fit(df)
}
Train models on each subset
val bySchoolArrayModels = bySchoolArray.map(df => trainModel(df))
Save models
import java.io._
def saveModel(name: String, model: PipelineModel) = {
val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
oos.writeObject(model)
oos.close
}
schools.zip(bySchoolArrayModels).foreach{
case (name, model) => saveModel(name, Model)
}
Optional: Since individual subsets are rather small you can try an approach similar to the one I've describe here to submit multiple tasks at the same time.
If you use mllib.tree.model.RandomForestModel you can omit 3. and use model.save directly. Since there seem to be some problems with deserialization (How to deserialize Pipeline model in spark.ml? - as far as I can tell it works just fine but better safe than sorry, I guess) it could be a preferred approach.
Edit
According to the official documentation:
VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.
Since error indicates your column is a String you should transform it first, for example using StringIndexer.

What is a good data structure to save a dictionary?

I am designing a word filter that can filter out bad words (200 words in list) in an article (about 2000 words). And there I have a problem that what data structure I need to save this bad word list, so that the program can use a little time to find the bad word in articles?
-- more details
If the size of bad word list is 2000, the article is 50000, and the program will procedure about 1000 articles one time. Which data structure I should choose, a less then O(n^2) solution in searching?
You can use HashTable because its average complexity is O(1) for insert and search and your data just 2000 words.
http://en.wikipedia.org/wiki/Hash_table
A dictionary usually is a mapping from one thing (word in 1st language) to another thing (word in 2nd language). You don't seem to need this mapping here, but just a set of words.
Most languages provide a set data structure out of the box that has insert and membership testing methods.
A small example in Python, comparing a list and a set:
import random
import string
import time
def create_word(min_len, max_len):
return "".join([random.choice(string.ascii_lowercase) for _ in
range(random.randint(min_len, max_len+1))])
def create_article(length):
return [create_word(3, 10) for _ in range(length)]
wordlist = create_article(50000)
article = " ".join(wordlist)
good_words = []
bad_words_list = [random.choice(wordlist) for _ in range(2000)]
print("using list")
print(time.time())
for word in article.split(" "):
if word in bad_words_list:
continue
good_words.append(word)
print(time.time())
good_words = []
bad_words_set = set(bad_words_list)
print("using set")
print(time.time())
for word in article.split(" "):
if word in bad_words_set:
continue
good_words.append(word)
print(time.time())
This creates an "article" of 50000 randomly created "words" with a length between 3 and 10 letters, then picks 2000 of those words as "bad words".
First, they are put in a list and the "article" is scanned word by word if a word is in this list of bad words. In Python, the in operator tests for membership. For an unordered list, there's no better way than scanning the whole list.
The second approach uses the set datatype that is initialized with the list of bad words. A set has no ordering, but way faster lookup (again using the in keyword) if an element is contained. That seems to be all you need to know.
On my machine, the timings are:
using list
1421499228.707602
1421499232.764034
using set
1421499232.7644095
1421499232.785762
So it takes about 4 seconds with a list and 2 hundreths of a second with a set.
I think the best structure, you can use there is set. - http://en.wikipedia.org/wiki/Set_%28abstract_data_type%29
I takes log_2(n) time to add element to structure (once-time operation) and the same answer every query. So if you will have 200 elements in data structure, your program will need to do only about 8 operations to check, does the word is existing in set.
You need a Bag data structure for this problem. In a Bag data structure elements have no order but is designed for fast lookup of an element in the Bag. It time complexity is O(1). So for N words in an article overall complexity turns out to be O(N). Which is the best you can achieve in this case. Java Set is an example of Bag implementation in Java.

SQLite: Numeric values in CSV treated as text?

I just imported a huge text file into a table, using the .import command. Everything is OK, except for the fact that it seems to treat clearly numeric values as text. For instance, conditions such as WHERE field > 4 are always met. I did not specify datatypes when I created the table, but this doesn't seem to matter when small tables are created.
Any advice would be welcome. Thanks!
Edit/conclusion: It turns out some of the values in my CSV file were blanks. I ended up solving this by being a bit less lazy and declaring the datatypes explicitly.
The way SQLite handles types is described on this page: http://www.sqlite.org/datatype3.html
In particular:
Under circumstances described below,
the database engine may convert values
between numeric storage classes
(INTEGER and REAL) and TEXT during
query execution.
Section 3.4 (Comparison Example) should give you concrete examples, which are likely to explain the problem you have. This is probably this example:
-- Because column "a" has text affinity, numeric values on the
-- right-hand side of the comparisons are converted to text before
-- the comparison occurs.
SELECT a < 40, a < 60, a < 600 FROM t1;
0|1|1
To avoid the affinity to be guessed, you can use CAST explicitly (see section 3.2 too):
SQLite may attempt to convert values
between the storage classes INTEGER,
REAL, and/or TEXT before performing a
comparison. Whether or not any
conversions are attempted before the
comparison takes place depends on the
affinity of the operands. Operand
affinity is determined by the
following rules:
An expression that is a simple reference to a column value has the
same affinity as the column. Note that
if X and Y.Z are column names, then +X
and +Y.Z are considered expressions
for the purpose of determining
affinity.
An expression of the form "CAST(expr AS type)" has an affinity
that is the same as a column with a
declared type of "type".
Otherwise, an expression has NONE affinity.
Here is another example:
CREATE TABLE test (value TEXT);
INSERT INTO test VALUES(2);
INSERT INTO test VALUES(123);
INSERT INTO test VALUES(500);
SELECT value, value < 4 FROM test;
2|1
123|1
500|0
It's likely that the CSV import create columns of affinity TEXT.

Resources