Get weight of words by occurence

Get weight of words by occurence - math

Maybe this is related to math.stacexhange, but I am affraid, that I will get a formula in answer what I won't undersand.
I have products in our database, and I have products from different suppliers in another table.
What I want is to pair, these supplieres products to our products if it is possible, or show for me at least show me a list, where the matching is high.
I did iterate throught all the suppliers products, and explodes the product name by spaces, and store it in a table, and the count of the occurence.
The table seems like this.
+--------+-------------+---------------+-------+
| id | word | originalWord | count |
+--------+-------------+---------------+-------+
| 220950 | Tracer | Tracer | 493 |
| 220951 | Destroyer | Destroyer | 3 |
| 220952 | Avago5050 | Avago5050 | 4 |
| 220953 | mouse | mouse | 2535 |
| 220954 | TRAMYS44916 | /TRAMYS44916/ | 2 |
| 220955 | GameZone | GameZone | 16 |
| 220956 | Enduro | Enduro | 3 |
| 220957 | AVAGO | AVAGO | 10 |
| 220958 | 5050 | 5050 | 4 |
| 220959 | optical | optical | 2370 |
| 220960 | USB | USB | 6160 |
+--------+-------------+---------------+-------+
and so on. Of course, in another table I stored, what is the product id for each word.
So what I want is to determine the weight of a word by occurence.
As you see, the word TRAMYS44916 is occured only twice, almost certain that is a partnumber, so this is the most heavy word. It weight should be 1.
Let's say the most occured is USB with 6160 occurence, so it weight should be like 0.01 or something like that, I think.
What is the best way to get all the weights of the words?
There are other tables for other suppliers so dispersion is always change.

This reminds me of Naive Bayes text classification, so to determine which product should it belongs to, you can calculate tf-idf of all the words.
Then if you want to pair it from another product name, you can decompose it to words again and select the product id based on the highest term value, however maybe you should specify some threshold for this, because in some cases it would not be that clear.
tf-idf = ("number of word matches in product name"/"word count of product name") * log ("number of products" / "number of products that contains the word")
You can see how it is done in the example here (In your case the document will be the product full name): https://en.wikipedia.org/wiki/Tf–idf#Example_of_tf.E2.80.93idf
Example implementation in Java: https://guendouz.wordpress.com/2015/02/17/implementation-of-tf-idf-in-java/

Related

Cumulative count of occurrences per value in array in Kusto

I'm looking to get the count of query param usage from the query string from page views stored in app insights using KQL. My query currently looks like:
pageViews
| project parsed=parseurl(url)
| project keys=bag_keys(parsed["Query Parameters"])
and the results look like
with each row looking like
I'm looking to get the count of each value in the list when it is contained in the url in order to anwser the question "How many times does page appear in the querystring". So the results might look like:
Page | From | ...
1000 | 67 | ...
Thanks in advance

you could try something along the following lines:
datatable(url:string)
[
"https://a.b.c/d?p1=hello&p2=world",
"https://a.b.c/d?p2=world&p3=foo&p4=bar"
]
| project parsed = parseurl(url)
| project keys = bag_keys(parsed["Query Parameters"])
| mv-expand key = ['keys'] to typeof(string)
| summarize count() by key
which returns:
| key | count_ |
|-----|--------|
| p1 | 1 |
| p2 | 2 |
| p3 | 1 |
| p4 | 1 |

How are complex conditions represented in decision table

I am trying to model a decision table template.
Why I understand for simple rules like
(x>10 and y<10) print "red" can be represented in a decision table with one row using two columns for conditions and one column for action.
+-----+-----+-------------+
| X | Y | Action |
+-----+-----+-------------+
| >10 | <10 | Print "red" |
+-----+-----+-------------+
How are conditions like
((x>10 and y<10) or x>1) or z<5 and y>5 print "red" represented in decision tables.
I assume the above big condition is represented in many rows where the combination of different mini conditions is true. with the same action part repeated. Is there any method to reduce conditions like this to decision tables?
However In that case The action is fired multiple rows. Where as we have only one action. Is there any column for grouping?

One approach is to give actions numbers, and reference them from decision tables. If an action has been fired during an evaluation run, subsequent firings are ignored.
Here is an example:
+-----+-----+-----+--------+
| X | Y | Z | Action |
+-----+-----+-----+--------+
| >10 | >10 | - | 1 |
+-----+-----+-----+--------+
| >10 | <10 | - | 2 |
+-----+-----+-----+--------+
| >50 | - | - | 2 |
+-----+-----+-----+--------+
| - | - | >5 | 2 |
+-----+-----+-----+--------+
Action number corresponds to an action from this table:
+-----+--------------+
| # | Action |
+-----+--------------+
| 1 | Print "red" |
+-----+--------------+
| 2 | Print "blue" |
+-----+--------------+
If action #2 is fired because x>10 AND y<10, it wouldn't fire again even if x>50 or z>5.

What database schema to use for storing survey answers

I'm required for designing a survey system for our customer.
It's based on asp.net, and the database used is oracle.
I've no experience here so I'd like to ask for advice about:
What database schema to use for storing user answers, I'm afraid my current design is likely to have performance issue...
About the survey:
There'll be two or more surveys going on at the same time.
Surveys may be triggered once a year or more frequently, so I think I need a Survey Period table.
Surveys are targeting different products, so there'll be a mapping between products and surveys
Currently my design:
Survey Category table
+------------+--------------+
| CatageryId | CatageryName |
+------------+--------------+
| 1 | cat1 |
| 2 | cat2 |
+------------+--------------+
Survey Category version table
+-----------+------------+--------------------+
| VersionId | CatageryId | VersionDescription |
+-----------+------------+--------------------+
| 1 | 1 | 'cat1 version1' |
| 2 | 1 | 'cat1 version2' |
| 3 | 2 | 'cat2 version1' |
+-----------+------------+--------------------+
Survey Period Table
+----------+--------------------+
| PeriodId | PeriodDescription |
+----------+--------------------+
| 1 | 'cat1 period2016' |
| 2 | 'cat1 period2017' |
| 3 | 'cat2 period2016' |
+----------+--------------------+
Survey Period-Version map table
+----------+-----------+
| PeriodId | VersionId |
+----------+-----------+
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 3 | 3 |
+----------+-----------+
A Version-Question map table
+--------------+------------+
| VersionId | | QuestionId |
+--------------+------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 3 | 1 |
+--------------+------------+
A Version-Product map table
+-----------+-----------+
| VersionId | ProductId |
+-----------+-----------+
| 1 | 'prodA' |
| 1 | 'prodB' |
| 1 | 'prodC' |
| 2 | 'prodA' |
+-----------+-----------+
And to Store the survey result data, I have to put lots of duplicated information between rows of record:
User Answer table
+----------+------------+----------+-----------+-----------+--------+-----------+
| AnswerId | QuestionId | PeriodId | UserId/Ip | ProductId | Answer | VersionId |
+----------+------------+----------+-----------+-----------+--------+-----------+
| 1 | 1 | 1 | 'adam' | 'prodA' | 'Yes' | 2 |
| 2 | 2 | 1 | 'Joe' | 'prodA' | 'Yes' | 2 |
| 3 | 1 | 2 | 'adam' | 'prodB' | 'A' | 3 |
+----------+------------+----------+-----------+-----------+--------+-----------+
We're expecting tens of products and thousands of users for this system.
So assume 30 products, 5000 users, 50 questions per survey and 4 surveys per year
in the current design, there'll be 5000 * 4 * 50 * 30 = 30 millions of records added in the User Answer Table per year,
I'm really afraid if it could still work properly..., so any suggestions for optimizing?
Edit 1:
Add VersionId column in user answer table as suggested.

This looks like a case of premature optimization. You should probably worry more about correctness and flexibility than performance.
30 million rows per year, especially in these skinny tables, is a small amount of data for any Oracle system. Don't worry too much about indexes and partitioning yet, those can be added later if necessary.
Your solution is similar to the Entity Attribute Value (EAV) model. It's worth knowing that term since much has been written about it. There are 2 common problems with EAV models you want to avoid:
Avoid extremes. Don't use EAV for everything, but don't completely avoid it either. EAV is slow and inconvenient compared to a normal table structure. It should not be used for every interesting columns, otherwise you have created a database within a database. For example, if virtually every survey has fields like a username and a date created, store those as regular columns and not in a generic column. It's OK to have a column that is only populated 99% of the time. On the other hand, it's a bad idea to always avoid the EAV and try to hack something together with 1,000 column tables or object-relational types.
Always use the correct type. Always, always, always store data as the correct type. Store numbers as numbers, dates as dates, and strings as strings. Your queries will be easier, faster, and safer, if you have at least three columns for the data: ANSWER_NUMBER, ANSWER_STRING, ANSWER_DATE. I explain the type safety problem more in this answer. Those extra columns may look bad in the model diagram, but they are a life-saver when you're querying the data.

Symfony2 dynamic form builder

I want to create a bundle in an application where the user has the ability to create forms.
These forms will be questionnaires with different answers. Do you know if there is already a similar bundle exists?
If it does not exist. How should I proceed? I always create the forms only on files like "Form\Type\UserType".
In this case, would have to be generated dynamically from the database, right? Only I miss the approach, can anyone give a hint on how I can realize this?
Update:
Maybe i think to complicated. I'm not sure if an form service solve my problem. Know i created a database structure to describe my initial situation.
The user can create on a backend following records.
A scale, a scale can contain many answers (yes, no, maybe, good, better ...)
A question category
A question, can be assigned to many categories and many questionnairies
A questionnaire, can contain many questions and here can the user assign a scale to a question.
Table Scale
+----+--------+------------+----------+-----------+------------+
| id | name | alignment | isActive | isDeleted | createDate |
+----+--------+------------+----------+-----------+------------+
| 1 | Yes-No | horizontal | 1 | 1 | 2014-09-25 |
+----+--------+------------+----------+-----------+------------+
Table Items
+----+------+-------+------------+
| id | name | value | createDate |
+----+------+-------+------------+
| 1 | Yes | 1 | 2014-09-25 |
| 2 | No | 0 | 2014-09-25 |
+----+------+-------+------------+
ManyToMany 'scale_items'
+----------+----------+
| scale_id | items_id |
+----------+----------+
| 1 | 1 |
| 1 | 2 |
+----------+----------+
Table category for question categories
+----+---------+----------+-----------+------------+
| id | name | isActive | isDeleted | createDate |
+----+---------+----------+-----------+------------+
| 1 | General | 1 | 0 | 2014-09-25 |
+----+---------+----------+-----------+------------+
Table question
+----+-----------------------------------------+------------+
| id | question | createDate |
+----+-----------------------------------------+------------+
| 1 | Are you satisfied with the cleanliness? | 2014-09-25 |
+----+-----------------------------------------+------------+
ManyToMany 'question_category'
+-------------+-------------+
| question_id | category_id |
+-------------+-------------+
| 1 | 1 |
+-------------+-------------+
Table questionnaire
+----+-------------------+---------+----------+-----------+------------+
| id | name | version | isActive | isDeleted | createDate |
+----+-------------------+---------+----------+-----------+------------+
| 1 | General Questions | 2.2 | 1 | 0 | 2014-09-25 |
+----+-------------------+---------+----------+-----------+------------+
Now the database contain scales and items, questions and categories and a table for the questionnaire. Now i created a big relation to assign questions to questionnairies with a specified scale. A question can be assigned to different scales on different questionaries.
Table questionnaire_question_scale
+----+-------------+------------------+----------+------+--------+
| id | question_id | questionnaire_id | scale_id | page | hash |
+----+-------------+------------------+----------+------+--------+
| 1 | 1 | 1 | 1 | 1 | X321Z1 |
+----+-------------+------------------+----------+------+--------+
In the final step i create a relation table to assign a questionnaire to a couple of users.
Table questionnaire_user
+---------+------------------+
| user_id | questionnaire_id |
+---------+------------------+
| 21 | 1 |
+---------+------------------+
Now if the user log in i will render the above informations as form and here starts my problem :)
I think my solution was inefficient, because if many users log in to fill out a questionnaire i must generate every time the questionnaire (the complex structure) as form.
This is a end point for me, unfortunately I do not know further.
I would be very grateful for ideas, tips and solution approaches

If you want to build it yourself: I suggest creating a form type, declaring it as a service and injecting the form repository (explained further down) in it.For storing it in the data base: you can create two entities (actually creating a seperate entity for each field seems better but for the sake of simplicity I use two entities) : one for forms : YourBundle:Form and one for your form-fields YourBundle:FormField.the Form entity can only contain an id and a name and a one-to-many association to FormField. the data you store in the FormField will be: a many-to-one association to Form - the field name - the field type - the field's options.you can store the options as Json or other formats and later decode it.

Combine DataFrame rows into a new column

I am wondering if there is simple way to achieve this in Julia besides iterating over the rows in a for-loop.
I have a table with two columns that looks like this:
| Name | Interest |
|------|----------|
| AJ | Football |
| CJ | Running |
| AJ | Running |
| CC | Baseball |
| CC | Football |
| KD | Cricket |
...
I'd like to create a table where each Name in first column is matched with a combined Interest column as follows:
| Name | Interest |
|------|----------------------|
| AJ | Football, Running |
| CJ | Running |
| CC | Baseball, Football |
| KD | Cricket |
...
How do I achieve this?
UPDATE: OK, so after trying a few things including print_joint and grpby, I realized that the easiest way to do this would be by() function. I'm 99% there.
by(myTable, :Name, df->DataFrame(Interest = string(df[:Interest])))
This gives me my :Interest column as "UTF8String[\"Running\"]", and I can't figure out which method I should use instead of string() (or where to typecast) to get the desired ASCIIString output.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex