How to pass a record to a decision tree? - r

I'm working a small project. Lets say, I have a table of around 100K records containing columns like Age, Gender, Region, Life(nominal - range of days the products is used) etc., Here Life is a dependent variable and all others are independent variable.I created a decision tree out of the data available. Now my query is, suppose if I have one new record, I want to know in which terminal node that record falls after traversing the decision tree i.e., under which Life range does that record falls. For that, how can I pass that record to the decision tree and get a output?

predict(model,newdata)
Let's say your original data.frame had 4 columns as you list in your question. Your new record would need to be formated as a data.frame with the same columns names as your independent factors, e.g., newdata = data.frame(Age=15,Gender="Male",Region="Southwest") or whatever those values should be. Let's assume you've stored your model thusly model = rpart(Life~.,data=data,method="class") then predict(model,newdata) will return a vector of the probability that the new record belongs to each of the terminal classes. You then need to have some cutoff logic to determine which group you'll assign it to.

Related

Does column order matter in RNN?

My question is somewhat similar to this one. But I want to ask whether the column order matters or not. I have some time series data. For each cycle I computed some features (let's call them var1, var2,.... I now train the model using the following column order which of course will be consistent for the test set.
X_train=data['var1','var2','var3',var4']
After watching this video I've concluded that the order in which the columns appear is significant i.e. swapping var 1 and var 3 as:
X_train=data['var3','var2','var1',var4']
I would get a different loss function.
If the above is true, then how does one figure out the correct feature order to minimize the loss function, especially when the number of features could be in dozens.

How to import and read excel rows in Julia?

I have a dataset like the following picture and want to read and extract each cell and assign them into parameters in an optimization model. For example, considering only one part of a row:
ID, Min, speed, Distance, Time Latitude, Longitude
1 2506 23271 11.62968 17.7 -37.731 144.898
Every row depicts a persons' information. So, is it better to define a dictionary of person and put all these into that? Or is it better to define a tuple? Arrays(like below)?
for i in 1:n_people
person_id = i
push!(requests, Request(ID[i], Min[i], speed[i], Distance[i], Latitude[i], Longitude[i]))
end
In any case, how can I access (extract), let's say, distance for that person?
I mean, I need to have a set of people in my model like
people[i] and then for each of these connect them to their information (model parameters) like person's 'i' distance, speed,... and then compare them with person j.
What is the best way to do that?
Since JuMP is agnostic to the format of the input data, the answer is: it depends on what you want to do with it. Pick whatever makes the most sense to you.
There are a few data related tutorials that address how to read data into a DataFrame and use that to create JuMP variables and constraints. Reading those is a good next step:
https://jump.dev/JuMP.jl/stable/tutorials/getting_started/getting_started_with_data_and_plotting/
https://jump.dev/JuMP.jl/stable/tutorials/linear/diet/

Handle a string return from R to Tableau and SPLIT it

I connect Tableau to R and execute an R function for recommending products. When R ends, the return value is a string which will have all products details, like below:
ID|Existing_Prod|Recommended_Prod\nC001|NA|PROD008\nC002|PROD003|NA\nF003|NA|PROD_ABC\nF004|NA|PROD_ABC1\nC005|PROD_ABC2|NA\nC005|PRODABC3|PRODABC4
(Each line separated by \n indicating end of line)
On Tableau, I display the calculated field which is as below:
ID|Existing_Prod|Recommended_Prod
C001|NA|PROD008
C002|PROD003|NA
F003|NA|PROD_ABC
F004|NA|PROD_ABC1
C005|PROD_ABC2|NA
C005|PRODABC3|PRODABC4
Above data reaches Tableau through a calculated field as a single string which I want to split based on pipeline ('|'). Now, I need to split this into three columns, separated by the pipeline.
I used Split function on the calculated field :
SPLIT([R_Calculated_Field],'|',1)
SPLIT([R_Calculated_Field],'|',2)
SPLIT([R_Calculated_Field],'|',3)
But the error says "SPLIT function cannot be applied on Table calculations", which is self explanatory. Are there any alternatives to solve this ?? I googled to check for best practices to handle integration between R and Tableau and all I could find was simple kmeans clustering codes.
Make sure you understand how partitioning and addressing work for table calcs. Table calcs pass vectors of arguments to the R script, and receive a single vector in response. The cardinality of those vectors depends on the partitioning of the table calc. You can view that by editing the table calc, clicking specific dimensions. The fields that are not checked determine the partitioning - and thus the cardinality of the arguments you send and receive from R
This means it might be tricky to map your problem onto this infrastructure. Not necessarily impossible. It was designed to send a series of vector arguments with one cell per partitioning dimension, say, Manufacturer and get back one vector with one result per Manufacturer (or whatever combination of fields partition your data for the table calc). Sounds like you are expecting an arbitrary length list of recommendations. It shouldn’t be too hard to have your R script turn the string into a vector before returning, but the size of the vector has to make sense.
As an example of an approach that fits this model more easily, say you had a Tableau view that had one row per Product (and you had N products) - and some other aggregated measure fields in the view per Product. (In Tableau speak, the view’s level of detail is at the Product level.)
It would be straightforward to pass those measures as a series of argument vectors to R - each vector having N values, and then have R return a vector of reals of length N where the value returned at each location was a recommender score for the product at that position. (Which is why the ordering aka addressing of the vectors also matters)
Then you could filter out low scoring products from the view and visually distinguish highly recommended products.
So the first step to understanding R integration is to understand how table calcs operate with partitioning and addressing and to think in terms of vectors of fixed lengths passed in both directions.
If this model doesn’t support your use case well, you might be able to do something useful with URL actions or the JavaScript API.

If I want 5 values that are in columns to the right of each key, what is the ideal way to train the Form Recognizer?

I have a column of numbers to the far left as my keys, of which, each entry has 5 design values I'm trying to pair to it. To train the model, I've used 15 completed pdf files, most of which were not scans. I also edited 3 of those, deleting the values but leaving the keys, and saved them with the same file name as the original, suffixed with "Empty".
The results returned from the model have no problem finding any of the numbers or their locations, but they are not in key-value pairs of any kind. I get that key-value "pair" excludes any possibility of grabbing the column header and the row, but just the row and position relative to the others would be make things easy enough. Just hoping for some insight on how to train it to reuse the same key as it looks across the row.
I'm exporting the data to Word format and tabulating the values with a light border. I have no experience with machine learning. For the empty form, would there be any benefit to adding DocVariable fields to each of the 5 value columns, with the variable name being a combination of the row and column key names?
Actually, it's necessary to delete these keys from your sample data to train the model of Form Recognizer, even incorrect to do that. Form Recognizer need to learn what key is in your sample data.
So you just need to follow the offical tutorial Build a training data set for a custom model to train the model with more samples of the similar form layout with different keys and different value. Then, you can follow my answer for the SO thread How to improve the accuracy of Form Recognizer? to draw the keys and values and extract the values what you want from the json result by their boundingBox values.
Yes, what I said means you need to design an algorithm to classify these keys and values by classifing their geometry values of boundingBox.
For example, you can try to draw several horizontal or vertical lines to link these left-up point of keys and values and to find out the geometry point pattern for classification these form cells.

Preventing duplicate SIMILAR relationships when using algo.similarity.jaccard on continuously updated data

I am computing the Jaccard similarity index for a category of nodes in a graph using the algo.similarity.jaccard algorithm from the Neo4j graph algorithm's library. Once calculating the Jaccard similarity and indicating a cutoff, I am storing the metric in a relationship between the nodes (this is a feature of the algorithm). I am trying to see the change of the graph over time as I get new data to add into the graph (I will be reloading my CSV file with new data and merging in new nodes/relationships).
A problem I foresee is that once I run the Jaccard algorithm again with the updated graph, it will create duplicate relationships. This is the Neo4j documentation example of the code that I am using:
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard(data, {topK: 1, similarityCutoff: 0.1, write:true})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, stdDev, p25, p50, p75, p90, p95, p99, p999, p100
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean, p95
Is there a way to specify I do not want to have duplicate relationships each time I run this code with an updated graph? Manually, I'd use MERGE instead of CREATE but seeing as though this an algorithm from a library, I'm not sure how to go about that. FYI I will not have the ability to add changes to a library plug in and it seems like there is no way to store the relationship under a different label such as SIMILARITY2.
There are at least 2 ways to avoid duplicate relationships from multiple calls to algo.similarity.jaccard:
Delete the existing relationships (by default, they have the SIMILAR type) before each call. This is probably the easiest approach.
Omit the write:true option when making the calls (so that the procedure won't create relationships at all), and write your own Cypher code to optionally create relationships that do not already exist (using MERGE).
[UPDATED]
Here is an example of the second approach (using the
algo.similarity.jaccard.stream variant of the procedure, which yields more useful values for our purposes):
MATCH (p:Person)-[:LIKES]->(cuisine)
WITH {item:id(p), categories: collect(id(cuisine))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {topK: 1, similarityCutoff: 0.1})
YIELD item1, item2, similarity
WHERE item1 < item2
WITH algo.getNodeById(item1) AS n1, algo.getNodeById(item2) AS n2, similarity
MERGE (n1)-[s:SIMILAR]-(n2)
SET s.score = similarity
RETURN *
Since the procedure will return the same node pair twice (with the same similarity score), the WHERE clause is used to filter out one of the pairs, to speed up processing. The algo.getNodeById() utility function is used to get a node by its native ID. And the MERGE clause's relationship pattern does not specify a value for score, so that it will match an existing relationship even if it has a different value. The SET clause for setting the score is placed after the MERGE, which also helps to ensure the value is up to date.

Resources