How to import and read excel rows in Julia? - julia

I have a dataset like the following picture and want to read and extract each cell and assign them into parameters in an optimization model. For example, considering only one part of a row:
ID, Min, speed, Distance, Time Latitude, Longitude
1 2506 23271 11.62968 17.7 -37.731 144.898
Every row depicts a persons' information. So, is it better to define a dictionary of person and put all these into that? Or is it better to define a tuple? Arrays(like below)?
for i in 1:n_people
person_id = i
push!(requests, Request(ID[i], Min[i], speed[i], Distance[i], Latitude[i], Longitude[i]))
end
In any case, how can I access (extract), let's say, distance for that person?
I mean, I need to have a set of people in my model like
people[i] and then for each of these connect them to their information (model parameters) like person's 'i' distance, speed,... and then compare them with person j.
What is the best way to do that?

Since JuMP is agnostic to the format of the input data, the answer is: it depends on what you want to do with it. Pick whatever makes the most sense to you.
There are a few data related tutorials that address how to read data into a DataFrame and use that to create JuMP variables and constraints. Reading those is a good next step:
https://jump.dev/JuMP.jl/stable/tutorials/getting_started/getting_started_with_data_and_plotting/
https://jump.dev/JuMP.jl/stable/tutorials/linear/diet/

Related

Handle a string return from R to Tableau and SPLIT it

I connect Tableau to R and execute an R function for recommending products. When R ends, the return value is a string which will have all products details, like below:
ID|Existing_Prod|Recommended_Prod\nC001|NA|PROD008\nC002|PROD003|NA\nF003|NA|PROD_ABC\nF004|NA|PROD_ABC1\nC005|PROD_ABC2|NA\nC005|PRODABC3|PRODABC4
(Each line separated by \n indicating end of line)
On Tableau, I display the calculated field which is as below:
ID|Existing_Prod|Recommended_Prod
C001|NA|PROD008
C002|PROD003|NA
F003|NA|PROD_ABC
F004|NA|PROD_ABC1
C005|PROD_ABC2|NA
C005|PRODABC3|PRODABC4
Above data reaches Tableau through a calculated field as a single string which I want to split based on pipeline ('|'). Now, I need to split this into three columns, separated by the pipeline.
I used Split function on the calculated field :
SPLIT([R_Calculated_Field],'|',1)
SPLIT([R_Calculated_Field],'|',2)
SPLIT([R_Calculated_Field],'|',3)
But the error says "SPLIT function cannot be applied on Table calculations", which is self explanatory. Are there any alternatives to solve this ?? I googled to check for best practices to handle integration between R and Tableau and all I could find was simple kmeans clustering codes.
Make sure you understand how partitioning and addressing work for table calcs. Table calcs pass vectors of arguments to the R script, and receive a single vector in response. The cardinality of those vectors depends on the partitioning of the table calc. You can view that by editing the table calc, clicking specific dimensions. The fields that are not checked determine the partitioning - and thus the cardinality of the arguments you send and receive from R
This means it might be tricky to map your problem onto this infrastructure. Not necessarily impossible. It was designed to send a series of vector arguments with one cell per partitioning dimension, say, Manufacturer and get back one vector with one result per Manufacturer (or whatever combination of fields partition your data for the table calc). Sounds like you are expecting an arbitrary length list of recommendations. It shouldn’t be too hard to have your R script turn the string into a vector before returning, but the size of the vector has to make sense.
As an example of an approach that fits this model more easily, say you had a Tableau view that had one row per Product (and you had N products) - and some other aggregated measure fields in the view per Product. (In Tableau speak, the view’s level of detail is at the Product level.)
It would be straightforward to pass those measures as a series of argument vectors to R - each vector having N values, and then have R return a vector of reals of length N where the value returned at each location was a recommender score for the product at that position. (Which is why the ordering aka addressing of the vectors also matters)
Then you could filter out low scoring products from the view and visually distinguish highly recommended products.
So the first step to understanding R integration is to understand how table calcs operate with partitioning and addressing and to think in terms of vectors of fixed lengths passed in both directions.
If this model doesn’t support your use case well, you might be able to do something useful with URL actions or the JavaScript API.

Best approach to splitting up clusters of data

I am working on a way to split up data in a CSV file based on a timestamp.
For example, for a given object id, check each entries date and see if it is within a given, allowed range. So if a set of rows in the table were:
OBJECT ID - Info - Date
obj1 xyz 1/1/12
obj1 xyw 1/2/12
obj1 cya 1/3/12
obj1 abc 2/1/12
...
In this example, the fourth entry is well outside of the area of time that the other entries are in. Therefore, my desired behavior is for a script to assign that entry to a new object, say 'obj2' for example, such that it is separated from data within its own cluster. Note that the dataset this will be applied to will be somewhat large, at the very least in the 10s of thousands, so I don't know if manual algorithms will be fast enough.
I'm using R for the moment to try to get this done using the PAM and PAMK functions in the FPC package. This gives me a plot of the clusters (I think), but I don't know how to apply this information to the actual data.
Any thoughts or ideas on the best way to do this?
I figured out a solution using the following steps:
// Convert the timestamps to milliseconds
newData <- as.POSIXct(data$date, format="date_format_here")
// Split the data using the object ID as the parameter
splitData <- split(data, f=data$id)
// Iterate over the split sessions, concatenating the cluster IDs as it goes using paste
for each {
pamk.result <- pamk(splitData[[i]][dataColumnIndex]
newData[i,1] <- paste(data[i,1],
pamk.result$pamobject$clustering[[x]],
sep="delimiter_here")
}
Anyway, this is a rough outline of how I approached the problem. Maybe this will give some ideas to others down the line.

hash table suffix tree explanation

I am asking this here because I couldn't find the answer I am looking for elsewhere and I don't know where else I could ask this. I hope someone can reply without saying that the question is irrelevant to the forum. I have a biology background and I am currently using bioinformatics. I need to understand in lay language hash tables and suffix trees. Something simple, I don't get the O(n) concepts and all that stuff, I think they are both kind of the same: a way to store string data? But I would like to understand better the differences. This will help enormously to other people like me. We are a lot in this field now!
Thanks in advance.
OK, lets use bioinformatics to help illustrate the differences.
Let's say you have several DNA sequences that are pretty long. If we want to store these sequences in a datastructure.
If we want to use a hashtable
A Hashtable is a useful way to store a bunch of objects but very quickly search the datastructure to see if we already contain a particular object.
One bioinformatics usecase that we can solve with a hashtable is de-duping a large sequence set. Let's say we have a huge dataset of next-gen sequenced data and we want to de-duplicate it before we assemble. We can use a hashtable to store the unique sequences. Before inserting any sequences into the hashtable, we can first check to see if it already exists in the hashtable and if it does we skip that read. Only if it is not yet in the hashtable do we add it. Then when we are done the elements in the hash will be the unique sequences.
Hashtables are basically an array of LinkedLists. Each cell in the array we will call a "bin". When we insert or search for something in the hashtable, we have to first know what bin it is in. The way we determine which bin to use is by a hash algorithm.
We have to come up with a hash algorithm. Something that will convert our sequence into a number. A requirement of this equation is the same sequence must always evaluate to the same number. It's OK if different sequences evaluate to the same number (which is called as hash collision) since there are an infinite number of possible sequences and we will only have a limited range of possible number values in our hash.
A simple hash algorithm is to assign a value to each base A =1 G =2 C = 3 T =4 (assume no ambiguities) then we can just sum up the bases in our sequence. This would mean that any sequences with the same number of As, Cs Gs and Ts will have the same hash value. If we wanted, we could also have a more complicated algorithm that also takes position into account so to get the same number we would have to also have the same sequence in the same order.
Once we have our hash algorithm. We can make a hash table by binning the sequences by their hash values. The more bins we have in our table, the fewer hash values per bin. Hashtables are often implemented by an array of LinkedLists. This is a very fast lookup because to see if a sequence is in our hashtable or to add a new sequence to our hash table, we just compute the hash value for the sequence to see what bin it is in, then we only have to look at the values inside that bin. We can ignore the rest of the bins.
suffix tree
A Suffix Tree is a different datastructure which is a graph where each node is (in this case) a residue in our sequence. Edges in the graph will point to the next node etc. So for example if our sequence was ACGT the path in the graph will be A->C->G->T->$. If we had another sequence ACTT the path will be A->C->T->T->$.
We can combine consecutive nodes if there is only 1 path so in the previous example since both sequence start with AC then the paths will be AC->G->T->$and AC->T->T->$.
In bioinformatics this is really useful for substring matching (like finding repetitive regions or primer binding sites etc) since we can easily see where there are subpaths in our graph that match our motif.
Hope that helps

How to pass a record to a decision tree?

I'm working a small project. Lets say, I have a table of around 100K records containing columns like Age, Gender, Region, Life(nominal - range of days the products is used) etc., Here Life is a dependent variable and all others are independent variable.I created a decision tree out of the data available. Now my query is, suppose if I have one new record, I want to know in which terminal node that record falls after traversing the decision tree i.e., under which Life range does that record falls. For that, how can I pass that record to the decision tree and get a output?
predict(model,newdata)
Let's say your original data.frame had 4 columns as you list in your question. Your new record would need to be formated as a data.frame with the same columns names as your independent factors, e.g., newdata = data.frame(Age=15,Gender="Male",Region="Southwest") or whatever those values should be. Let's assume you've stored your model thusly model = rpart(Life~.,data=data,method="class") then predict(model,newdata) will return a vector of the probability that the new record belongs to each of the terminal classes. You then need to have some cutoff logic to determine which group you'll assign it to.

Determining distribution so I can generate test data

I've got about 100M value/count pairs in a text file on my Linux machine. I'd like to figure out what sort of formula I would use to generate more pairs that follow the same distribution.
From a casual inspection, it looks power law-ish, but I need to be a bit more rigorous than that. Can R do this easily? If so, how? Is there something else that works better?
While a bit costly, you can mimic your sample's distribution exactly (without needing any hypothesis on underlying population distribution) as follows.
You need a file structure that's rapidly searchable for "highest entry with key <= X" -- Sleepycat's Berkeley database has a btree structure for that, for example; SQLite is even easier though maybe not quite as fast (but with an index on the key it should be OK).
Put your data in the form of pairs where the key is the cumulative count up to that point (sorted by increasing value). Call K the highest key.
To generate a random pair that follows exactly the same distribution as the sample, generate a random integer X between 0 and K and look it up in that file structure with the mentioned "highest that's <=" and use the corresponding value.
Not sure how to do all this in R -- in your shoes I'd try a Python/R bridge, do the logic and control in Python and only the statistics in R itself, but, that's a personal choice!
To see whether you have a real power law distribution, make a log-log plot of frequencies and see whether they line up roughly on a straight line. If you do have a straight line, you might want to read this article on the Pareto distribution for more on how to describe your data.
I'm assuming that you're interested in understanding the distribution over your categorical values.
The best way to generate "new" data is to sample from your existing data using R's sample() function. This will give you values which follow the probability distribution indicated by your existing counts.
To give a trivial example, let's assume you had a file of voter data for a small town, where the values are voters' political affiliations, and counts are number of voters:
affils <- as.factor(c('democrat','republican','independent'))
counts <- c(552,431,27)
## Simulate 20 new voters, sampling from affiliation distribution
new.voters <- sample(affils,20, replace=TRUE,prob=counts)
new.counts <- table(new.voters)
In practice, you will probably bring in your 100m rows of values and counts using R's read.csv() function. Assuming you've got a header line labeled "values\t counts", that code might look something like this:
dat <- read.csv('values-counts.txt',sep="\t",colClasses=c('factor','numeric'))
new.dat <- sample(dat$values,100,replace=TRUE,prob=dat$counts)
One caveat: as you may know, R keeps all of its objects in memory, so be sure you've got enough freed up for 100m rows of data (storing character strings as factors will help reduce the footprint).

Resources