How can I get correlation between features and target variables using user defind function in python? - pythomnic3k

I have two different dataset. Where one dataset has all the features column like consumer price, gdp.
Another dataset has the information of different customers orders. I want to find out the correlation for each customers order with the features. At the end I want to store the information in a dataframe like, one column contains the customer Name , 2nd column must be the feature name and the 3rd one should be the correlation value.
It would be greatful if anyone helps me in this.

Related

Tableau---Getting count from 2 different data sources and combining into one total

I am a tableau newbie and am trying to see if this is possible or not. I have 2 separate data sources where the same employees are listed, one is for closed cases and the other is for open cases. These data sources have some of the same columns, but for the most part they are different.
Is it possible to aggregate the case count for each employee on the closed and open data sources into a single column? For instance, if an employee has 50 closed cases and 23 open cases, I want it to show 73 for them.
I attempted to play around with the joins/unions but these didn't work properly and duplicated the data most times.
I think this is a great chance to leverage blends.
I have created a workbook with the Sample Superstore Excel dataset. This dataset has three sheets. I'll use the Orders and Returns sheets to demonstrate how we can calculate the net orders using blends.
The dataset I'm using can be found here.
Start by connecting to both the Orders and Returns separately. Once done with this step you should see the two data sources at the top of your data pane.
In this example, I'll calculate the Net Returns by Category. In your case, you're after the Total Cases by Employee, so just imagine Employee in place of Category.
Next, drag Category from the Orders data source onto the view, then select the Orders data source and click the chain icon to blend on Order ID.
You will need a common column between the two tables in order to blend.
Once blended I'll go back to the primary data source (indicated by the blue check mark) and create the Net Orders calculation.
This calculation uses the dot notation - similar to what you might see in SQL - to reference our other table.
To double check that our calculation is working properly, we can drag the components of this calculation onto the view and do the math.
Of course, once you are satisfied you can remove all but your blended calculation.
Blending isn't ideal in most cases but you could try it. Bring in each data source separately and "join" them within your workbook pane on Employee or hopefully an Employee_id. Click the little chain once you have them both loaded and you are on a worksheet tab. Then you could sum the counts by employee. Blending sometimes presents some issues with calculated fields across the two data sources but this is what I would try first.

R returning new columns instead of new rows for additional records

I'm trying to find an easy way to take the output and outputs like the one below and convert them to have another row for every additional product column but with the same link in the first column, the ideal output is a data frame that has three columns, first one being the link, one for the product and one for the price.
I'm scraping this data from a website for practice, but having an issue with my output, right now its returning multiple columns per link where there are multiple products for each link - instead I want unique rows for every product.

R RecordLinkage Identity

I am working with RecordLinkage Library in R.
I have a data frame with id, name, phone, mail
My code looks like this:
ids = data$id
pairs = compare.dedup(data, identity=ids, blockfld=as.list(2,3,4))
The problem is that my ids are not the same in my result output
so if I had this data:
id Name Phone Mail
233 Nathali 2222 nathali#dd.com
435 Nathali 2222
553 Jean 3444 jean#dd.com
In my result output I will have something like
id1 id2
1 2
Instead of
id1 id2
233 435
I want to know if there is a way to keep the ids instead of the index, or someone could explain me the identity parameter.
Thanks
The identity vector tells the getPairs method which of the input records belong to the same entity. It actually holds information that you usually want to gain from record linkage, i.e. you have a couple of records and do not know in advance which of them belong together. However, when you use a training set to calibrate a method or you want to evaluate the accurateness of record linkage methods (the package was mainly written for this purpose), you start with an already deduplicated or linked data set.
In your example, the first two rows (ids 233, 435) obviously mean the same person and the third row a different one. A meaningful identity vector would therefore be:
c(1,1,2)
But it could also be:
c(42,42,128)
Just make sure that the identity vector has identical values exactly at those positions where the corresponding table rows hold matching record (vector index = row index).
About your question on how to display the ids in the result: You can get the full record pairs, including all data fields, with (see the documentation for more details):
getPairs(pairs)
There might be better ways to get hold of the original ids, depending on how you further process the record pairs (e.g. running a classification algorithm). Extend your example if you need more advice on this.
p.s.: I am one of the package authors. I have only very recently become aware that people ask questions about the package on Stack Overflow, so please excuse that a couple of questions have been around unanswered for a long time. I will look for a way to get notified on new questions posted here, but I would also like to mention that people can contact us directly via one of the email addresses listed in the package information.
You have to replace the index column with your identify column.

Fix spelling errors in unique identifers of data

I have 6,000 items (just a sampling of some 200,000 entries). The unique identifier is a company name (not my choosing). There are spelling mistakes in the company name. I'm using Levenshtein's distance algorithm to decide if one company name is say 90% similar to the other company name. If this is true I would combine the entries. If I compare every company name entry against every other company name entry I have 6,000^2 iterations. This takes over ten minutes. The data entries are stored in a c++ std::map, where the company names are the key and the associated data is the value. Any ideas on how I can accurately decide whether two company names might be the same with small spelling errors or abbreviations, with out a nested for loop?

R Models with Factors in Tableau

I'm attempting to build a model for sales in R that is then integrated into Tableau so I can look at the predictions as they relate to the actual values. The model I'm building for sales is in R, and I'm trying to integrate it into Tableau by creating a calculated field that uses the model to give the predicted value for each record using the SCRIPT_REAL function in Tableau. The records are all coming from a MySQL database connection. The issue that I'm having comes from using factors in my model (for example, month).
If I want to group all of the predictions by day of week, Tableau can't perform the calculation because it tries to aggregate each field I'm using before passing it into the model. When it tries to aggregate month, not all of the values are the same, so it instead returns a "". Obviously a prediction value then can't be reached because there is no value associated with a "". Essentially what I'm trying to do is get a prediction value for each record that I have, and then aggregate those prediction values in various ways.
Okay, now I can understand a little bit better what you're talking about. A twbx with dummy data (and even dummy model, but that generates the same problem you're facing) would help even more, but let me try to say a couple of things
One thing that is important to understand is that SCRIPT functions are like table calculations, i.e., they are performed only with aggregated fields, they are computed last (after all aggregations, measures and regular calculations) and you can define the level of aggregation you want.
So, if you want to display values on a daily basis, put your date field on page, go to the day level, and for the calculation partition by DAY(date_field). If you want by week, same thing.
I find table calculations (including R scripts) very useful when they are an end, i.e. the calculation is the answer. It's not so useful (or better, not so easily manipulable) when it's an end, like an intermediate step before a final calculation to get to the answer. That is mainly because the level of aggregation is based on the fields that are on page. So, for instance, if I have multiple orders from different clients, and want to assess what's the average order value by customer, table calculation is great, WINDOW_AVG(SUM(order_value)) partitioned by customer. If, for some reason, I want to sum all this values, then it's tricky. I can't do it directly, as the avg order value is not stored anywhere, and cannot be retrieved without all the clients being on page. So what I usually do is to create the table with all customers, export it to mdb, and reconnect in Tableau.
I said all this because it might be your problem, when you say "Tableau can't perform the calculation because it tries to aggregate each field I'm using before passing it into the model". Yes, Tableau does that and there's nothing you can do about it, but figure out a way around it. Creating an intermediate table in Tableau, exporting it, and connecting to it again in Tableau might be an answer. Performing the calculations in R, exporting it and then connecting to Tableau might be another way.
But again, without actually seeing what you're trying to do, it's hard to say what you need to do

Resources