How can I obtain the attribution of a channel per consumer in their purchase decision with an attribution model (markov chain)? - r

in the last days I have been working with markov chain for a multi touch (data driven) attribution model, I have found too much important information at the macro level, for example, the ChannelAttribution package gives me the attribution of each one of the channels of the process to achieve a conversion (either TV, search or call-center) but this attribute is done taking into account all customer journeys, and also the elimination effects for each channel. My question is the following, at a micro level of the analysis, can I obtain at the customer level, which was the channel that most attributed to their purchase decision? That is, which is the channel that had the greatest impact for each one of them customers to make their purchase? it does not matter if a conversion was not made or not.
For example, I imagine an output like the following:
Curtomer ID
Channel Atribution by curtomer
Conversion
1
TV
Conversion
2
TV
Conversion
3
Search
Non-Conversion
4
Call-center
Conversion
5
TV
Non-Conversion
6
Call-center
Conversion
I would be grateful, also sorry for my English I hope to be clear.

sorry for the late reply, maybe this will be helpful for someone else.
The first thing you'll need to do is get your data into shape, specifically a long shape. I've built a sample below for the first 3 Customer IDs in your output table:
|Customer ID|Channel |Conversion |
|----------|------------|--------------|
|1 |TV |Conversion |
|1 |TV |Conversion |
|1 |Call Centre |Conversion |
|2 |TV |Conversion |
|2 |TV |Conversion |
|3 |Search |Conversion |
|3 |Search |Non-conversion|
|3 |Call Centre |Non-conversion|
Notice that if look at the most popular channel for each Customer ID, that it will correspond to the 'Channel Atribution by curtomer' field in your output able?
You can do this by:
Grouping the Customer IDs and the Conversion (Conversion-Customer ID relationship should be 1 to 1)
Count the occasions the channel occurs for each Customer, this will give you
Customer ID
Channel
Conversion
Channel Count
1
TV
Conversion
2
1
TV
Conversion
2
1
Call Centre
Conversion
1
2
TV
Conversion
2
2
TV
Conversion
2
3
Search
Non-conversion
2
3
Search
Non-conversion
2
3
Call Centre
Non-conversion
1
There is some duplication on the conversion and count fields, ignore for now.
With the same grouping as above, filter on the max Channel Count column. This will give you:
Customer ID
Channel
Conversion
Channel Count
1
TV
Conversion
2
1
TV
Conversion
2
2
TV
Conversion
2
2
TV
Conversion
2
3
Search
Non-conversion
2
3
Search
Non-conversion
2
Run a distinct across the dataset, this will give you:
Customer ID
Channel
Conversion
Channel Count
1
TV
Conversion
2
2
TV
Conversion
2
3
Search
Non-conversion
2
Which corresponds to your imagined output.
Tie breakers
It will happen that a Customer ID will have two or more equal number of channels. For instance two TV and two Search. How do we manage this? If you really must have one row per customer, then depending on what you're planning on doing, you'll need to either:
Build some priority ranking logic where rules dictate which channel is counted as the attribution.
Build some logic that randomly attributes the channel in the case of tie breakers.
I hope that helps, I've kept the answer code free but had R/Python in the back of my mind. It could possible be implemented in Excel, but someone far smarter than I would need to contribute that answer.

Related

Computing an object 'similarity fingerprint'

Background
This is a theoretical problem similar to an actual problem I am having.
Imagine I have a database of millions of grades for students with rows looking something like the following:
student_id | first_name | last_name | maths_grade | english_grade | physics_grade | chemistry_grade | biology_grade
-------------------------------------------------------------------------------------------------------------------
15643 | John | Smith | 68 | 87 | 54 | 36 | 98
13465 | Alice | Jones | 87 | 54 | 52 | 84 | 23
....
The Problem
I want to find students with the most similar results so I can pair them up as study partners, that way they can both learn from each other, unlike a typical mentor/mentee relationship where only one student learns.
When a given student submits a request for a study partner, I want to find the student in our database that has the most similar grades to the student who made a request.
All subject grades have equal weighting.
The Required Solution
Since this database has millions of grades, simply looking through the entire database and computing a 'similarity score' for the given student won't work, it would be too slow and wouldn't scale.
Instead, we need to add a grade_fingerprint field to the database that contains a 'similarity fingerprint' of the students' grades. We can then index the table on this field and quickly find the two closest students to our given student.
This grade_fingerpriny would be similar to a hash code in that it acts as a summary of the table entry, however, two objects with similar properties do not have a similar hash code, which makes them unsuitable.
Accepted Answers
The name of a sub-domain of computer science that covers the fingerprinting of objects in a way that allows for easy searching for smiliar objects
The pseudocode of an algorithm that can compute a 'fingerprint' of an object that can be used to compare object similarity

DynamoDB Table/Index Modeling + Querying

Basic requirements:
I have a table with a bunch of attributes (20-30), but only 3 are used in querying: User, Category, and Date, and would be structured something like this...
User | Category | Date | ...
1 | Red | 5/15
1 | Green | 5/15
1 | Red | 5/16
1 | Green | 5/16
2 | Red | 5/18
2 | Green | 5/18
I want to be able to query this table in the following 2 ways:
Most recent rows (based on Date) by User. e.g., User=1 returns the 2 rows from 5/16 (3rd and 4th row)
Most recent rows (based on Date) by User and Category. e.g., User=1, Category=Red returns the 5/16 row only (the 3rd row).
Is the best way to model this with a HASH on User, RANGE on Date, and a GSI with HASH on User+Category and RANGE on Date? Is there anything else that might be more efficient? If that's the path of least resistance, I'd still need to know how many rows to return, which would require doing a count against distinct categories or something?
I've decided that it's going to be easier to just change the way I'm storing the documents. I'll move the category and other attributes into a sub-document so I can easily query against User+Date and I'll do any User+Category+Date querying with some client-side code against the User+Date result set.

How to retrieve movies' genres from wikidata using R

I would like to retrieve information from wikidata and store it in a dataframe. For the sake of simplicity I am going to assume that I want to get the genre of the following movies and then filter those that belong to sci-fi:
movies = c("Star Wars Episode IV: A New Hope", "Interstellar",
"Happythankyoumoreplease")
I know there is a package called WikidataR. If I am not wrong, and according to its vignettes there are two commands that may be useful: find_item and find_property allow you to retrieve a set of Wikidata items or properties where the aliase or descriptions match a particular search term. Apparently they are great for me, so I thought of doing something like
for (i in movies) {
info = find_item(i)
}
This is what I get from each item:
> find_item("Interstellar")
Wikidata item search
Number of results: 10
Results:
1 Interstellar (Q13417189) - 2014 US science fiction film
2 Interstellar (Q6057099)
3 interstellar medium (Q41872) - matter and fields (radiation) that exist in the space between the star systems in a galaxy;includes gas in ionic, atomic or molecular form, dust and cosmic rays. It fills interstellar space and blends smoothly into the surrounding intergalactic space
4 space colonization (Q686876) - concept of permanent human habitation outside of Earth
5 rogue planet (Q167910) - planetary-mass object that orbits the galaxy directly
6 interstellar cloud (Q1054444) - accumulation of gas, plasma and dust in a galaxy
7 interstellar travel (Q834826) - term used for hypothetical manned or unmanned travel between stars
8 Interstellar Boundary Explorer (Q835898)
9 starship (Q2003852) - spacecraft designed for interstellar travel
10 interstellar object (Q2441216) - astronomical object in interstellar space, such as a comet
>
Unfortunately, the information that I get from find_item (see below) has two problems:
it is not a dataframe with all wikidata information of the item I
am searching but a list of what seems to be metadata (wikidata's id,
link...).
it does not have the information I need (wikidata's
properties from each particular wikidata item).
Similarly, find_property provides metadata of a certain property. find_property("genre") retrieves the following information:
> find_property("genre")
Wikidata property search
Number of results: 4
Results:
1 genre (P136) - a creative work's genre or an artist's field of work (P101). Use main subject (P921) to relate creative works to their topic
2 radio format (P415) - describes the overall content broadcast on a radio station
3 sex or gender (P21) - sexual identity of subject: male (Q6581097), female (Q6581072), intersex (Q1097630), transgender female (Q1052281), transgender male (Q2449503). Animals: male animal (Q44148), female animal (Q43445). Groups of same gender use "subclass of" (P279)
4 gender of a scientific name of a genus (P2433) - determines the correct form of some names of species and subdivisions of species, also subdivisions of a genus
This has similar problems:
it is not a dataframe
it just stores metadata about the property
I don't find any way to link each property with each object in movies vector.
Is there any way to end up with a dataframe containing the genre's of those movies? (or a dataframe with all wikidata's information which I will have to manipulate in order to filter or select my desired data?)
These are just lists. you can get a picture with str(find_item("Interstellar")) for example.
Then you can go through each element of the list and pick the item that you need. For example. Getting the title and the label
a <- find_item("Interstellar")
b <- Reduce(rbind,lapply(a, function(x) cbind(x$title,x$label)))
data.frame(b)
## X1 X2
## 1 Q13417189 Interstellar
## 2 Q6057099 Interstellar
## 3 Q41872 interstellar medium
## 4 Q686876 space colonization
## 5 Q167910 rogue planet
## 6 Q1054444 interstellar cloud
## 7 Q834826 interstellar travel
## 8 Q835898 Interstellar Boundary Explorer
## 9 Q2003852 starship
## 10 Q2441216 interstellar object
This works easily for regular data if some element is missing then you will have to handle it for example some items don't have description. So you can get around with the following.
Reduce("rbind",lapply(a,
function(x) cbind(x$title,
x$label,
ifelse(length(x$description)==0,NA,x$description))))

Power Bi graph like pivot graph

I'm new to Power Bi, followed most of the tutorial on MS but haven't figured yet how creat a graph that resembles this graphic I did with Excel - Pivot Graph, using as source the same data table.
What I need to recreate in Power Bi is a column graph with the most requested (pre-orders requests % of total sum) products in different price ranges.
Pivot Graph
Table ie.
| Date | Product | 3 to 5 Eur | 5 to 8 Eur | 8 to 11 Eur |
----------------------------------------------------------
| mar17| Coffe | 12 | 7 | 2 |
| mar17| Milk | 15 | 3 | 1 |
| mar17| Honey | 17 | 0 | 5 |
| mar17| Sugar | 20 | 9 | 8 |
Thank in advance for the help.
Bests,
Alberto
Edit - Thanks to Mike Honey for pointing out the original request was for % of grand total. I have added an additional step to accomplish this and cleaned up some existing steps.
When I imported your sample data into Power BI, I got this (looking at the data in the Query Editor window).
From there, Select the Data and Product columns and then click on Transform -> Unpivot Columns -> Unpivot Other Columns...
... which results in this.
Just to clean this up, I renamed the Attribute and Value columns and changed the data type of the Value column. In the end, it looks like this.
Then just click on Home -> Close & Apply to get back in the Report Editor window, where you can create a graph and configure it as shown such:
Axis:
Price Range
Product
Value:
Quantity
Then click of the forked, drill-down arrow in the top left corner of the graph to show Price Range and Product.
Which looks like this.
Next, while not necessary I feel that it is very nice, with the graph selected, click on the paint roller icon and expand the X-Axis category. In there, turn off Concatenate labels.
Finally, to get the bars to be % grand total, simply right click on Quantity in the Value section of the graph's fields and then select Show value as -> Percent of grand total.
To get the final results that look like this.

JavaFX TableView every row not a single object

I am coming from Swing where in a JTable, I can just set the column and the row to a value. In a JavaFX TableView, I have to make each row represent an object. I am trying to represent a schedule for a race track. I have round and race number, and then whoever is in each lane.
Round | Race | Lane 1 | Lane 2
1 1 Bob Joe
1 2 Tom Sam
2 1 Sam Joe
2 2 Bob Tom
Each object in a lane, (Bob, Tom, ...) is a Car object. It has various fields but what is being represented in the table should be whatever toString() returns, in this case, the driver's name. I have an array of Round object and each Round has an array of Races which has an array of Cars for lanes. I need a way to represent this data structure in a TableView as shown above. Note that the amount of lanes, races in a round and total rounds can be changed by the user at runtime.

Resources