TABLEAU: How can I measure similarity of sets of dimensions across dates? - r

this is a bit of a complicated one - but I'll do my best to explain. I have a dataset comprised of data that I scrape from a particular video on demand interface every day. Each day there are around 120 titles on display (a grid of 12 x 10) - the data includes a range of variables: date of scrape, title of programme, vertical/horizontal position of programme, genre, synopsis, etc.
One of the things I want to do is analyse the similarity of what's on offer on a day-to-day basis. What I mean by this is that I want to compare how many of the titles on a given day appeared on the previous date (ideally expressed as a percentage). So if 40 (out of 120) titles were the same as the previous day, the similarity would be 30%.
Here's the thing - I know how to do this (thanks to some kindly stranger on this very site who helped me write a script using R). You can see the post here which gives some more detail: Calculate similarity within a dataframe across specific rows (R)
However, this method creates a similarity score based on the total number of titles on a day-to-day basis whereas I also want to be able to explore the similarity after applying other filters. Specifically, I want to narrow the focus to titles that appear within the first four rows and columns. In other words: how many of these titles are the same as the previous day in those positions? I could do this by modifying the R script, but it seems that the better way would be to do this within Tableau so that I can change these parameters in "real-time", so to speak. I.e. if I want to focus on the top 6 rows and columns I don't want to have to run the R script all over again and update the underlying data!
It feels as though I'm missing something very obvious here - maybe it's a simple table calculation? Or I need to somehow tell Tableau how to subset the data?
Hopefully this all makes sense, but I'm happy to clarify if not. Also, I can't provide you the underlying data (for research reasons!) but I can provide a sample if it would help.
Thanks in advance :)

You can have the best of both worlds. Use Tableau to connect to your data, filter as desired, then have Tableau call an R script to calculate similarity and return the results to Tableau for display.
If this fits your use case, you need to learn the mechanics to put this into play. On the Tableau side, you’ll be using the functions that start with the word SCRIPT to call your R code, for example SCRIPT_REAL(), or SCRIPT_INT() etc. Those are table calculations, so you’ll need to learn how table calculations work, in particular with regard to partitioning and addressing. This is described in the Tableau help. You’ll also have to point Tableau at the host for your R code, by managing external services under the Help->Settings and Performance menu.
On the R side, you’ll have write your function of course, and then use the function RServe() to make it accessible to Tableau. Tableau sends vectors of arguments to R and expects a vector in response. The partitioning and addressing mentioned above controls the size and ordering of those vectors.
It can be a bit tricky to get the mechanics working, but they do work. Practice on something simple first.
See Tableau’s web site resources for more information. The official name for this functionality is Tableau “analytic extensions”

I am sharing a strategy to solve this in R.
Step-1 Load the libraries and data
library(tidyverse)
library(lubridate)
movies <- tibble(read.csv("movies.csv"))
movies$date <- as.Date(movies$date, format = "%d-%m-%Y")
set the rows and columns you want to restrict your similarity search to in two variables. Say you are restricting the search to 5 columns and 4 rows only
filter_for_row <- 4
filter_for_col <- 5
Getting final result
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>% #Restricting search to designated rows and columns
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>% # removing duplicate titles screened on any given day
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>% #checking whether it was screened previous day
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 17 0 0
2 2018-08-14 17 10 0.588
3 2018-08-15 17 9 0.529
If you change the filters to 12, 12 respectively, then
filter_for_row <- 12
filter_for_col <- 12
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>%
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>%
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 68 0 0
2 2018-08-14 75 61 0.813
3 2018-08-15 72 54 0.75
Good Luck

As Alex has suggested, you can have best of both the worlds. But to the best of my knowledge, Tableau Desktop allows interface with R (or python etc.) through calculated fields i.e. script_int script_real etc. All of these can be used in tableau through calculated fields. Presently these functions in tableau allows creation on calculated field through Table calculations which in tableau work only in context. We cannot hard code these values (fields/columns) and thus. we are not at liberty to use these independent on context. Moreover, table calculations in tableau can neither be further aggregated and nor be mixed with LOD expressions. Thus, in your use case, (again to the best of my knowledge) you can build a parameter dependent view in tableau, after hard-coding values through any programming language of your choice. I therefore, suggest that prior to importing data in tableau a new column can be created in your dataset by running following (or alternate as per choice programming language)
movies_edited <- movies %>% group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
ungroup()
write.csv(movies_edited, "movies_edited.csv")
This created a new column named similarity in dataset wherein 1 denotes that it was available on previous day, 0 denotes it was not not screened on immediately previous day and NA means it is first day of its screening.
I have imported this dataset in tableau and created a parameter dependent view, as you desired.

Related

Finding summary statistics. Struggling with having anything work after importing data into R from Excel

Very new to R here, also very new to the idea of coding and computer stuff.
Second week of class and I need to find some summary statistics from a set of data my professor provided. I downloaded the chart of data and tried to follow along with his verbal instructions during class, but I am one of the only non-computer science backgrounds in my degree program (I am an RN going for degree in Health Informatics), so he went way too fast for me.
I was hoping for some input on just where to start with his list of tasks for me to complete. I downloaded his data into an excel file, and then uploaded it into R and it is now a matrix. However, everything I try for getting the mean and standard deviation of the columns he wants comes up with an error. I am understanding that I need to convert these column names into some sort of vector, but online every website tells me to do these tasks differently. I don't even know where to start with this assignment.
Any help on how to get myself started would be greatly appreciated. Ive included a screenshot of his instructions and of my matrix. and please, excuse my ignorance/lack of familiarity compared to most of you here... this is my second week into my masters I am hoping I begin to pick this up soon I am just not there yet.
the instructions include:
# * Import the dataset
# * summarize the dataset,Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Tabulate smokers and age.level data with the variable and its frequency. How many smokers in each age category ?
# * Subset dataset by the mothers that smoke and weigh less than 100kg,how many mothers meet this requirements?
# * Compute the mean and standard deviation for the three variables (columns): age, height, weight
# * Plot a histogram
Stack Overflow is not a place for homeworks, but I feel your pain. Let's get piece by piece.
First let's use a package that helps us do those tasks:
library(data.table) # if not installed, install it with install.packages("data.table")
Then, let's load the data:
library(readxl) #again, install it if not installed
dt = setDT(read_excel("path/to/your/file/here.xlsx"))
Now to the calculations:
1 summarize the dataset. Here you'll see the ranges, means, medians and other interesting data of your table.
summary(dt)
1A mean and standard deviation of age, height and weight (replace age with the column name of height and weight to get those)
dt[, .(meanValue = mean(age, na.rm = TRUE), stdDev = sd(age, na.rm = TRUE))]
2 tabulate smokers and age.level. get the counts for each combination:
dt[, .N, by = .(smoke, age.level)]
3 subset smoker mothers with wt < 100 (I'm asuming non-pregnant mothers have NA in the gestation field. Adjust as necessary):
dt[smoke == 1 & weight < 100 & !is.na(gestation), .N]
4 Is the same as 1A.
5 Plot a histogram (but you don't specify of what variable, so let's say it's age):
hist(dt$age)
Keep on studying R, it's not that difficult. The book recommended in the comments is a very good start.

Implementing Code Conditionally in R Based on Features of Dataset

I'm looking to streamline my code, and minimize manual tweaks depending on the data set I run through it. I.e. I receive batches of data by country - but each country is slightly different in terms of fields and field names, so requires tweaking each time I run a new country. I would like to eliminate the tweaks and do some selective coding. (Many of the challenges I handle easily with ifelse(), but haven't been able to do a conditional mutate for example).
This is a logic question, so please let me know if I should have uploaded a data set.
This is a new example, I just added. I realized since the one I had used was a mutate, there were many tools to answer the question. In this example, I am dealing with data from various countries, each df with varying dimensionality, which I will want to keep. I of course, could use different code for each, but I think it would be cleaner if I used the same code, but it accommodated various country data.
I have created a version of this using mutate with ifelse, creating variables for these non-common dimensions and that works. I'm wondering if there is an alternative in R where I can run select snippets of code (and a good answer may be, there is not that option inside pipes). [I know how to do with with separate sets of code and if {} else{}.
Keep in mind, this is part of a much larger block of code that I need all the countries to run though...this is just an illustrative subset.
# As you can see, I comment out each countries unique variables (and spelling!)
P_Region_HP_Brand <- P_Region_HP %>%
left_join(M_brand) %>%
left_join(M_prodcat) %>%
group_by(Calendar_Year, Calendar_Quarter, Calendar_Month, Calendar_Month_txt, Date,
region_b_frcst5, region_b_frcst7, Country, country_b,
BrandSummary, rank_m, Launch_Year, Launch_Month, Model, PriceSegment, SumProdCat, ProductCategory, True_Wireless, ProductType,
# SPORTS, VOICE.ASSISTANT.FUNCTION # JPN
# Sports, Heart.Rate.Sensor # EU3
# HEARTMON, WTRRSST # USA
Sports, DIST_TYP # CHN
) %>%
summarize(Dollars = sum(Dollars), # ALL (inc USA)
Local_Currency = sum(Local_Currency), # ALL
Units = sum(Units)) %>%
select(Calendar_Year, Calendar_Quarter, Calendar_Month, Calendar_Month_txt, Date, Launch_Year, Launch_Month,
region_b_frcst5, region_b_frcst7, Country, country_b,
BrandSummary, Model, PriceSegment, SumProdCat, ProductCategory, True_Wireless, ProductType,
Units, Dollars, Local_Currency, rank_m, # ALL (inc USA)
# HEARTMON, WTRRSST, # USA
# SPORTS, VOICE.ASSISTANT.FUNCTION # JPN
# Sports, Heart.Rate.Sensor # EU3
Sports, DIST_TYP # CHN
) %>%
as.data.frame() %>%
arrange(Country, desc(Date), desc(Local_Currency))
Does anyone know a solution for this that will allow me to keep my code simple enough? & run select lines for given countries?

Removing duplicates where the relationship is only clear by comparing lines (relative reference in R)

The situation: I have some data about contracts, and how many acres are covered by a contract in a given year. The contracts I am dealing with have an obnoxious naming convention which is contract renewals have the same name with 'a', 'b', 'c', etc appended after the number.
Because contracts can be renewed at any time, calculating the acreage in a given year means that there is double-counting when the renewal begins. Some example data might help to explain:
example <- data.frame(contract = c('c300a', 'c300b'),
true_contract = c('c300', 'c300'),
acres_2007 = c(100, 0),
acres_2008 = c(100, 100),
acres_2009 = c(0, 100)
)
print(example)
contract true_contract acres_2007 acres_2008 acres_2009
1 c300a c300 100 100 0
2 c300b c300 0 100 100
As you can see, if the transition from 300a to 300b happened on (for example) May 20, 2008, then there is double-counting in 2008. Those 100 acres are the same piece of land. I would like a way to remove one of the 100s - it doesn't matter which, since both contracts are functionally "the same".
I can tell the problem by looking at it, but am completely puzzled about how I would address the issue using R. In fact, I have always been at a loss about how to deal with data issues where the relationship is only clear from looking at lines that are next to each other. This is a very Excel-style mindset (relative reference) but I am not good at Excel/VBA. In addition, I come up against problems like this often enough that figuring out how to map this problem to R solutions would help me a lot.
Here's a general solution that applies a rule to all contracts in all years. The rule I used was "For each contract-year with more than one contract, keep the largest one, and if more than one at that size, keep the later one."
library(dplyr); library(tidyr)
example %>%
# Split contract name into two, putting last letter/digit into new column
separate(contract, c("contract", "renewal_ltr"), sep = -1) %>%
# Gather into long form to make counting easier
gather(year_col, acres, -c(contract:true_contract)) %>%
# Optional: extract year from year_col; dropped below but might be of use.
mutate(year = readr::parse_number(year_col)) %>%
# For contracts with more than one value in a year, keep the larger one,
# or if tied, keep the later one
group_by(contract, year_col) %>%
arrange(year, desc(acres), desc(renewal_ltr)) %>%
slice(1) %>% # Keep top row per group
ungroup() %>%
# Optional: spread back
select(-year) %>%
spread(year_col, acres, fill = 0)
Output
# A tibble: 2 x 6
contract renewal_ltr true_contract acres_2007 acres_2008 acres_2009
<chr> <chr> <fct> <dbl> <dbl> <dbl>
1 c300 a c300 100 0 0
2 c300 b c300 0 100 100
If I undestood correctly you want to remove one of the duplicated 100 from the second column. This keeps the first value in the acres_2008 column and replace the other with 0
example$acres_2008 <- ave(
example$acres_2008,
example$true_contract,
FUN = function(a) replace(a, duplicated(a), 0)
)
The result with your example is:

Juxtaposing Replicate Data

I have provided a sample dataset that I have arranged in column format (called "full.table").
These data were extracted from a 96-well PCR plate, & while collecting my data, I always ran a duplicate experiment, meaning each variable (aka test) has 1 replicate. I would like to take all replicates and juxtapose them (have them be side by side), which would allow me to easily visualize replicates next to each other, and finally calculate an average value for the variable "Cq" between the two.
The complications stems from having done multiple tests over several days (complication one), and NOT having my samples always run in the same fashion on the PCR plate (complication two). Typically, as you see on my data set below, Well A1 has a duplicate in Well B1, however this is not always the case. Occasionally, Well A7 matches Well A8 (and NOT B7).
Replicates were always run on the same day, so an important variable here is “date” which I added via R before uploading to Stack Exchange. I am confused on how to re-arrange the data to get my desired result (not even sure where to start)
I have provided an example of what I would like in the end, called “sample.finished.table”
Logically, having 768 observations in this example, this should divide it in two, resulting in 384 total lines of data (385 with header)
I appreciate any feedback. Thank you
full.table<- read.table("https://pastebin.com/raw/kTQhuttv", header=T, sep="")
sample.finished.table <- read.table("https://pastebin.com/raw/Phg7C9xD", header=T, sep="")
You can use dplyr here to group by sample and extract the requested values:
library(dplyr)
full.table %>% group_by(sample,date) %>% summarise(
Well1 = first(Well), Cq1 = first(Cq),
Well2 = last(Well), sample1 = last(sample), Cq2 = last(Cq), Cq_mean = mean(Cq[Cq > 0]))

Extracting a value based on multiple conditions in R

Quick question - I have a dataframe (severity) that looks like,
industryType relfreq relsev
1 Consumer Products 2.032520 0.419048
2 Biotech/Pharma 0.650407 3.771429
3 Industrial/Construction 1.327913 0.609524
4 Computer Hardware/Electronics 1.571816 2.019048
5 Medical Devices 1.463415 3.028571
6 Software 0.758808 1.314286
7 Business/Consumer Services 0.623306 0.723810
8 Telecommunications 0.650407 4.247619
if I wanted to pull the relfreq of Medical Devices (row 5) - how could I subset just that value?
I was thinking about just indexing and doing severity$relfreq[[5]], but I'd be using this line in a bigger function where the user would specify the industry i.e.
example <- function(industrytype) {
weight <- relfreq of industrytype parameter
thing2 <- thing1*weight
return(thing2)
}
So if I do subset by an index, is there a way R would know which index corresponds to the industry type specified in the function parameter? Or is it easier/a way to just subset the relfreq column by the industry name?
You would require to first select the row of interest and then keep the 2 column you requested (industryType and relfreq).
There is a great package that allows you to do this intuitively with tidyverse library(tidyverse)
data_want <- severity %>%
subset(industryType =="Medical Devices") %>%
select(industryType, relfreq)
Here you read from left to right with the %>% serving as passing the result to the next step as if nesting.
I think that selecting whole row is better, then choose column which you would like to see.
frame <- severity[severity$industryType == 'Medical Devices',]
frame$relfreq

Resources