Implementing Code Conditionally in R Based on Features of Dataset - r

I'm looking to streamline my code, and minimize manual tweaks depending on the data set I run through it. I.e. I receive batches of data by country - but each country is slightly different in terms of fields and field names, so requires tweaking each time I run a new country. I would like to eliminate the tweaks and do some selective coding. (Many of the challenges I handle easily with ifelse(), but haven't been able to do a conditional mutate for example).
This is a logic question, so please let me know if I should have uploaded a data set.
This is a new example, I just added. I realized since the one I had used was a mutate, there were many tools to answer the question. In this example, I am dealing with data from various countries, each df with varying dimensionality, which I will want to keep. I of course, could use different code for each, but I think it would be cleaner if I used the same code, but it accommodated various country data.
I have created a version of this using mutate with ifelse, creating variables for these non-common dimensions and that works. I'm wondering if there is an alternative in R where I can run select snippets of code (and a good answer may be, there is not that option inside pipes). [I know how to do with with separate sets of code and if {} else{}.
Keep in mind, this is part of a much larger block of code that I need all the countries to run though...this is just an illustrative subset.
# As you can see, I comment out each countries unique variables (and spelling!)
P_Region_HP_Brand <- P_Region_HP %>%
left_join(M_brand) %>%
left_join(M_prodcat) %>%
group_by(Calendar_Year, Calendar_Quarter, Calendar_Month, Calendar_Month_txt, Date,
region_b_frcst5, region_b_frcst7, Country, country_b,
BrandSummary, rank_m, Launch_Year, Launch_Month, Model, PriceSegment, SumProdCat, ProductCategory, True_Wireless, ProductType,
# SPORTS, VOICE.ASSISTANT.FUNCTION # JPN
# Sports, Heart.Rate.Sensor # EU3
# HEARTMON, WTRRSST # USA
Sports, DIST_TYP # CHN
) %>%
summarize(Dollars = sum(Dollars), # ALL (inc USA)
Local_Currency = sum(Local_Currency), # ALL
Units = sum(Units)) %>%
select(Calendar_Year, Calendar_Quarter, Calendar_Month, Calendar_Month_txt, Date, Launch_Year, Launch_Month,
region_b_frcst5, region_b_frcst7, Country, country_b,
BrandSummary, Model, PriceSegment, SumProdCat, ProductCategory, True_Wireless, ProductType,
Units, Dollars, Local_Currency, rank_m, # ALL (inc USA)
# HEARTMON, WTRRSST, # USA
# SPORTS, VOICE.ASSISTANT.FUNCTION # JPN
# Sports, Heart.Rate.Sensor # EU3
Sports, DIST_TYP # CHN
) %>%
as.data.frame() %>%
arrange(Country, desc(Date), desc(Local_Currency))
Does anyone know a solution for this that will allow me to keep my code simple enough? & run select lines for given countries?

Related

Selective choice of tuples with partially matching characteristics in R

I have a dataset with data about political careers.
Every politician has a unique identifier nuber (ui) and can occur in multiple electoral terms (electoral_terms). Every electoral term equals a period of 4 years in which the politician is in office.
Now I would like to find out, which academic titles (academic_title) occure in the dataset and how often they occur.
The problem is that every politican is potentially mentioned multiple times and I'm only interested in the last state of their academic title.
E.g. the correct answer would be:
1x Prof. Dr.
1x Dr. Med
Thanks in advance!
I tried this Command:
Stammdaten_academic<- Stammdaten |> arrange(ui, academic_title) |> distinct(ui, .keep_all = TRUE)``
Stammdaten_academic is the dataframe where every politician is only mentioned once (similar as a Group-By command would do).
Stammdaten is the original dataframe with multiple occurences of each politician.
Result:
I got the academic title that was mentioned in the first occuring row of each politician.
Problem:
I would like to receive the last state of everyones' academic title!
library(dplyr)
Stammdaten_academic <- Stammdaten |>
group_by(ui) |>
arrange(electoral_term) |>
slice(n)
Should give you the n'th row from each group (ui) where n is the number of items in that group.
Academic titles are progressive and a person does not stop being a doctor or such.
I believe this solves your problem
# create your data frame
df <- data.frame(ui = c(1,1,1,2,2,3),
electoral_term = c(1,2,3,3,4,4),
academit_title = c(NA, "Dr.","Prof. Dr.","Dr. Med.","Dr. Med.", NA))
# get latest titles
titles <- df |>
dplyr::group_by(ui) |>
dplyr::summarise_at(vars(electoral_term), max) |>
dplyr::left_join(df, by = c("ui", "electoral_term")) |>
tidyr::drop_na() # in case you don't want the people without title
#counts occurences
table(titles$academic_title)

TABLEAU: How can I measure similarity of sets of dimensions across dates?

this is a bit of a complicated one - but I'll do my best to explain. I have a dataset comprised of data that I scrape from a particular video on demand interface every day. Each day there are around 120 titles on display (a grid of 12 x 10) - the data includes a range of variables: date of scrape, title of programme, vertical/horizontal position of programme, genre, synopsis, etc.
One of the things I want to do is analyse the similarity of what's on offer on a day-to-day basis. What I mean by this is that I want to compare how many of the titles on a given day appeared on the previous date (ideally expressed as a percentage). So if 40 (out of 120) titles were the same as the previous day, the similarity would be 30%.
Here's the thing - I know how to do this (thanks to some kindly stranger on this very site who helped me write a script using R). You can see the post here which gives some more detail: Calculate similarity within a dataframe across specific rows (R)
However, this method creates a similarity score based on the total number of titles on a day-to-day basis whereas I also want to be able to explore the similarity after applying other filters. Specifically, I want to narrow the focus to titles that appear within the first four rows and columns. In other words: how many of these titles are the same as the previous day in those positions? I could do this by modifying the R script, but it seems that the better way would be to do this within Tableau so that I can change these parameters in "real-time", so to speak. I.e. if I want to focus on the top 6 rows and columns I don't want to have to run the R script all over again and update the underlying data!
It feels as though I'm missing something very obvious here - maybe it's a simple table calculation? Or I need to somehow tell Tableau how to subset the data?
Hopefully this all makes sense, but I'm happy to clarify if not. Also, I can't provide you the underlying data (for research reasons!) but I can provide a sample if it would help.
Thanks in advance :)
You can have the best of both worlds. Use Tableau to connect to your data, filter as desired, then have Tableau call an R script to calculate similarity and return the results to Tableau for display.
If this fits your use case, you need to learn the mechanics to put this into play. On the Tableau side, you’ll be using the functions that start with the word SCRIPT to call your R code, for example SCRIPT_REAL(), or SCRIPT_INT() etc. Those are table calculations, so you’ll need to learn how table calculations work, in particular with regard to partitioning and addressing. This is described in the Tableau help. You’ll also have to point Tableau at the host for your R code, by managing external services under the Help->Settings and Performance menu.
On the R side, you’ll have write your function of course, and then use the function RServe() to make it accessible to Tableau. Tableau sends vectors of arguments to R and expects a vector in response. The partitioning and addressing mentioned above controls the size and ordering of those vectors.
It can be a bit tricky to get the mechanics working, but they do work. Practice on something simple first.
See Tableau’s web site resources for more information. The official name for this functionality is Tableau “analytic extensions”
I am sharing a strategy to solve this in R.
Step-1 Load the libraries and data
library(tidyverse)
library(lubridate)
movies <- tibble(read.csv("movies.csv"))
movies$date <- as.Date(movies$date, format = "%d-%m-%Y")
set the rows and columns you want to restrict your similarity search to in two variables. Say you are restricting the search to 5 columns and 4 rows only
filter_for_row <- 4
filter_for_col <- 5
Getting final result
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>% #Restricting search to designated rows and columns
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>% # removing duplicate titles screened on any given day
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>% #checking whether it was screened previous day
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 17 0 0
2 2018-08-14 17 10 0.588
3 2018-08-15 17 9 0.529
If you change the filters to 12, 12 respectively, then
filter_for_row <- 12
filter_for_col <- 12
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>%
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>%
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 68 0 0
2 2018-08-14 75 61 0.813
3 2018-08-15 72 54 0.75
Good Luck
As Alex has suggested, you can have best of both the worlds. But to the best of my knowledge, Tableau Desktop allows interface with R (or python etc.) through calculated fields i.e. script_int script_real etc. All of these can be used in tableau through calculated fields. Presently these functions in tableau allows creation on calculated field through Table calculations which in tableau work only in context. We cannot hard code these values (fields/columns) and thus. we are not at liberty to use these independent on context. Moreover, table calculations in tableau can neither be further aggregated and nor be mixed with LOD expressions. Thus, in your use case, (again to the best of my knowledge) you can build a parameter dependent view in tableau, after hard-coding values through any programming language of your choice. I therefore, suggest that prior to importing data in tableau a new column can be created in your dataset by running following (or alternate as per choice programming language)
movies_edited <- movies %>% group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
ungroup()
write.csv(movies_edited, "movies_edited.csv")
This created a new column named similarity in dataset wherein 1 denotes that it was available on previous day, 0 denotes it was not not screened on immediately previous day and NA means it is first day of its screening.
I have imported this dataset in tableau and created a parameter dependent view, as you desired.

Extracting a value based on multiple conditions in R

Quick question - I have a dataframe (severity) that looks like,
industryType relfreq relsev
1 Consumer Products 2.032520 0.419048
2 Biotech/Pharma 0.650407 3.771429
3 Industrial/Construction 1.327913 0.609524
4 Computer Hardware/Electronics 1.571816 2.019048
5 Medical Devices 1.463415 3.028571
6 Software 0.758808 1.314286
7 Business/Consumer Services 0.623306 0.723810
8 Telecommunications 0.650407 4.247619
if I wanted to pull the relfreq of Medical Devices (row 5) - how could I subset just that value?
I was thinking about just indexing and doing severity$relfreq[[5]], but I'd be using this line in a bigger function where the user would specify the industry i.e.
example <- function(industrytype) {
weight <- relfreq of industrytype parameter
thing2 <- thing1*weight
return(thing2)
}
So if I do subset by an index, is there a way R would know which index corresponds to the industry type specified in the function parameter? Or is it easier/a way to just subset the relfreq column by the industry name?
You would require to first select the row of interest and then keep the 2 column you requested (industryType and relfreq).
There is a great package that allows you to do this intuitively with tidyverse library(tidyverse)
data_want <- severity %>%
subset(industryType =="Medical Devices") %>%
select(industryType, relfreq)
Here you read from left to right with the %>% serving as passing the result to the next step as if nesting.
I think that selecting whole row is better, then choose column which you would like to see.
frame <- severity[severity$industryType == 'Medical Devices',]
frame$relfreq

Reliability tests for classic content analysis (multiple categorial codes per item)

In classic content analysis (or qualitative content analysis), as typically done with Atlas.TI or Nvivo type tools (sometimes called QACDAS tools), you typically face the situation of having multiple raters rate many objects with many codes, so there are multiple codes that each rater might apply to each object. I think this is what the excellent John Ubersax page on agreement statistics calls "Two Raters, Polytomous Ratings".
For example you might have two raters read articles and code them with some group of topic codes from a coding scheme (e.g., diy, shelving, circular saw), and you are asking how well the coders agree on applying the codes.
What I'd like is to use the irr package functions, agree and kappa2, in these situations. Yet their documentation didn't help me figure out how to proceed, since they expect input in the form of "n*m matrix or dataframe, n subjects m raters." which implies that there is a single rating per rater, per object.
Given two raters using (up to) three codes to code two articles input data that looks like this (two diy articles, the second with some topic tags):
article,rater,code
article1,rater1,diy
article1,rater2,diy
article2,rater1,diy
article2,rater2,diy
article2,rater1,circular-saw
article2,rater1,shelving
article2,rater2,shelving
I'd like to get:
Overall percentage agreement.
Percentage agreement for each code.
Contingency table for each code.
Ideally, I'd also like to get Positive agreement (how often do the raters agree that a code should be present?) and Negative Agreement (how often do the raters agree that a code should not be present). See discussion of these at http://www.john-uebersax.com/stat/raw.htm#binspe
I'm pretty sure that this involves breaking the input data.frame up and processing it code by code, using something like dplyr, but I wondered if others have tackled this problem.
(The kappa functions take the same input, so let's just keep this simple by using the agree function from the irr package, plus the positive and negative agreement only really make sense with percentage agreement).
Looking at the meta.stackexchange threads on answering one's own question, it seems that is an acceptable thing to do. Makes sense, good place to store stuff for others to find :)
I solved most of this with the following code:
library(plyr); library(dplyr); library(reshape2); library(irr)
# The irr package expects input in the form of n x m (objects in rows, raters in columns)
# for multiple coders per coded items that is really confusing. Here we have 10 articles (to be coded) and
# many codes. So each rater rates each combinations of articles and codes as present (or not).
# Basically you send only the ratings columns to agree and kappa2. You can send them all at
# once for overall agreement, or send only those for each code for code-by-code agreement.
# letter,code,rater
# letter1,code1,rater1
# letter1,code2,rater1
# letter2,code3,rater2
coding <- read.csv("CombinedCoding.csv")
# Now want:
# letter, code, rater1, rater2
# where 0 = no (this code wasn't used), 1 = yes (this code was used)
# dcast can do this, collapsing across a group. In this case we're not really
# grouping, so if the code was not present length gives a 0, if it was length
# gives a 1.
# This excludes all the times where we agreed that both codes weren't present.
ccoding <- dcast(coding, letter + code ~ coder, length)
# create data.frame from combination of letters and codes
# this handles the negative agreement parts.
codelist <- unique(coding$code)
letterlist <- unique(coding$letter)
coding_with_negatives <- merge(codelist, letterlist) # Gets cartesion product of these.
names(coding_with_negatives) <- c("code", "letter") # align the names
# merge this with the coding, produces NA for rows that don't exist in ccoding
coding_with_negatives <- merge(coding_with_negatives,ccoding,by=c("letter","code"), all.x=T)
# replace NAs with zeros.
coding_with_negatives[is.na(coding_with_negatives)] <- 0
# Now want agreement per code.
# need a function that returns a df
# this function gets given the split data frame (ie this happens once per code)
getagree <- function(df) {
# for positive agreement remove the cases where we both coded it negative
positive_df <- filter(df, (rater1 == 1 & rater2 == 1) | (rater1 == 0 & rater2 == 1) | (rater1 == 1 & rater2 == 0))
# for negative agreement remove the cases where we both coded it positive
negative_df <- filter(df, (rater1 == 0 & rater2 == 0) | (rater1 == 0 & rater2 == 1) | (rater1 == 1 & rater2 == 0))
data.frame( positive_agree = round(agree(positive_df[,3:4])$value,2) # Run agree on the raters columns, get the $value, and round it.
, negative_agree = round(agree(negative_df[,3:4])$value,2)
, agree = round(agree(df[,3:4])$value,2)
, used_in_articles = nrow(positive_df) # gives some idea of the prevalance.
)
}
# split the df up by code, run getagree on the sections
# recombine into a data frame.
results <- ddply(coding_with_negatives, .(code), getagree)
The confusion matrices can be gotten with:
print(table(coding_with_negatives[,3],coding_with_negatives[,4],dnn=c("rater1","rater2")))
I haven't done it but I think I could do that per code inside the function using print to push them into a text file.

How to store trees/nested lists in R?

I have a list of boroughs and a list of localities (like this one). Each locality lies in exactly one borough. What's the best way to store this kind of hierarchical structure in R, considerung that I'd like to have a convenient and readable way of accessing these, and using this list to accumulate data on the locality-level to the borough level.
I've come up with the following:
localities <- list("Mitte" = c("Mitte", "Moabit", "Hansaviertel", "Tiergarten", "Wedding", "Gesundbrunnen",
"Friedrichshain-Kreuzberg" = c("Friedrichshain", "Kreuzberg")
)
But I am not sure if this is the most elegant and accessible way.
If I wanted to assign additional information on the localitiy-level, I could do that by replacing the c(...) by some other call, like rbind(c('0201', '0202'), c("Friedrichshain", "Kreuzberg")) if I wanted to add additional information to the borough-level (like an abbreviated name and a full name for each list), how would I do this?
Edit: For example, I'd like to condense a table like this into a borough-wise version.
Hard to know without having a better view on how you intend to use this, but I would strongly recommend moving away from a nested list structure to a data frame structure:
library(reshape2)
loc.df <- melt(localities)
This is what the molten data looks like:
value L1
1 Mitte Mitte
2 Moabit Mitte
3 Hansaviertel Mitte
4 Tiergarten Mitte
5 Wedding Mitte
6 Gesundbrunnen Mitte
7 Friedrichshain Friedrichshain-Kreuzberg
8 Kreuzberg Friedrichshain-Kreuzberg
You can then use all the standard data frame and other computations:
loc.df$population <- sample(100:500, nrow(loc.df)) # make up population
tapply(loc.df$population, loc.df$L1, mean) # population by borough
gives mean population by Borough:
Friedrichshain-Kreuzberg Mitte
278.5000 383.8333
For more complex calculations you can use data.table and dplyr
You can extract all of this data directly into a data.frame using the XML library.
library(XML)
theurl <- "http://en.wikipedia.org/wiki/Boroughs_and_localities_of_Berlin#List_of_localities"
tables<-readHTMLTable(theurl)
boroughs<-tables[[1]]$Borough
localities<-tables[c(3:14)]
names(localities) <- as.character(boroughs)
all<-do.call("rbind", localities)
#Roland, I think you will find data frames superior to lists for the reasons cited earlier, but also because there is other data on the web page you reference. Loading to a data frame will make it easy to go further if you wish. For example, making comparisons based on population density or other items provided "for free" on the page will be a snap from a data frame.

Resources