Selective choice of tuples with partially matching characteristics in R - r

I have a dataset with data about political careers.
Every politician has a unique identifier nuber (ui) and can occur in multiple electoral terms (electoral_terms). Every electoral term equals a period of 4 years in which the politician is in office.
Now I would like to find out, which academic titles (academic_title) occure in the dataset and how often they occur.
The problem is that every politican is potentially mentioned multiple times and I'm only interested in the last state of their academic title.
E.g. the correct answer would be:
1x Prof. Dr.
1x Dr. Med
Thanks in advance!
I tried this Command:
Stammdaten_academic<- Stammdaten |> arrange(ui, academic_title) |> distinct(ui, .keep_all = TRUE)``
Stammdaten_academic is the dataframe where every politician is only mentioned once (similar as a Group-By command would do).
Stammdaten is the original dataframe with multiple occurences of each politician.
Result:
I got the academic title that was mentioned in the first occuring row of each politician.
Problem:
I would like to receive the last state of everyones' academic title!

library(dplyr)
Stammdaten_academic <- Stammdaten |>
group_by(ui) |>
arrange(electoral_term) |>
slice(n)
Should give you the n'th row from each group (ui) where n is the number of items in that group.

Academic titles are progressive and a person does not stop being a doctor or such.
I believe this solves your problem
# create your data frame
df <- data.frame(ui = c(1,1,1,2,2,3),
electoral_term = c(1,2,3,3,4,4),
academit_title = c(NA, "Dr.","Prof. Dr.","Dr. Med.","Dr. Med.", NA))
# get latest titles
titles <- df |>
dplyr::group_by(ui) |>
dplyr::summarise_at(vars(electoral_term), max) |>
dplyr::left_join(df, by = c("ui", "electoral_term")) |>
tidyr::drop_na() # in case you don't want the people without title
#counts occurences
table(titles$academic_title)

Related

Calculate number of years worked with different end dates

Consider the following two datasets. The first dataset describes an id variable that identifies a person and the date when his or her unemployment benefits starts.
The second dataset shows the number of service years, which makes it possible to calculate the maximum entitlement period. More precisely, each year denotes a dummy variable, which is equal to unity in case someone build up unemployment benefits rights in a particular year (i.e. if someone worked). If this is not the case, this variable is equal to zero.
df1<-data.frame( c("R005", "R006", "R007"), c(20120610, 20130115, 20141221))
colnames(df1)<-c("id", "start_UI")
df1$start_UI<-as.character(df1$start_UI)
df1$start_UI<-as.Date(df1$start_UI, "%Y%m%d")
df2<-data.frame( c("R005", "R006", "R007"), c(1,1,1), c(1,1,1), c(0,1,1), c(1,0,1), c(1,0,1) )
colnames(df2)<-c("id", "worked2010", "worked2011", "worked2012", "worked2013", "worked2014")
Just to summarize the information from the above two datasets. We see that person R005 worked in the years 2010 and 2011. In 2012 this person filed for Unemployment insurance. Thereafter person R005 works again in 2013 and 2014 (we see this information in dataset df2). When his unemployment spell started in 2012, his entitlement was based on the work history before he got unemployed. Hence, the work history is equal to 2. In a similar vein, the employment history for R006 and R007 is equal to 3 and 5, respectively (for R007 we assume he worked in 2014 as he only filed for unemployment benefits in December of that year. Therefore the number is 5 instead of 4).
Now my question is how I can merge these two datasets effectively such that I can get the following table
df_final<- data.frame(c("R005", "R006", "R007"), c(20120610, 20130115, 20141221), c(2,3,5))
colnames(df_final)<-c("id", "start_UI", "employment_history")
id start_UI employment_history
1 R005 20120610 2
2 R006 20130115 3
3 R007 20141221 5
I tried using "aggregate", but in that case I also include work history after the year someone filed for unemployment benefits and that is something I do not want. Does anyone have an efficient way how to combine the information from the two above datasets and calculate the unemployment history?
I appreciate any help.
base R
You should use Reduce with accumulate = T.
df2$employment_history <- apply(df2[,-1], 1, function(x) sum(!Reduce(any, x==0, accumulate = TRUE)))
merge(df1, df2[c("id","employment_history")])
dplyr
Or use the built-in dplyr::cumany function:
df2 %>%
pivot_longer(-id) %>%
group_by(id) %>%
summarise(employment_history = sum(value[!cumany(value == 0)])) %>%
left_join(df1, .)
Output
id start_UI employment_history
1 R005 2012-06-10 2
2 R006 2013-01-15 3
3 R007 2014-12-21 5

TABLEAU: How can I measure similarity of sets of dimensions across dates?

this is a bit of a complicated one - but I'll do my best to explain. I have a dataset comprised of data that I scrape from a particular video on demand interface every day. Each day there are around 120 titles on display (a grid of 12 x 10) - the data includes a range of variables: date of scrape, title of programme, vertical/horizontal position of programme, genre, synopsis, etc.
One of the things I want to do is analyse the similarity of what's on offer on a day-to-day basis. What I mean by this is that I want to compare how many of the titles on a given day appeared on the previous date (ideally expressed as a percentage). So if 40 (out of 120) titles were the same as the previous day, the similarity would be 30%.
Here's the thing - I know how to do this (thanks to some kindly stranger on this very site who helped me write a script using R). You can see the post here which gives some more detail: Calculate similarity within a dataframe across specific rows (R)
However, this method creates a similarity score based on the total number of titles on a day-to-day basis whereas I also want to be able to explore the similarity after applying other filters. Specifically, I want to narrow the focus to titles that appear within the first four rows and columns. In other words: how many of these titles are the same as the previous day in those positions? I could do this by modifying the R script, but it seems that the better way would be to do this within Tableau so that I can change these parameters in "real-time", so to speak. I.e. if I want to focus on the top 6 rows and columns I don't want to have to run the R script all over again and update the underlying data!
It feels as though I'm missing something very obvious here - maybe it's a simple table calculation? Or I need to somehow tell Tableau how to subset the data?
Hopefully this all makes sense, but I'm happy to clarify if not. Also, I can't provide you the underlying data (for research reasons!) but I can provide a sample if it would help.
Thanks in advance :)
You can have the best of both worlds. Use Tableau to connect to your data, filter as desired, then have Tableau call an R script to calculate similarity and return the results to Tableau for display.
If this fits your use case, you need to learn the mechanics to put this into play. On the Tableau side, you’ll be using the functions that start with the word SCRIPT to call your R code, for example SCRIPT_REAL(), or SCRIPT_INT() etc. Those are table calculations, so you’ll need to learn how table calculations work, in particular with regard to partitioning and addressing. This is described in the Tableau help. You’ll also have to point Tableau at the host for your R code, by managing external services under the Help->Settings and Performance menu.
On the R side, you’ll have write your function of course, and then use the function RServe() to make it accessible to Tableau. Tableau sends vectors of arguments to R and expects a vector in response. The partitioning and addressing mentioned above controls the size and ordering of those vectors.
It can be a bit tricky to get the mechanics working, but they do work. Practice on something simple first.
See Tableau’s web site resources for more information. The official name for this functionality is Tableau “analytic extensions”
I am sharing a strategy to solve this in R.
Step-1 Load the libraries and data
library(tidyverse)
library(lubridate)
movies <- tibble(read.csv("movies.csv"))
movies$date <- as.Date(movies$date, format = "%d-%m-%Y")
set the rows and columns you want to restrict your similarity search to in two variables. Say you are restricting the search to 5 columns and 4 rows only
filter_for_row <- 4
filter_for_col <- 5
Getting final result
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>% #Restricting search to designated rows and columns
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>% # removing duplicate titles screened on any given day
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>% #checking whether it was screened previous day
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 17 0 0
2 2018-08-14 17 10 0.588
3 2018-08-15 17 9 0.529
If you change the filters to 12, 12 respectively, then
filter_for_row <- 12
filter_for_col <- 12
movies %>% filter(rank <= filter_for_col, row <= filter_for_row) %>%
group_by(Title, date) %>% mutate(d_id = row_number()) %>%
filter(d_id ==1) %>%
group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
group_by(date) %>%
summarise(total_movies_displayed = sum(d_id),
similar_movies = sum(similarity, na.rm = T),
similarity_percent = similar_movies/total_movies_displayed)
# A tibble: 3 x 4
date total_movies_displayed similar_movies similarity_percent
<date> <int> <dbl> <dbl>
1 2018-08-13 68 0 0
2 2018-08-14 75 61 0.813
3 2018-08-15 72 54 0.75
Good Luck
As Alex has suggested, you can have best of both the worlds. But to the best of my knowledge, Tableau Desktop allows interface with R (or python etc.) through calculated fields i.e. script_int script_real etc. All of these can be used in tableau through calculated fields. Presently these functions in tableau allows creation on calculated field through Table calculations which in tableau work only in context. We cannot hard code these values (fields/columns) and thus. we are not at liberty to use these independent on context. Moreover, table calculations in tableau can neither be further aggregated and nor be mixed with LOD expressions. Thus, in your use case, (again to the best of my knowledge) you can build a parameter dependent view in tableau, after hard-coding values through any programming language of your choice. I therefore, suggest that prior to importing data in tableau a new column can be created in your dataset by running following (or alternate as per choice programming language)
movies_edited <- movies %>% group_by(Title) %>%
mutate(similarity = ifelse(lag(date)== date - lubridate::days(1), 1, 0)) %>%
ungroup()
write.csv(movies_edited, "movies_edited.csv")
This created a new column named similarity in dataset wherein 1 denotes that it was available on previous day, 0 denotes it was not not screened on immediately previous day and NA means it is first day of its screening.
I have imported this dataset in tableau and created a parameter dependent view, as you desired.

Implementing Code Conditionally in R Based on Features of Dataset

I'm looking to streamline my code, and minimize manual tweaks depending on the data set I run through it. I.e. I receive batches of data by country - but each country is slightly different in terms of fields and field names, so requires tweaking each time I run a new country. I would like to eliminate the tweaks and do some selective coding. (Many of the challenges I handle easily with ifelse(), but haven't been able to do a conditional mutate for example).
This is a logic question, so please let me know if I should have uploaded a data set.
This is a new example, I just added. I realized since the one I had used was a mutate, there were many tools to answer the question. In this example, I am dealing with data from various countries, each df with varying dimensionality, which I will want to keep. I of course, could use different code for each, but I think it would be cleaner if I used the same code, but it accommodated various country data.
I have created a version of this using mutate with ifelse, creating variables for these non-common dimensions and that works. I'm wondering if there is an alternative in R where I can run select snippets of code (and a good answer may be, there is not that option inside pipes). [I know how to do with with separate sets of code and if {} else{}.
Keep in mind, this is part of a much larger block of code that I need all the countries to run though...this is just an illustrative subset.
# As you can see, I comment out each countries unique variables (and spelling!)
P_Region_HP_Brand <- P_Region_HP %>%
left_join(M_brand) %>%
left_join(M_prodcat) %>%
group_by(Calendar_Year, Calendar_Quarter, Calendar_Month, Calendar_Month_txt, Date,
region_b_frcst5, region_b_frcst7, Country, country_b,
BrandSummary, rank_m, Launch_Year, Launch_Month, Model, PriceSegment, SumProdCat, ProductCategory, True_Wireless, ProductType,
# SPORTS, VOICE.ASSISTANT.FUNCTION # JPN
# Sports, Heart.Rate.Sensor # EU3
# HEARTMON, WTRRSST # USA
Sports, DIST_TYP # CHN
) %>%
summarize(Dollars = sum(Dollars), # ALL (inc USA)
Local_Currency = sum(Local_Currency), # ALL
Units = sum(Units)) %>%
select(Calendar_Year, Calendar_Quarter, Calendar_Month, Calendar_Month_txt, Date, Launch_Year, Launch_Month,
region_b_frcst5, region_b_frcst7, Country, country_b,
BrandSummary, Model, PriceSegment, SumProdCat, ProductCategory, True_Wireless, ProductType,
Units, Dollars, Local_Currency, rank_m, # ALL (inc USA)
# HEARTMON, WTRRSST, # USA
# SPORTS, VOICE.ASSISTANT.FUNCTION # JPN
# Sports, Heart.Rate.Sensor # EU3
Sports, DIST_TYP # CHN
) %>%
as.data.frame() %>%
arrange(Country, desc(Date), desc(Local_Currency))
Does anyone know a solution for this that will allow me to keep my code simple enough? & run select lines for given countries?

dplyr merge two datasets for Finished Goods Sales and Bill of Materials

I am working on an analysis of need for raw materials in my company and the approach i am taking is to use the sales records of finished goods to combine with bill of materials for each finished goods. The problem i am having now that each finished product consists of multiple components, and many finished products share common components. I am trying to keep all individual sales record for each finished good and use the UnitsSold to multiply with the unit qty for each component to get the demand of raw materials. Here is code for sample datasets:
fg_Sales <- data_frame(FG_PartNumber=rep(c("A","B","C"),2),
Order_Date=seq.Date(as.Date("2011-1-1"),as.Date("2012-1-10"),length.out = 6),
FG_UnitsSold=c(100,200,300,400,500,600))
bill_materials <- data_frame(FG_PartNumber=rep(c("A","B","C"),4),
Components=c("C1","C2","C3","C4","C5","C6","C7","C7","C7","C8","C8","C9"),
Qty=rnorm(3,1,n = 12))%>%
arrange(FG_PartNumber)
i am familiar with left_join in dplyr but it seems not work because it would always give me with the first component for each finished product.
Can anyone kindly help with this?
Thanks.
Perhaps I am not understanding the question, but if you group your two data frames by the FG_PartNumber and make a pivot tables on the quantities you are interested in, you can get the the totals you are looking for:
#Create data
set.seed(1)
fg_Sales <- data_frame(FG_PartNumber=rep(c("A","B","C"),2),
Order_Date=seq.Date(as.Date("2011-1-1"),as.Date("2012-1-10"),length.out = 6),
FG_UnitsSold=c(100,200,300,400,500,600))
bill_materials <- data_frame(FG_PartNumber=rep(c("A","B","C"),4),
Components=c("C1","C2","C3","C4","C5","C6","C7","C7","C7","C8","C8","C9"),
Qty=rnorm(3,1,n = 12))%>%
arrange(FG_PartNumber)
library(dplyr)
#make pivot tables for sales and quantity
tot_sales <- fg_Sales %>%
group_by(FG_PartNumber) %>%
summarise(tot_sales = sum(FG_UnitsSold))
tot_materials <- bill_materials %>%
group_by(FG_PartNumber) %>%
summarise(tot_qty = sum(Qty))
#join the pivot tables together
df <- left_join(tot_sales, tot_materials)
> df
# A tibble: 3 × 3
FG_PartNumber tot_sales tot_qty
<chr> <dbl> <dbl>
1 A 500 13.15087
2 B 700 14.76326
3 C 900 11.30953
I think inner_join from dplyr is the best choice here:
library(dplyr)
fg_Sales_ext <- inner_join(x = fg_Sales,
y = bill_materials,
by = "FG_PartNumber")
From the inner_join documentation: "If there are multiple matches between x and y, all combination of the matches are returned."
With fg_Sales_ext you can perform any kind of analysis now with group_by and summarise.

row wise count the number of the words in a review text in an R dataframe

I want to count the number of words in each row:
Review_ID Review_Date Review_Content Listing_Title Star Hotel_Name
1 1/25/2016 I booked both the Crosby and Four Seasons but decided to cancel the Four Seasons closer to the arrival date based on reviews. Glad I did. The Crosby is an outstanding hotel. The rooms are immaculate and luxurious, with real attention to detail and none of the bland furnishings you find in even the top chain hotels. Staff on the whole were extremely attentive and seemed to enjoy being there. Breakfast was superb and facilities at ground level gave an intimate and exclusive feel to the hotel. It's a fairly expensive place to stay but is one of those hotels where you feel you're getting what you pay for, helped by an excellent location. Hope to be back! Outstanding 5 Crosby Street Hotel
2 1/18/2016 We've stayed many times at the Crosby Street Hotel and always have an incredible, flawless experience! The staff couldn't be more accommodating, the housekeeping is immaculate, the location's awesome and the rooms are the coolest combination of luxury and chic. During our most recent trip over The New Years holiday, we stayed in the stunning Crosby Suite which has the most extraordinary, gorgeous decor. The Crosby remains our absolute favorite in NYC. Can't wait to return! Always perfect! 5 Crosby Street Hotel
I was thinking something like:
WordFreqRowWise %>%
rowwise() %>%
summarise(n = n())
To get the results something like..
Review_ID Review_Content total_Words Min_occrd_word Max Average
1 .... 230 great: 1 the: 25 total_unique/total_words in the row
But do not have idea, how can I do it....
Here is a method in base R using strsplit and sapply. Let's say the data is stored in a data.frame df and the reviews are stored in the variable Review_Content
# break up the strings in each row by " "
temp <- strsplit(df$Review_Content, split=" ")
# count the number of words as the length of the vectors
df$wordCount <- sapply(temp, length)
In this instance, sapply will return a vector of the counts for each row.
Since the word count is now an object, you can perform analysis you want on it. Here are some examples:
summarize the distribution of word counts: summary(df$wordCount)
maximum word count: max(df$wordCount)
mean word count: mean(df$wordCount)
range of word counts: range(df$wordCount)
interquartile range of word counts: IQR(df$wordCount)
Adding to #lmo's answer above..
Below code will generate a dataframe that consists of all the words, row-wise, and their frequencies:
temp2 <- data.frame()
for (i in 1:length(temp)){
temp1 <- as.data.frame(table(temp[[i]]))
temp1$ID <- paste0("Row_", i)
temp2 <- rbind(temp2, temp1)
temp1 <- NULL
}

Resources