Optimisation in R: Creating an exam schedule (ompr, ROI ?) - r

I'm scheduling an exam for a course (using R).
I have 36 students, which all need to be fitted with time-slots to 8 teachers on the exam day (1:1 exams).
Each student or teacher can obviously only be in one place at a time.
How do I run an optimisation to find a solution using the least amount of time slots?
# Teachers
examiners <- c(seq(from=1, to=8, by= 1))
#students
students <- c(seq(from=1, to=36))
Some packages of interest: ompr, ROI?
Many thanks!

Related

Analysis to identify similar occupations by frequency of skills requested in job postings (in R)

I have access to a dataset of job postings, which for each posting has a unique posting ID, the job posting occupation, and a row for each skill requested in each job posting.
The dataset looks a bit like this:
posting_id
occ_code
occname
skillname
1
1
data scientist
analysis
1
1
data scientist
python
2
2
lecturer
teaching
2
2
lecturer
economics
3
3
biologist
research
3
3
biologist
biology
1
1
data scientist
research
1
1
data scientist
R
I'd like to perform analysis in R to identify "close" occupations by how similar their overall skill demand is in job postings. E.g. if many of the top 10 in-demand skills for financial analysts matched some of the top 10 in-demand skills for data scientists, those could be considered closely related occupations.
To be more clear, I want to identify similar occupations by their overall skill demand in the postings i.e. by summing the no. of times each skill is requested for an occupation, and identifying which other occupations have similar frequently requested skills.
I am fairly new to R so would appreciate any help!
I think you might want an unsupervised clustering strategy. See the help page for hclust for a debugged worked example. This untested code.
# Load necessary libraries
library(tidyverse)
library(reshape2)
# Read in the data
data <- read.csv("path/to/your/data.csv")
# Sum the number of times each skill is requested for each occupation
skill_counts <- data %>%
group_by(occ_code, occname_skillname) %>%
summarise(count = n())
# Get the top 10 in-demand skills for each occupation
top_10_skills <- skill_counts %>%
group_by(occ_code) %>%
top_n(10, count)
# Convert the data into a matrix for clustering
matrix <- dcast(top_10_skills, occ_code ~ occname_skillname, value.var = "count")
# Perform clustering
fit <- hclust(dist(t(matrix)), method = "ward.D2")
# Plot the dendrogram
plot(fit, hang = -1, labels = row.names(matrix), main = "Occupation Clustering")
The resulting dendrogram will show the relationships between the occupations based on their skill demand. Closer occupations will be grouped together and more distantly related occupations will be separated further apart.

Solving binomial distribution question in R language

Insurance policies are sold to 10 different people age 25-30 years. All of these are in good health
conditions. The probability that a person of similar condition will enjoy more than 25 years is 4/5. Calculate the probability that in 25 years Almost 2 will die. Perform this calculation in R syntax without using any direct built-in function.
n <- 10
p_live <- 4/5
p_notlive <- 0.2
Pzerodie <- combn(n,0)*(p_live^0)*(p_notlive^n-0)
print(Pzerodie)
I will do the same process for P(one die) and P(two die) and then add all three variable. Now the above code should print 1.024 * 10^-7 for P(Pzerodie)! But its printing : [,1]. Can anyone guide? Thanks

Statistics on cluster member relationships over several days

Assume, I have hourly data corresponding to 5 categories for consective 10 days, created as:
library(xts)
set.seed(123)
timestamp <- seq(as.POSIXct("2016-10-01"),as.POSIXct("2016-10-10 23:59:59"), by = "hour")
data <- data.frame(cat1 = rnorm(length(timestamp),150,5),
cat2 = rnorm(length(timestamp),130,3),
cat3 = rnorm(length(timestamp),150,5),
cat4 = rnorm(length(timestamp),100,8),
cat5 = rnorm(length(timestamp),200,15))
data_obj <- xts(data,timestamp) # creat time-series object
head(data_obj,2)
Now, for each day separately, I perform clustering and see how these categories behave with respect to each other using simple kmeans as:
daywise_data <- split.xts(data_obj,f="days",k=1) # split data day wise
clus_obj <- lapply(daywise_data, function(x){ # clustering day wise
return (kmeans(t(x), 2))
})
Once clustering is over, I visualize the cluster relationships over different 10 days with
sapply(clus_obj,function(x) x$cluster) # clustering results
and I found the results as
On visual inspection, it is clear that cat1 and cat3 always remained in the same cluster. Similarly cat4 and cat5 are mostly in different clusters on 10 different days.
Apart from visual inspection, is there any automatic approach to gather this type of statistic from such clustering tables?
Note: This is a dummy example. I have a data frame containing such 80 categories over continuous 100 days. An automatic summary like above one will reduce the effort.
Pair-counting cluster evaluation measures show an easy way to tackle this problem.
Rather than looking at object-cluster assignments, which are unstable, these methods look at whether or not two objects are in the same cluster (that is called a "pair").
So you could check if these pairs change much over time, or not.
Since k-means is randomized, you may also want to run it several times for every time slice, as they may return different clusterings!
You could then say that e.g. series 1 is in the same cluster as series 2 in 90% of the results. etc.

Using calendar adjustment when forecasting

I am reading the online textbook "Forecasting: Principles and Practice
Textbook by George Athanasopoulos and Rob J. Hyndman" which has examples in R code.
A section on calendar adjustments explaisn that it is often useful to "look at average daily production instead of average monthly production, we effectively remove the variation due to the different month lengths. Simpler patterns are usually easier to model and lead to more accurate forecasts.
This is the example data sets with accompanying plots:
monthly -> daily
I don't understand the second line of the given example code:
monthdays <- rep(c(31,28,31,30,31,30,31,31,30,31,30,31),14)
monthdays[26 + (4*12)*(0:2)] <- 29
par(mfrow=c(2,1))
plot(milk, main="Monthly milk production per cow",
ylab="Pounds",xlab="Years")
plot(milk/monthdays, main="Average milk production per cow per day",
ylab="Pounds", xlab="Years")
I understand that the first line creates a vector of the # of days in each month and repeats it 14 times because the data set is 14 years. But I have no idea what the second line is doing and where those numbers and calculations are coming from?

How recommenderlab of R culculate the ratings of each item in ratingMatrix?

Recently, I started using R's recommenderlab package in my studies.
This is recommenderlab document:
http://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf
There are some examples in this document, but I have a big question.
First, load recommenderlab package and Jester5k data set.
library("recommenderlab")
data(Jester5k)
Use the frontest 1000 records (users) of Jester5k to learn. The recommendation algorithm is POPULAR.
r <- Recommender(Jester5k[1:1000], method="POPULAR")
Then predict the 1001th user's recommendation list. List the top 5 items.
recom <- predict(r, Jester5k[1001], n=5)<br/>
as(recom, "matrix")
output:
[1] "j89" "j72" "j47" "j93" "j76"<br/>
Then I check the rating of the 5 items above.
rating <- predict(r, Jester5k[1001], type="ratings")<br/>
as(rating, "matrix")[,c("j89", "j72", "j47", "j93", "j76")]
output:
j89 j72 j47 j93 j76<br/>
2.6476613 2.1273894 0.5867006 1.2997065 1.2956333<br/>
Why is the top 5 list "j89" "j72" "j47" "j93" "j76", but j47's rating is only 0.5867006.
I do not understand.
How does recommenderlab calculate the ratings of each item in ratingMatrix?
And how does it produce the TopN list?
To get a more clear picture of your issue I suggest that you read this:
"recommenderlab: A Framework for Developing and Testing Recommendation Algorithms"
Why is the top 5 list "j89" "j72" "j47" "j93" "j76"
You are using the popularity method, this means that you are choosing the top 5 list based on the most rated items(counting the number of saves), not the highest predicted rating.
How does recommenderlab calculate the ratings of each item in
ratingMatrix? And how does it produce the TopN list?
The predicted rating, recommanderlab calculates them using the usual distance methods(not yet clear if it is pearson or cosine, I didn't have the chance to check it out) then it determines the rating , as suggested by Breeseet al. (1998), mean rating plus a weighted factor calculated on the neighborhood, you can consider the entire training set as the neighborhood of any user, that is why the predicted ratings for any user on the same item will have the same value.
My best.
L

Resources