How to calculate the duration of employment spells - r

I have data on peoples employment status monthly for 1 year. The dataframe includes 4 variables: ID of a Person, Country, Month and Main Activity in that specific month (Employed, Unemployed, Inactive, Other). I put an example of it here:
ID <- c(1:10, 1:10)
country <- c("AT", "BE", "CH", "CZ", "HR", "SO", "SV", "RU", "GR", "GE", "AT", "BE", "CH", "CZ", "HR",
"SO", "SV", "RU", "GR", "GE")
month <- c("Jan", "Feb", "Mar", "Apr", "May", "Aug", "Dec", "Nov", "Sep", "Jan", "Jun", "Jul", "Oct",
"Jan", "Feb", "Mar", "Apr", "May", "Aug", "Dec")
act <- c("Unemployed", "Employed", "Other", "Other", "Inactive", "Unemployed", "Employed",
"Employed", "Employed", "Unemployed", "Other", "Unemployed", "Unemployed", "Unemployed",
"Other", "Other", "Employed", "Other", "Other", "NA")
df <- data.frame(ID, country, month, act)
df[order(ID),]
ID country month act
1 1 AT Jan Unemployed
11 1 AT Jun Other
21 1 AT Nov Unemployed
2 2 BE Feb Employed
12 2 BE Jul Unemployed
22 2 BE Sep Unemployed
3 3 CH Mar Other
13 3 CH Oct Unemployed
23 3 CH Jan NA
4 4 CZ Apr Other
14 4 CZ Jan Unemployed
24 4 CZ Jun Unemployed
5 5 HR May Inactive
15 5 HR Feb Other
25 5 HR Jul Other
6 6 SO Aug Unemployed
16 6 SO Mar Other
26 6 SO Oct Employed
7 7 SV Dec Employed
17 7 SV Apr Employed
27 7 SV Nov Employed
8 8 RU Nov Employed
18 8 RU May Other
28 8 RU Jan NA
9 9 GR Sep Employed
19 9 GR Aug Other
29 9 GR Jun Inactive
10 10 GE Jan Unemployed
20 10 GE Dec NA
30 10 GE Aug Unemployed
My goal is to create a new dataframe, where every row represents one spell of employment but with the condition that prior to AND after the spell of employment had to be a spell of unemployment. So that I am able to include only spells of employment where people switched from unemployment to employment and back to unemployment and calculate the duration of these spells. Ideally in the end there would be 4 variables: PersID, Country, duration of spell, starting month, end month. It should look like this:
ID country spell_duration starting ending
1 1 AT 5 Jan May
11 1 AT 5 Jun Oct
2 2 BE 7 Feb Aug
12 2 BE 6 Jul Dec
3 3 CH 10 Mar Dec
13 3 CH 1 Oct Oct
4 4 CZ 8 Apr Nov
14 4 CZ 5 Jan May
5 5 HR 5 May Sep
15 5 HR 4 Feb May
6 6 SO 2 Aug Sep
16 6 SO 6 Mar Aug
7 7 SV 1 Dec Dec
17 7 SV 9 Apr Dec
8 8 RU 8 Nov Dec
18 8 RU 7 May Nov
9 9 GR 3 Sep Nov
19 9 GR 2 Aug Sep
10 10 GE 8 Jan Aug
20 10 GE 1 Dec Dec
I already found this solution by Maria (How to calculate number and duration of categorical spells by ID in R) but her problem is different. I don't want the overall duration of employment and I also don't really need the number of spells

I used data.table package for manupulation and loops are working i guess.
EDIT: One extra "}" left and I edited it. I've tried it and it works.
EDIT2: I added "setDT(df)" too.
library(data.table)
df <- fread(paste("ID country month act
1 AT Jan Unemployed
1 AT Jun Other
1 AT Nov Unemployed
2 BE Feb Employed
2 BE Jul Unemployed
2 BE Sep Unemployed
3 CH Mar Other
3 CH Oct Unemployed
3 CH Jan NA
4 CZ Apr Other
4 CZ Jan Unemployed
4 CZ Jun Unemployed
5 HR May Inactive
5 HR Feb Other
5 HR Jul Other
6 SO Aug Unemployed
6 SO Mar Other
6 SO Oct Employed
7 SV Dec Employed
7 SV Apr Employed
7 SV Nov Employed
8 RU Nov Employed
8 RU May Other
8 RU Jan NA
9 GR Sep Employed
9 GR Aug Other
9 GR Jun Inactive
10 GE Jan Unemployed
10 GE Dec NA
10 GE Aug Unemployed", collapse = '\n'))
setDT(df)
df[, monthInt := match(month, month.abb)]
df <- df[order(ID,monthInt)]
finalDt <- data.table()
for (i in unique(df[, ID])) {
tempT <- df[ID == i]
for (tim in 1:(nrow(tempT)-1)) {
timT <- data.table(ID = tempT[tim,ID],
country = tempT[tim, country],
spell_duration = tempT[tim+1, monthInt] - tempT[tim, monthInt],
starting = month.abb[tempT[tim, monthInt]],
ending = month.abb[tempT[tim+1, monthInt]-1])
finalDt <- rbind(finalDt,timT)
}
}

Without much thought, the first thing that came to my mind. Very cumbersome, though. I'm sure there are more elegant solutions to this, but this doesn't require any additional packages.
data <- df
Empl_spells <- data.frame(ID = c(), Start = c(), End = c())
for(user in unique(data$ID)){
# subset per user
user_dat <- data[data$ID == user,]
# initiate a list to store where changes occur and a counter for
# entries to this list
if(nrow(user_dat) > 2){
Changes_data <- list()
entry <- 1
# for every row, check if it switches from employed to unemployed
# or the opposite. Mark with "break" if some other entry interrupts
for(i in 2:nrow(user_dat)){
if(user_dat$act[i] == "Employed" &
user_dat$act[i-1] == "Unemployed"){
Changes_data[[entry]] <- c("Start", i)
entry <- entry + 1
}else if(user_dat$act[i] == "Unemployed" &
user_dat$act[i-1] == "Employed"){
Changes_data[[entry]] <- c("End", i)
entry <- entry + 1
}else if(user_dat$act[i] != "Employed" &
user_dat$act[i] != "Unemployed"){
Changes_data[[entry]] <- c("Break", i)
entry <- entry + 1
}
}
# see where to an "End" follows a "Start" immediately in the new list
Changes_df <- do.call(rbind.data.frame, Changes_data)
EmplToUnempl <- which(Changes_df[-nrow(Changes_df), 1] == "Start" & Changes_df[-1, 1] == "End")
if(length(EmplToUnempl) >= 1){
append <- data.frame(ID = user,
Start = user_dat$month[as.numeric(Changes_df[EmplToUnempl, 2])],
End = user_dat$month[as.numeric(Changes_df[EmplToUnempl + 1, 2])-1])
# append the data to the data.frame for all of the people
Empl_spells <- rbind(Empl_spells, append)
}
}
}
Since I don't have your data, I didn't test this. Is this what you want?
Edit (vectorize; probably makes it faster):
data <- df
users <- unique(data$ID)
calculate <- function(user){
# subset per user
user_dat <- data[data$ID == user,]
# initiate a list to store where changes occur and a counter for
# entries to this list
if(nrow(user_dat) > 2){
Changes_data <- list()
entry <- 1
# for every row, check if it switches from employed to unemployed
# or the opposite. Mark with "break" if some other entry interrupts
for(i in 2:nrow(user_dat)){
if(user_dat$act[i] == "Employed" &
user_dat$act[i-1] == "Unemployed"){
Changes_data[[entry]] <- c("Start", i)
entry <- entry + 1
}else if(user_dat$act[i] == "Unemployed" &
user_dat$act[i-1] == "Employed"){
Changes_data[[entry]] <- c("End", i)
entry <- entry + 1
}else if(user_dat$act[i] != "Employed" &
user_dat$act[i] != "Unemployed"){
Changes_data[[entry]] <- c("Break", i)
entry <- entry + 1
}
}
# see where to an "End" follows a "Start" immediately in the new list
Changes_df <- do.call(rbind.data.frame, Changes_data)
EmplToUnempl <- which(Changes_df[-nrow(Changes_df), 1] == "Start" & Changes_df[-1, 1] == "End")
if(length(EmplToUnempl) >= 1){
append <- data.frame(ID = user,
Start = user_dat$month[as.numeric(Changes_df[EmplToUnempl, 2])],
End = user_dat$month[as.numeric(Changes_df[EmplToUnempl + 1, 2])-1])
# append the data to the data.frame for all of the people
return(append)
}
}
}
empl_spells <- lapply(users, FUN = calculate)
Empl_spells <- do.call(rbind.data.frame, empl_spells)
Edit #2 (calculate duration):
MonthToNumeric <- function(x){
which(c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec") == x)
}
calcDuration <- function(Start, End){
return(MonthToNumeric(End) - MonthToNumeric(Start) + 1)
}
Empl_spells$Duration <- mapply(FUN = calcDuration, Start = Empl_spells[, 2], End = Empl_spells[, 3])

Related

Insert a new column to a list of dataframes by parsing the date names of the

I have a list of data frames which I would like to insert a new column into
df<- data_frame(first =seq(1:10), second = seq(1:10))
ldf <- list(df, df, df)
names(ldf) <- c('April 2018 ASP NDC-HCPCS Crosswalk', 'Apr 2019 ASP Pricing File', 'Jan 18 ASP Pricing File')
I wish to insert a new column into each data frame by taking the name of each data frame and converting it to the corresponding YYYY QQ format.
Using the example above, the April 2018 ASP NDC-HCPCS Crosswalk dataframe will have a new column called date which will contain "2018 Q2" for all 10 rows, similarly the Apr 2019 ASP Pricing File dataframe will have a new date column containing "2019 Q2" for all rows and finally the Jan 18 ASP Pricing File will have a new date column with "2018 Q1"
Here is an example of an approach you can use and improve on. You should probably make the regex more robust--although it works for your example I can see situations where it could fail. Also, this works for years 2000 and up; you would need to change it if you have any earlier dates. Let me know if you have any questions!
# Getting month indices
# Extracting month names
new_mon <- gsub(paste0(".*(", paste(c(month.name, month.abb), collapse = "|"), ").*"), "\\1", names(ldf))
# Subsetting a "quarter" vector by month name/abbreviation and taking nonmissing values
new_mon <- pmax(sort(rep(1:4, 3))[match(new_mon, month.name)],
sort(rep(1:4, 3))[match(new_mon, month.abb)], na.rm = T)
# Getting the year indices and ading "20" if there are only two characters
new_year <- gsub(".*?([0-9]{2,4}).*", "\\1", names(ldf)) # This is good for your example but you can make it more robust
new_year <- ifelse(nchar(new_year) == 2, paste0("20", new_year), new_year)
# Pasting them Together
new_cols <- paste0(new_year, " Q", new_mon)
Results:
# Adding the columns using Map
Map(function(x, y) cbind(x, "new_column" = y), ldf, new_cols)
$`April 2018 ASP NDC-HCPCS Crosswalk`
first second new_column
1 1 1 2018 Q2
2 2 2 2018 Q2
3 3 3 2018 Q2
4 4 4 2018 Q2
5 5 5 2018 Q2
6 6 6 2018 Q2
7 7 7 2018 Q2
8 8 8 2018 Q2
9 9 9 2018 Q2
10 10 10 2018 Q2
$`Apr 2019 ASP Pricing File`
first second new_column
1 1 1 2019 Q2
2 2 2 2019 Q2
3 3 3 2019 Q2
4 4 4 2019 Q2
5 5 5 2019 Q2
6 6 6 2019 Q2
7 7 7 2019 Q2
8 8 8 2019 Q2
9 9 9 2019 Q2
10 10 10 2019 Q2
$`Jan 18 ASP Pricing File`
first second new_column
1 1 1 2018 Q1
2 2 2 2018 Q1
3 3 3 2018 Q1
4 4 4 2018 Q1
5 5 5 2018 Q1
6 6 6 2018 Q1
7 7 7 2018 Q1
8 8 8 2018 Q1
9 9 9 2018 Q1
10 10 10 2018 Q1
Data:
df<- data_frame(first =seq(1:10), second = seq(1:10))
ldf <- list(df, df, df)
names(ldf) <- c('April 2018 ASP NDC-HCPCS Crosswalk', 'Apr 2019 ASP Pricing File', 'Jan 18 ASP Pricing File')
A way to start:
df<- data_frame(first =seq(1:10), second = seq(1:10))
ldf <- list(df, df, df)
names(ldf) <- c('April 2018 ASP NDC-HCPCS Crosswalk', 'Apr 2019 ASP Pricing File', 'Jan 18 ASP Pricing File')
lookup_quarters <- setNames(paste0("Q", rep(1:4, each = 3)),
c("jan", "feb", "mar", "apr", "may", "jun",
"jul", "aug", "sep", "oct", "nov", "dec"))
lapply(seq_along(ldf),
function(i) {
qtr <- substr(tolower(strsplit(names(ldf)[i], " ")[[1]][1]), 1, 3)
qtr <- lookup_quarters[qtr]
yr <- strsplit(names(ldf)[i], " ")[[1]][2]
yr <- if (nchar(yr) == 2) paste0("20", yr) else yr
res <- ldf[[i]]
res$newcol <- paste(yr, qtr)
res
})

Minimise objective function using R

Can somebody help me in solving this to multivariate function parameters optimization in R, I have a data set like this. This is just a subset of data, dimension of the full dataset is n type * m regions * 12 months.
Month region type physics maths allsub
Jan r1 1 4 5 9
Feb r1 1 3 8 11
Mar r1 1 5 4 9
Apr r1 1 6 7 13
May r1 1 4 4 8
Jun r1 1 8 9 17
Jul r1 1 4 3 7
Aug r1 1 5 4 9
Sep r1 1 3 8 11
Oct r1 1 9 2 11
Nov r1 1 4 7 11
Dec r1 1 7 3 10
Jan r1 2 5 8 13
Feb r1 2 4 9 13
Mar r1 2 8 3 11
Apr r1 2 5 6 11
May r1 2 6 4 10
Jun r1 2 7 6 13
Jul r1 2 3 7 10
Aug r1 2 4 8 12
Sep r1 2 4 4 8
Oct r1 2 8 1 9
Nov r1 2 2 3 5
Dec r1 2 1 6 7
... ... .. ... ... ....
... ... .. ... ... ....
I have one more dataset which has maximum number of physics and maths students in each region. And my objective function is this, 100*(physics) + 65*(maths) >= 0. I want to minimize this function and my constraints are
1. sum of physics and maths should always be equal to allsub for that region and month.
2. total number of physics students in a region every month should be less than maximum number of physics students available in that region.
3. total number of maths students in a region every month should be less than maximum number of maths students available in that region.
I am trying to use R. The whole idea is to find the right number of physics and maths students in each region/type/month minimizing the objective function and meeting the constraints. Can someone help me with this?
EDIT : As requested in the comments.
Here is the total capacity dataset. dataframe name = totalcap
Month region physicscap mathscap
1 Jan r1 9 13
2 Feb r1 7 17
3 Mar r1 13 7
4 Apr r1 11 13
5 May r1 10 8
6 Jun r1 15 15
7 Jul r1 7 10
8 Aug r1 9 12
9 Sep r1 7 12
10 Oct r1 17 3
11 Nov r1 6 10
12 Dec r1 8 9
Here is the script I have tried,
library(dplyr)
library(MASS)
library(Rsolnp)
Month <- c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
region <- c('r1')
physicscap <- c(5,5,8,6,7,9,5,6,4,10,5,8)
mathscap <- c(5,8,5,8,5,10,5,5,8,5,8,5)
totalcap <- data.frame(Month,region,physicscap,mathscap)
#Constraints for the optimization.
constraints2 <- function(efforts){
# constraints are:
# 1. effort - allsub <= 0 in each region/month
#
efforts$effort_calculated <- efforts$physics + efforts+maths
reqeff <- summarise(group_by(efforts,region,Month),monthlyeffreg=sum(effort_calculated))
reqeffallsub <- summarise(group_by(efforts,region,Month),allsubsum=sum(allsub))
cons1 <- mutate(inner_join(reqeff,reqeffallsub,by=c('region'='region','Month'='Month'))
,diff=monthlyeffreg-allsubsum)
constout <- cons1$diff
# 2. sum(physics) - total physics available <= 0 in each region/month
#
phyreqeff <- summarise(group_by(efforts,region,Month),physicseff=sum(physics))
cons2 <- mutate(inner_join(totalcap,phyreqeff,by=c('region'='region','Month'='Month')),
diff=physicseff-physicscap)
constout <- c(constout,cons2$diff)
# 3. sum(maths) - total maths available <= 0 in each region/month
#
matreqeff <- summarise(group_by(efforts,region,Month),mathseff=sum(maths))
cons3 <- mutate(inner_join(totalcap,matreqeff,by=c('region'='region','Month'='Month')),
diff=mathseff-mathscap)
constout <- c(constout,cons3$diff)
constout
}
#Objective function to minimize the cost function.
objectivefunc <- function(efforts){
nb_physics <- sum(efforts$physics)
nb_maths <- sum(efforts$maths)
objective <- (100*nb_physics + 55*nb_maths - 110)
objective
}
Out2 <- solnp(pars = efforts,fun=objectivefunc,ineqfun=constraints2,ineqLB = rep(-100000,36),
ineqUB = rep(0,36), LB = rep(0,length(u)))
Here is the error I am getting,
Error in p0/vscale[(neq + 2):(nc + np + 1)] :
non-numeric argument to binary operator
Hope this clears the questions in comments. I tried my level best here, hope someone help me in solving this.
Here is an approach with lpSolveAPI:
dat <- data.frame(
mon=rep(c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),2),
region="r1",
type=c(rep("1", 12), rep("2", 12)),
physicsmin=1,
mathsmin=1,
allsub=c(9, 11, 9, 13, 8, 17, 7, 9, 11, 11, 11, 10, 13,13,11,11,10,13,10,12,8,9,5,7),
stringsAsFactors=FALSE
)
dat
capdat <- data.frame(
mon=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
region="r1",
physicscap=c(9,7,13,11,10,15,7,9,7,17,6,8),
mathscap=c(13,17,7,13,8,15,10,12,12,3,10,9),
stringsAsFactors=FALSE
)
capdat
Now for each month/region combination an optimization problem is to be solved. That is why
we wrap the calculation in a function:
library(lpSolveAPI)
ntypes <- length(unique(dat[,"type"])) # number of types
typemap <- setNames(seq.int(ntypes), unique(dat[,"type"])) # map typename to 1,...,ntypes
solve_one <- function(subdat, capdat) {
# create object
lprec <- make.lp(0, ncol=2*ntypes) # for each type, two decision variables
# By convention, we assume that the first ntypes variables are physics for type 1, ..., ntypes
# and the second ntypes variables are maths
# add objective and type
set.objfn(lprec, obj=c(rep(100, ntypes), rep(65, ntypes)))
set.type(lprec, columns=seq.int(2*ntypes), type="integer") # no reals
# add capacity constraints
idx <- which(capdat[,"mon"]==subdat[1,"mon"] & capdat[,"region"]==subdat[1,"region"]) # lookup the right cap
add.constraint(lprec, rep(1, ntypes), type="<=", rhs=capdat[idx,"physicscap"], indices=seq.int(ntypes))
add.constraint(lprec, rep(1, ntypes), type="<=", rhs=capdat[idx,"mathscap"], indices=seq.int(ntypes+1, 2*ntypes))
# add allsub equality constraints and minimum constraints
for (typ in subdat[,"type"]) {
add.constraint(lprec, c(1,1), type="=", rhs=subdat[typemap[typ], "allsub"], indices=c(typemap[typ], ntypes+typemap[typ]))
add.constraint(lprec, 1, type=">=", rhs=subdat[typemap[typ],"physicsmin"], indices=typemap[typ])
add.constraint(lprec, 1, type=">=", rhs=subdat[typemap[typ],"mathsmin"], indices=ntypes+typemap[typ])
}
# solution data.frame
ans <- subdat[, c("mon", "region", "type")]
# solve
if(solve(lprec)==0) {
sol <- get.variables(lprec)
for (i in seq.int(nrow(subdat))) {
ans[i, "physics"] <- sol[typemap[subdat[i,"type"]]]
ans[i, "maths"] <- sol[typemap[subdat[i,"type"]]+ntypes]
}
} else ans[,c("physics", "maths")] <- NA # no solution found
return(ans)
}
Now we apply the function to each subdataset which includes all types for each month/region combination. We
use a split/apply/combine approach here:
sp <- split(dat, list(dat[,"mon"], dat[,"region"]))
results <- lapply(sp, solve_one, capdat=capdat)
results <- do.call(rbind, results)
rownames(results) <- NULL
results
The code does not assume that for each month/region combination all types are present (some types may be omitted), however the solution will be wrong if there are several entries present for the same month/region/type combination. (the code would need to be adapted for this).

Constrained multivariate optimization in R [duplicate]

Can somebody help me in solving this to multivariate function parameters optimization in R, I have a data set like this. This is just a subset of data, dimension of the full dataset is n type * m regions * 12 months.
Month region type physics maths allsub
Jan r1 1 4 5 9
Feb r1 1 3 8 11
Mar r1 1 5 4 9
Apr r1 1 6 7 13
May r1 1 4 4 8
Jun r1 1 8 9 17
Jul r1 1 4 3 7
Aug r1 1 5 4 9
Sep r1 1 3 8 11
Oct r1 1 9 2 11
Nov r1 1 4 7 11
Dec r1 1 7 3 10
Jan r1 2 5 8 13
Feb r1 2 4 9 13
Mar r1 2 8 3 11
Apr r1 2 5 6 11
May r1 2 6 4 10
Jun r1 2 7 6 13
Jul r1 2 3 7 10
Aug r1 2 4 8 12
Sep r1 2 4 4 8
Oct r1 2 8 1 9
Nov r1 2 2 3 5
Dec r1 2 1 6 7
... ... .. ... ... ....
... ... .. ... ... ....
I have one more dataset which has maximum number of physics and maths students in each region. And my objective function is this, 100*(physics) + 65*(maths) >= 0. I want to minimize this function and my constraints are
1. sum of physics and maths should always be equal to allsub for that region and month.
2. total number of physics students in a region every month should be less than maximum number of physics students available in that region.
3. total number of maths students in a region every month should be less than maximum number of maths students available in that region.
I am trying to use R. The whole idea is to find the right number of physics and maths students in each region/type/month minimizing the objective function and meeting the constraints. Can someone help me with this?
EDIT : As requested in the comments.
Here is the total capacity dataset. dataframe name = totalcap
Month region physicscap mathscap
1 Jan r1 9 13
2 Feb r1 7 17
3 Mar r1 13 7
4 Apr r1 11 13
5 May r1 10 8
6 Jun r1 15 15
7 Jul r1 7 10
8 Aug r1 9 12
9 Sep r1 7 12
10 Oct r1 17 3
11 Nov r1 6 10
12 Dec r1 8 9
Here is the script I have tried,
library(dplyr)
library(MASS)
library(Rsolnp)
Month <- c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
region <- c('r1')
physicscap <- c(5,5,8,6,7,9,5,6,4,10,5,8)
mathscap <- c(5,8,5,8,5,10,5,5,8,5,8,5)
totalcap <- data.frame(Month,region,physicscap,mathscap)
#Constraints for the optimization.
constraints2 <- function(efforts){
# constraints are:
# 1. effort - allsub <= 0 in each region/month
#
efforts$effort_calculated <- efforts$physics + efforts+maths
reqeff <- summarise(group_by(efforts,region,Month),monthlyeffreg=sum(effort_calculated))
reqeffallsub <- summarise(group_by(efforts,region,Month),allsubsum=sum(allsub))
cons1 <- mutate(inner_join(reqeff,reqeffallsub,by=c('region'='region','Month'='Month'))
,diff=monthlyeffreg-allsubsum)
constout <- cons1$diff
# 2. sum(physics) - total physics available <= 0 in each region/month
#
phyreqeff <- summarise(group_by(efforts,region,Month),physicseff=sum(physics))
cons2 <- mutate(inner_join(totalcap,phyreqeff,by=c('region'='region','Month'='Month')),
diff=physicseff-physicscap)
constout <- c(constout,cons2$diff)
# 3. sum(maths) - total maths available <= 0 in each region/month
#
matreqeff <- summarise(group_by(efforts,region,Month),mathseff=sum(maths))
cons3 <- mutate(inner_join(totalcap,matreqeff,by=c('region'='region','Month'='Month')),
diff=mathseff-mathscap)
constout <- c(constout,cons3$diff)
constout
}
#Objective function to minimize the cost function.
objectivefunc <- function(efforts){
nb_physics <- sum(efforts$physics)
nb_maths <- sum(efforts$maths)
objective <- (100*nb_physics + 55*nb_maths - 110)
objective
}
Out2 <- solnp(pars = efforts,fun=objectivefunc,ineqfun=constraints2,ineqLB = rep(-100000,36),
ineqUB = rep(0,36), LB = rep(0,length(u)))
Here is the error I am getting,
Error in p0/vscale[(neq + 2):(nc + np + 1)] :
non-numeric argument to binary operator
Hope this clears the questions in comments. I tried my level best here, hope someone help me in solving this.
Here is an approach with lpSolveAPI:
dat <- data.frame(
mon=rep(c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),2),
region="r1",
type=c(rep("1", 12), rep("2", 12)),
physicsmin=1,
mathsmin=1,
allsub=c(9, 11, 9, 13, 8, 17, 7, 9, 11, 11, 11, 10, 13,13,11,11,10,13,10,12,8,9,5,7),
stringsAsFactors=FALSE
)
dat
capdat <- data.frame(
mon=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
region="r1",
physicscap=c(9,7,13,11,10,15,7,9,7,17,6,8),
mathscap=c(13,17,7,13,8,15,10,12,12,3,10,9),
stringsAsFactors=FALSE
)
capdat
Now for each month/region combination an optimization problem is to be solved. That is why
we wrap the calculation in a function:
library(lpSolveAPI)
ntypes <- length(unique(dat[,"type"])) # number of types
typemap <- setNames(seq.int(ntypes), unique(dat[,"type"])) # map typename to 1,...,ntypes
solve_one <- function(subdat, capdat) {
# create object
lprec <- make.lp(0, ncol=2*ntypes) # for each type, two decision variables
# By convention, we assume that the first ntypes variables are physics for type 1, ..., ntypes
# and the second ntypes variables are maths
# add objective and type
set.objfn(lprec, obj=c(rep(100, ntypes), rep(65, ntypes)))
set.type(lprec, columns=seq.int(2*ntypes), type="integer") # no reals
# add capacity constraints
idx <- which(capdat[,"mon"]==subdat[1,"mon"] & capdat[,"region"]==subdat[1,"region"]) # lookup the right cap
add.constraint(lprec, rep(1, ntypes), type="<=", rhs=capdat[idx,"physicscap"], indices=seq.int(ntypes))
add.constraint(lprec, rep(1, ntypes), type="<=", rhs=capdat[idx,"mathscap"], indices=seq.int(ntypes+1, 2*ntypes))
# add allsub equality constraints and minimum constraints
for (typ in subdat[,"type"]) {
add.constraint(lprec, c(1,1), type="=", rhs=subdat[typemap[typ], "allsub"], indices=c(typemap[typ], ntypes+typemap[typ]))
add.constraint(lprec, 1, type=">=", rhs=subdat[typemap[typ],"physicsmin"], indices=typemap[typ])
add.constraint(lprec, 1, type=">=", rhs=subdat[typemap[typ],"mathsmin"], indices=ntypes+typemap[typ])
}
# solution data.frame
ans <- subdat[, c("mon", "region", "type")]
# solve
if(solve(lprec)==0) {
sol <- get.variables(lprec)
for (i in seq.int(nrow(subdat))) {
ans[i, "physics"] <- sol[typemap[subdat[i,"type"]]]
ans[i, "maths"] <- sol[typemap[subdat[i,"type"]]+ntypes]
}
} else ans[,c("physics", "maths")] <- NA # no solution found
return(ans)
}
Now we apply the function to each subdataset which includes all types for each month/region combination. We
use a split/apply/combine approach here:
sp <- split(dat, list(dat[,"mon"], dat[,"region"]))
results <- lapply(sp, solve_one, capdat=capdat)
results <- do.call(rbind, results)
rownames(results) <- NULL
results
The code does not assume that for each month/region combination all types are present (some types may be omitted), however the solution will be wrong if there are several entries present for the same month/region/type combination. (the code would need to be adapted for this).

how to calculate the distance and return the value of a specific variable with the shortest distance?

I have two separated datasets. One contains the location of the participants, another contains the location of measurement station and corresponding values, at different time points. Below I generate sample datasets.
# dataset of value
yearmon <- c("Jan 1996","Jan 1996","Jan 1996","Jan 1996","Jan 1996","Jan 1996",
"Feb 1996","Feb 1996","Feb 1996","Feb 1996","Feb 1996","Feb 1996",
"Mar 1996","Mar 1996","Mar 1996","Mar 1996","Mar 1996","Mar 1996",
"Apr 1996","Apr 1996","Apr 1996","Apr 1996","Apr 1996","Apr 1996",
"May 1996","May 1996","May 1996","May 1996","May 1996","May 1996",
"Jun 1996","Jun 1996","Jun 1996","Jun 1996","Jun 1996","Jun 1996")
lon <- c(114.1592, 114.1294, 114.1144, 114.0228, 113.9763, 113.9431)
lat <- c(22.35694, 22.31306, 22.33000, 22.37167, 22.37639, 22.45111)
STN <- c("A","B","C","D","E","F")
value <- runif(n=36, min=10, max=20)
df<- data.frame(STN,lon,lat)
df<- rbind(df,df,df,df,df,df)
df <- cbind(df,yearmon,value)
df$value[df$value < 12] <- NA
# dataset of participant location
id <- c(1,2,3,4)
lon.p <- c(114.3608, 114.1850, 114.1581, 114.1683)
lat.p <- c(22.44500, 22.33000, 22.28528, 22.37167)
participant <- data.frame(id,lon.p,lat.p)
#
sample datasets are as below. I want to calculate the distance between each station (A-F) and each participant (1-4) at each time point (yearmon). And assign the value of a specific time point to the specific participants. I could not assign the participants to a station first, because the location of stations may change at different time points (although it does not change in the sample dataset)
I.e. if participant 1 lives closest to Station A in Jan 1996, then he/she should be assign the value 17.03357.
I prefer the great circle distance, maybe calculate using script like this:
rdist.earth(location1, location2 ,miles=FALSE, R=6371)
head(df,10)
STN lon lat yearmon value
1 A 114.1592 22.35694 Jan 1996 17.03357
2 B 114.1294 22.31306 Jan 1996 NA
3 C 114.1144 22.33000 Jan 1996 17.98293
4 D 114.0228 22.37167 Jan 1996 15.98854
5 E 113.9763 22.37639 Jan 1996 16.78647
6 F 113.9431 22.45111 Jan 1996 18.89551
7 A 114.1592 22.35694 Feb 1996 NA
8 B 114.1294 22.31306 Feb 1996 19.90123
9 C 114.1144 22.33000 Feb 1996 17.88482
10 D 114.0228 22.37167 Feb 1996 13.80029
participant
id lon.p lat.p
1 1 114.3608 22.44500
2 2 114.1850 22.33000
3 3 114.1581 22.28528
4 4 114.1683 22.37167
At the end, I think this is what I would like to return. (But with the value filled in)
id lon.p lat.p Apr 1996 Feb 1996 Jan 1996 Jun 1996 Mar 1996 May 1996
1 1 114.3608 22.44500
2 2 114.1850 22.33000
3 3 114.1581 22.28528
4 4 114.1683 22.37167
Thank you.
Here's a way to do it in a couple of steps. Note that I created a naive_dist function just as a placeholder for the distance metric. The function comes from here.
naive_dist <- function(long1, lat1, long2, lat2) {
R <- 6371 # Earth mean radius [km]
d <- acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1)) * R
return(d) # Distance in km
}
dist_by_id <- by(participant, participant$id, FUN = function(x)
#you would use your distance metric here
naive_dist(long1 = x$lon.p, long2 = df$lon, lat1 = x$lat.p, lat2 = df$lat)
)
#function to find the min for each yearmon, by id
find_min <- function(id, data, by_data){
data$dist_column = by_data[[id]]
by(data, data$yearmon, FUN = function(x) x[which.min(x$dist_column),]$value)
}
#initialize
participant[,4:9] = 0
names(participant)[4:9] = as.character(unique(df$yearmon))
#use a for loop to fill in the values
for(i in 1:4){
participant[i,4:9] = stack(find_min(id = i, data = df, by_data = dist_by_id))[,1]
}
participant
id lon.p lat.p Jan 1996 Feb 1996 Mar 1996 Apr 1996 May 1996 Jun 1996
1 1 114.3608 22.44500 17.36620 18.88409 19.53951 19.35646 13.00518 18.45556
2 2 114.1850 22.33000 17.36620 18.88409 19.53951 19.35646 13.00518 18.45556
3 3 114.1581 22.28528 18.57447 13.85192 17.52038 NA 16.14562 18.06435
4 4 114.1683 22.37167 17.36620 18.88409 19.53951 19.35646 13.00518 18.45556
Obviously once you change the distance metric these results may change.
Alternatively, here's an option that uses dplyr, I would tend to prefer this solution since it might be more performant.
library(dplyr)
df2 <- merge(df, participant, all = T) #merge the df's
#calculate distance
df2$distance <- naive_dist(long1 = df2$lon, lat1 = df2$lat,
long2 = df2$lon.p, lat2 = df2$lat.p)
df3 <- df2 %>%
group_by(yearmon, id) %>%
filter(distance == min(distance)) %>%
select(id, yearmon, value)
participant2 <- participant
participant2[,4:9] <- 0
names(participant2)[4:9] <- as.character(unique(df$yearmon))
for(i in 1:4){
participant2[i,4:9] = c(subset(df3, id == i)$value)
}
participant2
id lon.p lat.p Jan 1996 Feb 1996 Mar 1996 Apr 1996 May 1996 Jun 1996
1 1 114.3608 22.44500 19.53951 18.88409 13.00518 17.36620 18.45556 19.35646
2 2 114.1850 22.33000 19.53951 18.88409 13.00518 17.36620 18.45556 19.35646
3 3 114.1581 22.28528 17.52038 13.85192 16.14562 18.57447 18.06435 NA
4 4 114.1683 22.37167 19.53951 18.88409 13.00518 17.36620 18.45556 19.35646

Adding a seasons column to data table based on month dates

I'm using data.table and I am trying to make a new column, called "season", which creates a column with the corresponding season, e.g summer, winter... based on a column called "MonthName".
I'm wondering whether there is a more efficient way to add a season column to a data table based on month values.
This is the first 6 of 300,000 observations, assume that the table is called "dt".
rrp Year Month Finyear hourminute AvgPriceByTOD MonthName
1: 35.27500 1999 1 1999 00:00 33.09037 Jan
2: 21.01167 1999 1 1999 00:00 33.09037 Jan
3: 25.28667 1999 2 1999 00:00 33.09037 Feb
4: 18.42334 1999 2 1999 00:00 33.09037 Feb
5: 16.67499 1999 2 1999 00:00 33.09037 Feb
6: 18.90001 1999 2 1999 00:00 33.09037 Feb
I have tried the following code:
dt[, Season := ifelse(MonthName = c("Jun", "Jul", "Aug"),"Winter", ifelse(MonthName = c("Dec", "Jan", "Feb"), "Summer", ifelse(MonthName = c("Sep", "Oct", "Nov"), "Spring" , ifelse(MonthName = c("Mar", "Apr", "May"), "Autumn", NA))))]
Which returns:
rrp totaldemand Year Month Finyear hourminute AvgPriceByTOD MonthName Season
1: 35.27500 1999 1 1999 00:00 33.09037 Jan NA
2: 21.01167 1999 1 1999 00:00 33.09037 Jan Summer
3: 25.28667 1999 2 1999 00:00 33.09037 Feb Summer
4: 18.42334 1999 2 1999 00:00 33.09037 Feb NA
5: 16.67499 1999 2 1999 00:00 33.09037 Feb NA
6: 18.90001 1999 2 1999 00:00 33.09037 Feb Summer
I get the error:
Warning messages:
1: In MonthName == c("Jun", "Jul", "Aug") :
longer object length is not a multiple of shorter object length
2: In MonthName == c("Dec", "Jan", "Feb") :
longer object length is not a multiple of shorter object length
3: In MonthName == c("Sep", "Oct", "Nov") :
longer object length is not a multiple of shorter object length
4: In MonthName == c("Mar", "Apr", "May") :
longer object length is not a multiple of shorter object length
ALongside this, for reasons that I don't know, some of the summer months are correctly assigned "summer", but others are assigned NA, e.g rows 1 and 2 should both be summer, but return differently.
Thanks in advance!
One pretty straightforward way is to use a lookup table to map month names to seasons:
# create a named vector where names are the month names and elements are seasons
seasons <- rep(c("winter","spring","summer","fall"), each = 3)
names(seasons) <- month.abb[c(6:12,1:5)] # thanks thelatemail for pointing out month.abb
seasons
# Jun Jul Aug Sep Oct Nov Dec Jan
#"winter" "winter" "winter" "spring" "spring" "spring" "summer" "summer"
# Feb Mar Apr May
#"summer" "fall" "fall" "fall"
Use it:
dt[, season := seasons[MonthName]]
data:
dt <- setDT(read.table(text=" rrp Year Month Finyear hourminute AvgPriceByTOD MonthName
1: 35.27500 1999 1 1999 00:00 33.09037 Jan
2: 21.01167 1999 1 1999 00:00 33.09037 Jan
3: 25.28667 1999 2 1999 00:00 33.09037 Feb
4: 18.42334 1999 2 1999 00:00 33.09037 Feb
5: 16.67499 1999 2 1999 00:00 33.09037 Feb
6: 18.90001 1999 2 1999 00:00 33.09037 Feb",
header = TRUE, stringsAsFactors = FALSE))
A bit of typing, but the code is efficient
dt[MonthName %in% c("Jun","Jul","Aug"), Season := "Winter"]
dt[MonthName %in% c("Dec","Jan","Feb"), Season := "Summer"]
dt[MonthName %in% c("Sep","Oct","Nov"), Season := "Spring"]
dt[is.na(MonthName), Season := "Autumn"]
Here we are assigning by-reference on a subset of the data.table
I prefer this to a lot of nested ifelses
If you want to check if a value is in a vector, you have to use %in%. See the different behaviour of:
myVec <- c("a","b","c")
"a" == myVec
[1] TRUE FALSE FALSE
"a" %in% myVec
[1] TRUE

Resources