Interpolate variables on subsets of dataframe - r

I have a large dataframe which has observations from surveys from multiple states for several years. Here's the data structure:
state | survey.year | time1 | obs1 | time2 | obs2
CA | 2000 | 1 | 23 | 1.2 | 43
CA | 2001 | 2 | 43 | 1.4 | 52
CA | 2002 | 5 | 53 | 3.2 | 61
...
CA | 1998 | 3 | 12 | 2.3 | 20
CA | 1999 | 4 | 14 | 2.8 | 25
CA | 2003 | 5 | 19 | 4.3 | 29
...
ND | 2000 | 2 | 223 | 3.2 | 239
ND | 2001 | 4 | 233 | 4.2 | 321
ND | 2003 | 7 | 256 | 7.9 | 387
For each state/survey.year combination, I would like to interpolate obs2 so that it's time-location is lined up with (time1,obs1).
ie I would like to break up the dataframe into state/survey.year chunks, perform my linear interpolation, and then stitch the individual state/survey.year dataframes back together into a master dataframe.
I have been trying to figure out how to use the plyr and Hmisc packages for this. But keeping getting myself in a tangle.
Here's the code that I wrote to do the interpolation:
require(Hmisc)
df <- new.obs2 <- NULL
for (i in 1:(0.5*(ncol(indirect)-1))){
df[,"new.obs2"] <- approxExtrap(df[,"time1"],
df[,"obs1"],
xout = df[,"obs2"],
method="linear",
rule=2)
}
But I am not sure how to unleash plyr on this problem. Your generous advice and suggestions would be much appreciated. Essentially - I am just trying to interpolate "obs2", within each state/survey.year combination, so it's time references line up with those of "obs1".
Of course if there's a slick way to do this without invoking plyr functions, then I'd be open to that...
Thank you!

This should be as simple as,
ddply(df,.(state,survey.year),transform,
new.obs2 = approxExtrap(time1,obs1,xout = obs2,
method = "linear",
rule = 2))
But I can't promise you anything, since I haven't the foggiest idea what the point of your for loop is. (It's overwriting df[,"new.obs2"] each time through the loop? You initialize the entire data frame df to NULL? What's indirect?)

Related

icc on dataframe with row for each rater

Let me start off with saying I'm completely new to R and trying to figure out how to run icc on my specific dataset which might be a bit different then normally.
The dataset looks as follows
+------------+------------------+--------------+--------------+--------------+
| date | measurement_type | measurement1 | measurement2 | measurement3 |
+------------+------------------+--------------+--------------+--------------+
| 25-04-2020 | 1 | 15.5 | 34.3 | 43.2 |
| 25-04-2020 | 2 | 21.2 | 12.3 | 2.2 |
| 25-04-2020 | 3 | 16.2 | 9.6 | 43.3 |
| 25-04-2020 | 4 | 27 | 1 | 6 |
+------------+------------------+--------------+--------------+--------------+
now I want to do icc on all of those rows since each row stands for a different rater. It should leave the date and measurement_type columns out.
can someone point me in the right direction, I have absolutely no idea how to go about this.
------- EDIT -------
I exported the actual dataset that will come out with some test data.
Which is available here
The 2 important sheets here are the first and third.
The first contains all the participants of the research and the third contains all 4 different reports for each participant. The code I have so far just to tie each report to the correct participant;
library("XLConnect")
library("sqldf")
library("irr")
library("dplyr")
library("tidyr")
# Load in Workbook
wb = loadWorkbook("Measuring.xlsx")
# Load in Worksheet
# Sheet 1 = Study Results
# Sheet 3 = Meetpunten
records = readWorksheet(wb, sheet=1)
reports = readWorksheet(wb, sheet=3)
for (record in 1:nrow(records)) {
recordId = records[record, 'Record.Id']
participantReports = sqldf(sprintf("select * from reports where `Record.Id` = '%s'", recordId))
baselineReport = sqldf("select * from participantReports where measurement_type = '1'")
drinkReport = sqldf("select * from participantReports where measurement_type = '2'")
regularReport = sqldf("select * from participantReports where measurement_type = '3'")
exerciseReport = sqldf("select * from participantReports where measurement_type = '4'")
}
Since in your data each row stands for a different rater, but the icc function in the irr package needs the raters to be columns, you can ignore the two first columns of your table, transpose it and run icc.
So, assuming this table:
+------------+------------------+--------------+--------------+--------------+
| date | measurement_type | measurement1 | measurement2 | measurement3 |
+------------+------------------+--------------+--------------+--------------+
| 25-04-2020 | 1 | 15.5 | 34.3 | 43.2 |
| 25-04-2020 | 2 | 21.2 | 12.3 | 2.2 |
| 25-04-2020 | 3 | 16.2 | 9.6 | 43.3 |
| 25-04-2020 | 4 | 27 | 1 | 6 |
+------------+------------------+--------------+--------------+--------------+
is stored in a variable called data, i would do it like this:
data2 = data.matrix(data[,-c(1,2)]) # generates the dataset without the first two columns
data2 is this table:
+--------------+--------------+--------------+
| measurement1 | measurement2 | measurement3 |
+--------------+--------------+--------------+
| 15.5 | 34.3 | 43.2 |
| 21.2 | 12.3 | 2.2 |
| 16.2 | 9.6 | 43.3 |
| 27 | 1 | 6 |
+--------------+--------------+--------------+
Then:
data2 = t(data2) # transpose data2 so as to have raters in the columns and their ratings in each line
icc(data2) # here i'm not bothering with the parameters, but you should explore the appropriate icc parameters for your needs.
should generate a correct run.

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

R - create a Panel dataset from 2 Cross sectional datasets

Could you please help me with the following task of creating a panel dataset from two cross-sectional datasets?
Specifically, a small portion of the cross-sectional datasets are:
1) - data1
ID| Yr | D | X
-------------------
1 | 2002 | F | 25
2 | 2002 | T | 27
& 2) - data2
ID | Yr | D | X
---------------------
1 | 2003 | T | 45
2 | 2003 | F | 35
And would like to create a panel of the form:
ID | Yr | D | X
-----------------------
1 | 2002 | F | 25
1 | 2003 | T | 45
2 | 2002 | T | 27
2 | 2003 | F | 35
The codes I have tried so far are:
IDvec<-data1[,1]
ID_panel=c()
for (i in 1:length(IDvec)) {
x<-rep(IDvec[i],2)
ID_panel<-append(ID_panel,x)
}
Years_panel<-rep(2002:2003,length(IDvec))
But cannot quite figure out how to link the 3rd and 4th columns. Any help would be greatly appreciated. Thank you.
Assuming that you want to concatenate data frames, then order by ID and Yr. Here's a dplyr approach:
library(dplyr)
data1 %>%
bind_rows(data2) %>%
arrange(ID, Yr)
ID Yr D X
1 1 2002 F 25
2 1 2003 T 45
3 2 2002 T 27
4 2 2003 F 35

Shift planning with Linear Programming

The Modeling and Solving Linear Programming with R book has a nice example on planning shifts in Sec 3.7. I am unable to solve it with R. Also, I am not clear with the solution provided in the book.
Problem
A company has a emergency center which is working 24 hours a day. In
the table below, is detailed the minimal needs of employees for each of the
six shifts of four hours in which the day is divided.
Shift Employees
00:00 - 04:00 5
04:00 - 08:00 7
08:00 - 12:00 18
12:00 - 16:00 12
16:00 - 20:00 15
20:00 - 00:00 10
R solution
I used the following to solve the above.
library(lpSolve)
obj.fun <- c(1,1,1,1,1,1)
constr <- c(1,1,0,0,0,0,
0,1,1,0,0,0,
0,0,1,1,0,0,
0,0,0,1,1,0,
0,0,0,0,1,1,
1,0,0,0,0,1)
constr.dir <- rep(">=",6)
constr.val <-c (12,25,30,27,25,15)
day.shift <- lp("min",obj.fun,constr,constr.dir,constr.val,compute.sens = TRUE)
And, I get the following result.
> day.shift$objval
[1] 1.666667
> day.shift$solution
[1] 0.000000 1.666667 0.000000 0.000000 0.000000 0.000000
This is nowhere close to the numerical solution mentioned in the book.
Numerical solution
The total number of solutions required as per the numerical solution is 38. However, since the problem stated that, there is a defined minimum number of employees in every period, how can this solution be valid?
s1 5
s2 6
s3 12
s4 0
s5 15
s6 0
Your mistake is at the point where you initialize the variable constr, because you don't define it as a matrix. Second fault is your matrix itself. Just look at my example.
I was wondering why you didn't stick to the example in the book because I wanted to check my solution. Mine is based on that.
library(lpSolve)
obj.fun <- c(1,1,1,1,1,1)
constr <- matrix(c(1,0,0,0,0,1,
1,1,0,0,0,0,
0,1,1,0,0,0,
0,0,1,1,0,0,
0,0,0,1,1,0,
0,0,0,0,1,1), ncol = 6, byrow = TRUE)
constr.dir <- rep(">=",6)
constr.val <-c (5,7,18,12,15,10)
day.shift <- lp("min",obj.fun,constr,constr.dir,constr.val,compute.sens = TRUE)
day.shift$objval
# [1] 38
day.shift$solution
# [1] 5 11 7 5 10 0
EDIT based on your question in the comments:
This is the distribution of the shifts on the periods:
shift | 0-4 | 4-8 | 8-12 | 12-16 | 16-20 | 20-24
---------------------------------------------------
20-4 | 5 | 5 | | | |
0-8 | | 11 | 11 | | |
4-12 | | | 7 | 7 | |
8-16 | | | | 5 | 5 |
12-20 | | | | | 10 | 10
18-24 | | | | | |
----------------------------------------------------
sum | 5 | 16 | 18 | 12 | 15 | 10
----------------------------------------------------
need | 5 | 7 | 18 | 12 | 15 | 10
---------------------------------------------------

Tidy Data Layout - convert variables into factors

I have the following data table
| State | Prod. |Non-Prod.|
|-------|-------|---------|
| CA | 120 | 23 |
| GA | 123 | 34 |
| TX | 290 | 34 |
How can I convert this table to tiny data format in R or any other software like Excel?
|State | Class | # of EEs|
|------|----------|---------|
| CA | Prod. | 120 |
| CA | Non-Prod.| 23 |
| GA | Prod. | 123 |
| GA | Non-Prod.| 34 |
Trying using reshape2:
library(reshape2)
melt(df,id.vars='State')
# State variable value
# 1 CA Prod 120
# 2 GA Prod 123
# 3 TX Prod 290
# 4 CA Non-Prod. 23
# 5 GA Non-Prod. 34
# 6 TX Non-Prod. 34

Resources