Simple function over twisting data- exercise - r

So I have a given function, that I aim to run over a dataset, but it do not seem to work, with an "error-message" that i quite cannot figure out. I will describe the dataset, because it is twisting, then write my function. The name of the dataset is: cohort_1x1:
Year Age Male Female
1940 18 0.001234 0.003028
1940 19 0.005278 0.002098
.... .. ........ ........ #up to age 100
1940 100 0.004678 0.006548
1941 18 0.004567 0.002753 # start new year, with age 18
1941 19 0.005633 0.001983
.... .. ........ ........
1979 100 0.003456 0.00256
The dataset contains death-rates for Male and Female, for every agegroup 18-100, from year 1940-1979. Further on; the function that i have written is this:
gompakmean=function(theta,y_0,h){
# Returns the expected remaining number of years at age "l_0" under
# the Gomperz-Makeham model with parameter "theta" (vector).
# "h" is the time increment for the numerical integration.
l_0=y_0/h
l_e=110/h
ll=(l_0+1):l_e
hh=(theta[2]/theta[3])*(exp(theta[3]*h)-1)
p=exp(-theta[1]*h-hh*exp(theta[3]*ll*h))
P=cumprod(p)
(0.5+sum(P))*h
}
This function returns the expected number of years, for every year/agegroup, and it is to be done separately for men and female.
But if i try with input like the one below, i get the following error-message:
Input:
s=-c(8,9,2)
theta=exp(s)
y_0 = cohort_1x1$Male
h = 1
gompakmean(theta,y_0,h)
Leads to this error-message:
[1] 46.13826
Warning message:
In (l_0 + 1):l_e :
numerical expression has 3320 elements: only the first used
So get an output for the first year (age?) which is: [1]46.13826.
But then the function seem to stop, hence the error-message. Is the reason somewhat with my dataset? That maybe after running over 1940, it must have year 1941? But that will only give me output for year 18 in every year?
Because my aim is to calculate expected number of years for every agegroup in every year, i.e: calculate expected number of years for every cohort in all the years.
Appreciate all answers!:)

Related

Return closest matching value in seperate dataframe in R

DATAFRAMES
DataFrame 1
GENDER HANDICAP PRIMARYREC
male 10 Model1
female 12 Model2
male 18 Model3
DataFrame 2
MODEL1Lofts MODEL2Lofts MODEL3Lofts
9 10.5 9
10.5 10.5
12
My current filter function:
driverfunc = driver_model[driver_model$Gender==gender &
driver_model$HandicapMin<=handicap & driver_model$HandicapMax>=handicap, ]
return(driverfunc) }
**EXAMPLE INPUT** GetDriver(gender = "Male", handicap = 5)
Note: It automatically finds the recommended model
GOAL
What I'm looking to do is add a nested function within my filter (if possible) that includes the user inputting their "current loft" and then it lookups the possible values from DataFrame2 and adds that output to driverfunc.
This is literally my 4th day doing R and I've searched around a lot but getting stuck and any help would be tremendously appreciated!

Looping regressions and running column sum based on results

I have a data frame with panel data that looks as follows:
countrycode year 7111 7112 7119 7126 7129 7131 7132 7133 7138
1 AGO 1981 380491 149890 238832 0 166690 449982 710642 430481 890546
2 AGO 1982 339626 66434 183487 0 79682 108356 486799 186884 220545
3 AGO 1983 128043 2697 91404 148617 3988 432725 829958 138764 152822
4 AGO 1984 67832 0 85613 1251 45644 361733 1250272 237236 2952746
5 AGO 1985 354335 11225 143000 2130 7687 2204297 942071 408907 474666
There are 159 four-digit column variables like the ones shown above. There are also column variables named CEPI1_fw and CIPI1_fw. Furthermore, there are 46 countries and 34 years in the data set.
I would like to use the plm command to regress each of the numerical column variables on CEPI1_fw and CIPI1_fw. Then, I would like to sum the numerical column variables in the data frame above based on whether the coefficients from the regressions are above or below a certain threshold. The resulting output should be a pair of columns added to the data frame above.
There are a few ambiguities in your question, but I'll take a shot.
First, I'm going to revamp your code slightly: adding rows to data frames is very inefficient (probably doesn't matter in this application, but it's a bad habit to get into ...)
out <- list()
for (i in colnames(master5)) {
f <- reformulate(c("CEPI1_fw","CIPI1_fw"),
response=paste0("master5$",i))
m <- summary(plm(f, data = master4, model = "within"))
out <- c(out, list(data.frame(yvar=i, coef=m$coefficients[1,1],
pval= m$coefficients[1,4],
stringsAsFactors=FALSE)))
}
out <- do.call(rbind, out) ## combine elements into a single data frame
Select only statistically significant response variables. From a statistical/inferential point of view, this is probably a bad idea ...
out <- out[out$pval<0.05,]
Select the names of variables where the coefficients are above a threshold
big_vars <- out$yvar[abs(out$coef)>threshold]
Compute column sums from another data set ...
colSums(other_data[big_vars])

R code optimizing for rep function

I'm working with data from an income/expense per home poll.
The 9,002 observations from the sample data base represent 3,155,937 homes through an expansion factor like this.
Homeid Income Factor
001 23456 678
002 42578 1073
.. .. ..
9002 62333 987
I'm trying to get an exact summary of the total income per decile by expanding each income value times its factor which will give as result a 3,155,937 ovservations vector and then I'm using a 'for' loop to asign each value the Decile it belongs to.
Three <- Nal %>% select(income,factor)
Five <- data.frame(income=rep(Three$income,Three$factor))
for(i in 1:31559379){if(i<=3155937){Five$Decil[i]=1}
else{if(i<=6311874){Five$Decil[i]=2}
else{if(i<=9467811){Five$Decil[i]=3}
else{if(i<=12623748){Five$Decil[i]=4}
else{if(i<=15779685){Five$Decil[i]=5}
else{if(i<=18935622){Five$Decil[i]=6}
else{if(i<=22091559){Five$Decil[i]=7}
else{if(i<=25247496){Five$Decil[i]=8}
else{if(i<=28403433){Five$Decil[i]=9}
else{Five$Decil[i]=10}
}}}}}}}}}
for(i in 1:10){Two=filter(Five,Decil==i);
TotDecil$inctot[i]=sum(Two$income)}
rm(Five);rm(Three);rm(Two);gc()
I want to know if you can help me optimize this code; it has taken hours and still haven't finished.
The ntile function from the dplyr package worked better:
Three <- Nal %>% select(income,factor)
Five <- data.frame(income=rep(Three$income,Three$factor))
Cinco$Decil <- ntile(Cinco$ing_cor,10)
# ^ This line works instead of that 'for' loop & it only takes seconds to run

How to optimize for loops in extremely large dataframe

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)

Aggregating, restructuring hourly time series data in R

I have a year's worth of hourly data in a data frame in R:
> str(df.MHwind_load) # compactly displays structure of data frame
'data.frame': 8760 obs. of 6 variables:
$ Date : Factor w/ 365 levels "2010-04-01","2010-04-02",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Time..HRs. : int 1 2 3 4 5 6 7 8 9 10 ...
$ Hour.of.Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Wind.MW : int 375 492 483 476 486 512 421 396 456 453 ...
$ MSEDCL.Demand: int 13293 13140 12806 12891 13113 13802 14186 14104 14117 14462 ...
$ Net.Load : int 12918 12648 12323 12415 12627 13290 13765 13708 13661 14009 ...
While preserving the hourly structure, I would like to know how to extract
a particular month/group of months
the first day/first week etc of each month
all mondays, all tuesdays etc of the year
I have tried using "cut" without result and after looking online think that "lubridate" might be able to do so but haven't found suitable examples. I'd greatly appreciate help on this issue.
Edit: a sample of data in the data frame is below:
Date Hour.of.Year Wind.MW datetime
1 2010-04-01 1 375 2010-04-01 00:00:00
2 2010-04-01 2 492 2010-04-01 01:00:00
3 2010-04-01 3 483 2010-04-01 02:00:00
4 2010-04-01 4 476 2010-04-01 03:00:00
5 2010-04-01 5 486 2010-04-01 04:00:00
6 2010-04-01 6 512 2010-04-01 05:00:00
7 2010-04-01 7 421 2010-04-01 06:00:00
8 2010-04-01 8 396 2010-04-01 07:00:00
9 2010-04-01 9 456 2010-04-01 08:00:00
10 2010-04-01 10 453 2010-04-01 09:00:00
.. .. ... .......... ........
8758 2011-03-31 8758 302 2011-03-31 21:00:00
8759 2011-03-31 8759 378 2011-03-31 22:00:00
8760 2011-03-31 8760 356 2011-03-31 23:00:00
EDIT: Additional time-based operations I would like to perform on the same dataset
1. Perform hour-by-hour averaging for all data points i.e average of all values in the first hour of each day in the year. The output will be an "hourly profile" of the entire year (24 time points)
2. Perform the same for each week and each month i.e obtain 52 and 12 hourly profiles respectively
3. Do seasonal averages, for example for June to September
Convert the date to the format which lubridate understands and then use the functions month, mday, wday respectively.
Suppose you have a data.frame with the time stored in column Date, then the answer for your questions would be:
###dummy data.frame
df <- data.frame(Date=c("2012-01-01","2012-02-15","2012-03-01","2012-04-01"),a=1:4)
##1. Select rows for particular month
subset(df,month(Date)==1)
##2a. Select the first day of each month
subset(df,mday(Date)==1)
##2b. Select the first week of each month
##get the week numbers which have the first day of the month
wkd <- subset(week(df$Date),mday(df$Date)==1)
##select the weeks with particular numbers
subset(df,week(Date) %in% wkd)
##3. Select all mondays
subset(df,wday(Date)==1)
First switch to a Date representation: as.Date(df.MHwind_load$Date)
Then call weekdays on the date vector to get a new factor labelled with day of week
Then call months on the date vector to get a new factor labelled with name of month
Optionally create a years variable (see below).
Now subset the data frame using the relevant combination of these.
Step 2. gets an answer to your task 3. Steps 3. and 4. get you to task 1. Task 2 might require a line or two of R. Or just select rows corresponding to, say, all the Mondays in a month and call unique, or its alter-ego duplicated on the results.
To get you going...
newdf <- df.MHwind_load ## build an augmented data set
newdf$d <- as.Date(newdf$Date)
newdf$month <- months(newdf$d)
newdf$day <- weekdays(newdf$d)
## for some reason R has no years function. Here's one
years <- function(x){ format(as.Date(x), format = "%Y") }
newdf$year <- years(newdf$d)
# get observations from January to March of every year
subset(newdf, month %*% in c('January', 'February', 'March'))
# get all Monday observations
subset(newdf, day == 'Monday')
# get all Mondays in 1999
subset(newdf, day == 'Monday' & year == '1999')
# slightly fancier: _first_ Monday of each month
# get the first weeks
first.week.of.month <- !duplicated(cbind(newdf$month, newdf$day))
# now pull out the mondays
subset(newdf, first.monday.of.month & day=='Monday')
Since you're not asking about the time (hourly) part of your data, it is best to then store your data as a Date object. Otherwise, you might be interested in chron, which also has some convenience functions like you'll see below.
With respect to Conjugate Prior's answer, you should store your date data as a Date object. Since your data already follows the default format ('yyyy-mm-dd') you can just call as.Date on it. Otherwise, you would have to specify your string format. I would also use as.character on your factor to make sure you don't get errors inline. I know I've ran into problems with factors-into-Dates for that reason (possibly corrected in current version).
df.MHwind_load <- transform(df.MHwind_load, Date = as.Date(as.character(Date)))
Now you would do well to create wrapper functions that extract the information you desire. You could use transform like I did above to simply add those columns that represent months, days, years, etc, and then subset on them logically. Alternatively, you might do something like this:
getMonth <- function(x, mo) { # This function assumes w/in single year vector
isMonth <- month(x) %in% mo # Boolean of matching months
return(x[which(isMonth)] # Return vector of matching months
} # end function
Or, in short form
getMonth <- function(x, mo) x[month(x) %in% mo]
This is just a tradeoff between storing that information (transform frame) or having it processed when desired (use accessor methods).
A more complicated process is your need for, say, the first day of a month. This is not entirely difficult, though. Below is a function that will return all of those values, but it is rather simple to just subset a sorted vector of values for a given month and take their first one.
getFirstDay <- function(x, mo) {
isMonth <- months(x) %in% mo
x <- sort(x[isMonth]) # Look at only those in the desired month.
# Sort them by date. We only want the first day.
nFirsts <- rle(as.numeric(x))$len[1] # Returns length of 1st days
return(x[seq(nFirsts)])
} # end function
The easier alternative would be
getFirstDayOnly <- function(x, mo) {sort(x[months(x) %in% mo])[1]}
I haven't prototyped these, as you didn't provide any data samples, but this is the sort of approach that can help you get the information you desire. It is up to you to figure out how to put these into your work flow. For instance, say you want to get the first day for each month of a given year (assuming we're only looking at one year; you can create wrappers or pre-process your vector to a single year beforehand).
# Return a vector of first days for each month
df <- transform(df, date = as.Date(as.character(date)))
sapply(unique(months(df$date)), # Iterate through months in Dates
function(month) {getFirstDayOnly(df$date, month)})
The above could also be designed as a separate convenience function that uses the other accessor function. In this way, you create a series of direct but concise methods for getting pieces of the information you want. Then you simply pull them together to create very simple and easy to interpret functions that you can use in your scripts to get you precise what you desire in the most efficient manner.
You should be able to use the above examples to figure out how to prototype other wrappers for accessing the date information you require. If you need help on those, feel free to ask in a comment.

Resources