How alter R codes in efficient way - r

I have a sample created as follows:
survival1a= data.frame(matrix(vector(), 50, 2,dimnames=list(c(), c("Id", "district"))),stringsAsFactors=F)
survival1a$Id <- 1:nrow(survival1a)
survival1a$district<- sample(1:4, size=50, replace=TRUE)
this sample has 50 individuals from 4 different districts.
I have probabilities (a matrix) that shows the likelihood of migration from one district to another(Migdata) as follows:
district***** prob1****** prob2******** prob3******* prob4**
0.83790 0.08674 0.05524 0.02014
0.02184 0.88260 0.03368 0.06191
0.01093 0.03565 0.91000 0.04344
0.03338 0.06933 0.03644 0.86090
I merge these probabilities with my data with this code:
survival1a<-merge( Migdata,survival1a, by.x=c("district"), by.y=c("district"))
I would like to know by the end of the year each person resides in which districts based on probabilities of migration that I have(Migdata).
I have already written a code that perfectly works but with big data it is so time-consuming since it is based on a Loop:
for (k in 1:nrow(survival1a)){
survival1a$migration[k]<-sample(1:4, size=1,replace = TRUE,prob=survival1a[k,2:5])}
Now, I want to write the code in a way that it would not be based on a loop and shows every person district by the end of the year.

Related

Statistics on cluster member relationships over several days

Assume, I have hourly data corresponding to 5 categories for consective 10 days, created as:
library(xts)
set.seed(123)
timestamp <- seq(as.POSIXct("2016-10-01"),as.POSIXct("2016-10-10 23:59:59"), by = "hour")
data <- data.frame(cat1 = rnorm(length(timestamp),150,5),
cat2 = rnorm(length(timestamp),130,3),
cat3 = rnorm(length(timestamp),150,5),
cat4 = rnorm(length(timestamp),100,8),
cat5 = rnorm(length(timestamp),200,15))
data_obj <- xts(data,timestamp) # creat time-series object
head(data_obj,2)
Now, for each day separately, I perform clustering and see how these categories behave with respect to each other using simple kmeans as:
daywise_data <- split.xts(data_obj,f="days",k=1) # split data day wise
clus_obj <- lapply(daywise_data, function(x){ # clustering day wise
return (kmeans(t(x), 2))
})
Once clustering is over, I visualize the cluster relationships over different 10 days with
sapply(clus_obj,function(x) x$cluster) # clustering results
and I found the results as
On visual inspection, it is clear that cat1 and cat3 always remained in the same cluster. Similarly cat4 and cat5 are mostly in different clusters on 10 different days.
Apart from visual inspection, is there any automatic approach to gather this type of statistic from such clustering tables?
Note: This is a dummy example. I have a data frame containing such 80 categories over continuous 100 days. An automatic summary like above one will reduce the effort.
Pair-counting cluster evaluation measures show an easy way to tackle this problem.
Rather than looking at object-cluster assignments, which are unstable, these methods look at whether or not two objects are in the same cluster (that is called a "pair").
So you could check if these pairs change much over time, or not.
Since k-means is randomized, you may also want to run it several times for every time slice, as they may return different clusterings!
You could then say that e.g. series 1 is in the same cluster as series 2 in 90% of the results. etc.

HMM text recognition in R depmixs4

I'm wondering how I would utilize the depmixs4 package for R to run HMM on a dataset. What functions would I use so I get a classification of a testing data set?
I have a file of training data, a file of label data, and a test data.
Training data consists of 4620 rows. Each row has 1079 values. These values are 83 windows with 13 values per window so in otherwords the 1079 is data that is made up of 83 states and each category has 13 observations. Each of these rows with 1079 values is a spoken word so it have 4620 utterances. But in total the data only has 7 distinct words. each of these distinct words have 660 different utterances hence the 4620 rows of words.
So we have words (0-6)
The label file is a list where each row is labeled 0-6 corresponding to what word they are. For example row 300 is labeled 2, row 450 is labeled 6 and 520 is labeled 0.
The test file contains about 5000 rows structured exactly like the training data except there are no labels assocaiated with it.
I want to use HMM to using the training data to classify the test data.
How would I use depmixs4 to output a classification of my test data?
I'm looking at :
depmix(response, data=NULL, nstates, transition=~1, family=gaussian(),
prior=~1, initdata=NULL, respstart=NULL, trstart=NULL, instart=NULL,
ntimes=NULL,...)
but I don't know what response refers to or any of the other parameters.
Here's a quick, albeit incomplete, test to get you started, if only to familiarize you with the basic outline. Please note that this is a toy example and it merely scratches the surface for HMM design/analysis. The vignette for the depmixs4 package, for instance, offers quite a lot of context and examples. Meanwhile, here's a brief intro.
Let's say that you wanted to investigate if industrial production offers clues about economic recessions. First, let's load the relevant packages and then download the data from the St. Louis Fed:
library(quantmod)
library(depmixS4)
library(TTR)
fred.tickers <-c("INDPRO")
getSymbols(fred.tickers,src="FRED")
Next, transform the data into rolling 1-year percentage changes to minimize noise in the data and convert data into data.frame format for analysis in depmixs4:
indpro.1yr <-na.omit(ROC(INDPRO,12))
indpro.1yr.df <-data.frame(indpro.1yr)
Now, let's run a simple HMM model and choose just 2 states--growth and contraction. Note that we're only using industrial production to search for signals:
model <- depmix(response=INDPRO ~ 1,
family = gaussian(),
nstates = 2,
data = indpro.1yr.df ,
transition=~1)
Now let's fit the resulting model, generate posterior states
for analysis, and estimate probabilities of recession. Also, we'll bind the data with dates in an xts format for easier viewing/analysis. (Note the use of set.seed(1), which is used to create a replicable starting value to launch the modeling.)
set.seed(1)
model.fit <- fit(model, verbose = FALSE)
model.prob <- posterior(model.fit)
prob.rec <-model.prob[,2]
prob.rec.dates <-xts(prob.rec,as.Date(index(indpro.1yr)),
order.by=as.Date(index(indpro.1yr)))
Finally, let's review and ideally plot the data:
head(prob.rec.dates)
[,1]
1920-01-01 1.0000000
1920-02-01 1.0000000
1920-03-01 1.0000000
1920-04-01 0.9991880
1920-05-01 0.9999549
1920-06-01 0.9739622
High values (>0.80 ??) indicate/suggest that the economy is in recession/contraction.
Again, a very, very basic introduction, perhaps too basic. Hope it helps.

Sum of subvectors

My vector contains the frequency per day of a certain event in a certain month.
I want to see what run of 16 days contains the highest frequency, and I would like to extract the dates which start en and it.
vector=table(date[year(date)==2001&month(date)==05])
I know how to do this, but my method is (obviously) too primitive.
max(c(sum(vector[1:16]),sum(vector[2:17]),sum(vector[3:18]),sum(vector[4:19]),sum(vector[5:20]),sum(vector[6:21]))/sum(vector))
Edit: For reproducibility the data in vector is provided in .csv form below:
"","Var1","Freq"
"1","2001-05-06",1
"2","2001-05-08",1
"3","2001-05-09",7
"4","2001-05-10",2
"5","2001-05-11",10
"6","2001-05-12",10
"7","2001-05-13",7
"8","2001-05-14",20
"9","2001-05-15",24
"10","2001-05-16",15
"11","2001-05-17",27
"12","2001-05-18",17
"13","2001-05-19",13
"14","2001-05-20",15
"15","2001-05-21",13
"16","2001-05-22",26
"17","2001-05-23",17
"18","2001-05-24",19
"19","2001-05-25",7
"20","2001-05-26",5
"21","2001-05-27",6
"22","2001-05-28",2
"23","2001-05-29",1
"24","2001-05-31",1
Assuming the data in vector is as shown in your data example, something like
max_start <- which.max(rollmean(vector$Freq, 16, align="left"))
date_max_start <- vector$Var1[max_start]
date_max_end <- vector$Var1[max_start + 16]

Using Lm in dlply while sorting by variable

I have the following data called: dataframe
planid (each plan indicated with a number from 1 till 126)
US_FRAC (a value between 0 and 1 for each fund in each year) and
market.premium (a value indicating the market premium for every fund in every year)
For every planid I want to do a regression where I regress US_FRAC against market.premiumas I have 10 years of data for every planid.
I used the following code:
mods=dlply(dataframe,.('planid'),lm,formula=ADJ_US_FRAC ~ market.premium)
I need both the t-statistic and the coefficient for every planid in a table, but I could only find the code for the coefficient. I did something wrong as I only get an output with 1 value for an intercept and nothing else.
removing the quotes in on planid and ADJ_ before usfrac worked for this sample data
dataframe <- data.frame(planid=round(runif(1000)*126), US_FRAC=runif(1000), market.premium=rnorm(1000))
dlply(dataframe,.(planid),lm,formula=US_FRAC ~ market.premium)
this summary() performs the coef t-tests. You can create dataframes with the fits with something like:
C <- ddply(dataframe,.(planid),function(x) {summary(lm(formula=US_FRAC ~ market.premium,data=x))$coefficients['(Intercept)', ]})
Beta <- ddply(dataframe,.(planid),function(x) {summary(lm(formula=US_FRAC ~ market.premium,data=x))$coefficients['market.premium', ]})
kind greetings

Performance issues in for loop (moving towards vectorization with multiple sample()'s)

I am currently having issues with performance in one of my scripts. I made the script as a result of this question, but I have been unable to increase its performance and figured increasing its performance is a different question than actually writing the code.
I wrote the code to generate a dummy webshop dataset with a hidden pattern hat can be found with clustering as an example in one of my courses. It, however, does not allow me to go beyond ~ 40,000 transactions with a reasonable runtime (i.e. a few hours).
This issue is as follows, using these parameters I will build a transaction/customer/product table:
set.seed(1) # Set seed to make reproducible
Parameters <- data.frame(
CustomerType = c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
PropCustTypes = c(.10, .45, .30, .15), # Probability for being in each group.
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability for each group
ByDirectCustomer = c(0.60, .30, 0.15, 0.05), # of coming through channel X
ByPartnerBlog = c(0.30, .30, 0.35, 0.35), #
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
stringsAsFactors=FALSE)
# Some other parameters for later use.
NumDays = 1000
NumTransactions = 100000 # Note that more than these will be made, it's a starting point (excluding annual growth, weekend increases etc.)
SalesMultiplierWeekends = 1.5 # For example, I want more in weekends
StartDate <- as.Date("2009-01-04")
NumProducts <- 150
AnnualGrowth <- .1 # I also want an annual growth trend
I start with a 'Days' dataframe along with an almost equal division of total transactions over all days.
days <- data.frame( # Define the days
day = StartDate+1:NumDays,
DaysSinceStart = StartDate+1:NumDays - StartDate, # Used to compute a rising trend
CustomerRate = NumTransactions/NumDays)
days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends
as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*SalesMultiplierWeekends)
days$nPurchases <- as.integer(days$nPurchases+days$nPurchases * (days$DaysSinceStart/365)*AnnualGrowth)
Next I generate the transactions using this table:
Transactions <- data.frame(
ID = 1:sum(days$nPurchases),
Date = rep(days$day, times=days$nPurchases),
CustomerType = sample(Parameters$CustomerType, sum(days$nPurchases), replace=TRUE, prob=Parameters$PropCustTypes),
NewCustomer = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
CustomerID = NA, # Will be assigned later, NewCustomer: 0.8 and .2
ProductID = NA, # insinuate new/existing customers above
ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)
Now I'd like to dynamically assign products and customers to each transaction in order to make my pattern recognizable in the transaction dataset. I first make a product table from which I can choose, having convenient release dates so that I will be able to select a product for each transaction based on this info.
StartProductRelease <- StartDate-(365*2*max(Parameters$Timeliness)/12)
ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
ID = 1:NumProducts,
DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
as.numeric(Upper-Lower),
as.numeric(Upper-Lower)/NumProducts))),
SuggestedPrice = rnorm(NumProducts, 100, 50))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 15 # Cap ProductPrice at 10$
Next I build a table of customers, deriving from the number of 'new customers' in the transaction dataset.
Customers <- data.frame(
ID=(1:sum(Transactions$NewCustomer)),
CustomerType = sample(Parameters$CustomerType, size=sum(Transactions$NewCustomer),
replace=TRUE, prob=Parameters$PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)
I want to dynamically assign Customers and Products to each transaction, sampled from the 'Products' and 'Customers' dataframe in order to maintain the overall parameters I have defined above. I'd like to vectorize this, but I have no idea on how I would do so (I've already excluded as much as I could from the for loop). The part outside of the for loop:
ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")
Transactions <- merge(Transactions,Parameters, by="CustomerType") # Parameters are now
Transactions$Discount <- rnorm(length(Transactions$ID), # assigned to each transaction
Transactions$Discount,Transactions$Discount/20)
Transactions$Timeliness <- rnorm(length(Transactions$ID),
Transactions$Timeliness, Transactions$Timeliness/6)
Now the performance issues start to arise, the for loop:
for (i in 1:nrow(Transactions)){
# Only sample customers which share the same 'CustomerType' as the transaction
Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
1,replace=FALSE)
# Sample the 'ReferredBy' based upon the proportions described in 'Parameters'
Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
# Only sample products in the required range to maintain the 'timeliness' parameter.
CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
This concludes to my final question: how would I vectorize the last part here? I've been able to munge millions of rows with data.table in seconds, it just seems weird that I'm unable to conduct such a relatively simple task so slow.
For loop / filling 100 rows: ~ 18 Seconds
For loop / filling 200 rows: ~ 37 Seconds
For loop / filling 1000 rows: ~ 3 minutes
For loop / filling 300000 rows: No idea, can't get that far?
Why is it running so slow and how can I solve this? Any help is greatly appreciated.
Below is how you would do the first part using data.table, adding CustomerID to the Transactions table. I have changed some names and dropped the placeholder columns as they will be added through the data.table joins.
Tr <- data.table(Transactions)
Tr[, CustomerID:=NULL]
Tr[, ProductID:=NULL]
Tr[, ReferredBy:=NULL] ## see #Arun's comment for a more compact way to do this
Cs <- data.table(Customers)
setnames(Cs, 'ID', 'CustomerID') ## So we avoid duplicate with Tr
## Add customer ID, matching customer types
setkey(Tr, CustomerType)
setkey(Cs, CustomerType)
# Make an index Transaction ID -> Customer ID
# Large interim matrix should not be formed, but I am not sure
TrID2CustID <- Cs[Tr, allow.cartesian=T][, list(CustomerID=sample(CustomerID, 1)), by=ID]
setkey(TrID2CustID, ID)
setkey(Tr, ID)
Tr <- Tr[TrID2CustID]
There is a large matrix that is the cartesian product of your Transactions and Customers tables (about 15M rows) which would exhaust the memory if it is explicitly computed. Judging by the fact that this takes about a second, I'd say it is not computed, but I am not sure.
I will work on the rest and edit the answer if I come up with the solutions quickly, but this ought to show you how to do this using data.table.
UPDATE 1: adding ReferredBy
Since the referral probabilities only vary by CustomerType, you can generate the referrals in blocks with replacement (much faster than by individual ID)
setkey(Tr, CustomerType)
Tr[, ReferredBy:=sample(ReferredByOptions, replace=TRUE, size=.N,
prob=c(BySearchEngine[1],
ByDirectCustomer[1],
ByPartnerBlog[1])),
by=CustomerType]
UPDATE 2: adding ProductID
This is proving trickier to do in a neat cartesian-product sort of way. I cannot think of an elegant way to generate the 31 dates (-15:15) for each purchase (melted matrix would probably be too big). The code below works as intended but is not as fast as the previous 2:
Pr <- data.table(Products)
setnames(Pr, 'ID', 'ProductID') ## not necessary here, but good practice
CenteredAround <- as.Date(Tr$Date - 30*Tr$Timeliness)
setkey(Tr, ID)
Tr[, ProductID:=sample(Pr[abs(Pr$DateReleased -
CenteredAround[.I]) <= 15, ProductID], 1), by=ID]
A very simple optimization is to avoid modifying the data frame in the loop, as others have suggested. At least prior to R3.1, modifying a data frame is really expensive, so that's the last thing you want to be doing in a loop. Also, based on Hadley's comments and release notes for R3.1, it may be the case that modifying data frames is not as expensive with R3.1, but I haven't tested.
Here we get around the data frame modification by storing interim results in vectors, and then only inserting into the data frame after the loop. Consider:
system.time({
custId <- Transactions$CustomerID
refBy <- Transactions$ReferredBy
productID <- Transactions$ProductID
for (i in 1:100){
# Only sample customers which share the same 'CustomerType' as the transaction
custId <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
1,replace=FALSE)
# Sample the 'ReferredBy' based upon the proportions described in 'Parameters'
refBy <- sample(ReferredByOptions,1,replace=FALSE,
prob=Transactions[i,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])
# Only sample products in the required range to maintain the 'timeliness' parameter.
CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
productID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
Transactions$CustomerID <- custId
Transactions$ReferredBy <- refBy
Transactions$ProductID <- productID
})
Which times in at:
user system elapsed
0.66 0.06 0.71
The corresponding time with your original code is:
user system elapsed
5.01 1.78 6.79
So close to a 10x improvement with a minor change (avoiding modifying the data frame repeatedly).
I'm sure you can get further improvements, but this is a real low hanging fruit you can easily implement.

Resources