Sampling in stages in R - r

I am running some sampling simulations from census data and I would like to sample in 2 stages.
First I want to sample 25 households within each village.
Second I want to sample 1 person from each household.
My data is in long format, with a village identifier, a household identifier, and a binary disease status (0 = healthy, 1 = diseased). The following code runs a monte-carlo simulation to sample 25 individuals per village 3000 times and record the number of malaria-positive individuals sampled.
But, I would like to sample 1 individual from 25 sampled households from each village. I can't figure it out.
Here is the link to my data:
d = read.table("data.txt", sep=",", header=TRUE)
villages = split(d$malaria, d$villageid)
positives = vector("list", 3000)
for(i in 1:3000) {
sampled = lapply(villages, sample, 25)
positives[[i]] = lapply(sampled, sum)
}

How about this?
replicate(3000, sum(sapply(lapply(villages, sample, 25), sample, 1)))
lapply(villages, sample, 25) -> gives 25 households for all 177 villages
sapply(., sample, 1) -> sample 1 person from these 25 people from each of 177 villages
sum(.) -> sum the sampled values
replicate -> repeat the same function 3000 times

I figured out a workaround. It is quite convoluted and involves taking the data and creating another dataset. (I did this in Stata as my R capabilities are limited.) First I sort the dataset by house number and load that into R (d.people). Then I create a new dataset by collapsing the old dataset by house number, and load that into R (d.house). I do the sampling in 2 stages, first sampling 1 person from each household in the people dataset. I can then sample 25 "household sampled people" from each village after combining the houses dataset with the output from sampling 1 person from each household.
d.people = read.table("people data", sep=",", header=TRUE)
d.houses = read.table("houses data", sep=",", header=TRUE)
for(i in 1:3000){
houses = split(d.people$malaria, d.people$house)
firststage = sapply(houses, sample, 1)
secondstage = cbind(d.houses, firststage)
villages = split(secondstage$firststage, secondstage$village)
sampled = lapply(villages, sample, 25)
positives[[i]] = lapply(sampled, sum)
}

Related

Analysis to identify similar occupations by frequency of skills requested in job postings (in R)

I have access to a dataset of job postings, which for each posting has a unique posting ID, the job posting occupation, and a row for each skill requested in each job posting.
The dataset looks a bit like this:
posting_id
occ_code
occname
skillname
1
1
data scientist
analysis
1
1
data scientist
python
2
2
lecturer
teaching
2
2
lecturer
economics
3
3
biologist
research
3
3
biologist
biology
1
1
data scientist
research
1
1
data scientist
R
I'd like to perform analysis in R to identify "close" occupations by how similar their overall skill demand is in job postings. E.g. if many of the top 10 in-demand skills for financial analysts matched some of the top 10 in-demand skills for data scientists, those could be considered closely related occupations.
To be more clear, I want to identify similar occupations by their overall skill demand in the postings i.e. by summing the no. of times each skill is requested for an occupation, and identifying which other occupations have similar frequently requested skills.
I am fairly new to R so would appreciate any help!
I think you might want an unsupervised clustering strategy. See the help page for hclust for a debugged worked example. This untested code.
# Load necessary libraries
library(tidyverse)
library(reshape2)
# Read in the data
data <- read.csv("path/to/your/data.csv")
# Sum the number of times each skill is requested for each occupation
skill_counts <- data %>%
group_by(occ_code, occname_skillname) %>%
summarise(count = n())
# Get the top 10 in-demand skills for each occupation
top_10_skills <- skill_counts %>%
group_by(occ_code) %>%
top_n(10, count)
# Convert the data into a matrix for clustering
matrix <- dcast(top_10_skills, occ_code ~ occname_skillname, value.var = "count")
# Perform clustering
fit <- hclust(dist(t(matrix)), method = "ward.D2")
# Plot the dendrogram
plot(fit, hang = -1, labels = row.names(matrix), main = "Occupation Clustering")
The resulting dendrogram will show the relationships between the occupations based on their skill demand. Closer occupations will be grouped together and more distantly related occupations will be separated further apart.

Backtesting in R for time series

I am new to the backtesting methodology - algorithm in order to assess if something works based on the historical data.Since I am new to that I am trying to keep things simple in order to understand it.So far I have understood that if let's say I have a data set of time series :
date = seq(as.Date("2000/1/1"),as.Date("2001/1/31"), by = "day")
n = length(date);n
class(date)
y = rnorm(n)
data = data.frame(date,y)
I will keep the first 365 days that will be the in sample period in order to do something with them and then I will update them with one observation at the time for the next month.Am I correct here ?
So if I am correct, I define the in sample and out of sample periods.
T = dim(data)[1];T
outofsampleperiod = 31
initialsample = T-outofsampleperiod
I want for example to find the quantile of alpha =0.01 of the empirical data.
pre = data[1:initialsample,]
ypre = pre$y
quantile(ypre,0.01)
1%
-2.50478
Now the difficult part for me is to update them in a for loop in R.
I want to add each time one observation and find again the empirical quantile of alpha = 0.01.To print them all and check the condition if is greater than the in sample quantile as resulted previously.
for (i in 1:outofsampleperiod){
qnew = quantile(1:(initialsample+i-1),0.01)
print(qnew)
}
You can create a little function that gets the quantile of column y, over rows 1 to i of a frame df like this:
func <- function(i,df) quantile(df[1:i,"y"],.01)
Then apply this function to each row of data
data$qnew = lapply(1:nrow(data),func,df=data)
Output (last six rows)
> tail(data)
date y qnew
392 2001-01-26 1.3505147 -2.253655
393 2001-01-27 -0.5096840 -2.253337
394 2001-01-28 -0.6865489 -2.253019
395 2001-01-29 1.0881961 -2.252701
396 2001-01-30 0.1754646 -2.252383
397 2001-01-31 0.5929567 -2.252065

How alter R codes in efficient way

I have a sample created as follows:
survival1a= data.frame(matrix(vector(), 50, 2,dimnames=list(c(), c("Id", "district"))),stringsAsFactors=F)
survival1a$Id <- 1:nrow(survival1a)
survival1a$district<- sample(1:4, size=50, replace=TRUE)
this sample has 50 individuals from 4 different districts.
I have probabilities (a matrix) that shows the likelihood of migration from one district to another(Migdata) as follows:
district***** prob1****** prob2******** prob3******* prob4**
0.83790 0.08674 0.05524 0.02014
0.02184 0.88260 0.03368 0.06191
0.01093 0.03565 0.91000 0.04344
0.03338 0.06933 0.03644 0.86090
I merge these probabilities with my data with this code:
survival1a<-merge( Migdata,survival1a, by.x=c("district"), by.y=c("district"))
I would like to know by the end of the year each person resides in which districts based on probabilities of migration that I have(Migdata).
I have already written a code that perfectly works but with big data it is so time-consuming since it is based on a Loop:
for (k in 1:nrow(survival1a)){
survival1a$migration[k]<-sample(1:4, size=1,replace = TRUE,prob=survival1a[k,2:5])}
Now, I want to write the code in a way that it would not be based on a loop and shows every person district by the end of the year.

Time series with multiple stores multiple products

I have a 300 stores and 20000 products in my data set and I want to predict next 3 months sales forecast for each outlet each product level. And my data frame looks like this (sample) like this data frame I m taking from SQL Server 2016
Date outlet produ price
2019-Jan A W 10
2019-Feb A R 20
2019-Feb A W 15
2019-Jan B W 30
2019-Jan B F 40
2019-Feb B W 40
what I tried is get the single product observation entire time series and set to model and get the output
##getthe data set like this
outlet <-c('A','A','B','B')
produ <-c('W','R','W','F')
price <-c(10,20,30,40)
df <- data.frame(outlet,produ,price)
##tried to get single product
dpSingle <- dplyr::filter(df,df$produ == 'W')
data.ts=ts(Quntity, start=c(year,month), frequency=12)
fit_arima <- auto.arima(data.ts,d=1,D=1,stepwise = FALSE
,approximation = FALSE, trace = TRUE)
fcast<-forecast(fit_arima,h=24)
autoplot(fcast) + ggtitle("Forecasted for next 24 months")+
ylab("quntity")+xlab("Time in days")
print((exp(fcast$mean )))
but what i want is loop through a data frame and identify outlet first and then product and get particulars observations with features and pass to my time series model and get predictions individually for each outlet each product.

R: Percentile calculations on subsets of data

I have a data set which contains the following identifiers, an rscore, gvkey, sic2, year, and cdom. What I am looking to do is calculate percentile ranks based on summed rscores for all temporal spans (~1500) for a given gvkey, and then calculate percentile ranks in a given temporal time span and sic2 based on gvkey.
Calculating the percentiles for all temporal time spans is a fairly quick process, however once I add in calculating the sic2 percentile ranks it's fairly slow, but we are likely looking at about ~65,000 subsets in total. I'm wondering if there is a possibility of speeding up this process.
The data for one temporal time span looks like the following
gvkey sic2 cdom rscoreSum pct
1187 10 USA 8.00E-02 0.942268617
1265 10 USA -1.98E-01 0.142334654
1266 10 USA 4.97E-02 0.88565478
1464 10 USA -1.56E-02 0.445748247
1484 10 USA 1.40E-01 0.979807985
1856 10 USA -2.23E-02 0.398252565
1867 10 USA 4.69E-02 0.8791019
2047 10 USA -5.00E-02 0.286701209
2099 10 USA -1.78E-02 0.430915371
2127 10 USA -4.24E-02 0.309255308
2187 10 USA 5.07E-02 0.893020421
The code to calculate the industry ranks is below, and fairly straightforward.
#generate 2 digit industry SICs percentile ranks
dout <- ddply(dfSum, .(sic2), function(x){
indPct <- rank(x$rscoreSum)/nrow(x)
gvkey <- x$gvkey
x <- data.frame(gvkey, indPct)
})
#merge 2 digit industry SIC percentile ranks with market percentile ranks
dfSum <- merge(dfSum, dout, by = "gvkey")
names(dfSum)[2] <- 'sic2'
Any suggestions to speed the process would be appreciated!
You might try the data.table package for fast operations across relatively large datasets like yours. For example, my machine has no problem working through this:
library(data.table)
# Create a dataset like yours, but bigger
n.rows <- 2e6
n.sic2 <- 1e4
dfSum <- data.frame(gvkey=seq_len(n.rows),
sic2=sample.int(n.sic2, n.rows, replace=TRUE),
cdom="USA",
rscoreSum=rnorm(n.rows))
# Now make your dataset into a data.table
dfSum <- data.table(dfSum)
# Calculate the percentiles
# Note that there is no need to re-assign the result
dfSum[, indPct:=rank(rscoreSum)/length(rscoreSum), by="sic2"]
whereas the plyr equivalent takes a while.
If you like the plyr syntax (I do), you may also be interested in the dplyr package, which is billed as "the next generation of plyr", with support for faster data stores in the backend.

Resources