Related
I've been assigned to create a dataset of simulated patient data in R for an assignment. We've been provided variable names and thats it. I want to be able to get a random sample of 100, and use set.seed() to make it reproducible, but when I run the code, I originally got different sample variables each time I re-open the script, and now it I just get error messages and it won't run
This is what I have:
pulse_data <- data.frame(
group = c(rep("control", "treatment")),
age = sample(c(20:75)),
gender = c(rep("male", "female")),
resting_pulse = sample(c(40:120)),
height_cm = sample(c(140:220))
)
set.seed(30)
pulse_sim <- sample_n(pulse_data, 100, replace = FALSE)
am I missing something fundamental?!
(total beginner, speak to me like an idiot and I might understand :) )
I've tried to sample_n() straight from the dataframe, with the set.seed() and to put set.seed() inside the pulse_sim but to no avail... as for why I get errors now, I'm at my wits end
Realize that pulse_data is created using random data, so each time the script is called, you get random data. After you create it, you set the random seed, so you get the same rows you did the last time you opened the script, but ... the rows have different data. SOLUTION: set the random seed before you define pulse_data.
pulse_data <- data.frame(
group = rep(c("control", "treatment"), length.out=30),
age = sample(c(20:75), size=30),
gender = rep(c("male", "female"), length.out=30),
resting_pulse = sample(c(40:120), size=30),
height_cm = sample(c(140:220), size=30)
)
pulse_sim <- sample_n(pulse_data, 10, replace = FALSE)
I have put that code, plus a simple pulse_sim again (to print it) in a file 74408236.R. (Note that I added length.out and changed your sample size from 100 to 10, for the sake of this demonstration.) I can run this briefly with this shell command (not in R):
$ Rscript.exe 74408236.R
group age gender resting_pulse height_cm
1 treatment 28 female 76 210
2 treatment 24 female 118 140
3 control 44 male 57 141
4 control 70 male 96 184
5 treatment 22 female 87 177
6 control 30 male 50 168
7 control 39 male 56 145
8 treatment 37 female 120 182
9 treatment 20 female 79 181
10 treatment 75 female 98 186
When I run it a few times in a row, I get the same output. For brevity, I'll demonstrate same-ness by showing its MD5 checksum; while MD5 is not the most "secure" (cryptographically), I think this is an easy way to suggest that the output is unlikely to be different. (This is shell-scripting, still not in R.)
$ for rep in $(seq 1 5) ; do Rscript.exe 74408236.R | md5sum; done
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
0f06ecd84c1b65d6d5e4ee36dea76add -
In fact, if I repeat it 100 times, I still see no change. I'll pipe through uniq -c to replace repeated output with the count (first number) and the output (everything else, the checksum).
$ for rep in $(seq 1 100) ; do /mnt/c/R/R-4.1.2/bin/Rscript.exe 74408236.R | md5sum; done | uniq -c
100 0f06ecd84c1b65d6d5e4ee36dea76add -
I created a contingency table with the passengers data from the Titanic by the Hypergeometric sampling -That's mean that both of the marginal totals are preset and equals-. It was created crossing the Sex and Survivor columns of 328 cases -164 men and 164 women-, this is the code:
First, I ungroup the data and deleted the useless columns
titanic = as.data.frame(Titanic)
titanic = titanic[rep(1:nrow(titanic),titanic$Freq),]
titanic = titanic[,c(2,4)]
later, selected a sample of men
men = subset(titanic, titanic$Sex == 'Male')
men = men [sample(nrow(men),164), ]
table(men$Sex, men$Survived)
# No Yes
# Male 133 31
# Female 0 0
now the row of women must be filled in with the appropriate values
n = summary.factor(men$Survived)
womenYes = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='Yes'))
womenYes = subset(womenYes[1:n[1], ])
womenNo = subset(titanic, (titanic$Sex == 'Female' & titanic$Survived=='No'))
womenNo = subset(womenNo[1:n[2], ])
women = merge(womenYes, womenNo, all = TRUE)
hyperSample = merge(men, women, all = TRUE)
table(hyperSample$Sex, hyperSample$Survived)
# No Yes
# Male 133 31
# Female 31 133
It works, but it looks like a bit ugly and I honestly think perhaps someone could find a much more elegant or efficient way to do it. Thanks.
You can sample in two stages, both using rhyper: First to determine the number of men and women subject to only sampling 328 and assuming populations were sex-distributed as in the original sample. This is what you might do if you were trying to bootstrap a statistic like a rate ratio. And then secondly, use rhyper twice more to determine the numbers of survivors subject to the same probabilities in the original sample rows.
MFmat <- apply(Titanic, c(2, 4), sum)
nMale <- rhyper(1, rowSums(MFmat)[1], rowSums(MFmat)[2], 328)
#[1] 262
nFemale <- 328 - nMale
DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
SurvMale = nMale-DMale
DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
SurvFemale = nFemale - DFemale
matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2,
dimnames=dimnames(MFmat) )
#----
Survived
Sex No Yes
Male 223 42
Female 22 41
I suppose you could sample the two rows separately and you should be able to use the logic above, ... if that what you have decided to do. Which way is more appropriate will depend on the underlying problem.
# Fixed row marginals....
nMale <-164
nFemale <- 164
DMale <- rhyper(1, MFmat[1,1], MFmat[1,2], nMale)
SurvMale = nMale-DMale
DFemale = rhyper(1, MFmat[2,1], MFmat[2,2], nFemale)
SurvFemale = nFemale - DFemale
matrix( c( DMale, DFemale, SurvMale, SurvFemale), ncol=2,
dimnames=dimnames(MFmat) )
#----------------
Survived
Sex No Yes
Male 127 37
Female 39 125
I have data on college course completions, with estimated numbers of students from each cohort completing after 1, 2, 3, ... 7 years. I want to use these estimates to calculate the total number of students outputting from each College and Course in any year.
The output of students in a given year will be the sum of the previous 7 cohorts outputting after 1, 2, 3, ... 7 years.
For example, the number of students outputting in 2014 from COLLEGE 1, COURSE A is equal to the sum of:
Output of 2013 cohort (College 1, Course A) after 1 year +
Output of 2012 cohort (College 1, Course A) after 2 years +
Output of 2011 cohort (College 1, Course A) after 3 years +
Output of 2010 cohort (College 1, Course A) after 4 years +
Output of 2009 cohort (College 1, Course A) after 5 years +
Output of 2008 cohort (College 1, Course A) after 6 years +
Output of 2007 cohort (College 1, Course A) after 7 years +
So there are two dataframes: a lookup table that contains all the output estimates, and a smaller summary table that I'm trying to modify. I want to update dummy.summary$output with, for each row, the total output based on the above calculation.
The following code will replicate my data pretty well
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
# Summary table to be modified
dummy.summary <- aggregate(x = dummy.lookup["intake"], by = list(dummy.lookup$cohort, dummy.lookup$college, dummy.lookup$course), FUN = mean)
names(dummy.summary)[1:3] <- c("year", "college", "course")
dummy.summary <- dummy.summary[order(dummy.summary$year, dummy.summary$college, dummy.summary$course), ]
dummy.summary$output <- 0
The following code does not work, but shows the approach I've been attempting.
dummy.summary$output <- sapply(dummy.summary$output, function(x){
# empty vector to fill with output values
vec <- c()
# Find relevant output for college + course, from each cohort and exit year
for(j in 1:7){
append(x = vec,
values = dummy.lookup[dummy.lookup$college==dummy.summary[x, "college"] &
dummy.lookup$course==dummy.summary[x, "course"] &
dummy.lookup$cohort==dummy.summary[x, "year"]-j &
dummy.lookup$output.year==j, "output"])
}
# Sum and return total output
sum_vec <- sum(vec)
return(sum_vec)
}
)
I guess it doesn't work because I was hoping to use 'x' in the anonymous function to index particular values of the dummy.summary dataframe. But that clearly isn't happening and is only returning zero for each row, presumably because the starting value of 'x' is zero each time. I don't know if it is possible to access the index position of each value that sapply loops over, and use that to index my summary dataframe.
Is this approach fixable or do I need a completely different approach?
Even if it is fixable, is there a more elegant/faster way to acheive what I'm trying to do?
Thanks in anticipation.
I've just updated your output.year to output.year2 where instead of a value from 1 to 7 it gets a value of a year based on the cohort you have.
I've realised that the output information you want corresponds to the output.year, but the intake information you want corresponds to the cohort. So, I calculate them separately and then I join tables/information. This automatically creates empty (NA that I transform to 0) output info for 1998.
# fix your random sampling
set.seed(24)
# Lookup table
dummy.lookup <- data.frame(cohort = rep(1998:2014, each = 210),
college = rep(rep(paste("College", 1:6), each = 35), 17),
course = rep(rep(paste("Course", LETTERS[1:5]), each = 7),102),
intake = rep(sample(x = 150:300, size = 510, replace=TRUE), each = 7),
output.year = rep(1:7, 510),
output = sample(x = 10:20, size = 3570, replace=TRUE))
dummy.lookup$output[dummy.lookup$yr %in% 1:2] <- 0
library(dplyr)
# create result table for output info
dt_output =
dummy.lookup %>%
mutate(output.year2 = output.year+cohort) %>% # update output.year to get a year value
group_by(output.year2, college, course) %>% # for each output year, college, course
summarise(SumOutput = sum(output)) %>% # calculate sum of intake
ungroup() %>%
arrange(college,course,output.year2) %>% # for visualisation purposes
rename(cohort = output.year2) # rename column
# create result for intake info
dt_intake =
dummy.lookup %>%
select(cohort, college, course, intake) %>% # select useful columns
distinct() # keep distinct rows/values
# join info
dt_intake %>%
full_join(dt_output, by=c("cohort","college","course")) %>%
mutate(SumOutput = ifelse(is.na(SumOutput),0,SumOutput)) %>%
arrange(college,course,cohort) %>% # for visualisation purposes
tbl_df() # for printing purposes
# Source: local data frame [720 x 5]
#
# cohort college course intake SumOutput
# (int) (fctr) (fctr) (int) (dbl)
# 1 1998 College 1 Course A 194 0
# 2 1999 College 1 Course A 198 11
# 3 2000 College 1 Course A 223 29
# 4 2001 College 1 Course A 198 45
# 5 2002 College 1 Course A 289 62
# 6 2003 College 1 Course A 163 78
# 7 2004 College 1 Course A 211 74
# 8 2005 College 1 Course A 181 108
# 9 2006 College 1 Course A 277 101
# 10 2007 College 1 Course A 157 109
# .. ... ... ... ... ...
I have a R dataframe which describes the evolution of the sales of a product in approx. 2000 shops in a quarterly basis, with 5 columns (ie. 5 periods of time). I'd like to know how to analyse it with R.
I've already tried to make some basic analysis, that is to say to determine the average sales for the 1st period, the 2nd period, etc. and then determine the average for each period and then to compare the evolution of each shop relatively to this general evolution. For instance, there is a total of 50 000 sales for the 1st period and 35 000 for the 5th, so I assume that for each shop the normal sale in the 5th period is to be 35/55=0.63*the amount of the 1st period's sale: if the shop X has sold 100 items in the first period, I assume that it should normally sell 63 items in the 5th period.
Obviously, this is an easy-to-do method, but it is not statistically relevant.
I would like a method which would enable me to determine a trend curb which miminizes my R-square. My objective is to be able to analyse the sales of the shops by neutralizing the general trend: I'd like to know precisely what are the underperforming shops and what are the overperforming shops, with a statistically correct approach.
My dataframe is structured in this way :
shopID | sum | qt1 | qt2 | qt3 | qt4 | qt5
000001 | 150 | 45 | 15 | 40 | 25 | 25
000002 | 100 | 20 | 20 | 20 | 20 | 20
000003 | 500 | 200 | 0 | 100 | 100 | 100
... (2200 rows)
I've tried to put my timeserie in a list, which is successful, with the following functon:
reversesales=t(data.frame(sales$qt1,sales$qt2,sales$qt3,sales$qt4,sales$qt5))
# I reverse rows and columns of the frame in order that the time periods be the rows
timeser<-ts(reversesales,start=1,end=5, deltat=1/4)
# deltat=1/4 because it is a quarterly basis, 1 and 5 because I have 5 quarters
Still, I am unable to do anything with this variable. I can't do any plot (with the "plot" function) as there are 2200 rows (and so R wants to make me 2200 successive plots, obviously this is not what I want).
In addition, I don't know how to determine the theoretical trend and the theoretical value of the sales for each period for each shop...
Thank you for your help! (and merry Christmas)
An implementation of mixed model:
install.packages("nlme")
library("nlme")
library(dplyr)
# Generating some data with a structure like yours:
start <- round(sample(10:100, 50, replace = TRUE)*runif(50))
df <- data_frame(shopID = 1:50, qt1 = start, qt2 =round(qt1*runif(50, .5, 2)) ,qt3 = round(qt2*runif(50, .5, 2)), qt4 = round(qt3*runif(50, .5, 2)), qt5 = round(qt4*runif(50, .5, 2)))
df <- as.data.frame(df)
# Converting in into the long format:
df <- reshape(df, idvar = "shopID", varying = names(df)[-1], direction = "long", sep = "")
Estimating the model:
mod <- lme(qt ~ time, random = ~ time | shopID, data = df)
# Extract the random effects for comparison:
random.effects(mod)
(Intercept) time
1 74.0790805 3.7034172
2 7.8713699 4.2138001
3 -8.0670810 -5.8754060
4 -16.5114428 16.4920663
5 -16.7098229 6.4685228
6 -11.9630688 -8.0411504
7 -12.9669777 21.3071366
8 -24.1099280 32.9274361
9 8.5107335 -9.7976905
10 -13.2707679 -6.6028927
11 3.6206163 -4.1017784
12 21.2342886 -6.7120725
13 -14.6489512 11.6847109
14 -14.7291647 2.1365768
15 10.6791941 3.2097199
16 -14.1524187 -1.6933291
17 5.2120647 8.0119320
18 -2.5172933 -6.5011416
19 -9.0094366 -5.6031271
20 1.4857512 -5.9913865
21 -16.5973442 3.5164298
22 -26.7724763 27.9264081
23 49.0764631 -12.9800871
24 -0.1512509 2.3589947
25 15.7723150 -7.9295698
26 2.1955489 11.0318875
27 -8.0890346 -5.4145977
28 0.1338790 -8.3551182
29 9.7113758 -9.5799588
30 -6.0257683 42.3140432
31 -15.7655545 -8.6226255
32 -4.1450984 18.7995079
33 4.1510104 -1.6384103
34 2.5107652 -2.0871890
35 -23.8640815 7.6680185
36 -10.8228653 -7.7370976
37 -14.1253093 -8.1738468
38 42.4114024 -9.0436585
39 -10.7453627 2.4590883
40 -12.0947901 -5.2763010
41 -7.6578305 -7.9630013
42 -14.9985612 -0.4848326
43 -13.4081771 -7.2655456
44 -11.5646620 -7.5365387
45 6.9116844 -10.5200339
46 70.7785492 -11.5522014
47 -7.3556367 -8.3946072
48 27.3830419 -6.9049164
49 14.3188079 -9.9334156
50 -15.2077850 -7.9161690
I would interpret the values as follows: consider them as a deviation from zero, so that positive values are positive deviations from the average, whereas negative values are negative deviation from the average. The averages of the two columns are zero, as is checked below:
round(apply(random.effects(mod), 2, mean))
(Intercept) time
0 0
library(zoo)
#Reconstructing the data with four quarter columns (instead of five quarters as in your example)
shopID <- c(1, 2, 3, 4, 5)
sum <- c(150, 100, 500, 350, 50)
qt1 <- c(40, 10, 130, 50, 10)
qt2 <- c(40, 40, 110, 100, 15)
qt3 <- c(50, 30, 140, 150, 10)
qt4 <- c(20, 20, 120, 50, 15)
myDF <- data.frame(shopID, sum, qt1, qt2, qt3, qt4)
#The ts() function converts a numeric vector into an R time series object
ts1 <- ts(as.numeric((myDF[1,3:6])), frequency=4)
ts2 <- ts(as.numeric((myDF[2,3:6])), frequency=4)
ts3 <- ts(as.numeric((myDF[3,3:6])), frequency=4)
ts4 <- ts(as.numeric((myDF[4,3:6])), frequency=4)
ts5 <- ts(as.numeric((myDF[5,3:6])), frequency=4)
#Merge time series objects
tsm <- merge(a = as.zoo(ts1), b = as.zoo(ts2), c = as.zoo(ts3), d = as.zoo(ts4), e = as.zoo(ts5))
#Plotting the Time Series
plot.ts(tsm, plot.type = "single", lty = 1:5, xlab = "Time", ylab = "Sales")
The code is not optimized, and can be improved. More about time series analysis can be read here. Hope this gives some direction.
I have created a barplot of Age vs. Population size (by gender) from Census data in ggplot2. Similarly, I have used the 'fitdist' function from the fitdistrplus package to derive Weibull parameters for the normalised (by maximum observed population across all Age bins) population data.
What I would like to do is to overlay the plotted data with the distribution as a line plot. I have tried
+ geom_line (denscomp(malefit.w))
Plus other numerous (unsuccessful) strategies.
Any help that could be provided would be much appreciated! Please find the syntax appended below:
Data Structure
Order Age Male Female Total male.norm
1 1 0 - 5 2870000 2820000 5690000 1.00000000
2 2 5 - 9 2430000 2390000 4820000 0.84668990
3 3 10 - 14 2340000 2250000 4590000 0.81533101
4 4 15 - 19 2500000 2500000 5000000 0.87108014
5 5 20 - 24 2690000 2680000 5370000 0.93728223
6 6 25 - 29 2540000 2520000 5060000 0.88501742
7 7 30 - 34 2040000 1990000 4030000 0.71080139
8 8 35 - 39 1710000 1760000 3470000 0.59581882
9 9 40 - 44 1400000 1550000 2950000 0.48780488
10 10 45 - 49 1200000 1420000 2620000 0.41811847
11 11 50 - 54 1010000 1210000 2220000 0.35191638
12 12 55 - 59 812000 985000 1800000 0.28292683
13 13 60 - 64 612000 773000 1390000 0.21324042
14 14 65 - 69 402000 556000 958000 0.14006969
15 15 70 - 74 293000 455000 748000 0.10209059
16 16 75 - 79 165000 316000 481000 0.05749129
17 17 80 - 84 101000 222000 323000 0.03519164
18 18 85 plus 75500 180000 256000 0.02630662
female.norm
1 1.00000000
2 0.84751773
3 0.79787234
4 0.88652482
5 0.95035461
6 0.89361702
7 0.70567376
8 0.62411348
9 0.54964539
10 0.50354610
11 0.42907801
12 0.34929078
13 0.27411348
14 0.19716312
15 0.16134752
16 0.11205674
17 0.07872340
18 0.06382979
This is the answer to the original question I posed above. In conjunction with the data posted in the question it is a beginning to end solution (i.e. raw data to plot).
Fitting of South-African age-population data (by gender) to a Weibull distribution (Theresa Cain and Ben Small)
load libraries
library(MASS)
library(ggplot2)
Import dataset
age_gender2 <- read.csv("age_gender2.csv", sep=",", header = T)
Define total population size by gender - that is sum the entire male / female population across all age bins and place in an objects 'total.male' and 'total.female' respectively
total.male <- sum(age_gender2$Male)
total.female <- sum(age_gender2$Female)
The object 'age.groups' is a single row, single column vector describing the number of age bins for the 'age_gender2' df
age.groups <- length(age_gender2$Age)
The object 'age.all' is a 1 row, 18 column empty matrix that will describe the minimum age range extracted from the age bins (categories) in the 'Age' column from age_gender2 df
age.all <- matrix(0,1,age.groups)
Next line assigns min age to each element of matrix (1 X 18) for first column in each age group. So 'for' loop assigns each column of matrix as an age (HELP: writing a for loop in R).
Structure of the 'for' loop # RULE (given in parentheses()): for each element (i) loop from 2 to the value presented in the 'age.groups' object (i.e. 18) # COMMAND (given in curly brackets {}): taking each element (i) in the 'age.male' matrix and starting at the first row (i.e. [1, by each element (i.e. [1,i], perform / assign ('<-') the following operation: ((5 X (ith element - 1)) - 2.5). This operation provides the 'middle' age for the bin
this assigns the first element (row, column) in the 'age.all' matrix the value 2.5
age.all[1,1] <- 2.5
for(i in 2:age.groups){
age.all[1,i] <- ((5*(i)) - 2.5)
}
This next command 'rep' creates a (1 X 25190500) vector of all the ages within a particular bin
male.data <- rep(age.all,age_gender2$Male)
female.data <- rep(age.all,age_gender2$Female)
Fit weibull distribution to age for male and female
male.weib <- fitdistr(male.data, "weibull")
female.weib <- fitdistr(female.data, "weibull")
male.shape <- male.weib$estimate[1]
male.scale <- male.weib$estimate[2]
female.shape <- female.weib$estimate[1]
female.scale <- female.weib$estimate[2]
Add column "Age_Median" to 'age_gender2' df with median age. Need to transpose as 'age.all' is an 1 row X 18 column vector.
age_gender2["Age_Median"] <- t(age.all)
Fit weibull distribution
The function 'pweibull' is a PDF and finds the cumulative probability over all ages, therefore we need to subtract the previous age bin(s) from the present bin to find the probability for that bin and hence (by multiplying by the total male population) the expected population for that bin.
male.p.weibull <- matrix(0,1,age.groups)
female.p.weibull <- matrix(0,1,age.groups)
for (i in 1:age.groups){
male.p.weibull[1,i] <- pweibull(age.all[1,i]+2.5, male.shape, male.scale) - pweibull(age.all[1,i]-2.5, male.shape, male.scale)
}
for (i in 1:age.groups){
female.p.weibull[1,i] <- pweibull(age.all[1,i]+2.5, female.shape, female.scale) - pweibull(age.all[1,i]-2.5, female.shape, female.scale)
}
Add column to list calculated population per age bin - 'transpose' to 1 x 18 -> 18 row x 1 column vector
age_gender2["male.prob"] <- t(male.p.weibull * total.male)
age_gender2["female.prob"] <- t(female.p.weibull * total.female)
Create bar plots describing Age-Gender population distributions
Males (real data) and super-imposed curve showing Weibull calculated probabilities (ggplot2)
agp.male <- ggplot(age_gender2, aes(x=reorder(Age, Order), y=Male, fill=Male)) + geom_bar(stat="identity") + theme (axis.text.x=element_text(angle=45, hjust=1)) + xlab("Age Group (5 yr bin)") + ylab("Male Population (M)") + geom_smooth(aes(age_gender2$Age,age_gender2$male.prob, group=1))
Females (real data) and super-imposed curve showing Weibull calculated probabilities (ggplot2)
agp.female <- ggplot(age_gender2, aes(x=reorder(Age, Order), y=Female, fill=Female)) + geom_bar(stat="identity") + theme (axis.text.x=element_text(angle=45, hjust=1)) + xlab("Age Group (5 yr bin)") + ylab("Female Population (M)") + geom_smooth(aes(age_gender2$Age,age_gender2$female.prob, group=1))