R - bootstrapping with control for age and gender - r

I have a very unbalanced dataset ,, thousands of healthy participants and 21 patients (16 male and 5 female) ,, I want to use bootstrapping to define a new sampler but with control of age and gender.
this is the method i'm using
parametric_bootstrap_boot <- function(x) {
# Perform bootstrap using boot package
# Estimate mean
mu <- boot(x, samplemean, R=1000)$t0
#Estimate sd
sd <- boot(x, samplesd, R=1000)$t0
# Sample 21 observations
set.seed(1)
samples <- rnorm(21,mu,sd)
return(samples)
}
how can I control for age and gender of the healthy resampling method ?
my data looks like this
Patient ID Age Mean_RR SDNN RMSSD nn50 pnn50 SEX Year of birth.0.0 Date of all cause dementia report.0.0 Source of all cause dementia report.0.0 Date of alzheimer's disease report.0.0 Source of alzheimer's disease report.0.0 Date of vascular dementia report.0.0 Source of vascular dementia report.0.0
1.53E+09 56 1257 397.34 468 2 33.33 Female 1961 NA NA NA NA NA NA
1.53E+09 56 1257 397.34 468 2 33.33 Female 1961 NA NA NA NA NA NA
this is how I call the function
control_BPM <- abs(parametric_bootstrap_boot(control_BPM))
control_SDNN <- abs(parametric_bootstrap_boot(control_SDNN))
control_RMSSD <- abs(parametric_bootstrap_boot(control_RMSSD))

Related

How to add in rows in data frame x to reflect rows data frame y in R

I'm trying to combine two data frames which have the same headings (Species, Age, n , kill.total, Contribution), and then need to calculate the average for Species*Age for the contribution.
Where:
Species is Species killed
Age is Age of Species killed
n is total number of kills for a species of a particular age class kill.total is total number of kills made per season
Contribution is n/kill.total
`Lion.diet.all.Age <- do.call("rbind", list(Diet.1, Diet.2))
Lion.diet.all.Age.Table = aggregate(Contribution ~ Species*Age, data = Lion.diet.all.Age,
FUN = function(x) c(mean = mean(x), sd = std.error(x)))`
However, since not all the same species/ same age classes of each species were killed each season when I calculate the average contribution, it given me an incorrect value.
e.g., Season 1:
`Species <-c("Buffalo", "Buffalo", "Impala", "Impala", "Impala", "Kudu")
Age <-c("Adult", "Juvenile", "Adult", "Sub-adult", "Juvenile", "Adult")
n<- c(2,5,10,3,11,4)
kill.total<- c(35,35,35,35,35,35)
Diet.1 <- data.frame(Species, Age, n, kill.total)
Diet.1$Contribution = (Diet.1$n/kill.total)*100`
Season 2:
` Species <-c("Buffalo", "Eland", "Impala", "Impala", "Impala", "Zebra", "Wildebeest", "Wildebeest")
Age <-c("Adult","Adult", "Sub-adult", "Juvenile","Neonate", "Neonate", "Juvenile", "Neonate")
n<- c(1,2,5,9,11,12,10,9)
kill.total<- c(59,59,59,59,59,59,59,59)
Diet.2 <- data.frame(Species, Age, n, kill.total)
Diet.2$Contribution = (Diet.2$n/kill.total)*100`
Using the above two seasons,
Lion.diet.all.Age <- do.call("rbind", list(Diet.1, Diet.2))
Output:
Species Age n kill.total Contribution
1 Buffalo Adult 2 35 5.714286
2 Buffalo Juvenile 5 35 14.285714
3 Impala Adult 10 35 28.571429
4 Impala Sub-adult 3 35 8.571429
5 Impala Juvenile 11 35 31.428571
6 Kudu Adult 4 35 11.428571
7 Buffalo Adult 1 59 1.694915
8 Eland Adult 2 59 3.389831
9 Impala Sub-adult 5 59 8.474576
10 Impala Juvenile 9 59 15.254237
11 Impala Neonate 11 59 18.644068
12 Zebra Neonate 12 59 20.338983
13 Wildebeest Juvenile 10 59 16.949153
14 Wildebeest Neonate 9 59 15.254237
Lion.diet.all.Age.Table = aggregate(Contribution ~ Species*Age, data = Lion.diet.all.Age, FUN = function(x) c(mean = mean(x), sd = std.error(x)))
Output:
Species Age Contribution.mean Contribution.sd
1 Buffalo Adult 3.70460048 2.00968523
2 Eland Adult 3.38983051 NA
3 Impala Adult 28.57142857 NA
4 Kudu Adult 11.42857143 NA
5 Buffalo Juvenile 14.28571429 NA
6 Impala Juvenile 23.34140436 8.08716707
7 Wildebeest Juvenile 16.94915254 NA
8 Impala Neonate 18.64406780 NA
9 Wildebeest Neonate 15.25423729 NA
10 Zebra Neonate 20.33898305 NA
11 Impala Sub-adult 8.52300242 0.04842615
However, since the species are not the same for both seasons, the averages are not correct.
i.e., mean(Eland Adult) should equal '(0/35 + 2/59)/2*100 = 1.69' but the above output gives me 3.38.
How do I tell R to repeat Eland Adult (and all other species*ages) across both seasons and make Contribution = 0 so that I get the correct average values?
I am using a much larger dataset where each data frame (Diet.1, Diet.2,...) is created from a subset of data from my input .csv file, therefore I cannot manually input all the Species*Age rows.

approximate character matching using R

I have two datafiles. One of the files contains only one column with the name of the company (usually a hospital) and the other one contains a list of companies with the respective adresses. The problem is that the company names do not exactly match. How can i match them approximately ?
> dput(head(HOSPITALS[130:140,], 10))
I would like to obtain one datafile, where the company is matchen with an adress, if available in adress
Check out the fuzzyjoin package and the stringdist_join functions.
Here's a starting point. In your example data ignore_case = TRUE solves the matching problem. Depending on how the full data looks, you will have to experiment with the arguments (e.g. max_dist) and possibly filter the result until your achieve what you want.
library(dplyr)
library(fuzzyjoin)
HOSPITALS %>%
stringdist_left_join(GH_MY,
by = c("hospital" = "hospital_name"),
ignore_case = TRUE,
max_dist = 2,
distance_col = "dist")
Result:
# A tibble: 10 x 6
hospital hospital_name adress district town dist
<chr> <chr> <chr> <chr> <chr> <dbl>
1 HOSPITAL PAPAR Hospital Papar Peti Surat No. 6, Papar Sabah 0
2 HOSPITAL PARIT BUNT~ Hospital Parit ~ Jalan Sempadan Parit Bun~ Perak 0
3 HOSPITAL PEKAN Hospital Pekan 26600 Pekan Pekan Pahang 0
4 HOSPITAL PENAWAR SD~ NA NA NA NA NA
5 HOSPITAL PORT DICKS~ Hospital Port D~ KM 11, Jalan Pantai Port Dick~ Negeri ~ 0
6 HOSPITAL PULAU PINA~ Hospital Pulau ~ Jalan Residensi Pulau Pin~ Pulau P~ 0
7 HOSPITAL PUSRAWI SD~ NA NA NA NA NA
8 HOSPITAL PUSRAWI SM~ NA NA NA NA NA
9 HOSPITAL PUTRAJAYA Hospital Putraj~ Pusat Pentadbiran Ker~ Putrajaya WP Putr~ 0
10 HOSPITAL QUEEN ELIZ~ NA NA NA NA NA

How can I run a regression daily and save the coefficients in a new dataset?

I have a dataset containing the daily rate of return for every industry (in total 10 industries) per country (in total 16 countries) from 1975 to 2018. Now I need to run cross sectional regressions per day and per week and save the coefficients in a separate dataset.
I tried the following code. But the estimates are the same for every day.
fitted_models = Data %>%
group_by(Data$Date) %>%
do(model = lm(Data$RoR ~ Data$Country + Data$Industry, data=Data))
fitted_models$model
I need to include the following contrasts:
contrasts(All0$Country) <- contr.sum(16, contrasts=TRUE)
contrasts(All0$Industry) <- contr.sum(10, contrasts=TRUE)
but I get the following error message then
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels In addition: Warning messages: 1: contrasts dropped from factor Country due to missing levels 2: contrasts dropped from factor Industry due to missing levels
This is a sample of my data. As time goes on there are values for RoR.
Country Date Industry RoR
<chr> <date> <chr> <dbl>
1 Finland 1975-01-01 Basic Mats NA
2 Austria 1975-01-01 Basic Mats NA
3 Spain 1975-01-01 Basic Mats NA
4 United Kingdom 1975-01-01 Basic Mats NA
5 Norway 1975-01-01 Basic Mats NA
6 Germany 1975-01-01 Basic Mats NA
7 France 1975-01-01 Basic Mats NA
8 Italy 1975-01-01 Basic Mats NA
9 Portugal 1975-01-01 Basic Mats NA
10 Switzerland 1975-01-01 Basic Mats NA
Using the data.table package for to do group-wise operations might be a good way to approach this -- I'm using mtcars as an example data set since you haven't provided one, but the approach would be the same with your data. Here, I use cyl as the grouping column, but in your case it would be by Date.
library(data.table)
DT <- as.data.table(mtcars)
DT[,as.list(lm(mpg ~ wt+qsec)$coefficients), by = .(cyl)]
# cyl (Intercept) wt qsec
# 1: 6 25.46173 -5.201906 0.5838640
# 2: 4 24.88427 -7.513576 0.9903892
# 3: 8 14.02093 -2.813754 0.7352592

How to find correlation in a data set

I wish to find the correlation of the trip duration and age from the below data set. I am applying the function cor(age,df$tripduration). However, it is giving me the output NA. Could you please let me know how do I work on the correlation? I found the "age" by the following syntax:
age <- (2017-as.numeric(df$birth.year))
and tripduration(seconds) as df$tripduration.
Below is the data. the number 1 in gender means male and 2 means female.
tripduration birth year gender
439 1980 1
186 1984 1
442 1969 1
170 1986 1
189 1990 1
494 1984 1
152 1972 1
537 1994 1
509 1994 1
157 1985 2
1080 1976 2
239 1976 2
344 1992 2
I think you are trying to subtract a number by a data frame, so it would not work. This worked for me:
birth <- df$birth.year
year <- 2017
age <- year - birth
cor(df$tripduration, age)
>[1] 0.08366848
# To check coefficient
cor(dat$tripduration, dat$birth.year)
>[1] -0.08366848
By the way, please format the question with an easily replicable data where people can just copy and paste to their R. This actually helps you in finding an answer.
Based on the OP's comment, here is a new suggestion. Try deleting the rows with NA before performing a correlation test.
df <- df[complete.cases(df), ]
age <- (2017-as.numeric(df$birth.year))
cor(age, df$tripduration)
>[1] 0.1726607

Survival Analysis Data with country-year observations

I'm trying to fit a Cox Proportional Hazard model to analyze the impact of the number of protest events on the survival rates of different political regimes in different countries.
My dataset looks similar to this:
Country year sdate edate time evercollapsed protest GDPgrowth
Country A 2003 1996-11-24 2012-12-31 5881 0 78 14.78
Country A 2004 NA NA NA 0 99 8.56
Country A 2005 NA NA NA 0 25 3.56
Country B 2003 2000-10-26 2011-05-21 3859 1 13 2.33
Country B 2004 NA NA NA 1 28 5.43
Country B 2005 NA NA NA 1 7 1.89
So, basically my dataset provides yearly information on a number of variables for each year, but information about the start and end dates for the regime and the time of survival (measured in days) is only provided in the first row of each given political regime.
My data includes information for 48 different political regimes and 15 of them collapse in the time span I am looking at.
I fitted a Cox PH model with the survival package:
myCPH <- coxph(Surv(time, evercollapsed) ~ protest + GDPgrowth, data = mydata)
This gives me the following result:
Call:
coxph(formula = Surv(time, evercollapsed) ~ protest + GDPgrowth,
data = mydata)
coef exp(coef) se(coef) z p
protest 0.01630 1.01644 0.00722 2.26 0.024
GDPgrowth -0.03447 0.96612 0.01523 -2.26 0.024
Likelihood ratio test=9.26 on 2 df, p=0.00977
n= 48, number of events= 15
(556 observations deleted due to missingness)
So, these results imply that I'm losing 556 country years, because the rows in my data frame do not include the information on the survival time of the regime.
My question now is, how to include the country years into the analysis which do not provide the information on sdate, edate and time?
I assume, if I would just copy the information for each country-year, this would increase my number of regime collapses?
I assume I have to give an unique ID for every given political regime to make sure R can distinguish the different cases. Then, how do I have to fit the Cox PH model that includes the information of the differen country-years in the analysis?
Many thanks in advance!

Resources