Determine weights in multivariate weighted linear regression - r

I have a dataset containing insurance pricing and coverage information. The first column refers to the policy identifier, and the remaining columns refer to premium, limit, deductible, and further details as dummy variables (State and coverage).
Identifier
Price
Limit
Deductible
Peril1
Peril2
Peril3
Peril4
Peril5
Peril6
State1
State2
State3
State4
POL1
250.0
100000
500.0
1
1
1
0
0
1
1
0
0
0
POL1
625.0
100000
1000.0
1
1
1
0
0
1
1
0
0
0
POL1
1650.0
500000
1000.0
1
1
1
0
0
1
1
0
0
0
POL1
2500.0
1000000
1000.0
1
1
1
0
0
1
1
0
0
0
POL1
4375.0
2000000
2000.0
1
1
1
0
0
1
1
0
0
0
POL2
25.0
50000
500.0
0
0
1
1
0
0
1
0
0
0
POL3
60.25
25000
500.0
1
1
1
1
1
1
1
0
0
0
POL3
73.25
50000
500.0
1
1
1
1
1
1
1
0
0
0
Moreover, as it can be seen from the sample dataframe, several rows can refer to the same insurance product. In the original data frame, up to 40 rows may refer to a single policy, while other policies are described in a single row.
I am trying to conduct a multivariate regression
reg <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df)
By conducting the multivariate regression, it emerges that the distribution of residual errors does not follow a normal distribution. I therefore decided to Log() the dependent variable. Moreover, in my dataframe there are several outliers and presence of heteroscedasticity.
For the reasons above I thought WLS regression could be a solution to my problem, because it can help me assigning an appropriate weight to each error term. Trying to understand the functioning and theory behind WLS I tried to conduct simple weighted regression as explained here
wt <- 1 / lm(abs(reg$residuals) ~ reg$fitted.values)$fitted.values^2  
wls_model <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df, weight=wt)
But when looking at the results I don’t think this is the correct approach to tackle my problem, also considering the fact that by trying to solve this issue many rows are not considered.
From my understand, as the weight parameter of lm should be a vector, I could assign a specific weight to each policy. For instance, each row pertaining POL1 is 1/5. Despite having read documentation, relevant posts, and searched for packages that could facilitate my work, it is not clear to me how to implement WLS in my case.

Related

Regression with before and after

I have a dataset with four variables (df)
household
group
income
post
1
0
20'000
0
1
0
22'000
1
2
1
10'000
0
2
1
20'000
1
3
0
20'000
0
3
0
21'000
1
4
1
9'000
0
4
1
16'000
1
5
1
8'000
0
5
1
18'000
1
6
0
22'000
0
6
0
26'000
1
7
1
12'000
0
7
1
24'000
1
8
0
24'000
0
8
0
27'000
1
Group is a binary variable and is 1, when household got support from state. and post variable is also binary and is 1, when it is after some household got support from state.
Now I would like to run a before vs after regression that estimates the group effect by comparing post-period and before period for the supported group. I would like to put the dependent variable in logs, to have the effect in percentage, so the impact of state support on income.
I used that code, but I don't know if it is right to get the answer?
library("fixest")
feols(log(income) ~ group + post,data=df) %>% etable()
Is there another way?
If you are looking for the classic 2x2 design your code was almost correct. Change '+' with '*'. This tell us that the supported group increased the income with 7 250 more than the group which not received support.
comparing = feols(income ~ group * post,data)
comparing_log = feols(log(income) ~ group * post,data)
etable(comparing,comparing_log)
PS: The interpretation of the coefficient as percentage change is a good approximation for small numbers. The correct formula for % change is: exp(beta)-1. In this case it is exp(0.5829)-1 = 0.7912.
So the change here is 79,12%.

Test to compare proportions / paired (small) samples / 7-levels categorical variables

I'm working on data from a pre-post survey: the same participants have been asked the same questions at 2 different times (so the sample are not independant). I have 19 categorical variables (Likert scale with 7 levels).
For each question, I want to know if there is a significant difference between the "pre" and "post" answer. To do this, I want to compare proportions in each of the 7 categories between pre and post results.
I have two data bases (one 'pre' and one 'post') which I have merged as in the following example (I've made sure that the categorical variables have the same levels for PRE and POST):
prepost <- data.frame(ID = c(1:7),
Quest1_PRE = c('5_SomeA','1_StronglyD','3_SomeD','4_Neither','6_Agree','2_Disagree','7_StronglyA'),
Quest1_POST = c('1_StronglyD','7_StronglyA','6_Agree','7_StronglyA','3_SomeD','5_SomeA','7_StronglyA'))
I tried to perform a McNemar test:
temp <- table(prepost_S1$Quest1_PRE,prepost_S1$Quest1_POST)
mcnemar.test(temp)
> McNemar's Chi-squared test
data: temp
McNemar's chi-squared = NaN, df = 21, p-value = NA
But whatever the question, the test always return NA values. I think it is because the pivot table (temp) has very low frequencies (I only have 24 participants).
One exemple of a pivot table (I have 22 participants):
> temp
1_StronglyD 2_Disagree 3_SomeD 4_Neither 5_SomeA 6_Agree 7_StronglyA
1_StronglyD 0 0 0 0 0 1 0
2_Disagree 0 0 0 0 1 0 0
3_SomeD 0 0 0 0 0 1 1
4_Neither 0 0 1 1 2 2 2
5_SomeA 0 0 0 0 1 1 2
6_Agree 0 0 0 0 0 3 2
7_StronglyA 0 0 0 0 0 1 2
I've tried aggregating the variables' levels into 5 instead of 7 ("1_Disagree", "2_SomeD", "3_Neither", "4_SomeA", "5_Agree") but it still doesn't work.
Is there an equivalent of Fisher's exact test for paired sample? I've done research but I couldn't find anything helpful.
If not, could you think of any other test that could answer my question (= Do the answers differ significantly between the pre and post survey)?
Thanks!

How to correctly merge two files and count values before Fisher's test in R?

I am very new to R, so I apologise if this looks simple to someone.
I try to to join two files and then perform a one-sided Fisher's exact test to determine if there is a greater burden of qualifying variants in casefile or controlfile.
casefile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
controlfile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
ENSG00000174776 1 1 0 2
ENSG00000076864 1 0 1 13
ENSG00000086015 1 0 1 25
I have this script:
#!/usr/bin/env Rscript
library("argparse")
suppressPackageStartupMessages(library("argparse"))
parser <- ArgumentParser()
parser$add_argument("--casefile", action="store")
parser$add_argument("--casesize", action="store", type="integer")
parser$add_argument("--controlfile", action="store")
parser$add_argument("--controlsize", action="store", type="integer")
parser$add_argument("--outfile", action="store")
args <- parser$parse_args()
case.dat<-read.delim(args$casefile, header=T, stringsAsFactors=F, sep="\t")
names(case.dat)[1]<-"GENE"
control.dat<-read.delim(args$controlfile, header=T, stringsAsFactors=F, sep="\t")
names(control.dat)[1]<-"GENE"
dat<-merge(case.dat, control.dat, by="GENE", all.x=T, all.y=T)
dat[is.na(dat)]<-0
dat$P_DOM<-0
dat$P_REC<-0
for(i in 1:nrow(dat)){
#Dominant model
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
if(case_count>args$casesize){
case_count<-args$casesize
}else if(case_count<0){
case_count<-0
}
if(control_count>args$controlsize){
control_count<-args$controlsize
}else if(control_count<0){
control_count<-0
}
mat<-cbind(c(case_count, (args$casesize-case_count)), c(control_count, (args$controlsize-control_count)))
dat[i,]$P_DOM<-fisher.test(mat, alternative="greater")$p.value
and problem starts in here:
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
the result of case_count and control_count is NULL values, however corresponding columns in both input files are NOT empty.
I tried to run the script above with assigning absolute numbers (1000 and 2000) to variables case_count and control_count , and the script worked without issues.
The main purpose of the code:
https://github.com/mhguo1/TRAPD
Run burden testing This script will run the actual burden testing. It
performs a one-sided Fisher's exact test to determine if there is a
greater burden of qualifying variants in cases as compared to controls
for each gene. It will perform this burden testing under a dominant
and a recessive model.
It requires R; the script was tested using R v3.1, but any version of
R should work. The script should be run as: Rscript burden.R
--casefile casecounts.txt --casesize 100 --controlfile controlcounts.txt --controlsize 60000 --output burden.out.txt
The script has 5 required options:
--casefile: Path to the counts file for the cases, as generated in Step 2A
--casesize: Number of cases that were tested in Step 2A
--controlfile: Path to the counts file for the controls, as generated in Step 2B
--controlsize: Number of controls that were tested in Step 2B. If using ExAC or gnomAD, please refer to the respective documentation for
total sample size
--output: Output file path/name Output: A tab delimited file with 10 columns:
#GENE: Gene name CASE_COUNT_HET: Number of cases carrying heterozygous qualifying variants in a given gene CASE_COUNT_CH: Number of cases
carrying potentially compound heterozygous qualifying variants in a
given gene CASE_COUNT_HOM: Number of cases carrying homozygous
qualifying variants in a given gene. CASE_TOTAL_AC: Total AC for a
given gene. CONTROL_COUNT_HET: Approximate number of controls carrying
heterozygous qualifying variants in a given gene CONTROL_COUNT_HOM:
Number of controlss carrying homozygous qualifying variants in a given
gene. CONTROL_TOTAL_AC: Total AC for a given gene. P_DOM: p-value
under the dominant model. P_REC: p-value under the recessive model.
I try to run genetic variant burden test with vcf files and external gnomAD controls. I found this repo suitable and trying to fix bugs now in it.
as a newbie in R statistics, I will be happy about any suggestion. Thank you!
If you want all row in two file. You can use full join with by = "GENE" and suffix as you wish
library(dplyr)
z <- outer_join(case_file, control_file, by = "GENE", suffix = c(".CASE", ".CONTROL"))
GENE CASE_COUNT_HET.CASE CASE_COUNT_CH.CASE CASE_COUNT_HOM.CASE CASE_TOTAL_AC.CASE
1 ENSG00000124209 1 0 0 1
2 ENSG00000064703 1 1 0 9
3 ENSG00000171408 1 0 0 1
4 ENSG00000110514 1 1 1 12
5 ENSG00000247077 1 1 1 7
6 ENSG00000174776 NA NA NA NA
7 ENSG00000076864 NA NA NA NA
8 ENSG00000086015 NA NA NA NA
CASE_COUNT_HET.CONTROL CASE_COUNT_CH.CONTROL CASE_COUNT_HOM.CONTROL CASE_TOTAL_AC.CONTROL
1 1 0 0 1
2 1 1 0 9
3 1 0 0 1
4 1 1 1 12
5 1 1 1 7
6 1 1 0 2
7 1 0 1 13
8 1 0 1 25
If you want only GENE that are in both rows, use inner_join
z <- inner_join(case_file, control_file, by = "GENE", suffix = c(".CASE", ".CONTROL"))
GENE CASE_COUNT_HET.CASE CASE_COUNT_CH.CASE CASE_COUNT_HOM.CASE CASE_TOTAL_AC.CASE
1 ENSG00000124209 1 0 0 1
2 ENSG00000064703 1 1 0 9
3 ENSG00000171408 1 0 0 1
4 ENSG00000110514 1 1 1 12
5 ENSG00000247077 1 1 1 7
CASE_COUNT_HET.CONTROL CASE_COUNT_CH.CONTROL CASE_COUNT_HOM.CONTROL CASE_TOTAL_AC.CONTROL
1 1 0 0 1
2 1 1 0 9
3 1 0 0 1
4 1 1 1 12
5 1 1 1 7

afex in R: Error: Empty cells in within-subjects design (i.e., bad data structure). But the data structure is fine

I'm attempting to run an ANCOVA with 1 between-subjects variable and 2 within-subjects variables and I'm running into an error that makes no sense to me. My data looks like this:
Scan
ID
Region
ALFF
Age
Resp
1
20
AID
0.826
Adol
77.25
2
20
AID
1.116
Adol
73.18
1
22
AID
0.362
Adult
78.70
2
22
AID
0.849
Adult
72.58
1
20
MDM
0.826
Adol
79.25
2
20
MDM
1.116
Adol
71.18
1
22
MDM
0.778
Adult
79.70
2
22
MDM
0.291
Adult
73.58
My ANCOVA code is:
Full_Anova_ALFF<- AlFF_Resp %>% group_by(Region) %>% do(fit=aov_car(ALFF ~ AgeScan+Error(ID/ScanResp), data = .))
and I get this error when I run it:
Converting to factor: Age
Error: Empty cells in within-subjects design (i.e., bad data structure).
table(data[c("Scan", "Resp")])
Resp
Scan
X77.25
X73.1777777766667
X63.1944444433333
X70.3333333333333
X78.7
X72.5833333333333
X1
1
0
0
0
1
0
X2
0
1
0
0
0
1
X3
0
0
1
0
0
1
X4
0
0
0
1
0
0
Resp
Scan
X72.4833333333333
X78.25
X65.1833333333333
X71.9166666666667
X57.333333335
X65.55
X1
0
0
1
0
0
1
X2
0
0
0
1
0
0
X3
1
0
0
0
0
0
X4
0
1
0
Scan is factor variable and resp is numeric and I have no idea why this error is occurring. There're no empty cells! And this weird table that was outputted as a part of the error message is also quite strange. It appears to be treating respiration as a factor maybe? But I've definitely told it that respiration is numeric. When I take respiration out of the model, it runs completely fine. However, unfortunately, I do need to include respiration.
Anybody have any idea what's going wrong? Or even just a workaround I can use to get this done?
Thanks in advance for your help!!
For those interested, I figured it out! I ended up using a linear mixed-model instead of what I have above because my covariate was at the scan-level rather than a between-subjects variable.
My command ended up being:
Full_Anova_ALFF<- AlFF_Resp %>% group_by(Region) %>% do(fit=mixed(ALFF~Resp+Age*Scan+(1|ID)+(1|Scan),data=.))
Hope this helps someone in the future!

How transform (calculate) an ordinal variable to a dichotomous in R?

I want to transform an ordinal variabel (0-2) – where 0 is no rights, 1 is some rights, and 2 full rights – to a dichotomous variable.
The original ordinal variable is coded for each country and year (country-year unit).
I want to create a dichotomous variable, (let's call it Improvement), capturing all annual positive changes, for each country-year. So when it goes from 0 to 1 (or from 0 to 2, or from 1 to 0), I want it to be 1 for that year and country. And zero otherwise.
Below I give an example of how my data looks like. The "RIGHTS" is the original ordinal variable. The "MY DICHOTOMOUS" variable is what I want to calculate in R. How can I do it?
COUNTRY YEAR RIGHTS MY DICHOTOMOUS
A 1990 0 0
A 1991 0 0
A 1992 0 0
A 1993 1 1
A 1994 0 0
B 1990 1 1
B 1991 1 0
B 1992 1 0
B 1993 1 0
B 1994 1 0
Please, note that the original data can go the other away as well, i.e. it can go negative. I do not want to code for negative changes for this dichotomous variable.
We can use diff
df1$dichotomous <- +c(FALSE,diff(df1$RIGHTS)==1)
df1$dichotomous
#[1] 0 0 0 1 0 1 0 0 0 0
This assumes you don't consider starting with a 1 in rights as a 1 in dichotomous:
x <- rights
n <- length(x)
dichotomous <- c(0, as.numeric(x[-1] - x[-n] == 1))
Might have to do a series of ifelse() statements. But then again I might be miss reading your question. An example is posted below.
MY.DATA$MY.DICHOTOMOUS <- with(MY.DATA,ifelse(COUNTRY=="A",RIGHTS,ifelse(COUNTRY=="B"&YEAR==1990,1,factor(RIGHTS)))`

Resources