Dealing with outliers by using interquantile ranges - but afterwards just different outliers - r

I am dealing with a dataset consisting of several key banking balance sheet and income statement figures (deleted some variables for this post):
'data.frame': 52028 obs. of 38 variables:
$ institutionid : int 4307883 4209717 4558501 4392480 4306242 4303334 114518 4183859 4307849 4256486 ...
$ fiscalyear : Factor w/ 8 levels "2010","2011",..: 1 1 1 1 1 1 1 1 1 1 ...
$ institutionname : chr "Kure Shinkin Bank" "Shinkin Central Bank" "Shibata Shinkin Bank" "Takasaki Shinkin Bank" ...
$ Tier 1 Ratio : num 9.8 20.68 13.93 6.84 19.43 ...
$ snlindustryid : int 28 2 28 2 2 1 1 1 2 1 ...
$ snlindustryname : chr "Other Banking" "Savings Bank/Thrift/Mutual" "Other Banking" "Savings Bank/Thrift/Mutual" ...
$ countryname : chr "Japan" "Japan" "Japan" "Japan" ...
$ Interest Income : num 141.3 3330.3 16.2 83.6 289.8 ...
$ Net Interest Income : num 122.8 756.4 14.1 74.4 250.4 ...
$ Operating Revenue : num 137.8 NA 13.8 80.1 NA ...
$ Provision for Loan Losses: num 27.546 NA 0.535 13.26 NA ...
$ Compensation and Benefits: num NA NA 6.07 36.8 NA ...
$ EBIT : num 27.04 2814.57 5.05 16.67 88.05 ...
$ Net Income befoire Taxes : num 8.57 224.58 2.98 7.42 48.62 ...
$ Provision for Taxes : num -7.861 -113.864 0.159 0.125 14.525 ...
$ Net Income : num 16.43 338.45 2.83 7.29 34.1 ...
$ net_margin : num 2.98 1.06 3.56 3.05 2.5 ...
I am trying to run a DiD regression using net_margins, a figure that is calculated as net income / total gross loans. When I first plot the net_margins they look like this:
Clearly, there are values included that don't make economic sense. This is partly due the fact that some banks in the dataset have unreasonable figures for e.g. gross loans. If you divide by something close to zero some unreasonable large numbers will come out.
My first intuition was to just get rid of the outliers by doing this:
Q <- quantile(dataset$net_margin, probs = c(0.25,0.75))
IQR <- IQR(dataset$net_margin)
up <- Q[2]+1.5*IQR # Upper Range
low<- Q[1]-1.5*IQR # Lower Range
#Eliminating outliers
dataset_cleaned <- dataset %>%
filter(net_margin<up & net_margin > low)
If I plot the data now it looks like this:
Through removing the outliers I basically created new medians and interquantile ranges, thus my data is now still plagued heavily by outliers.
In other posts that suggested using the IQR to remove outliers that was not the case however.
I am a bit on the dead end with my own statistical (and R) knowledge. Is this a right practice to remove outliers for such a dataset? Thank you!

Related

2 survival functions, 1 left-truncated, 1 not truncated. How create survival function in R that assumes same experience over the truncated interval?

I have two survival functions, one is not truncated so I have experience for all time periods. The other is left-truncated until t = 4, so it has no experience until t > 4. I can plot the two together in the following code in R using the survival package.
library(tidyverse)
library(survival)
library(ggfortify)
# create two survival functions
set1 <- tibble(start0 = rep(0,10), end0 = 1:10, event0 = rep(1,10))
set2 <- tibble(start0 = rep(4,10), end0 = c(5, 5, 7, 9, rep(10, 6)), event0 = rep(1,10))
combined_set <- bind_rows(set1, set2)
survival_fn <- survfit(Surv(start0, end0, event0) ~ start0, data = combined_set)
# plot the survival function:
autoplot(survival_fn, conf.int = FALSE)
I would like to show the difference in survival between the two functions if they had both experienced the same survival experience during the truncation period - i.e. up to t = 4. I've manually sketched the approximate graph I am trying to achieve (size of steps not to scale).
This is a simplified example - in practice I have eight different sets of data with different truncation periods, and around 2000 data-points in each set.
If you look at the structure of the survival_fn object (which is not a function but rather a list), you see:
str(survival_fn)
List of 17
$ n : int [1:2] 10 10
$ time : num [1:14] 1 2 3 4 5 6 7 8 9 10 ...
$ n.risk : num [1:14] 10 9 8 7 6 5 4 3 2 1 ...
$ n.event : num [1:14] 1 1 1 1 1 1 1 1 1 1 ...
$ n.censor : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
$ surv : num [1:14] 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ...
$ std.err : num [1:14] 0.105 0.158 0.207 0.258 0.316 ...
$ cumhaz : num [1:14] 0.1 0.211 0.336 0.479 0.646 ...
$ std.chaz : num [1:14] 0.1 0.149 0.195 0.242 0.294 ...
$ strata : Named int [1:2] 10 4
..- attr(*, "names")= chr [1:2] "start0=0" "start0=4"
$ type : chr "counting"
$ logse : logi TRUE
$ conf.int : num 0.95
$ conf.type: chr "log"
$ lower : num [1:14] 0.732 0.587 0.467 0.362 0.269 ...
$ upper : num [1:14] 1 1 1 0.995 0.929 ...
$ call : language survfit(formula = Surv(start0, end0, event0) ~ start0, data = combined_set)
- attr(*, "class")= chr "survfit"
So one way of getting something like you goal although still with an automatic start to the survival function at (t=0,S=1) would be to multiply all the $surv items in the 'start0=4'-stratum by the surv value at t=4, and then redo the plot:
survival_fn[['surv']][11:14] <- survival_fn[['surv']][11:14]*survival_fn[['surv']][4]
I can see why this might not be a totally conforming answer since there is still a blue line from 1 out to t=5 and it doesn't actually start at the surv value for stratum 1 at t=4. That is however a limitation of using a "high-level" abstraction plotting paradigm. The customizability is inhibited by the many "helpful" assumptions built into the plotting grammar. It would not be as difficult to do this in base plotting since you could "move things around" without as many constraints.
If you do need to build a step unction from estimated survial proportions and times you might look at this answer and then build an augmented dataset with a y at time=4 adjustment for the later stratum. You would need to add a time=0 value of for the main stratum and a time=4 value of the first stratum for the second stratum as well as dong the adjustment as shown above. See this question and answer. Reconstruct survival curve from coordinates

Calculation of Geometric Mean of Data that includes NAs

EDIT: The problem was not within the geoMean function, but with a wrong use of aggregate(), as explained in the comments
I am trying to calculate the geometric mean of multiple measurements for several different species, which includes NAs. An example of my data looks like this:
species <- c("Ae", "Ae", "Ae", "Be", "Be")
phen <- c(2, NA, 3, 1, 2)
hveg <- c(NA, 15, 12, 60, 59)
df <- data.frame(species, phen, hveg)
When I try to calculate the geometric mean for the species Ae with the built-in function geoMean from the package EnvStats like this
library("EnvStats")
aggregate(df[, 3:3], list(df1$Sp), geoMean, na.rm=TRUE)
it works wonderful and skips the NAs to give me the geometric means per species.
Group.1 phen hveg
1 Ae 4.238536 50.555696
2 Be 1.414214 1.414214
When I do this with my large dataset, however, the function stumbles over NAs and returns NA as result even though there are e.g 10 numerical values and only one NA. This happens for example with the column SLA_mm2/mg.
My large data set looks like this:
> str(cut2trait1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 22 obs. of 19 variables:
$ Cut : chr "15_08" "15_08" "15_08" "15_08" ...
$ Block : num 1 1 1 1 1 1 1 1 1 1 ...
$ ID : num 451 512 431 531 591 432 551 393 511 452 ...
$ Plot : chr "1_1" "1_1" "1_1" "1_1" ...
$ Grazing : chr "n" "n" "n" "n" ...
$ Acro : chr "Leuc.vulg" "Dact.glom" "Cirs.arve" "Trif.prat" ...
$ Sp : chr "Lv" "Dg" "Ca" "Tp" ...
$ Label_neu : chr "Lv021" "Dg022" "Ca021" "Tp021" ...
$ PlantFunctionalType: chr "forb" "grass" "forb" "forb" ...
$ PlotClimate : chr "AC" "AC" "AC" "AC" ...
$ Season : chr "Aug" "Aug" "Aug" "Aug" ...
$ Year : num 2015 2015 2015 2015 2015 ...
$ Tiller : num 6 3 3 5 6 8 5 2 1 7 ...
$ Hveg : num 25 38 70 36 68 65 23 58 71 27 ...
$ Hrep : num 39 54 77 38 76 70 65 88 98 38 ...
$ Phen : num 8 8 7 8 8 7 6.5 8 8 8 ...
$ SPAD : num 40.7 42.4 48.7 43 31.3 ...
$ TDW_in_g : num 4.62 4.85 11.86 5.82 8.99 ...
$ SLA_mm2/mg : num 19.6 19.8 20.3 21.2 21.7 ...
and the result of my code
gm_cut2trait1 <- aggregate(cut2trait1[, 13:19], list(cut2trait1$Sp), geoMean, na.rm=TRUE)
is (only the first two rows):
Group.1 Tiller Hveg Hrep Phen SPAD TDW_in_g SLA_mm2/mg
1 Ae 13.521721 73.43485 106.67933 NA 28.17698 1.2602475 NA
2 Be 8.944272 43.95452 72.31182 5.477226 20.08880 0.7266361 9.309672
Here, the geometric mean of SLA for Ae is NA, even though there are 9 numeric measurements and only one NA in the column used to calculate the geometric mean.
I tried to use the geometric mean function suggested here:
Geometric Mean: is there a built-in?
But instead of NAs, this returned the value 1.000 when used with my big dataset, which doesn't solve my problem.
So my question is: What is the difference between my example df and the big dataset that throws the geoMean function off the rails?

PLM is not recognizing my id variable name

I'm doing a regression analysis considering fixed effects using plm() from package plm. I have selected the twoways method to account for both time and individual effects. However, after runing the below code I keep receiving this message:
Error in pdata.frame(data, index) :
variable id does not exist (individual index)
Here the code:
pdata <- DATABASE[,c(2:4,13:21)]
pdata$id <- group_indices(pdata,ISO3.p,Productcode)
coutnin <- dcast.data.table(pdata,ISO3.p+Productcode~.,value.var = "id")
setcolorder(pdata,neworder=c("id","Year"))
pdata <- pdata.frame(pdata,index=c("id","Year"))
reg <- plm(pdata,diff(TV,1) ~ diff(RERcp,1)+diff(GDPR.p,1)-diff(GDPR.r,1), effect="twoways", model="within", index = c("id","Year"))
Please mind that pdata structure shows that there are multiple levels in the id variable which is in numeric form, I tried initially to use a string type variable but I keep receiving the same outcome:
Classes ‘data.table’ and 'data.frame': 1211800 obs. of 13 variables:
$ id : int 4835 6050 13158 15247 17164 18401 19564 23553 24895 27541 ...
$ Year : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
$ Productcode: chr "101" "101" "101" "101" ...
$ ISO3.p : Factor w/ 171 levels "ABW","AFG","AGO",..: 8 9 20 22 27 28 29 34 37 40 ...
$ e : num 0.245 -0.238 1.624 0.693 0.31 ...
$ RERcp : num -0.14073 -0.16277 1.01262 0.03908 -0.00243 ...
$ RERpp : num -0.1712 NA NA NA -0.0952 ...
$ RER_GVC : num -3.44 NaN NA NA NaN ...
$ GDPR.p : num 27.5 26.6 23.5 20.3 27.8 ...
$ GDPR.r : num 30.4 30.4 30.4 30.4 30.4 ...
$ GVCPos : num 0.141 0.141 0.141 0.141 0.141 ...
$ GVCPar : num 0.436 0.436 0.436 0.436 0.436 ...
$ TV : num 17.1 17.1 17.1 17.1 17.1 ...
- attr(*, ".internal.selfref")=<externalptr>
When I convert the data.table into a pdata.frame I do not receive any warning, it happens only after I run the plm function. From running View(table(index(pdata), useNA = "ifany")) it displays no value larger than 1, therefore I assume I have no duplicates obs in my data.
Try to put the data argument at the second place in the plm statement. In case pdata has been converted to a pdata.frame already, leave out the index argument in the plm statement, i.e., try this:
reg <- plm(diff(TV,1) ~ diff(RERcp,1)+diff(GDPR.p,1)-diff(GDPR.r,1), data = pdata, effect = "twoways", model = "within")

“length of 'dimnames' [2] not equal to array extent”

So I have seen questions regarding this error code before, but the suggested troubleshooting that worked for those authors didn't help me diagnose. I'm self-learning R and new to Stackoverflow, so please give me constructive feedback on how to better ask my question, and I will do my best to provide necessary information. I've seen many, similar questions put on hold so I want to help you to help me. I'm sure the error probably stems from my lack of experience in data prep.
I'm trying to run a panel data model, loaded as .csv and this error returns when the model is run
fixed = plm(Y ~ X, data=pdata, model = "within")
Error in `colnames<-`(`*tmp*`, value = "1") :
length of 'dimnames' [2] not equal to array extent
running str() on my dataset returns that ID and Time are factors with 162 levels and 7 levels, respectively.
str(pdata)
Classes ‘plm.dim’ and 'data.frame': 1127 obs. of 11 variables:
$ ID : Factor w/ 162 levels "1","2","3","4",..: 1 1 1 1 1 1 1 2 2 2 ...
$ Time : Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7 1 2 3 ...
$ Online.Service.Index : num 0.083 0.131 0.177 0.268 0.232 ...
$ Eparticipation : num 0.0345 0.0328 0.0159 0.0454 0.0571 ...
$ CPI : num 2.5 2.6 2.5 1.5 1.4 0.8 1.2 2.5 2.5 2.4 ...
$ GE.Est : num -1.178 -0.883 -1.227 -1.478 -1.466 ...
$ RL.Est : num -1.67 -1.71 -1.72 -1.95 -1.9 ...
$ LN.Pop : num 16.9 17 17 17.1 17.1 ...
$ LN.GDP.Cap : num 5.32 5.42 5.55 5.95 6.35 ...
$ Human.Capital.Index : num 0.268 0.268 0.268 0.329 0.364 ...
$ Telecommunication.Infrastructure.Index: num 0.0016 0.00173 0.00202 0.01576 0.03278 ...
Still, I don't see how it would create this error. I've tried transforming it as a data frame or matrix, with the same result (I got desperate and it worked for some people)
dim() yields
[1] 1127 11
I have some NA values, but I understand that these shouldn't cause a problem. Again, I'm self-taught and new here, so please take it easy on me! Hope I explained the problem well.

Truncate a Time-Series in R

I'm using continuous Morlet wavelet transform (cwt) analysis over a time series by the use of the R-package dplR. The time series corresponds to a 15min data (gam_15min) with length 7968 (corresponding to 83 days of measurements).
I have the following output:
cwtGamma=morlet(gam_15min,x1=seq_along(gam_15min),p2=NULL,dj=0.1,siglvl=0.95)
str(cwtGamma)
List of 9
$ y : Time-Series [1:7968] from 1 to 1993: 672 674 673 672 672 ...
$ x : int [1:7968] 1 2 3 4 5 6 7 8 9 10 ...
$ wave : cplx [1:7968, 1:130] -0.00332+0.0008i 0.00281-0.00181i -0.00194+0.00234i ...
$ coi : num [1:7968] 0.73 1.46 2.19 2.92 3.65 ...
$ period: num [1:130] 1.03 1.11 1.19 1.27 1.36 ...
$ Scale : num [1:130] 1 1.07 1.15 1.23 1.32 ...
$ Signif: num [1:130] 0.000382 0.001418 0.005197 0.018514 0.062909 ...
$ Power : num [1:7968, 1:130] 1.17e-05 1.11e-05 9.26e-06 7.09e-06 5.54e-06 ...
$ siglvl: num 0.95
In my analysis I want to truncate the time-series (I suppose $wave) by removing 1 period length in the beginning and 1 period length at the end. how do I do that? maybe its easy but I'm seeing how... Thanks

Resources