N size for not being considerated leverage - r

Suposse I have a dataset with two variables x and y, with the purpose to run a linear regression y ~ x.
We have all x values equal and y varying between 1 and 10.
For example (in R code):
x <- rep(100, 50)
y <- runif(50, 1, 10)
If I add a new value, being x value 75, this new value will be considerated a leverage:
x <- c(x, 75)
y <- c(y, runif(1, 1, 10))
fit <- lm(y ~ x)
im <- influence.measures(fit)
tail(im$is.inf)
How many 75's I need to add to the dataset for not being considerated a leverage?
Is there any R package that returns that critical N size?
Edit after #RuiBarradas comments
hatvalues with 51 observations (50 100's and 1 75) are:
> im$infmat[, 6]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
49 50 51
0.02 0.02 1.00
What i want to know is how many 75's i have to add so that 75 are not considered leverages, because I'm deleting observations with high leverage from my analysis.
This have to be done programatically for over 1000 cases.

Well I don't think anyone is trying to question your approach just point out that as you drop high leverage rows more will appear. potentially with even higher leverage.
But best to simply arm you to do what you want. Here's a function that will take a dataset, a regression formula, and how_deep you want to go.
We'll apply it to mtcars and say that we want to make sure we have 27 rows of 32 left which means how_deep = 27. You get back a vector of cars. Initially the Lincoln has the highest leverage when you eliminate it, it's now a Chrysler with what is actually higher leverage. Hopefully your data will be much different
recursive_leverage <- function(data, formula, how_deep) {
while (nrow(data) >= how_deep) {
data$hatval <- hatvalues(lm(formula = formula, data = data))
print(paste(rownames(data[which.max(data$hatval),]), max(data$hatval)))
data <- data[-which(data$hatval == max(data$hatval)),]
}
}
recursive_leverage(mtcars, mpg ~ wt + disp, 27)
#> [1] "Lincoln Continental 0.198891254526384"
#> [1] "Chrysler Imperial 0.240276382749128"
#> [1] "Cadillac Fleetwood 0.274528186610731"
#> [1] "Lotus Europa 0.224979770471144"
#> [1] "Honda Civic 0.223347867723116"
#> [1] "Ford Pantera L 0.21141405715464"

Related

How to convert a list into a data.frame in R?

I've created a frequency table in R with the fdth package using this code
fdt(x, breaks = "Sturges")
The specific result was:
Class limits f rf rf(%) cf cf(%)
[-15.907,-11.817) 12 0.00 0.10 12 0.10
[-11.817,-7.7265) 8 0.00 0.07 20 0.16
[-7.7265,-3.636) 6 0.00 0.05 26 0.21
[-3.636,0.4545) 70 0.01 0.58 96 0.79
[0.4545,4.545) 58 0.00 0.48 154 1.27
[4.545,8.6355) 91 0.01 0.75 245 2.01
[8.6355,12.726) 311 0.03 2.55 556 4.57
[12.726,16.817) 648 0.05 5.32 1204 9.89
[16.817,20.907) 857 0.07 7.04 2061 16.93
[20.907,24.998) 1136 0.09 9.33 3197 26.26
[24.998,29.088) 1295 0.11 10.64 4492 36.90
[29.088,33.179) 1661 0.14 13.64 6153 50.55
[33.179,37.269) 2146 0.18 17.63 8299 68.18
[37.269,41.36) 2525 0.21 20.74 10824 88.92
[41.36,45.45) 1349 0.11 11.08 12173 100.00
It was given as a list:
> class(x)
[1] "fdt.multiple" "fdt" "list"
I need to convert it into a data frame object, so I can have a table. How can I do it?
I'm a beginner at using R :(
Since you did not provide a reproducible example of your data I have used example from the help page of ?fdt which is closer to what you have.
library(fdth)
mdf <- data.frame(c1=sample(LETTERS[1:3], 1e2, TRUE),
c2=as.factor(sample(1:10, 1e2, TRUE)),
n1=c(NA, NA, rnorm(96, 10, 1), NA, NA),
n2=rnorm(100, 60, 4),
n3=rnorm(100, 50, 4),
stringsAsFactors=TRUE)
fdt <- fdt(mdf,breaks='FD',by='c1')
class(fdt)
#[1] "fdt.multiple" "fdt" "list"
You can extract the table part from each list and bind them together.
result <- purrr::map_df(fdt, `[[`, 'table')
#In base R
#result <- do.call(rbind, lapply(fdt, `[[`, 'table'))
result
# Class limits f rf rf(%) cf cf(%)
#1 [8.1781,9.1041) 5 0.20833333 20.833333 5 20.833333
#2 [9.1041,10.03) 6 0.25000000 25.000000 11 45.833333
#3 [10.03,10.956) 10 0.41666667 41.666667 21 87.500000
#4 [10.956,11.882) 3 0.12500000 12.500000 24 100.000000
#5 [53.135,56.121) 4 0.16000000 16.000000 4 16.000000
#6 [56.121,59.107) 8 0.32000000 32.000000 12 48.000000
#7 [59.107,62.092) 8 0.32000000 32.000000 20 80.000000
#....

Create matrix from dataset in R

I want to create a matrix from my data. My data consists of two columns, date and my observations for each date. I want the matrix to have year as rows and days as columns, e.g. :
17 18 19 20 ... 31
1904 x11 x12 ...
1905
1906
.
.
.
2019
The days in this case is for December each year. I would like missing values to equal NA.
Here's a sample of my data:
> head(cdata)
# A tibble: 6 x 2
Datum Snödjup
<dttm> <dbl>
1 1904-12-01 00:00:00 0.02
2 1904-12-02 00:00:00 0.02
3 1904-12-03 00:00:00 0.01
4 1904-12-04 00:00:00 0.01
5 1904-12-12 00:00:00 0.02
6 1904-12-13 00:00:00 0.02
I figured that the first thing I need to do is to split the date into year, month and day (European formatting, YYYY-MM-DD) so I did that and got rid of the date column (the one that says Datum) and also got rid of the unrelevant days, namely the ones < 17.
cdata %>%
dplyr::mutate(year = lubridate::year(Datum),
month = lubridate::month(Datum),
day = lubridate::day(Datum))
select(cd, -c(Datum))
cu <- cd[which(cd$day > 16
& cd$day < 32
& cd$month == 12),]
and now it looks like this:
> cu
# A tibble: 1,284 x 4
Snödjup year month day
<dbl> <dbl> <dbl> <int>
1 0.01 1904 12 26
2 0.01 1904 12 27
3 0.01 1904 12 28
4 0.12 1904 12 29
5 0.12 1904 12 30
6 0.15 1904 12 31
7 0.07 1906 12 17
8 0.05 1906 12 18
9 0.05 1906 12 19
10 0.04 1906 12 20
# … with 1,274 more rows
Now I need to fit my data into a matrix with missing values as NA. Is there anyway to do this?
Base R approach, using by.
r <- `colnames<-`(do.call(rbind, by(dat, substr(dat$date, 1, 4), function(x) x[2])), 1:31)
r[,17:31]
# 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# 1904 -0.28 -2.66 -2.44 1.32 -0.31 -1.78 -0.17 1.21 1.90 -0.43 -0.26 -1.76 0.46 -0.64 0.46
# 1905 1.44 -0.43 0.66 0.32 -0.78 1.58 0.64 0.09 0.28 0.68 0.09 -2.99 0.28 -0.37 0.19
# 1906 -0.89 -1.10 1.51 0.26 0.09 -0.12 -1.19 0.61 -0.22 -0.18 0.93 0.82 1.39 -0.48 0.65
Toy data
set.seed(42)
dat <- do.call(rbind, lapply(1904:1906, function(x)
data.frame(date=seq(ISOdate(x, 12, 1, 0), ISOdate(x, 12, 31, 0), "day" ),
value=round(rnorm(31), 2))))
You can try :
library(dplyr)
library(tidyr)
cdata %>%
mutate(year = lubridate::year(Datum),
day = lubridate::day(Datum)) %>%
filter(day >= 17) %>%
complete(day = 17:31) %>%
select(year, day, Snödjup) %>%
pivot_wider(names_from = day, values_from = Snödjup)

Finding max of column by group with condition

I have a data frame like this:
for each gill, I would like to find the maximum time for which the Diameter is different from 0. I have tried to use the function aggregate and the dplyr package but this did not work. A combinaison of for, if and aggregate would probably work but I did not find how to do it.
I'm not sure of the best way to approach this. I'd appreciate any help.
After grouping by 'Gill', subset the 'Time' where 'Diametre' is not 0 and get the max (assuming 'Time' is numeric class)
library(dplyr)
df1 %>%
group_by(Gill) %>%
summarise(Time = max(Time[Diametre != 0]))
Here how you can use aggregate:
> df<- data.frame(
Gill = rep(1:11, each = 2),
diameter = c(0,0,1,0,0,0,73.36, 80.08,1,25.2,53.48,61.21,28.8,28.66,71.2,80.25,44.55,53.50,60.91,0,11,74.22),
time = 0.16
)
> df
Gill diameter time
1 1 0.00 0.16
2 1 0.00 0.16
3 2 1.00 0.16
4 2 0.00 0.16
5 3 0.00 0.16
6 3 0.00 0.16
7 4 73.36 0.16
8 4 80.08 0.16
9 5 1.00 0.16
10 5 25.20 0.16
11 6 53.48 0.16
12 6 61.21 0.16
13 7 28.80 0.16
14 7 28.66 0.16
15 8 71.20 0.16
16 8 80.25 0.16
17 9 44.55 0.16
18 9 53.50 0.16
19 10 60.91 0.16
20 10 0.00 0.16
21 11 11.00 0.16
22 11 74.22 0.16
> # Remove diameter == 0 before aggregate
> dfnew <- df[df$diameter != 0, ]
> aggregate(dfnew$time, list(dfnew$Gill), max )
Group.1 x
1 2 0.16
2 4 0.16
3 5 0.16
4 6 0.16
5 7 0.16
6 8 0.16
7 9 0.16
8 10 0.16
9 11 0.16
I would use a different approach than the elegant solution that akrun suggested. I know how to use this method to create the column MaxTime that you show in your image.
#This will split your df into a list of data frames for each gill.
list.df <- split(df1, df1$Gill)
Then you can use lapply to find the maximum of Time for each Gill and then make that value a new column called MaxTime.
lapply(list.df, function(x) mutate(x, MaxTime = max(x$Time[x$Diametre != 0])))
Then you can combine these split dataframes back together using bind_rows()
df1 = bind_rows(list.df)

How build a nonlinear approximation?

There was a need to build an approximation of data using the formula
y = a(exp(x/b) - 1) (below the code).
library("ggplot2")
df <- read.table(file='vah_p_1',header =TRUE)
p <- ggplot(df, aes(x = x, y = y)) + geom_point() +
geom_smooth(data = df, method = "nls",size=0.4, se=FALSE,color ='cyan2',
formula = y ~ a(exp^(x*b)-1),method.args = list(start=c(a=1.0,b=0.0)))
p
Unfortunately the approximation line is not being built.I think the problem is in method.args = list(start=c(a=1.0,b=0.0). How to find a, b?
In vah_p_1 is located:
x y
0 4
0.25 5
0.27 6
0,29 7
0.31 8
0.33 10
0.34 13
0.36 16
0.37 20
0.38 23
0.39 28
0.4 37
0.41 43
0.42 55
0.43 67
0.44 81
0.45 94
0.46 118
0.47 143
0.48 187
0.49 225

Categorical Survey Analysis - data structure problems

I am trying to run a probability table for an entire survey. I want to then export these statistics into a csv where each column represents a single question. Each question in my original is its own column, like so:
print(InternalSurveyPercent)
Q1 Q2 Q3 Q4
1 3 2 Mazda
2 3 4 Ford
3 5 2 Toyota
9 3 2 Hyundai
I'd like the results to look like this, but for each column.
InternalSurveyPercent$Q1
Q1
1 25%
2 25%
3 25%
4 0%
5 0%
9 25%
I use this function to generate the list (is lapply the right way to do this?)
InternalSurveyPercent = lapply(InternalSurvey, function(x) prop.table(table(x)))
Then I multiply by 100 because it makes graphic my data easier.
InternalSurveyPercent = sapply(InternalSurveyPercent, "*", 100)
I'm not really sure where to go from here. I'm very confused about how the data is being structured at this point.
str(InternalSurveyPercent)
List of 4
$ Q1: table [1:5(1d)] 25.00 25.00 25.00 0.00 0.00 25.00
..- attr(*, "dimnames")=List of 1
.. ..$ x: chr [1:5] "1" "2" "3" "4" ...
Why is it returning a list? Why not a data frame with 4 variables (columns)? Thoughts on where I am going wrong/getting lost?
Thank you!
Seems folks are having different interpretation on the output, suggest to re-frame the question and desired output with clarity. Anyhow, here s a data.table solution based on how far I understand the question.
# the data
df <- read.table(text="Q1 Q2 Q3 Q4
1 3 2 Mazda
2 3 4 Ford
3 5 2 Toyota
9 3 2 Hyundai", header=T, as.is=T)
library(data.table)
# one liner to get the %
setDT(df)[,lapply(.SD, function(x) prop.table(table(x))*100)][]
# Q1 Q2 Q3 Q4
# 1: 25 75 75 25
# 2: 25 25 25 25
# 3: 25 75 75 25
# 4: 25 25 25 25
# If you prefer stitch the result table with the original together, you could:
df2 <- setDT(df)[,lapply(.SD, function(x) prop.table(table(x))*100)]
df[,paste0("Q",(1:4),"%") := df2[,1:4,with=FALSE], with=FALSE][]
# Q1 Q2 Q3 Q4 Q1% Q2% Q3% Q4%
# 1: 1 3 2 Mazda 25 75 75 25
# 2: 2 3 4 Ford 25 25 25 25
# 3: 3 5 2 Toyota 25 75 75 25
# 4: 9 3 2 Hyundai 25 25 25 25
This may be helpful. I am guessing that you have six options in Q1-3 (i.e., 1,2,3,4,5,and 9). But, Q4 is a different question in that there may not be the same options. Therefore, you will see ten options in the outcome.
devtools::install_github("hadley/tidyr")
library(tidyr)
# I am following your idea with data provided by #LyzandeR
ana <- lapply(InternalSurvey, function(x) prop.table(table(x)))
bob <- data.frame(t(unnest(lapply(ana, as.data.frame.list))), stringsAsFactors = FALSE)
bob <- replace(bob, is.na(bob), 0)
colnames(bob) <- gsub("X", "Q", colnames(bob))
# Q1 Q2 Q3 Q4
#X1 0.25 0.00 0.00 0.00
#X2 0.25 0.00 0.75 0.00
#X3 0.25 0.75 0.00 0.00
#X9 0.25 0.00 0.00 0.00
#X5 0.00 0.25 0.00 0.00
#X4 0.00 0.00 0.25 0.00
#Ford 0.00 0.00 0.00 0.25
#Hyundai 0.00 0.00 0.00 0.25
#Mazda 0.00 0.00 0.00 0.25
#Toyota 0.00 0.00 0.00 0.25

Resources