Using R to determine if errors are normally distributed: - r

Say I have a dataset called wage that looks like this:
wage
# A tibble: 935 x 17
wage hours iq kww educ exper tenure age married black south urban sibs brthord meduc
<int> <int> <int> <int> <int> <int> <int> <int> <fctr> <fctr> <fctr> <fctr> <int> <int> <int>
1 769 40 93 35 12 11 2 31 1 0 0 1 1 2 8
2 808 50 119 41 18 11 16 37 1 0 0 1 1 NA 14
3 825 40 108 46 14 11 9 33 1 0 0 1 1 2 14
4 650 40 96 32 12 13 7 32 1 0 0 1 4 3 12
5 562 40 74 27 11 14 5 34 1 0 0 1 10 6 6
6 1400 40 116 43 16 14 2 35 1 1 0 1 1 2 8
7 600 40 91 24 10 13 0 30 0 0 0 1 1 2 8
8 1081 40 114 50 18 8 14 38 1 0 0 1 2 3 8
9 1154 45 111 37 15 13 1 36 1 0 0 0 2 3 14
10 1000 40 95 44 12 16 16 36 1 0 0 1 1 1 12
# ... with 925 more rows, and 2 more variables: feduc <int>, lwage <dbl>
Say I then look at a simple linear regression btw wage and IQ:
m_wage_iq = lm(wage ~ iq, data = wage)
m_wage_iq$coefficients
which gives me:
## (Intercept) iq
## 116.991565 8.303064
I want check that the errors are:
ϵi∼N(0,σ2)
How do I check this using R?

There are a number of ways you can try.
One way would be the shapiro.test to test for normality. A p.value greater than your alpha level (typically up to 10%) would mean that the null hypothesis (i.e. the errors are normally distributed) cannot be rejected. However, the test is biased by sample size so you might want to reinforce your results by looking at the QQplot.
You can see that by plotting m_wage_iq (plot(m_wage_iq )) and looking at the second graph. If your points approximately lie on the x=y line then that would suggest that the errors follow a normal distribution.

Related

Making multiple matrices for different run of simulation using for loops

I have a data from a simulation which has counts of interactions among 10 individuals, and there are 80 runs. I would like to make separate matrices for each run, and then use a function for calculating the ranking of individuals from the matric for each run
Is it possible to make for loops for-
making matrices for each run
running a function through all matrices
I am new to R so I don't really know how to make these iterative loops. I made separate matrices, and ran the function separately for each matrix. But this is very time consuming and prone to error.
This is what the data looks like :
head(A)
[run number] distribution who-won1 who-won2 won-battle
1 3 4 patches 7 5 17
2 3 4 patches 9 4 31
3 3 4 patches 0 1 11
4 3 4 patches 2 1 7
5 3 4 patches 2 9 4
6 3 4 patches 5 7 36
7 3 4 patches 9 6 10
8 3 4 patches 2 7 3
9 3 4 patches 1 0 19
10 3 4 patches 3 7 7
Then I used this to make the matrices, which is an actor-receiver matrix with the counts of fights won for each actor-receiver.
Alist <- vector("list", 40)
for(run in 1:40){
newmatrix <- matrix(nrow = 10, ncol = 10)
for (x in 1:90) { #90 rows per run
Actor = A$Actor[A$Group== run][x] + 1
Receiver = A$Receiver[A$Group== run][x] + 1
Won = A$`won-battle`[A$Group== run][x]
newmatrix[Actor,Receiver] = as.numeric(Won)
}
newmatrix[is.na(newmatrix)] <- 0
groomlosepatchylist[[run]] <- newmatrix
}
and it gives a matrix like this-
...1 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 11 19 23 11 9 1 12 34 3
2 2 19 0 25 24 13 12 5 12 35 13
3 3 14 7 0 14 6 3 1 3 38 4
4 4 16 8 10 0 1 5 2 7 19 8
5 5 30 19 35 35 0 17 9 16 67 18
6 6 31 50 52 38 21 0 21 36 83 26
7 7 69 42 46 38 35 43 0 62 66 59
8 8 38 23 48 44 19 17 7 0 66 21
9 9 26 14 31 24 4 2 5 6 0 12
10 10 41 35 43 48 31 33 10 34 64 0

How to get the row numbers n days prior to last observation in R?

I have a big data-set with over 1000 subjects, a small piece of the data-set looks like:
mydata <- read.table(header=TRUE, text="
Id DAYS QS Event
01 50 1 1
01 57 4 1
01 70 1 1
01 78 2 1
01 85 3 1
02 70 2 1
02 92 4 1
02 98 5 1
02 105 6 1
02 106 7 0
")
I would like to get row number of the observation 28 or more days prior to last observation, eg. for id=01; last observation is 85 minus 28 would be 57 which is row number 2. For id=02; last observation is 106 minus 28; 78 and because 78 does not exist, we will use row number of 70 which is 1 (I will be getting the row number for each observation separately) or first observation for id=02.
This should work:
mydata %>% group_by(Id) %>%
mutate(row_number = last(which(DAYS <= max(DAYS) - 28)))
# A tibble: 10 x 6
# Groups: Id [2]
Id DAYS QS Event max row_number
<int> <int> <int> <int> <dbl> <int>
1 1 50 1 1 57 2
2 1 57 4 1 57 2
3 1 70 1 1 57 2
4 1 78 2 1 57 2
5 1 85 3 1 57 2
6 2 70 2 1 78 1
7 2 92 4 1 78 1
8 2 98 5 1 78 1
9 2 105 6 1 78 1
10 2 106 7 0 78 1

How to reshape data frame from a row level to person level in R

I have the following codes for Netflix experiment to reduce the price of Netflix and see if people watch more or less TV. Each time someone uses Netflix, it shows what they watched and how long they watched it for.
**library(tidyverse)
sample_size <- 10000
set.seed(853)
viewing_data <-
tibble(unique_person_id = sample(x = c(1:100),
size = sample_size,
replace = TRUE),
tv_show = sample(x = c("Broadchurch", "Duty-Shame", "Drive to Survive", "Shetland", "The Crown"),
size = sample_size,
replace = TRUE),
)**
I then want to write some code that would randomly assign people into one of two groups - treatment and control. However, the dataset it's in a row level as there are 1000 observations. I want change it to person level in R, then I could sign a person be either treated or not. A person should not be both treated and not treated. However, the tv_show shows many times for one person. Any one know how to reshape the dataset in this case?
library(dplyr)
treatment <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(treated = sample(c("yes", "no"), size = 100, replace = TRUE))
viewing_data %>%
left_join(treatment, by = "unique_person_id")
You can change the way of sampling if you need to...
You can do the below, this groups your observations by person id, assigns a unique "treat/control" per group:
library(dplyr)
viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
# A tibble: 10,000 x 3
# Groups: unique_person_id [100]
unique_person_id tv_show group
<int> <chr> <chr>
1 9 Drive to Survive control
2 64 Shetland treated
3 90 The Crown treated
4 93 Drive to Survive treated
5 17 Duty-Shame treated
6 29 The Crown control
7 84 Broadchurch control
8 83 The Crown treated
9 3 The Crown control
10 33 Broadchurch control
# … with 9,990 more rows
We can check our results, all of the ids have only 1 group of treated / control:
newdata <- viewing_data %>%
group_by(unique_person_id) %>%
mutate(group=sample(c("treated","control"),1))
tapply(newdata$group,newdata$unique_person_id,n_distinct)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
In case you wanted random and equal allocation of persons into the two groups (complete random allocation), you can use the following code.
library(dplyr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=sample(100), # in case the ids are not truly random
group=ifelse(group %% 2 == 0, 0, 1)) # works if only two groups
Persons
# A tibble: 100 x 2
unique_person_id group
<int> <dbl>
1 1 0
2 2 0
3 3 1
4 4 0
5 5 1
6 6 1
7 7 1
8 8 0
9 9 1
10 10 0
# ... with 90 more rows
And to check that we've got 50 in each group:
Persons %>% count(group)
# A tibble: 2 x 2
group n
<dbl> <int>
1 0 50
2 1 50
You could also use the randomizr package, which has many more features apart from complete random allocation.
library(randomizr)
Persons <- viewing_data %>%
distinct(unique_person_id) %>%
mutate(group=complete_ra(N=100, m=50))
Persons %>% count(group) # Check
To link this back to the viewing_data, use inner_join.
viewing_data %>% inner_join(Persons, by="unique_person_id")
# A tibble: 10,000 x 3
unique_person_id tv_show group
<int> <chr> <int>
1 10 Shetland 1
2 95 Broadchurch 0
3 7 Duty-Shame 1
4 68 Drive to Survive 0
5 17 Drive to Survive 1
6 70 Shetland 0
7 78 Drive to Survive 0
8 21 Broadchurch 1
9 80 The Crown 0
10 70 Shetland 0
# ... with 9,990 more rows

How can I avoid right-truncated subjects being dropped?

I'm doing a survival analysis about the time some individual components remain in the source code of a software project, but some of these components are being dropped by the survfit function.
This is what I'm doing:
library(survival)
data <- read.table(text = "component_id weeks removed
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 2 0
9 2 0
10 2 0
11 2 0
12 2 1
13 2 1
14 2 0
15 2 0
16 2 0
17 2 0
18 2 0
19 2 0
20 2 1
21 2 1
22 2 0
23 2 0
24 3 1
25 3 1
26 3 1
27 3 1
28 7 1
29 7 1
30 14 1
31 14 1
32 14 1
33 14 1
34 14 1
35 14 1
36 14 1
37 14 1
38 14 1
39 14 1
40 14 1
41 14 1
42 14 1
43 14 1
44 14 1
45 14 1
46 14 1
47 14 1
48 40 1
49 40 1
50 40 1
51 40 1
52 48 1
53 48 1
54 48 1
55 48 1
56 48 1
57 48 1
58 48 1
59 48 1
60 56 1
61 56 1
62 56 1
63 56 1
64 56 1
65 56 1
66 56 1
67 56 1
68 56 1
69 56 1", header = TRUE)
fit <- survfit(Surv(data$weeks, data$removed) ~ 1)
summary(fit, censored=TRUE)
And this is the output
Call: survfit(formula = Surv(data$weeks, data$removed) ~ 1)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 69 7 0.899 0.0363 0.830 0.973
2 62 4 0.841 0.0441 0.758 0.932
3 46 4 0.767 0.0533 0.670 0.879
7 42 2 0.731 0.0567 0.628 0.851
14 40 18 0.402 0.0654 0.292 0.553
40 22 4 0.329 0.0629 0.226 0.478
48 18 8 0.183 0.0520 0.105 0.319
56 10 10 0.000 NaN NA NA
I was expecting the number of events to be 69 but I get 12 subjects dropped.
I initially thought I was misusing the package functions, and carried a type="interval2" approach, following a similar situation, but the drops keep happening with now a weird continuous number of subjects and events counts:
as.t2 <- function(i, data) if (data$removed[i] == 1) data$weeks[i] else NA
size <- length(data$weeks)
t1 <- data$weeks
t2 <- sapply(1:size, as.t2, data = data)
interval_fit <- survfit(Surv(t1, t2, type="interval2") ~ 1)
summary(interval_fit, censored=TRUE)
Next, I found what I call a mid-air explanation, clarifying a bit further the situation. I understand this is caused by non-censored subjects appearing after a "constant censoring time", but again, why?
That led me somehow to dig deeper and read about right-truncation and realized that type of studies mapped very closely to the drops I'm experiencing. Here's Klein & Moeschberger:
Truncation of survival data occurs when only those individuals whose event time lies within a certain observational window $(Y_L,Y_R)$ are observed. An individual whose event time is not in this interval is not observed and no information on this subject is available to the investigator.
Right truncation occurs when $Y_L$ is equal to zero. That is, we observe the survival time $X$ only when $X \leq Y_R$.
From my perspective, these drops carry important information for my study regardless of their time of entry.
How can I stop the drops?

Loop linear regrssion model

I have a data like this where Amount is the dependent variable and len,age, quantity and pos are explanotry variables. I trying to Make a regression of Amount On age, quantity and pos Using stepwise.
ID Sym Month Amount len Age quantity Pos
11 10 1 500 5 17 0 12
22 10 1 300 6 11 0 11
33 10 1 200 2 10 0 10
44 10 1 100 2 11 0 11
55 10 1 150 4 15 0 12
66 10 1 250 4 16 0 14
11 20 1 500 5 17 0 12
22 20 1 300 6 11 0 11
33 20 1 200 2 10 0 10
44 20 1 100 2 11 0 11
55 20 1 150 4 15 0 12
66 20 1 250 4 16 0 14
77 20 1 700 4 17 0 11
88 20 1 100 2 16 0 12
11 30 1 500 5 17 0 12
22 30 1 300 6 11 0 11
33 30 1 200 2 10 0 10
44 30 1 100 2 11 0 11
55 30 1 150 4 15 0 12
66 30 1 250 4 16 0 14
11 40 1 500 5 17 2000 12
22 40 1 300 6 11 1000 11
33 40 1 200 2 10 1000 10
44 40 1 100 2 11 1000 11
55 40 1 150 4 15 1000 12
66 40 1 250 4 16 1000 14
And the Output of the results I want after running all regression should be a dataframe that's look like this (That should help me detect outliers):
Id Month Sym Amount len Age Quantity Pos R^2 CookDistanse Residuals UpperLimit LowerLimit
11 1 10 500 5 17 null 12 0.7 1.5 -350 -500 1000
22 1 10 300 6 11 null 11 0.8 1.7 -400 -500 1000
That's the code that I am trying to run on Sym = 10, Sym= 20, Sym = 30, Sym = 40.
I have something like 400 Sym values to run a regression analysis on them.
fit[i] <- step(lm (Sym[i]$Sum ~ len + Age + Quantity,
na.action=na.omit), direction="backward")
R_Sq <- summary(fit[i])$r.squared
Res[i] <- resid(fit[i])
D[i] <- cooks.distance(fit[i])
Q[i] <- quantile(resid(fit[i), c(.25, .50, .75, .99))
L[i]<- Q[1][i] - 2.2* (Q[3][i]-Q[1][i])
U[i] <- Q[3][i] + 2.2*(Q[3][i]-Q[1][i])
"i" means the results for the regression of sym = i (10,20..).
Any way to do this on loop for every Sym value?
Any help will be highly appreciate.

Resources