Remove two outliers in multiple regression - r

we've got a problem with removing two outliers from our dataset. The data is about an experiment with two independent and one dependent variable. We've exercised the multiple regression and analyzed the "Normal Q-Q" plot. It showed us two outliers (10,46). Now we would like to remove those two cases, before rerunning the multiple regression without the outliers.
We've already tried out various commands recommended in several R platforms but unfortunately nothing worked out.
We would be glad, if anyone of you had an idea that would help us solving our problem.
Thank You very much for helping.

Since no data was provided, I fabricated some:
> x <- data.frame(a = c(10, 12, 14, 6, 10, 8, 11, 9), b = c(1, 2, 3, 24, 4, 1, 2, 4),
c = c(2, 1, 3, 6, 3, 4, 2, 48))
> x
a b c
1 10 1 2
2 12 2 1
3 14 3 3
4 6 24 6
5 10 4 3
6 8 1 4
7 11 2 2
8 9 4 48
If the 4th case in column x$b and the 8th case in column x$c are outliers:
> x1 <- x[-c(4, 8), ]
> x1
a b c
1 10 1 2
2 12 2 1
3 14 3 3
5 10 4 3
6 8 1 4
7 11 2 2
Is this what you need?

Related

Divide data in to chunks with multiple values in each chunk in R

I have a dataframe with observations from three years time, with column df$week that indicates the week of the observation. (The week count of the second year continues from the count of the first, so the data contains 207 weeks).
I would like to divide the data to longer time periods, to df$period that would include all observations from several weeks' time.
If a period would be the length of three weeks, and I the data would include 13 observations in six weeks time, the I idea would be to divide
weeks <- c(1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 6, 6, 6)
into
periods <- c(1, 1, 1, 2, 2, 3, 3), c(4, 5, 5, 6, 6, 6)
periods
[1]
1 1 1 2 2 3 3
[2]
4 5 5 6 6 6
To look something like
> df
week period
1 1 1
2 1 1
3 1 1
4 2 1
5 2 1
6 3 1
7 3 1
8 4 2
9 5 2
10 5 2
11 6 2
12 6 2
13 6 2
>
The data contains +13k rows so would need to do some sort of map in style of
mapPeriod <- function(df, fun) {
out <- vector("vector_of_weeks", length(df))
for (i in seq_along(df)) {
out[i] <- fun(df[[i]])
}
out
}
I just don't know what to include in the fun to divide the weeks to the decided sequences of periods. Can function rep be of assistance here? How?
I would be very grateful for all input and suggestions.
split(weeks, f = (weeks - 1) %/% 3)
$`0`
[1] 1 1 1 2 2 3 3
$`1`
[1] 4 5 5 6 6 6
from comments below
weeks <- c(1, 1, 1, 2, 2, 3, 3, 4, 5, 5, 6, 6, 6)
df <- data.frame(weeks)
library(data.table)
df$period <- data.table::rleid((weeks - 1) %/% 3)
# weeks period
# 1 1 1
# 2 1 1
# 3 1 1
# 4 2 1
# 5 2 1
# 6 3 1
# 7 3 1
# 8 4 2
# 9 5 2
# 10 5 2
# 11 6 2
# 12 6 2
# 13 6 2

Loop to sum observation of a time serie in R

I'm sure this is very obvious but i'm a begginer in R and i spent a good part of the afternoon trying to solve this...
I'm trying to create a loop to sum observation in my time serie in steps of five.
for example :
input:
1
2
3
4
5
5
6
6
7
4
5
5
4
4
5
6
5
6
4
4
output:
15
28
23
25
My time serie as only one variable, and 7825 obserbations.
The finality of the loop is to calculate the weekly realized volatility. My observations are squared returns. Once i'll have my loop, i'll be able to extract the square root and have my weekly realized volatility.
Thank you very much in advance for any help you can provide.
H.
We can create a grouping variable with gl and use that to get the sum in tapply
tapply(input, as.integer(gl(length(input), 5, length(input))),
FUN = sum, na.rm = TRUE)
# 1 2 3 4
# 15 28 23 25
data
input <- scan(text = "1 2 3 4 5 5 6 6 7 4 5 5 4 4 5 6 5 6 4 4", what = numeric())
Here is another base R option using sapply + split
> sapply(split(x,ceiling(seq_along(x)/5)),sum)
1 2 3 4
15 28 23 25
Data
x <- c(1, 2, 3, 4, 5, 5, 6, 6, 7, 4, 5, 5, 4, 4, 5, 6, 5, 6, 4, 4)

How do you efficiently return the order of an increasing index? [duplicate]

This question already has answers here:
Create group names for consecutive values
(4 answers)
Closed 4 years ago.
I have the following index vector:
TestVec = rep(c(6,8,9,11,18), each = 10)
This reads c(6, 6, ..., 6, 8, 8, ..., 8, 9, 9, ..., 9, ...).
I would like to convert this vector into c(1, 1, ..., 1, 2, 2, ..., 2, 3, 3, ..., 3, ...)
Try
I have improvised a quick-and-dirty method, as follows:
sapply(TestVec, function(x) {which(x == unique(TestVec))})
This works fine, but this takes a lot of time in a large dataset.
Is there any efficient way to improve?
match(TestVec, unique(TestVec))
Another option:
as.numeric(as.factor(TestVec))
# [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5
Requiring data.table:
rleid(TestVec)
Here is another one,
c(1, cumsum(diff(TestVec) != 0)) + 1

Deleting incomplete cases across multiple rows in R studio

Say I have a longitudinal data set as below
ID <- c(1, 1, 2, 2, 3, 3, 4, 4)
time <- c(1, 2, 1, 2, 1, 2, 1, 2)
value <- c(7, 5, 9, 2, NA, 3, 7, NA)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
In this data-set, we have 4 cases with data at two time-points (let's say pre and post treatment)
Something I want to do is set criteria to delete any case that are not complete for both time-points. In this example, I would want to delete ID3 (who is missing timepoint 1), and ID4 (who is missing timepoint 2). Like below:
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
I am not having much luck. I've tried variants of complete.cases() or which() to no avail
I'm still new to R, and would be hugely appreciative if anyone could help me out
Edit: Thank you Ronak for answering my question. Upon reflection of my real data, I have encountered a second problem. My actual data is more reflected by the below:
ID <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 6, 7, 8)
time <- c(1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1)
value <- c(7, 5, 9, 2, NA, 3, 7, NA, 8, 9, 7, 6)
mydata <- data.frame(ID, time, value)
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
Where I would also want to remove cases 5, 6, 7 and 8. These IDs have an entry for Time 1, but not Time 2. Hopefully this makes sense
Thanks a heap
If you switch your data to wide format (where each time point is represented as its own column), then you can use na.omit. Using dplyr and tidyr functions:
library(dplyr)
mydata <- mydata %>%
tidyr::spread(key=time, value=value) %>% # reformat to wide
na.omit() %>% # delete cases with missingness on any variable (i.e. any time point)
tidyr::gather(key="time", value="value", -ID) # put it back in long format
> mydata
ID time value
1 1 1 7
2 2 1 9
3 1 2 5
4 2 2 2
Note that this will work (it will keep only cases with complete data for both time 1 and time 2) even when you have a time point missing without an explicit NA present in the data, like this:
> mydata
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2
5 3 1 NA
6 3 2 3
7 4 1 7
8 4 2 NA
9 5 1 8
10 6 1 9
11 7 1 7
12 8 1 6
You can do this easily with sqldf.
library(sqldf)
sqldf(' select * from (select ID, count(*) as cnt from mydata where value is not null group by id having cnt >1 ) t1 inner join mydata t2 on t1.ID=t2.ID')
You would select those id having a count greater than 1 and who doesn't have NA in their values and then join back with the original data.
#Ronak already provided
mydata[!mydata$ID %in% mydata$ID[is.na(mydata$value)], ]
For the second part, you can just group over each id and filter on their frequency
k2 <- data.frame(table(mydata$ID))
k2$Var1[k2$Freq > 1]
and then do something like
mydata[mydata$ID %in% k2$Var1[k2$Freq > 1],]
See the updated answer
# Eliminates ID cases with NA
mydata = mydata[!mydata$ID %in% mydata[!complete.cases(mydata) ,]$ID, ]
library(plyr)
# counts all the IDs
cnt = count(mydata, "ID")
# Eliminates any ID that doesn't have 2 observations
mydata[mydata$ID %in% cnt[cnt$freq == 2, ]$ID, ]
ID time value
1 1 1 7
2 1 2 5
3 2 1 9
4 2 2 2

Create column in dataframe that samples from another column by factor levels

I would like column x3 of my dataframe dat to contain a random sample of column x2 but the random sample should only come from the same factor level given in column x1. I have researched the functions by(), ddply(), and sample(), but can't seem to make it work. I also checked a similar question but it didn't help me. You can see what I tried in the context of (what I hope is) a reproducible example below.
Here is the example dataframe:
dat <- data.frame(x1=c("a","a","a","b","b","b","c","c","c"),x2=1:9);
dat$x1 <- as.factor(dat$x1);
dat;
x1 x2
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 c 8
9 c 9
Then some of my non-working attempts to generate x3 were the following:
set.seed(99);
by(dat,FUN=dat$x1,dat$x3<-sample(dat$x1,1,replace=FALSE)); #this did not work at all
I also tried this
set.seed(99);
a <- by(dat,dat[,"x1"],function(d){sample(d$x2,3,replace=FALSE)},simplify=TRUE);
dat$x3<-a;
a;
dat[, "x1"]: a
[1] 2 1 3
---------------------------------------------------------------------------------------------------
dat[, "x1"]: b
[1] 6 5 4
---------------------------------------------------------------------------------------------------
dat[, "x1"]: c
[1] 9 7 8
dat;
> dat
x1 x2 x3
1 a 1 2, 1, 3
2 a 2 6, 5, 4
3 a 3 9, 7, 8
4 b 4 2, 1, 3
5 b 5 6, 5, 4
6 b 6 9, 7, 8
7 c 7 2, 1, 3
8 c 8 6, 5, 4
9 c 9 9, 7, 8
I kind of got what I needed into a in that the random resampling by factor level is there but a is not a simple vector. I feel that if a was a vector I would just about have what I need as I could assign it to dat$x3. To sum up, I would want dat to turn out something like this:
dat
x1 x2 x3
1 a 1 2
2 a 2 1
3 a 3 3
4 b 4 6
5 b 5 5
6 b 6 4
7 c 7 9
8 c 8 7
9 c 9 8
The solution should be efficient for a dataframe with >2 million rows. Thanks anyone for your help. I hope to return the help to others as I get better with r.
dat$x3 <- ave( dat$x2, dat$x1, FUN=sample)
The way you have constructed the output (to have the same number of entries as there were rows of the dataframe) you will get permutations of x2 values within distinct values of x1. (Edited your code to make it run.)

Resources