how to remove outliers from cooks distance in R - r

I am writing a generic function which takes dataframe and column name and return the clean dataframe without outliers in R
cooks_dist <- function(dataframe,column){
dataframe <- dataframe %>% select_if(dataframe,is.numeric)
mod <- lm(column ~ ., data=dataframe)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd,na.rm=T))]) # influential row numbers
final <- dataframe[-influential,]
return(final)
}
But,when I run this function it says Error: Can't convert a list to function
Data can be found at
http://ucanalytics.com/blogs/wp-content/uploads/2016/09/Regression-Clean-Data.csv

The error originated from dplyr::select_if(). I believe you want a subset of all numeric columns so you alternatively could create a subset with sapply(). Note: As your lm() line produced errors, I`ve inserted the minimal model instead.
So I think you want this:
cooks_dist <- function(dataframe, column){
dataframe <- dataframe[, sapply(dataframe, is.numeric)]
mod <- lm(dataframe[, column] ~ 1, data = dataframe)
cooksd <- cooks.distance(mod)
influential <- as.numeric(names(cooksd)[(cooksd > 4 * mean(cooksd, na.rm = TRUE))])
final <- dataframe[-influential, ]
return(final)
}
df1 <- cooks_dist(df1, 4)
Yields:
> head(df1)
X Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Rainfall House_Price
2 2 2 8294 8186 12694 1461 1752 210 3982000
3 3 3 11001 14399 16991 1340 1609 720 5401000
4 4 4 8301 11188 12289 1451 1748 620 5373000
5 5 5 10510 12629 13921 1770 2111 450 4662000
7 7 7 13153 11869 17811 1542 1858 1030 7224000
8 8 8 5882 9948 13315 1261 1507 1020 3772000

I used this code, with threshold for cooks as 4/n:
orig.mod <- lm(Outcome ~ Exposure, data=origdf)
origdf$cooksd <- cooks.distance(orig.mod)
origdf$cookyn <- ifelse(origdf$cooksd < 4/nrow(orig.dat), "keep","no")
minus.df <-subset(origdf, cookyn=="keep")
newmod.minuscooks <- lm(Outcome ~ Exposure, data=minus.df)

Related

wilcox.test() on many combinations of columns

I have a data.table with approximately 400 columns and 800,000 rows. The columns represent samples and the rows represent CpG sites. Example data here:
require(data.table)
samples <- replicate(200,replicate(1000,runif(1)))
cpgs <- paste0('cpg',1:1000)
n <- c('cpg',paste0('sample',1:200))
data <- data.table(cbind(cpgs,samples))
colnames(data) <- n
I want to run a wilcox.test() on randomly selected columns of this data 1000 times. I've currently implemented this the following way, but it's very slow on large numbers of permutations.
cases <- paste0('sample',1:10)
controls <- paste0('sample',30:40)
data[,wilcox_p:=wilcox.test( as.numeric(.SD[,mget(cases)]), as.numeric(.SD[,mget(controls)]) )$p.value,by=cpg]
Is there a more efficient way to do this? My complete use case, where getCpGSites() is the function described above, is here:
iterations_vec <- 1:1000
labels <- paste0('sample',1:200)
permutations <- foreach(i = iterations_vec, .combine='rbind', .multicombine = TRUE ) %dopar% {
case_labels <- sample(labels,num_cases,replace=FALSE)
control_labels <- labels[!labels %in% case_labels]
signature_cpgs <- getCpGSites(case_labels,control_labels)
num_signature_cpgs <- length(signature_cpgs)
out <- data.table('gene' = gene,
'iteration' = i,
'num_signature_cpgs' = num_signature_cpgs)
return(out)
}
Here's one approach, based on the tidyverse. First, convert all your character data tonumeric, rtaher than delegating to your function.
library(tidyverse)
numericData <- data %>% mutate(across(where(is.character), as.numeric))
Now write a function to perform a Wilcoxon test on a randomly selected pair of columns
randomWilcox <- function(d) {
cols <- sample(2:ncol(d), size=2, replace=FALSE)
d1 <- d %>% select(cpg, all_of(cols))
tibble(
col1=cols[1],
col2=cols[2],
p.value=wilcox.test(d1 %>% pull(2), d1 %>% pull(3))$p.value
)
}
Now use lapply to run the function 1000 times, with a very crude measure of speed:
startTime <- Sys.time()
lapply(1:1000, function(x) numericData %>% randomWilcox) %>% bind_rows()
endTime <- Sys.time()
# A tibble: 1,000 × 3
col1 col2 p.value
<int> <int> <dbl>
1 15 172 0.124
2 26 58 0.202
3 200 60 0.840
4 124 94 0.344
5 180 200 0.723
6 122 155 0.987
7 122 174 0.173
8 83 146 0.921
9 135 95 0.0605
10 168 174 0.0206
# … with 990 more rows
Each row of the output tibble contains the indices of the columns selected, and the p-value obtained from corresponding wilcox.test.
The time taken is about 13 seconds on my machine. Is that quick enough?
endTime - startTime
Time difference of 13.1156 secs
Edit
Removing the intermediate data frame reduces the time taken to just over none seconds:
randomWilcox <- function(d) {
cols <- sample(2:ncol(d), size=2, replace=FALSE)
tibble(
col1=cols[1],
col2=cols[2],
p.value=wilcox.test(d %>% pull(cols[1]), d %>% pull(cols[2]))$p.value
)
}

Rerun same code but changing colums names at each loop

I have a script in r to calculate body condition residuals. I would like to apply this code to each columns, which correspond to a specific category of individual.
For example i would run this code to calculate body condition residuals of all individuals that are in the category 1
1) Select rows of interest
Data1 = RawData %>% select(ID,temperature, Bodysize1, mass1, year) %>% filter((temperature %in% c(20:29) & Bodysize1 %in% c(20:100) & mass1 %in% c(15:40))
2) Create a new model with created data
Model1 =lmer(log(mass1) ~ log(Bodysize1) + temperature + (1|year), data = Data1)
3) Extract residuals and add ID to the residuals
ResModel1 = resid(Model1)
ID=Data1$ID
Res1 =data.frame(ResModel1 ,ID)
4) Add residuals to my RawData
RawData2.0 = merge(RawData, Res1, by = c("ID"), all.x = T)
In order to avoid reruning this code and manually changing all the 1 by 2 and then all the 2 by 3... etc is there a way to do this commande automatic whith loops and the apply familly?
My data
ID TEMPERATURE BODYSIZE1 MASS1 BODYSIZE2 MASS2 YEAR
81-012 0.03830645 200 1450 205 1425 1981
84-069 0.26923078 200 1473 205 1498 1984
84-134 0.32692307 209 1448 195 1323 1984
84-145 0.27884614 197 1373 197 1498 1984
84-190 0.31129807 191 1248 195 1323 1984
85-155 0.33056709 198 1637 229 1988 1985
Thanks in advance
Withou example data it is tough to say if this will work but maybe creating a function can simplify your workflow
library(tidyverse)
get_resid <- function(df,filters) {
df_to_model <- df %>% filter({{filters}})
df_to_keep <- df <- filter({{filters}},.preserve = FALSE)
Data1 <- df_to_model %>%
select(ID,temperature, Bodysize1, mass1, year)
Model1 <- lmer(log(mass1) ~ log(Bodysize1) + temperature + (1|year), data = Data1)
ResModel1 <- resid(Model1)
ID <- Data1$ID
Res1 <- data.frame(ResModel1 ,ID)
Res1 %>%
bind_rows(df_to_keep)
}
Then you may use this this function in your pipes
RawData %>%
get_resid(temperature %in% c(20:29) & Bodysize1 %in% c(20:100) & mass1 %in% c(15:40))
You might try to i) create a tibble, in the first column list all dep variables as strings,
ii) list your models of indep vars in the second column,
iii) create a formula in the third column
iv) run your model in the fourth column
df <- tibble(dep = paste0("log(var",seq(1,10,1),")"),
x = "~ your_x_vars") %>%
mutate(formula = as.formula(paste0(dep,x))) %>%
mutate(reg = map(formula, ~lm(as.formula(.x), data=df) ))
then you can easily extract the residuals

Random sampling based on vector with multiple conditions R

I have a large dataframe SYN_data with 150000 rows and 3 columns named SNP, Gene and count.There is a list r with 2545 count values that also include some duplicates. Now I need to randomly sample 2545 rows without replacement from SYN_data with similar count values as in the list r. I could successfully do it until here by using this code:
test1 <- SYN_data[ sample( which( SYN_data$count %in% r ) , 2545 ) , ]
The second condition is that the unique length of Genes should be 1671 in total 2545 rows, means that some of the Genes have more than 1 SNPs. Is there any way I can incorporate this condition in the same code or any other code meeting all conditions would be very helpful. Thanks!
Sample data:
# list
r
> 1,7,3,14,9
SYN_data$SNP <- c('1- 10068526', '1- 10129891', '1- 10200104',
'1- 10200491', '1- 10470141', '1- 10671598')
SYN_data$Gene <- c('AT1G28640', 'AT1G29030', 'AT1G29180',
'AT1G29180', 'AT1G29900', 'AT1G30290')
SYN_data$count <- c('14', '9', '3', '3', '7', '1')
Try using the following :
library(dplyr)
no_of_rows <- 2545
no_of_unique_gene <- 1671
temp <- SYN_data
while(n_distinct(temp$Gene) != no_of_unique_gene) {
gene <- sample(unique(SYN_data$Gene),no_of_unique_gene)
temp <- SYN_data[SYN_data$V23 %in% unique(r) & SYN_data$Gene %in% gene, ]
}
part1 <- temp %>% group_by(Gene) %>% sample_n(floor(no_of_rows/no_of_unique_gene))
part2 <- temp %>% anti_join(part1) %>% sample_n(no_of_rows - nrow(part1))
final <- bind_rows(part1, part2)
and now check length(unique(final$Gene)).
An possible approach is to sample 1671 unique genes first, subset the dataset to those that share those genes and has count in the set of r. Here is an implementation of this approach in data.table:
#had to create some dummy data as not clear what the data is like
set.seed(0L)
nr <- 15e4
nSNP <- 1e3
nGene <- 1e4
ncount <- 1:14
r <- c(1,3,7,9,14)
SYN_data <- data.table(SNP=sample(nSNP, nr, TRUE),
Gene=sample(nGene, nr, TRUE), count=sample(ncount, nr, TRUE))
ncnt <- 2545
ng <- 1671
#sample 1671 genes
g <- SYN_data[, sample(unique(Gene), ng)]
#subset and sample the dataset
ix <- SYN_data[Gene %in% g & count %in% r, sample(.I, 1L), Gene]$V1
ans <- rbindlist(list(
SYN_data[ix],
SYN_data[-ix][Gene %in% g & count %in% r][, .SD[sample(.I, ncnt - ng)]]))
ans[, uniqueN(Gene)]
#1662 #not enough Gene in this dummy dataset
output:
SNP Gene count
1: 816 1261 14
2: 7 8635 1
3: 132 7457 1
4: 22 3625 3
5: 396 7640 7
---
2534: 423 6387 3
2535: 936 3908 7
2536: 346 9654 14
2537: 182 7492 3
2538: 645 635 1

how to make t-student test in loop in r?

I'm new in R and what I want to do is something very simple but I need help.
I have a database that looks like the one above; where spot number = "name" of a protein, grupo = group I and II and APF = fluorescent reading.
I want to do a tstudent test to each protein, by comparing groups I and II, but in a loop.
In the database above there only 1 protein (147) but im my real database i have 444 proteins.
Starting with some fake data:
set.seed(0)
Spot.number <- rep(147:149, each=10)
grupo <- rep(rep(1:2, each=5), 3)
APF <- rnorm(30)
gel <- data.frame(Spot.number, grupo, APF)
> head(gel)
Spot.number grupo APF
1 147 1 2.1780699
2 147 1 -0.2609347
3 147 1 -1.6125236
4 147 1 1.7863384
5 147 1 2.0325473
6 147 2 0.6261739
You can use lapply to loop through the subsets of gel, split by the Spot.number:
tests <- lapply(split(gel, gel$Spot.number), function(spot) t.test(APF ~ grupo, spot))
or just
tests <- by(gel, gel$Spot.number, function(spot) t.test(APF ~ grupo, spot))
You can then move on to e.g. taking only the p values:
sapply(tests, "[[", "p.value")
# 147 148 149
#0.2941609 0.9723856 0.5726007
or confidence interval
sapply(tests, "[[", "conf.int")
# 147 148 149
# [1,] -0.985218 -1.033815 -0.8748502
# [2,] 2.712395 1.066340 1.4240488
And the resulting vector or matrix will already have the Spot.number as names which can be very helpful.
You can perform a t.test within each group using dplyr and my broom package. If your data is stored in a data frame called dat, you would do:
library(dplyr)
library(broom)
results <- dat %>%
group_by(Spot.number) %>%
do(tidy(t.test(APF ~ grupo, .)))
This works by performing t.test(APF ~ grupo, .) on each group defined by Spot.number. The tidy function from broom then turns it into a one-row data frame so that it can be recombined. The results data frame will then contain one row per protein (Spot.number) with columns including estimate, statistic, and p.value.
See this vignette for more on the combination of dplyr and broom.

R-sq values, linear regression of several trends within one dataset

I am running into a sticky spot trying to solve for variance accounted for by trend several times within a single data set.....
My data is structured like this
x <- read.table(text = "
STA YEAR VALUE
a 1968 457
a 1970 565
a 1972 489
a 1974 500
a 1976 700
a 1978 650
a 1980 659
b 1968 457
b 1970 565
b 1972 350
b 1974 544
b 1976 678
b 1978 650
b 1980 690
c 1968 457
c 1970 565
c 1972 500
c 1974 600
c 1976 678
c 1978 670
c 1980 750 " , header = T)
and I am trying to return something like this
STA R-sq
a n1
b n2
c n3
where n# is the corresponding r-squared value of the locations data in the original set....
I have tried
fit <- lm(VALUE ~ YEAR + STA, data = x)
to give the model of yearly trend of VALUE for each individual station over the years data is available for VALUE, within the master data set....
Any help would be greatly appreciated.... I am really stumped on this one and I know it is just a familiarity with R problem.
To get r-squared for VALUE ~ YEAR for each group of STA, you can take this previous answer, modify it slightly and plug-in your values:
# assuming x is your data frame (make sure you don't have Hmisc loaded, it will interfere)
models_x <- dlply(x, "STA", function(df)
summary(lm(VALUE ~ YEAR, data = df)))
# extract the r.squared values
rsqds <- ldply(1:length(models_x), function(x) models_x[[x]]$r.squared)
# give names to rows and col
rownames(rsqds) <- unique(x$STA)
colnames(rsqds) <- "rsq"
# have a look
rsqds
rsq
a 0.6286064
b 0.5450413
c 0.8806604
EDIT: following mnel's suggestion here are more efficient ways to get the r-squared values into a nice table (no need to add row and col names):
# starting with models_x from above
rsqds <- data.frame(rsq =sapply(models_x, '[[', 'r.squared'))
# starting with just the original data in x, this is great:
rsqds <- ddply(x, "STA", summarize, rsq = summary(lm(VALUE ~ YEAR))$r.squared)
STA rsq
1 a 0.6286064
2 b 0.5450413
3 c 0.8806604
#first load the data.table package
library(data.table)
#transform your dataframe to a datatable (I'm using your example)
x<- as.data.table(x)
#calculate all the metrics needed (r^2, F-distribution and so on)
x[,list(r2=summary(lm(VALUE~YEAR))$r.squared ,
f=summary(lm(VALUE~YEAR))$fstatistic[1] ),by=STA]
STA r2 f
1: a 0.6286064 8.462807
2: b 0.5450413 5.990009
3: c 0.8806604 36.897258
there's only one r-squared value, not three.. please edit your question
# store the output
y <- summary( lm( VALUE ~ YEAR + STA , data = x ) )
# access the attributes of `y`
attributes( y )
y$r.squared
y$adj.r.squared
y$coefficients
y$coefficients[,1]
# or are you looking to run three separate
# lm() functions on 'a' 'b' and 'c' ..where this would be the first?
y <- summary( lm( VALUE ~ YEAR , data = x[ x$STA %in% 'a' , ] ) )
# access the attributes of `y`
attributes( y )
y$r.squared
y$adj.r.squared
y$coefficients
y$coefficients[,1]

Resources