r: Extracting residuals of regressed data with different dimensions - r

I am running 500 linear regressions with a different dependent variable each time, but the same independent variables, using the following loop:
for(j in 1:500) {
lmj <- lm(formula = df[, j] ~ x1 + x2, data = df)
coeff[j,] <- t(lmj$coefficients)
}
However the all columns of df have different ‘start’ and ‘end’ times, e.g.
> df[,1]
[1] NA NA NA NA NA NA NA NA
[9] NA NA NA NA NA NA NA NA
[17] NA NA -12.56643 2.90788 -15.80776 10.35763 18.22261 -8.33948
[25] -11.92777 3.35641 -9.13571 -27.46489 -14.18712 -3.75335 3.60028 -0.64753
[33] 1.07798 12.67291 8.83168 2.20233 11.13526 8.75306
> df[,2]
[1] NA NA NA NA NA NA NA 4.59821
[9] 1.80505 0.88652 1.05448 -7.39130 -0.46957 -5.85455 7.66825 -3.12985
[17] -6.58715 -9.43875 NA NA NA NA NA NA
[25] NA NA NA NA NA NA NA NA
[33] NA NA NA NA NA NA
(Note all observations of the dependent variables are consecutive, and there are no NA values in either x1 or x2, i.e. x1 and x2 are both (38x1) vectors. Incidentally, x1 and x2 are the 501st and 502nd columns of df).
How can I save the residuals from each of these 500 regressions?

Related

How to get name row as variable in function and plot density graph

I have issues with my function, i don't know if the problem is in the function or in my way to called it.
I have big dataframe with > 20000 row and around 700 columns, with each row a part of a gene and i want to calculate density for each row + plot the density plot with name of the gene.
baseM <- read.csv("expansions_full_omim_06_07_21.2.csv", sep = "\t")
rownames(baseM) <- paste(baseM$motif, baseM$chromosome, baseM$intervalle , baseM$gene , baseM$localisation, baseM$OMIM, sep = ".")
baseM.num <- baseM[sapply(baseM, is.numeric)]
names <- rownames(baseM.num.fltr)
d.density <- function(X, n){
#print(X)
d <- density(as.numeric(as.matrix(X)), na.rm=T)
peaks <- NULL
for (i in 2:(length(d$y)-1)) {
if (d$y[i-1] >= d$y[i] & d$y[i] <= d$y[i+1]) {
peaks <- cbind(peaks, c(d$x[i], d$y[i]))
}}
df <- data.frame(test =as.numeric(as.matrix(X)))
g <- ggplot(df, aes(x = as.numeric(as.matrix(test)))) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
ggsave(filename=paste("/work/gad/shared/analyse/STR/Marine/analysis/output/annotation/R_plots/", n, ".png", sep=""), plot=g)
#q <- plot(d)
#png(file=file_name)
#print(q)
#dev.off()
return(peaks)
}
baseM.num.fltr$peaks <- apply(temp, 1 , d.density, n=names)
I get correctly my peaks but obviously something wrong with the plot. I'm not sure my way to pass the name is correct, or is something else would be better/easier? Thanks for your help! I tried 2 ways for the plot, with or without ggplot2 but not working.
This is the error I get:
NULL
Erreur : `device` must be NULL, a string or a function.
>
Example of my data :
> head(baseM)
motif chromosome intervalle gene localisation
1 AAAAAAAAAAAAAAAAAAAC chr2 (69131154, 69132154) BMP10 intergenic
2 AAAAAAAAAAAAAAAAAAAC chr2 (237411093, 237412093) IQCA1 intronic
3 AAAAAAAAAAAAAAAAAAAC chr2 (44378070, 44379070) LRPPRC intergenic
4 AAAAAAAAAAAAAAAAAAAC chr2 (105218444, 105219444) LINC01102 intergenic
5 AAAAAAAAAAAAAAAAAAAC chr2 (124310903, 124311903) LINC01826 intergenic
6 AAAAAAAAAAAAAAAAAAAC chr2 (30730559, 30731559) LCLAT1 intronic
OMIM
1 .
2 .
3 .,Mitochondrial complex IV deficiency, nuclear type 5, (French-Canadian), 220111 (3)
4 .
5 .
6 .
dijen003 dijen004 dijen005 dijen006 dijen007 dijen008 dijen009 dijen010
1 NA NA NA NA NA NA NA NA
2 7 NA NA NA NA NA NA NA
3 NA NA NA 5 NA NA NA NA
4 NA NA NA 5 NA NA NA NA
5 NA NA NA 5 NA NA NA NA
6 NA NA NA NA 5 NA NA NA
dijen011 dijen012 dijen013 dijen014 dijen015 dijen016 dijen017 dijen018
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA
(Sorry i know it's a short example but data is really big - and of course not all lines have that much NA)
For the device argument, use png or 'png'. (Note that png() will work also but only when the filename has the '.png' extension.)
(png() will work also but only when the filename includes the '.png' extension, see comment thread below.)
Example:
library(tidyverse)
set.seed(1L)
df <- tibble(a = rnorm(10))
df %>% ggplot(aes(a)) + geom_density()
ggsave("foo.png", device = "png")

R na.approx error: need at least two non-NA values to interpolate

Sample Data
1/1/2000 NA NA NA 29.71 NA
1/2/2000 NA NA NA NA NA
1/3/2000 NA NA NA NA NA
1/4/2000 NA NA NA 29.25 NA
1/5/2000 NA NA NA 30.28 NA
1/6/2000 NA NA NA 27.66 NA
1/7/2000 NA NA NA 27.22 NA
1/8/2000 NA NA NA 27.27 NA
1/9/2000 170 4.1 NA 5.24 NA
1/10/2000 NA NA NA NA NA
1/11/2000 NA NA NA 27.65 NA
1/12/2000 NA NA NA 28.28 100.57
1/13/2000 NA NA NA 27.52 NA
I'm trying to interpolate a lot of NA values.
I have unique dates (key), but most [other] data columns begin/end with with NULL/NA values (combined_data_z[,a]). I care to interpolate these [other] columns empty values against date, I'm having this error when attempting
Error in approx(x[!na], y[!na], xout, ...) : need at least two
non-NA values to interpolate
library(zoo)
#start with 2 because 1st column is date
a=2
for (i in parsedList)
{
dates <- combined_data_z[,1]
test1 <- combined_data_z[,a]
test1_z <- zoo(test1)
test1_z_approx <- na.fill(na.approx(test1_z, x=dates, rule=2, na.rm = FALSE), "extend")
#print(test1_z_approx)
a=a+1
}
update: apparently it has something to do with the for loop, when I removed it and tested using print statements and built up from there, I found that it works when not enclosed in brackets (but I need the loop).
dates <- combined_data_z[,1]
test1 <- combined_data_z[,4]
test1_z <- zoo(test1)
test1_z_approx <- na.fill(na.approx(test1_z, x=dates, rule=2, na.rm = FALSE), "extend")
print(test1_z_approx)
For the following dataset you provided in comments this works:
library(zoo)
combined_data_z <- read.csv(file="http://thistleknot.sytes.net/wordpress/wp-content/uploads/2018/04/output_NoNA.csv")
test1_z_approx <- matrix(NA, ncol=ncol(combined_data_z)-2, nrow = nrow(combined_data_z))
for (i in 3:ncol(combined_data_z))
{
dates <- combined_data_z[,1]
test1 <- combined_data_z[,i]
test1_z <- zoo(test1)
test1_z_approx[,i-2] <-as.matrix( na.fill(na.approx(test1_z, x=dates, rule=2, na.rm = FALSE), "extend"))[,1]
}
If your dataset starts with the "date" column , then the code will look like:
head(combined_data_z)
# date CPIAUCSL UNRATE MEHOINUSA672N INTDSRUSM193N CIVPART
# 1 1/1/2000 169.3 4 58544 5 67.3
# 2 1/2/2000 NA NA NA NA NA
# 3 1/3/2000 NA NA NA NA NA
# 4 1/4/2000 NA NA NA NA NA
# 5 1/5/2000 NA NA NA NA NA
# 6 1/6/2000 NA NA NA NA NA
test1_z_approx <- matrix(NA, ncol=ncol(combined_data_z)-1, nrow = nrow(combined_data_z))
for (i in 2:ncol(combined_data_z))
{
dates <- combined_data_z[,1]
test1 <- combined_data_z[,i]
test1_z <- zoo(test1)
test1_z_approx[,i-1] <-as.matrix( na.fill(na.approx(test1_z, x=dates, rule=2, na.rm = FALSE), "extend"))[,1]
}
head(test1_z_approx)
# [,1] [,2] [,3] [,4] [,5]
#[1,] 169.3000 4.000000 58544 5.000000 67.30000
#[2,] 224.0420 4.033100 59039 2.844406 64.07145
#[3,] 196.4639 3.959895 59039 4.579983 65.57215
#[4,] 188.9426 3.939930 59039 5.053322 65.98144
#[5,] 186.4355 3.933275 59039 5.211101 66.11786
#[6,] 183.9284 3.926620 59039 5.368881 66.25429
Thanks goes to Katia for the assist (specifically my x's and y's needing to be in separate dataframes)
combined_data_z <- df3
#https://stackoverflow.com/a/50173660/1731972
#file begins with numeric iterations
#ncol(combined_data_z)
dates <- combined_data_z[1]
print(dates)
#important to start at 2!, otherwise na.approx will not work!
#either copy from 2: on or copy whole and drop first column (date)
#test1 <- combined_data_z[c(2:length(parsedList)+1)]
#drop date
test1 <- combined_data_z
test1[1] <- NULL
print(test1)
#wtf, had to add data.frame today!
test1_z <- zoo(data.frame(test1))
date_z <- zoo(data.frame(dates))
print(test1_z)
#colnames(test1_z)
print(dates)
test1_z_approx <- na.fill(na.approx(test1_z, dates$date, rule=2, na.rm = FALSE), "extend")
print(test1_z_approx)
#new <- NULL
print(new)
new <- c(data.frame(dates),data.frame(test1_z_approx))
print(new)
write.csv(new, file = "output_test.csv")

Conditional subsetting of data frame keeping previous row

My data frame looks like this
Model w0 p0 w1 p1 w2 p.value
1 Null_model 3.950000e-05 0.7366921 0.988374029 0.000000e+00 1.296464
2 alt_test 1.366006e-02 0.4673263 0.139606503 3.049244e-01 1.146653
3 alt_ref 2.000000e-07 0.4673263 0.000846849 3.049244e-01 1.635038 5.550000e-15
8 Null_model 2.790000e-05 0.7240479 0.987016439 0.000000e+00 1.263556
9 alt_test 7.550000e-09 0.7231176 0.991768899 1.060000e-13 1.369259
10 alt_ref 2.770000e-05 0.7231176 0.995373167 1.060000e-13 1.192839 3.073496e-01
... ... ... ... ... ... ...
What I want is to subset my data.frame in a way that keeps every case where p.value < 0.05 but it also keeps the previous rows to these cases.
So ideally my output will be something like this
Model w0 w1 w2
2 alt_test 1.4e-0.2 0.139606503 1.146653
3 alt_ref 2.00e-07 0.000846849 1.635038
I've tried the following but it doesn't work quite right:
subset(v, p.value < 0.05, select = c(Model,w0,w1,w2))
the output doesn't have the alt_test row.
I have also tried
with(v, ifelse(p.value < 0.05, paste(dplyr::lag(c(w0,w1,w2),1)), ""))
and the output in this case looks like
[1] NA NA NA NA "0.013660056" NA NA NA NA ""
[11] NA NA NA NA "" NA NA NA NA ""
[21] NA NA NA NA "" NA NA NA NA ""
[31] NA NA NA NA "" NA NA NA NA ""
[41] NA NA NA NA "" NA NA NA NA ""
[51] NA NA NA NA "1.34e-11" NA NA NA NA "" ...
I also tried
subset(v, p.value < 0.05, select = c(w0, w1,w2, w0-1, w1-1, w2-1))
but this gives the previous column, so I was wondering if something similar can give previous rows instead?
Thank you
If your data.frame always has alternating structure as alt_test and alt_ref, then you can manually construct the subset index as below:
library(data.table)
setDT(myDf)
myDf[Reduce(function(x,y) ifelse(!is.na(x), x, ifelse(!is.na(y), y, F)),
shift(p.Value < 0.05, n = 0:1, type = "lead")), .(Model,w0,w1,w2)]

Assign value to data frame in R to all elements conditionally

I try to assign value to all cells in a dataframe having a specific value
by this code
train_data <- read.csv("train_set.csv",header=TRUE)
train_data[train_data == "<NA>"] <- 0
But it does not work, I still see the values unchanged. How can I change values? Data in CSV is as below
spec1 spec2
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
SP-0013 SP-0063
SP-0013 SP-0063
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
As #akrun and others have mentioned we may need an actual copy of your data, but give this a shot:
train_data <- read.csv("train_set.csv", header=TRUE, na.strings = c("NA", "<NA>"), stringsAsFactors=FALSE)
train_data[is.na(train_data)] <- 0

how can I add to columns in R

I cannot seem to add two columns in R.
when I try
dat$V1 + dat$V2
I get
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Warning message:
In Ops.factor(dat$V1, dat$V2) : + not meaningful for factors
lots of other questions suggest to do as I have done, however as you can see this does not work for me. what is the problem?
Try to convert your factor columns to numeric: If V1 and V2 are 1st two columns.
dat[,1:2] <- lapply(dat[,1:2], function(x) as.numeric(as.character(x)))
dat$V1 +dat$V2
For example:
dat <- data.frame(V1= factor(1:5), V2= factor(6:10))
dat$V1+dat$V2
#[1] NA NA NA NA NA
#Warning message:
#In Ops.factor(dat$V1, dat$V2) : + not meaningful for factors
dat[,1:2] <- lapply(dat[,1:2], function(x) as.numeric(as.character(x)))
dat$V1 +dat$V2
#[1] 7 9 11 13 15

Resources