Fill missing values with linear regression

Fill missing values with linear regression - r

I have a dataframe that contains 7 columns.
str(df)
'data.frame': 8760 obs. of 7 variables:
$ G1_d20_2014.SE1_ : num 25.1 25.1 25 25 25.1 ...
$ G1_d20_2014.SE4_ : num 42.4 42.3 42.3 42.3 42.3 ...
$ G1_d20_2014.SE7_ : num 34.4 34.4 34.4 34.4 34.4 ...
$ G1_d20_2014.SE22_: num 42.5 42.4 42.3 42.4 42.3 ...
$ G1_d20_2014.SE14_: num 52.5 52.5 52.5 52.5 52.4 ...
$ G1_d20_2014.SE26 : num 40.8 40.8 40.8 40.8 40.8 ...
Each column represents a unique sensor and the columns contain measurement data from sensors. Some of the columns contain missing values. I want to fill the data gaps in each column by linear regression. I already did this manually but there is one condition that is very important and I'm looking for a function that does this on its own, as it'd take too much time to do this for all the columns. Here's the condition:
Lets say G1_d20_2014_SE1 contains missing data. Then I want to fill the data gaps from that sensor with a complete dataset from another sensor where the correlation coefficient is highest.
Here is how I did that manually:
I created a function that creates an indicator variable. Indicator variable turns to 1 if value is not NA and to 0 if it is NA. Then I added this variable as a column to the dataset:
Indvar <- function(t) {
x <- dim(length(t))
x[which(!is.na(t))] = 1
x[which(is.na(t))] = 0
return(x)
}
df$I <- Indvar(df$G1_d20_2014.SE1_)
Next I looked between which sensor and sensor 1 the correlation coefficient is highest (in that case correlation coefficient highest between SE1 and SE14). Then I computed the linear regression, took the equation from it and put it into a for loop that fills up the NA values according to the equation whenever the indicator variable is 0:
lm(df$G1_d20_2014.SE1_ ~ df$G1_d20_2014.SE14_, data = df)
for (i in 1:nrow(df)) {
if (df$I[i] == 0)
{
df$G1_d20_2014.SE1_[i] = 8.037 + 0.315*df$G1_d20_2014.SE14_[i]
}
}
This works perfectly fine but it takes too much time doing this because I have a lot of dataframes that looks like the one up in the post.
I already tried using impute_lm from the simputation package but unfortunately it does not seem to care about where the correlation is highest before filling the data gaps. Here is what I wrote:
impute_fun <- impute_lm(df,
formula = SE1_ + SE4_ ~ SE14_ + SE26)
As I wrote SE14_ + SE26_ I checked if he uses the values from SE14 for imputing the values in SE1 but he doesn't, as the result is different from my manual result.
Is there any function that does what I want? I'm really frustrated because I've been looking for this for over 2 weeks now. I'd really really appreciate some help!
EDIT/Answer to #jay.sf
So I tried to make a function (s. below) out of it but there's something I struggle with:
I don't know how to specify in the function that I want to do this for for every column and that it removes the name of that sensor that I want to fill from the sapply(c("SE1_", "SE2_", ...) Because obviously, if I do this for SE1_ and SE1_ is still in the code the correlation will be 1 and nothing happens. Now as you can see this is also problematic for the rest of the code, e.g. in the line cor(df$SE1_, df[, x], use = "complete.obs")) as it says df$SE1_ here. Same for the df$SE1_imp <- ... line.
Of course I could just delete the sensor from the sapply(...) code so the first problem does not occur. I'm just wondering if there's a nicer way to do this. Same for the df$SE1_ parts, if I wanna impute the values for SE2_ then I'd have to change df$SE1_ to df$SE2_ and so on.
I tried to run the code like this (but without the SE1_ in the sapply(...) of course) and I got the error: Error in df[, x] : incorrect number of dimensions.
Any ideas how to solve these issues?
impFUN <- function(df) {
corr <- sapply(c("SE1_", "SE2_", "SE4_", "SE5_","SE6_",
"SE7_", "SE12_", "SE13_","SE14_", "SE15_",
"SE16_", "SE22_","SE23", "SE24", "SE25",
"SE26", "SE33", "SE34", "SE35", "SE36",
"SE37", "SE46", "SE51", "SE52", "SE53",
"SE54", "SE59", "SE60", "SE61", "SE62",
"SE68", "SE69", "SE70", "SE71", "SE72",
"SE73","SE74", "SE82", "SE83", "SE84",
"SE85", "SE86", "SE87", "SE99","SE100",
"SE101", "SE102", "SE103","SE104",
"SE106", "SE107","SE121"), function(x)
cor(df$SE1_, df[, x], use = "complete.obs"))
imp.use <- names(which.max(corr))
regr.model <- lm(reformulate(imp.use, "SE1_"))
df$SE1_imp <-
ifelse(is.na(df$SE1_), lm.cf[1] + df[[imp.use]]*lm.cf[2], df$SE1_)
}

What about this? First check which sensor correlates most with sensor 1.
corr <- sapply(c("sensor.2", "sensor.3", "sensor.4"), function(x)
cor(dat$sensor.1, dat[,x], use="complete.obs"))
# sensor.2 sensor.3 sensor.4
# 0.04397132 0.26880412 -0.06487781
imp.use <- names(which.max(corr))
# [1] "sensor.3"
Calculate the regression model,
lm.cf <- lm(reformulate(imp.use, "sensor.1"), dat)$coef
and to impute sensor 1 use the coefficients in an ifelse like this:
dat$sensor.1.imp <-
ifelse(is.na(dat$sensor.1), lm.cf[1] + dat[[imp.use]]*lm.cf[2], dat$sensor.1)
Result
head(dat)
# sensor.1 sensor.2 sensor.3 sensor.4 sensor.1.imp
# 1 2.0348728 -0.6374294 2.0005714 0.03403394 2.0348728
# 2 -0.8830567 -0.8779942 0.7914632 -0.66143678 -0.8830567
# 3 NA 1.2481243 -0.9897785 -0.36361831 -0.1943438
# 4 NA -0.1162450 0.6672969 -2.84821295 0.2312968
# 5 1.0407590 0.1906306 0.3327787 1.16064011 1.0407590
# 6 0.5817020 -0.6133034 0.5689318 0.71543751 0.5817020
Toy data:
library('MASS')
set.seed(42)
M <- mvrnorm(n=1e2, mu=c(0, 0, 0, 0),
Sigma=matrix(c(1, .2, .3, .1,
.2, 1, 0, 0,
.3, 0, 1, 0,
.1, 0, 0, 1), nrow=4),
empirical=TRUE)
dat <- as.data.frame(`colnames<-`(M, paste0("sensor.", 1:4)))
dat[sample(1:nrow(dat), 30), "sensor.1"] <- NA ## generate 30% missings

Related

Problem passing extra arguments to lapply() with uniroot function

So I'm trying to find the roots for specific values of Y with uniroot(). I have them all in a column in a dataframe, and I want to create a new column with the root found for each one of the Ys of the original column via lapply().
The way I'm creating the function that uniroot takes as an argument to find its roots, is I am substracting the Y value to the last coefficient of this function, and that Y value is passed as an extra argument to uniroot (according to the uniroot help page).
After a couple hours trying to figure out what was happening I realized that the value that lapply() feeds to the function is the Y, but it's being read as the argument "interval" inside uniroot, thus giving me errors about this argument.
I think I could implement this another way, but it'd be much better and simpler if this way has a solution.
pol_mod <- lm(abs_p ~ poly(patron, 5, raw = TRUE), data = bradford)
a <- as.numeric(coefficients(pol_mod)[6])
b <- as.numeric(coefficients(pol_mod)[5])
c <- as.numeric(coefficients(pol_mod)[4])
d <- as.numeric(coefficients(pol_mod)[3])
e <- as.numeric(coefficients(pol_mod)[2])
f <- as.numeric(coefficients(pol_mod)[1])
fs <- function (x,y) {a*x^5 + b*x^4 + c*x^3 + d*x^2 + e*x + f - y}
interpol <- function (y, fs) {
return(uniroot(fs,y=y, interval=c(0,2000)))
}
bradford$concentracion <- lapply(bradford$abs_m, interpol, fs=fs)
The error I'm getting:
Error in uniroot(fs, y = y, interval = c(0, 2000)) :
f.lower = f(lower) is NA
Needless to say, everything works when applied outside of lapply()
I'd be really happy If someone could lend a hand! Thanks in advance!
EDIT: This is how the dataframe looks like.
bradford
# A tibble: 9 x 3
patron abs_p abs_m
<dbl> <dbl> <dbl>
1 0 0 1.57
2 25 0.041 1.27
3 125 0.215 1.59
4 250 0.405 1.61
5 500 0.675 0.447
6 750 0.97 0.441
7 1000 1.23 NA
8 1500 1.71 NA
9 2000 2.04 NA

Is there an R function to return a parameter in a list that can not be find by str(list)

I’m trying to return a parameter in a list, but I cannot find the parameter using str(list).
this is my codes
install.packages("meta")
library(meta)
m1 <- metacor(c(0.85, 0.7, 0.95), c(20, 40, 10))
m1
COR 95%-CI %W(fixed) %W(random)
1 0.8500 [0.6532; 0.9392] 27.9 34.5
2 0.7000 [0.4968; 0.8304] 60.7 41.7
3 0.9500 [0.7972; 0.9884] 11.5 23.7
Number of studies combined: k = 3
COR 95%-CI z p-value
Fixed effect model 0.7955 [0.6834; 0.8710] 8.48 < 0.0001
Random effects model 0.8427 [0.6264; 0.9385] 4.87 < 0.0001
how could I save COR(=0.8427) orp-value(=< 0.0001) forRandom effects model as a single parameter.

It seems that the numbers that you are looking for (cor 0.8427) are created in print.meta. The function seems too big though so I gave up trying to pinpoint exactly where it gets calculated and what name it has. I don't think it is even saved within the function, but rather printed.
Anyway I took the alternative road of capturing the output:
#capture the output of the summary - the fifth line gives us what we want
out <- capture.output(summary(m1))[5]
#capture all the number and return the first
unlist(regmatches(out, gregexpr("[[:digit:]]+\\.*[[:digit:]]*", out)))[1]
#[1] "0.8427"

I assume your problem is accessing to the object.
The $ will help you with it, such that by putting the variablename, then the dollar and by pressing the tab, the different possibilities of that object will appear. According to you questions, the values would be
> m1$cor[1]
[1] 0.85
> mysummary<-summary(m1)
> mysummary$fixed$p
[1] 2.163813e-17
> mysummary$fixed$z
[1] 8.484643
> ifelse(mysummary$fixed$p<0.0001, "<0.0001", "WHATEVER")
[1] "<0.0001"
To select a specific one, you can use [i] where i is an integer (example i = 1 for 0.85)
To get a 0.0001, I suggest using an ifelse() statement on pvalues or Z with their according rule. Cheers !

Find two densities' point of intersection in R

I have two densities that overlap as seen in the attached picture. I want to find out where the two lines meet. How would I go about doing that?
This is the code that produced the image:
... #reading in files etc.
pdf("test-plot.pdf")
d1 <- density(somedata)
d2 <- density(someotherdata)
plot(d1)
par(col="red")
lines(d2)
dev.off()
The original data is just two monodimensional vectors, so what I'm interested in is the intersection point of their densities.
I tried to use the solution shown in here, but unfortunately, it neither gives me a number nor even draws the lines correctly:
edit: I have found what I was looking for

# create and plot example data
set.seed(1)
plotrange <- c(-1,8)
d1 <- density(rchisq(1000, df=2), from=plotrange[1], to=plotrange[2])
d2 <- density(rchisq(1000, df=3)-1, from=plotrange[1], to=plotrange[2])
plot(d1)
lines(d2)
# look for points of intersection
poi <- which(diff(d1$y > d2$y) != 0)
# Mark those points with a circle:
points(x=d1$x[poi], y=d1$y[poi], col="red")
# or with lines:
abline(v=d1$x[poi], col="orange", lty=2)
abline(h=d1$y[poi], col="orange", lty=2)

intersect(x,y)
see this help file
For example: If your data are in the same data.frame df
intersect(df$col1, df$col2)

Here is a small example extending John's answer with an example.
require(ggplot2)
require(reshape2)
set.seed(12)
df <- data.frame(x = round(rnorm(100, 20, 10),1), y = round((100/log(100:199)),1))
str(df)
# 'data.frame': 200 obs. of 2 variables:
# $ variable: Factor w/ 2 levels "x","y": 1 1 1 1 1 1 1 1 1 1 ...
# $ value : num 16.8 25.7 20.5 22 19 ...
# Melt and plot
mdf <- melt(df)
ggplot(mdf) +
geom_density(aes(x = value, color = variable))
# Find points that intersect
intersect(df$x, df$y)
# [1] 18.9 20.1 21.3 21.5 21.0 19.6 19.0 20.0 19.8
# To make the answer more complete, here is the source code of intersect.
function (x, y)
{
y <- as.vector(y)
unique(y[match(as.vector(x), y, 0L)])
}
<bytecode: 0x10285d400>
<environment: namespace:base>
>
# It's actually posible to use unique and match to produce the same output
unique(as.vector(df$y)[match(as.vector(df$x), df$y, 0L)])
# [1] 18.9 20.1 21.3 21.5 21.0 19.6 19.0 20.0 19.8!

I'm sure your answers are correct, but here's what finally worked for me:
d1$x[abs(d1$y-d2$y) < 0.00001 && d1$x < 1000 && d1$x > 500]
(because I really only needed to find out one value and am a total R newbie, which made it difficult to understand your answers, since I don't even understand most basic R concepts yet. Thank you for your help and sorry.

How to separate data frame in R into two separate data frames for Stepwise Regression in SuperLearner

I have a dataset 162 x 152. What I want to do is use stepwise regression, incorporating cross validation on the dataset to create a model and to test how accurate that model is.
ID RT (seconds) 76_TI2 114_DECC 120_Lop 212_PCD 236_X3Av
4281 38 4.086 1.2 2.322 0 0.195
4952 40 2.732 0.815 1.837 1.113 0.13
4823 41 4.049 1.153 2.117 2.354 0.094
3840 41 4.049 1.153 2.117 3.838 0.117
3665 42 4.56 1.224 2.128 2.38 0.246
3591 42 2.96 0.909 1.686 0.972 0.138
This is part of the dataset I have. I want to construct a model where my Y variable is RT(seconds) and all my variables (my predictors) are all the other 151 variables in my dataset. I was told to use the superleaner package, and algorithm for that is:-
test <- CV.SuperLearner(Y = Y, X = X, V = 10, SL.library = SL.library,
verbose = TRUE, method = "method.NNLS")
The problem is that I'm still rather new to R. The main way in which I've been reading my data in and performing other forms of machine learning algorithms onto my data is by doing the following:-
mydata <- read.csv("filepathway")
fit <- lm(RT..seconds~., data=mydata)
So how do I go about separating the RT seconds column from the input of my data so that I can input the things as an X and Y dataframe? i.e something along the lines of:-
mydata <- read.csv("filepathway")
mydata$RT..seconds. = Y #separating my Y response variable
Alltheother151variables = X #separating all of my X predictor variables (all 151 of them)
SL.library <- c("SL.step")
test <- CV.SuperLearner(Y (i.e RT seconds column), X (all the other 151 variables that corresponds to the RT values), V = 10, SL.library = SL.library,
verbose = TRUE, method = "method.NNLS")
I hope this all makes sense. Thanks!

If the response variable is in the first column, you can simply use:
Y <- mydata[ , 1 ]
X <- mydata[ , -1 ]
The first argument of [ (the row number) is empty, so we keep all the rows,
and the second is either 1 (the first column) or -1 (everything but the first column).
If your response variable is elsewhere, you can use the column names instead:
Y <- mydata[ , "RT..seconds." ]
X <- mydata[ , setdiff( colnames(mydata), "RT..seconds." ) ]

Selecting significant cases from a chi-squared test

I tried the loop function as was given in this question and it seems to work. However, I still have two problems. First I have 4753 comparisons, but R only lists those from 1946 to 4752. Is there a way to get the previous 1945 cases? I've already changed the length of my console to 100000 lines, but that does not seem to work.
1946 1946 focushumrights pillar4info 0.867 1 0.352
1947 1947 focushumrights pillar4campagne 0.053 1 0.818
...
4752 4752 improveorglearning improvenetwork 49.064 9 0.000
4753 4753 improvetechexpert improvenetwork 43.738 9 0.000
Second, I get 4753 results and of those, only a few are significant. Is there a way to automatically filter out the significant cases based on the "p-value" smaller than 0.1 or 0.05.

You are confusing with what is being displayed and what is being stored. I presume you are using the answer in the question you reference in your own question. The answer is a function that returns a data frame. You should store the data frame and then select rows as need. For example,
##Example function that returns a data frame
f = function(N=1000){
out <- data.frame("Row" = 1:N
, "Column" = 1:N
, "Chi.Square" = runif(N)
, "df"= sample(N, 1:10, replace=T)
, "p.value" = round(runif(N), 3)
)
return(out)
}
#Would just print everything to the screen
f()
##Store in a data frame
results = f()
##Select rows as needed
results[results$p.value < 0.05,]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Fill missing values with linear regression - r

Related

Problem passing extra arguments to lapply() with uniroot function

Is there an R function to return a parameter in a list that can not be find by str(list)

Find two densities' point of intersection in R

How to separate data frame in R into two separate data frames for Stepwise Regression in SuperLearner

Selecting significant cases from a chi-squared test

Categories

Resources