NA issues with Linear Regression - r

I want to regress y1 with x, then y2 with x and so on and save the slope,intercept,r2 values ,p values etc. to a vector such that the final vector should contain values for y1...n even if the value is NA.
The following is how my data looks
y1 y2 y3 y4 y5 x
NA 89 86 91 69 1
NA 86 NA 88 NA 2
NA 86 83 88 66 3
NA 100 NA 102 80 4
Using the following code, the slope values will not be calculated for those columns where all the values of y[,i] are NA but will still be calculated if one of the values is a NA.
slope<-0
for(i in 1:length(names(y))){
if (all(is.na(y[,i]))==FALSE) {
m <- lm(y[,i] ~ time)
slope <- c(slope, m$coefficients[2])
}
}
However,I still cannot figure out a way by which I maintain the positional information of all y[,i] such that my final vector output would look something like this
slope
1 NA
2 9.362637e-01
3 8.461538e-01
4 3.450549e-01
5 6.593407e-01
ANy help will be much appreciated

sapply over the non-x columns of DF returning the coefficients if there are any non-NAs in the dependent variable (y) and returning NAs otherwise:
t(sapply(DF[-6], function(y) if (any(!is.na(y))) coef(lm(y ~ x, DF)) else c(NA, NA)))
This gives the following where column 1 is the intercepts and column 2 is the slopes:
[,1] [,2]
y1 NA NA
y2 82.00000 3.300000
y3 87.50000 -1.500000
y4 84.00000 3.300000
y5 63.85714 2.928571
If only the slopes are needed then:
matrix(sapply(DF[-6], function(y) if (any(!is.na(y))) coef(lm(y ~ x, DF))[2] else NA))

#This is for the slope only.
nn<-lapply(paste0("y",1:5),function(i){
if (all(is.na(y[[i]]))==FALSE) {bb<-lm(y[[i]]~x,data=y)
return(bb[[1]][2])
}else{
return(NA)
}
})
do.call(rbind,kk)
x
[1,] NA
[2,] 3.300000
[3,] -1.500000
[4,] 3.300000
[5,] 2.928571
do.call(rbind,nn)

Related

Adding data by row into an empty matrix and handling missing data

I have an empty matrix with a certain number of columns that I'm trying to fill row-by-row with output vectors of a for-loop. However, some of the output are not the same length as the number of columns as my matrix, and just want to fill up those "empty spaces" with NAs.
For example:
matrix.names <- c("x1", "x2", "x3", "x4", "y1", "y2", "y3", "y4", "z1", "z2", "z3", "z4")
my.matrix <- matrix(ncol = length(matrix.names))
colnames(my.matrix) <- matrix.names
This would be the output from one iteration:
x <- c(1,2)
y <- c(4,2,1,5)
z <- c(1)
Where I would want it in the matrix like this:
x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
The output from the next iteration would be, for example:
x <- c(1,1,1,1)
y <- c(0,4)
z <- c(4,1,3)
And added as a new row in the matrix:
x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
[2,] 1 1 1 1 0 4 NA NA 4 1 3 NA
It's not really a concern if I have a 0, it's just where there is no data. Also, the data is saved in such a way that whatever is there is listed in the row first, followed by NAs in empty slots. In other words, I'm not worried if an NA may pop up first.
Also, is such a thing better handled in data frames rather than matrices?
not the efficient answer : just a try
logic : extending the length to 4.(exception could be if already x/y/z is laready of length4) Therefore while rbinding I only extract the first 4 elements .
x[length(x)+1:4] <- NA
y[length(y)+1:4] <- NA
z[length(z)+1:4] <- NA
my.matrix <- rbind(my.matrix,c(x[1:4],y[1:4],z[1:4]))
Note : the exception I mentioned above is like below :
> x <- c(1,1,1,1)
> x
[1] 1 1 1 1
> x[length(x)+1:4] <- NA
> x
[1] 1 1 1 1 NA NA NA NA # therefore I extract only the first four
Here is an option to do this programmatically
d1 <- stack(mget(c("x", "y", "z")))[2:1]
nm <- with(d1, paste0(ind, ave(seq_along(ind),ind, FUN = seq_along)))
my.matrix[,match(nm,colnames(my.matrix), nomatch = 0)] <- d1$values
my.matrix
# x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
#[1,] 1 2 NA NA 4 2 1 5 1 NA NA NA
Or another option is stri_list2matrix from stringi
library(stringi)
m1 <- as.numeric(stri_list2matrix(list(x,y, z)))
Change the 'x', 'y', 'z' values
m2 <- as.numeric(stri_list2matrix(list(x,y, z)))
rbind(m1, m2)

Deleting all variables with over 30% missing values

I found this function to detect proportions of missing values for each column in any given dataframe:
propmiss <- function(dataframe) lapply(dataframe,function(x) data.frame(nmiss=sum(is.na(x)), n=length(x), propmiss=sum(is.na(x))/length(x)))
I assign it to a variable like this:
propmissdf <- propmiss(df)
Then I loop through the dataframe to NULL variables in my data like this:
for(i in (1:length(df))){
var = names(df)[i]
if((propmissdf[[var]][[3]]) > 0.3) { #the 3 index represents the proportion inside propmissdf
df[var] <- NULL
}
}
This gives me an error:
Error in if ((propmissdf[[var]][[3]]) > 0.3) { :argument is of length zero
But it works, somehow. It gets rid of several variables with missing value proportions greater than 0.3, but if I run the for loop again, it gets rid of more until 3 or 4 more times until it gets rid of all of them. Why is this happening? Please feel free to correct my problem, or to come up with a better way to remove variables with over 30% NAs.
You can use something like this:
df <- df[colSums(is.na(df))/nrow(df) < .3]
colSums(is.na(df)) would calculate how many NA values there are in each column.
Divide that output by the number of rows in the data.frame to get the proportion.
Use < .3 to create a logical comparison that can be used to subset the relevant columns.
Sample data and example:
set.seed(2)
df <- data.frame(matrix(sample(c(NA, 1:4), 20, TRUE), nrow = 4))
df
# X1 X2 X3 X4 X5
# 1 NA 4 2 3 4
# 2 3 4 2 NA 1
# 3 2 NA 2 2 2
# 4 NA 4 1 4 NA
colSums(is.na(df))/nrow(df)
# X1 X2 X3 X4 X5
# 0.50 0.25 0.00 0.25 0.25
df[colSums(is.na(df))/nrow(df) < .3]
# X2 X3 X4 X5
# 1 4 2 3 4
# 2 4 2 NA 1
# 3 NA 2 2 2
# 4 4 1 4 NA
For reference, here's a quick timing comparison:
set.seed(1)
df <- data.frame(matrix(sample(c(NA, 1:4), 4000, TRUE), ncol = 1000))
akfun <- function() {
i1 <-sapply(df, function(x) {
pr <- prop.table(table(factor(is.na(x), levels=c(TRUE, FALSE))))
pr[as.logical(names(pr))]< 0.3
})
df[i1]
}
amfun <- function() df[colSums(is.na(df))/nrow(df) < .3]
identical(amfun(), akfun())
# [1] TRUE
system.time(akfun())
# user system elapsed
# 0.172 0.000 0.173
system.time(amfun())
# user system elapsed
# 0.000 0.000 0.001
We can loop over the columns with sapply, get the count of 'NA' values with table, use `prop.table to find the proportion and create a logical vector.
i1 <-sapply(df, function(x) {
pr <- prop.table(table(factor(is.na(x), levels=c(TRUE, FALSE))))
pr[as.logical(names(pr))]< 0.3
})
This vector can be used for subsetting the columns.
df[i1]
If we need to remove the columns
df[!i1] <- list(NULL) #contributed by #Ananda Mahto
df
# X2 X3 X4 X5
#1 4 2 3 4
#2 4 2 NA 1
#3 NA 2 2 2
#4 4 1 4 NA
NOTE: df taken from #Ananda Mahto's post

t.test applied in pairs to more than two samples in R

I have 44 samples where I only have its sample size, average and 1 standard deviation. I asked about the possibility of doing a t.test and some of you returned the answer:
T.test <- function(n, mean, sd) {
s <- sum((n - 1) * sd^2) / (sum(n) - 2) # weighted variance
t <- sqrt(prod(n) / sum(n)) * (diff(mean) / sqrt(s)) # t statistic
df <- sum(n) - 2 # degrees of freedom
p <- (1 - pt(abs(t), df)) * 2 # p value
c(t = t, p = p)
}
dat <- data.frame(mean = c(6.62, 6.31), sd = c(.52, .49), n = c(10, 12))
# mean sd n
# 1 6.62 0.52 10
# 2 6.31 0.49 12
T.test(dat$n, dat$mean, dat$sd)
# t p
# -1.4373111 0.1660915
However I would like to perform a t.test analysis to every single pair of samples. As I have 44 samples, it would be a very hard task.
Imagine I have 5 samples:
# mean sd n
# 1 6.62 0.52 10
# 2 6.31 0.49 12
# 3 5.95 0.76 34
# 4 5.67 0.56 23
# 5 6.12 0.35 16
I would like to perfom a t.test between 1-2, 1-3, 1-4, 1-5, 2-3, 2-4, 2-5, 3-4, 3-5, 4-5 and get a table with all the results obtained at the same time.
The obtained table would be with the name of the samples in the first row and in the first column, so half of the table would be redundant. Both results (t.test and p-value) must appear. It should be something similar to this:
# 1 2 3 4 5
# 1 - test1 p-value1 test2 p-value2 test3 p-value3 test4 p-value4
# 2 - - test5 p-value5 test6 p-value6 test7 p-value7
# 3 - - - test8 p-value8 test9 p-value9
# 4 - - - - test0 p-value0
# 5 - - - - -
Can anybody of you show me how could be the code in R to obtain what I request automatically using data written above? I could then update it to my full samples.
We can use the pairwise.t.test function which also provides different p-value adjustments for pairwise comparisons. This is needed, because among the many t.tests you will encounter significant differences just by chance.
We first set up the original data.frame and a data.frame to be filled.
df = data.frame(n = c(10,12,34,23,16), mean = c(6.62,6.31,5.95,5.67,6.12), sd = c(0.52,0.49,0.76,0.56,0.35))
sample_distributions = data.frame(value = numeric(0), sample = integer(0))
We then use the values in df to build normal distributions with the provided parameters and append them on sample_distributions.
for(i in 1:nrow(df)){
values = rnorm(df$n[i], df$mean[i], df$sd[i])
sample= rep(i, length(values))
sample_distributions = rbind(sample_distributions, data.frame(values,sample))
}
Finally, we use these distributions to perform pairwise t.tests.
pairwise.t.test(x = sample_distributions$values, g = sample_distributions$sample, p.adjust.method = "bonferroni")
which yields:
Pairwise comparisons using t tests with pooled SD
data: sample_distributions$values and sample_distributions$sample
1 2 3 4
2 1.0000 - - -
3 0.0051 0.1524 - -
4 0.0099 0.2309 1.0000 -
5 0.9955 1.0000 0.4172 0.6055
P value adjustment method: bonferroni
We could use outer to do the T.test on all the combinations of rows.
res <- outer(1:nrow(dat), 1:nrow(dat), FUN=Vectorize(function(i,j) {
x1 <- dat[c(i,j), ]
T.test(x1$n, x1$mean, x1$sd)[[2]]}))
If we need only the upper triangular p-values, we can assign the lower.tri elements to NA.
res[lower.tri(res, diag=TRUE)] <- NA
res
# [,1] [,2] [,3] [,4] [,5]
#[1,] NA 0.1660915 0.01270188 7.317558e-05 0.007149738
#[2,] NA NA 0.13401244 2.075498e-03 0.241424622
#[3,] NA NA NA 1.368773e-01 0.399642479
#[4,] NA NA NA NA 0.007203030
#[5,] NA NA NA NA NA
This could be also done using combn to return a vector of 'p values'
v1 <- combn(1:nrow(dat), 2, FUN=function(i) {
x1 <- dat[i,]
T.test(x1$n, x1$mean, x1$sd)})[2,]
If we need the matrix output, we can create a matrix with NA values
res2 <- matrix(NA, 5, 5)
then fill the elements in the matrix based on the logical index returned from lower.tri.
res2[lower.tri(res2, diag=FALSE)] <- v1
and transpose (t) to return the same output as in outer.
t(res2)
# [,1] [,2] [,3] [,4] [,5]
#[1,] NA 0.1660915 0.01270188 7.317558e-05 0.007149738
#[2,] NA NA 0.13401244 2.075498e-03 0.241424622
#[3,] NA NA NA 1.368773e-01 0.399642479
#[4,] NA NA NA NA 0.007203030
#[5,] NA NA NA NA NA
data
dat <- structure(list(mean = c(6.62, 6.31, 5.95, 5.67, 6.12),
sd = c(0.52,
0.49, 0.76, 0.56, 0.35), n = c(10L, 12L, 34L, 23L, 16L)),
.Names = c("mean",
"sd", "n"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5"))

Cramer's V with missing values gives different results

My questions concern the calculation of the Cramers V to detect correlation between categorial variables. I 've got a dataset with missing values, but I created a fake dataset for illustration with two variables a and b, one of them containing to NA's.
a <- factor(c("M","","F","F","","M","F","F"))
a2 <- factor(a, levels = c('M','F'),labels =c('Male','Female'))
b <- factor(c("y","y","","y","n","n","n","y"))
b2 <- factor(b, levels=c("y","n"),labels=c("yes","no"))
df<-cbind(a2,b2)
The assocstats function gives me the result for the cramers V:
require(vcd)
> tab <-table(a,b)
> assocstats(tab)
X^2 df P(> X^2)
Likelihood Ratio 1.7261 4 0.78597
Pearson 1.3333 4 0.85570
Phi-Coefficient : 0.408
Contingency Coeff.: 0.378
Cramer's V : 0.289
Now I want to drop the NA's from the levels
a[a==""]<-NA
a3 <- droplevels(a)
levels(a3)
tab <-table(a,b)
assocstats(tab)
But everytime I remove NA's the result looks like this:
X^2 df P(> X^2)
Likelihood Ratio 0.13844 2 0.93312
Pearson NaN 2 NaN
Phi-Coefficient : NaN
Contingency Coeff.: NaN
Cramer's V : NaN
Also, because I have a large dataset I would like to calculate a matrix of the Cramer V results. I found this code here on stack overflow and it seems to work...
get.V<-function(y){
col.y<-ncol(y)
V<-matrix(ncol=col.y,nrow=col.y)
for(i in 1:col.y){
for(j in 1:col.y){
V[i,j]<-assocstats(table(y[,i],y[,j]))$cramer
}
}
return(V)
}
get.V(tab)
Only that the result is different than that with assocstats function:
[,1] [,2] [,3]
[1,] 1.0 0.5 1
[2,] 0.5 1.0 1
[3,] 1.0 1.0 1
This can not be right, because I get this result every time, even when changing the number of observations... what is wrong with this code?
Conclusion:I don't know which one of the result is right. I have a large dataset with a lot of NA's in it. The first asocstat result and the code give different results, altough there is no big difference,because the code only creates a matrix. The second asocstat function gives only NaN.I cant detect any errors... Can somebody help me?
You don't have to replace the "" with NA if you are using factors--any unique value that you don't define in levels will be converted to NA by factor
a <- factor(c("M","","F","F","","M","F","F"))
a2 <- factor(a, levels = c('M','F'),labels =c('Male','Female'))
a
# [1] M F F M F F
# Levels: F M
a2
# [1] Male <NA> Female Female <NA> Male Female Female
# Levels: Male Female
b <- factor(c("y","y","","y","n","n","n","y"))
b2 <- factor(b, levels=c("y","n"),labels=c("yes","no"))
(df <- cbind(a2,b2))
# a2 b2
# [1,] 1 1
# [2,] NA 1
# [3,] 2 NA
# [4,] 2 1
# [5,] NA 2
# [6,] 1 2
# [7,] 2 2
# [8,] 2 1
Above, you're creating a matrix which loses all the labels that you created with factor. I think you want a data frame:
(df <- data.frame(a2,b2))
# a2 b2
# 1 Male yes
# 2 <NA> yes
# 3 Female <NA>
# 4 Female yes
# 5 <NA> no
# 6 Male no
# 7 Female no
# 8 Female yes
require('vcd')
(tab <- table(a2,b2, useNA = 'ifany'))
# b2
# a2 yes no <NA>
# Male 1 1 0
# Female 2 1 1
# <NA> 1 1 0
(tab <- table(a2,b2))
# b2
# a2 yes no
# Male 1 1
# Female 2 1
You need to explicitly tell table if you want to see NA values in the table. Otherwise, it will drop them by default so that you are already "excluding" them when you use assocstats:
assocstats(tab)
# X^2 df P(> X^2)
# Likelihood Ratio 0.13844 1 0.70983
# Pearson 0.13889 1 0.70939
#
# Phi-Coefficient : 0.167
# Contingency Coeff.: 0.164
# Cramer's V : 0.167
For get.V just pass the data frame or matrix, not the table:
get.V <- function(y) {
col.y <- ncol(y)
V <- matrix(ncol=col.y,nrow=col.y)
for(i in 1:col.y){
for(j in 1:col.y){
V[i,j] <- assocstats(table(y[,i],y[,j]))$cramer
}
}
return(V)
}
get.V(df)
# [,1] [,2]
# [1,] 1.0000000 0.1666667
# [2,] 0.1666667 1.0000000

Constructing derivative of a conditional density

I am using npcdens from np package to construct a conditional density of y on covariates x. However, I need the derivative of the log of this density with respect to y. Is there some way in R to get this?
bw <- npcdensbw(formula=y ~ x1+x2+x3)
fhat <- npcdens(bws=bw,gradients=TRUE)
grad.fhat <- gradients(npcdens(bws=bw,gradients=TRUE))
which returns the gradient with respect to x1, x2 and x3
Can we use this example dataset?
dta = data.frame(expand.grid(x1=1:5,x2=2:6,x3=5:10))
dta$y = with(dta,x1+2*x2 + 3*x3^2)
head(dta)
x1 x2 x3 y
1 1 2 5 80
2 2 2 5 81
3 3 2 5 82
4 4 2 5 83
5 5 2 5 84
6 1 3 5 82
y is the value of the "density". estimate a conditional bandwith object
bw <- npcdensbw(formula = y ~ x1+x2+x3,data=dta)
and look at the gradients
head(gradients(npcdens(bws=bw,gradients=TRUE)))
[,1] [,2] [,3]
[1,] -2.024422e-15 -2.048994e-50 -1.227563e-294
[2,] -1.444541e-15 -1.994174e-50 -1.604693e-294
[3,] -1.017979e-31 -1.201719e-50 -1.743784e-294
[4,] 1.444541e-15 -6.753912e-64 -1.604693e-294
[5,] 2.024422e-15 1.201719e-50 -1.227563e-294
[6,] -2.024422e-15 -3.250713e-50 -1.227563e-294
What do you mean with "derivative with respect to y"? this is a function g(x1,x2,x3), so you can only take derivatives w.r.t. to those 3 dimensions. Concerning the "log of y" part of your question, could this be it?
bw <- npcdensbw(formula = log(y) ~ x1 + x2 + x3,data=dta)
I've never used this package, so these are the thoughts of a non-practitioner. I guess you looked at the examples in help(npcdensbw)?

Resources