cor.test into data.frame in R - r

consider the following example:
require(MuMIn)
data(Cement)
d <- data.frame(Cement)
idx <- seq(11,13)
cor1 <- list()
for (i in 1:length(idx)){
d2 <- d[1:idx[i],]
cor1[[i]] <- cor.test(d2$X1,d2$X2, method = "pearson")
}
out <- lapply(cor1, function(x) c(x$estimate, x$conf.int, x$p.value))
Here I calculate the correlation for a dataset within an iteration loop.
I know want to generate one data.frame made up of the values in the list 'out'. I try using
df <- do.call(rbind.data.frame, out)
but the result does not seem right:
> df
c.0.129614123011664..0.195326511912326..0.228579470307565.
1 0.1296141
2 0.1953265
3 0.2285795
c..0.509907346173941...0.426370467476045...0.368861726657293.
1 -0.5099073
2 -0.4263705
3 -0.3688617
c.0.676861607564929..0.691690831088494..0.692365536706126.
1 0.6768616
2 0.6916908
3 0.6923655
c.0.704071702633775..0.542941653020805..0.452566184329491.
1 0.7040717
2 0.5429417
3 0.4525662
This is not what I am after.
How can I generate a data.frame that has the first column expressing which list the cor.test was calcuated i.e. 1 to 3 in this case, the second column referring to the $estimate and then $conf.int and %p.value resulting in a five column data.frame.

Is this what you're trying to do? Your question is a bit hard to understand. Is a column of indices from the list really necessary? The whole first column will be exactly the same as the row names (which appear on the left-hand side).
> D <- data.frame(cbind(index = seq(length(out)), do.call(rbind, out)))
> names(D)[2:ncol(D)] <- c('estimate', paste0('conf.int', 1:2), 'p.value')
> D
index estimate conf.int1 conf.int2 p.value
1 1 0.1296141 -0.5099073 0.6768616 0.7040717
2 2 0.1953265 -0.4263705 0.6916908 0.5429417
3 3 0.2285795 -0.3688617 0.6923655 0.4525662

It's not entirely clear what you're asking ... you have there such a data frame, just without reasonable column names. You can simplify your code to ..
ctests <- lapply(idx, function(x) cor.test(d[1:x,"X1"], d[1:x, "X2"]))
ctests <- lapply(ctests, "[", c("estimate", "conf.int", "p.value"))
as.data.frame(do.call(rbind, lapply(ctests, unlist)))
# estimate.cor conf.int1 conf.int2 p.value
# 1 0.1296141 -0.5099073 0.6768616 0.7040717
# 2 0.1953265 -0.4263705 0.6916908 0.5429417
# 3 0.2285795 -0.3688617 0.6923655 0.4525662
Is this what you need?

Related

Loop rename variables in R with assign()

I am trying to rename a variable over several data frames, but assign wont work. Here is the code I am trying
assign(colnames(eval(as.name(DataFrameX)))[[3]], "<- NewName")
# The idea is, go through every dataset, and change the name of column 3 to
# "NewName" in all of them
This won't return any error (All other versions I could think of returned some kind of error), but it doesn't change the variable name either.
I am using a loop to create several data frames and different variables within each, now I need to rename some of those variables so that the data frames can be merged in one at a later stage. All that works, except for the renaming. If I input myself the names of the dataframe and variables in a regular call with colnames(DF)[[3]] <- "NewName", but somehow when I try to use assign so that it is done in a loop, it doesn't do anything.
Here is what you can do with a loop over all data frames in your environment. Since you are looking for just data frame in your environment, you are immune of the risk to touch any other variable. The point is that you should assign new changes to each data frame within the loop.
df1 <- data.frame(q=1,w=2,e=3)
df2 <- data.frame(q=1,w=2,e=3)
df3 <- data.frame(q=1,w=2,e=3)
# > df1
# q w e
# 1 1 2 3
# > df2
# q w e
# 1 1 2 3
# > df3
# q w e
# 1 1 2 3
DFs=names(which(sapply(.GlobalEnv, is.data.frame)))
for (i in 1:length(DFs)){
df=get(paste0(DFs[i]))
colnames(df)[3]="newName"
assign(DFs[i], df)
}
# > df1
# q w newName
# 1 1 2 3
# > df2
# q w newName
# 1 1 2 3
# > df3
# q w newName
# 1 1 2 3
We could try ?eapply() to apply setnames() from the data.table package to all data.frame's in your global enviromnent.
library(data.table)
eapply(.GlobalEnv, function(x) if (is.data.frame(x)) setnames(x, 3, "NewName"))

Named arrays, dataframes and matrices

If I split my data matrix into rows according to class labels in another vector y like this, the result is something with 'names' like this:
> X <- matrix(c(1,2,3,4,5,6,7,8),nrow=4,ncol=2)
> y <- c(1,3,1,3)
> X_split <- split(as.data.frame(X),y)
$`1`
V1 V2
1 1 5
3 3 7
$`3`
V1 V2
2 2 6
4 4 8
I want to loop through the results and do some operations on each matrix, for example sum the elements or sum the columns. How do I access each matrix in a loop so I can that?
labels = names(X_split)
for (k in labels) {
# How do I get X_split[k] as a matrix?
sum_class = sum(X_split[k]) # Doesn't work
}
In fact, I don't really want to deal with dataframes and named arrays at all. Is there a way I can call split without as.data.frame and get a list of matrices or something similar?
To split without converting to a data frame
X_split <- list(X[c(1, 3), ], X[c(2, 4), ])
More generally, to write it in terms of a vector y of length nrow(X), indicating the group to which each row belongs, you can write this as
X_split <- lapply(unique(y), function(i) X[y == i, ])
To sum the results
X_sum <- lapply(X_split, sum)
# [[1]]
# [1] 16
# [[2]]
# [1] 20
(or use sapply if you want the result as a vector)
Another option is not to split in the first place and just sum per y. Here's a possible data.table approach
library(data.table)
as.data.table(X)[, sum(sapply(.SD, sum)), by = y]
# y V1
# 1: 1 16
# 2: 3 20
Pretty sure operating directly on the matrix is most efficient:
tapply(rowSums(X),y,sum)
# 1 3
# 16 20

Gnu R: Rename variable in loop

I would like to create a loop in order to create 15 crosstables with one data.frame (var1), which consist of 15 variables, and another variable (var2), see data which can be downloaded here.
The code is now able to give results, but I would like to know how I can rename the variable "mytable" so that I get mytable1, mytable2, etc.
Code:
library(vcd) # for Cramer's V
var1 <- read.csv("~/example.csv", dec=",")
var2 <- sample(1:43)
i <- 1
while(i <= ncol(var1)) {
mytable[[i]] <- table(var2,var1[,i])
assocstats(mytable[[i]])
print(mytable[[i]])
i <- i + 1
}
As suggested in the comments, using names like mytable1, mytable2, etc. for a list of objects is actively discouraged when using R. Collecting all in a list is more useful and cleaner.
One way to do what you want would be this:
library(vcd) # for Cramer's V
data(mtcars)
var1 <- mtcars[ , c(2, 8:11)] ##OP's CSV no longer available
var2 <- sample(1:5, 32, TRUE)
mytable <- myassoc <- list() ##store output in a list
##a `for` loop looks simpler than `while`
for(i in 1:ncol(var1)){
mytable[[i]] <- table(var2, var1[ , i])
myassoc[[i]] <- assocstats(mytable[[i]])
}
So now to access "mytable2" and "myassoc2" you would simply do:
> mytable[[2]]
var2 0 1
1 4 2
2 6 6
3 1 1
4 2 3
5 5 2
> myassoc[[2]]
X^2 df P(> X^2)
Likelihood Ratio 1.7079 4 0.78928
Pearson 1.6786 4 0.79460
Phi-Coefficient : NA
Contingency Coeff.: 0.223
Cramer's V : 0.229

How to row-wise subtract a vector keeping the means of a data frames (df) columns from df?

Sorry for the confusing title ... here is what I want to do with a possible solution:
> df <- data.frame(a=c(1,2,3),b=c(4,5,6))
> v <- colMeans(df)
> df
a b
1 1 4
2 2 5
3 3 6
> v
a b
2 5
> t(t(df)-v)
a b
[1,] -1 -1
[2,] 0 0
[3,] 1 1
But the data frame will have named columns and rows and be quite large. Which is why I am not comfortable with this solution and would like to know if there is a programmatical one out there which does (of course) not resort to loops and has no need for clumsy double-transpositions (maybe even fits neatly into a single line).
You want to mean-correct all columns in your data frame?
df <- scale(df, center=TRUE, scale=FALSE)
If there are columns that aren't numeric (factors and character) then you'll have to test for them:
numeric <- sapply(df, is.numeric)
df[numeric] <- scale(df[numeric], center=TRUE, scale=FALSE)
Note that this converts your df into a matrix as part of the scaling. If you don't want the conversion to happen, you could also do:
df[] <- lapply(df, function(x) x - mean(x))
Another option:
sweep(df, 2, v)
In the answer from Hong Ooi, you can obtain directly a data.frame using:
df <- data.frame(scale(df, center=TRUE, scale=FALSE))

Creating a function to replace NAs from one data.frame with values from another

I regularly have situations where I need to replace missing values from a data.frame with values from some other data.frame that is at a different level of aggregation. So, for example, if I have a data.frame full of county data I might replace NA values with state values stored in another data.frame. After writing the same merge... ifelse(is.na()) yada yada a few dozen times I decided to break down and write a function to do this.
Here's what I cooked up, along with an example of how I use it:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
mergedDf <- merge(naDf, fillDf, by=mergeCols)
for (col in fillCols){
colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
k <- which( is.na( colWithNas ) )
colWithNas[k] <- colWithOutNas[k]
mergedDf[col] <- colWithNas
mergedDf[[paste(col, "x", sep=".")]] <- NULL
mergedDf[[paste(col, "y", sep=".")]] <- NULL
}
return(mergedDf)
}
## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )
So after I got this running I had this odd feeling that someone has probably solved this problem before me and in a much more elegant way. Is there a better/easier/faster solution to this problem? Also, is there a way that eliminates the loop in the middle of my function? That loop is there because I am often replacing NAs in more than one column. And, yes, the function assumes the columns we're filling from are named the same and the columns we are filling to and the same applies to the merge.
Any guidance or refactoring would be helpful.
EDIT on Dec 2 I realized I had logic flaws in my example which I fixed.
What a great question.
Here's a data.table solution:
# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))
# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]
# a b f g f.1 g.1
# [1,] 1 3 NA 0 100 11
# [2,] 1 3 NA NA 100 11
# [3,] 1 3 NA 0 100 11
# [4,] 1 3 0 0 100 11
# [5,] 1 3 0 NA 100 11
# First 5 rows of 200 printed.
# In outDT[i, j], on the following two lines
# -- i is a Boolean vector indicating which rows will be operated on
# -- j is an expression saying "(sub)assign from right column (e.g. f.1) to
# left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]
# Just keep the four columns ultimately needed
outDT <- outDT[,list(a,b,g,f)]
# a b g f
# [1,] 1 3 0 0
# [2,] 1 3 11 0
# [3,] 1 3 0 0
# [4,] 1 3 11 0
# [5,] 1 3 11 0
# First 5 rows of 200 printed.
Here's a slightly more concise/robust version of your approach. You could replace the for-loop with a call to lapply, but I find the loop easier to read.
This function assumes any columns not in mergeCols are fair game to have their NAs filled. I'm not really sure this helps, but I'll take my chances with the voters.
fillNaDf.ju <- function(naDf, fillDf, mergeCols) {
mergedDf <- merge(fillDf, naDf, by=mergeCols, suffixes=c(".fill",""))
dataCols <- setdiff(names(naDf),mergeCols)
# loop over all columns we didn't merge by
for(col in dataCols) {
rows <- is.na(mergedDf[,col])
# skip this column if it doesn't contain any NAs
if(!any(rows)) next
rows <- which(rows)
# replace NAs with values from fillDf
mergedDf[rows,col] <- mergedDf[rows,paste(col,"fill",sep=".")]
}
# don't return ".fill" columns
mergedDf[,names(naDf)]
}
My preference would be to pull out the code from merge that does the matching and do it myself so that I could keep the ordering of the original data frame intact, both row-wise and column-wise. I also use matrix indexing to avoid any loops, though to do so I create a new data frame with the revised fillCols and replace the columns of the original with it; I thought I could fill it in directly but apparently you can't use matrix ordering to replace parts of a data.frame, so I wouldn't be surprised if a loop over the names would be faster in some situations.
With matrix indexing:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
na.ind <- is.na(naDf[,fillCols])
fill.ind <- cbind(match(naB, fillB)[row(na.ind)[na.ind]], col(na.ind)[na.ind])
naX <- naDf[,fillCols]
fillX <- fillDf[,fillCols]
naX[na.ind] <- fillX[fill.ind]
naDf[,colnames(naX)] <- naX
naDf
}
With a loop:
fillNaDf2 <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
m <- match(naB, fillB)
for(col in fillCols) {
fix <- which(is.na(naDf[,col]))
naDf[fix, col] <- fillDf[m[fix],col]
}
naDf
}

Resources