If I have an input file with 7 columns or an arbitrary number of columns, and each column has a number of values in it, and I want to run correlations for all unique pairs(where AB and BA are the same) of columns without having to go in and do cor.test(column$A,
Column$B) for every possible pair how do I do this in R?
Example data:
A B C D
1 2 2 3
3 2 2 1
5 2 4 3
5 2 3 3
In this case A, B, C,D are different columns and I would want to do all possible correlations for unique pairs where AB and BA count as the same pair, just as DA and AD would be the same pair.
You can try:
cor(df, use="complete.obs", method="kendall") #or whichever method fits you
or:
#this gives significance levels also
library(Hmisc)
rcorr(df, type="pearson") # type can be pearson or spearman
#or as matrix
rcorr(as.matrix(df))
If this doesn't work, try creating a list of column vectors and loop through everything (I will try to provide an example in edit)
Hope this helps
Related
This question already has answers here:
Counting the number of elements with the values of x in a vector
(20 answers)
Closed 2 years ago.
for eg:
a dataframe "housing" has a column "street" with different street names as levels.
I want to return a df with counts of the number of houses in each street (level), basically number of repetitions.
what functions do i use in r?
This should help:
library(dplyr)
housing %>% group_by(street) %>% summarise(Count=n())
This can be done in multiple ways, for instance with base R using table():
table(housing$street)
It can also be done through dplyr, as illustrated by Duck.
Another option (my preference) is using data.table.
library(data.table)
setDT(housing)
housing[, .N, by = street]
summary gives the first 100 frequencies of the factor levels. If there are more, try:
table(housing$street)
For example, let's generate one hundred one-letter street names and summarise them with table.
set.seed(1234)
housing <- data.frame(street = sample(letters, size = 100, replace = TRUE))
x <- table(housing$street)
x
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# 1 3 5 6 4 6 2 6 5 3 1 3 1 2 5 5 4 1 5 5 3 7 4 5 3 5
As per OP's comment. To further use the result in analyses, it needs to be included in a variable. Here, the x. The class of the variable is table, and it works in base R with most functions as a named vector. For example, to find the most frequent street name, use which.max.
which.max(x)
# v
# 22
The result says that the 22nd position in x has the maximum value and it is called v.
Let's say I have a vector of integers 1:6
w=1:6
I am attempting to obtain a matrix of 90 rows and 6 columns that contains the multinomial combinations from these 6 integers taken as 3 groups of size 2.
6!/(2!*2!*2!)=90
So, columns 1 and 2 of the matrix would represent group 1, columns 3 and 4 would represent group 2 and columns 5 and 6 would represent group 3. Something like:
1 2 3 4 5 6
1 2 3 5 4 6
1 2 3 6 4 5
1 2 4 5 3 6
1 2 4 6 3 5
...
Ultimately, I would want to expand this to other multinomial combinations of limited size (because the numbers get large rather quickly) but I am having trouble getting things to work. I've found several functions that do binomial combinations (only 2 groups) but I could not locate any functions that do this when the number of groups is greater than 2.
I've tried two approaches to this:
Building up the matrix from nothing using for loops and attempting things with the reshape package (thinking that might be something there for this with melt() )
working backwards from the permutation matrix (720 rows) by attempting to retain unique rows within groups and or removing duplicated rows within groups
Neither worked for me.
The permutation matrix can be obtained with
library(gtools)
dat=permutations(6, 6, set=TRUE, repeats.allowed=FALSE)
I think working backwards from the full permutation matrix is a bit excessive but I'm tring anything at this point.
Is there a package with a prebuilt function for this? Anyone have any ideas how I shoud proceed?
Here is how you can implement your "working backwards" approach:
gps <- list(1:2, 3:4, 5:6)
get.col <- function(x, j) x[, j]
is.ordered <- function(x) !colSums(diff(t(x)) < 0)
is.valid <- Reduce(`&`, Map(is.ordered, Map(get.col, list(dat), gps)))
dat <- dat[is.valid, ]
nrow(dat)
# [1] 90
I have a data.frame that can contains N columns (N defined at runtime), and I want to get the rows within the data frame that satisfy N-1 conditions, in other words I want to get only the rows with a specific value for the first N-1 columns.
For instance if I have a data frame with four columns (A,B,C,D) and five rows:
A B C D
1 2 3 4
9 9 9 9
1 2 9 5
4 3 2 1
1 2 3 8
I would get all the rows with A==1 & B==2 & C==3, i.e:
A B C D
1 2 3 4
1 2 3 8
But as said, the data frame can be composed of any amount of rows and columns (defined at runtime), and the values of the conditions may change.
I implemented this function (simplified):
getRows<-function(dataFrame, values) {
conditions=rep(TRUE, dim(dataFrame)[1])
for (k in 1:length(values)) {
conditions=conditions&(dataFrame[,k]==values[k])
}
return(dataFrame[conditions,])
}
Of course, this assumes the values in the values vector are sorted with respect to the columns order of the data frame, and that the length of the vector is N-1.
The function works but I've the feeling that it is not really efficient to create the vector of boolean, evaluate boolean expressions in this way and so on... especially if the data frame contains many data.
Another solution that I found is:
getRows<-function(dataFrame, values) {
tmp=dataFrame
for (k in 1:length(values)) {
tmp=tmp[tmp[,k]==values[k],]
}
return(tmp)
}
Basically this 'reduces' the data frame by filtering out all the rows that not satisfy each condition. But I think this is even worst, because it creates a new data frame object for each condition (ok always smaller, but anyway...).
So my question is: is there a method to do that more efficiently?
one possibility:
# if you are only checking for equalities
f <- function(df, values){
# values must be a list with the columns names of df as names and the conditions
# if you
y <- paste(names(values), unlist(values), sep="==", collapse=" & ")
return(df[eval(parse(text=y), envir=df),])
}
l <- as.vector(1:3, "list")
names(l) <- colnames(df)[-ncol(df)]
f(df, l)
A B C D
1 1 2 3 4
5 1 2 3 8
# you can also use other conditions
f <- function(df, values){
# values must be a list with the columns names of df as names and the conditions
# if you
y <- paste(names(values), unlist(values), collapse=" & ")
return(df[eval(parse(text=y), envir=df),])
}
l <- as.vector(paste0(c("==", "<=", "=="), 1:3), "list")
names(l) <- colnames(df)[-ncol(df)]
f(df, l)
A B C D
1 1 2 3 4
5 1 2 3 8
Sometimes matrices are quicker than data.frames to operate on, so something along the lines of:
mat <- t(as.matrix(df[-ncol(df)))
boolMat <- (mat==values) # if necessary use match to reorder values to match columns of df
ind <- colSums(boolMat)==nrow(boolMat)
df[ind,]
The idea being that values will get recycled along the columns of the matrix (which are the rows of the dataframe). colSums is meant to be quicker than an apply, so the final line should be somewhat optimised compared to apply(boolMat, 2, all).
The optimal solutions will depend on the size and proportions of the data; whether the entries are all integers; and maybe what proportion of matches you get in the data. So as #droopy mentions, you'll need to benchmark. My approach involves creating a copy of the data, so if your data is already approaching memory limits, then it might struggle - but maybe then you could generate your data in matrix rather than data.frame format to save the duplication.
I'm an enthusiastic R newbie that needs some help! :)
I have a data frame that looks like this:
id<-c(100,200,300,400)
a<-c(1,1,0,1)
b<-c(1,0,1,0)
c<-c(0,0,1,1)
y=data.frame(id=id,a=a,b=b,c=c)
Where id is an unique identifier (e.g. a person) and a, b and c are dummy variables for whether the person has this feature or not (as always 1=TRUE).
I want R to create a matrix or data frame where I have the variables a, b and c both as the names of the columns and of the rows. For the values of the matrix R will have to calculate the number of identifiers that have this feature, or the combination of features.
So for example, IDs 100, 200 and 400 have feature a then in the diagonal of the matrix where a and a cross, R will input 3. Only ID 100 has both features a and b, hence R will input 1 where a and b cross, and so forth.
The resulting data frame will have to look like this:
l<-c("","a","b","c")
m<-c("a",3,1,1)
n<-c("b",1,2,1)
o<-c("c",1,1,2)
result<-matrix(c(l,m,n,o),nrow=4,ncol=4)
As my data set has 10 variables and hundreds of observations, I will have to automate the whole process.
Your help will be greatly appreciated.
Thanks a lot!
With base R:
crossprod(as.matrix(y[,-1]))
# a b c
# a 3 1 1
# b 1 2 1
# c 1 1 2
This is called an adjacency matrix. You can do this pretty easily with the qdap package:
library(qdap)
adjmat(y[,-1])$adjacency
## a b c
## a 3 1 1
## b 1 2 1
## c 1 1 2
It throws a warning because you're feeding it a dataframe. Not a big deal and can be ignored. Also noticed I dropped the first column (ID's) with negative indexing y[, -1].
Note that because you started out with a Boolean matrix you could have gotten there with:
Y <- as.matrix(y[,-1])
t(Y) %*% Y
I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help
apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))