I have a huge dataset and created a large correlation matrix. My goal is to clean this up and create a new data frame with all the correlations greater than the abs(.25) with the variable names include.
For example, I have this data set, how would I use a double nested loop over the rows and columns of the table of correlation.
a <- rnorm(10, 0 ,1)
b <- rnorm(10,1,1.5)
c <- rnorm(10,1.5,2)
d <- rnorm(10,-0.5,1)
e <- rnorm(10,-2,1)
matrix <- data.frame(a,b,c,d,e)
cor(matrix)
(notice, that there is redundancy in the matrix. You only need to inspect the first 5
columns; and you don’t need to inspect all rows. If I’m looking at column 3, for example, I
only need to start looking at row 4, after the correlation = 1)
Thank you
Is your ultimate goal to create a 5x5 with all values with absolute less than 0.25 set to zero? This can be done via sapply(matrix,function(x) ifelse(x<0.25,0,x)). If your goal is to simply create a loop over the rows and columns, this can be done via:
m <- cor(matrix)
for (row in rownames(m)){
for (col in colnames(m)){
#your code here
#operating on m[row,col]
}
}
To avoid redundancy:
for (row in rownames(m)[1:(length(rownames(m))-1)]){
for (col in colnames(m)[(which(colnames(m) == row)+1):length(colnames(m))]){
#your code here
#operating on m[row,col]
print(m[row,col])
}
}
I'd suggest using the corrr package, in conjunction with tidyr and dplyr.
This allows you to generate a correlation data frame rather than a matrix and remove the duplicate values (where for example a-b is the same as b-a) using the shave function. You can then rearrange by pivoting, remove the NA values (from the diagonal, e.g. a-a) and filter for values greater than 0.25.
library(dplyr)
library(tidyr)
library(magrittr) # for the pipe %>% or just use library(tidyverse) instead of all 3
library(corrr)
# for reproducible values
set.seed(1001)
# no need to make a data frame from vectors
# and don't call it matrix, that's a function name
mydata <- data.frame(a = rnorm(10, 0 ,1),
b = rnorm(10, 1, 1.5),
c = rnorm(10, 1.5, 2),
d = rnorm(10, -0.5, 1),
e = rnorm(10, -2, 1))
mydata %>%
correlate() %>%
shave() %>%
pivot_longer(2:6) %>%
na.omit() %>%
filter(abs(value) > 0.25)
Result:
# A tibble: 4 x 3
term name value
<chr> <chr> <dbl>
1 c b -0.296
2 d b 0.357
3 e a -0.440
4 e d -0.280
Related
I would like to iterate through a stored list of columns and procedures to create n new columns based on this list. In the example below, we start with 3 columns, a, b, c and two simple functions func1, func1.
The data frame col_mod contains two sets of modifications that should be applied to the data frame. Each of these modifications should be an addition to the data frame, rather than replacements of the specified columns.
In col_mod row 1, we see that column a should be modified using func1, and in row 2, we see that column c should be modified using func2. The new names of these columns should be a_new and c_new, respectively.
At the bottom of the reprex below, I obtain my desired result, but I would like to do so without hard coding each modification individually . Is there any way to use maybe something from purrr:map or anything similiar?
library(tidyverse)
## fake data
dat <- data.frame(a = 1:5,
b = 6:10,
c = 11:15)
## functions
func1 <- function(x) {x + 2}
func2 <- function(x) {x - 4}
## modification list
col_mod <- data.frame("col" = c("a", "c"),
"func" = c("func1", "func2"),
stringsAsFactors = FALSE)
## desired end result
dat %>%
mutate("a_new" = func1(a),
"c_new" = func2(c))
edit: if it is easier to store the modifications in a list, as shown below, a solution using that would be fine as well, as I am able to store the modifications in either a data frame or list.
col_mod <- list("set1" = list("a", "func1"),
"set2" = list("c", "func2"))
We can do this with the help of Map, use match.fun to apply the function
dat[paste0(col_mod$col, '_new')] <- Map(function(x, y) match.fun(y)(x),
dat[col_mod$col], col_mod$func)
dat
# a b c a_new c_new
#1 1 6 11 3 7
#2 2 7 12 4 8
#3 3 8 13 5 9
#4 4 9 14 6 10
#5 5 10 15 7 11
Using col_mod as dataframe.
col_mod <- data.frame("col" = c("a", "c"),"func" = c("func1", "func2"))
We can use the tidyverse approach to do this
library(dplyr)
library(purrr)
library(stringr)
library(tibble)
imap_dfc(deframe(col_mod), ~ dat %>%
transmute(!! str_c(.y, "_new") := match.fun(.x)(!! rlang::sym(.y)))) %>%
bind_cols(dat, .)
Using this very simple data example below, my goal would be to sample all 3 of A and only sample 5 out of 7 of B.
id group
1 A
2 A
3 A
4 B
5 B
6 B
7 B
8 B
9 B
10 B
ex_df <- data.frame(id = 1:10, group = c(rep("A", 3), rep("B", 7)))
Now, normally it'd just be a case of using sample_n from dplyr such that the code would be along the lines of
sel_5 <- ex_df %>%
group_by(group) %>%
sample_n(5)
Except this gives the error (for obvious reasons)
Error: size must be less or equal than 2 (size of data), set
replace = TRUE to use sampling with replacement
but sampling with replacement isn't an option. Is there any way that I might be able to set the sample_n size to be the minimum of 5 or the size of the group?
Or maybe another function that I'm unaware of that would be capable of this?
I've had the same problem, and here's what I did.
library(dplyr)
split_up <- split(ex_df, f = ex_df$group)
#split original dataframe into a list of dataframes for each unique group
sel_5 <- lapply(split_up, function(x) {x %>% sample_n(ifelse(nrow(x) < 5, nrow(x), 5))})
#on each dataframe, subsample to 5 or to the number of rows if there are less than 5
sel_5 <- do.call("rbind", sel_5)
#bind it back up!
I'm an R newby and wondering if people could offer me a little bit of advice as to how I can process some data I have.
I have a data frame containing a list of samples with observed changes in genes (example below)
Dataframe1:
Sample Gene Alteration
1 A -1
1 B -1
1 C -1
1 D 1
2 B 1
2 E -1 ...
I also have a data frame containing a list of genes that I am interested in (example below)
Dataframe2:
Gene
B
D
E
I want to calculate how many samples have a -1 alteration for each gene in dataframe2, with an ideal output looking something like:
Dataframe3:
Gene Alteration Sum
B -1 23
D -1 2
E -1 18
I'm really stuck as to where to start, I've found a lot of information on sum etc but I can't work out how to feed two data frames together and utilise sum.
Any advice or just functions that I could try would be hugely appreciated.
Step 1: Select the genes of interest from dataframe1:
set.seed(11)
dataframe1 = data.frame(Sample = rep(c(1,2), each = 5),
Gene = rep(c("A", "B", "C", "D","E"),2),
Alteration = sample(c(-1, 1), 10, prob = c(0.7, 0.3), replace = TRUE))
dataframe2 <- data.frame(Gene = c("B", "D", "E"))
# Select the genes of interest
dataframe1 <- dataframe1[dataframe1$Gene %in% dataframe2$Gene, ]
Step 2: Calculate the sum of -1's
We can use the dplyr library to compute the sum per group:
library(dplyr)
dataframe1 %>%
group_by(Gene) %>%
summarise(Sum = sum(Alteration == -1))
Note that when we have a boolean vector (vector containing TRUE's and FALSE's) the sum of this vector gives the number of TRUE's.
Good luck!
Or with dplyr, just try
dft2 %>%
inner_join(dft1) %>%
group_by(Gene, Alteration) %>%
summarise( cnt = n()) %>%
filter(Alteration == -1)
where dft1 is the first dataframe and dft2 is the second dataframe
In case dft2 has entries not found in dft1 and you want to show the null, change the inner_join to left_join
You can use the function ddply from the package plyr.
library(plyr)
Dataframe3 <- ddply(Dataframe1, c('Gene', 'Alteration'), summarise, Sum = length(Alteration))
QUESTION: Using R, how would you create values in column B prefixed with a constant "1" + n 0's where n is the value in each row in column A?
#R CODE EXAMPLE
df <- as.data.frame(1:3);colnames(df)[1] <- "A";
print(df);
# A
# 1
# 2
# 3
preFixedValue <- 1; repeatedValue <- 0;
#pseudo code: create values in column B with n 0's prefixed with 1
df <- cbind(df,paste(rep(c(preFixedValue,repeatedValue), times = c(1,df[1:nrow(df),])),collapse = ""));
#expected/desired result
# A B
# 1 10
# 2 100
# 3 1000
USE CASE: Real data contains hundreds of rows in column A with random integers, not just three sequential int's as shown in the code above.
Below is an example using Excel to demonstrate what I want to do in R.
The rowwise() function in dplyr lets you make variables from column values in each row.
require(dplyr)
df <- data.frame(A = 1:3, B = NA)
preFixedValue <- 1; repeatedValue <- 0;
df <- df %>%
rowwise() %>%
mutate(B = as.numeric(paste0(c(preFixedValue, rep(repeatedValue, A)), collapse = "")))
For maximum flexibility, i.e. total freedom of choosing prefixed and repeated values as single values or vectors, and for simplicity of the syntax (one single line):
library(stringr)
df$B <- str_pad(preFixedValue, width = df$A, pad = repeatedValue, side = c("right"))
Would something like this work?
B<-10^(df$A)
df<-cbind(df,B)
After using the G.test on all rows of my data subset
apply(datamixG +1 , 1, G.test)
I get an output for each row that looks like this
[[1]]
G-test for given probabilities
data: [(newX,,i)
G = 3.9624, df = 1, p-value = 0.04653
I have 46 rows. I need to sum the df and G-values. Is there a way to have R report the G-values differently and/or sum all of the G-values and df?
I'll assume you're using the G.test function from the RVAideMemoire package:
# Sample data (always a good idea to post!)
dat <- matrix(1:4, nrow=2)
library(RVAideMemoire)
tests <- apply(dat, 1, G.test)
You can use unlist and lapply to extract a single value from each element in a list and to return a vector of the results:
dfs <- unlist(lapply(tests, "[[", "parameter"))
dfs
# df df
# 1 1
sum(dfs)
# [1] 2
Gs <- unlist(lapply(tests, "[[", "statistic"))
Gs
# G G
# 1.0464963 0.6795961
sum(Gs)
# [1] 1.726092