R: Running row-wise operations between data frames - r

I'd like to run a statistical test, row-by-matching-row, between two data frames gex and mxy. The catch is that I need to run it several times, each time using a different column from gex, yielding a different vector of test results for each run.
Here is what I have so far (using example values), after much help from #kristang.
gex <- data.frame("sample" = c(987,7829,15056,15058,15072),
"TCGA-F4-6703-01" = runif(5, -1, 1),
"TCGA-DM-A28E-01" = runif(5, -1, 1),
"TCGA-AY-6197-01" = runif(5, -1, 1),
"TCGA-A6-5657-01" = runif(5, -1, 1))
colnames(gex) <- gsub("[.]", "_",colnames(gex))
listx <- c("TCGA_DM_A28E_01","TCGA_A6_5657_01")
mxy <- data.frame("TCGA-AD-6963-01" = runif(5, -1, 1),
"TCGA-AA-3663-11" = runif(5, -1, 1),
"TCGA-AD-6901-01" = runif(5, -1, 1),
"TCGA-AZ-2511-01" = runif(5, -1, 1),
"TCGA-A6-A567-01" = runif(5, -1, 1))
colnames(mxy) <- gsub("[.]", "_",colnames(mxy))
zScore <- function(x,y)((as.numeric(x) - as.numeric(rowMeans(y,na.rm=T)))/as.numeric(sd(y,na.rm=T)))
## BELOW IS FOR DIAGNOSTICS
write.table(mxy, file = "mxy.csv",
row.names=FALSE, col.names=TRUE, sep=",", quote=F)
write.table(gex, file = "gex.csv",
row.names=FALSE, col.names=TRUE, sep=",", quote=F)
## ABOVE IS FOR DIAGNOSTICS
for(i in seq(nrow(mxy)))
for(colName in listx){
zvalues <- zScore(gex[,colName[colName %in% names(gex)]],
mxy[i,])
## BELOW IS FOR DIAGNOSTICS
write.table(gex[,colName[colName %in% names(gex)]], file=paste0(colName, "column", ".csv"),
row.names=FALSE,col.names=FALSE,sep=",",quote=F)
write.table(mxy[i,], file=paste0(colName, "mxyinput", ".csv"),
row.names=FALSE,col.names=FALSE,sep=",",quote=F)
## ABOVE IS FOR DIAGNOSTICS
geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL,
stringsAsFactors = FALSE)
write.csv(geneexptest, file = paste0(colName, ".csv"),
row.names=FALSE, col.names=FALSE, sep=",", quote=F)
}
The problem is that while it seems to go through and create the correct number of output files with the correct number of rows, etc...but it does not yield correct z-scores. I want it to calculate:
((Value from row z & given column of gex) - (Mean of values in row z across mxy)) / (Standard deviation of values in row z across mxy)
Then move on to the next row, and so on, filling in the first vector. THEN, I want it to calculate the same thing using the next column of gex, filling in a separate vector. I hope this makes sense.
I have a separate script which runs the same test using a pre-determined column vs the other data frame. The relevant for loop from that script looks like this:
for(i in seq_along(mxy)){
zvalues[i] <- (gex_column_W[i] - mean(mxy[i,])) / sd(mxy[i,])
}

I think there may be a typo in your code, specifically you say you want "Mean of values in row z across mxy" but are using the mean(mxy[,i])) which selects the i'th column, not the i'th row. I re-wrote this section with for loops for clarity. (not sure why you were using lapply?)
# a function fo calculationg the z score
zScore <- function(x,y)(x - mean(y,na.rm=T))/sd(y,na.rm=T)
for(i in seq(nrow(mxy))) # note that length(mxy) is actually the number of columns in mxy
for(colName in listx){
zvalues <- zScore(gex[,colName],# column == colName
mxy[i,])# row == i
geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL,
stringsAsFactors = FALSE)
write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
row.names=FALSE, col.names=FALSE, quote=F,
sep = ",", dec = ".", append=(i > 1))
}
and alternative that does not rely on append:
for(colName in listx){
geneexptest <- NULL
for(i in seq(nrow(mxy))) {
zvalues <- zScore(gex[,colName],# column == colName
mxy[i,])# row == i
geneexptest <- rbind(geneexptest,
data.frame(gex$sample, zvalues, row.names = NULL,
stringsAsFactors = FALSE))
}
write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
row.names=FALSE, col.names=FALSE, quote=F,
sep = ",", dec = ".", append=(i > 1))
}

Related

Combining multiple data sets in R

I am a complete beginner in R/ programming language. Right now I am trying to process hundreds of comma separated data files using R. For time series analyses, I need to concatenate the data sets sequentially. Unfortunately, the data files do not have a designated column with time stamp and have some header lines. For that, I am parsing the file creation time from the second line of the data file and adding timesteps based on the sampling frequency which can be found in the third line of the data file. Also, the sampling frequency will vary from files to files that can be identified from the regex patterns in filename. The first three header lines look like this:
SPU1 Monitor Data File
SPU Data Filename = 06Aug2021 ,07 -08 -28,s1c1h17.txt
Sample Frequency = 1
Or
SPU1 Traffic Data File
SPU Data Filename = 05Aug2021 ,02 -48 -14,s1c1p2311.txt
Sample Frequency = 20
I have tried the for loop as well as the lapply. When I am trying the for loop, the script only run once. When I am trying the lapply, I am getting the following message. What am I doing wrong?
[Error in file(file, "rt") : invalid 'description' argument
In addition: Warning messages:
1: In n.readLines(paste(filenames\[i\], sep = ","), header = FALSE, n = 1, :
file doesn't exist
2: In n.readLines(paste(filenames\[i\], sep = ",|\\s|-"), header = FALSE, :
file doesn't exist
Called from: file(file, "rt")][1]
Here is the code I am trying:
setwd("C:/Users/rottweiller/Desktop/Practicing R")
filenames <- list.files(path="C:/Users/rottweiller/Desktop/Practicing R", pattern="c1h|c1p", full.names=FALSE)
library(reader)
library(readr)
library(tidyverse)
AddTS <- function(filenames){
#frq1 <- parse_number(n.readLines(paste(filenames[i], sep = ","), header = FALSE, n = 1, skip = 2))
frq1 <- as.integer(gsub("\\D", "", n.readLines(paste(filenames[i], sep = ","), header = FALSE, n = 1, skip = 2)))
TL1 <- n.readLines(paste(filenames[i], sep = ",|\\s|-"), header = FALSE, n = 1, skip = 1)
SUTC1 <- lubridate::parse_date_time(gsub("\\s-|\\s", "",
stringr::str_extract(TL1, "[SPU Data Filename = ]?\\d{2}\\D{3}\\d{4}\\s\\,\\d{2}\\s-\\d{2}\\s-\\d{2}")), orders = "dmYHMS")
C1 <- as.data.frame(read.delim(filenames[i], header = FALSE, sep = ",", skip = 79))
C1[] <- lapply(C1, function(j) if(is.numeric(j)) ifelse(is.infinite(j), 0, j) else j)
TS1 <- SUTC1 + (1/frq1)*seq_len(nrow(C1))
Card1 <- cbind(TS1, C1)
}
combined <- dplyr::bind_rows(lapply(filenames, AddTS))
Or
[for(i in 1:length(filenames)){
frq1 <- parse_number(n.readLines(paste(filenames\[i\], sep = ","), header = FALSE, n = 1, skip = 2), trim_ws = TRUE)
TL1 <- n.readLines(paste(filenames\[i\], sep = ",|\\s|-"), header = FALSE, n = 1, skip = 1)
SUTC1 <- lubridate::parse_date_time(gsub("\\s-|\\s", "",
stringr::str_extract(TL1, "\[SPU Data Filename = \]?\\d{2}\\D{3}\\d{4}\\s\\,\\d{2}\\s-\\d{2}\\s-\\d{2}")),
orders = "dmYHMS")
C1 <- as.data.frame(read.delim(filenames\[i\], header = FALSE, sep = ",", skip = 79))
C1\[\] <- lapply(C1, function(j) if(is.numeric(j)) ifelse(is.infinite(j), 0, j) else j)
TS1 <- SUTC1 + (1/frq1)*seq_len(nrow(C1))
Card1 <- cbind(TS1, C1)
}][1]
It's a good starting step that you already know about regular expressions and recent R libraries.
You could do something like this:
purrr::map_dfr(filenames, function(f) {
lines <- readLines(file(f))
frq <- lines[3] %>%
str_replace(".*?(\\d*)$", "\\1") %>%
as.integer()
frq
SUTC <- lines[2] %>%
stringr::str_extract("[SPU Data Filename = ]?\\d{2}\\D{3}\\d{4}\\s\\,\\d{2}\\s-\\d{2}\\s-\\d{2}") %>%
lubridate::parse_date_time(orders = "dmYHMS")
SUTC
C <- lines[(which(lines == "end of text") + 2):length(lines)] %>%
textConnection() %>%
read.delim(header = FALSE, sep = ",") %>%
mutate(across(.fns = ~ if_else(. == Inf, 0, .)))
C
TS <- SUTC + seq_len(nrow(C)) / frq
bind_cols(file = f, TS = TS, C)
})

How do you restore the result of each iteration of a for loop into its own matrix?

I have a for loop that takes each sample file on a list, creates a matrix for that sample, and then stores it into one big list of all the sample matrices.
Here is what I have done so far:
# load in data ------------------------------------------------------------------
filePaths = getGEOSuppFiles("GSE124395")
tarF <- list.files(path = "./GSE124395/", pattern = "*.tar", full.names = TRUE)
untar(tarF, exdir = "./GSE124395/")
gzipF <- list.files(path = "./GSE124395/", pattern = "*.gz", full.names = TRUE)
ldply(.data = gzipF, .fun = gunzip)
#running test loop -------------------------------------------------------------
testlist <- c("./GSE124395//GSM3531672_P301_3_CRYOMIXED11.coutt.csv",
"./GSE124395//GSM3531673_P301_4_CRYOMIXED12.coutt.csv",
"./GSE124395//GSM3531674_P301_5_HEP1_1_5.coutt.csv")
LoopList_test <- list()
for (i in 1:length(testlist)){
matrix_test <- read.delim(file =testlist[i])
matrix_test <- data.frame(matrix_test[,-1], row.names=matrix_test[,1])
matrix_test <- as.matrix(matrix_test) #<- makes the excel file into a matrix
colname_test <- read.delim(file =testlist[i])
colname_test <- read.table(file = './GSE124395//GSE124395_celseq_barcodes.192.txt', header = FALSE, row.names = 1)
colname_test <- data.frame(colname_test[,-1], col=colname_test[,1])
colname_test <- as.matrix(colname_test)
colnames(matrix_test) <- colname_test[,1]
LoopList_test[[i]]<-matrix_test
}
This is the output:
part of output in the one big list
I would like the loop to store the result of each iteration into its own matrix, so I have multiple matrices instead of one giant list of matrices, if that makes sense. I think it involves either splitting this one giant list into sublists, or storing the results of the loop into a matrix/array/vector instead of a list, or somehow having it store each iteration into its own variable within the loop. I'm not sure how to go about doing any of those.
Thanks for reading!
UPDATE:
So the whole point of this was to create matrices to then combine them into one matrix. Then turn this one matrix into a Seurat object which I could then perform clustering on.
So here is what I have done so far: essentially, I made multiple loops of each group within the dataset, added whatever information I needed, and then took the list and the function I think I need actually takes a list so that's good for me. Here's the code I decided on at the moment:
mylist<-list.files(path = "./GSE124395/", pattern = "\\.csv$",full.names = TRUE)
LoopList <- list()
for (i in 1:30){
matrix_input <- read.delim(file =mylist[i])
matrix_input <- data.frame(matrix_input[,-1], row.names=matrix_input[,1])
matrix_input <- as.matrix(matrix_input) #<- makes the excel file into a matrix
colname_input <- read.delim(file =mylist[i])
colname_input <- read.table(file = './GSE124395//GSE124395_celseq_barcodes.192.txt', header = FALSE, row.names = 1)
colname_input <- data.frame(colname_input[,-1], col=colname_input[,1])
colname_input <- as.matrix(colname_input)
colnames(matrix_input) <- colname_input[,1]
colnames(matrix_input) <- paste(colnames(matrix_input), "Colorectal_Metastasis", sep = "_")
P301_pdat <- data.frame("samples" = colnames(matrix_input), "treatment" = "Colorectal_Metastasis")
sobj <- CreateSeuratObject(counts = matrix_input, min.cells = 0, min.features = 1,
project = "Patient301_Colorectal_Metastasis")
LoopList[[i]]<-sobj
#LoopList <- assign(paste0("Patient301", i), sobj )
}
# P304 loop -------------------------------------------------------------------------
for (i in 31:56){
matrix_input <- read.delim(file =mylist[i])
matrix_input <- data.frame(matrix_input[,-1], row.names=matrix_input[,1])
matrix_input <- as.matrix(matrix_input) #<- makes the excel file into a matrix
colname_input <- read.delim(file =mylist[i])
colname_input <- read.table(file = './GSE124395//GSE124395_celseq_barcodes.192.txt', header = FALSE, row.names = 1)
colname_input <- data.frame(colname_input[,-1], col=colname_input[,1])
colname_input <- as.matrix(colname_input)
colnames(matrix_input) <- colname_input[,1]
colnames(matrix_input) <- paste(colnames(matrix_input), "Colorectal_Metastasis", sep = "_")
P304_pdat <- data.frame("samples" = colnames(matrix_input), "treatment" = "Colorectal_Metastasis")
sobj <- CreateSeuratObject(counts = matrix_input, min.cells = 0, min.features = 1,
project = "Patient304_Colorectal_Metastasis")
LoopList[[i]]<-sobj
}
and so on. Then, following https://satijalab.org/seurat/articles/integration_large_datasets.html
sobj.list <- SplitObject(LoopList, split.by = "orig.ident")
joined <- lapply(X = LoopList, FUN = function(x) {
x <- NormalizeData(x, verbose = FALSE)
x <- FindVariableFeatures(x, verbose = FALSE)
})
features <- SelectIntegrationFeatures(object.list = joined)
joined <- lapply(X = joined, FUN = function(x) {
x <- ScaleData(x, features = features, verbose = FALSE)
x <- RunPCA(x, features = features, verbose = FALSE)
})
anchors <- FindIntegrationAnchors(object.list = joined, reduction = "rpca",
dims = 1:50)
joined.integrated <- IntegrateData(anchorset = anchors, dims = 1:50)
joined.integrated <- ScaleData(joined.integrated, verbose = FALSE)
joined.integrated <- RunPCA(joined.integrated, verbose = FALSE)
joined.integrated <- RunUMAP(joined.integrated, dims = 1:50)
DimPlot(joined.integrated, group.by = "orig.ident")
DimPlot(joined.integrated, reduction = "umap", split.by = "treatment")
I don't know if this works for sure, but I thought I would update this question to reflect what I've learned so far! I guess lesson I've learned is see if you can find a function that takes a list as input heh.

Efficiently populating rows given possible values for each variable in R

I have a dataframe with 42 variables, each of which have different possible values. I am aiming to create a much larger dataframe which contains a row for each possible combination of values for each of the variables.
This will be millions of rows long and too large to hold in RAM. I have therefore been trying to make a script which appends each possible value to an existing file. The following code works but does so too slowly to be practical (also includes only 5 variables), taking just under 5 minutes to run on my machine.
V1 <- c(seq(0, 30, 1), NA)
V2 <- c(seq(20, 55, 1), NA)
V3 <- c(0, 1, NA)
V4 <- c(seq(1, 16, 1), NA)
V5 <- c(seq(15, 170, 1), NA)
df_empty <- data.frame(V1 = NA, V2 = NA, V3 = NA, V4 = NA)
write.csv(df_empty, "table_out.csv", row.names = FALSE)
start <- Sys.time()
for(v1 in 1:length(V1)){
V1_val <- V1[v1]
for(v2 in 1:length(V2)){
V2_val <- V2[v2]
for(v3 in 1:length(V3)){
V3_val <- V3[v3]
for(v4 in 1:length(V4)){
V4_val <- V4[v4]
row <- cbind(V1_val, V2_val, V3_val, V4_val)
write.table(as.matrix(row), file = "table_out.csv", sep = ",", append = TRUE, quote = FALSE,col.names = FALSE, row.names = FALSE)
}
}
}
}
print(abs(Sys.time() - start)) # 4.8 minutes
print(paste(nrow(read.csv("table_out.csv")), "rows in file"))
I have tested using data.table::fwrite() but this failed to be any faster than write.table(as.matrix(x))
I'm sure the issue I have is with using so many for loops but am unsure how to translate this into a more efficient approach.
Thanks
I guess you can try the following code to generate all combinations
M <- as.matrix(do.call(expand.grid,mget(x = ls(pattern = "^V\\d+"))))
and then you are able to save res to you designated file, e.g.,
write.table(M, file = "table_out.csv", sep = ",", append = TRUE, quote = FALSE,col.names = FALSE, row.names = FALSE)

R redefine a string to argument

following on from some help earlier I think all I need for this to work is a way to define the variable dimxST below as not a string as I need that to point to the data frame....
cpkstudy <- function(x,y){
dxST <- paste(x,"$",y, sep = "")
dLSL <- paste(y, "LSL", sep = "")
dUSL <- paste(y, "USL", sep = "")
dTar <- paste(y, "Target", sep = "")
dimxST <-
dimLSL <- PivSpecs[[dLSL]]
dimUSL <- PivSpecs[[dUSL]]
dimTar <- PivSpecs[[dTar]]
ss.study.ca(dimxST, LSL = dimLSL, USL = dimUSL, Target = dimTar,
alpha = 0.05, f.na.rm = TRUE, f.main = "Six Sigma Study")
}
cpkstudy("cam1","D1")
link to the previous post
This is a different direction, and you may find the learning curve a bit steeper, but it's a lot more powerful. Instead of passing everything in as strings, we pass them without quotes, and use the rlang package to figure out where to evaluate D1.
# the same dummy data frame from Katia's answer
cam1 <- data.frame(D1 = rnorm(10),
D2 = rnorm(10))
PivSpecs <- list(D1LSL = 740, D1USL = 760, D1Target = 750)
library(rlang)
cpkstudy <- function(df, y){
quo_y <- enquo(y)
dLSL <- paste0(quo_name(quo_y), "LSL")
dUSL <- paste0(quo_name(quo_y), "USL")
dTar <- paste0(quo_name(quo_y), "Target")
dimxST <- eval_tidy(quo_y, data = df)
dimLSL <- PivSpecs[[dLSL]]
dimUSL <- PivSpecs[[dUSL]]
dimTar <- PivSpecs[[dTar]]
print(dimxST)
print (paste("dimLSL=", dimLSL) )
print (paste("dimUSL=", dimUSL) )
print (paste("dimTar=", dimTar) )
# ss.study.ca(dimxST, LSL = dimLSL, USL = dimUSL, Target = dimTar,
# alpha = 0.05, f.na.rm = TRUE, f.main = "Six Sigma Study")
}
# notice that I am not quoting cam1 and D1
cpkstudy(cam1, D1)
If you want to learn more about this, I would suggest looking at https://dplyr.tidyverse.org/articles/programming.html as an overview (the dplyr package imports some of the functions used in rlang), and http://rlang.r-lib.org/index.html for a more complete list of all the functions and examples.
You can use function get() to get object value from its string representation. In this solution I did not evaluate ss.study.ca() function itself, since I do not have your real-case input data, instead I just print the values that would go there:
cpkstudy <- function(x,y){
#dxST <- paste0(x,"$",y)
dLSL <- paste0(y, "LSL")
dUSL <- paste0(y, "USL")
dTar <- paste0(y, "Target")
dimxST <- get(x)[,y]
print(dimxST)
dimLSL <- PivSpecs[[dLSL]]
dimUSL <- PivSpecs[[dUSL]]
dimTar <- PivSpecs[[dTar]]
print (paste("dimLSL=", dimLSL) )
print (paste("dimUSL=", dimUSL) )
print (paste("dimTar=", dimTar) )
#ss.study.ca(dimxST, LSL = dimLSL, USL = dimUSL, Target = dimTar,
# alpha = 0.05, f.na.rm = TRUE, f.main = "Six Sigma Study")
}
# create some dummy dataframe to test with this example
cam1 <- data.frame(D1 = rnorm(10),
D2 = rnorm(10))
# define a list that will be used within a function
PivSpecs <- list(D1LSL = 740, D1USL = 760, D1Target = 750)
#test function
cpkstudy("cam1","D1")
#[1] 1.82120625 -0.08857998 -0.08452232 -0.43263828 0.17974556 -0.91141414 #-2.30595203 -1.24014396 -1.83814577 -0.24812598
#[1] "dimLSL= 740"
#[1] "dimUSL= 760"
#[1] "dimTar= 750"
I also changed your paste() commands on paste0() which has sep="" as a default.

R: Appending to a data frame in a for loop

So I have this loop, and it writes multiple csv files, with each one having been appended out of the results of the run. As you can see below, this particular loop runs a statistical function (zScore) across each row of an subset from gex against mxy, then publishes the results for each row, then moves onto the next subset of gex.
My question is, instead of writing the appended result as a csv file, is there a way that I can just build a dataframe within the loop that looks the same?
Thank you for your kind help.
gex <- data.frame("sample" = c("BIX","HEF","TUR","ZOP","VAG"),
"TCGA-F4-6703-01" = runif(5, -1, 1),
"TCGA-DM-A28E-01" = runif(5, -1, 1),
"TCGA-AY-6197-01" = runif(5, -1, 1),
"TCGA-A6-5657-01" = runif(5, -1, 1))
colnames(gex) <- gsub("[.]", "_",colnames(gex))
listx <- c("TCGA_DM_A28E_01","TCGA_A6_5657_01")
mxy <- data.frame("TCGA-AD-6963-01" = runif(5, -1, 1),
"TCGA-AA-3663-11" = runif(5, -1, 1),
"TCGA-AD-6901-01" = runif(5, -1, 1),
"TCGA-AZ-2511-01" = runif(5, -1, 1),
"TCGA-A6-A567-01" = runif(5, -1, 1))
colnames(mxy) <- gsub("[.]", "_",colnames(mxy))
zScore <- function(x,y)((as.numeric(x) - as.numeric(rowMeans(y,na.rm=T)))/as.numeric(sd(y,na.rm=T)))
for(i in seq(nrow(mxy))){
for(colName in listx){
zvalues <- zScore(gex[i,colName],
mxy[i,])
geneexptest <- data.frame(gex$sample[i], zvalues, row.names = NULL,
stringsAsFactors = TRUE)
write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
row.names=FALSE, col.names=FALSE, quote=F,
sep = ",", dec = ".", append=(i > 1))
}
}
In your posted code you have one csv file for each element of listx, and you are writing a number of lines one-by-one into each of these files. Instead, you could create a data frame for each element of listx and write each out with a single call to write.table.
dfs <- lapply(listx, function(colName) {
do.call(rbind, lapply(seq(nrow(mxy)), function(i) {
zvalues <- zScore(gex[i,colName], mxy[i,])
data.frame(gex$sample[i], zvalues, row.names = NULL, stringsAsFactors = TRUE)
}))
})
dfs
# [[1]]
# gex.sample.i. zvalues
# 1 BIX 1.1105593
# 2 HEF 0.5451948
# 3 TUR -1.4060388
# 4 ZOP -1.4218218
# 5 VAG 0.2780513
#
# [[2]]
# gex.sample.i. zvalues
# 1 BIX 2.0607386
# 2 HEF 1.6703912
# 3 TUR 1.3249181
# 4 ZOP 0.8865058
# 5 VAG 1.5289732
Now you can output the full data frame for each column using write.table.
Combining all the data frames together in a single call to rbind will be much more efficient than calling rbind at each loop iteration; see Circle 2 of The R Inferno for more details.

Resources