R code: Dynamic Variables in a loop - r

I'm facing a challenge in R. I'm writing a code that incorporates another code written in C++ called MHX.
MHX is used for chemical data analysis by inputting some concentrations, etc. The integration between R and MHX works fine. So I'm able to write my MHX code definitions in the form of cat(CODE HERE) then calling a bash command to run MHX from terminal.
Now the results from MHX are given as tab delimited data tables that I am able to read without a problem in R. The problem is that I use R to simulate a large number of MHX calculations using loops.
Hence the need to write dynamic variables and here were I'm stuck. Let me give you more information with examples of my R code:
for (i in 1:100) {
fin <- file.create("input/ex1") #MHX input file
fout <- file.create("output/ex1.out") #MHX output file
FNM <- paste0("table_data/pH", i, ".txt") #filename used inside MHX definition
file.create(FNM) #this is used to create FNM table in R
fXY <- file.create(paste0("table_data/ECOMXY", i, ".txt"))
ifelse (HERE SOME MATHEMATICAL DEFINITIONS OF SOME VARIABLES)
ksource(MHXCode) #THIS CALLS MY MHX CODE which is inside another R code called `MHXCode` using a custom function KSOURCE. No problem here.
Up to here I don't have major problems. Now I need to setup the dynamic variables:
First I am creating variables PHL1 to PHL100
assign(paste("PHL", i, sep=""), read.table(paste0("table_data/pH", i, ".txt") ,skip=0, sep="\t", head=TRUE, na.strings = "-Inf"))
Each PHL table contains two rows and about 20 columns. Now I am interested in creating data frames from the second row for each column. Take for example row number 1 which is called EMF, ideally I need to do the following for all tables from PHLto PHL100 which is very tedious:
EMFT <- cbind(PHL1$EMF[2], PHL2$EMF[2], PHL3$EMF[2], PHL4$EMF[2], PHL5$EMF[2], PHL6$EMF[2],PHL7$EMF[2], PHL8$EMF[2], PHL9$EMF[2], PHL10$EMF[2], ....... etc up to PHL100! )
I tried many things to achieve the above, but I was not successful, including:
XX <- assign(paste0("PHL", i, "$EMF[2]"), cat(paste0("PHL", i, "$EMF[2]")))
I will need to do the same for other variables in order to be able to create some complicated plots. I hope anyone would be able to help.
I must mention that the main problem with assign is that I get qouted names of variables hence cannot return their values. Also for cat, you cannot use it to return a value, you will get NULL in the example above. Simple I am stuck!!
Please help.

Thanks to Justin he gave me a clue to answer my question. Here is what I have done:
files <- list.files(path="table_data", pattern=".dat", full.names=T); files
FRM <- NULL
for (f in files){
dat <- read.table(f, skip=0, header=TRUE, sep="\t", na.strings="",quote="", colClasses="character")[2,]
note that the [2, ] argument means that you skip all lines except line number 2 while keeping header which exactly what I was looking for.
Now I can bind it all in one table for my plots.
FRM <- rbind(FRM, dat)
This is a short answer and I think it is neat, sorted!

Related

R- how to fix memory issues in R?

In all honesty, I'm not quite sure what the issue is but I've had a similar issue in R in the past. I've written code to extract the variables I want from .dat files (specifically the current panel survey). I have CSV files that contain the positions of each variable by year (positions change by year). For example, HRFS12M1 in 2010 is 1173-1174, and in 2019 1223-1224 (and this part of the code is not shown but works so I didn't include it). So I have two folders and two separate directories one with the positions and one with the .dat files. I first loop through the positions files and create dfs with positions for each year (2010-2019). After the position dfs are generated I run the code below to obtain what variables I want in a large merged df. Now the code works as intended when I select 4 or fewer variables in the varList. However, the moment I try to use more variables the df starts to produce values that aren't within those columns. Does anyone know why it's doing this? I've tried several different variables to confirm it's not a problem with the position files but a problem with the number of variables.
#Loop through list of .dat files (lst2 contains name of files example:2010dec.dat)
for(i in 1:length(lst2)){
#Import the data cps data set
temp_cps<-readLines(lst2[i])
#Get the positions of the relevant year
temp_pos<- get(paste("Year", i, sep = "."))
#List of Variables we are looking at (can't use more than 4)
**varList=c("HRYEAR4","GESTFIPS","HESP1","HRFS12M1")**
#Get positions only for the variables selected
temp_pos=temp_pos[grep(paste(varList, collapse="|"), temp_pos$Variable),]
#Create the dataframe
df<-NULL
for(j in 1:length(varList)){
df<-cbind(df,substr(temp_cps,temp_pos$Pos1[j],temp_pos$Pos2[j]))
}
df<-as.data.frame(df)
names(df)<-varList
assign(paste("CPS", i, sep = "."), df)
}
#AutoMate appending each year
for (k in 1:(length(lst2)-1)){
if(k==1){
CPS1 <- get(paste("CPS", k, sep = "."))
CPS2 <- get(paste("CPS", k+1, sep = "."))
#Append to keep only rows of second data set
merged_data=rbind(CPS1,CPS2)
}
else{
CPS_C <- get(paste("CPS", k+1, sep = "."))
merged_data=merged_data=rbind(merged_data,CPS_C)
}
if(k==length(lst2)-1){
#Clear Console
rm(list=setdiff(ls(), "merged_data"))
}
}
This is what it looks like before it breaks
This what happens after adding more than 4 variables
I think I figured it out. Need to run a few extra variables to confirm. But the program currently won't work if my list of variables is not in order in terms of position. For example, if "HRYEAR4" is 82-84 and "GESTFIPS" is 93-94 then the program will fail if I put GETSFIPS before HRYEAR4 in varList. However, if HRYEAR comes first then the program will run as intended. Does anyone, have any quick idea how to replace this line df<-cbind(df,substr(temp_cps,temp_pos$Pos1[j],temp_pos$Pos2[j])) to make it more dynamic and not have this issue? If not, it's not a big deal for the moment I'll just put them in order and see if I can find a better solution in the future. Thanks to anyone who tried to help.

R function doesn't recognize variable

I am not very familiar with loops in R, and am having a hard time stating a variable such that it is recognized by a function, DESeqDataSetFromMatrix.
pls is a table of integers. metaData is a data frame containing sample IDs and conditions corresponding to pls. I verified that the below steps run error-free with the individual elements of cond run successfully .
I reviewed relevant posts on referencing variables in R:
How to reference variable names in a for loop in R?
How to reference a variable in a for loop?
Based on these posts, I modified i in line 3 with single brackets, double brackets and "as.name". No luck. DESeqDataSetFromMatrix is reading the literal text after ~ and spits out an error.
cond=c("wt","dhx","mpp","taz")
for(i in cond){
dds <- DESeqDataSetFromMatrix(countData=pls,colData=metaData,design=~i, tidy = TRUE)
"sizeFactors"(dds) <- 1
paste0("PLS",i)<-DESeq(dds)
pdf <- paste(i,"-PLS_MA.pdf",sep="")
tsv <- paste(i,"-PLS.tsv",sep="")
pdf(file=pdf,paper = "a4r", width = 0, height = 0)
plotMA(paste0("PLS",i),ylim=c(-10,10))
dev.off()
write.table(results(paste0("PLS",i)),file = tsv,quote=FALSE, sep='\t', col.names = NA)
}
With brackets, an unexpected symbol error populates.
With i alone, DESEqDataSetFromMatrix tries to read "i" from my metaData column.
Is R just not capable of reading variables in some situations? Generally speaking, is it better to write loops outside of R in a more straightforward language, then push as standalone commands? Thanks for the help—I hope there is an easy fix.
For anyone else who may be having trouble looping with DESeq2 functions, comments above addressed my issue.
Correct input:
dds <- DESeqDataSetFromMatrix(countData=pls,colData=metaData,design=as.formula(paste0("~", i)), tidy = TRUE)
as.formula worked well with all DESeq functions that I tested.
reformulate(i) also worked well in most situations.
Thanks, everyone for the help!

Saving output of for-loop for every iteration

I am currently working on an imputation project where I need to evaluate my methods of imputation. I have my incomplete dataframe with NAs from which I calculate the missing rate for every column/variable. My second data frame contains the complete cases which I extracted from the first data frame. I now want to simulate the missingness structure of the real data in the frame containing the complete cases. the data frame with the generated NAs get stored in the object "result" as you can see in the code. If I now want to replicate this code and thus generate 100 different data frames like "result", how do I replicate and save them separately?
I'm a beginner and would be really thankful for your answers!
I tried to put my loop which generates the NAs in another loop which contains the replicate() command and counts from 1:100 and saves these 100 replicated data frames but it didn't work at all.
result = data.frame(res0=rep(NA, dim(comp_cas)[1]))
for (i in 1:length(Z32_miss_item$miss_per_item)) {
dat = comp_cas[,i]
missRate = Z32_miss_item$miss_per_item[i]
cat (i, " ", paste0(dat, collapse=",") ," ", missRate, "!\n")
df <- data.frame("res"= GenMiss(x=dat, missrate = missRate), stringsAsFactors = FALSE)
colnames(df) = gsub("res", paste0("Var", i), colnames(df))
result = cbind(result, df)
}
result = result[,-1]
I expect that every data frame of the 100 runs get saved in a separate .rda file in my project folder.
also, is imputation and the evaluation of fitness of the latter beginner stuff in r or at what level of proficiency am I if you take a look at the code that I posted?
It is difficult to guess what exactly you are doing without some dummy data. But it is fine to have loops within loops and to save data.frames. Firstly, I would avoid the replicate function here as it has a strange syntax and just stick with plain loops. Secondly, you must make sure that the loops have different indexes (i.e. for(i ... should be surrounded by, say, for(j ... since functions can loop outside their scope in R. Finally, use saveRDS rather than save, as you can then have each object (data.frame) saved in separate .rds files. The save function is designed for saving your whole workspace so that you can pick up where you left off.
fun <- function(i){
df <- data.frame(x=rnorm(5))
names(df) <- paste0("x",i)
df
}
for(j in 1:100){
res <- data.frame(id=1:5)
for(i in 1:10){
res <- cbind(res, fun(i))
}
saveRDS(res, sprintf("replication_%s.rds",j))
}

R - Creating subsets of several datasets in a loop

I have a quite big number of quite heavy datasets. I would like to extract a subset out of each of them and save it into different csv files (one for each dataset). These are the commands I would like to loop for all the files I have in the folder:
df <-read.csv("1985.csv",header=FALSE,stringsAsFactors=TRUE,sep="\t")
df_short <- df[df$V6=="OPP", ]
write.csv(df_short, file = "OPP_1985.csv",row.names=FALSE)
rm(df)
rm(df_short)
This is probably a very noob question, but I am struggling to understand how to do it, so I would appreciate a lot help with this!
EDIT:
Following #SimonShine's suggestion, I have run this code and it works!
You don't specify if you are trying to collect the subsets into one dataset, or if you are trying to make one file per subset. You refer to OPP_1985 that appears out of scope for the code you wrote. Did you mean to refer to df_short?
You could start by abstracting what you want to do with one datafile into a function, e.g.:
extract_and_save_from_dataset <- function(csvfile) {
df <- read.csv(csvfile, header=F, stringsAsFactors=T, sep="\t")
df_short <- df[df$V6 == "OPP",]
csvfile_short <- gsub(".csv", "_short.csv", csvfile)
write.csv(df_short, file=csvfile_short, row_names=F)
}
Assuming you have a collection of dataset filenames, you could apply this function multiple times:
# csvfiles <- c("OPP_1985.csv", "OPP_1986.csv", ...)
csvfiles <- list.files("/path/to/my/csvfiles")
for (csvfile in csvfiles) {
extract_and_save_from_dataset(csvfile)
}
The data.table approach is probably the fastest option, specially if you have a large dataset. The function fwrite{data.table} works in parallel using many CPUS, making it extremely fast.
Here is how you can divide your original data according to subgroups defined based on the values of df$V6 and save each subset into a separate .csv file.
library (data.table)
set(df)[, fwrite(.SD, paste0("output_", V6,".csv")), by = V6, .SDcols=names(df) ]
ps. The name of the files will be output_*.csv where * is the correspondent V6 value.

nested for loop to create histograms named according to list

I'm new to R and need to create a bunch of histograms that are named according to the population they came from. When I try running the loop without the "names" part, it works fine. The code below loops through the list of names and applies them in order, but I end up with 3,364 versions of the same exact histogram. If anyone has any suggestions, I'd really appreciate it.
popFiles <- list.files(pattern = "*.txt") # generates a list of the files I'm working with
popTables <- lapply(popFiles, read.table, header=TRUE, na.strings="NA")
popNames <- read.table(file.path("Path to file containing names", "popNamesR.txt"), header=FALSE,)
popNames <- as.matrix(popNames)
name <- NULL
table <- c(1:58)
for (table in popTables){
for (name in popNames){
pVals <- table$p
hist(pVals, breaks=20, xlab="P-val", main=name))
}
}
Try making a distinct iterator, and use that, rather than iterating over the table list itself. It's just easier to see what's going on. For example:
pdf("Myhistograms.pdf")
for(i in 1:length(popTables)){
table = popTables[[i]]
name = popNames[i]
pVals = table$p
hist(pVals, breaks=20, xlab="P-val", main=name))
}
dev.off()
In this case, your problem is that name and table are actually linked, but you have two for loops, so actually every combination of table and name are generated.

Resources