write.table inside a function applied to a list of data frames overwrite outputs - r

I almost finish a messy code to apply several statistical methods/test to 11 data frames from different watersheds with physico-chemical parameters as variables. I reach the goal, but I need to do this functional.
So to start i made a function to compute correlation, and save the results as .txt tables and .pdf images.
It works great when run the function to one dataframe at the time (for that you should import each dataframe separately using read.table, which is not written in the code below).
As i want it functional, made a list of the 11 dataframes and use lapply to run the function to each one. It works in the sense that gives me one list (corr) containing the correlation results of each dataframe.
Here comes the issues:
The list cor with correlation results for each dataframe looks like has values instead of data frames, so i dont know how to access or save them (see the corr list in the Environment/Data window). Well, until here, at least looks like correlation results exists somewhere.
The second problem is that when i run corr<-lapply(PQ_data, cor_PQ), which has a line to save the outputs as tables (.txt) and images (.pdf) using part of the name of the original dataframe computed (e.g first element of PQ_data is "AgIX_E_PQ" so table and plot of cor_PQ(PQ_data[["AgIX_E_PQ"]] should get the names "mCorAgIX_E_PQ.txt" and "CorAgIX_E_PQ.pdf" respectively), im getting just one output (mCorX[[I]].txt and CorX[[i]].pdf) with the last dataframe correlation result. That is, tables and images for each dataframe correlation result are overwritten into this generics mCorX[[I]].txt, CorX[[i]].pdf files.
Now i guess have to define 'i' or something to avoid this. Should i define cor_PQ function for PQ_data instead X?
If anyone can see where im failing, i will appreciate any help to solve this, please.
My data: PQ_data /save it in your workspace and fix setwd with it.
My code:
rm(list=ls(all=TRUE))
cat("\014")
setwd("C:/Users/Sol/Documents/ProyectoTítulo/CalidadAgua/Matrices/Regs") #my workspace
PQ_files<-list.files(path="C:/Users/Sol/Documents/ProyectoTítulo/CalidadAgua/Matrices/Regs",
pattern="\\_PQ.txt") #my list of 14 dataframes in my workspace.
PQ_data<-lapply(PQ_files, read.table) #read tables of the 14 dataframes in the list.
names(PQ_data)<-gsub("\\_PQ.txt","", PQ_files) #name the 14 dataframes with their original names.
#FUNCTION TO COMPUTE CORRELATIONS, SAVE TABLES AND PLOTS.
cor_PQ<-function(X) {
corPQ<-cor(X, use="pairwise.complete.obs")
outputname.txt<-paste0("mCor",deparse(substitute(X)),".txt")
write.table(corPQ, file=outputname.txt)
outputname.pdf<-paste0("Cor",deparse(substitute(X)),".pdf")
pdf(outputname.pdf)
plot(X)
dev.off()
return(corPQ)
}
corr<-lapply(PQ_data, cor_PQ)
After this, as i said, a get a list called "corr" with 11 elements containing correlation results from each dataframe in my list (PQ_data), but i cant access them as tables when i pin the "corr" list in my environment/data window (they dont show the blue R arrow to expand the element).
`
And i get only 2 output files called mCorX[[I]].txt and CorX[[i]].pdf showing only the last dataframe correlation result because the write.table and .pdf functions overwrite the results of the 10 previous calculations.
Again, i will appreciate any help. I really need a push to catch the idea.
Thanks!!!

lapply doesn't send names of the list to the function. So although the function works for individual files it doesn't work with list of files. Also since there are no names to the files all the files generated are given the same name, hence all the new files overwrite the previously existing files and in the end you get output with only 1 file which is the last element in your list. You can use the below function where we send the names as different parameter to assign the name to the files.
cor_PQ<-function(X, Y) {
corPQ<-cor(X, use="pairwise.complete.obs")
outputname.txt<-paste0("mCor",Y,".txt")
write.table(corPQ, file= outputname.txt)
outputname.pdf<-paste0("Cor",Y,".pdf")
pdf(outputname.pdf)
plot(X)
dev.off()
return(corPQ)
}
Now use Map to apply the same function.
Map(cor_PQ, PQ_data, names(PQ_data))
We can also use imap from purrr to apply this function.
purrr::imap(PQ_data, cor_PQ)

Related

How do I use a for loop to open .ncdf files and average a matrix variable that has different values over all the files? (Using R Programming)

I'm currently trying to code an averaged matrix for all matrix values from a specific air quality variable (ColumnAmountNO2TropCloudScreened) positioned in different .ncdf4 files. The only way I was able to do it was listing all the files, opening them using lapply, creating a single NO2 variable for every ncdf. file and then applying abind to all of the variables. Even though I was able to do it, it took me a lot of time to type in different names for the NO2 variables (NO2_1, NO2_2,NO2_3,etc) and which row to access the original listed file ([[1]],[[2]],[[3]],etc).
I am trying to type in a code that's smarter and easier than just typing in a bunch of numbers. I have all the original .ncdf4 files listed, and am trying to loop over the files to open them and get the 'ColumnAmountNO2TropCloudScreened' matrix value from each, so then I can average them. However, I am having no luck. Would someone know what is wrong with this code/my thought over it? Thanks.
I'm trying the code as it follows:
# Load libraries
library(ncdf4)
library(abind)
library(plot.matrix)
# Set working directory
setwd("~/researchdatasets/2020")
# Declare data frame
df=NULL
# List all files in one file
files1= list.files(pattern='\\.nc4$',full.names=FALSE)
# Loop to open files, get NO2 variables
for(i in seq(along=files1)) {
nc_data = nc_open(files1[i])
NO2_var<-ncvar_get(nc_data,'ColumnAmountNO2TropCloudScreened')
nc_close(nc_data)
}
# Average variables
list_NO2= apply(abind::abind(NO2_var,along=3),1:2,mean,na.rm=TRUE)
NCO's ncra averages variables across all input files with, e.g.,
ncra in*.nc out.nc

paste input name between words for save it using write.table

im super newbie on R and i have been learning for myself for a few weeks already due my work degree.
Im almost done with the statistical analysis that i need, but it is through an ugly and messy code, that is, repeating lot of codes for several data frames, to apply different statistical tests, save results, etc.
Well now, for personal interest, want to write this better, but im totally trapped in my ignorance and really need a push to get the idea, please.
For example, i want to create a function that measure the correlation on all the data tables im using and save those results as a tables using the input name as part of the output name.
I mean, if we had the iris data but measured on different seasons, e.g. iris_fall, iris_winter, iris_spring and iris_summer, after apply cor(X) method to each one, i want to save those results as tables called like "mCoriris_fall.txt", "mCoriris_winter.txt", "mCoriris_spring.txt" and "mCoriris_summer.txt" respectively.
My useless code for now say:
cor_PQ<-function(X) {
cor_PQ<-cor(X, use="pairwise.complete.obs")
return(cor_PQ)
}
savecor<-function(t) {
outputname<-(paste0("mCor",t)) #HOW DO I CALL THE NAME OF THE INPUT? t is cor_PQ result matrix.
savecor<-write.table(t, file=paste0(outputname,".txt"))
return(savecor)
}
cor_PQ(Iris_fall)
I expect to get cor result and save it as a table in my workspace, using the input name as part of the output name.
Im aware this are 2 separates functions and the one to write table should be inside the function for cor(x), but i cant understand how.
I have been reading a lot but i just cant fit all in my head.
Thanks to anyone who can help me.
Regards.
UNTIL HERE IT HAS BEEN SOLVED...
But after making a list with my 14 data frames to apply cor and other methods, the write.table function overwrite the 14 cor results on 1 single doc. This is my code.
PQ_files<-list.files(path="C:/Users/Sol/Documents/ProyectoTítulo/CalidadAgua/Matrices/Regs",pattern="\\_PQ.txt")
PQ_data<-lapply(PQ_files, read.table)
names(PQ_data)<-gsub("\\_PQ.txt","", PQ_files)
PQ_data
cor_PQ<-function(X) {
cor_PQ<-cor(X, use="pairwise.complete.obs")
outputname.txt<-paste0("mCor",deparse(substitute(X)),".txt")
write.table(cor_PQ, file=outputname.txt)
outputname.pdf<-paste0("Cor",deparse(substitute(X)),".pdf")
pdf(outputname.pdf)
plot(X)
dev.off()
return(cor_PQ)
}
for (i in seq_along(PQ_data)){
Correlaciones<-lapply(PQ_data,cor_PQ)
}
Correlaciones
On SUM: seems to work almost good, until the write.table and plot(x) overwrite the outputs from the 14 dataframes on my PQ_data withe the name mCor[[i]] and CorX[[i]], respectively.
Should i define [i] somehow to have each results with the right name?
Also, when i run Correlaciones at the end, i can see the cor result for the 14 dataframes in one single dataframe, but i dont know how to split them correctly.
I guess almost there.
THANKS AGAIN!
You can combine the two functions and use deparse substitute to get input names as string
cor_PQ <- function(X) {
cor_PQ<-cor(X, use="pairwise.complete.obs")
outputname<- paste0("mCor",deparse(substitute(X)), ".txt")
write.table(t, file=outputname)
return(cor_PQ)
}
and then call
cor_PQ(Iris_fall)

I am trying to create an individual document term matrix for each of the rows in my dataset. I have applied the following code to each

I am trying to establish a seperate document term matrix for each of the individual rows in a csv file. I have successfully read the csv file into R-Studio using the read.csv command. The first step to creating a document term matrix using the tm package, as far as I could figure out, would be to create an individual corpus for each of the rows of the file and to try and achieve this, I created the following code.
for(i in 1:no_row)
{
data$TextCorpus[i]<-Corpus(VectorSource(data$Text[i]))
#print(data$TextCorpus[i])
}
Where no_row equals the number of rows in the column (this was done using the command no_row<-nrow(data)) the data$TextCorpus column is a column I created to store the corpus's created by the loop and data$text refers to the column the data being used to create the individual corpus's.
I expected that this would produce a corpus for each of the individual rows however, when I apply the class() function to the data$Text_Corpus column, it says that the column is classed as a list and this is preventing me from applying tm_map functions to individual rows of a column. Furthermore, when I apply the as.Corpus function or any similar function to the column, this has no effect and still the data$TextCorpus column is classified as a list. Does anybody know how to fix this problem? It is greatly appreciated.
P.S. If corpus's isn't the plural of corpus, please feel free to correct me in your response.

print variable names in my own function r

I want to create a funtion that creates new data frames using some variables from other data frames. For that I thing I need to print the variable names in my own function somehow.
The variables come from two data frames (asd and tetracam) which have six variables in common, the bands "w530", "w550", "w570", "670", "w700" and "w800". So, I want to create six data frames, one for each band. One by one I could write like this:
# Band w530
w530<-data.frame(tetracam$filename,tetracam$time,tetracam$type,tetracam$w530,asd$w530)
names(w530)<-c("filename","time","type","tetracam","asd")
w530<-w530[order(w530$time),]
It works fine but I'd like to do it as a function in order to run for all bands. I thought I have to replace all the w530 in the code above for a dinamic object. As I thought of using some of the apply family. So, I first created a list with the names of my common variables:
bands<-c("w530","w550","w570","670","w700","w800")
Then, I tried several ways, for example, using cat or sprintf that would use the strings from the list to fill my function. But it didn't work. Actually, I'm not sure which apply family function I would use. If it's possible to use any in this case:
my.fun<- function(band){
sprintf("%s<-data.frame(tetracam$filename,tetracam$time,tetracam$type,asd$%s,tetracam$%s)",band,band,band)
sprintf("names(%s)<-c('filename','time','type','asd','tetracam')",band)
sprintf("%s[order(%s$time),]",band,band)
}
Any help is appreciated.
Trick is to access data.frame column using df[varName] idiom.
fun1 <- function(band, tetracam, asd){
df<-data.frame(tetracam$filename,tetracam$time,tetracam$type,tetracam[band],asd[band])
names(df)<-c("filename","time","type","tetracam","asd")
df<-df[order(df$time),]
return(df)
}
for (band in bands){
single_band_df <- fun1(band, tetracam, asd)
}

r create and address variable in for loop

I have multiple csv-files in one folder. I want to load each csv-file in this folder into one separate data frame. Next, I want to extract certain elements from this data frame into a matrix and calculate the mean of all these matrixes.
setwd("D:\\data")
group_1<-list.files()
a<-length(group_1)
mferg_mean<-data.frame
for(i in 1:a)
{
assign(paste0("mferg_",i),read.csv(group_1[i],header=FALSE,sep=";",quote="",dec=",",col.names=1:90))
}
As there are 11 csv-files in the folder I now have the data frames
mferg_1
to
mferg_11
How can I address each data frame in this loop? As mentioned, I want to extract certain elements from each data frame to a matrix. I would imagine it something like this:
assign(paste0("mferg_matrix_",i),mferg_i[1:5,1:10])
But this obviously does not work because R does not recognize mferg_i in the loop. How can I address this data frame?
This is not something you should probably be using assign for in the first place. Working with a bunch of different data.frames in R is a mess, but working with a list of data.frames is much easier. Try reading your data with
group_1<-list.files()
mferg <- lapply(group_1, function(filename) {
read.csv(filename,header=FALSE,sep=";",quote="",dec=",",col.names=1:90))
})
and you get each each value with mferg[[1]], mferg[[1]], etc. And then you can create a list of extractions with
mferg_matrix <- lapply(mferg, function(x) x[1:5, 1:10])
This is the more R-like way to do things.
But technically you can use get to retrieve values like you use assign to create them. For example
assign(paste0("mferg_matrix_",i),get(paste0("mferg_",i))[1:5,1:10])
but again, this is probably not a smart strategy in the long run.

Resources