nested for loops in R to parse csv files? - r

Edit: I've corrected the typo in the coding (copy and paste error). I can't add an example of the csv files, as its too complex to model in a simple example (I tried..)
I've spent hours looking through similarly titled questions to solve a for loop problem in R, and have tried a lot of different approaches, but I'm having no luck.
I have many different csv files, each of which has a set of 10 separate strings (variables) identifying a specific row (e.g., names = c("Delta values", "Scream factor", "nightmare mode"). Two rows below such a string, I need the max value of that row of data. I can create loops scanning files for such a value in single csv files using the following
test files-
test1.csv, test2.csv, test3.csv test4.csv
names<-list.files(pattern=".csv")
DF <- NULL
for (i in names){
dat <- read.csv(i, header=FALSE, stringsAsFactors=FALSE)
index <- which(dat=="Delta values", arr.ind=TRUE)
row=as.numeric(rownames(dat)[index[1]])
aver=dat[row+2,]
p=max(na.omit(as.numeric(aver)))
DF=rbind(DF, p)
colnames(DF)=dat[index]}
However, my problem comes in trying to generalize it, so that I get a data frame returned indicating the file each value was retrieved from as a row (not "p") and looping over the files so that I can retrieve the next several variables, while appending to the same data frame so that I end up with a data frame listing by row the filename the variable was derived from, and each variable listed in a separate column.
I'm pretty sure I need a nested loop listing the values I want to retrieve as calculated by "p" but I can't find any good examples describing how to iteratively loop using such an approach, and append the new variables to the growing data frame while staying consistent with the row numbering by file.
please help!

Related

How to merge a set of lists into a single data frame

I am new to R and coding in general, so please bear with me.
I have a spreadsheet that has 7 sheets, 6 of these sheets are formatted in the same way and I am skipping the one that is not formatted the same way.
The code I have is thus:
lst <- lapply(2:7,
function(i) read_excel("CONFIDENTIAL Ratio 062018.xlsx", sheet = i)
)
This code was taken from this post: How to import multiple xlsx sheets in R
So far so good, the formula works and I have a large list with 6 sub lists that appears to represent all of my data.
It is at this point that I get stuck, being so new I do not understand lists yet, and really need the lists to be merged into one single data frame that looks and feels like the source data (so columns and rows).
I cannot work out how to get from a list to a single data frame, I've tried using R Bind and other suggestions from here, but all seem to either fail or only partially work and I end up with a data frame that looks like a list etc.
If each sheets has the same number of columns (ncol) and same names (colnames) then this will work. It needs the dplyr pacakge.
require(dplyr)
my_dataframe <- bind_rows(my_list)

Change a date column in multiple data frames with one function

I know there are several questions regarding the "apply one function to multiple data frames"-issue. However, I coundn't find a solution to my problem but I think I got close to it using a solution from this question:
Same function over multiple data frames in R
I have 12 data frames with 4 columns each. The second one contains the data as an integer (e.g. 20161014, so %Y%m%d).
To get it into 2016-10-14 I used
TX_SOUID100758.txt[,2]<-as.Date(as.character(TX_SOUID100758.txt[,2]), "%Y%m%d")
Since I want to apply this function on all 15 data frames I tried
zch_filelist <- list.files(path=path, pattern="*.txt")
for (file in zch_filelist){
assign(file, read.csv(paste(path, file, sep=''),na.strings = -9999))
}
lapply(zch_filelist, function(x) (as.Date(as.character(x[2]), "%Y%m%d")))
I used the previously created list of file names when I imported the files into R.
However, it is not working. I guess the mistake is the indexing in the as.date function.
Any help is greatly appreciated.
Thanks!

r create and address variable in for loop

I have multiple csv-files in one folder. I want to load each csv-file in this folder into one separate data frame. Next, I want to extract certain elements from this data frame into a matrix and calculate the mean of all these matrixes.
setwd("D:\\data")
group_1<-list.files()
a<-length(group_1)
mferg_mean<-data.frame
for(i in 1:a)
{
assign(paste0("mferg_",i),read.csv(group_1[i],header=FALSE,sep=";",quote="",dec=",",col.names=1:90))
}
As there are 11 csv-files in the folder I now have the data frames
mferg_1
to
mferg_11
How can I address each data frame in this loop? As mentioned, I want to extract certain elements from each data frame to a matrix. I would imagine it something like this:
assign(paste0("mferg_matrix_",i),mferg_i[1:5,1:10])
But this obviously does not work because R does not recognize mferg_i in the loop. How can I address this data frame?
This is not something you should probably be using assign for in the first place. Working with a bunch of different data.frames in R is a mess, but working with a list of data.frames is much easier. Try reading your data with
group_1<-list.files()
mferg <- lapply(group_1, function(filename) {
read.csv(filename,header=FALSE,sep=";",quote="",dec=",",col.names=1:90))
})
and you get each each value with mferg[[1]], mferg[[1]], etc. And then you can create a list of extractions with
mferg_matrix <- lapply(mferg, function(x) x[1:5, 1:10])
This is the more R-like way to do things.
But technically you can use get to retrieve values like you use assign to create them. For example
assign(paste0("mferg_matrix_",i),get(paste0("mferg_",i))[1:5,1:10])
but again, this is probably not a smart strategy in the long run.

Changing hundreds of column names simultaneously in R

I have a data frame with hundreds of columns whose names I want to change. I'm very new to R, so it's rather easy to think through the logic of this, but I simply can't find a relevant example online.
The closest I could sort of get was this:
projectFileAllCombinedNames <- for (i in 1:200){names(projectFileAllCombined)[i+1] <-variableNames[i]}
Basically, starting at the second column of projectFileAllCombined, I want to loop through the columns in the dataframe and assign them the data values in the second data frame. I was able to change one column name manually with this code:
colnames(projectFileAllCombined)[2]<-"newColumnName"
but I can't possibly do that for hundreds of columns. I've spent multiple hours on this and can't crack it with any number of Google searches on "change multiple columns in r" or "change column names in r". The best I can find online is examples where people change a few columns with a c() function and I get how that works, but that still seems to require typing out all the column names as parameters to the function, unless there is a way to just pass the "variableNames" file into that c() function, but I don't know of one.
Will
colnames(projectFileAllCombined)[-1] <- variableNames
not suffice?
This assumes the ordering of columns in projectFileAllCombined is the same as the ordering of the new variable names in variableNames, and that
length(variableNames) == (ncol(projectFileAllCombined) - 1)
The key point here is that the replacement function 'colnames<-'() is vectorised and can replace any number of column names in a single call if passed a vector of replacement values.

Executing for loop in R

I am pretty new to R and have a couple of questions about a loop I am attemping to execute. I will try explain myself as best as possible reguarding what I wish the loop to do.
for(i in (1988:1999,2000:2006)){
yearerrors=NULL
binding=do.call("rbind.fill",x[grep(names(x), pattern ="1988.* 4._ data=")])
cmeans=lapply(binding[,2:ncol(binding)],mean)
datcmeans=as.data.frame(cmeans)
finvec=datcmeans[1,]
kk=0
result=RMSE2(yields[(kk+1):(kk+ncol(binding))],finvec)
kk=kk+ncol(binding)
yearerrors=c(result)
}
yearerrors
First I wish for the loop to iterate over file names of data.
Specifically over the years 1988-2006 in the place where 1988 is
placed right now in the binding statement. x is a list of data files
inputted into R and the 1988 is part of the file name. So, I have
file names starting with 1988,1989,...,2006.
yields is a numeric vector and I would like to input the indices of
the vector into the function RMSE2 as indicated in the loop. For
example, over the first iteration I wish for the indices 1 to the
number of columns in binding to be used. Then for the next iteration
I want the first index to be 1 more than what the previous iteration
ended with and continue to a number equal to the number of columns in the next binding
statement. I just don't know if what I have written will accomplish
this.
Finally, I wish to store each of these results in the vector
yearerrors and then access this vector afterwards.
Thanks so much in advance!
OK, there's a heck of a lot of guesswork here because the structure of your data is extremely unclear, I have no idea what the RMSE2 function is (and you've given no detail). Based on your question the other day, I'm going to assume that your data is in .csv files. I'm going to have a stab at your problem.
I would start by building the combined dataframe while reading the files in, not doing one then the other. Like so:
#Set your working directory to the folder containing the .csv files
#I'm assuming they're all in the form "YEAR.something.csv" based on your pattern matching
filenames <- list.files(".", pattern="*.csv") #if you only want to match a specific year then add it to the pattern match
years <- gsub("([0-9]+).*", "\\1", filenames)
df <- mdply(filenames, read.csv)
df$year <- as.numeric(years[df$X1]) #Adds the year
#Your column mean dataframe didn't work for me
cmeans <- as.data.frame(t(colMeans(df[,2:ncol(df)])))
It then gets difficult to know what you're trying to achieve. Since your datcmeans is a one row data.frame, datcmeans[1,] doesn't change anything. So if a one row from a dataframe (or a numeric vector) is an argument required for your RMSE2 function, you can just pass it datcmeans (cmeans in my example).
Your code from then is pretty much indecipherable to me. Without know what yields looks like, or how RMSE2 works, it's pretty much impossible to help more.
If you're going to do a loop here, I'll say that setting kk=kk+ncol(binding) at the end of the first iteration is not going to help you, since you've set kk=0, kk is not going to be equal to ncol(binding), which is, I'm guessing, not what you want. Here's my guess at what you need here (assuming looping is required).
yearerrors=vector("numeric", ncol(df)) #Create empty vector ahead of loop
for(i in 1:ncol(df)) {
yearerrors[i] <- RMSE2(yields[i:ncol(df)], finvec)
}
yearerrors
I honestly can't imagine a function that would work like this, but it seems the most logical adaption of your code.

Resources