Custom R function returning weird output - r

So I'm trying to create a list of lists of data frames, basically for the purposes of passing them to multiple cores via mclapply. But that's not the part I'm having trouble with. I wrote a function to create a list of smaller data frames from one large data frame, and then applied it sequentially to break a large data frame down into a list of lists of small data frames. The problem is that when the function is called the second time (via lapply to the first list of data frames), it's adding extra small data frames to each list of data frames in the larger list. I have no idea why. I don't think it's the lapply, since when I ran the function manually on one frame from the first list it also did work. Here's the code:
create_frame_list<-function(mydata,mystep,elnames){
datalim<-dim(mydata)[1]
mylist<-list()
init<-1
top<-mystep
i<-1
repeat{
if(top < datalim){
mylist[[i]]<-assign(paste(elnames,as.character(i),sep=""),data.frame(mydata[init:top,]))
}
else {
mylist[[i]]<-assign(paste(elnames,as.character(i),sep=""),data.frame(mydata[init:datalim,]))
}
if(top > datalim){break}
i<-i+1
init<-top+1
top<-top+mystep
}
return(mylist)
}
test_data<-data.frame(replicate(10,sample(0:1,1000,rep=TRUE)))
#Create the first list of data frames, works fine
master_list<-create_frame_list(test_data,300,"bd")
#check the dimensions of the data frames created, they are correct
lapply(master_list,dim)
#create a list of lists of data frames, doesn't work right
list_list<-lapply(master_list,create_frame_list,50,"children")
#check the dimensions of the data frames in the various lists. The function when called again is making extra data frames of length 2 for no reason I can see
lapply(list_list,lapply,dim)
So that's it. Any help is appreciated as always.

Okay, so your code only has one small bug, but there are definitely better ways of doing this. Your code doesn't work when the number of rows is an exact multiple of step. This has to do with the position of your break. Here is a fix:
create_frame_list<-function(mydata,mystep,elnames){
datalim<-dim(mydata)[1]
mylist<-list()
init<-1
top<-mystep
i<-1
repeat{
if(top < datalim)
# mylist[[i]]<-assign(paste0(elnames,as.character(i)),data.frame(mydata[init:top,]))
mylist[[i]]<-mydata[init:top,]
else
mylist[[i]]<-mydata[init:datalim,]
# if(top > datalim) break
i<-i+1
init<-top+1
top<-top+mystep
if(init > datalim) break
}
return(mylist)
}
The main fix was to move the if and make it reliant on init, and not top.
You'll note that I cleaned up your code, and removed the assign statments. One good rule of thumb is: if you think you need to use assign or get, you're doing it wrong. In your case, the assign was completely redundant, and did not assign the names in the way you wanted.
If you're looking for a better way to do this, here is one option:
n<-nrow(test_data)
step<-300
split.var<-rep(1:ceiling(n/step),each=step,length.out=n)
master_list<-split(test_data,split.var)
names(master_list)<-paste0('bd',seq_along(master_list))
# If you didn't care about the order of the rows you could just do
# split(test_data,seq(ceiling(n/step)))
If you want to get fancy, you could do something like:
special.split<-function(data,step)
split(data,rep(1:ceiling(nrow(data)/step),each=step,length.out=nrow(data)))
lapply(special.split(test_data,300),special.split,step=50)
And that would do everything in one step.

Related

Lapply on a list of a list

This is all about a code in R.
I have seperated a big data file "All_data.csv" in smaller data of individuals in a particular year.
So the list looks like this:
All individuals is a list of 25 individuals. If you would then take the first element of that list you get:
Indivudual 1: dataframe_year1, dataframe_year2.
If you take the second element you get for instance:
Individual 2: dataframe_year1, dataframe_year2, dataframe_year3.
etc. so the lists in the lists differ in their length.
Now I want to do a (analysis) function on the dataframes, I do not need to store the output again in the list per se.
I solved it with doing an lapply on the list All_data, with a function defined by myself which also calls lapply again and then my analysis function. But I was wondering if there was another way. Because it seems a bit inefficient to do.
split <- function (All_data)
{
#function that splits files by date and individual
#returns list of individuals and within that list is another list of dataframes. Called All_individuals
}
Make_analysis <- function (All_individuals)
{
Listfiles <- split (All_individuals)
HRE <- lapply (Listfiles, Doall)
}
Analysis <- function (files)
{
...
}
function calls:
lapply (All_data, Make_analysis)
Could anyone help?
Also is this the best way to go if I would want to parallise the analysis with RSlurm to run it on a HPC? Then I could change lapply with slurm map right?
My function in itself works but it seems very inefficient. Would like some tips on how to make code more efficient. Also on how to parallise it with Rslurm.

Renaming Columns with index with a For Loop in R

I am writing this post to ask for some advice for looping code to rename columns by index.
I have a data set that has scale item columns positioned next to each other. Unfortunately, they are oddly named.
I want to re-name each column in this format: SimRac1, SimRac2, SimRac3.... and so on. I know the location of the columns (Columns number 30 to 37). I know these scale items are ordered in such a way that they can be named and numbered in increased order from left to right.
The code I currently have works, but is not efficient. There are other scales, in different locations, that also need to be renamed in a similar fashion. This would result in dozens of code rows.
See below code.
names(Total)[30] <- "SimRac1"
names(Total)[31] <- "SimRac2"
names(Total)[32] <- "SimRac3"
names(Total)[33] <- "SimRac4"
names(Total)[34] <- "SimRac5"
names(Total)[35] <- "SimRac6"
names(Total)[36] <- "SimRac7"
names(Total)[37] <- "SimRac8"
I want to loop this code so that I only have a chunk of code that does the work.
I was thinking perhaps a "for loop" would help.
Hence, the below code
for (i in Total[,30:37]){
names(Total)[i] <- "SimRac(1:8)"
}
This, unfortunately does not work. This chunk of code runs without error, but it doesn't do anything.
Do advice.
In the OP's code, "SimRac(1:8)" is a constant. To have dynamic names, use paste0.
We do not need a loop here. We can use a vectorized function to create the names, then assign the names to a subset of names(Total)
names(Total)[30:37]<-paste0('SimRac', 1:8)

For loop to create multiple empty data frames gives error

I wrote a for loop to create empty multiple data frames, using a vector of names, but even though it seemed really easy at start I got an error message : Error in ID_names[i] <- data.frame() : replacement has length zero
To be more specific I' ll provide you with a reproducable example:
ID_names <- c("Athens","Rome","Barcelona","London","Paris","Madrid")
for(i in 1:length(ID_names){
ID_names[i] <- data.frame()
}
Do you have any idea why this is wrong? I would like to ask you not only provide a solution, but specify me why this for loop is wrong in order to avoid such kind of mistakes in the future.
You are trying to store a dataframe in one element of a vector (ID_names[i]) which is not possible. You might want to create a list of empty dataframes and assign names to it which can be done using replicate.
ID_names <- c("Athens","Rome","Barcelona","London","Paris","Madrid")
list_data <- setNames(replicate(length(ID_names), data.frame()), ID_names)
However, very rarely such initialisation of empty dataframes will be useful. It ends up creating more confusion down the road. Depending on your actual use case there might be other better ways to handle this.

How do I change column names in list of data frames inside a function?

I know that the answer to "how to change names in a list of data frames" has been answered multiple times. However, I'm stuck trying to generate a function that can take any list as an argument and change all of the column names of all of the data frames in the list. I am working with a large number of .csv files, all of which will have the same 3 column names. I'm importing the files in groups as follows:
# Get a group of drying data data files, remove 1st column
files <- list.files('Mang_Run1', pattern = '*.csv', full = TRUE)
mr1 <- lapply(files, read.csv, skip = 1, header = TRUE, colClasses = c("NULL", NA, NA, NA))
I will have 6 such file groups. If I run the following code on a single list, the names of the columns in each data frame within the specified list will be changed correctly.
for (i in seq_along(mr1)) {
names(mr1[[i]]) <- c('Date_Time', 'Temp_F', 'RH')
}
However, if I try to generalize the function (see code below) to take any list as an argument, it does not work correctly.
nameChange <- function(ls) {
for (i in seq_along(ls)) {
names(ls[[i]]) <- c('Date_Time', 'Temp_F', 'RH')
}
return(ls)
}
When I call nameChange on mr1 (list generated from above), it prints the entire contents of the list to the console and does not change the names of the columns in the data frames within the list. I'm clearly missing something fundamental about the inner workings of R here. I've tried the above function with and without return, and have made several modifications to the code, none of which have proven successful. I'd greatly appreciate any help, and would really like to understand the 'why' behind the problem as well. I've had considerable trouble in the past handling functions that take lists as arguments.
Thanks very much in advance for any constructive input.
I think this might be a very simple fix:
First, generalize the function you are using to rename the columns. This only needs to work on one dataframe at a time.
renameFunction<-function(x,someNames){
names(x) <- someNames
return(x)
}
Now we need to define the names we want to change each column name to.
someNames <- c('Date_Time', 'Temp_F', 'RH')
Then we call the new function and apply it to every element of the "mr1" list.
lapply(mr1, renameFunction, someNames)
I may have gotten some of the details wrong with regards to your exact sitiuation, but I've used this method before to solve similar issues. Since you were able to get it to work on the specific case, I'm pretty sure this will generalize readily using lapply

r create and address variable in for loop

I have multiple csv-files in one folder. I want to load each csv-file in this folder into one separate data frame. Next, I want to extract certain elements from this data frame into a matrix and calculate the mean of all these matrixes.
setwd("D:\\data")
group_1<-list.files()
a<-length(group_1)
mferg_mean<-data.frame
for(i in 1:a)
{
assign(paste0("mferg_",i),read.csv(group_1[i],header=FALSE,sep=";",quote="",dec=",",col.names=1:90))
}
As there are 11 csv-files in the folder I now have the data frames
mferg_1
to
mferg_11
How can I address each data frame in this loop? As mentioned, I want to extract certain elements from each data frame to a matrix. I would imagine it something like this:
assign(paste0("mferg_matrix_",i),mferg_i[1:5,1:10])
But this obviously does not work because R does not recognize mferg_i in the loop. How can I address this data frame?
This is not something you should probably be using assign for in the first place. Working with a bunch of different data.frames in R is a mess, but working with a list of data.frames is much easier. Try reading your data with
group_1<-list.files()
mferg <- lapply(group_1, function(filename) {
read.csv(filename,header=FALSE,sep=";",quote="",dec=",",col.names=1:90))
})
and you get each each value with mferg[[1]], mferg[[1]], etc. And then you can create a list of extractions with
mferg_matrix <- lapply(mferg, function(x) x[1:5, 1:10])
This is the more R-like way to do things.
But technically you can use get to retrieve values like you use assign to create them. For example
assign(paste0("mferg_matrix_",i),get(paste0("mferg_",i))[1:5,1:10])
but again, this is probably not a smart strategy in the long run.

Resources