How to dynamically create and name data frames in a for loop - r

I am trying to generate data frame subsets for each respondent in a data frame using a for loop.
I have a large data frame with columns titled "StandardCorrect", "NameProper", "StartTime", "EndTime", "AScore", and "StandardScore" and several thousand rows.
I want to make a subset data frame for each person's name so I can generate statistics for each respondent.
I tried using a for loop
for(name in 1:length(NamesList)){ name <- DigiNONA[DigiNONA$NameProper == NamesList[name], ] }
NamesList is just a list containing all the levels of NamesProper (which isa factor variable)
All I want the loop to do is each iteration, generate a new data frame with the name "NamesList[name]" and I want that data frame to contain a subset of the main data frame where NameProper corresponds to the name in the list for that iteration.
This seems like it should be simple I just can;t figure out how to get r to dynamically generate data frames with different names for each iteration.
Any advice would be appreciated, thank you.

The advice to use assign for this purpose is technically feasible, but incorrect in the sense that it is widely deprecated by experienced users of R. Instead what should be done is to create a single list with named elements each of which contains the data from a single individual. That way you don't need to keep a separate data object with the names of the resulting objects for later access.
named_Dlist <- setNames( split( DigiNONA, DigiNONA$NameProper),
NamesList)
This would allow you to access individual dataframes within the named_Dlist object:
named_Dlist[[ NamesList[1] ]] # The dataframe with the first person in that NamesList vector.
It's probably better to use the term list only for true R lists and not for atomic character vectors.

Related

How do I apply the same changes on multiple data frames in R?

I have a file (named subdatlob) containing a list of data frames (4dfs namely 1,2,3, and 4). For each data frame, I want to implement the following
tri_I=as.triangle(subdatlob[["I"]],origin="AY",dev="DY",value="paid")
triLoB_I = incr2cum(tri_I)
for I = 1,2,3,4 or more generally, for each data frame I in the given list.
How do I do this? I will also be doing this step for a list containing 1,000,000+ data frames.
This inquiry involves applying a function to every data frame and naming the necessary variables for the computation.
#shs's suggestion
lapply(subdatlob, \(x) incr2cumc(as.triangle(x,origin="AY",dev="DY",value="paid")))
worked for me and for my larger list containing more data frames.

R for loop: creating data frames using split?

I have data that I want to separate by date, I have managed to do this manually through:
tsssplit <- split(tss, tss$created_at)
and then creating dataframes for each list which I then use.
t1 <- tsssplit[[1]]
t2 <- tsssplit[[2]]
But I don't know how many splits I will need, as sometimes the og data frame may may have 6 dates to split up by, and sometimes it may have 5, etc. So I want to create a for loop.
Within the for loop, I want to incorporate this code, which connects to a function:
bscore3 <- score.sentiment(t3$cleaned_text,pos.words,neg.words,.progress='text')
score3 <- as.integer(bscore3$score[[1]])
Then I want to be able to create a new data frame that has the scores for each list.
So essentially I want the for loop to:
split the data into lists using split
split each list into a separate data frames for each different day
Come out with a score for each data frame
Put that into a new data frame
It doesn't have to be exactly like this as long as I can come up with a visualisation of the scores at the end.
Thanks!
It is not recommended to create separate dataframes in the global environment, they are difficult to keep track of. Put them in a list instead. You have started off well by using split and creating list of dataframes. You can then iterate over each dataframe in the list and apply the function on each one of them.
Using by this would look like as :
by(tss, tss$created_at, function(x) {
bscore3 <- score.sentiment(x$cleaned_text,pos.words,neg.words,.progress='text')
score3 <- as.integer(bscore3$score[[1]])
return(score3)
}) -> result
result

Separating data frame based on column values

I am having a bit of trouble with trying to script a code in R so that it separates a data frame based on the character in a data frame column without manually specifying a subset command. Below is the script for reproduction in R:
a=c("Model_A","R1",358723.0,171704.0,1.0,36.818500,4.0222700,1.38895000)
b=c("Model_A","R2",358723.0,171704.0,2.6,36.447300,4.0116100,1.37479000)
c=c("Model_A","R3",358723.0,171704.0,5.0,35.615400,3.8092600,1.34301000)
d=c("Model_B","R1",358723.0,171704.0,1.0,39.818300,2.4475600,1.50384000)
e=c("Model_B","R2",358723.0,171704.0,2.6,39.391600,2.4209900,1.48754000)
f=c("Model_B","R3",358723.0,171704.0,5.0,38.442700,2.3618400,1.45126000)
g=c("Model_C","R1",358723.0,171704.0,1.0,31.246400,2.2388000,1.30652000)
h=c("Model_C","R2",358723.0,171704.0,2.6,30.911600,2.2144800,1.29234000)
i=c("Model_C","R3",358723.0,171704.0,5.0,30.166700,2.1603000,1.26077000)
df=data.frame(a,b,c,d,e,f,g,h,i)
df=t(df)
df=data.frame(df)
col_list=list("Model","Receptor.name","X(m.)","Y(m.)","Z(m.)",
"nox","PM10","PM2.5")
colnames(df)=col_list
Essentially what I am trying is to separate the data frame (df) by the Model names ("Model_A", "Model_B", and "Model_C") and store them in new and different data frames. I have been trying to use the following command
df_test=split(df,with(df,interaction(Model,Model)), drop = TRUE)
This command separates the data frame but stores them in lists, and I don't know how to extract the lists individually and store them as data frames. Is there a simpler solution (avoiding the subset command if possible as I need the script to be dynamic and relative) or does anyone know how to use the last command shown above to separate the lists into individual data frames? Also if possible, is it possible to name the data frame after the model?
I apologize if these are a lot of questions but any help would be hugely appreciated! Thank you!
list2env(split(df, df$Model), envir = .GlobalEnv) will give you three dataframes in your global environment, named after the models, containing the relevant rows.
> Model_A
Model Receptor.name X(m.) Y(m.) Z(m.) nox PM10 PM2.5
a Model_A R1 358723 171704 1 36.8185 4.02227 1.38895
b Model_A R2 358723 171704 2.6 36.4473 4.01161 1.37479
c Model_A R3 358723 171704 5 35.6154 3.80926 1.34301
Although I would just keep the list of three dataframes by only using dflist <- split(df, df$Model).
Why a list? Lists allow you the use of lapply - a looping function that applies an operation over every list element. A quick example: Let's say you'd want to get a frequency table for both PM variables in your data for all three datasets.
For single elements in your global environment this would be
table(Model_A$PM10)
table(Model_A$PM2.5)
...
table(Model_C$PM2.5)
With a list, it would be
lapply(dflist, function(x) table(x["PM10"]))
lapply(dflist, function(x) table(x["PM2.5"]))
Right now, it seems to only save some lines of code, but better yet, the output of lapply is again a list, which you can store in an object and further use for different operations. Due to this, you can have a global environment with only a few objects in it, each being lists which contain certain similar objects, like dataframes, tables, summaries or even plots.

Altering dataframes stored within a list

I am trying to write some kind of loop function that will allow me to apply the same set of code to dozens of data frames that are stored in one list. Each data frame has the same number of columns and identical headers for each column, though the number of rows varies across data frames.
This data comes from an egocentric social network study where I collected ego-network data in edgelist format from dozens of different respondents. The data collection software that I use stores the data from each interview in its own .csv file. Here is an image of the raw data for a specific data frame (image of raw data).
For my purposes, I only need to use data from the fourth, sixth, and seventh columns. Furthermore, I only need rows of data where the last column has values of 4, at which point the final column can be deleted entirely. The end result is a two-column data frame that represents relationships among pairs of people.
After reading in the data and storing it as an object, I ran the following code:
x100291 = `100291AlterPair.csv` #new object based on raw data
foc.altername = x100291$Alter.1.Name
altername = x100291$Alter.2.Name
tievalue = x100291$AlterPair_B
tie = tievalue
tie[(tie<4)] = NA
egonet.name = data.frame(foc.altername, altername, tievalue)
depleted.name = cbind(tie,egonet.name)
depleted.name = depleted.name[is.na(depleted.name[,1]) == F,]
dep.ego.name = data.frame(depleted.name$foc.altername, depleted.name$altername)
This produced the following data frame (image of final data). This is ultimately what I want.
Now I know that I could cut-and-paste this same set of code 100+ times and manually alter the file names, but I would prefer not to do that. Instead, I have stored all of my raw .csv files as data frames in a single list. I suspect that I can apply the same code across all of the data frames by using one of the apply commands, but I cannot figure it out.
Does anyone have any suggestions for how I might apply this basic code to a list of data frames so that I end up with a new list containing cleaned and reduced versions of the data?
Many thanks!
The logic can be simplified. Try creating a custom function and apply over all dataframes.
cleanDF <- function(mydf) {
if( all(!c('AlterPair_B', 'Alter.1.Name', 'Alter.2.Name') %in%
names(mydf))) stop("Check data frame names")
condition <- mydf[, 'AlterPair_B'] >= 4
mydf[condition, c("Alter.1.Name", "Alter.2.Name")]
}
big_list <- lapply(all_my_files, read.csv) #read in all data frames
result <- do.call('rbind', lapply(big_list, cleanDF))
The custom function cleanDF first checks that all the relevant column names are there. Then it defines the condition of 4 or more 'AlterPair_B'. Lastly, subset the two target columns by that condition. I used a list called 'big_list' that represents all of the data frames.
You haven't provided a reproducible example so it's hard to solve your problem. However, I don't want your questions to remain unanswered. It is true that using lapply would be a fast solution, usually preferable to a loop. However, since you mentioned being a beginner, here's how to do that with a loop, which is easier to understand.
You need to put all your csv files in a single folder with nothing else. Then, you read the filenames and put them in a list. You initialize an empty result object with NULL. You then read all your files in a loop, do calculations and rbind the results in the result object.
path <-"C:/temp/csv/"
list_of_csv_files <- list.files(path)
result <- NULL
for (filenames in list_of_csv_files) {
input <- read.csv(paste0(path,filenames), header=TRUE, stringsAsFactors=FALSE)
#Do your calculations
input_with_calculations <- input
result <- rbind(result,input_with_calculations)
}
result

Name list of data frames from data frame

I usually read a bunch of .csv files into a list of data frames and name it manually doing.
#...code for creating the list named "datos" with files from library
# Naming the columns of the data frames
names(datos$v1r1)<-c("estado","tiempo","x1","x2","y1","y2")
names(datos$v1r2)<-c(...)
names(datos$v1r3)<-c(...)
I want to do this renaming operation automatically. To do so, I created a data frame with the names I want for each of the data frames in my datos list.
Here is how I generate this data frame:
pru<-rbind(c("UT","TR","UT+","TR+"),
c("UT","TR","UT+","TR+"),
c("TR","UT","TR+","UT+"),
c("TR","UT","TR+","UT+"))
vec<-paste("v1r",seq(1,20,1),sep="")
tor<-paste("v1s",seq(1,20,1),sep="")
nombres<-do.call("rbind", replicate(10, pru, simplify = FALSE))
nombres_df<-data.frame(corrida=c(vec,tor),nombres)
Because nombres_df$corrida[1] is v1r1, I have to name the datos$v1r1 columns ("estado","tiempo", nombres_df[1,2:5]), and so on for the other 40 elements.
I want to do this renaming automatically. I was thinking I could use something that uses regular expressions.
Just for the record, I don't know why but the order of the list of data frames is not the same as the 1:20 sequence (by this I mean 10 comes before 2,3,4...)
Here's a toy example of a list with a similar structure but fewer and shorter data frames.
toy<-list(a=replicate(6,1:5),b=replicate(6,10:14))
You have a data frame where variable corridas is the name of the data frame to be renamed and the remaining columns are the desired variable names for that data frame. You could use a loop to do all the renaming operations:
for (i in seq_len(nrow(nombres_df))) {
names(datos[[nombres_df$corridas[i]]]) <- c("estado","tiempo",nombres_df[i,2:length(nombres_df)])
}

Resources