How do I apply the same changes on multiple data frames in R? - r

I have a file (named subdatlob) containing a list of data frames (4dfs namely 1,2,3, and 4). For each data frame, I want to implement the following
tri_I=as.triangle(subdatlob[["I"]],origin="AY",dev="DY",value="paid")
triLoB_I = incr2cum(tri_I)
for I = 1,2,3,4 or more generally, for each data frame I in the given list.
How do I do this? I will also be doing this step for a list containing 1,000,000+ data frames.
This inquiry involves applying a function to every data frame and naming the necessary variables for the computation.

#shs's suggestion
lapply(subdatlob, \(x) incr2cumc(as.triangle(x,origin="AY",dev="DY",value="paid")))
worked for me and for my larger list containing more data frames.

Related

How to dynamically create and name data frames in a for loop

I am trying to generate data frame subsets for each respondent in a data frame using a for loop.
I have a large data frame with columns titled "StandardCorrect", "NameProper", "StartTime", "EndTime", "AScore", and "StandardScore" and several thousand rows.
I want to make a subset data frame for each person's name so I can generate statistics for each respondent.
I tried using a for loop
for(name in 1:length(NamesList)){ name <- DigiNONA[DigiNONA$NameProper == NamesList[name], ] }
NamesList is just a list containing all the levels of NamesProper (which isa factor variable)
All I want the loop to do is each iteration, generate a new data frame with the name "NamesList[name]" and I want that data frame to contain a subset of the main data frame where NameProper corresponds to the name in the list for that iteration.
This seems like it should be simple I just can;t figure out how to get r to dynamically generate data frames with different names for each iteration.
Any advice would be appreciated, thank you.
The advice to use assign for this purpose is technically feasible, but incorrect in the sense that it is widely deprecated by experienced users of R. Instead what should be done is to create a single list with named elements each of which contains the data from a single individual. That way you don't need to keep a separate data object with the names of the resulting objects for later access.
named_Dlist <- setNames( split( DigiNONA, DigiNONA$NameProper),
NamesList)
This would allow you to access individual dataframes within the named_Dlist object:
named_Dlist[[ NamesList[1] ]] # The dataframe with the first person in that NamesList vector.
It's probably better to use the term list only for true R lists and not for atomic character vectors.

Merging data frames into another dataframe

I'm working with R statistics. I'm trying to make a data frame that merges other three data frames. Those three data frames have different column names & different row numbers (they don't have row names).
I tried originally to do:
Namenewdf <- data.frame(dataframe1, dataframe2, dataframe3)
R marked an error because of differing number of rows.
Then I tried with the merge function but it also didn't work.
How do I merge the data frames so that the resulting data frames include the original information of the data frames used as arguments, not filling the 'void' rows from the data frames that have fewer rows?
library(rowr)
finaldataframe<-cbind.fill(dataframe1,dataframe2, dataframe3,fill = NA)
finaldataframe[is.na(finaldataframe)]<-""

How do I loop through multiple Data Frames in r to create a vector?

This is the code I am currently using to move data from multiple data frames into a time-ordered vector which I then perform analysis on and graph:
TotalLoans <- c(
sum(as.numeric(HCD2001$loans_all)), sum(as.numeric(HCD2002$loans_all)),
sum(as.numeric(HCD2003$loans_all)), sum(as.numeric(HCD2004$loans_all)),
sum(as.numeric(HCD2005$loans_all)), sum(as.numeric(HCD2006$loans_all)),
sum(as.numeric(HCD2007$loans_all)), sum(as.numeric(HCD2008$loans_all)),
sum(as.numeric(HCD2009$loans_all)), sum(as.numeric(HCD2010$loans_all)),
sum(as.numeric(HCD2011$loans_all)), sum(as.numeric(HCD2012$loans_all)),
sum(as.numeric(HCD2013$loans_all)), sum(as.numeric(HCD2014$loans_all)),
sum(as.numeric(HCD2015$loans_all)), sum(as.numeric(HCD2016$loans_all))
)
I do this four more times with similar data frames that also are similarly formatted as:
Varname$year
Is there a way to loop through these 16 data frames, select an individual column, perform a function on it, and put it into a vector? This is what I have tried so far:
AllList <- list(HCD2001, HCD2002, HCD2003, HCD2004, HCD2005, HCD2006, HCD2007, HCD2008, HCD2009, HCD2010, HCD2011, HCD2012, HCD2013, HCD2014, HCD2015, HCD2016)
TotalLoans <- lapply(AllList,
function(df){
sum(as.numeric(df$loans_all))
return(df)
}
)
However, it returns a Large List with every column from the data frames. All the other posts related to this were for modifying data frames, not creating a new vector with modified values of the data frames.

Altering dataframes stored within a list

I am trying to write some kind of loop function that will allow me to apply the same set of code to dozens of data frames that are stored in one list. Each data frame has the same number of columns and identical headers for each column, though the number of rows varies across data frames.
This data comes from an egocentric social network study where I collected ego-network data in edgelist format from dozens of different respondents. The data collection software that I use stores the data from each interview in its own .csv file. Here is an image of the raw data for a specific data frame (image of raw data).
For my purposes, I only need to use data from the fourth, sixth, and seventh columns. Furthermore, I only need rows of data where the last column has values of 4, at which point the final column can be deleted entirely. The end result is a two-column data frame that represents relationships among pairs of people.
After reading in the data and storing it as an object, I ran the following code:
x100291 = `100291AlterPair.csv` #new object based on raw data
foc.altername = x100291$Alter.1.Name
altername = x100291$Alter.2.Name
tievalue = x100291$AlterPair_B
tie = tievalue
tie[(tie<4)] = NA
egonet.name = data.frame(foc.altername, altername, tievalue)
depleted.name = cbind(tie,egonet.name)
depleted.name = depleted.name[is.na(depleted.name[,1]) == F,]
dep.ego.name = data.frame(depleted.name$foc.altername, depleted.name$altername)
This produced the following data frame (image of final data). This is ultimately what I want.
Now I know that I could cut-and-paste this same set of code 100+ times and manually alter the file names, but I would prefer not to do that. Instead, I have stored all of my raw .csv files as data frames in a single list. I suspect that I can apply the same code across all of the data frames by using one of the apply commands, but I cannot figure it out.
Does anyone have any suggestions for how I might apply this basic code to a list of data frames so that I end up with a new list containing cleaned and reduced versions of the data?
Many thanks!
The logic can be simplified. Try creating a custom function and apply over all dataframes.
cleanDF <- function(mydf) {
if( all(!c('AlterPair_B', 'Alter.1.Name', 'Alter.2.Name') %in%
names(mydf))) stop("Check data frame names")
condition <- mydf[, 'AlterPair_B'] >= 4
mydf[condition, c("Alter.1.Name", "Alter.2.Name")]
}
big_list <- lapply(all_my_files, read.csv) #read in all data frames
result <- do.call('rbind', lapply(big_list, cleanDF))
The custom function cleanDF first checks that all the relevant column names are there. Then it defines the condition of 4 or more 'AlterPair_B'. Lastly, subset the two target columns by that condition. I used a list called 'big_list' that represents all of the data frames.
You haven't provided a reproducible example so it's hard to solve your problem. However, I don't want your questions to remain unanswered. It is true that using lapply would be a fast solution, usually preferable to a loop. However, since you mentioned being a beginner, here's how to do that with a loop, which is easier to understand.
You need to put all your csv files in a single folder with nothing else. Then, you read the filenames and put them in a list. You initialize an empty result object with NULL. You then read all your files in a loop, do calculations and rbind the results in the result object.
path <-"C:/temp/csv/"
list_of_csv_files <- list.files(path)
result <- NULL
for (filenames in list_of_csv_files) {
input <- read.csv(paste0(path,filenames), header=TRUE, stringsAsFactors=FALSE)
#Do your calculations
input_with_calculations <- input
result <- rbind(result,input_with_calculations)
}
result

Name list of data frames from data frame

I usually read a bunch of .csv files into a list of data frames and name it manually doing.
#...code for creating the list named "datos" with files from library
# Naming the columns of the data frames
names(datos$v1r1)<-c("estado","tiempo","x1","x2","y1","y2")
names(datos$v1r2)<-c(...)
names(datos$v1r3)<-c(...)
I want to do this renaming operation automatically. To do so, I created a data frame with the names I want for each of the data frames in my datos list.
Here is how I generate this data frame:
pru<-rbind(c("UT","TR","UT+","TR+"),
c("UT","TR","UT+","TR+"),
c("TR","UT","TR+","UT+"),
c("TR","UT","TR+","UT+"))
vec<-paste("v1r",seq(1,20,1),sep="")
tor<-paste("v1s",seq(1,20,1),sep="")
nombres<-do.call("rbind", replicate(10, pru, simplify = FALSE))
nombres_df<-data.frame(corrida=c(vec,tor),nombres)
Because nombres_df$corrida[1] is v1r1, I have to name the datos$v1r1 columns ("estado","tiempo", nombres_df[1,2:5]), and so on for the other 40 elements.
I want to do this renaming automatically. I was thinking I could use something that uses regular expressions.
Just for the record, I don't know why but the order of the list of data frames is not the same as the 1:20 sequence (by this I mean 10 comes before 2,3,4...)
Here's a toy example of a list with a similar structure but fewer and shorter data frames.
toy<-list(a=replicate(6,1:5),b=replicate(6,10:14))
You have a data frame where variable corridas is the name of the data frame to be renamed and the remaining columns are the desired variable names for that data frame. You could use a loop to do all the renaming operations:
for (i in seq_len(nrow(nombres_df))) {
names(datos[[nombres_df$corridas[i]]]) <- c("estado","tiempo",nombres_df[i,2:length(nombres_df)])
}

Resources