Count the number of data frames beginning with prefix in R - r

I have a collection of data frames that I have generated in R. I need to count the number of data frames whose names begin with "entry_". I'd like to generate a number to then use for a function that rbinds all of these data frames and these data frames only.
So far, I have tried using grep to identify the data frames, however, this just returns where they are indexed in my object list (e.g., 16:19 --- objects 16-19 begin with "entry_"):
count_entry <- (grep("entry_", objects()))
Eventually I would like to rbind all of these data frames like so:
list.make <- function() {
sapply(paste('entry_', seq(1:25), sep=''), get, environment(), simplify = FALSE)
}
all.entries <- list.make()
final.data <- rbind.fill(all.entries)
I don't want to have to enter the sequence manually every time (for example (1:25) in the code above), which is why I'm hoping to be able to automatically count the data frames beginning with "entry_".
If anyone has any ideas of how to solve this, or how to go about this in a better way, I'm all ears!

Per comment by docendo: The ls function will list objects in an environment that match a regex pattern. You can then use mget to retrieve those objects as a list:
mylist <- mget(ls(pattern = "^entry_"))
That will then work with rbind.fill. You can then remove the original objects using something similar: rm(ls(pattern = "^entry_"))

Related

R for loop: creating data frames using split?

I have data that I want to separate by date, I have managed to do this manually through:
tsssplit <- split(tss, tss$created_at)
and then creating dataframes for each list which I then use.
t1 <- tsssplit[[1]]
t2 <- tsssplit[[2]]
But I don't know how many splits I will need, as sometimes the og data frame may may have 6 dates to split up by, and sometimes it may have 5, etc. So I want to create a for loop.
Within the for loop, I want to incorporate this code, which connects to a function:
bscore3 <- score.sentiment(t3$cleaned_text,pos.words,neg.words,.progress='text')
score3 <- as.integer(bscore3$score[[1]])
Then I want to be able to create a new data frame that has the scores for each list.
So essentially I want the for loop to:
split the data into lists using split
split each list into a separate data frames for each different day
Come out with a score for each data frame
Put that into a new data frame
It doesn't have to be exactly like this as long as I can come up with a visualisation of the scores at the end.
Thanks!
It is not recommended to create separate dataframes in the global environment, they are difficult to keep track of. Put them in a list instead. You have started off well by using split and creating list of dataframes. You can then iterate over each dataframe in the list and apply the function on each one of them.
Using by this would look like as :
by(tss, tss$created_at, function(x) {
bscore3 <- score.sentiment(x$cleaned_text,pos.words,neg.words,.progress='text')
score3 <- as.integer(bscore3$score[[1]])
return(score3)
}) -> result
result

How to bind several data frames obtained from web scraping, using a for loop?

So I have vector that is basically a list of species such as:
list_species<-c("Pomphorhynchus laevis","Profilicollis altmani","Leptorhynchoides thecatus","Mayarhynchus karlae","Oligacanthorhynchus tortuosa","Pseudoacanthocephalus toshimai","Corynosoma australe")
And I have this function, which mines data on several specimens for each of those species:
library(bold)
df<-bold_seqspec(name_of_species, format = "tsv")
I want to use the bold_seqspec function to create one data frame for each of the elements in list_species, so far I tried like this:
for (name_of_species in list_species){
df<-bold_seqspec(name_of_species, format = "tsv")
joined_dfs<-rbind(df)
}
What I wanted was a data frame that is the sum of all the data frames which were downloaded for in species name in list_species.
But what I'm getting is a data frame with one observation one, so something must be wrong in the code.
Since you want to apply this for multiple species, you need to loop over them.
You can use purrr's map functions.
joined_dfs <- purrr::map_df(list_species, bold::bold_seqspec)
Try
do.call(rbind, lapply(list_species, bold_seqspec, format = "tsv"))
Explanation: lapply(list_species, bold_seqspec, format = "tsv") loops through list_species and applies bold_seqspec to every element with argument format = "tsv". The return object is a list of bold_seqspec return objects; assuming they are data.frames you can then row-bind them with do.call(rbind, ...), producing a single data.frame.

Altering dataframes stored within a list

I am trying to write some kind of loop function that will allow me to apply the same set of code to dozens of data frames that are stored in one list. Each data frame has the same number of columns and identical headers for each column, though the number of rows varies across data frames.
This data comes from an egocentric social network study where I collected ego-network data in edgelist format from dozens of different respondents. The data collection software that I use stores the data from each interview in its own .csv file. Here is an image of the raw data for a specific data frame (image of raw data).
For my purposes, I only need to use data from the fourth, sixth, and seventh columns. Furthermore, I only need rows of data where the last column has values of 4, at which point the final column can be deleted entirely. The end result is a two-column data frame that represents relationships among pairs of people.
After reading in the data and storing it as an object, I ran the following code:
x100291 = `100291AlterPair.csv` #new object based on raw data
foc.altername = x100291$Alter.1.Name
altername = x100291$Alter.2.Name
tievalue = x100291$AlterPair_B
tie = tievalue
tie[(tie<4)] = NA
egonet.name = data.frame(foc.altername, altername, tievalue)
depleted.name = cbind(tie,egonet.name)
depleted.name = depleted.name[is.na(depleted.name[,1]) == F,]
dep.ego.name = data.frame(depleted.name$foc.altername, depleted.name$altername)
This produced the following data frame (image of final data). This is ultimately what I want.
Now I know that I could cut-and-paste this same set of code 100+ times and manually alter the file names, but I would prefer not to do that. Instead, I have stored all of my raw .csv files as data frames in a single list. I suspect that I can apply the same code across all of the data frames by using one of the apply commands, but I cannot figure it out.
Does anyone have any suggestions for how I might apply this basic code to a list of data frames so that I end up with a new list containing cleaned and reduced versions of the data?
Many thanks!
The logic can be simplified. Try creating a custom function and apply over all dataframes.
cleanDF <- function(mydf) {
if( all(!c('AlterPair_B', 'Alter.1.Name', 'Alter.2.Name') %in%
names(mydf))) stop("Check data frame names")
condition <- mydf[, 'AlterPair_B'] >= 4
mydf[condition, c("Alter.1.Name", "Alter.2.Name")]
}
big_list <- lapply(all_my_files, read.csv) #read in all data frames
result <- do.call('rbind', lapply(big_list, cleanDF))
The custom function cleanDF first checks that all the relevant column names are there. Then it defines the condition of 4 or more 'AlterPair_B'. Lastly, subset the two target columns by that condition. I used a list called 'big_list' that represents all of the data frames.
You haven't provided a reproducible example so it's hard to solve your problem. However, I don't want your questions to remain unanswered. It is true that using lapply would be a fast solution, usually preferable to a loop. However, since you mentioned being a beginner, here's how to do that with a loop, which is easier to understand.
You need to put all your csv files in a single folder with nothing else. Then, you read the filenames and put them in a list. You initialize an empty result object with NULL. You then read all your files in a loop, do calculations and rbind the results in the result object.
path <-"C:/temp/csv/"
list_of_csv_files <- list.files(path)
result <- NULL
for (filenames in list_of_csv_files) {
input <- read.csv(paste0(path,filenames), header=TRUE, stringsAsFactors=FALSE)
#Do your calculations
input_with_calculations <- input
result <- rbind(result,input_with_calculations)
}
result

R : how to append data frames in a list?

i am trying to produce data frames using for loop.
How can i append these data frames to a list and then check if any frame is empty or not ?
I would like to remove the data frames with empty rows from the list.
any help is appreciated
You should use lapply here without using a for loop. The advantages are:
You want to create a list of data.frame and lapply create a list
You do the job once , no need to do 2 loops.
Somethinkg like :
lapply(seq_len(nbr_df),function(x)
{
## code to create you data.frame dt
## dt = data.frame(...)
if(nrow(dt)>0) dt
})
second option: data.frames already created in separate variables:
We assume that your variable have a certain pattern, say patt:
lapply(mget(ls(pattern=patt)),function(x)if(nrow(x)>0)x)
To append to a list you can
Your_list= list()
for(i in numbOfPosibleDF){
k <- data.frame()
if(nrow(k)!=0){
Your_list[paste0(df,i)] = k
}
}
I would just add valid data frames to the list instead of removing them afterwards. If you want or need to use a for-loop (instead of lapply function), you may use following:
# init list
list.of.df <- list()
# start your loop to
# create data frame etc.
# ....
df <- data.frame(1,2)
# add to list
if (!is.null(df) && nrow(df)>0) list.of.df[[length(list.of.df)+1]] <- df
# ... end of loop here.
For the benefit of anyone finding this otherwise dead-end page by its title, the way to concatenate consistently formatted data.frames that are items of a list is with plyr:
rbind.fill.matrix(lst)
I would like to give a better picture of the scenario :
the frames may or may not have same number of columns/rows.
the data frames are dynamically produced using a for loop.
the frames have all data types: numeric , factor, character.

Storing multiple data frames into one data structure - R

Is it possible to have multiple data frames to be stored into one data structure and process it later by each data frame? i.e. example
df1 <- data.frame(c(1,2,3), c(4,5,6))
df2 <- data.frame(c(11,22,33), c(44,55,66))
.. then I would like to have them added in a data structure, such that I can loop through that data structure retrieving each data frame one at a time and process it, something like
for ( iterate through the data structure) # this gives df1, then df2
{
write data frame to a file
}
I cannot find any such data structure in R. Can anyone point me to any code that illustrates the same functionality?
Just put the data.frames in a list. A plus is that a list works really well with apply style loops. For example, if you want to save the data.frame's, you can use mapply:
l = list(df1, df2)
mapply(write.table, x = l, file = c("df1.txt", "df2.txt"))
If you like apply style loops (and you will, trust me :)) please take a look at the epic plyr package. It might not be the fastest package (look data.table for fast), but it drips with syntactic sugar.
Lists can be used to hold almost anything, including data.frames:
## Versatility of lists
l <- list(file(), new.env(), data.frame(a=1:4))
For writing out multiple data objects stored in a list, lapply() is your friend:
ll <- list(df1=df1, df2=df2)
## Write out as *.csv files
lapply(names(ll), function(X) write.csv(ll[[X]], file=paste0(X, ".csv")))
## Save in *.Rdata files
lapply(names(ll), function(X) {
assign(X, ll[[X]])
save(list=X, file=paste0(X, ".Rdata"))
})
What you are looking for is a list.
You can use a function like lapply to treat each of your data frames in the same manner sperately. However, there might be cases where you need to pass your list of data frames to a function that handles the data frames in relation to each other. In this case lapply doesn't help you.
That's why it is important to note how you can access and iterate the data frames in your list. It's done like this:
mylist[[data frame]][row,column]
Note the double brackets around your data frame index.
So for your example it would be
df1 <- data.frame(c(1,2,3), c(4,5,6))
df2 <- data.frame(c(11,22,33), c(44,55,66))
mylist<-list(df1,df2)
mylist[[1]][1,2] would return 4, whereas mylist[1][1,2] would return NULL. It took a while for me to find this, so I thought it might be helpful to post here.

Resources