I am trying to isolate one point in the same location (same column and row) from 1000 data frames. Each data frame has the same 8 columns with varying amounts of rows (at least one)- and I only need the points from the first row for now. These data frames are within a list created with the lapply function. Here is how I did that:
list <- list.files(pattern=".aei")
files <- lapply(list, read.table, ...)
Now, I need to isolate points from each data frame in Row 1 and Column 2. I was able to do this for one data frame with the following code:
a <- data.frame(files[1])[1,2]
However, I can't get this to work for all 1000 files. I've tried several pieces of code, such as:
all <- data.frame(files[1:999])[1,2]
all<- lapply(files data.frame)[1,2]
all<- lapply(files, data.frame[1,2])
and even two different for loops:
for(i in files [[1:999]]) {
list(files[1:999])[1,2]
}
for(i in files [[1:999]]) {
data.frame(files[1:999])[1,2]
}
Are any of these methods on the right track or are they completely wrong? I've been stuck on this for awhile and seem to have hit a complete dead end regarding any other ideas. Please let me know of any suggestions you may have!
We can use a anonymous function (lambda function) to extrac the element
lapply(files, function(x) x[1,2])
The read.table already gives a data.frame, so there is no need to wrap with data.frame
Related
I am trying to write some kind of loop function that will allow me to apply the same set of code to dozens of data frames that are stored in one list. Each data frame has the same number of columns and identical headers for each column, though the number of rows varies across data frames.
This data comes from an egocentric social network study where I collected ego-network data in edgelist format from dozens of different respondents. The data collection software that I use stores the data from each interview in its own .csv file. Here is an image of the raw data for a specific data frame (image of raw data).
For my purposes, I only need to use data from the fourth, sixth, and seventh columns. Furthermore, I only need rows of data where the last column has values of 4, at which point the final column can be deleted entirely. The end result is a two-column data frame that represents relationships among pairs of people.
After reading in the data and storing it as an object, I ran the following code:
x100291 = `100291AlterPair.csv` #new object based on raw data
foc.altername = x100291$Alter.1.Name
altername = x100291$Alter.2.Name
tievalue = x100291$AlterPair_B
tie = tievalue
tie[(tie<4)] = NA
egonet.name = data.frame(foc.altername, altername, tievalue)
depleted.name = cbind(tie,egonet.name)
depleted.name = depleted.name[is.na(depleted.name[,1]) == F,]
dep.ego.name = data.frame(depleted.name$foc.altername, depleted.name$altername)
This produced the following data frame (image of final data). This is ultimately what I want.
Now I know that I could cut-and-paste this same set of code 100+ times and manually alter the file names, but I would prefer not to do that. Instead, I have stored all of my raw .csv files as data frames in a single list. I suspect that I can apply the same code across all of the data frames by using one of the apply commands, but I cannot figure it out.
Does anyone have any suggestions for how I might apply this basic code to a list of data frames so that I end up with a new list containing cleaned and reduced versions of the data?
Many thanks!
The logic can be simplified. Try creating a custom function and apply over all dataframes.
cleanDF <- function(mydf) {
if( all(!c('AlterPair_B', 'Alter.1.Name', 'Alter.2.Name') %in%
names(mydf))) stop("Check data frame names")
condition <- mydf[, 'AlterPair_B'] >= 4
mydf[condition, c("Alter.1.Name", "Alter.2.Name")]
}
big_list <- lapply(all_my_files, read.csv) #read in all data frames
result <- do.call('rbind', lapply(big_list, cleanDF))
The custom function cleanDF first checks that all the relevant column names are there. Then it defines the condition of 4 or more 'AlterPair_B'. Lastly, subset the two target columns by that condition. I used a list called 'big_list' that represents all of the data frames.
You haven't provided a reproducible example so it's hard to solve your problem. However, I don't want your questions to remain unanswered. It is true that using lapply would be a fast solution, usually preferable to a loop. However, since you mentioned being a beginner, here's how to do that with a loop, which is easier to understand.
You need to put all your csv files in a single folder with nothing else. Then, you read the filenames and put them in a list. You initialize an empty result object with NULL. You then read all your files in a loop, do calculations and rbind the results in the result object.
path <-"C:/temp/csv/"
list_of_csv_files <- list.files(path)
result <- NULL
for (filenames in list_of_csv_files) {
input <- read.csv(paste0(path,filenames), header=TRUE, stringsAsFactors=FALSE)
#Do your calculations
input_with_calculations <- input
result <- rbind(result,input_with_calculations)
}
result
I cannot for the life of me figure out where the simple error is in my for loop to perform the same analyses over multiple data frames and output each iteration's new data frame utilizing the variable used along with extra string to identify the new data frame.
Here is my code:
john and jane are 2 data frames among many I am hoping to loop over and compare to bcm to find duplicate results in rows.
x <- list(john,jane)
for (i in x) {
test <- rbind(bcm,i)
test$dups <- duplicated(test$Full.Name,fromLast=T)
test$dups2 <- duplicated(test$Full.Name)
test <- test[which(test$dups==T | test$dups2==T),]
newname <- paste("dupl",i,sep=".")
assign(newname, test)
}
Thus far, I can either get the naming to work correctly without including the x data or the loop to complete correctly without naming the new data frames correctly.
Intended Result: I am hoping to create new data frames dupl.john and dupl.jane to show which rows are duplicated in comparison to bcm.
I understand that lapply() might be better to use and am very open to that form of solution. I could not figure out how to use it to solve my problem, so I turned to the more familiar for loop.
EDIT:
Sorry if I'm not being more clear. I have about 13 data frames in total that I want to run the same analysis over to find the duplicate rows in $Full.Name. I could do the first 4 lines of my loop and then dupl.john <- test 13 times (for each data frame), but I am purposely trying to write a for loop or lapply() to gain more knowledge in R and because I'm sure it is more efficient.
If I understand correctly based on your intended result, maybe using the match_df could be an option.
library(plyr)
dupl.john <- match_df(john, bcm)
dupl.jane <- match_df(jane, bcm)
dupl.john and dupl.jane will be both data frames and both will have the rows that are in these data frames and bcm. Is this what you are trying to achieve?
EDITED after the first comment
library(plyr)
l <- list(john, jane)
res <- lapply(l, function(x) {match_df(x, bcm, on = "Full.Name")} )
dupl.john <- as.data.frame(res[1])
dupl.jane <- as.data.frame(res[2])
Now, res will have a list of the data frames with the matches, based on the column "Full.Name".
I have several files in one folder that I read into R as a list. Each element in the list is a data frame and I need to remove a random consecutive 6000 rows form each data frame. I can unlist the data frames and pull out the rows but ideally I would like to keep it in the list and just go through each element of the list and remove the rows I need. I thought a for loop or apply function would work but the individual elements don't seem t be recognized as data frames when they're in the list.
Here is what I have so far
files <- list.files('file location')
fs <- lapply(files, read.table, sep=',',skip=3,header=TRUE)
##separates the list into individual data frames
for (i in seq(fs))
assign(paste("df", i, sep = ""), fs[[i]])
##selects a random 6000 rows to remove from a dataframe
n <- nrow(df1)
samp <- sample(1:(n-6000),1)
rmvd <- df1[-seq(from = samp, to =samp+5999),]
I would either like to apply the last part to each dataframe individually and put those back into a list or be able to apply it to the list. I want it in a list in the end because it will be easier to write each dataframe to its own csv file.
If you stick with the list of data.frames, fs, instead of assigning them, you can do something like
lapply(fs, function(x) x[-(sample(nrow(x)-6000,1)+0:5999), ])
If n=nrow(x) is ever under 6000, you're in trouble, of course.
I have a collection of data frames that I have generated in R. I need to count the number of data frames whose names begin with "entry_". I'd like to generate a number to then use for a function that rbinds all of these data frames and these data frames only.
So far, I have tried using grep to identify the data frames, however, this just returns where they are indexed in my object list (e.g., 16:19 --- objects 16-19 begin with "entry_"):
count_entry <- (grep("entry_", objects()))
Eventually I would like to rbind all of these data frames like so:
list.make <- function() {
sapply(paste('entry_', seq(1:25), sep=''), get, environment(), simplify = FALSE)
}
all.entries <- list.make()
final.data <- rbind.fill(all.entries)
I don't want to have to enter the sequence manually every time (for example (1:25) in the code above), which is why I'm hoping to be able to automatically count the data frames beginning with "entry_".
If anyone has any ideas of how to solve this, or how to go about this in a better way, I'm all ears!
Per comment by docendo: The ls function will list objects in an environment that match a regex pattern. You can then use mget to retrieve those objects as a list:
mylist <- mget(ls(pattern = "^entry_"))
That will then work with rbind.fill. You can then remove the original objects using something similar: rm(ls(pattern = "^entry_"))
I have to write an R script where I want to load different numbers of files at different times. The files are loaded into data frames and certain columns of the data frames are extracted. The columns are then merged with the cbind function. My problem is that I do not know how I can adapt to varying numbers of files that are loaded from time to time, because there may be 3 vectors for cbind at one time or 5 vectors at another time. So how can I give cbind a number of vectors so that it doesn't output errors when it doesn't get all vectors? This happens when I give it a fixed number.
raw1 <- read.table()
raw2 <- read.table()
vec1 <- raw1[,2]
vec2 <- raw2[,2]
cbind(vec1,vec2,vec3)
I know I'd better write sth interactive such as a tcltk dialog and the some kind of loop. Maybe you could provide me with some kind of an idea of how an effective loop could be structured.
You can store the data frames in a list and then cbind them using do.call(). This is a good way to cbind lists of arbitrary length.
datalist <- lapply(filenames, function(i) read.table(i)[, 2])
# ... where filenames are the names of the files you want to read, and
# passing any additional parameters to read.table that are needed
# Then cbind all the entries of datalist
do.call(cbind, datalist)