I have read multiple questionnaire files into DFs in R. Now I want to create new DFs based on them, buit with only specific rows in them, via looping over all of them.The loop appears to work fine. However the selection of the rows does not seem to work. When I try selecting with simple squarebrackts, i get the error "incorrect number of dimensions". I tried it with subet(), but i dont seem to be able to set the subset correctly.
Here is what i have so far:
for (i in 1:length(subjectlist)) {
p[i] <- paste("path",subjectlist[i],sep="")
files <- list.files(path=p,full.names = T,include.dirs = T)
assign(paste("subject_",i,sep=""),read.csv(paste("path",subjectlist[i],".csv",sep=""),header=T,stringsAsFactors = T,row.names=NULL))
assign(paste("subject_",i,"_t",sep=""),sapply(paste("subject_",i,sep=""),[c((3:22),(44:63),(93:112),(140:159),(180:199),(227:246)),]))
}
Here's some code that tries to abstract away the details and do what it seems like you're trying to do. If you just want to read in a bunch of files and then select certain rows, I think you can avoid the assign functions and just use sapply to read all the data frames into a list. Let me know if this helps:
# Get the names of files we want to read in
files = list.files([arguments])
df.list = sapply(files, function(file) {
# Read in a csv file from the files vector
df = read.csv(file, header=TRUE, stringsAsFactors=FALSE)
# Add a column telling us the name of the csv file that the data came from
df$SourceFile = file
# Select only the rows we want
df = df[c(3:22,44:63,93:112,140:159,180:199,227:246), ]
}, simplify=FALSE)
If you now want to combine all the data frames into a single data frame, you can do the following (the SourceFile column tells you which file each row originally came from):
# Combine all the files into a single data frame
allDFs = do.call(rbind, df.list)
Related
I have a dataframe that contains the names of a bunch of .CSV files. It looks how it does in the snippet below:
What I'm trying to do is convert each of these .CSVs into a dataframe that appends the results of each. What I'm trying to do is create three different dataframes based on what's in the file names:
Create a dataframe with all results from .CSV files with -callers- in its file name
Create a dataframe with all results from .CSV files with -results in its filename
Create a dataframe with all results from .CSV files with -script_results- in its filename
The command to actually convert the .CSV file into a dataframe looks like this if I were using the first .CSV in the dataframe below:
data <- aws.s3::s3read_using(read.csv, object = "s3://abc-testtalk/08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv")
But what I'm trying to do is:
Iterate ALL the .csv files under Key using the s3read_using function
Put them in three separate dataframes based on the file names as listed above
Key
08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv
08182020-testpilot-arizona-results-08-18-2020--08-18-2020-606698088.csv
08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-114004469.csv
08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-450823767.csv
08182020-testpilot-iowa-callers-08-18-2020-374839084.csv
08182020-testpilot-maine-callers-08-18-2020-396935866.csv
08182020-testpilot-maine-results-08-18-2020--08-18-2020-990912614.csv
08182020-testpilot-maine-script_results-08-18-2020--08-18-2020-897037786.csv
08182020-testpilot-michigan-callers-08-18-2020-367670258.csv
08182020-testpilot-michigan-follow-ups-08-18-2020--08-18-2020-049435266.csv
08182020-testpilot-michigan-results-08-18-2020--08-18-2020-544974900.csv
08182020-testpilot-michigan-script_results-08-18-2020--08-18-2020-239089219.csv
08182020-testpilot-nevada-callers-08-18-2020-782329503.csv
08182020-testpilot-nevada-results-08-18-2020--08-18-2020-348644934.csv
08182020-testpilot-nevada-script_results-08-18-2020--08-18-2020-517037762.csv
08182020-testpilot-new-hampshire-callers-08-18-2020-134150800.csv
08182020-testpilot-north-carolina-callers-08-18-2020-739838755.csv
08182020-testpilot-pennsylvania-callers-08-18-2020-223839956.csv
08182020-testpilot-pennsylvania-results-08-18-2020--08-18-2020-747438886.csv
08182020-testpilot-pennsylvania-script_results-08-18-2020--08-18-2020-546894204.csv
08182020-testpilot-virginia-callers-08-18-2020-027531377.csv
08182020-testpilot-virginia-follow-ups-08-18-2020--08-18-2020-419338697.csv
08182020-testpilot-virginia-results-08-18-2020--08-18-2020-193170030.csv
Create 3 empty dataframes. You will probably also need to indicate column names matching column names from each of the file you want to append:
results <- data.frame()
script_results <- data.frame()
callers <- data.frame()
Then iterate over file_name and read it into data object. Conditionally on what pattern ("-results-", "-script_results-" or "-caller-" is contanied in the name of each file, it will be appended to the correct dataframe:
for (file in file_name) {
data <- aws.s3::s3read_using(read.csv, object = paste0("s3://abc-testtalk/", file))
if (grepl(file, "-results-")) { results <- rbind(results, data)}
if (grepl(file, "-script_results-")) { script_results <- rbind(script_results, data)}
if (grepl(file, "-callers-")) { callers <- rbind(callers, data)}
}
As an alternative to #JohnFranchak's recommendation for map_dfr (which likely works just fine), the method that I referenced in comments would look something like this:
alldat <- lapply(setNames(nm = dat$file_name),
function(obj) aws.s3::s3read_using(read.csv, object = obj))
callers <- do.call(rbind, alldat[grepl("-callers-", names(alldat))])
results <- do.call(rbind, alldat[grepl("-results-", names(alldat))])
script_results <- do.call(rbind, alldat[grepl("-script_results-", names(alldat))])
others <- do.call(rbind, alldat[!grepl("-(callers|results|script_results)-", names(alldat))])
The do.call(rbind, ...) part is analogous to dplyr::bind_rows and data.table::rbindlist in that it accepts a list of frames, and the result is a single frame. Some differences:
do.call(rbind, ...) really requires all columns to exist in all frames, in the same order. It's not hard to enforce this externally (e.g., adding missing columns, rearranging), but it's not automatic.
data.table::rbindlist will complain for the same conditions (missing columns or different order), but it has fill= and use.names= arguments that need to be set TRUE.
dplyr::bind_rows will fill and row-bind by-name by default, without message or warning. (I don't agree that a default of silence is good all of the time, but it is the simplest.)
Lastly, my use of setNames(nm=..) is merely to assign the filename to each object. This is not strictly necessary since we still have dat$file_name, but I've found that with two separate objects, it is feasible to accidentally change (delete, append, or reorder) one of them and not the other, so I prefer to keep the names and the objects (frames) perfectly tied together. These two calls are relatively the same in the resulting named-list:
lapply(setNames(nm = dat$file_name), ...)
sapply(dat$file_name, ..., simplify = FALSE)
rewrote in attempt to simplify my problem statement.
I am using R V1.3.959 and relatively new to R overall. I have a custom excel form, which means the objects are in various cells in excel and the variable is also in some cell. I have over 1000 of these forms as product specs. I read in only 1 file and created a function called tidy.form to pull data out and then cbind into new file as below.
read_customer_file = "C:/Users/..../FABRIC TECHNICAL SUBMISSION AGREEMENT J123abd.xlsx"
product_tech <- read_excel(read_customer_file, sheet = "Form") %>% clean_names()
#function for make form tidy
form.extract <- function(tidy.form) {
#extract the object / data point looking for but with entire column
fabric.supplier.name <- product_tech[c( 0,5)]
#extract the specific row in the column with the data point desired
fabric.supplier.name <- slice(fabric.supplier.name, 3,0)
#rename column to correct variable
colnames(fabric.supplier.name)[colnames(fabric.supplier.name) == "x5"] <- "fabric.supplier.name"
combine <- cbind(date, fabric.supplier.name, address)
return(combine)
}
Now I need a way to read in all of the xlsx files from a directory and do the same thing for each.
I figured out how to read the file names in through:
files <- list.files(path="C:/Users/me/productspecfolder", pattern="*.xlsx", full.names=TRUE, recursive=FALSE)
However I am stuck at how to loop / lapply through my list.files and apply the function tidy.form to each.
Any help would be so much appreciated!
I have a vector of file paths called dfs, and I want create a dataframe of those files and bind them together into one huge dataframe, so I did something like this :
for (df in dfs){
clean_df <- bind_rows(as.data.table(read.delim(df, header=T, sep="|")))
return(clean_df)
}
but only the last item in the dataframe is being returned. How do I fix this?
I'm not sure about your file format, so I'll take common .csv as an example. Replace the a * i part with actually reading all the different files, instead of just generating mockup data.
files = list()
for (i in 1:10) {
a = read.csv('test.csv', header = FALSE)
a = a * i
files[[i]] = a
}
full_frame = data.frame(data.table::rbindlist(files))
The problem is that you can only pass one file at a time to the function read.delim(). So the solution would be to use a function like lapply() to read in each file specified in your df.
Here's an example, and you can find other answers to your question here.
library(tidyverse)
df <- c("file1.txt","file2.txt")
all.files <- lapply(df,function(i){read.delim(i, header=T, sep="|")})
clean_df <- bind_rows(all.files)
(clean_df)
Note that you don't need the function return(), putting the clean_df in parenthesis prompts R to print the variable.
I am new to R and I am practicing to write R functions. I have 100 cvs separate
data files stored in my directory, and each is labeled by its id, e.g. "1" to "100.
I like to write a function that reads some selected files into R, calculates the
number of complete cases in each data file, and arrange the results into a data frame.
Below is the function that I wrote. First I read all files in "dat". Then, using
rbind function, I read the selected files I want into a data.frame. Lastly, I computed
the number of complete cases using sum(complete.cases()). This seems straightforward but
the function does not work. I suspect there is something wrong with the index but
have not figured out why. Searched through various topics but could not find a useful
answer. Many thanks!
`complete = function(directory,id) {
dat = list.files(directory, full.name=T)
dat.em = data.frame()
for (i in id) {
dat.ful= rbind(dat.em, read.csv(dat[i]))
obs = numeric()
obs[i] = sum(complete.cases(dat.ful[dat.ful$ID == i,]))
}
data.frame(ID = id, count = obs)
}
complete("envi",c(1,3,5)) `
get error and a warning message:
Error in data.frame(ID = id, count = obs) : arguments imply differing number of rows: 3, 5
One problem with your code is that you reset obs to numeric() each time you go through the loop, so obs ends up with only one value (the number of complete cases in the last file in dat).
Another issue is that the line dat.ful = rbind(dat.em, read.csv(dat[i])) resets dat.ful to contain just the data frame being read in that iteration of the loop. This won't cause an error, but you don't actually need to store the previous data frames, since you're just checking the number of complete cases for each data frame you read in.
Here's a different approach using lapply instead of a loop. Note that instead of giving the function a vector of indices, this function takes a vector of file names. In your example, you use the index instead of the file name as the file "id". It's better to use the file names directly, because even if the file names are numbers, using the index will give an incorrect result if, for some reason, your vector of file names is not sorted in ascending numeric order, or if the file names don't use consecutive numbers.
# Read files and return data frame with the number of complete cases in each csv file
complete = function(directory, files) {
# Read each csv file in turn and store its name and number of complete cases
# in a list
obs.list = lapply(files, function(x) {
dat = read.csv(paste0(directory,"/", x))
data.frame(fileName=x, count=sum(complete.cases(dat)))
})
# Return a data frame with the number of complete cases for each file
return(do.call(rbind, obs.list))
}
Then, to run the function, you need to give it a directory and a list of file names. For example, to read all csv files in the current working directory, you can do this:
filesToRead = list.files(pattern=".csv")
complete(getwd(), filesToRead)
I would like to know how I solve the following problem using higher order functions like ddply, ldply, dlply, and avoid using problematic for loops.
The problem:
I have a .csv file representing a dataset loaded into a data.frame, with each row containing the path to a directory where more information is stored in files. I want to use the directory information in the datas.frame to open the files("file1.txt","file2.txt") in that directory, merge them, then combine the merged files from each entry in one large dataframe.
something like this:
df =
entryName,dir
1,/home/guest/data/entry1
2,/home/guest/data/entry2
3,/home/guest/data/entry3
4,/home/guest/data/entry4
what I would like to do is apply a function to the dataframe that take the directory,
appends a couple of file names "file1.txt", "file.txt", then merges the two files together based off a given field.
for example file1.txt could be:
entry,subEntry,value
1,A,2
1,B,3
1,C,4
1,D,5
1,E,3
1,F,3
for example file2.txt could be:
entry,subEntry,value
1,A,8
1,B,7
1,C,8
1,D,9
1,E,8
1,F,7
the output would look something like this:
entryName,subEntry,valueFromFile1,valueFromFile2
1,A,2,8
1,B,3,7
1,C,4,8
1,D,5,9
1,E,3,8
1,F,3,7
2,A,4,8
2,B,5,9
2,C,6,7
2,D,3,7
2,E,6,8
2,F,5,9
Right now I am using a for loop, but for obvious reasons would like to use a higher order function. Here is what I have so far:
allCombined <- data.frame()
df <- read.csv(file="allDataEntries.csv",header=true)
numberOfEntries = <- dim(df)[1]
for(i in 1:numberOfEntries){
dir <- df$dir[i]
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
localMerged <- merge(file1.df,file2.df, by="value")
allCombined <- rbind(allCombined,localMerged)
}
#rest of my analysis...
Here is one way to do it. The idea is to create a list with contents of all the files, and then use Reduce to merge them sequentially using the common columns entry and subEntry.
# READ DIRECTORIES, FILES AND ENTRIES
dirs <- read.csv(file = "allDataEntries.csv", header = TRUE, as.is = TRUE)$dir
files <- as.vector(outer(dirs, c('file.txt', 'file2.txt'), 'file.path'))
entries <- lapply(files, 'read.csv', header = TRUE)
# APPLY CUSTOM MERGE FUNCTION TO COMBINE ENTRIES
merge_by <- function(x, y){
merge(x, y, by = c('entry', 'subEntry'))
}
Reduce('merge_by', entries)
I've not tested this, but it seems like it should work. The anonymous function takes a single row from df, reads in the two associated files, and merges them together by value. Using ddply will take these data frames and make a single one out of them by rbinding (since the requested output is a data frame). It does assume entryName is not repeated in df. If it is, you can add a unique row to group over instead.
ddply(df, .(entryName), function(DF) {
dir <- df$dir
file1String <- paste(dir,"/file1.txt",sep='')
file2String <- paste(dir,"/file2.txt",sep='')
file1.df <- read.csv(file=file1String,header=TRUE)
file2.df <- read.csv(file=file2String,header=TRUE)
merge(file1.df,file2.df, by="value")
})