I am trying to count all of the files cumulatively, but for some reason it is instead counting the last file and using that number for the rest of the analysis. How can I change this code to instead include the counts and unique counts of all files (there are 51 files).
#Move all files to one list
file_list <- list.files(pattern="Dataset 2.*txt")
Read files
for (i in 1:length(file_list)){
file <- read.table(file_list[i], header=TRUE, sep=",")
out.file <- rbind(file)
}
Count total number phone call records
count_PHONECALLRECORDS <- length(out.file$CALLER_ID)
#Count number unique caller id's
count_CALLERID <- length(unique(out.file$CALLER_ID))
Here's the correction you need -
# Read files
out.file <- NULL
for (i in 1:length(file_list)){
file <- read.table(file_list[i], header=TRUE, sep=",")
out.file <- rbind(out.file, file)
}
Note that this way of growing the data i.e rbind-ing to itself is not efficient but assuming you are a beginner I'd say don't worry about it until you have to.
You should move the counting code to the loop and initialize the counting variables before the loop:
setwd("~/Desktop/GEOG Research/Jordan/compression")
library(plyr)
library(rlang)
library(dplyr)
# Move all files to one list
file_list <- list.files(pattern="Dataset 2.*txt")
# Read files
count_PHONECALLRECORDS <- 0
count_CALLERID <- 0
for (i in 1:length(file_list)){
file <- read.table(file_list[i], header=TRUE, sep=",")
out.file <- rbind(file)
# Count total number phone call records
count_PHONECALLRECORDS <- count_PHONECALLRECORDS + length(out.file$CALLER_ID)
# Count number unique caller id's
count_CALLERID <- count_CALLERID + length(unique(out.file$CALLER_ID))
}
# Construct contingency matrix
tb_1 <- with(out.file, table(CALLEE_PREFIX, CALLER = substr(CALLER_ID, 0, 1)))
colnames(tb_1) <- c("Refugee Caller", "Non-Refugee Caller")
rownames(tb_1) <- c("Refugee Callee", "Non-Refugee Callee", "Unknown Callee")
tb_1
Related
I'm trying to write a function called complete that takes a file directory (which has csv files titled 1-332) and the title of the file as a number to print out the number of rows without NA in the sulfate or nitrate columns. I am trying to use mutate to add a column titled nobs which returns 1 if neither column is na and then takes the sum of nobs for my answer, but I get an error message that the object nob is not found. How can I fix this? The specific file directory in question is downloaded within this block of code.
library(tidyverse)
if(!file.exists("rprog-data-specdata.zip")) {
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip",temp)
unzip(temp)
unlink(temp)
}
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
#create an empty data frame
dat <- data.frame()
for(i in id){
dat <- rbind(dat, read.csv(files_full[i]))
}
mutate(dat, nob = ifelse(!is.na(dat$sulfate) & !is.na(dat$nitrate), 1, 0))
x <- summarise(dat, sum = sum(nob))
return(x)
}
When one runs the following code nobs should be 117, but I get an error message instead
complete("specdata", 1)
Error: object 'nob' not found"
I think the function below should get what you need. Rather than a loop, I prefer map (or apply) in this setting. It's difficult to say where your code went wrong without the error message or an example I can run on my machine, however.
Happy Coding,
Daniel
library(tidyverse)
complete <- function(directory, id = 1:332){
#create a list of files
files_full <- list.files(directory, full.names = TRUE)
# cycle over each file to get the number of nonmissing rows
purrr::map_int(
files_full,
~ read.csv(.x) %>% # read in datafile
dplyr::select(sulfate, nitrate) %>% # select two columns of interest
tidyr::drop_na %>% # drop missing observations
nrow() # get the number of rows with no missing data
) %>%
sum() # sum the total number of rows not missing among all files
}
As mentioned, avoid building objects in a loop. Instead, consider building a list of data frames from each csv then call rbind once. In fact, even consider base R (i.e., tinyverse) for all your needs:
complete <- function(directory, id = 1:332){
# create a list of files
files_full <- list.files(directory, full.names = TRUE)
# create a list of data frames
df_list <- lapply(files_full[id], read.csv)
# build a single data frame with nob column
dat <- transform(do.call(rbind, df_list),
nob = ifelse(!is.na(sulfate) & !is.na(nitrate), 1, 0)
)
return(sum(dat$nob))
}
A folder has dozens of csv files. Each csv file is named with just an id ranging from 1 - 332. Each file contains two columns "sulfate" and "nitrate" with numeric values of pollution level. I want to create a table that lists ids (file names as 'id') in one column, and number of complete cases (as 'nobs') in that file in another column.
Please suggest modification to the code below (or something totally new is fine)
complete <- function(directory, id = 1:332) {
csvfiles <- dir(directory, "*\\.csv$", full.names = TRUE)
data <- lapply(csvfiles[id], read.csv)
for (filedata in data) {
d <- filedata[["sulfate"]]
d <- d[complete.cases(d)] # remove NA values
d1 <- filedata[["nitrate"]]
d1<- d1[complete.cases(d1)]
}
paste(id, (length(d)+length(d1)))
}
Currently the above code just binds the id numbers with the total of complete cases across all the files in that id-range.
some suggested modifications:
you can read in and process the csv file within the same function. Use cbind to add the 2 columns that you require. Then row bind all the data.frames into 1 data.frame
complete <- function(directory, id = 1:332) {
lsData <- lapply(id, function(n) {
df <- read.csv(paste0(directory, "/", n, ".csv"))
cbind(id=n, df, nobs=nrow(df[complete.cases(df),,drop=FALSE]))
})
do.call(rbind, lsData)
}
After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...
So I am gathering data from a relatively decent sized excel file (~200k rows, 35 columns). The data is broken up with a section name, cycle count, and then the next row is the data, and at the end of the data section is a blank line. So here is my function that gets the data, file for the parameter is all the files in the directory ending in .csv that have to be parsed and name is the name of the section of data that you want to get. The function works, but it runs at a snail pace, and by that I mean to process 10k lines about 4 minutes.
getData1 <- function(file,name) {
for(i in 1:length(file)) {
dat <- c()
lines <- readLines(file[i])
indx <- grep(name, lines) #row number for anything with search term in it
counter <- 3
dat <- c(read.table(text=lines[(indx+2)],
sep=",", header=FALSE, stringsAsFactors=FALSE, check.names=FALSE))
while(dat[counter-2] != "\t") {
dat <- c(dat,read.table(text=lines[(indx+counter)], #read only one line per loop
sep=",", header=FALSE, stringsAsFactors=FALSE, check.names=FALSE))
counter <- counter + 1
}
return(dat)
}
}
I have a directory containing a large number of csv files. I would like to load the data into R and apply a function to every possible pair combination of csv files in the directory, then write the output to file.
The function that I would like to apply is matchpt() from the biobase library which compares locations between two data frames.
Here is an example of what I would like to do (although I have many more files than this):
Three files in directory: A, B and C
Perform matchpt on each pairwise combination:
nn1 = matchpt(A,B)
nn2 = matchpt(A,C)
nn3 = matchpt(B,C)
Write nn1, nn2 and nn3 to csv file.
I have not been able to find any solutions for this yet and would appreciate any suggestions. I am really not sure where to go from here but I am assuming that some sort of nested for loop is required to somehow cycle sequentially through all pairwise combinations of files. Below is a beginning at something but this only compares the first file with all the others in the directory so does not work!
library("Biobase")
# create two lists of identical filenames stored in the directory:
filenames1 = list.files(path=dir, pattern="csv$", full.names=FALSE, recursive=FALSE)
filenames2 = list.files(path=dir, pattern="csv$", full.names=FALSE, recursive=FALSE)
for(i in 1:length(filenames2)){
# load the first data frame in list 1
df1 <- lapply(filenames1[1], read.csv, header=TRUE, stringsAsFactors=FALSE)
df1 <- data.frame(df1)
# load a second data frame from list 2
df2 <- lapply(filenames2[i], read.csv, header=TRUE, stringsAsFactors=FALSE)
df2 <- data.frame(df2)
# isolate the relevant columns from within the two data frames
dat1 <- as.matrix(df1[, c("lat", "long")])
dat2 <- as.matrix(df2[, c("lat", "long")])
# run the matchpt function on the two data frames
nn <- matchpt(dat1, dat2)
#Extract the unique id code in the two filenames (for naming the output file)
file1 = filenames1[1]
code1 = strsplit(file1,"_")[[1]][1]
file2 = filenames2[i]
code2 = strsplit(file2,"_")[[1]][1]
outname = paste(code1, code2, sep=”_”)
outfile = paste(code, "_nn.csv", sep="")
write.csv(nn, file=outname, row.names=FALSE)
}
Any suggestions on how to solve this problem would be greatly appreciated. Many thanks!
You could do something like:
out <- combn( list.files(), 2, FUN=matchpt )
write.table( do.call( rbind, out ), file='output.csv', sep=',' )
This assumes that matchpt is expecting 2 strings with the names of the files and that the result is the same structure each time so that the rbinding makes sense.
You could also write your own function to pass to combn that takes the 2 file names, runs matchpt and then appends the results to the csv file. Remember that if you pass an open filehandle to write.table then it will append to the file instead of overwriting what is there.
Try this example:
#dummy filenames
filenames <- paste0("file_",1:5,".txt")
#loop through unique combination
for(i in 1:(length(filenames)-1))
for(j in (i+1):length(filenames))
{
flush.console()
print(paste("i=",i,"j=",j,"|","file1=",filenames[i],"file2=",filenames[j]))
}
In response to my question I seem to have found a solution. The below uses a for loop to perform every pairwise combination of files in a common directory (this seems to work and gives EVERY combination of files i.e. A & B and B & A):
# create a list of filenames
filenames = list.files(path=dir, pattern="csv$", full.names=FALSE, recursive=FALSE)
# For loop to compare the files
for(i in 1:length(filenames)){
# load the first data frame in the list
df1 = lapply(filenames[i], read.csv, header=TRUE, stringsAsFactors=FALSE)
df1 = data.frame(df1)
file1 = filenames[i]
code1 = strsplit(file1,"_")[[1]][1] # extract unique id code of file (in case where the id comes before an underscore)
# isolate the columns of interest within the first data frame
d1 <- as.matrix(df1[, c("lat_UTM", "long_UTM")])
# load the comparison file
for (j in 1:length(filenames)){
# load the second data frame in the list
df2 = lapply(filenames[j], read.csv, header=TRUE, stringsAsFactors=FALSE)
df2 = data.frame(df2)
file2 = filenames[j]
code2 = strsplit(file2,"_")[[1]][1] # extract uniqe id code of file 2
# isolate the columns of interest within the second data frame
d2 <- as.matrix(df2[, c("lat_UTM", "long_UTM")])
# run the comparison function on the two data frames (in this case matchpt)
out <- matchpt(d1, d2)
# Merge the unique id code in the two filenames (for naming the output file)
outname = paste(code1, code2, sep="_")
outfile = paste(outname, "_out.csv", sep="")
# write the result to file
write.csv(out, file=outfile, row.names=FALSE)
}
}