Issues using sapply function in R - r

I have multiple files (from different days) that all contain the same information that I want to be able to combine and analyse both together and separately. For example, I would like to be able to take averages of one of the columns for each day - and write the result to a new file - as well as the average for all files (or a weeks worth of files) and write that result to the same output file. I've been trying several methods but none quite work.
The code below I think is combining the files fine but the data (separated with a " ") are being joined together with a \t, so when I try to read the $name row it doesn't exist. How do I fix this so the headers and data remain separated and can be read individually?
data<- sapply(datafiles, function(x)read.table(file=paste0(x),
fill = TRUE, header= TRUE, sep=" ",
stringsAsFactors = TRUE, quote = ""))
#separating STARLINK (SL) satellites from entire list
SL<- data[grepl("^SLINK", data$name), ]```

Related

How to get averages of values from multiple data sets written to one output file

so I've used R for a few years but I find thinking around coding issues pretty hard still so if any explanations can assume as little as possible I'd really appreciate it!
Basically, I have lots of data files that correspond to different dates. I would like to write some sort of loop where I can have each day's data file be read in, analysis taken place (e.g. the mean of one of the columns) and the output go to a separate file labelled by the date/name of the file (The date isn't currently part of the data file so I haven't figured out how to have that in the code yet)
To complicate things I need to pull out subsets from the data file to analyze separately. I've figured out how to do this and get the separate means already I just don't know how to incorporate the loop.
#separating LINK (SL) satellites from entire list
SL<- data[grepl("^LINK", data$name), ]
#separating non-SL sat. from entire list
nonSL<- data[!grepl("^LINK", data$name), ]
analyse<- function(filenames){
#mean mag. for satellites in data frame
meansat<- print(mean(data[,2]))
#mean mag. for LINK sat. in data frame
meanSLsat<- print(mean(SL[,2]))
#mean mag. non-SL sat. in data frame
meannonSLsat<- print(mean(nonSL[,2]))
means<-c(meansat, meanSLsat, meannonSLsat)
}
#looping in data files
filenames<- list.files(path = "Data")
for (f in filenames) {
print(f)
allmeans<-analyse(f)
}
write.table(allmeans, file = "outputloop.txt", col.names = "Mean Magnitude", quote = FALSE, row.names = FALSE)
This is what I have so far, but it's not working and I don't understand why. There are feeble attempts for a loop but I have no idea where/the order for putting in a loop when I need to then separate out the subclasses, so any help would be really appreciated! Thank you in advance!
Try:
for (f in filenames) {
allmeans <-analyse(f)
file_out <- paste(f, "_output.txt", sep='') # This is for creating different filenames for each file analyzed
write.table(allmeans, file = file_out , col.names = "Mean Magnitude", quote = FALSE, row.names = FALSE)
}

Changing the name of a dataframe inside of a dataframe

I am working with a folder of csv files. These files were imported using the following code:
data_frame <- list.files("path", pattern = ".csv", all.files = TRUE, full.names = TRUE)
csv_data <- lapply(data_frame, read.csv)
names(csv_data) <- gsub(".csv","",
list.files("path", pattern = ".csv", all.files = TRUE, full.names = FALSE),
fixed = TRUE)
After this has been generated the dataframes hold the name of the csv. Since I have over 3000 csv files, I was wondering how to change the name of them to keep track of them better.
For example, instead of 'City, State, US', it will generate 'City-State-US'.
I apologize if this has already been asked, but I cannot find anything that could help.
So, if I understand your question correctly, you have CSV files names "City, State, US.csv" (so "Chicago, IL, USA.csv", etc.), you are reading them in to R and storing them in a list where the list element name is the CSV name, and you want to make some changes to that element name?
You can access the names of the list item using names(csv_data), as you did above, and then treat it however you like and write it back to the same.
For instance, the example you gave:
names(csv_data) <- gsub(", ", "-", names(csv_data), fixed = TRUE)
This should do what you need. If you need to do something else, just change the gsub parameters or function to something else - the key is that you can extract and write back the list item names in one shot.
You are already sort of doing this with the third line, where you name the items - you could even make the treatment before you assign the names.
Edit: Also, a quick note: You are already storing the output of list.files in the data_frame variable - you could just reuse that variable in the third line instead of calling list.files again.

How to merge many databases in R?

I have this huge database from a telescope at the institute where I currently am working, this telescope saves every single day in a file, it takes values for each of the 8 channels it measures every 10 seconds, and every day starts at 00:00 and finishes at 23:59, unless there was a connection error, in which case there are 2 or more files for one single day.
Also, the database has measurement mistakes, missing data, repeated values, etc.
File extensions are .sn1 for days saved in one single file and, .sn1, .sn2, .sn3...... for days saved in multiple files, all the files have the same number of rows and variables, besides that there are 2 formats of databases, one has a sort of a header and it uses the first 5 lines of the file, the other one doesn't have it.
Every month has it's own folder including the days it has, and then this folders are saved in the year they belong to, so for 10 years I'm talking about more than 3000 files, and to be honest I had never worked with .sn1 files before
I have code to merge 2 or a handful of files into 1, but this time I have thousands of files (which is way more then what I've used before and also the reason of why I can't provide a simple example) and I would like to generate a program that would merge all of the files to 1 huge database, so I can get a better sample from it.
I have an Excel extension that would list all the file locations in a specific folder, can I use a list like this to put all the files together?
Suggestions were too long for a comment, so I'm posting them as an aswer here.
It appears that you are able to read the files into R (at least one at a time) so I'm not getting into that.
Multiple Locations: If you have a list of all the locations, you can search in those locations to give you just the files you need. You mentioned an excel file (let's call it paths.csv - has only one column with the directory locations):
library(data.table)
all_directories <- fread(paths.csv, col.names = "paths")
# Focussing on only .sn1 files to begin with
files_names <- dir(path = all_directories$paths[1], pattern = ".sn1")
# Getting the full path for each file
file_names <- paste(all_directories$path[1], file_names, sep = "/")
Reading all the files: I created a space-delimited dummy file and gave it the extension ".sn1" - I was able to read it properly with data.table::fread(). If you're able to open the files using notepad or something similar, it should work for you too. Need more information on how the files with different headers can be distinguished from one another - do they follow a naming convention, or have different extensions (appears to be the case). Focusing on the files with 5 rows of headers/other info for now.
read_func <- function(fname){
dat <- fread(fname, sep = " ", skip = 5)
dat$file_name <- fname # Add file name as a variable - to use for sorting the big dataset
}
# Get all files into a list
data_list <- lapply(file_names, read_func)
# Merge list to get one big dataset
dat <- rdbindlist(data_list, use.names = T, fill = T)
Doing all of the above will give you a dataset for all the files that have the extension ".sn1" in the first directory from your list of directories (paths.csv). You can enclose all of this in a function and use lapply over all the different directories to get a list wherein each element is a dataset of all such files.
To include files with ".sn2", ".sn3" ... extensions you can modify the call as below:
ptrns <- paste(sapply(1:5, function(z) paste(".sn",z,sep = "")), collapse = "|")
# ".sn1|.sn2|.sn3|.sn4|.sn5"
dir(paths[1], pattern = ptrns)
Here's the simplified version that should work for all file extensions in all directories right away - might take some time if the files are too large etc. You may want to consider doing this in chunks instead.
# Assuming only one column with no header. sep is set to ";" since by default fread may treate spaces
# as separators. You can use any other symbol that is unlikely to be present in the location names
# We need the output to be a vector so we can use `lapply` without any unwanted behaviour
paths_vec <- as.character(fread("paths.csv", sep = ";", select = 1, header = F)$V1)
# Get all file names incl. location)
file_names <- unlist(lapply(paths_vec, function(z){
ptrns <- paste(sapply(1:5, function(q) paste(".sn",q,sep = "")), collapse = "|")
inter <- dir(z, pattern = ptrns)
return(paste(z,inter, sep = "/"))
}))
# Get all data in a single data.table using read_func previously defined
dat <- rbindlist(lapply(file_names, read_func), use.names = T, fill = T)

CSV with multiple datasets/different-number-of-columns

Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.
Given: One CSV file containing several flat tables.
Wanted: Several dataframes or other structure holding all data (S4?)
The method so far:
get line numbers of header data by counting number of columns
get headers by reading every line index held in vector described above
read data by calculating skip and nrows between data sets in index described by header lines as above.
give the read data column names from read header
I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.
These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.
In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:
1: n, Name, Species, Description, Classification
2: 90, Mickey, Mouse, Big ears, rat
3: 45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14
There are a number of ways going about reading each data set. What I have come up with so far:
filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)
# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")
# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)
# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)
With this, I can do things like:
header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)
data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)
names(data) <- header
As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:
Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]
I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).
The other returns a list of string vectors holding all headers:
GetHeaders <- function(filepath, linenums) {
# init an empty list of length(linenums)
l.headers <- vector(mode = "list", length = length(linenums))
for(i in seq_along(linenums)) {
# read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
l.headers[[i]] <- GetHeader(filepath, linenums[i])
}
l.headers
}
What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.
Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.
Before you answer, please keep in mind that, no, using a different export format is not an option.
Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.

Reading a folder of csv files with specific name endings followed by filtering with dplyr and combining output into single dataframe

I have a folder containing 13000 csv files.
Problem 1: I need to read all the files ending with -postings.csv. All the -postings.csv files have same number of variables and same format.
So far I have the following
name_post = list.files(pattern="-postings.csv")
for (i in 1:length(name_post)) assign(name_post[i], read.csv(name_post[i], header=TRUE))
This creates around 600 dataframes.
Problem 2: I need to filter the 600 dataframe output trough the following rules
1) column_name1 != "" (remove all empty rows)
2) column_name2 ==124 (only keep rows with values equal to 124)
So far I have only done this on a single file, but need a way to get this done on all 600 dataframes. (I use filter which is part of the dplyr package. I am open for other solutions)
filter(random_name-postings.csv,column_name1 != "",column_name2==124)
Problem 3: I need to combine the filtering output from problem 2 into a single dataframe.
I have not done this since I have issues solving problem 2.
Any help is much appreciated :)
Rather than working with the data frames as 600 separate variables, which isn't a good idea, you can combine them into one data frame as soon as you read them in. map_df from the purrr package is a good way to do this.
library(purrr)
name_post = list.files(pattern="-postings.csv")
combined = map_df(name_post, read.csv, header = TRUE)
After that, you can perform your filtering on the combined dataset.
library(dplyr)
combined_filtered = combined %>%
filter(column_name1 != "", column_name2 == 124)
Note that if you want to know which file each row originally came from, you could turn name_post into a named vector and use .id = "filename", which would add a filename column to your output.
names(name_post) = name_post
combined = map_df(name_post, read.csv, header = TRUE, .id = "filename")

Resources