R 1000+ .CSV files merging and frequency analysis - r

so I'm back with an even more adventurous approach for my thousands of .CSV files manipulation with R. I can import, merge every ten files, rename columb headers, save new .CSV etc, but the result is still too cumbersome to manipulate analytically.
What I need is; every 10 files put into a matrix OR merged into a single file (see below for file example). The columns are Frequency, Channel A (and later Channel B). Simply F, A, and B, and the F values are the same for every file (hence I was thinking matrix). In the end I'll end up with headers
| *F* | *A1* | *B1* | *A2* | *B2* | *A3* | *B3* |
etc... to 10.
Inside of the matrix/bind_col loop, is it possible before wrie.csv to do some math functions on the values A1-10? A few new columns for Average and Mean for each Frequency. I need others too but I'll sort that myself.
+------------+-------------+
| Frequency | Channel A |
| (MHz) | (dBV) |
+------------+-------------+
0.00000000,-27.85117000
0.00007629,-28.93283000
0.00015259,-32.89576000
0.00022888,-43.54568000
---
Continued...
---
19.99977312,-60.59710000
19.99984941,-48.58142000
19.99992571,-43.29094000
Thanks for you time, I know I've spent too much debugging and now I'm looking for a more elegant method.
PS: How's my formatting? Table and .CSV style blunder!

Tough to answer without a better example of what each file looks like, and what you want your output to be, as well as some example code.
Are the files small enough that you can load all 1000 at once?
Something like the following is where I would start if it were me.
library(data.table)
filenames <- list.files(pattern = ".csv$")
list_data <- vector(mode = "list", length = length(filenames))
i <- 1
for (file in filenames){
list_data[[i]] <- fread(file)
i <- i + 1
}
dat <- rbindlist(list_data, use.names = TRUE, fill = TRUE)
After that, you can use all of the useful data.table features.
dat[, .(meanA = mean(A), stdevA = sd(A)), by = Frequency]

Related

Applying function to each row of dataframe, merging data and writting to csv file in R

Goal
I have a dataframe in R, my goal is for each row of the dataframe to perform an API call, retrieving data in csv format, then merge the two dataframe and append the newly formatted row to a csv file.
I ve been using apply function to loop over each row of the original df but Ive been struggling to merge the two resulting dfs and append to the file.
First, the row input to my get_temp function seem to be a vector of characters. The response from the api call seem to be automatically converted to a dataframe by function read_csv. My goal is just to merge the two side by side. I guess I would need to convert row vector to a dataframe but I am actually not sure this is the right approach and I have been struggling to do that.
Here's the bit of relevent code I wrote to try to achieve that :
get_temp <- function(row) {
uri <- build_uri(latitude, longitude, start_date)
response <- read_csv(uri)
merged <- merge(response, row, by=0)
write.table(merged, file = "temp_and_cases.csv", append = TRUE, sep = ",")
}
row is the current row from the cases_by_region df.
response is the result of the api_call.
cases_by_region was obtained -> cases_by_region <- read_csv("cases_by_region.csv") and looks like this :
State | admin2 |province_state | lat | longcombined_key | day | number_of_cases | total_cases
Abbreviation
Alabama | Autauga | Alabama | 32.53953 | -86.64408 | Autauga, Alabama, US | 2020-04-03 | 12 | 72 | AL |
the response from the api is a csv that looks like that :
Address,Date time,Minimum Temperature,Maximum Temperature,Temperature,Dew Point,Relative Humidity,Heat Index,Wind Speed,Wind Gust,Wind Direction,Wind Chill,Precipitation,Precipitation Cover,Snow Depth,Visibility,Cloud Cover,Sea Level Pressure,Weather Type,Latitude,Longitude,Resolved Address,Name,Info,Conditions
"32.53953,-86.64408","04/03/2020",7.5,25.2,15.7,8.4,67.85,,8.7,,35.79,,0,0,,15.8,7.1,1016.3,"",32.53953,-86.64408,"32.53953,-86.64408","","","Clear"
So far the code I wrote indeed appends to the csv file but not at all in the expected way (strange headers merging and no row values...) :
"Row.names","Address","Date time","Minimum Temperature","Maximum Temperature","Temperature","Dew Point","Relative Humidity","Heat Index","Wind Speed","Wind Gust","Wind Direction","Wind Chill","Precipitation","Precipitation Cover","Snow Depth","Visibility","Cloud Cover","Sea Level Pressure","Weather Type","Latitude","Longitude","Resolved Address","Name","Info","Conditions","y"
This is the part to apply the function get_temp to every row of cases_by_region
temps <- apply(cases_by_region, 1, get_temp)
How can I merge row and the response from the api side by side, and then append to each created line to the temp_and_cases.csv file in the right csv format ( an extra issue is that the column names should be appended to the csv file only on the first api call) ?
Try using this :
get_temp <- function(row) {
uri <- build_uri(row$latitude, row$longitude, row$start_date)
response <- readr::read_csv(uri)
cbind(row, response)
}
temps <- do.call(rbind, lapply(split(cases_by_region,
seq(nrow(cases_by_region))), get_temp))
write.csv(temps, "temp_and_cases.csv", row.names = FALSE)
Using asplit we split each row into different dataframes, pass each row to get_temp function, cbind row and response and finally combine them into one dataframe.
We can write csv only once at the end for entire file.

Putting Data Frames into Complex Hierachical Lists in For-Loop (r)

I have created a list hierarchy, and now I'm trying to put data frames into those lists, but I'm not sure of the syntax that allows me to put a data frame into a list.
Essentially what I'm trying to acheive is this:
PlanetList[["Mars"]][["Mars vs Venus"]]["PlanetDataFrame"] <- CurrentDataFrame
Which goes well, provided that I have run the loop once so that it fails but still applies a data frame to the CurrentDataFrame variable for me to mess about with.
But in my for-loop like this:
PlanetList[[Planets[i]]][[vsList]]["PlanetDataFrame"] <- CurrentDataFrame
It doesn't apply my data table to the correct folder, in fact from what I can tell it's making a new one and writing into that.
I can't work out what to change, but I'm assuming the above line of code is the crux of my problem.
Here's the full for-loop:
for(i in 1:length(Planets)) {
for(j in 1:length(Planets)) {
#here I create the 'X vs Y' (e.g. Mars vs Venus) variable to
#direct my newly retrieved data frame into the correct sublist
#later on
vsList <- paste0(Planets[i], ' vs ', Planets[j])
#obviously - file path. All good here.
FilePath <- file.path(paste0(SourceDir, '/', Planets[i]))
#get my data frame full of exciting numbers
PlanetDataFrame <- read_delim(file.path(paste0(FilePath, "Correlations_file_for_", Planets[i], '_vs_', Planets[j], '_threshold_0.0.txt')), ' ', trim_ws = TRUE, col_names = FALSE)
#put that data frame in a corresponding list (or not)
PlanetList[[Planets[i]]][[vsList]]["PlanetDataFrame"] <- PlanetDataFrame
}
}
Additional Issue: Not only is this creating a new list entirely and ignoring those I have already made, but I also get the following warning:
number of items to replace is not a multiple of replacement length
I made this list hierachy using the following:
Planets = c('Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune')
DataSets <- c('PlanetDataFrame', 'PlanetMatrix', 'PlanetTable')
PlanetList <- lapply(Planets, function(el_PlanetList_outer) {
## create a list with |Planets| elements
## this we do by creating first an inner list vector(.)
## and the repeating it |Planets| times
ret <- rep(tibble(setNames(vector("list", length(DataSets)),
DataSets)),
length(Planets))
setNames(ret, ## ret has no proper names yet
paste(el_PlanetList_outer, " vs ", Planets))
})
names(PlanetList) <- Planets ## could have been done with setNames as well
str(PlanetList)
I thought that specifying 'tibble' rather than 'list' (6th line of text down in the above bloc) would allow me to put a dataframe in there?
I apologise as this may have now turned this question into a bit of a hydra which wasn't my original intent.
So in the hope, all the information required to help me is presented above, and at the risk of wasting a few hundred more bits:
All I'm trying to achieve is to retrieve a file like this:
Correlations_file_for_Saturn_vs_Mercury_threshold_0.0.txt
And then pop it neatly into it's correct list / sublist:
PlanetList > Saturn > Saturn vs Mercury > PlanetDataFrame
An abbreviated view of my file hierarchy showing the first planet:
Planet List -
|
|-- Mercury -
| |
| |-- Mercury vs Venus -
| | |
| | |--PlanetDataFrame
| | |
| | |--PlanetMatrix
| | |
| | |--PlanetTable
| | |
... ... ...

Multiple Excel Spreadsheets Import to R

I have an Excel file with hundreds of spreadsheets.
I have read a few postings on stackoverflow where it is answered how to import them into R using read.xl packages and so on...
But I need to do something extra for this file. Each spreadsheet has 2 rows of header on the top and first row in the header has 5 digits of number that I need to extract and insert it to the table.
For example, header has 11111 ABC Corp. and its dataset below.
It should look like this:
11111 ABC Corp.
Product# | Description | Quantity Order | Price | Unit Price
Here, I want to import the data as below:
ID# | Product # | Description | Quantity Order | Price | Unit Price
11111 | 2813A | Whatever | 100
11111 | 2222B
11111 | 7721CD
So as you see above, the five digits of number should be copied to the first column of the table for each spreadsheet. Each spreadsheet has different five digit numbers to be copied to its table.
I was thinking if I have a way to extract the first five digits, then I can probably do this by using loop.
So 1. Extract the first five digits.
2. Design a loop by which I can insert into first column and import in to R.
What are the good functions I can use?
Thank you.
R is a great tool for so, so, so many things! In this specific case, I would manipulate the data in Excel, and then import one large merged range into R. I always believe in using the right tool for the specific task you are tacking. So, start by downloading and installing the AddIn from here.
https://www.rondebruin.nl/win/addins/rdbmerge.htm
So, merge all worksheets (hundreds) into one massive worksheet. Set First Cell to A2 till last cell on worksheet. Once all those hundreds of sheets are merged into one sheet, save that as a CSV, and import it into R.
mydata <- read.table("c:/mydata.csv", header=TRUE, sep=",", row.names="id")
The key to iteration is solve it for one and then apply to all. Once you've figured out how to do it for one sheet the rest is easy.
Here is my guess based on your description of your files.
library(readxl) # to read excel files
library(readr) # for type_convert
fname <- "test.xlsx"
## get sheet names
sheets <- excel_sheets(fname)
## function to process a single sheet
processSheet <- function(sheet, file) {
all <- read_excel(file, sheet) # read all data
id <- na.omit(names(all)) # extract the ID
names(all) <- unlist(all[1, ]) # make the first row the names
all <- all [-1, ] # get rid of the first row
data.frame(ID = id, # add id column
type_convert(all) # convert to appropriate column types
)
}
## apply the function to each sheet, collecting the results into a
## data.frame
test.data <- do.call(rbind,
lapply(sheets,
processSheet,
file = fname))
You could of course use something other than readxl to read the Excel files. Something that can read a specific range would make the re-arranging of the data easier. The reason I went with readxl is that I've found it "just work", whereas others depend on Java or Perl and tend to break more often in my experience.

Reading, merging & sorting .csv files

I'm very new to R but do program. I'm probably just getting fed up with my own progress at this stage, so here's my issue;
Lots of .csv files, large (6MB) with spectrum data that I need to do analysis afterwards. I'm trying to read in the data - two columns of Frequency and Voltage (V as dB values), 500,000 data points per file. I would like to "merge" the data from the 2nd column in a new data set for every 10 files.
Eg: 10 files, ten Frequency (all the same for each so can be ignored for the moment) and ten Voltage. Take the data from the Voltage in the 2nd column and merge it into a data set. If I have 10 files = I end up with one data set, 100 files = 10 data sets. Hopefully in the end each data set will have 11 columns | Frequency | V1 | V2 | ... | V10 |. It would be nice to do an Index-Match on each file but I'm not sure my PC will be able for it until I upgrade resources.
This might seem quiet convoluted, all suggestions welcome, memory seems to be an issue when trying to sort through 1200 .csv files or even just reading 100 of them. Thanks for your time!
I haven't tested this since I obviously don't have your data, but something like the code below should work. Basically, you create a vector of all the file names and then read, combine, and write 10 of them at a time.
library(reshape2)
library(dplyr)
# Get the names of all the csv files
files = list.files(pattern="csv$")
# Read, combine, and save ten files at a time in each iteration of the loop
for (i in (unique(1:length(files)) - 1) %/% 10)) {
# Read ten files at a time into a list
dat = lapply(files[(1:length(files) - 1) %/% 10 == i], function(f) {
d=read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Add file name as a column
d$file = gsub("(.*)\\.csv$", "\\1", f)
return(d)
})
# Combine the ten files into a single data frame
dat = bind_rows(dat)
# Reshape from long to wide format
dat = dcast(Frequency ~ file, value.var="Voltage")
# Write to csv
write.csv(dat, paste("Files_", i,".csv"), row.names=FALSE)
}
On the other hand, if you want to just combine them all into a single file in long format, which will make analysis easier (if you have enough memory of course):
# Read all files into a list
dat = lapply(files, function(f) {
d = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Add file name as a column
d$file = gsub("(.*)\\.csv$", "\\1", f)
return(d)
})
# Combine into a single data frame
dat = bind_rows(dat)
# Save to csv
write.csv(dat, "All_files_combined.csv", row.names=FALSE)

Automate script for many genes with different values

I am interested in making my R script to work automatically for another set of parameters. For example:
gene_name start_x end_y
file1 -> gene1 100 200
file2-> gene2 150 270
and my script does trivial job, just for learning purposes. It should take the information about gene1 and find a sum, write into a file; then it should take information of the next gene2, find sum and write this into a new file and etc, and lets say I would like to keep files name according to the genes name:
file_gene1.txt # this file holds sum of start_x +end_y for gene1
file_gene2.txt # this file holds sum of start_x +end_y for gene2
etc for the rest of 700 genes (obviously manually its to much work to take file1, and write file name and plug in start and end values into already existing script )
I guess the idea is clear, I have never been doing this type of things, and I guess its very trivial, but i would appreciate if anyone can tell me the proper definition of this process so I can search and learn online how to do it.
P.S: I think in Python I would just make a list of genes and related x/y values, loop and select required info, but I still don't know how I would keep gene names as a file name automatically.
EDIT:
I have to supply the info about a gene location, therefore start and end, which is X and Y respectively.
x=100 # assign x to a value of a related gene
y=150 # assign y to a value of a related gene
a=tbl[which(tbl[,'middle']>=x & tbl[,'middle']<y),] # for each new gene this info is changing accoringly
write.table( a, file= ' gene1.txt' ) # here I would need changing file name
my thoughts:
may be I need to generate a file, which contains all 700 gene names and related X and Y values.
then I read line one of this file and supply it into my script (in case of variable a, x and y)
then my computation is over I write results into a file and keep a gene name, that was used to generate this results.
Is it more clear?
P.S.: I Google it by probably because I don't know the topic I cant find anything relevant, just give me the idea where I can search, I would like to learn this programming step anyway.
I guess so you are looking for reading all the files present in a folder (Assuming all your gene files written in a single folder using your older script). In that case you can use something like:
directory <- "C://User//Downloads//R//data"
file <- list.files(directory, full.names = TRUE)
Then access filename using file[i] and do whatever needed (naming the file paste("gene", file[i], sep = "_") or reading it read.csv(file[i])).
I would divide your problem in two parts. (Sample data for reproducible example provided below)
library(data.table) # v1.9.7 (devel version)
# go here for install instructions
# https://github.com/Rdatatable/data.table/wiki/Installation
1st: Apply your functions to your data by gene
output <- dt[ , .( f1 = sum(start_x, end_y),
f2 = start_x - end_y ,
f3 = start_x * end_y ,
f7 = start_x / end_y),
by=.(gene)]
2nd: Split your data frame by gene and save it in separate files
output[,fwrite(.SD,file=sprintf("%s.csv", unique(gene))),
by=.(gene)]
Latter on, you can do bind the multiple files into one single data frame if you like:
# Get a List of all `.csv` files in your folder
filenames <- list.files("C:/your/folder", pattern="*.csv", full.names=TRUE)
# Load and bind all data sets
data <- rbindlist(lapply(filenames,fread))
ps. note that fwrite is still in development version of data.table as of today (12 May 2016)
data for reproducible example:
dt <- data.table( gene = c('id1','id2','id3','id4','id5','id6','id7','id8','id9','id10'),
start_x = c(1:10),
end_y = c(20:29) )

Resources