I want to import json data from some kind of fitness tracker in order to run some analysis on them. The single json files are quite large, while I am only interested in specific numbers per training session (each json file is a training session).
I managed to read in the names of the files & to grab the interesting content out of the files. Unfortunately my code does obviously not work correctly if one or more information are missing in some of the json files (e.g. distance is not availaible as it was an indoor training).
I stored all json files with training sessions in a folder (=path in the code) and asked R to get a list of the files in that folder:
json_files<- list.files(path,pattern = ".json",full.names = TRUE) #this is the list of files
jlist<-as.list(json_files)
Then I wrote this function to get the data im interested in from each single file (as reading in all the content for each file exceeded my available RAM capacity...)
importPFData <- function(x)
{
testimport<-fromJSON(x)
sport<-testimport$exercises$sport
starttimesession<-testimport$exercises$startTime
endtimesession<-testimport$exercises$stopTime
distance<-testimport$exercises$distance
durationsport<-testimport$exercises$duration
maxHRsession <- testimport$exercises$heartRate$max
minHRsession <- testimport$exercises$heartRate$min
avgHRsession <- testimport$exercises$heartRate$avg
calories <- testimport$exercises$kiloCalories
VO2max_overall <- testimport$physicalInformationSnapshot$vo2Max
return(c(starttimesession,endtimesession,sport,distance,durationsport,
maxHRsession,minHRsession,avgHRsession,calories,VO2max_overall))
}
Next I applied this function to all elements of my list of files:
dataTest<-sapply(jlist, importPFData)
I receive a list with one entry per file, as expected. Unfortunately not all of the data was available per file, which results in some entries having 7, other having 8,9 or 10 entries.
I struggle with getting this into a proper dataframe as the infomation is not shown as NA or 0, its just left out.
Is there an easy way to include NA in the function above if no information is found in the individual json file for that specific detail (e.g. distance not available --> "NA" for distance for this single entry)?
Example (csv) of the content of a file with 10 entries:
"","c..2013.01.06T08.52.38.000....2013.01.06T09.52.46.600....RUNNING..."
"1","2013-01-06T08:52:38.000"
"2","2013-01-06T09:52:46.600"
"3","RUNNING"
"4","6890"
"5","PT3608.600S"
"6","234"
"7","94"
"8","139"
"9","700"
"10","48"
Example (csv) for a file with only 7 entries (columns won´t macht to Example 1):
"","c..2014.01.22T18.38.30.000....2014.01.22T18.38.32.000....RUNNING..."
"1","2014-01-22T18:38:30.000"
"2","2014-01-22T18:38:32.000"
"3","RUNNING"
"4","0"
"5","PT2S"
"6","0"
"7","46"
Related
When I ran a security report through the Office 365 Admin Email Explorer to obtain detailed information about emails and their respective types of attacks, I downloaded the .csv file and manually use Microsoft Excel to filter out exact email subject rows and save to their own .csv file. This took a long time to create individual CSV files since there were quite a lot of various emails with same or differing subject titles as values.
Downloaded the .csv fild from the Office 365 Admin portal with a date range of 7 days into the past (date-range).
Imported into R using the R command below:
Office_365_Report_CSV = "C:/Users/absnd/Documents/2022-11-18office365latestquarantine.csv"
Imported the table from the library.
require(data.table)
Created a new variable to convert the data into a data-frame.
quarantine_data = fread(paste0(Office_365_Report_CSV),sep = ",", header = TRUE, check. Names = FALSE)
Pull columns needed to filter through in the data-frame.
Quarantine_Columns = quarantine_data[,c("Email date (UTC)","Recipients","Subject","Sender","Sender IP","Sender domain","Delivery action","Latest delivery location","Original delivery location","Internet message ID","Network message ID","Mail language","Original recipients","Additional actions","Threats","File threats","File hash","Detection technologies","Alert ID","Final system override","Tenant system override(s)","User system override(s)","Directionality","URLs","Sender tags","Recipient tags","Exchange transport rule","Connector","Context" )]
Steps Needed to be done (I am not sure where to go from here):
-I would like to have R write to individual .csv file with the same "Subject" value rows that must contain all the above columns data in step 5.
Sub-step - ex. If the row data contains the value inside the column (named, "Threats") = "Phish" generate a file named, "YYYY-MM-DD Phishing <number increment +1>.csv."
Sub-step - ex. 2 If the row data contains the value inside the column (named, "Threats") = Phish, Spam" generate a CSV file named, "YYYY-MM-DD Phishing and Spam <number increment +1>.csv."
Step 6 and so on would filter out like same "Subject" column values for rows and save the rows with same Subject email values into a single file that would be named based on the if-condition in the substeps above in step 6.
First of all, you are looking to do this in R - RStudio is an IDE to make usage of R easier.
If you save your data frames in a list, and then set a vector of the names of the files that you want to give each of those files, you can then use purrr::walk2() to iterate through the saving. Some reproducible code as an example:
library(purrr)
library(glue)
library(readr)
mydfs_l <- list(mtcars, iris)
file_names <- c("mtcars_file", "iris_file")
walk2(mydfs_l, file_names, function(x, y) { write_excel_csv(x, glue("mypath/to/files/{y}.csv")) })
I have >1000 audio files in one directory that I have been tagging for bird calls and entering the data into excel.
Once I read in this data, I filtered out covariates of interest and made separate data frames. Basically, I have about 45 files I want to analyse separately with read_wav.
I'm not sure how to create a for loop to look into a certain directory '/all_SMP' and pull out these 45 files from a list of >1000.
I've created a list "list.clean", however my current forloop only lists every file in that folder (>1000 files)
for(i in c(list.clean)){
raw.path <- paste0("../02_Working/SMP_SM4/SMP_15sec/all_AR_SMP")
wav.list <- list.files(path="../02_Working/SMP_SM4/SMP_15sec/all_AR_SMP",
pattern="*.wav",
recursive=TRUE)
I'm quite a novice with R as I'm sure you can tell.
I want to read_wav on the 45 files and use the 'analyze' function on each file
audio1 <- analyze(sample1,samplingRate=24000, cutFreq=c(800,8000))
Hope this makes sense!
Cheers
In list.files the parameter pattern is a regular expression.
So, for your case should be pattern=".*\\.wav$" (and it's case sensitive).
Alternatively and easier, you can use fs::dir_ls(glob = "*.wav").
My company documents summaries of policies/services for each client in a pdf formatted file. These files are combined into a large dataset each year. One row per client and columns are variables in the client's document. There are a couple thousand of these files and each one has approximately 20-30 variables each. I' want to automate this process by creating a data.frame with each row representing a client, and then pull the variables for each client from their pdf document. I'm able to create a list or data.frame of all the clients by the pdf filename in a directory but don't know how to create a loop that pulls each variable I need for each document. I currently have two different methods which I can't decide between, and also need help with a loop that grabs the variables I need for each client document. My code and links to two mock files are provided below. Any help would be appreciated!
Files: Client 1 and Client 2
Method 1: pdftools
The benefit of the first method is it extracts the entire pdf into a vector, and each page into a separate element. This makes it easier for me to pull strings/variables. However, don't know how to loop it to pull the information from each client and appropriately place it in a column for each client.
library(pdftools)
library(stringr)
Files <- list.files(path="...", pattern=".pdf")
Files <- Files %% mutate(FR =
str_match(text, "\\$\\d+\\s\\Financial Reporting")) #Extract the first variable
Method 2:
The benefit of this approach is it automatically creates a database for each of the client documents with file name as a row, and the each pdf in a variable. The downside is an entire pdf in a variable makes it more difficult to match and extract strings compared to having each page in its own element. I don't know how to write a loop that will extract variables for each client and place them in their respective column.
DF <- readtext("directory pathway/*.pdf")
DF <- DF %>% mutate(FR =
str_match(text, "\\$\\d+\\s\\Financial Reporting"))
Here's a basic framework that I think solves your problem using your proposed Method 1.
library(pdftools)
library(stringr)
Files <- list.files(path="pdfs/", pattern=".pdf")
lf <- length(Files)
client_df <- data.frame(client = rep(NA, lf), fr = rep(NA, lf))
for(i in 1:lf){
# extract the text from the pdf
f <- pdf_text(paste0("pdfs/", Files[i]))
# remove commas from numbers
f <- gsub(',', '', f)
# extract variables
client_name <- str_match(f[1], "Client\\s+\\d+")[[1]]
fr <- as.numeric(str_match(f[1], "\\$(\\d+)\\s+Financial Reporting")[[2]])
# add variables to your dataframe
client_df$client[i] <- client_name
client_df$fr[i] <- fr
}
I removed commas from the text under the assumption that any numeric variables you extract you'll want to use as numbers in some analysis. This removes all commas though, so if those are important in other areas you'll have to rethink that.
Also note that I put the sample PDFs into a directory called 'pdfs'.
I would imagine that with a little creative regex you can extract anything else that would be useful. Using this method makes it easy to scrape the data if the elements of interest will always be on the same pages across all documents. (Note the index on f in the str_match lines.) Hope this helps!
this is my first time working with XML data, and I'd appreciate any help/advice that you can offer!
I'm working on pulling some data that is stored on AWS in a collection of XML files. I have an index files that contains a list of the ~200,000 URLs where the XML files are hosted. I'm currently using the XML package in R to loop through each URL and pull the data from the node that I'm interested in. This is working fine, but with so many URLs, this loop takes around 12 hours to finish.
Here's a simplified version of my code. The index file contains the list of URLs. The parsed XML files aren't very large (stored as dat in this example...R tells me they're 432 bytes). I've put NodeOfInterest in as a placeholder for the spot where I'd normally list the XML tag that I'd like to pull data from.
for (i in 1:200000) {
url <- paste('http://s3.amazonaws.com/',index[i,9],'_public.xml', sep="") ## create URL based off of index file
dat <- (xmlTreeParse(url, useInternal = TRUE)) ## load entire XML file
nodes <- (getNodeSet(dat, "//x:NodeOfInterest", "x")) ##find node for the tag I'm interested in
if (length(nodes) > 0 & exists("dat")) {
dat2 <- xmlToDataFrame(nodes) ##create data table from node
compiled_data <- rbind(compiled_data, dat2) ##create master file
rm(dat2)
}
print(i)
}
It seems like there must be a more efficient way to pull this data. I think the longest step (by far) is loading the XML into memory, but I haven't found anything out there that suggests another option. Any advice???
Thanks in advance!
If parsing the XML into a tree is your pinchpoint (in xmlTreeParse) maybe use a streaming interface like SAX which will allow you to only process those elements that are useful for your application. I haven't used it, but the package xml2 is built on top of libxml2 which provides a SAX ability.
I'm writing a script to plot data from multiple files. Each file is named using the same format, where strings between “.” give some info on what is in the file. For example, SITE.TT.AF.000.52.000.001.002.003.WDSD_30.csv.
These data will be from multiple sites, so SITE, or WDSD_30, or any other string, may be different depending on where the data is from, though it's position in the file name will always indicate a specific feature such as location or measurement.
So far I have each file read into R and saved as a data frame named the same as the file. I'd like to get something like the following to work: if there is a data frame in the global environment that contains WDSD_30, then plot a specific column from that data frame. The column will always have the same name, so I could write plot(WDSD_30$meas), and no matter what site's files were loaded in the global environment, the script would find the WDSD_30 file and plot the meas variable. My goal for this script is to be able to point it to any folder containing files from a particular site, and no matter what the site, the script will be able to read in the data and find files containing the variables I'm interested in plotting.
A colleague suggested I try using strsplit() to break up the file name and extract the element I want to use, then use that to rename the data frame containing that element. I'm stuck on how exactly to do this or whether this is the best approach.
Here's what I have so far:
site.files<- basename(list.files( pattern = ".csv",recursive = TRUE,full.names= FALSE))
sfsplit<- lapply(site.files, function(x) strsplit(x, ".", fixed =T)[[1]])
for (i in 1:length(site.files)) assign(site.files[i],read.csv(site.files[i]))
for (i in 1:length(site.files))
if (sfsplit[[i]][10]==grep("PARQL", sfsplit[[i]][10]))
{assign(data.frame.getting.named.PARQL, sfsplit[[i]][10])}
else if (sfsplit[i][10]==grep("IRBT", sfsplit[[i]][10]))
{assign(data.frame.getting.named.IRBT, sfsplit[[i]][10])
...and so on for each data frame I'd like to eventually plot from.Is this a good approach, or is there some better way? I'm also unclear on how to refer to the objects I made up for this example, data.frame.getting.named.xxxx, without using the entire filename as it was read into R. Is there something like data.frame[1] to generically refer to the 1st data frame in the global environment.