How to select for certain data in a .txt file - r

I have a .txt import file from a weather station using some pretty advanced code, and I need to sort based on one area of content within each line. Here's a few lines:
13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68
13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72
13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E
I basically need to be able to group together all lines with a $GPGGA, and do the same for $GPGLL, $GPVTG, and I believe 6 other types of entries that repeat. group_by() does work, nor do select() or sort() for obvious reasons. The formatting here is clearly not in any organized table format, making this very difficult for me. How do I do this?
Here's the code I used to import the original file (I replaced my actual username with "my username"):
filefolder <-"C:\\Users\\"my username"\\Downloads\\"
Weather_data = paste(filefolder, "Jul_13_2021_Weatherstation_Test_File.txt", sep = "")
Weather_data <- read.delim("Jul_13_2021_Weatherstation_Test_File.txt")
And here's what I have so far in my attempt:
Screenshot of what I have so far
1: https://i.stack.imgur.com/FSlzf.png][1]

As you say there is no organisation in the table. I would suggest doing something with regular expressions:
df <- data.frame(text = c("13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68",
"13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72",
"13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E"))
library(dplyr)
df %>%
mutate(Entry = gsub(".*\\$([A-Z]+),.*", "\\1", text)) %>%
group_by(Entry)

Related

read a pdf-file into R without header/contents

I want to import multiple pdf-files into R but per page there are 4 columns, a header/footer line and a table of contents.
For purpose of text mining I want to remove them from my file or character vector.
Right now I am using two functions to read in the files. The first one is pdf_text because it keeps the pages but can't deal with the 4 columns. The second one is extract_text, this one on its own doesn't keep the pages but can deal with the column structure (and is decently with occuring tables) .
But neither one of them is able to remove the table of contents (as far as I have tried).
My data set is not exactly minimal but otherwise I had some problems with the data structures. Here a working code:
################ relevant code ##############
library(pdftools)
library(tidyverse)
library(tabulizer)
files_name <- "Nachhaltigkeit 2021.pdf"
file_url <- c("https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/sustainability/documents/Allianz_Group_Sustainability_Report_2021-web.pdf", "https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/investor-relations/en/results-reports/annual-report/ar-2021/en-Allianz-Group-Annual-Report-2021.pdf")
reports_list <- lapply(file_url, pdf_text)
createTibble <- function(){
tibble_together <- NULL
#for all files
for(i in 1:length(files_name)){
page_nr <- length(reports_list[[i]])
tib <- tibble(report = rep(files_name[i], page_nr), page = 1:page_nr, text = gsub("\r\n", " ",
extract_text(files_name[[i]], pages = 1:page_nr)))
tibble_together <- rbind(tibble_together, tib)
}
return(tibble_together)
}
reports_df <- createTibble()
############ code for problem visualization ###############
reports_df <- reports_df %>% unnest_tokens(output = word, input = text, token = "words")
#e.g this part contains the table of contents which is not intended
(reports_df %>% filter(page == 34, report == "Nachhaltigkeit 2021.pdf"))$word[832:885]
Thanks for your help in advance
PS: it's my first question so if you need sth. let me know.
And I know that the function createTibble probably isn't optimal. But that's not my primary concern.

How to adjust read_excel so that thousand seperator is not changed to decimal point?

Currently, I use the following code to store Excel files (which are stored in a folder on my PC) in a list.
decrease_names <- list.files("4_large_decreases",pattern = ".xlsx",full.names = T)
decrease_list <- sapply(decrease_names,read_excel,simplify = F)
After that, I combine the dataframes into one object by using the following code.
decrease <- decrease_list %>%
keep(function(x) nrow(x) > 0) %>%
bind_rows()
The problem I have is that the Excel files that are stored in the folder contain decimal points (points ".") as well as thousand separators (commas ","). I think R (and read_excel() in particular) convert the thousand separators into decimal points, which results in incorrect data.
Although I know that I can remove the thousand separators in Excel first, this would result in a lot of manual work and hence I am interested in a solution that recognises the thousand separator and keeps it intact (or removes it, the goal is to keep the nature of the data correct).
EDIT: as #dario suggested I add a snippet of a tibble that is stored in decrease_list after I run the code. The snippet looks like this:
Raised Avg. change
526.000 2.04
186.000 3.24
...
In the column raised the "." used to be a "," but has become a ".". The "." in Avg. change was a "." already.
Assuming that each excel file contains data in the same format, then we can apply the following code:
library(tidyverse)
library(readxl)
decrease_names <- list.files("4_large_decreases",pattern = ".xlsx",full.names = T)
# 10 columns as written in your comment
decrease <- sapply(decrease_names, readxl::read_excel, col_types = rep("text", 10L), simplify = F)
# Not tested
decrease <- decrease_list %>%
keep(function(x) nrow(x) > 0) %>%
bind_rows() %>%
mutate(across(where(is.character), ~ as.numeric(gsub("\\,", "", .x))))

How to create a string that can be used as LHS and assigned a value to?

I feel stupid for asking such a simple question, but I am hitting my head in the wall.
Why does the paste0() create a string that cannot be not interpreted as name for an empty object ? Is there a different way of create the LHS that would be better?
As input I have a dataframe. As an output I want to have a new filtered dataframe. This works fine as long as I manually type all the code. However, I am trying to reduce repetition, and therefore I want to create a function that does the same thing, but then it is not working anymore.
library(magrittr)
df <- data.frame(
var_a = round(runif(20), digits = 1),
var_b = sample(letters, 20)
)
### Find duplicates
df$duplicate_num <- duplicated(df$var_a)
df$duplicate_txt <- duplicated(df$var_b)
df # a check
### Create two lists of duplicates
list_of_duplicate_num <-
df %>%
filter(duplicate_num)
list_of_duplicate_num # a check
list_of_duplicate_txt <-
df %>%
filter(duplicate_txt)
list_of_duplicate_txt # a check '
So far everything works as expected.
I would like to simplify the code and make this to a function that takes the arguments "num" or "txt". But I am having problems with creating the LHS.
The below should, in my mind, do the same as the code above.
paste0("list_of_duplicate_", "num") <-
df %>%
filter(duplicate_num)
I do get an error message:
Error in paste0("list_of_duplicate_", "num") <- df %>%
filter(duplicate_num) :
target of assignment expands to non-language object
My goal is to create a function with something like this:
make_list_of_duplicates <- function(criteria = "num") {
paste0("list_of_duplicate_", criteria) <-
df %>%
filter(paste0("duplicate_", criteria))
paste0("list_of_duplicate_", criteria) # a check
}
### Create two lists of duplicates
make_list_of_duplicates("num")
make_list_of_duplicates("txt")
and then continue with some joins etc.
I have been looking to tidy evaluation, assignments, rlang::enexpr(), base::substitute(), get(), mget() and many other things, but after two day of reading and trial and error, I am convinced that there must be a an other direction to look at that I am not seeing.
I am running MS Open R 4.0.2.
I am grateful for any suggestions.
Sincerely,
Eero
I found the solution to my question, when I understood that it was a case of indirection. Because I was on a wrong track, I created lots of complications and made it more difficult than necessary. Thanks to #r2evans who pointed me in the right direction. I have in the mean time decided that I will use loops, instead of functions, but here is the working function:
## Example of using paste inside a function to refer to an object.
library(magrittr)
library(dplyr)
df <- data.frame(
var_a = round(runif(20), digits = 1),
var_b = sample(letters, 20)
)
# Find duplicates
df$duplicate_num <- duplicated(df$var_a)
df$duplicate_txt <- duplicated(df$var_b)
# SEE https://dplyr.tidyverse.org/articles/programming.html#indirection-2
make_list_of_duplicates_f2 <- function(criteria = "num") {
df %>%
filter(.data[[paste0("duplicate_", {{criteria}})]])
}
# Create two lists of duplicates
list_of_duplicates_f2_num <-
make_list_of_duplicates_f2("num")
list_of_duplicates_f2_txt <-
make_list_of_duplicates_f2("txt")

R - Irregular metadata; create df from large single column

The title doesn't really do my question justice, because there are probably a few ways to skin this cat. But I picked one approach and went with it. This is what I'm working with:
I've pulled all the metadata for a particular study in the NCBI database using the "Send to:" option on their interface and downloading a .txt file.
In total, I have ~23k samples, each with up to 609 unique questions and answers from a questionnaire totaling 8M+ obs of 1 variable when read as a .csv. To my dismay, the metadata are irregular. Some samples have 140 associated key/value pairs. Others have 492. I've included a header of a sample below.
1: qiita_sid_10317:10317.BLANK1.6H.GUELPH
Identifiers: BioSample: SAMEA4790059; SRA: ERS2609990
Organism: metagenome
Attributes:
/Alias="qiita_sid_10317:10317.BLANK1.6H.GUELPH"
/description="American Gut control"
/ENA checklist="ERC000011"
/INSDC center alias="UCSDMI"
/INSDC center name="University of California San Diego Microbiome Initiative"
/INSDC first public="2018-07-13T17:03:10Z"
/INSDC last update="2018-07-13T14:50:03Z"
/INSDC status="public"
/SRA accession="ERS2609990"
I've tried (including but not limited to):
Read .txt file (adding a delimiter hasn't made a difference, am I missing something here?)
I've tried reading the data using various delimiters
I've even removed the header data in Sublime Text, leaving only "Attributes:" and the "/"-delimited key/value pairs in order to mess with the column that way
I've split the column found all unique values in col1 to maybe create a df from scratch, etc etc.
Can't seem to get past the cleaning steps:
samples <- read.csv("~/biosample_result_full.txt")
samples_split <- cSplit(samples, splitCols = sample$Colname, sep = "=")
samples_split$Attributes_1 <- gsub(" ", "_", samples_split$Attributes_1)
questions <- unique(samples_split$Attributes_1)
Ideally, each sample and associated metadata would be transformed into rows, with each "Attribute"/question as the column name.
Any help is greatly appreciated.
I see that the website you've linked to, allows fot the option to export data to xml. I strongly suggest to do so. R can hande/parse xml-files very efficient.
When I download the first three results from that site to a file biosample_result.xml , it's easy to process using the xml2-package
library( xml2 )
library( magrittr )
doc <- read_xml( "./biosample_result.xml")
#gret all BioSample nodes
BioSample.Nodes <- xml_find_all( doc, "//BioSample")
#build a data.frame
data.frame(
sample_name = xml_find_first( BioSample.Nodes , ".//Id[#db='SRA']") %>% xml_text(),
stringsAsFactors = FALSE )
# sample_name
# 1 ERS2609990
# 2 ERS2609989
# 3 ERS2609988
So if you can use the XML, you will just have to use the right xpath-syntax to get the data/nodes you need, into the columns you want...
In the exmaple above, I extracted (from each BioSample-node) the first ID-node with attribute db equals SRA, and stored the result in the co0lumn sample_name.
Still assuming you can use the xml-data.
If you are lokking for all attributes into one df, you need the functions from purrr, so just load the entire tidyverse
library( tidyverse )
df <- xml_find_all( doc, "//BioSample") %>%
map_df(~{
set_names(
xml_find_all(.x, ".//Attribute") %>% xml_text(),
xml_find_all(.x, ".//Attribute") %>% xml_attr( "attribute_name" )
) %>%
as.list() %>%
flatten_df()
})
will result in a df like this

Wrap long url line in R markdown

I've read the various posts on this, but I still haven't found a solution. Here's some example code:
library(dplyr)
library(lubridate)
urlfile<-'https://raw.githubusercontent.com/blakeobeans/Predicting-Service-Calls/master/Data/nc.csv'
dates<-read.csv(urlfile, header=FALSE)
dates$V1 <- mdy(dates$V1)
dates <- dates %>%
rename("data.time" = V1) %>%
filter("2017-10-01" >= data.time & data.time >= "2017-06-01") %>%
group_by(data.time) %>%
summarise(n = n())
When I output to the pdf...
The same thing happens if I have notes in the code running out of the grey bar.
I've tried using the following line of code at the beginning:
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)
But that doesn't help.
I had a similar problem when putting package on CRAN (they give a note if Rd file line exceeds 90 characters (NOTE: lines wider than 90 characters)). One of the arguments to my function was url to a github dataset. Solution was to split url into separate arguments. For example:
urlRemote <- "https://raw.githubusercontent.com/"
pathGithub <- "blakeobeans/Predicting-Service-Calls/master/Data/"
fileName <- "nc.csv"
And you can use it in your code like this:
paste0(urlRemote, pathGithub, fileName) %>%
read.csv(header = FALSE)
This solution has an advantage when you want to use multiple files from the same repository as you can use paste0(urlRemote, pathGithub, fileName1), paste0(urlRemote, pathGithub, fileName2), etc.

Resources