How to edit an R markdown YAML header programmatically? - r

Take for example this Rmd file - https://github.com/rstudio/learnr/blob/master/inst/tutorials/ex-setup-r/ex-setup-r.Rmd
The YAML header has this -
output:
learnr::tutorial:
progressive: true
allow_skip: true
I would like to change this to -
output:
ioslides_presentation:
widescreen: true
Is there a way to make this edit programmatically i.e. Can I write some function that takes an Rmd file as input, edits the YAML header, and produces a new Rmd file?
Thanks!

I think a quick function could do this.
change_yaml_matter <- function(input_file, ..., output_file) {
input_lines <- readLines(input_file)
delimiters <- grep("^---\\s*$", input_lines)
if (!length(delimiters)) {
stop("unable to find yaml delimiters")
} else if (length(delimiters) == 1L) {
if (delimiters[1] == 1L) {
stop("cannot find second delimiter, first is on line 1")
} else {
# found just one set, assume it is *closing* the yaml matter;
# fake a preceding line of delimiter
delimiters <- c(0L, delimiters[1])
}
}
delimiters <- delimiters[1:2]
yaml_list <- yaml::yaml.load(input_lines[ (delimiters[1]+1):(delimiters[2]-1) ])
dots <- list(...)
yaml_list <- c(yaml_list[ setdiff(names(yaml_list), names(dots)) ], dots)
output_lines <- c(
if (delimiters[1] > 0) input_lines[1:(delimiters[1])],
strsplit(yaml::as.yaml(yaml_list), "\n")[[1]],
input_lines[ -(1:(delimiters[2]-1)) ]
)
if (missing(output_file)) {
return(output_lines)
} else {
writeLines(output_lines, con = output_file)
return(invisible(output_lines))
}
}
Where ... is whatever you want it to be. Meaning: if you want to replace the output: component of the yaml matter, then you give a named list as output=list(...).
If I use the rmarkdown document I used in a previous answer, then unchanged, it looks like this:
readLines("~/StackOverflow/1883604/62095186.Rmd")
# [1] "---"
# [2] "title: Hello"
# [3] "output: html_document"
# [4] "params:"
# [5] " intab: TRUE"
# [6] "---"
# [7] ""
# [8] "# Headline 1"
# [9] ""
# [10] "## Headline 2 `r if (params$intab) \"{.tabset}\"`"
# [11] ""
# [12] "### Headline 3 in a tab"
# [13] ""
# [14] "### Headline 4 in a tab"
# [15] ""
# [16] "### Headline 5 in a tab"
# [17] ""
# [18] ""
And to change the output portion, I add a nested named list as:
change_yaml_matter("~/StackOverflow/1883604/62095186.Rmd",
output=list(ioslides_presentation=list(widescreen=TRUE)))
# [1] "---"
# [2] "title: Hello"
# [3] "params:"
# [4] " intab: yes"
# [5] "output:"
# [6] " ioslides_presentation:"
# [7] " widescreen: yes"
# [8] "---"
# [9] ""
# [10] "# Headline 1"
# [11] ""
# [12] "## Headline 2 `r if (params$intab) \"{.tabset}\"`"
# [13] ""
# [14] "### Headline 3 in a tab"
# [15] ""
# [16] "### Headline 4 in a tab"
# [17] ""
# [18] "### Headline 5 in a tab"
# [19] ""
# [20] ""
You can change just about any portion of the yaml matter. (The only things you cannot change, I suspect, are if you happen to have yaml parameters named input_file or output_file. If you actually have Rmd files with those yaml top-level parameters, then you can easily rename the named arguments here to be something else, such as Mxyzptlk and something else ... you're unlikely to see those in production.)
Notes:
This did not save anything to a file, you have to do that yourself. Add output_file="path/to/new.RMd" to your call, and it will write a new file.
When you do include output_file= in the arguments, if you choose to not catch the return value, it will appear to return nothing. This is due to invisible in my return; if you really want to see and save, either capture to a variable and look at that, or wrap the function call in parens, as in (change_yaml_matter(...)).
The trick for YAML is to know that yaml:: will treat every top-level as the named element of a list, and its contents are recursively lists in the same manner. For instance,
str(yaml::yaml.load("
---
top1:
level2a:
level3a: 123
level3b: 456
level2b: 789
top2: quux
---"))
# List of 2
# $ top1:List of 2
# ..$ level2a:List of 2
# .. ..$ level3a: int 123
# .. ..$ level3b: int 456
# ..$ level2b: int 789
# $ top2: chr "quux"
To assign new values, just provide nested named lists.

I modified it slightly.
With the version given, if an existing yaml element is changed, it is moved to the end of the yaml header.
With my modification, existing elements with changed values will keep their position in the header.
change_yaml_matter <- function(input_file, ..., output_file) {
input_lines <- readLines(input_file)
delimiters <- grep("^---\\s*$", input_lines)
if (!length(delimiters)) {
stop("unable to find yaml delimiters")
} else if (length(delimiters) == 1L) {
if (delimiters[1] == 1L) {
stop("cannot find second delimiter, first is on line 1")
} else {
# found just one set, assume it is *closing* the yaml matter;
# fake a preceding line of delimiter
delimiters <- c(0L, delimiters[1])
}
}
delimiters <- delimiters[1:2]
yaml_list <- yaml::yaml.load(
input_lines[ (delimiters[1]+1):(delimiters[2]-1) ])
dots <- list(...)
for (element_name in names(dots)){
if(element_name %in% names(yaml_list)) {
yaml_list[element_name] <- dots[element_name]
} else {
yaml_list <- c(yaml_list,dots[element_name])
}
}
output_lines <- c(
if (delimiters[1] > 0) input_lines[1:(delimiters[1])],
strsplit(yaml::as.yaml(yaml_list), "\n")[[1]],
input_lines[ -(1:(delimiters[2]-1)) ]
)
if (missing(output_file)) {
return(output_lines)
} else {
writeLines(output_lines, con = output_file)
return(invisible(output_lines))
}
}

Related

Apply mutate to data frame in R: iterates one extra step and causes error

I wonder why my get_http_status function iterates once more than necessary causing an exception
I have a data frame like:
> str(df5)
'data.frame': 10 obs. of 3 variables:
$ text : chr "\n" "\n" "\n" "\n" ...
$ enlace: chr "//www.blogger.com| __truncated__ ...
$ Freq : int 1 1 1 1 1 1 1 1 1 r code here
I'm trying to get the http status code for each "enlace"
Using this function:
get_http_status <- function(url){
if (!is.null(url)){
Sys.sleep(3)
print(url)
ret <- HEAD(url)
return(ret$status_code)
}
return("")
}
df44 <- mutate(df5, status = get_http_status(enlace))
but keeps trowing the error:
** Error in parse_url(url) : length(url) == 1 is not TRUE**
i can warp the function with try/catch and it works, but i don't know why the error is happening in first place.
get_http_status_2 <- function(url){
tryCatch(
expr = {
Sys.sleep(3)
print(url)
ret <- HEAD(url)
return(ret$status_code)
},
error = function(e){
return("")
}
)
}
The content of the df5$enlace is:
> df5$enlace
[1] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3"
[2] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1"
[3] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=BlogSearch&widgetId=BlogSearch1&action=editWidget&sectionId=sidebar-right-1"
[4] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=Followers&widgetId=Followers1&action=editWidget&sectionId=sidebar-right-1"
[5] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=PageList&widgetId=PageList1&action=editWidget&sectionId=crosscol"
[6] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=Text&widgetId=Text1&action=editWidget&sectionId=sidebar-right-1"
[7] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=Text&widgetId=Text2&action=editWidget&sectionId=sidebar-right-1"
[8] "http://5d4a.wordpress.com/2010/08/02/smashing-the-stack-in-2010/"
[9] "http://advancedwindowsdebugging.com/ch06.pdf"
[10] "http://beej.us/guide/
I think it iterate one time more because the result of the function is:
> df44 <- mutate(df5, status = get_http_status(enlace))
[1] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=Attribution&widgetId=Attribution1&action=editWidget&sectionId=footer-3"
[2] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=BlogArchive&widgetId=BlogArchive1&action=editWidget&sectionId=sidebar-right-1"
[3] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=BlogSearch&widgetId=BlogSearch1&action=editWidget&sectionId=sidebar-right-1"
[4] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=Followers&widgetId=Followers1&action=editWidget&sectionId=sidebar-right-1"
[5] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=PageList&widgetId=PageList1&action=editWidget&sectionId=crosscol"
[6] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=Text&widgetId=Text1&action=editWidget&sectionId=sidebar-right-1"
[7] "//www.blogger.com/rearrange?blogID=4514563088285989046&widgetType=Text&widgetId=Text2&action=editWidget&sectionId=sidebar-right-1"
[8] "http://5d4a.wordpress.com/2010/08/02/smashing-the-stack-in-2010/"
[9] "http://advancedwindowsdebugging.com/ch06.pdf"
[10] "http://beej.us/guide/bgc/"
Error in parse_url(url) : length(url) == 1 is not TRUE
Since your function contains a function that is not vectored, use the apply family of higher order function to iterate over your vector.
Below, get_http_status will be called on each element of df$enlace.
For each call a length one character vector is expected as the return, character(1):
vapply(df5$enlace, get_http_status, character(1))

Recursive indexing of lists with variable index value per recursion step

Puh... even trying to frame the title properly already gives me a headache.
I have a config.yml with nested values and I would like to define an indexing function get_config() that accepts "path-like" value strings.
The "path entities" of the value string match the nested entity structure of the config file. Based on the path-like value the function should then go and grab the corresponding hierarchy entity (either "branches" or "leaves") from the config file.
Example
Suppose this is the structure of the config.yml:
default:
column_names:
col_id: "id"
col_value: "value"
column_orders:
data_structure_a: [
column_names/col_id,
column_names/col_value
]
data_structure_b: [
column_names/col_value,
column_names/col_id
]
Here's a parsed version for you to play around with:
x <- yaml::yaml.load(
'default:
column_names:
col_id: "id"
col_value: "value"
column_orders:
data_structure_a: [
column_names/col_id,
column_names/col_value
]
data_structure_b: [
column_names/col_value,
column_names/col_id
]'
)
Accessing top-level entities is easy with config::get(value):
config::get("column_names")
# $col_id
# [1] "id"
#
# $col_value
# [1] "value"
config::get("column_orders")
# [1] "hello" "world"
But I would also like to access deeper entities, e.g. column_names: col_id.
In pseudo code:
config::get("column_names:col_id")
or
config::get("column_orders/data_structure_a")
The best I could come up with so far: relying on unlist()
get_config <- function(value, sep = ":") {
if (value %>% stringr::str_detect(sep)) {
value <- value %>% stringr::str_replace(sep, ".")
configs <- config::get() %>% unlist()
configs[value]
} else {
config::get(value)
}
}
get_config("column_names")
# $col_id
# [1] "id"
#
# $col_value
# [1] "value"
get_config("column_names:col_id")
# column_names.col_id
# "id"
Though not elegant, it works for most use cases, but fails for unnamed list entities in the config file
get_config("column_orders:data_structure_a")
# <NA>
# NA
as my indexing approach doesn't play well with the result of unlist() on unnamed lists:
config::get() %>% unlist()
# column_names.col_id column_names.col_value
# "id" "value"
# column_orders.data_structure_a1 column_orders.data_structure_a2
# "column_names/col_id" "column_names/col_value"
# column_orders.data_structure_b1 column_orders.data_structure_b2
# "column_names/col_value" "column_names/col_id"
Thus, I'd like to "go recursive" but my brain says: "no way, dude"
Due diligence
This solution comes close (I guess).
But I keep thinking that I need something like purrr::map2_if() or purrr::pmap_if() (which AFAIK don't exist) instead of purrr::map_if(), as I need to not only traverse the list behind config::get() recursively, but also a listified version of value (e.g. via stringr::str_split(value, sep) %>% unlist() %>% as.list())?
You could also use purrr::pluck to index into a nested list by name if that is what you are after:
x <- yaml::yaml.load('
column_names:
col_id: "id"
col_value: "value"
column_orders:
data_structure_a: [
column_names/col_id,
column_names/col_value
]
data_structure_b: [
column_names/col_value,
column_names/col_id
]
nested_list:
element_1:
element_2:
value: "hello world"
')
purrr::pluck(x, "column_names", "col_id")
#> [1] "id"
purrr::pluck(x, "column_names")
#> $col_id
#> [1] "id"
#>
#> $col_value
#> [1] "value"
purrr::pluck(x, "column_orders", "data_structure_a")
#> [1] "column_names/col_id" "column_names/col_value"
purrr::pluck(x, "column_names", "col_notthere")
#> NULL
I came up with a solution based on Recall().
However, while digging up the internet in trying to get here, I recall having read somewhere that Recall() is generally not a very (memory) efficient way of doing recursion in R? Also would appreciate additional hints on how to do recursion the tidy way with purrr and friends.
Config file content
Being able to call get_config() implies that you have a config.yml file with above content in your project's root directory given by here::here(), but you can test get_list_element_recursively() with this workaround:
x <- yaml::yaml.load('
column_names:
col_id: "id"
col_value: "value"
column_orders:
data_structure_a: [
column_names/col_id,
column_names/col_value
]
data_structure_b: [
column_names/col_value,
column_names/col_id
]
nested_list:
element_1:
element_2:
value: "hello world"
')
Function defs
get_config <- function(value, sep = "/") {
get_list_element_recursively(
config::get(),
stringr::str_split(value, sep, simplify = TRUE)
)
}
get_list_element_recursively <- function(
lst,
el,
.el_trace = el,
.level_trace = 1
) {
# Reached leaf:
if (!is.list(lst)) {
return(lst)
}
# Element not in list:
if (!(el[1] %in% names(lst))) {
message("Current list branch:")
# print(lst)
message(str(lst))
message("Trace of indexing vec (last element is invalid):")
message(stringr::str_c(.el_trace[.level_trace], collapse = "/"))
stop(stringr::str_glue("No such element in list: {el[1]}"))
}
lst <- lst[[ el[1] ]]
if (!is.na(el[2])) {
# Continue if there are additional elements in `el` vec
Recall(lst, el[-1], .el_trace, .level_trace = 1:(.level_trace + 1))
} else {
# Otherwise return last indexing result:
lst
}
}
Testing get_config()
get_config("column_names")
# $col_id
# [1] "id"
#
# $col_value
# [1] "value"
get_config("column_names/col_id")
# [1] "id"
get_config("column_names/col_nonexisting")
# Current list branch:
# List of 6
# $ col_id : chr "id"
# $ col_value : chr "value"
#
# Trace of indexing vec (last element is invalid):
# column_names/col_nonexisting
# Error in get_list_element_recursively(config::get(), stringr::str_split(value, :
# No such element in list: col_nonexisting
get_config("column_orders")
# $data_structure_a
# [1] "column_names/col_id" "column_names/col_value"
#
# $data_structure_b
# [1] "column_names/col_value" "column_names/col_id"
get_config("column_orders/data_structure_a")
# [1] "column_names/col_id" "column_names/col_value"
Testing get_list_element_recursively()
get_list_element_recursively(x, c("column_names"))
# $col_id
# [1] "id"
#
# $col_value
# [1] "value"
get_list_element_recursively(x, c("column_names", "col_id"))
# [1] "id"
get_list_element_recursively(x, c("column_names", "col_notthere"))
# Current list branch:
# List of 2
# $ col_id : chr "id"
# $ col_value: chr "value"
#
# Trace of indexing vec (last element is invalid):
# column_names/col_notthere
# Error in get_list_element_recursively(x$default, c("column_names", "col_notthere")) :
# No such element in list: col_notthere

Cannot iterate through a list while I iterate through a list of lists

The input is a list of lists. Please see below. The file names is a list containing as many names as there are lists in the list (name1, name2, name3).
Each name is appended to the path: path/name1 - path/name2 - path/name3
The program iterates through the list containing the paths as it iterates through the list of lists and prints the paths with their file names. I would expect for the output to be path/name1 - path/name2 - path/name3. However I get the output below. Please see OUTPUT after INPUT
INPUT
[[1]]
[1] "150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" "160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt" "JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
[[2]]
[1] "150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" "160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
[[3]]
[1] "150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
OUTPUT
I would expect for the output to be path/nam1 - path/name2 - path/name3
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name1.tsv",
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name2.tsv",
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name3.tsv".
However I get the output below:
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/name1.tsv"
I cannot understand why I cannot iterate through the list of paths with the file name while iterating through the list of lists. I hope this helps to clarify the problem. Could anyone help with this?
I have analyzed each statement using printing and every thing works fine except for the output of the code below
for (i in 1:length(lc)) {
for (j in 1:length(lc[[i]])) { # fetch and read files
if (j==1) {
newFile <- paste(dataFnsDir, lc[[i]][j], sep="/")
newFile <- tryCatch(read.delim(newFile, header = TRUE, sep = '/'), error=function(e) NULL)
newFile<- tryCatch(newFile, error=function(e) data.frame())
print(tmpFn[i])
} else {
newFile <- paste(dataFnsDir, lc[[i]][j], sep="/")
newFile <- tryCatch(read.delim(newFilei, header = TRUE, sep = '/'), error=function(e) NULL)
newFile <- tryCatch(newFile, error=function(e) data.frame())
newFile <- dplyr::bind_rows(newFile, newFile)
print(tmpFn[i])
}
}
}
There's no need to use nested loop. try this:
# sample data
dataFnsDir <- "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/"
lc <- list()
lc[[1]] <- c("150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt","160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
, "JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt" ,"JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
)
lc[[2]] <- c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" , "160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt",
"JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
)
lc[[3]] <- c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt",
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
)
# actual code
lc.path.v <- paste0(dataFnsDir,unlist(lc))
# maybe this is what you want?
lc.path.v
#> [1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
#> [2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
#> [3] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
#> [4] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
#> [5] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
#> [6] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
#> [7] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
#> [8] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
#> [9] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
If you want to read all of them and combine them together, try this(it may not work because I don't know what the data looks like):
lc.alldf <- lapply(lc.path, read.delim, header = TRUE, sep = "/")
lc.onedf <- dplyr::bind_rows(lc.alldf)
Edit:
code improved, thanks! #Onyambu
If I understand correctly, the OP wants to create 3 new files each from the file names given as character vectors in each list element.
The main issue with OP's code is that newFile is overwritten in each iteration of the nested loops.
Here is what I would with my preferred tools (untested):
library(data.table) # for fread() and rbindlist()
library(magrittr) # use piping for clarity
lapply(
lc,
function(x) {
filenames <- file.path(dataFnsDir, x)
lapply(filenames, fread) %>%
rbindlist()
}
)
This will return a list of three dataframes (data.tables).
I do not have the OP's input files available but we can simulate the effect for demonstration. If we remove the second call to lapply() we will get a list of 3 elements each containing a character vector of file names with the path prepended.
lapply(
lc,
function(x) {
filenames <- file.path(dataFnsDir, x)
print(filenames)
}
)
[[1]]
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
[2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
[4] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
[[2]]
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
[2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
[3] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
[[3]]
[1] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt"
[2] "/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA/160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
Data
dataFnsDir <-"/home/giuseppa/Development/glycoPipeApp/OUT/openMS/INPUT_DATA"
lc <- list(
c("150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt",
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt",
"JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt",
"JF_160426_Dep2Plas_tryp_Gpep_SIDtarg-(06)_PSMs.txt"
),
c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt" ,
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt",
"JF_160426_Dep2Plas_ctryp_Gpep_SIDtargFULL__PSMs.txt"
),
c(
"150413_JF_GPeps_SIDtarg_GPstdMix_Tryp_2runs_v3_PSMs.txt",
"160824_JF_udep_tryp_Hi_SIDdda_FULL_NewParse-(05)_PSMs.txt"
)
)

Importing data from a PDF to HTML using R

Is there a way to import data from a .pdf file into HTML format using R?
I tried with the following code:
library(tm)
filename = "file.pdf"
doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename),language = "en",id = "id1")
head(doc)
Output in HTML displays as:
## $content
## [1] " sample data"
## [2] ""
## [3] " records"
## [4] ""
## [5] " 31 July 2017"
## [6] ""
## [7] ""
## [8] "R Markdown setup
## [9] ""
## [10] ""
## [11] "R Markdown"
## [12] ""
## [13] "This is an R Markdown document. Markdown is a simple formatting syntax for"
## [14] "authoring HTML, PDF, and MS Word documents. For more details on using R"
## [15] "Markdown see http://rmarkdown.rstudio.com."
## [16] "When you click the Knit button a document will be generated that includes"
## [17] "both content as well as the output of any embedded R code chunks within the"
## [18] "document. You can embed an R code chunk like this:"
## [19] "{r cars} summary(cars)"
Please help!
I downloaded the pdf file available here : https://fie.org/competition/2022/152/results/pools/pdf?lang=en
With the following code, I have been able to convert the PDF file to a html file :
library(RDCOMClient)
path_PDF <- "C:\\pdf_with_table.pdf"
path_Html <- "C:\\temp.html"
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Html, FileFormat = 9) # saves to html
From my point of view, it would be more straightforward to extract the tables directly from the PDF or to convert the PDF to a word file and extract the tables from the word file.

Replace value table with condition in R

I have list of dataset :
> data1
[1] /index.php/search?
[2] /tabel/graphic1_.php?
[3] /mod/Layout/variableView2.php?
[4] /table/tblmon-frameee.php?
and a table:
> tes
[1] http://aladdine/index.php/search?
[2] http://aladdine/mod/params/returnParams.php
[3] http://aladdine/mod/Layout/variableView2.php
[4] http://aladdine/index.php/bos/index?
[5] http://aladdine/index.php/Bos
I want to change the value of the test table with an index on dataset which has a matching string values in the dataset.
I have tried this code:
for(i in 1:length(dataset)){
p = data[i]
for(j in 1:length(tes)){
t = tes [j]
if(grepl(p, t)){
tes[j]=i
}
else tes[j] = "-"
}
}
My expectation result like this,
> tes
[1] 1
[2] -
[3] 3
[4] -
[5] -
But, I always get warning message invalid factor level, NA generated. Why?
Thanks before.
The following code does not do exactly what you need, but effectively it should give you the same information.
data1<-c('/index.php/search?',
'/tabel/graphic1_.php?',
'/mod/Layout/variableView2.php?',
'/table/tblmon-frameee.php?')
tes<-c('http://aladdine/index.php/search?',
'http://aladdine/mod/params/returnParams.php',
'http://aladdine/mod/Layout/variableView2.php',
'http://aladdine/index.php/bos/index?',
'http://aladdine/index.php/Bos')
> lapply(data1,FUN = function(x) which(grepl(x,tes)))
[[1]]
[1] 1
[[2]]
integer(0)
[[3]]
[1] 3
[[4]]
integer(0)
For example, the first output in [[1]] tells which element in "tes" match the first element in "data1" etc...
Probably not the fastest one as i use for loop in this code but hope this provides a solution:
require(data.table)
data1<-c("/index.php/search?","/tabel/graphic1_.php?","/mod/Layout/variableView2.php?","/table/tblmon-frameee.php?")
tes<-c("http://aladdine/index.php/search?","http://aladdine/mod/params/returnParams.php" ,"http://aladdine/mod/Layout/variableView2.php","http://aladdine/index.php/bos/index?","http://aladdine/index.php/Bos")
d<-data.table(d=data1,t=tes)
d$id<-seq(1:nrow(d))
for (i in 1:nrow(d))
{
d$index[i]<-lapply(data1,FUN=function(x) {ifelse(length(grep(x,tes[i]))>0,d$id[i],"-")})[i]
}

Resources