R , Looping through a dataframe , creating a new one with additional content - r

Caution, quite new to R - but I really would like to do this in R instead of java.
My csv-file (Swedish redlist for species 2020 ) looks like this:
id,svenskt,latin,Organismgrupp,Kategori,Observationer,Landskapstyp,status_abbrev,Rodlistekriterium
249012,,Abia candens,stekel,Art,3,"Jordbrukslandskap (J) - Stor betydelse, Skog (S) - Har betydelse",DD,
249014,,Abia lonicerae,stekel,Art,2,Skog (S) - Stor betydelse,DD,
261452,,Abia nitens,stekel,Art,0,Jordbrukslandskap (J) - Stor betydelse,DD,
The whole csv-file can be download from SLU by pressing the button 'skapa csv-fil'.
The interesting columns for me is only the 'id' and the 'status_abbrev' columns.
I would like to use those columns to update my db-table, doing something like this:
sql<- paste("update redlist SET status_abbrev='",abbrev,"' ","where id=",id,sep="")
reading the csv-file with this command:
library(dplyr)
redlist <- read.csv("rodlistade_arter_tampered_2.csv",header=TRUE);
dat <- select(redlist,'id', 'status_abbrev')
the output from the 3 first lines would be:
redlist is a dataframe, contains the csv with header.
datis a dataframe , contains a subset of redlist (id and status_abbrev).
But which library would be best to iterate through the 'dat' data-frame to be able to create something like this ?
iterating and picking out abbrev and id and creating the below string for each row - (in the end I would like to write these strings to an sql-batch file and update the roughy 5660-records)
sql<- paste("update redlist SET status_abbrev='",abbrev,"' ","where id=",id,sep="")
so that my resulting string would be like this (then iterating through the whole file) :
update redlist SET status_abbrev=DD where id=249012
screenshot of redlist and dat -
best,i

Using dplyr::mutate() and glue::glue() you can create the strings like this
library(tidyverse)
library(glue)
#>
#> Attaching package: 'glue'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
str <- 'id,svenskt,latin,Organismgrupp,Kategori,Observationer,Landskapstyp,status_abbrev,Rodlistekriterium
249012,,Abia candens,stekel,Art,3,"Jordbrukslandskap (J) - Stor betydelse, Skog (S) - Har betydelse",DD,
249014,,Abia lonicerae,stekel,Art,2,Skog (S) - Stor betydelse,DD,
261452,,Abia nitens,stekel,Art,0,Jordbrukslandskap (J) - Stor betydelse,DD,'
df <- read_csv(str)
df2 <- df %>%
mutate(sql_string = glue("update redlist SET status_abbrev='{status_abbrev}' where id={id}"))
df2
#> # A tibble: 3 x 10
#> id svenskt latin Organismgrupp Kategori Observationer Landskapstyp
#> <dbl> <lgl> <chr> <chr> <chr> <dbl> <chr>
#> 1 249012 NA Abia… stekel Art 3 Jordbruksla…
#> 2 249014 NA Abia… stekel Art 2 Skog (S) - …
#> 3 261452 NA Abia… stekel Art 0 Jordbruksla…
#> # … with 3 more variables: status_abbrev <chr>, Rodlistekriterium <lgl>,
#> # sql_string <glue>
df2 %>% pull(sql_string)
#> update redlist SET status_abbrev='DD' where id=249012
#> update redlist SET status_abbrev='DD' where id=249014
#> update redlist SET status_abbrev='DD' where id=261452
Created on 2020-07-27 by the reprex package (v0.3.0)
Is this what you are looking for?
For database integration, have a look at DBI.

Related

Web scraping an interactive chart using R

I'm new to web scraping and am trying to scrape the data from this interactive chart using R so that all the series are displayed in a single table: https://www.e61.in/spendtracker
I've used developer tools in chrome (inspect - network - fetch/XHR) but cannot find the data points.
Would be highly appreciative if someone can take a quick look and let me know a) if the data points are stored on the page somewhere b) if possible, explain how they identified the right file, and c) if it is a reasonably straightforward task to then generate a table?
Continuing from that iframe url -
before switching to R & rvest you should check the actual page source and perhaps run it though some beautifier. You'll see Plotly.newPlot() call, check how it gets array of those data series as a 2nd parameter. One option would be extracting that piece of javascript with regex, parse it as JSON and work from there.
Perhaps something like this:
library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(jsonlite)
library(purrr)
library(ggplot2)
url <- "https://www-e61-in.filesusr.com/html/84f6c1_839cefc8bcc59c1cc688a6be6b4a5656.html"
html <- read_html(url)
# extract last <script> tag containing Plotly.newPlot() and dataseries'
plotly_js <- html %>%
html_element("script:last-of-type") %>%
html_text()
# extract array from js string, using \Q and \E to no escape all special chars
p_dataseries <- str_extract(plotly_js, '\\Q[{"connectgaps"\\E.*?\\Q"type":"scatter"}]\\E' )
# parse extracted string
ds_j <- fromJSON(p_dataseries,simplifyVector = FALSE)
# extract data, result will be in long format
df <- map_df(ds_j, `[`, c("name", "x", "y")) %>%
unnest(c(x,y)) %>%
mutate(date = as.POSIXct(x))
str(df)
#> tibble [2,346 × 4] (S3: tbl_df/tbl/data.frame)
#> $ name: chr [1:2346] "Total" "Total" "Total" "Total" ...
#> $ x : chr [1:2346] "2020-01-12T00:00:00" "2020-01-19T00:00:00" "2020-01-26T00:00:00" "2020-02-02T00:00:00" ...
#> $ y : num [1:2346] 100 100.1 100.7 99.3 97.8 ...
#> $ date: POSIXct[1:2346], format: "2020-01-12" "2020-01-19" ...
head(df)
#> # A tibble: 6 × 4
#> name x y date
#> <chr> <chr> <dbl> <dttm>
#> 1 Total 2020-01-12T00:00:00 100 2020-01-12 00:00:00
#> 2 Total 2020-01-19T00:00:00 100. 2020-01-19 00:00:00
#> 3 Total 2020-01-26T00:00:00 101. 2020-01-26 00:00:00
#> 4 Total 2020-02-02T00:00:00 99.3 2020-02-02 00:00:00
#> 5 Total 2020-02-09T00:00:00 97.8 2020-02-09 00:00:00
#> 6 Total 2020-02-16T00:00:00 100. 2020-02-16 00:00:00
p <- df %>%
ggplot(aes(x = date, y = y, color = name)) +
geom_path() +
theme_minimal()
p
Created on 2022-09-27 with reprex v2.0.2
You're trying to scrap the wrong URL - the one you've provided uses an iframe with the chart. You should take a deep look into the source code of this page instead (the iframe source): https://www-e61-in.filesusr.com/html/84f6c1_839cefc8bcc59c1cc688a6be6b4a5656.html

Trying to generate ASV table from phyloseq

I recognize most people have the opposite problem. But I'm trying to create an ASV table, with column names as "identified OTUs" (aka the column name is drawn from the taxonomy information from GlobalPatterns#tax.table, rather than just being the assigned OTU code that's encoded in GlobalPatterns#otu.table), and row names as sample name.
I also want to append the metadata to the end of the ASV table, to allow for analysis based on said metadata.
I managed to generate a table without the taxonomic information with this code, using GlobalPatterns for reproducibility:
data(GlobalPatterns)
asv.matrix <- as.matrix(GlobalPatterns#otu_table#.Data)
asv <- data.frame(t(asv.matrix)) #transposing to make sample name the row name
meta.df <- as.data.frame(GlobalPatterns#sam_data)
asv.full <- data.frame(asv,meta.df)
write.csv(asv.full, file = "full_asv.csv",quote = FALSE,sep = ",")
However, I can't figure out how to force taxonomy information into the column names, which makes the ASV table functionally useless for analysis.
EDIT:
My preferred format is (abbreviated with faked metadata appended) as below. Tried to make a table, failed, have a fake code chunk.
Sample-ID / Species1 / Species2 / ...etc... / Metadata1 / Metadata2 /...etc... /
--------- / -------- / -------- / --------- / --------- / --------- /--------- /
Sample1 / 1 / 5 / ...etc... / lake / summer /...etc... /
Sample2 / 4 / 0 / ...etc... / bog / spring /...etc... /
I think you're looking for the phyloseq::psmelt function, which combines the otu_table, tax_table and sample_data tables into a single, long format table that is suitable for analysis.
One way of dealing with unresolved taxonomy is to assign the highest known taxonomy to any unresolved level. You can use the name_na_taxa function from the fantaxtic package for this, prior to using psmelt.
EDIT
After seeing your updated post, I understand a bit better what you want. You can take the output from psmelt and pivot this into a semi-wide format; see the code chunk below.
require("phyloseq")
require("fantaxtic")
require("tidyverse")
# Load data
data(GlobalPatterns)
# Generate (unique) species names using fantaxtic
ps <- name_na_taxa(GlobalPatterns)
ps <- label_duplicate_taxa(ps, tax_level = "Species", asv_as_id = T)
# Convert to long data format
ps_long <- psmelt(ps)
# Convert to semi-wide data format where each column has a taxon name
# and contains the abundance in each sample
meta_vars <- sample_variables(ps)
ps_wide <- ps_long %>%
select(all_of(meta_vars), Species, Abundance) %>%
pivot_wider(names_from = Species,
values_from = Abundance)
# Inspect the final table
head(ps_wide)
#> # A tibble: 6 x 19,223
#> X.SampleID Primer Final_Barcode Barcode_truncate~ Barcode_full_le~ SampleType
#> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 AQC4cm ILBC_17 ACAGCT AGCTGT CAAGCTAGCTG Freshwate~
#> 2 LMEpi24M ILBC_13 ACACTG CAGTGT CATGAACAGTG Freshwater
#> 3 AQC7cm ILBC_18 ACAGTG CACTGT ATGAAGCACTG Freshwate~
#> 4 AQC1cm ILBC_16 ACAGCA TGCTGT GACCACTGCTG Freshwate~
#> 5 M31Tong ILBC_10 ACACGA TCGTGT TGTGGCTCGTG Tongue
#> 6 M11Fcsw ILBC_05 AAGCTG CAGCTT CGACTGCAGCT Feces
#> # ... with 19,217 more variables: Description <fct>,
#> # `Unknown Stramenopiles (Order) 549656` <dbl>,
#> # `Unknown Dolichospermum (Genus) 279599` <dbl>,
#> # `Unknown Neisseria (Genus) 360229` <dbl>,
#> # `Unknown Bacteroides (Genus) 331820` <dbl>,
#> # `Haemophilusparainfluenzae 94166` <dbl>,
#> # `Unknown ACK-M1 (Family) 329744` <dbl>, ...
Created on 2022-09-26 by the reprex package (v2.0.1)
Note that this will potentially lead to a table with thousands of columns (about 20k in the case of GlobalPatterns), which might be hard to work with.

Add row including each variable value to existing dataframe

I have a data frame with 80 existing rows and 6 variables, they are:
Row_ID
CatName
CatAge
Request
Friends
ID,
and I need to add some outliers to the dataset of generated data by adding a row on to the end containing specific data.
I attempted the following but it does not work. Any tips on how to get this to work?
```{r, create row 1, echo=TRUE,include=TRUE}
Cat_dataframe %>%
add_row(Row_ID = "30",CatName = "Carla",CatAge="30",Request="30",Friends="8",ID="500000")
```
Your command looks pretty good to me:
library(tidyverse)
df <- tribble(~"Row_ID", ~"CatName", ~"CatAge", ~"Request", ~"Friends", ~"ID",
"1", "name1", "31", "request1", "2", "051245")
df %>%
add_row(Row_ID = "30",CatName = "Carla",CatAge="30",Request="30",Friends="8",ID="500000")
#> # A tibble: 2 × 6
#> Row_ID CatName CatAge Request Friends ID
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 name1 31 request1 2 051245
#> 2 30 Carla 30 30 8 500000
Created on 2022-04-03 by the reprex package (v2.0.1)
You may have an issue with your chunk title (i.e. try {r create_row_1, echo=TRUE, include=TRUE} instead of {r, create row 1, echo=TRUE,include=TRUE}) and you may have an issue with different data types, e.g. if "CatAge" is an integer in your original dataframe and a character string in your add_rows() command (age=31 and age="31" are different types).
If you edit your original question to include the error message/s you're getting it will very likely make it easier troubleshoot your problem.

How to decode base64 strings in a vectorized way within dplyr::mutate?

I have a tibble which contains a column of base64-encoded strings like so:
mytib <- tibble(encoded_var = c("VGVzdGluZ3Rlc3Rpbmc=", "QW5vdGhlcnRlc3Q="))
When I try to decode it with base64::base64decode
mytib %>%
mutate(decoded_var = base64decode(encoded_var))
I receive an error:
Error in `mutate()`:
! Problem while computing `decoded_var = base64decode(encoded_var)`.
x `decoded_var` must be size 2 or 1, not 25.
I'm looking to have a tibble with a column of decoded, human-readable base64 strings. I'd also like to do that using the mutate tidyverse syntax. How can I achieve that?
Update: The tibble should look like this
# A tibble: 2 × 2
encoded_var decoded_var
<chr> <chr>
1 VGVzdGluZ3Rlc3Rpbmc= Testingtesting
2 QW5vdGhlcnRlc3Q= Anothertest
base64enc::base64decode produces a raw vector, so you need to carry out the conversion rowwise and wrap the result with rawToChar:
mytib %>%
rowwise() %>%
mutate(decoded_var = rawToChar(base64decode(encoded_var)))
#> # A tibble: 2 x 2
#> # Rowwise:
#> encoded_var decoded_var
#> <chr> <chr>
#> 1 VGVzdGluZ3Rlc3Rpbmc= Testingtesting
#> 2 QW5vdGhlcnRlc3Q= Anothertest
The problem is that the caTools::base64decode function only works on one string at a time, because a single string could contain several values. If you always have a single character value in your variable, then you can vectorize it:
library(tidyverse)
mytib <- tibble(encoded_var = c("VGVzdGluZ3Rlc3Rpbmc=", "QW5vdGhlcnRlc3Q="))
mytib %>%
mutate(decoded_var = Vectorize(caTools::base64decode)(encoded_var, "character"))
#> # A tibble: 2 × 2
#> encoded_var decoded_var
#> <chr> <chr>
#> 1 VGVzdGluZ3Rlc3Rpbmc= Testingtesting
#> 2 QW5vdGhlcnRlc3Q= Anothertest
Created on 2022-03-14 by the reprex package (v2.0.1)
EDITED TO ADD: Actually, there are (at least) four different packages that provide base64decode functions. I used caTools. There are also versions in the processx, xfun and base64enc packages. (The one in xfun is actually named base64_decode.) This is why it's important to show reproducible code here on StackOverflow. The reprex package makes this very easy.

How can I use map* and mutate to convert a list into a set of additional columns?

I have tried probably hundreds of permutations of this code for literally days to try to get a function that will do what I want, and I have finally given up. It feels like it should definitely be doable and I am so close!
I have tried to get back to the nub of things here with my reprex below.
Basically I have a single-row dataframe, with a column containing a list of strings ("concepts"). I want to create an additional column for each of those strings, using mutate, ideally with the column taking its name from the string, and then to populate the column with the results of a function call (?it doesn't matter which function, for now? - I just need the infrastructure of the function to work.)
I feel, as usual, like I must be missing something obvious... maybe just a syntax error.
I also wonder if I need to use purrr::map, maybe a simpler vectorised mapping would work fine.
I feel like the fact that new columns are named ..1 rather than the concept name is a bit of a clue as to what is wrong.
I can create the data frame I want by calling each concept manually (see end of reprex) but since the list of concepts is different for different data frames, I want to functionalise this using pipes and tidyverse techniques rather than do it manually.
I've read the following questions to find help:
How to use map from purrr with dplyr::mutate to create multiple new columns based on column pairs
How to mutate multiple columns with dynamic variable using purrr:map function?
(R) Cleaner way to use map() with list-columns
Add multiple output variables using purrr and a predefined function
Creating new variables with purrr (how does one go about that?)
How to compute multiple new columns in a R dataframe with dynamic names
but none of those has quite helped me crack the problem I'm experiencing. [edit: added in last q to that list which may be the technique I need].
<!-- language-all: lang-r -->
# load packages -----------------------------------------------------------
library(rlang)
library(dplyr)
library(tidyr)
library(magrittr)
library(purrr)
library(nomisr)
# set up initial list of tibbles ------------------------------------------
df <- list(
district_population = tibble(
dataset_title = "Population estimates - local authority based by single year",
dataset_id = "NM_2002_1"
),
jsa_claimants = tibble(
dataset_title = "Jobseeker\'s Allowance with rates and proportions",
dataset_id = "NM_1_1"
)
)
# just use the first tibble for now, for testing --------------------------
# ideally I want to map across dfs through a list -------------------------
df <- df[[1]]
# nitty gritty functions --------------------------------------------------
get_concept_list <- function(df) {
dataset_id <- pluck(df, "dataset_id")
nomis_overview(id = dataset_id,
select = c("dimensions", "codes")) %>%
pluck("value", 1, "dimension") %>%
filter(!concept == "geography") %>%
pull("concept")
}
# get_concept_list() returns the strings I need:
get_concept_list(df)
#> [1] "time" "gender" "c_age" "measures"
# Here is a list of examples of types of map* that do various things,
# none of which is what I need it to do
# I'm using toupper() here for simplicity - ultimately I will use
# get_concept_info() to populate the new columns
# this creates four new tibbles
get_concept_list(df) %>%
map(~ mutate(df, {{.x}} := toupper(.x)))
#> [[1]]
#> # A tibble: 1 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 TIME
#>
#> [[2]]
#> # A tibble: 1 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 GENDER
#>
#> [[3]]
#> # A tibble: 1 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 C_AGE
#>
#> [[4]]
#> # A tibble: 1 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 MEASUR~
# this throws an error
get_concept_list(df) %>%
map_chr(~ mutate(df, {{.x}} := toupper(.x)))
#> Error: Result 1 must be a single string, not a vector of class `tbl_df/tbl/data.frame` and of length 3
# this creates three extra rows in the tibble
get_concept_list(df) %>%
map_df(~ mutate(df, {{.x}} := toupper(.x)))
#> # A tibble: 4 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 TIME
#> 2 Population estimates - local authority based by single year NM_2002_1 GENDER
#> 3 Population estimates - local authority based by single year NM_2002_1 C_AGE
#> 4 Population estimates - local authority based by single year NM_2002_1 MEASUR~
# this does the same as map_df
get_concept_list(df) %>%
map_dfr(~ mutate(df, {{.x}} := toupper(.x)))
#> # A tibble: 4 x 3
#> dataset_title dataset_id ..1
#> <chr> <chr> <chr>
#> 1 Population estimates - local authority based by single year NM_2002_1 TIME
#> 2 Population estimates - local authority based by single year NM_2002_1 GENDER
#> 3 Population estimates - local authority based by single year NM_2002_1 C_AGE
#> 4 Population estimates - local authority based by single year NM_2002_1 MEASUR~
# this creates a single tibble 12 columns wide
get_concept_list(df) %>%
map_dfc(~ mutate(df, {{.x}} := toupper(.x)))
#> # A tibble: 1 x 12
#> dataset_title dataset_id ..1 dataset_title1 dataset_id1 ..11 dataset_title2
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Population e~ NM_2002_1 TIME Population es~ NM_2002_1 GEND~ Population es~
#> # ... with 5 more variables: dataset_id2 <chr>, ..12 <chr>,
#> # dataset_title3 <chr>, dataset_id3 <chr>, ..13 <chr>
# function to get info on each concept (except geography) -----------------
# this is the function I want to use eventually to populate my new columns
get_concept_info <- function(df, concept_name) {
dataset_id <- pluck(df, "dataset_id")
nomis_overview(id = dataset_id) %>%
filter(name == "dimensions") %>%
pluck("value", 1, "dimension") %>%
filter(concept == concept_name) %>%
pluck("codes.code", 1) %>%
select(name, value) %>%
nest(data = everything()) %>%
as.list() %>%
pluck("data")
}
# individual mutate works, for comparison ---------------------------------
# I can create the kind of table I want manually using a line like the one below
# df %>% map(~ mutate(., measures = get_concept_info(., concept_name = "measures")))
df %>% mutate(., measures = get_concept_info(df, "measures"))
#> # A tibble: 1 x 3
#> dataset_title dataset_id measures
#> <chr> <chr> <list>
#> 1 Population estimates - local authority based by sin~ NM_2002_1 <tibble [2 x ~
<sup>Created on 2020-02-10 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
Using !! and := lets you dynamically name columns. Then, we can reduce the list output of map() with reduce(), which left_joins() all the dataframes in the list using the dataset title and id columns.
df_2 <-
map(get_concept_list(df),
~ mutate(df,
!!.x := get_concept_info(df, .x))) %>%
reduce(left_join, by = c("dataset_title", "dataset_id"))
df_2
# A tibble: 1 x 6
dataset_title dataset_id time gender c_age measures
<chr> <chr> <list<df[,2]>> <list<df[,2]>> <list<df[,2]>> <list<df[,2]>>
1 Population estimates - local authority based by single year NM_2002_1 [28 x 2] [3 x 2] [121 x 2] [2 x 2]

Resources