Using str_extract_all and unnest but losing rows from NA - r

I'm using str_extract() and str_extract_all() to do some look around regex. There are zero, one, or multiple results, so I want to unnest() the multiple results into multiple rows. The unnest does not give all rows in the output, because of the character(0) in ab_all (I'm assuming).
library(tidyverse)
my_tbl <- tibble(clmn = c("abcd", "abef, abgh", "xkcd"))
ab_tbl <- my_tbl %>%
mutate(ab = str_extract(clmn, "(?<=ab)[:alpha:]*\\b"),
ab_all = str_extract_all(clmn, "(?<=ab)[:alpha:]*\\b"),
cd = str_extract(clmn, "[:alpha:]*(?=cd)"))
ab_tbl %>% unnest(ab_all, .drop = FALSE)
# A tibble: 3 x 4
clmn ab cd ab_all
<chr> <chr> <chr> <chr>
1 abcd cd ab cd
2 abef, abgh ef NA ef
3 abef, abgh ef NA gh
Edit: Expected output:
# A tibble: 3 x 4
clmn ab cd ab_all
<chr> <chr> <chr> <chr>
1 abcd cd ab cd
2 abef, abgh ef NA ef
3 abef, abgh ef NA gh
4 xkcd NA xk NA
The row with xkccd is not given in the output. Is it something to do with the str_extract_all or the unnest or should I change my approach?

May be we can change the length that are 0 to NA and then do the unnest
library(tidyverse)
ab_tbl %>%
mutate(ab_all = map(ab_all, ~ if(length(.x) ==0) NA_character_ else .x)) %>%
unnest
NOTE: Assuming that the patterns in str_extract are correct

Related

How to Subset all nested lists from a specific list with the same names in R

Raw Data is an XML file:
https://drive.google.com/file/d/1WOylIDRVDSicDjPZkDL0FyoyESfHIE_9/view?usp=sharing
library(XML)
doc <- xmlParse("./p48cierre_01-01-2019.xml")
docList <- xmlToList(doc)
mylist_SeriesTemporales <- sapply(docList, '[', 'SeriesTemporales')
$IdentificacionMensaje.NA
[1] NA
$VersionMensaje.NA
[1] NA
$TipoMensaje.NA
[1] NA
$TipoProceso.NA
[1] NA
$TipoClasificacion.NA
[1] NA
I having all NAs in the List of SeriesTemporales also. I have shared the similar output of others given above.
I want to convert all SeriesTemporales lists into a single data frame. Please help me out.
Expected Output:
> xmlDataOut
# A tibble: 30,000 x 7
`Periodo.IntervaloTiempo.Attribute~ `Periodo.Resolucion.Attribute:~ `UnidadMedida.Attribute:~ `UPSalida.Attribute:v` `UPEntrada.Attribute:~ `TipoNegocio.Attribute~ `Periodo.Intervalo.Pos.Attribut~
<chr> <chr> <chr> <lgl> <chr> <chr> <dbl>
1 2021-04-20T22:00Z/2021-04-21T22:00Z PT60M MWH NA ZERBI Z21 1
2 2021-04-20T22:00Z/2021-04-21T22:00Z PT60M MWH NA ZERBI Z21 10
I cannot understand how SeriesTemporales can be converted in a dataframe when its elements lengths/sizes are different. However you can extract all SeriesTemporales into another list say l2 simply by doing this
l2 <- docList[names(docList) == 'SeriesTemporales']
Now if first element of l2 is converted to a dataframe, then
library(purrr)
map_df(l2, ~.x[1])
# A tibble: 1,256 x 1
IdentificacionSeriesTemporales
<chr>
1 STP0
2 STP1
3 STP2
4 STP3
5 STP4
6 STP5
7 STP6
8 STP7
9 STP8
10 STP9
# ... with 1,246 more rows
But its third element give this
map_df(l2, ~.x[3])
# A tibble: 2,512 x 2
UPSalida UPEntrada
<chr> <chr>
1 LUBAC01 NA
2 NES NA
3 FUSIC01 NA
4 NES NA
5 NA ECEGRG
6 NA NES
7 NA HYGESTE
8 NA NES
9 GNRAC01 NA
10 NES NA
# ... with 2,502 more rows
This seems like a straight forward xml document to parse. The only catch is the information is stored in the node's attributes and not in the node itself.
Here is a xml2 solution.
See Comments for an explanation.
library(xml2)
library(dplyr)
page <- read_xml("p48cierre_01-01-2019.xml")
#check for namespace
xml_ns(page)
#strip namespace
xml_ns_strip(page)
#find all SeriesTeomorales nodes
seriesT <- page %>% xml_find_all(".//SeriesTemporales")
#get requested information from each parent node
# find the correct subnote and attribute
#assuming only one sub node per parent
Intervalo <- seriesT %>% xml_find_first(".//IntervaloTiempo") %>% xml_attr("v")
Resolution <- seriesT %>% xml_find_first(".//Resolucion") %>% xml_attr("v")
UnidadMedida <- seriesT %>% xml_find_first(".//UnidadMedida") %>% xml_attr("v")
UPSalida <- seriesT %>% xml_find_first(".//UPSalida") %>% xml_attr("v")
UPEntrada <- seriesT %>% xml_find_first(".//UPEntrada") %>% xml_attr("v")
TipoNegocio <- seriesT %>% xml_find_first(".//TipoNegocio") %>% xml_attr("v")
#combine into a final answer
head(data.frame(Intervalo, Resolution, UnidadMedida, UPSalida, UPEntrada, TipoNegocio))
I am not sure your request for the "Pos" node, there are 24 per parent node thus does not store conveniently in a single data.frame. If you are just looking for the first one follow the format above, if not maybe and another question.

Paste colnames by sequence

Hi and happy new year at all.
I have a tricky task (in my opinion) and I can not find a way to solve it.
Please see following toy data. The orginal dataset has hundreds of cols/rows.
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan"),
US=c(8,2,NA,7),
UK=c(5,4,1,7))
I want to create a new column, called "origin", which pastes the colname of each cell (without NA) seperated by "|" under consideration of the corresponding value. Higher values should be pasted first. As for same values (like Zlatan), the sequence isn´t relevant. Output for Zlatan could be US|UK OR UK|US.
This is the desired ouput:
I tried some hours to solve it but no approach worked. May be it make sense to convert the values as.factor...
Help is much appreciated. Thank you in advance!
Here's a dplyr approach. First, we can use rowwise to work on individual rows independently. Next, we can use c_across which allows us to select values from that row only. We can subset a vector of c("US","UK") based on whether the US and UK columns are not NA.
paste with collapse = "|" allows us to put the values together with the seperator. I added a row to see what would happen if they are both NA.
library(dplyr)
test %>%
rowwise() %>%
mutate(origin = paste(c("US","UK")[rev(order(c_across(US:UK), na.last = NA))], collapse = "|"))
# A tibble: 5 x 4
# Rowwise:
name US UK origin
<chr> <dbl> <dbl> <chr>
1 Amber 8 5 "US|UK"
2 Thomas 2 4 "UK|US"
3 Stefan NA 1 "UK"
4 Zlatan 7 7 "UK|US"
5 Bob NA NA ""
This is also trivially expanded to more columns:
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan","Bob"),
US=c(8,2,NA,7,NA),
UK=c(5,4,1,7,NA),
AUS=c(1,2,NA,NA,1))
test %>%
rowwise() %>%
mutate(origin = paste(c("US","UK","AUS")[rev(order(c_across(US:AUS), na.last = NA))], collapse = "|"))
# A tibble: 5 x 5
# Rowwise:
name US UK AUS origin
<chr> <dbl> <dbl> <dbl> <chr>
1 Amber 8 5 1 US|UK|AUS
2 Thomas 2 4 2 UK|AUS|US
3 Stefan NA 1 NA UK
4 Zlatan 7 7 NA UK|US
5 Bob NA NA 1 AUS
Or with tidyselect assistance to perform all columns but name:
test %>%
rowwise() %>%
mutate(origin = paste(names(across(-name))[rev(order(c_across(-name), na.last = NA))], collapse = "|"))
Another possibility with tidyverse. It is longer than the other two solutions, but it should work directly with a dataframe with as many columns as you need.
I changed the dataframe to long format, filtered out NAs, grouped by name, summarized using paste, and joined with the original dataframe to get the original columns (and rows with all NAs).
library(tidyverse)
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan","Bob"),
US=c(8,2,NA,7,NA),
UK=c(5,4,1,7,NA),
AUS=c(1,2,NA,NA,1))
test %>%
# change to long format
tidyr::pivot_longer(cols=-name, names_to = "country", values_to = "value") %>%
# remove rows with NA
dplyr::filter(!is.na(value)) %>%
# group by name and sort
dplyr::group_by(name) %>% dplyr::arrange(-value) %>%
# create summary of countries for each name in column 'origin'
dplyr::summarise(origin=paste(country, collapse = "|")) %>%
# join with original data frame to include original columns (and names with only NA) and change NA to '' in origin
dplyr::right_join(test, by='name') %>% dplyr::mutate(origin=ifelse(is.na(origin), '', origin)) %>%
# move origin column to end
dplyr::relocate(origin, .after = last_col())
Result
name US UK AUS origin
<chr> <dbl> <dbl> <dbl> <chr>
1 Amber 8 5 1 US|UK|AUS
2 Bob NA NA 1 AUS
3 Stefan NA 1 NA UK
4 Thomas 2 4 2 UK|US|AUS
5 Zlatan 7 7 NA US|UK
Here's a different tidyverse solution using case_when:
library(tidyverse)
data <- data.frame (test<-data.frame(
"name" =c("Amber","Thomas","Stefan","Zlatan"),
"US" =c(8,2,NA,7),
"UK" =c(5,4,1,7)))
data <- data %>% mutate(origin = case_when( US > UK ~ "US|UK",
UK >= US ~ "UK|US",
is.na(UK) & !is.na(US) ~ "US",
is.na(US) & !is.na(UK) ~ "UK"))
data
#> name US UK origin
#> 1 Amber 8 5 US|UK
#> 2 Thomas 2 4 UK|US
#> 3 Stefan NA 1 UK
#> 4 Zlatan 7 7 UK|US
Created on 2021-01-06 by the reprex package (v0.3.0)

How to separate rows into columns based on variable number of pattern matches per row

I have a dataframe like this:
df <- data.frame(
id = c("A","B"),
date = c("31/07/2019", "31/07/2020"),
x = c('random stuff "A":88876, more stuff',
'something, "A":1234, more "A":456, random "A":32078, more'),
stringsAsFactors = F
)
I'd like to create as many new columns as there are matches to a pattern; the pattern is (?<="A":)\\d+(?=,), i.e., "match the number if you see the string "A":on the left and the comma ,on the right.
The problems: (i) the number of matches may vary from row to row and (ii) the maximum number of new columns is not known in advance.
What I've done so far is this:
df[paste("A", 1:max(lengths(str_extract_all(df$x, '(?<="A":)\\d+(?=,)'))), sep = "")] <- str_extract_all(df$x, '(?<="A":)\\d+(?=,)')
While 1:max(lengths(str_extract_all(df$x, '(?<="A":)\\d+(?=,)'))) may solve the problem of unknown number of new columns, I get a warning:
`Warning message:
In `[<-.data.frame`(`*tmp*`, paste("A", 1:max(lengths(str_extract_all(df$x, :
replacement element 2 has 3 rows to replace 2 rows`
and the assignment of the values clearly incorrect:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 1234 88876
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 88876 456 88876
The correct output would be this:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 NA NA
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 1234 456 32078
Any idea?
Here's a somewhat pedestrian stringr solution:
library(stringr)
library(dplyr)
matches <- str_extract_all(df$x, '(?<="A":)\\d+(?=,)')
ncols <- max(sapply(matches, length))
matches %>%
lapply(function(y) c(y, rep(NA, ncols - length(y)))) %>%
do.call(rbind, .) %>%
data.frame() %>%
setNames(paste0("A", seq(ncols))) %>%
cbind(df, .) %>%
tibble()
#> # A tibble: 2 x 6
#> id date x A1 A2 A3
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A 31/07/20~ "random stuff \"A\":88876, more stuff" 88876 <NA> <NA>
#> 2 B 31/07/20~ "something, \"A\":1234, more \"A\":456, ran~ 1234 456 32078
Created on 2020-07-06 by the reprex package (v0.3.0)

Removing NA's in tibble after using pivot_wider

I am trying to create a table from a data set that takes two factors from a variable, pivots them wider, and lines them up in a single row. Unfortunately, I either keep producing two separate lists, or I get this:
dput(head(test1, 5))
# Edited section:
test1 <- df %>% # Code used to create the table below
select(`Incident ID`,`Device Time`, Description, `Elapsed Time`) %>%
filter(Description == "CPR Stopped" | Description == "CPR Started") %>%
mutate(Index = c(1:16)) %>%
pivot_wider(names_from = Description,
values_from = c(`Elapsed Time`, `Device Time`)) %>%
filter(!is.na(test1))
dput(head(test1, 5))
`Incident ID` Index `Elapsed Time_CPR Started` `Elapsed Time_CPR Stopped` `Device Time_CPR Started` `Device Time_CPR Stopped`
<chr> <int> <time> <time> <time> <time>
1 F190158585 1 01'03" NA 18'37" NA
2 F190158585 2 NA 01'08" NA 18'42"
3 F190158585 3 01'34" NA 19'08" NA
4 F190158585 4 NA 03'47" NA 21'22"
5 F190158585 5 04'00" NA 21'35" NA
I am trying to get a table that looks like this:
df <- data.frame(Index = c(1:3),
CPR_Started = c("00:01:00", "00:02:03", "00:05:46"),
CPR_Stopped = c("00:01:53", "00:04:30", "00:08:00"))
print(df)
Index CPR_Started CPR_Stopped
1 1 00:01:00 00:01:53
2 2 00:02:03 00:04:30
3 3 00:05:46 00:08:00

Using any() or all() with is.na() over multiple columns

I'd like to drop rows from my dataset that are all NAs (AKA keep rows with any non-NAs) for a list of columns. How could I update this code so that x & y are supplied as a vector? This would enable me to flexibly add and drop columns for inspection.
library(dplyr)
ds <-
tibble(
id = c(1:4),
x = c(NA, 1, NA, 4),
y = c(NA, NA , 3, 4)
)
ds %>%
rowwise() %>%
filter(
any(
!is.na(x),
!is.na(y)
)
) %>%
ungroup()
I'm trying to write something like any(!is.na(c(x,y))) but I'm not sure how to supply multiple arguments to is.na().
We can use filter_at with any_vars
ds %>%
filter_at(vars(x:y), any_vars(!is.na(.)))
# A tibble: 3 x 3
# id x y
# <int> <dbl> <dbl>
#1 2 1 NA
#2 3 NA 3
#3 4 4 4
-Update - Feb 7 2022
In the new version of dplyr (as #GitHunter0 suggested) can use if_all/if_any or across
ds %>%
filter(if_any(x:y, complete.cases))
# A tibble: 3 × 3
id x y
<int> <dbl> <dbl>
1 2 1 NA
2 3 NA 3
3 4 4 4
You can also use ds %>% filter(!if_all(x:y, is.na)).

Resources