Fastest way to pad a dataframe with uneven columns - r

Similar to this previous question I'm trying to transform a vector into a dataframe in R. I use this trick where I turn it into a matrix and then data frame, but the issue is that some rows potentially have a different number of columns, which throws out my data frame. There can be an arbitrary number of values per row (i.e. not necessarily 3 columns as in the examples), so I check first to work out how many columns I need.
For example, given the example data below, I get a neat data frame.
example <- c(
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c")
# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))
data.frame(matrix(example, ncol = ncols[1], byrow = T))
# X1 X2 X3
# 1 col-a col-b col-c
# 2 col-a col-b col-c
# 3 col-a col-b col-c
That's all well and good until I get a vector that has an extra value in one row (i.e. requires and extra column). For example:
example <- c("col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"WATCH OUT!",
"col-a",
"col-b",
"col-c")
# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))
data.frame(matrix(example, ncol = ncols[1], byrow = T))
# X1 X2 X3
# 1 col-a col-b col-c
# 2 col-a col-b col-c
# 3 WATCH OUT! col-a col-b
# 4 col-c col-a col-b
Whereas, what I really want is:
# X1 X2 X3 X4
# 1 col-a col-b col-c NA
# 2 col-a col-b col-c WATCH OUT!
# 3 col-a col-b col-c NA
I could deal with this with a double for loop after checking that there is an uneven number of elements between first column elements, but that's not even going to be close to the best option surely.
The additional complication is that the "extra" column could potentially be anywhere, not necessarily the last column.
Edit: The column ordering is actually arbitrary, so there's no reason why the extra column has to be in the middle, it could be appended at the end. The is one option I considered, to pull it out and just append it after padding with NA afterwards. The text that should be in the same column are also delimited so it's clear where they belong. Have updated the example below.
Here is some more realistic example data and desired output:
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
# Desired output
X1 X2 X3 X4
1 name:start date:a NA value:b
2 name:start date:c desc:WATCH OUT! value:d
3 name:start date:e NA value:f
What would be the fastest way to process this?
Thanks in advance!
EDIT: the "blocks" that turn into rows are well defined, so the start and end of a block are clear and finding the size of a block isn't hard, hence my diff(grep(...)) command earlier (could also use dist() for similar result). The WATCH OUT! Text can be arbitrary though, so it's not as simple as searching for WATCH OUT!.

I am not sure, if the output in this format, is useful
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate(dummy, into=c("name", 'value'), sep = '\\:') %>%
mutate(rowid = cumsum(name == first(name))) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = value)
#> # A tibble: 3 x 5
#> rowid name date value desc
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 start a b <NA>
#> 2 2 start c d WATCH OUT!
#> 3 3 start e f <NA>
OR perhaps this?
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
separate(dummy, into=c("name", 'value'), sep = '\\:', remove = F) %>%
mutate(rowid = cumsum(name == first(name))) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 3 x 5
#> rowid name date value desc
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 name:start date:a value:b <NA>
#> 2 2 name:start date:c value:d desc:WATCH OUT!
#> 3 3 name:start date:e value:f <NA>
Created on 2021-05-30 by the reprex package (v2.0.0)
For your first example, you could do
``` r
example <- c("col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"WATCH OUT!",
"col-a",
"col-b",
"col-c")
library(tidyverse)
example %>% as.data.frame() %>% setNames('dummy') %>%
group_by(rowid = cumsum(dummy == first(dummy))) %>%
mutate(name = paste0('X', row_number())) %>%
pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 3 x 5
#> # Groups: rowid [3]
#> rowid X1 X2 X3 X4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 col-a col-b col-c <NA>
#> 2 2 col-a col-b col-c WATCH OUT!
#> 3 3 col-a col-b col-c <NA>
Created on 2021-05-30 by the reprex package (v2.0.0)

Is this useful?
library(tidyverse)
library(rebus)
#>
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#>
#> regex
#> The following object is masked from 'package:ggplot2':
#>
#> alpha
example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")
example_dirty <- example #i will use it at the end of the script for replacing
custom_pattern <- rebus::or('name:.*', 'date:.', 'value:.')
alien_text_index <- str_detect(example, pattern = custom_pattern) %>%
as.character()
replacement <- which(alien_text_index == 'FALSE') %>%
`/`(., 3) %>% #in this case every three rows the repetition should start over.
round() #round for getting an index to modify
example <- str_match(example , pattern = custom_pattern) %>% keep(~!is.na(.))
df <- c('name:.*', 'date:.', 'value:.') %>%
map(~example[str_detect(example, .x)]) %>% reduce(bind_cols) %>%
mutate(..4 = '')
#> New names:
#> * NA -> ...1
#> * NA -> ...2
#> New names:
#> * NA -> ...3
for (i in length(replacement)) {
df[replacement[i], 4] <- example_dirty[!as.logical(alien_text_index)][i]
}
df
#> # A tibble: 3 x 4
#> ...1 ...2 ...3 ..4
#> <chr> <chr> <chr> <chr>
#> 1 name:start date:a value:b ""
#> 2 name:start date:c value:d "desc:WATCH OUT!"
#> 3 name:start date:e value:f ""
Created on 2021-05-29 by the reprex package (v2.0.0)

Related

Places after decimal points discarded when extracting numbers from strings

I'd like to extract weight values from strings with the unit and the time of measurement using tidyverse.
My dataset is like as below:
df <- tibble(ID = c("A","B","C"),
Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))
------
A tibble: 3 × 2
ID Weight
<chr> <chr>
1 A 45kg^20221120
2 B 11.5kg^20221015
3 C 66.05kg^20221020
I use stringr in the tidyverse package with regular expressions.
library(tidyverse)
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)")))
----------
A tibble: 3 × 3
ID Measurement Weight
<chr> <chr> <dbl>
1 A 45kg^20221120 45
2 B 11.5kg^20221015 11.5
3 C 66.05kg^20221020 66.0
The second decimal place of C (.05) doesn't extracted.
What's wrong with my code?
Any answers or comments are welcome.
Thanks.
Yes, it was extracted, however tibble is rounding it for 66.0 for easy display.
You can see it if you transform it in data.frame or if you View it
Solution
Check here
Check this
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)"))) %>%
as.data.frame()
Output
#> ID Measurement Weight
#> 1 A 45kg^20221120 45.00
#> 2 B 51.5kg^20221015 51.50
#> 3 C 66.05kg^20221020 66.05
Or check this
df %>%
mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)"))) %>%
View()
You could try to pull all the data out of the string at once with extract:
library(tidyverse)
df <- tibble(ID = c("A","B","C"),
Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))
df |>
extract(col = Weight,
into = c("weight", "unit", "date"),
regex = "(.*)(kg)\\^(.*$)",
remove = TRUE,
convert = TRUE) |>
mutate(date = lubridate::ymd(date))
#> # A tibble: 3 x 4
#> ID weight unit date
#> <chr> <dbl> <chr> <date>
#> 1 A 45 kg 2022-11-20
#> 2 B 51.5 kg 2022-10-15
#> 3 C 66.0 kg 2022-10-20
Note that, as stated in the comments, the .05 is just not printing, but is present in the data.

Expanding mean over time per subgroup in dataframe

Still quite new to R, so trying to figure out what I am doing wrong in the following explanation.
I am trying to calculate the expanding mean over time per subgroup for a dataframe. My code works when there is only a single subgroup in the dataframe, but starts to break when multiple subgroups are available within the dataframe.
Apologies if I have overlooked something, but I cant figure out where exactly my code is incorrect. My hunch is that I am not filling in the width correctly, but I have not been able to figure out how to change width to a dynamically expanding window over time per subgroup.
See my data below;
sample file
See my code below;
library(ggplot2)
library(zoo)
library(RcppRoll)
library(dplyr)
x <- read.csv("stackoverflow.csv")
x$datatime <- as.POSIXlt(x$datatime,format="%m/%d/%Y %H:%M",tz=Sys.timezone())
x$Event <- as.factor(x$Event)
x2 <- arrange(x,x$Event,x$datatime) %>%
group_by(x$Event) %>%
mutate(ma=rollapply(data = x$Actual, width=seq_along(x$Actual), FUN=mean,
partial=TRUE, fill=NA,
align = "right"))
Any help is very much appreciated!
Thanks
EDIT:
A fix has been found! Thanks to all the useful feedback.
The working code is;
x <-
arrange(x,x$Event,x$datatime) %>%
group_by(Event) %>%
mutate(ma=rollapply(data = Actual,
width=seq_along(Actual),
FUN=mean,
partial=TRUE,
fill=NA,
align = "right"))
I think the problem here is that you’re using x$ to extract columns from
the original data in mutate(), rather than using the column name directly
to refer to the column in the grouped slice.
In dplyr verbs you can (and in case of grouped operations, must) refer to the columns directly.
The solution is to just remove
all x$ references from your code in dplyr functions.
Here’s a small example that illustrates what’s going on:
library(dplyr, warn.conflicts = FALSE)
tbl <- tibble(g = c(1, 1, 2, 2, 2), x = 1:5)
tbl
#> # A tibble: 5 x 2
#> g x
#> <dbl> <int>
#> 1 1 1
#> 2 1 2
#> 3 2 3
#> 4 2 4
#> 5 2 5
tbl %>%
group_by(g) %>%
mutate(y = cumsum(tbl$x))
#> Error in `mutate_cols()`:
#> ! Problem with `mutate()` column `y`.
#> i `y = cumsum(tbl$x)`.
#> i `y` must be size 2 or 1, not 5.
#> i The error occurred in group 1: g = 1.
And how to fix it:
tbl %>%
group_by(g) %>%
mutate(y = cumsum(x))
#> # A tibble: 5 x 3
#> # Groups: g [2]
#> g x y
#> <dbl> <int> <int>
#> 1 1 1 1
#> 2 1 2 3
#> 3 2 3 3
#> 4 2 4 7
#> 5 2 5 12

Modify a vector based on a vector of regular expressions (regex) using (if possible) a functional approach

I have a dataframe with some columns that I want to modify depending on whether they match some patterns included in a vector with regular expressions
library(fuzzyjoin)
library(tidyverse)
(df <- tribble(~a,
"GUA-ABC",
"REF-CDE",
"ACC.S93",
"ACC.ATN"))
#> # A tibble: 4 x 1
#> a
#> <chr>
#> 1 GUA-ABC
#> 2 REF-CDE
#> 3 ACC.S93
#> 4 ACC.ATN
Depending on the pattern I want to paste a text, for example, for those that contain GUA- paste "GUA001" at the end of the chain joined by a point and for those that contain REF- paste "GUA002" in the same way, to be able to obtain the following:
# This is the resulting data.frame I need
#> # A tibble: 4 x 1
#> a
#> <chr>
#> 1 GUA-ABC.GUA001
#> 2 REF-CDE.GUA002
#> 3 ACC.S93
#> 4 ACC.ATN
I have thought of some approaches.
Approach # 1
# list of patterns to search
patterns <- c("\\b^GUA\\b", "\\b^REF\\b")
# Create a named list for recoding
model_key <- list("\\b^GUA\\b" = "GUA001",
"\\b^REF\\b" = "GUA002")
# Create a data.frame of regexs
(k <- tibble(regex = patterns))
#> # A tibble: 2 x 1
#> regex
#> <chr>
#> 1 "\\b^GUA\\b"
#> 2 "\\b^REF\\b"
# perform a regex_left_join to identify the pattern
df %>%
regex_left_join(k, by = c(a = "regex")) %>%
mutate(
across(regex, recode, !!!model_key),
a = case_when(
!is.na(regex) ~ str_c(a, regex, sep = "."),
TRUE ~ a)
) %>% select(-regex)
#> # A tibble: 4 x 1
#> a
#> <chr>
#> 1 GUA-ABC.GUA001
#> 2 REF-CDE.GUA002
#> 3 ACC.S93
#> 4 ACC.ATN
Why is this approach not optimal? The original data frame has millions of rows and fuzzyjoin::regex_left_join takes too long to do this.
Approach # 2
patron <- c("GUA001" = "\\b^GUA\\b", "GUA002" = "\\b^REF\\b")
newtex <- c("GUA001", "GUA002")
pegar <- function(string, pattern, text_to_paste) {
if_else(condition = str_detect(string, pattern),
true = str_c(string, text_to_paste, sep = "."),
false = string)
}
map2_dfr(.x = patron, .y = newtex, ~ pegar(string = df$a,
pattern = .x,
text_to_paste = .y))
#> # A tibble: 4 x 2
#> GUA001 GUA002
#> <chr> <chr>
#> 1 GUA-ABC.GUA001 GUA-ABC
#> 2 REF-CDE REF-CDE.GUA002
#> 3 ACC.S93 ACC.S93
#> 4 ACC.ATN ACC.ATN
Created on 2021-05-20 by the reprex package (v2.0.0)
With approach # 2 I can't get a single column.
As a side note, using str_replace_all and using a named vector to replace some of the values within the string has not seemed like a good alternative at the moment.
Is there a way to do this more optimally?
One option utilizing stringr and purrr could be:
imap_dfr(model_key,
~ df %>%
filter(str_detect(a, .y)) %>%
mutate(a = str_c(a, .x, sep = "."))) %>%
bind_rows(df %>%
filter(str_detect(a, str_c(names(model_key), collapse = "|"), negate = TRUE)))
a
<chr>
1 GUA-ABC.GUA001
2 REF-CDE.GUA002
3 ACC.S93
4 ACC.ATN
What about a boring old loop?
## make df millions of rows
df <- df[rep(1:4,1e6),]
system.time({
val <- c("GUA\\-", "REF\\-", "ACC\\.", "QQQ\\.")
rpl <- c("GUA001", "GUA002", "ACC001", "QQQ001")
for(i in seq_along(val)) {
sel <- grepl(val[i], df$a)
df$a[sel] <- paste(df$a[sel], rpl[i], sep=".")
}
})
## user system elapsed
## 2.14 0.03 2.17
2 seconds to complete
df
## A tibble: 4,000,000 x 1
# a
# <chr>
# 1 GUA-ABC.GUA001
# 2 REF-CDE.GUA002
# 3 ACC.S93.ACC001
# 4 ACC.ATN.ACC001
# ...
If the functional approach is absolutely necessary, you can squish it into a Reduce function:
Reduce(
function(str, args) {
sel <- grepl(args[1], str)
str[sel] <- paste(str[sel], args[2], sep=".")
str
},
Map(c, val, rpl), init = df$a
)

R: dplyr and row_number() does not enumerate as expected

I want to enumerate each record of a dataframe/tibble resulted from a grouping. The index is according a defined order. If I use row_number() it does enumerate but within group. But I want that it enumerates without considering the former grouping.
Here is an example. To make it simple I used the most minimal dataframe:
library(dplyr)
df0 <- data.frame( x1 = rep(LETTERS[1:2],each=2)
, x2 = rep(letters[1:2], 2)
, y = floor(abs(rnorm(4)*10))
)
df0
# x1 x2 y
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
Now, I group this table:
df1 <- df0 %>% group_by(x1,x2) %>% summarize(y=sum(y))
This gives me a object of class tibble:
# A tibble: 4 x 3
# Groups: x1 [?]
# x1 x2 y
# <fct> <fct> <dbl>
# 1 A a 12
# 2 A b 24
# 3 B a 0
# 4 B b 12
I want to add a row number to this table using row_numer():
df2 <- df1 %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# A tibble: 4 x 4
# Groups: x1 [2]
# x1 x2 y index
# <fct> <fct> <dbl> <int>
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 1
# 4 B a 0 2
row_number() does enumerate within the former grouping. This was not my intention. This can be avoid converting tibble to a dataframe first:
df2 <- df2 %>% as.data.frame() %>% arrange(desc(y)) %>% mutate(index = row_number())
df2
# x1 x2 y index
# 1 A b 24 1
# 2 A a 12 2
# 3 B b 12 3
# 4 B a 0 4
My question is: is this behaviour intended?
If yes: is it not very dangerous to incorporate former data processing into tibble? Which type of processing is incorporated?
At the moment I will convert tibble into dataframe to avoid this kind of unexpected results.
To elaborate on my comment: yes, retaining grouping is intended, and in many cases useful. It's only dangerous if you don't understand how group_by works—and that's true of any function. To undo group_by, you call ungroup.
Take a look at the group_by docs, as they're very thorough and explain how this function interacts with others, how grouping is layered, etc. The docs also explain how each call to summarise removes a layer of grouping—it might be there that you got confused about what's going on.
For example, you can group by x1 and x2, summarize y, and create a row number, which will give you the rows according to x1 (summarise removed a layer of grouping, i.e. drops the x2 grouping). Then ungrouping allows you to get row numbers based on the entire data frame.
library(dplyr)
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(group_row = row_number()) %>%
ungroup() %>%
mutate(all_df_row = row_number())
#> # A tibble: 4 x 5
#> x1 x2 y group_row all_df_row
#> <fct> <fct> <dbl> <int> <int>
#> 1 A a 12 1 1
#> 2 A b 2 2 2
#> 3 B a 10 1 3
#> 4 B b 23 2 4
A use case—I do this for work probably every day—is to get sums within multiple groups (again, x1 and x2), then to find the shares of those values within their larger group (after peeling away a layer of grouping, this is x1) with mutate. Again, here I ungroup to show the shares instead of the entire data frame.
df0 %>%
group_by(x1, x2) %>%
summarise(y = sum(y)) %>%
mutate(share_in_group = y / sum(y)) %>%
ungroup() %>%
mutate(share_all_df = y / sum(y))
#> # A tibble: 4 x 5
#> x1 x2 y share_in_group share_all_df
#> <fct> <fct> <dbl> <dbl> <dbl>
#> 1 A a 12 0.857 0.255
#> 2 A b 2 0.143 0.0426
#> 3 B a 10 0.303 0.213
#> 4 B b 23 0.697 0.489
Created on 2018-10-11 by the reprex package (v0.2.1)
As camille nicely showed, there are good reasons for wanting to have the result of summarize() retain additional layers of grouping and it's a documented behaviour so not really dangerous or unexpected per se.
However one additional tip is that if you are just going to call ungroup() after summarize() you might as well use summarize(.groups = "drop") which will return an ungrouped tibble and save you a line of code.
library(tidyverse)
df0 <- data.frame(
x1 = rep(LETTERS[1:2], each = 2),
x2 = rep(letters[1:2], 2),
y = floor(abs(rnorm(4) * 10))
)
df0 %>%
group_by(x1,x2) %>%
summarize(y=sum(y), .groups = "drop") %>%
arrange(desc(y)) %>%
mutate(index = row_number())
#> # A tibble: 4 x 4
#> x1 x2 y index
#> <chr> <chr> <dbl> <int>
#> 1 A b 8 1
#> 2 A a 2 2
#> 3 B a 2 3
#> 4 B b 1 4
Created on 2022-02-06 by the reprex package (v2.0.1)

tidyverse - prefered way to turn a named vector into a data.frame/tibble

Using the tidyverse a lot i often face the challenge of turning named vectors into a data.frame/tibble with the columns being the names of the vector.
What is the prefered/tidyversey way of doing this?
EDIT: This is related to: this and this github-issue
So i want:
require(tidyverse)
vec <- c("a" = 1, "b" = 2)
to become this:
# A tibble: 1 × 2
a b
<dbl> <dbl>
1 1 2
I can do this via e.g.:
vec %>% enframe %>% spread(name, value)
vec %>% t %>% as_tibble
Usecase example:
require(tidyverse)
require(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
Which gives
# A tibble: 2 × 3
a b c
<chr> <chr> <chr>
1 1 2 <NA>
2 1 <NA> 3
This is now directly supported using bind_rows (introduced in dplyr 0.7.0):
library(tidyverse))
vec <- c("a" = 1, "b" = 2)
bind_rows(vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
This quote from https://cran.r-project.org/web/packages/dplyr/news.html explains the change:
bind_rows() and bind_cols() now accept vectors. They are treated as rows by the former and columns by the latter. Rows require inner names like c(col1 = 1, col2 = 2), while columns require outer names: col1 = c(1, 2). Lists are still treated as data frames but can be spliced explicitly with !!!, e.g. bind_rows(!!! x) (#1676).
With this change, it means that the following line in the use case example:
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(~t(.) %>% as_tibble)
can be rewritten as
txt %>% map(read_xml) %>% map(xml_attrs) %>% map_df(bind_rows)
which is also equivalent to
txt %>% map(read_xml) %>% map(xml_attrs) %>% { bind_rows(!!! .) }
The equivalence of the different approaches is demonstrated in the following example:
library(tidyverse)
library(rvest)
txt <- c('<node a="1" b="2"></node>',
'<node a="1" c="3"></node>')
temp <- txt %>% map(read_xml) %>% map(xml_attrs)
# x, y, and z are identical
x <- temp %>% map_df(~t(.) %>% as_tibble)
y <- temp %>% map_df(bind_rows)
z <- bind_rows(!!! temp)
identical(x, y)
#> [1] TRUE
identical(y, z)
#> [1] TRUE
z
#> # A tibble: 2 x 3
#> a b c
#> <chr> <chr> <chr>
#> 1 1 2 <NA>
#> 2 1 <NA> 3
The idiomatic way would be to splice the vector with !!! within a tibble() call so the named vector elements become column definitions :
library(tibble)
vec <- c("a" = 1, "b" = 2)
tibble(!!!vec)
#> # A tibble: 1 x 2
#> a b
#> <dbl> <dbl>
#> 1 1 2
Created on 2019-09-14 by the reprex package (v0.3.0)
This works for me: c("a" = 1, "b" = 2) %>% t() %>% tbl_df()
Interestingly you can use the as_tibble() method for lists to do this in one call. Note that this isn't best practice since this isn't an exported method.
tibble:::as_tibble.list(vec)
as_tibble(as.list(c(a=1, b=2)))

Resources