I have a data table test:
id
key
1
2365
1
2365
1
3709
2
6734
2
1908
2
4523
I want to aggregate unique key values by id into vector using data.table package.
Expected output:
id
key_array
1
"2365", "3709"
2
"6734", "1908", "4523"
So, this should work like array_agg sql function.
I tried:
res <- test[, list(key_array = paste(unique(key), collapse = ", ")), by = "id"], but I get just a string. But I need to have opportunity to find the length of each vector and operate with its certain elements (find the intersection of two vectors for example).
1. Base R
This an aggregate one-liner.
x <- 'id key
1 2365
1 2365
1 3709
2 6734
2 1908
2 4523'
test <- read.table(textConnection(x), header = TRUE)
aggregate(key ~ id, test, \(x) c(unique(x)))
#> id key
#> 1 1 2365, 3709
#> 2 2 6734, 1908, 4523
Created on 2022-06-14 by the reprex package (v2.0.1)
But if user #Chris's comment is right then the right solution as follows.
aggregate(key ~ id, test, \(x) paste(unique(x), collapse = ", "))
Note that both c(unique(x)) and as.character(c(unique(x))) will output a list column, so the latter solution is right anyway.
2. Package data.table
Once again a one-liner.
The output is a list column, with each list member an integer vector. To keep as integers use
list(unique(key))
instead.
suppressPackageStartupMessages(library(data.table))
res <- setDT(test)[, .(key_array = list(as.character(unique(key)))), by = id]
res
#> id key_array
#> 1: 1 2365,3709
#> 2: 2 6734,1908,4523
str(res)
#> Classes 'data.table' and 'data.frame': 2 obs. of 2 variables:
#> $ id : int 1 2
#> $ key_array:List of 2
#> ..$ : chr "2365" "3709"
#> ..$ : chr "6734" "1908" "4523"
#> - attr(*, ".internal.selfref")=<externalptr>
Created on 2022-06-14 by the reprex package (v2.0.1)
Then, in order to access the vectors use two extractors, one to extract the column and the other one to extract the vectors.
res$key_array[[1]]
#> [1] "2365" "3709"
res$key_array[[2]]
#> [1] "6734" "1908" "4523"
Created on 2022-06-14 by the reprex package (v2.0.1)
3. dplyr solution
Group by id and collapse the unique strings into one only.
suppressPackageStartupMessages(library(dplyr))
test %>%
group_by(id) %>%
summarise(key_array = paste(unique(key), collapse = ", "))
#> # A tibble: 2 × 2
#> id key_array
#> <int> <chr>
#> 1 1 2365, 3709
#> 2 2 6734, 1908, 4523
Created on 2022-06-14 by the reprex package (v2.0.1)
Related
Here is the given code:
library(RSQLite)
library(DBI)
sqcon<-dbConnect(dbDriver("SQLite"), "data/sqlite.db")
events <- read_csv("events_log.csv")
sqevents <- copy_to(sqcon, events)
sqevents
The sqevents dataframe is like this:
## # Source: table<events> [?? x 9]
## # Database: sqlite 3.35.5 [C:\Users\James\Documents\Work\2021 Sem2\Stats
## # 369\lab4\Data\sqlite.db]
## uuid timestamp session_id group action checkin page_id n_results
## <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 00000736167~ 2.02e13 78245c2c3f~ b searchR~ NA cbeb66d1~ 5
## 2 00000c69fe3~ 2.02e13 c559c3be98~ a searchR~ NA eb658e87~ 10
## 3 00003bfdab7~ 2.02e13 760bf89817~ a checkin 30 f99a9fc1~ NA
## 4 0000465cd7c~ 2.02e13 fb905603d3~ a checkin 60 e5626962~ NA
## 5 000050cbb4e~ 2.02e13 c2bf5e5172~ a checkin 30 787dd6a4~ NA
## 6 0000a6af2ba~ 2.02e13 f6840a9614~ a checkin 180 6fb7b9ea~ NA
## 7 0000cd61e11~ 2.02e13 51f4d3b6a8~ a checkin 240 8ad97e7c~ NA
## 8 000104fe220~ 2.02e13 485eabe537~ b searchR~ NA 4da9a642~ 15
## 9 00012e37b74~ 2.02e13 91174a537d~ a checkin 180 dfdff179~ NA
## 10 000145fbe69~ 2.02e13 a795756dba~ b checkin 150 ec0bad00~ NA
## # ... with more rows, and 1 more variable: result_position <dbl>
I want to find the clickthrough rate which is the proportion of session_id that have action=="visitPage"
My current code is this:
sqevents %>% group_by(session_id) %>%
summarise(clickthrough = sum(action=="visitPage")) %>% filter(clickthrough=="0") %>% collect()
However this doesn't return anything:
## # A tibble: 0 x 2
## # ... with 2 variables: session_id <chr>, clickthrough <lgl>
What did I do wrong? And how do I fix this?
Perhaps, we may need to unquote the "0" as the previous step with sum returns a numeric summarised output. Also, if there are NA elements, specify the na.rm = TRUE in sum or else any missing value in the column returns the sum as NA as na.rm = FALSE by default.
library(dplyr)
sqevents %>%
group_by(session_id) %>%
summarise(clickthrough = sum(action=="visitPage", na.rm = TRUE)) %>%
filter(clickthrough == 0) %>%
collect()
Also, other case would be that there is at least one 'visitPage' for each 'session_id', thus the filter steps returns 0 rows
From you description "[...] which is the proportion of session_id that have action=="visitPage" [...]" you might commit an error further down the pipe using sum(). A nice way to calculate the proportion you described can be this:
library(dplyr)
sqevents %>%
dplyr::group_by(session_id) %>%
# check if a session has at least one "visitPage" (true or false = 1 or 0)
dplyr::summarise(yn = any(action == "visitPage")) %>%
# build a mean from that to get the proportion
dplyr::summarise(prop = mean(yn))
# and collect if you like
The content in the column appears as follows $1,521+ 2 bds. I want to extract 1521 and put it in a new column. I know this can be done in alteryx using regex can I do it R?
How about the following?:
library(tidyverse)
x <- '$1,521+ 2 bds'
parse_number(x)
For example:
library(tidyverse)
#generate some data
tbl <- tibble(string = str_c('$', as.character(seq(1521, 1541, 1)), '+', ' 2bds'))
new_col <-
tbl$string %>%
str_split('\\+',simplify = TRUE) %>%
`[`(, 1) %>%
str_sub(2, -1) #get rid of '$' at the start
mutate(tbl, number = new_col)
#> # A tibble: 21 x 2
#> string number
#> <chr> <chr>
#> 1 $1521+ 2bds 1521
#> 2 $1522+ 2bds 1522
#> 3 $1523+ 2bds 1523
#> 4 $1524+ 2bds 1524
#> 5 $1525+ 2bds 1525
#> 6 $1526+ 2bds 1526
#> 7 $1527+ 2bds 1527
#> 8 $1528+ 2bds 1528
#> 9 $1529+ 2bds 1529
#> 10 $1530+ 2bds 1530
#> # … with 11 more rows
Created on 2021-06-12 by the reprex package (v2.0.0)
We can use sub from base R
as.numeric( sub("\\$(\\d+),(\\d+).*", "\\1\\2", x))
#[1] 1521
data
x <- '$1,521+ 2 bds'
I read about the collapse package recently and tried to translate the following data.table code to collapse to see if it's faster in real world examples.
Here's my data.table code:
library(data.table)
library(nycflights13)
data("flights")
flights_DT <- as.data.table(flights)
val_var <- "arr_delay"
id_var <- "carrier"
by <- c("month", "day")
flights_DT[
j = list(agg_val_var = sum(abs(get(val_var)), na.rm = TRUE)),
keyby = c(id_var, by)
][
i = order(-agg_val_var),
j = list(value_share = cumsum(agg_val_var)/sum(agg_val_var)),
keyby = by
][
j = .SD[2L],
keyby = by
][
order(-value_share)
]
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-11 by the reprex package (v1.0.0)
I managed to translate the first data.table call, but struggled later on.
It would be great to see how collapse would be used to handle this use case.
So on this the first thing I'd like to note is that collapse is not and probably never will be a full-blown split-apply combine computing tool like dplyr or data.table. It's focus is not on optimally executing arbitrary code expressions by groups, but on providing advanced and highly efficient grouped, weighted, time-series and panel data computations through the broad range of C++ based statistical and data transformation functions it provides. I refer to the vignette on collapse and data.table for further clarity on these points as well as integration examples.
Accordingly, I think it only makes sense to translate data.table code to collapse if (1) you've come up with an arcane expression in data.table to do something complex statistical it is is not good at (such as weighted aggregation, computing quantiles or the mode by groups, lagging / differencing an irregular panel, grouped centering or linear / polynomial fitting) (2) you actually don't need the data.table object but would much rather work with vectors / matrices / data.frame's / tibbles (3) you want to write a statistical program and would much prefer standard evaluation programming over NS eval and data.table syntax or (4) collapse is indeed substantially faster for your specific application.
Now to the specific code you have provided. It mixes standard and non-standard evaluation (e.g. through the use of get()), which is something collapse is not very good at. I'll give you 3 solutions ranging from full NS eval to full standard eval base R style programming.
library(data.table)
library(nycflights13)
library(magrittr)
library(collapse)
data("flights")
flights_DT <- as.data.table(flights)
# Defining a function for the second aggregation
myFUN <- function(x) (cumsum(x[1:2])/sum(x))[2L]
# Soluting 1: Non-Standard evaluation
flights_DT %>%
fgroup_by(carrier, month, day) %>%
fsummarise(agg_val_var = fsum(abs(arr_delay))) %>%
roworder(month, day, -agg_val_var, na.last = NA) %>%
fgroup_by(month, day) %>%
fsummarise(value_share = myFUN(agg_val_var)) %>%
roworder(-value_share)
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
Note the use of na.last = NA wich actually removes cases where agg_val_var is missing. This is needed here because fsum(NA) is NA and not 0 like sum(NA, na.rm = TRUE). Now the hybrid example which is probably closes to the code you provided:
val_var <- "arr_delay"
id_var <- "carrier"
by <- c("month", "day")
# Solution 2: Hybrid approach with standard eval and magrittr pipes
flights_DT %>%
get_vars(c(id_var, val_var, by)) %>%
ftransformv(val_var, abs) %>%
collapv(c(id_var, by), fsum) %>%
get_vars(c(by, val_var)) %>%
roworderv(decreasing = c(FALSE, FALSE, TRUE), na.last = NA) %>%
collapv(by, myFUN) %>%
roworderv(val_var, decreasing = TRUE) %>%
frename(replace, names(.) == val_var, "value_share")
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
Note here that I used frename at the end to give the result column the name you wanted, as you cannot mix standard and non-standard eval in the same function in collapse. Finally, a big advantage of collapse is that you can use it for pretty low-level programming:
# Solution 3: Programming
data <- get_vars(flights_DT, c(id_var, val_var, by))
data[[val_var]] <- abs(.subset2(data, val_var))
g <- GRP(data, c(id_var, by))
data <- add_vars(get_vars(g$groups, by),
fsum(get_vars(data, val_var), g, use.g.names = FALSE))
data <- roworderv(data, decreasing = c(FALSE, FALSE, TRUE), na.last = NA)
g <- GRP(data, by)
columns
data <- add_vars(g$groups, list(value_share = BY(.subset2(data, val_var), g, myFUN, use.g.names = FALSE)))
data <- roworderv(data, "value_share", decreasing = TRUE)
data
#> month day value_share
#> 1: 10 3 0.5263012
#> 2: 1 24 0.5045664
#> 3: 1 20 0.4885145
#> 4: 10 17 0.4870692
#> 5: 3 6 0.4867606
#> ---
#> 361: 5 4 0.3220295
#> 362: 6 15 0.3205974
#> 363: 1 28 0.3197260
#> 364: 11 25 0.3161550
#> 365: 6 14 0.3128286
Created on 2021-03-12 by the reprex package (v0.3.0)
I refer you to the blog post on programming with collapse for a more interesting example on how this can benefit the development of statistical codes.
Now for the evaluation, I wrapped these solutions in functions, where DT() is the data.table code you provided, run with 2 threads on a windows machine. This checks equality:
all_obj_equal(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
#> TRUE
Now the benchmark:
library(microbenchmark)
microbenchmark(DT(), clp_NSE(), clp_Hybrid(), clp_Prog())
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> DT() 85.81079 87.80887 91.82032 89.47025 92.54601 132.26073 100 b
#> clp_NSE() 13.47535 14.15744 15.99264 14.80606 16.29140 28.16895 100 a
#> clp_Hybrid() 13.79843 14.23508 16.61606 15.00196 16.83604 32.94648 100 a
#> clp_Prog() 13.71320 14.17283 16.16281 14.94395 16.16935 39.24706 100 a
If you care about these milliseconds feel free to optimize, but for 340,000 obs all solutions are bloody fast.
I want to remove the date part from the first column but can't do it for all the dataset?
can someone advise please?
Example of dataset:
You can use sub() function with replacing ^[^[:alpha:]]+ (regular expression, i.e. all non-alphabetic characters at the beginning of the string), with "", i.e. empty string.
sub("^[^[:alpha:]]+", "", data)
Example
data <- data.frame(
good_column = 1:4,
bad_column = c("13/1/2000pasta", "14/01/2000flour", "15/1/2000aluminium foil", "15/1/2000soap"))
data
#> good_column bad_column
#> 1 1 13/1/2000pasta
#> 2 2 14/01/2000flour
#> 3 3 15/1/2000aluminium foil
#> 4 4 15/1/2000soap
data$bad_column <- sub("^[^[:alpha:]]+", "", data$bad_column)
data
#> good_column bad_column
#> 1 1 pasta
#> 2 2 flour
#> 3 3 aluminium foil
#> 4 4 soap
Created on 2020-07-29 by the reprex package (v0.3.0)
I need to to read a .txt file from an URL, but would like to skip the rows until a row with a certain value. The URL is https://fred.stlouisfed.org/data/HNOMFAQ027S.txt and the data takes the following form:
"
... (number of rows)
... (number of rows)
... (number of rows)
DATE VALUE
1945-01-01 144855
1946-01-01 138515
1947-01-01 136405
1948-01-01 135486
1949-01-01 142455
"
I would like to skip all rows until the row with "DATE // VALUE" and start importing the data from this line onwards (including "DATE // VALUE"). Is there a way to do this with data.table's fread() - or any other way, such as with dplyr?
Thank you very much in advance for your effort and your time!
Best,
c.
Here's a way to get to extract that info from those text files using readr::read_lines, dplyr, and string handling from stringr.
library(tidyverse)
library(stringr)
df <- data_frame(lines = read_lines("https://fred.stlouisfed.org/data/HNOMFAQ027S.txt")) %>%
filter(str_detect(lines, "^\\d{4}-\\d{2}-\\d{2}")) %>%
mutate(date = str_extract(lines, "^\\d{4}-\\d{2}-\\d{2}"),
value = as.numeric(str_extract(lines, "[\\d-]+$"))) %>%
select(-lines)
df
#> # A tibble: 286 x 2
#> date value
#> <chr> <dbl>
#> 1 1945-10-01 1245
#> 2 1946-01-01 NA
#> 3 1946-04-01 NA
#> 4 1946-07-01 NA
#> 5 1946-10-01 1298
#> 6 1947-01-01 NA
#> 7 1947-04-01 NA
#> 8 1947-07-01 NA
#> 9 1947-10-01 1413
#> 10 1948-01-01 NA
#> # ... with 276 more rows
I filtered for all the lines you want to keep using stringr::str_detect, then extracted out the info you want from the string using stringr::str_extract and regexes.
Combining fread with unix tools:
> fread("curl -s https://fred.stlouisfed.org/data/HNOMFAQ027S.txt | sed -n -e '/^DATE.*VALUE/,$p'")
DATE VALUE
1: 1945-10-01 1245
2: 1946-01-01 .
3: 1946-04-01 .
4: 1946-07-01 .
5: 1946-10-01 1298
---
282: 2016-01-01 6566888
283: 2016-04-01 6741075
284: 2016-07-01 7022321
285: 2016-10-01 6998898
286: 2017-01-01 7448792
>
Using:
file.names <- c('https://fred.stlouisfed.org/data/HNOMFAQ027S.txt',
'https://fred.stlouisfed.org/data/DGS10.txt',
'https://fred.stlouisfed.org/data/A191RL1Q225SBEA.txt')
text.list <- lapply(file.names, readLines)
skip.rows <- sapply(text.list, grep, pattern = '^DATE\\s+VALUE') - 1
# option 1
l <- Map(function(x,y) read.table(text = x, skip = y), x = text.list, y = skip.rows)
# option 2
l <- lapply(seq_along(text.list), function(i) fread(file.names[i], skip = skip.rows[i]))
will get you a list of data.frame's (option 1) or data.table's (option 2).