Bigquery: Transform data in multiple columns into row-format - google-analytics

Suppose the following table in BQ:
SELECT "Desktop" AS Device, 24 AS col1, 9 AS col2, 28 AS col3, 7 AS col4, 98 AS col5, 77 AS col6, 59 AS col7 UNION ALL
SELECT "Mobile" AS Device, 8 AS col1, 43 AS col2, 75 AS col3, 44 AS col4, 38 AS col5, 31 AS col6, 46 AS col7 UNION ALL
SELECT "Tablet" AS Device, 7 AS col1, 9 AS col2, 34 AS col3, 86 AS col4, 62 AS col5, 69 AS col6, 74 AS col7
Hereby, the table could be as large as around 100 columns.
I'd like to transform this query such that I have as resulting table:
SELECT "Desktop" AS Device, 24 AS Nr UNION ALL
SELECT "Desktop" AS Device, 9 AS Nr UNION ALL
SELECT "Desktop" AS Device, 28 AS Nr UNION ALL
SELECT "Desktop" AS Device, 7 AS Nr UNION ALL
SELECT "Desktop" AS Device, 98 AS Nr UNION ALL
SELECT "Desktop" AS Device, 77 AS Nr UNION ALL
SELECT "Desktop" AS Device, 59 AS Nr UNION ALL
SELECT "Mobile" AS Device, 8 AS Nr UNION ALL
SELECT "Mobile" AS Device, 43 AS Nr UNION ALL
SELECT "Mobile" AS Device, 75 AS Nr UNION ALL
Etc
Does anyone know how this can be achieved?

Below is for BigQuery Standard SQL and the extra luxury here is that it does not depend on number and names of columns to be unpivoted
#standardSQL
WITH raw AS (
SELECT "Desktop" AS Device, 24 AS col1, 9 AS col2, 28 AS col3, 7 AS col4, 98 AS col5, 77 AS col6, 59 AS col7 UNION ALL
SELECT "Mobile" AS Device, 8 AS col1, 43 AS col2, 75 AS col3, 44 AS col4, 38 AS col5, 31 AS col6, 46 AS col7 UNION ALL
SELECT "Tablet" AS Device, 7 AS col1, 9 AS col2, 34 AS col3, 86 AS col4, 62 AS col5, 69 AS col6, 74 AS col7
)
SELECT Device, Nr FROM raw t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING((SELECT AS STRUCT * EXCEPT(Device) FROM UNNEST([t]))), r'":([^,}]*)')) Nr
Update for OP's comment : I totally forgot to include in the requirements that the column names should also be added as a separate column
#standardSQL
SELECT Device, SPLIT(pair, ':')[OFFSET(0)] AS col, SPLIT(pair, ':')[OFFSET(1)] AS Nr
FROM raw t,
UNNEST(SPLIT(REGEXP_REPLACE(TO_JSON_STRING((SELECT AS STRUCT * EXCEPT(Device) FROM UNNEST([t]))), r'["{}]', ''))) pair
if to apply to same sampled data result is like below now
Row Device col Nr
1 Desktop col1 24
2 Desktop col2 9
3 Desktop col3 28
4 Desktop col4 7
5 Desktop col5 98
6 Desktop col6 77
7 Desktop col7 59
8 Mobile col1 8
9 Mobile col2 43
10 Mobile col3 75
11 Mobile col4 44
12 Mobile col5 38
13 Mobile col6 31
14 Mobile col7 46
15 Tablet col1 7
16 Tablet col2 9
17 Tablet col3 34
18 Tablet col4 86
19 Tablet col5 62
20 Tablet col6 69
21 Tablet col7 74

You can turn the number columns into an ARRAY and use UNNEST:
with raw as (
SELECT "Desktop" AS Device, 24 AS col1, 9 AS col2, 28 AS col3, 7 AS col4, 98 AS col5, 77 AS col6, 59 AS col7 UNION ALL
SELECT "Mobile" AS Device, 8 AS col1, 43 AS col2, 75 AS col3, 44 AS col4, 38 AS col5, 31 AS col6, 46 AS col7 UNION ALL
SELECT "Tablet" AS Device, 7 AS col1, 9 AS col2, 34 AS col3, 86 AS col4, 62 AS col5, 69 AS col6, 74 AS col7
)
select Device, Nr
from raw
left join UNNEST ([col1, col2, col3,col4,col5,col6,col7]) Nr

Related

Delete rows by dplyr but leave rownames indexes

Let's consider data following :
df1 <-data.frame('col_1'=rnorm(100),'col_2'=runif(100),'col_3'=rexp(100))
head(df1)
col_1 col_2 col_3
1 1.1626853 0.7081688 0.1356186
2 -0.5859245 0.8679017 0.4680558
3 1.7854650 0.4107538 0.5867553
4 -1.3325937 0.3032165 0.4111656
5 -0.4465668 0.8882200 3.4235329
6 0.5696061 0.4715614 1.0981746
Now I want to filter my data :
df1 %>%
filter(col_1>0)
However, I lost my unique numbering i.e. I have just new data frame with rows from 1-49 and I want to have old indexing with just data deleted. Is there any possibility how it can be done ?
Best approach would be to create a new column with row index as tibbles don't support rownames.
library(dplyr)
df1 %>%
mutate(row = row_number()) %>%
filter(col_1 > 0)
In order to keep row index try this:
library(tidyverse)
#Data
df1 <-data.frame('col_1'=rnorm(100),'col_2'=runif(100),'col_3'=rexp(100))
#Code
new <- df1 %>% rownames_to_column('id') %>%
filter(col_1>0) %>%
column_to_rownames('id')
Output:
col_1 col_2 col_3
1 0.44582154 0.485113710 1.12780556
9 0.91338077 0.028025045 0.03392986
12 0.39850519 0.693677593 0.08575707
15 1.31992767 0.875082565 1.69923642
18 1.01032450 0.874306072 0.07470948
19 0.21004100 0.489900673 0.06544119
20 1.83231058 0.777010624 1.04503362
23 1.76636414 0.932134284 0.89963322
24 0.14665427 0.453811105 1.69614288
27 0.95768915 0.540466270 2.08754680
28 2.12894656 0.265205677 1.26068462
29 1.20613178 0.590121360 0.69933346
31 0.17498536 0.003435992 0.90773187
33 1.09692125 0.321649196 3.08840026
35 0.71434379 0.592343229 1.51961595
36 2.18998179 0.288959794 0.86319077
37 0.24424922 0.129267751 0.01765732
39 1.10932154 0.515400529 0.34381840
40 1.62120910 0.843270861 1.22549044
42 0.61201364 0.299831635 0.24302644
43 0.69583869 0.621354113 1.71074969
50 0.12516294 0.337942860 0.13970981
51 0.55032446 0.204976125 0.58245053
52 1.24819371 0.796629076 0.36528538
53 0.78363419 0.321154495 0.09472414
55 0.98528573 0.626797295 0.36268645
56 0.82932405 0.404080363 0.18517625
60 0.65893951 0.441280360 0.15770949
62 0.23747401 0.498418489 0.32947354
67 2.05117816 0.702286040 2.04353073
68 0.46038166 0.455878959 0.78142526
69 0.85814858 0.167027385 0.77806710
73 0.36265229 0.836850527 0.08689737
74 1.75032050 0.918432489 2.44187445
80 1.84781396 0.064257761 1.31418005
82 0.69448019 0.664345881 0.22248944
84 1.43213456 0.172975017 1.02372291
86 0.05623400 0.436021922 0.67705170
87 0.50485963 0.791348607 0.32379094
90 0.08281623 0.608697963 0.87405171
91 0.15252262 0.026808318 0.28446487
92 0.13104612 0.649343508 1.19998877
95 2.47542034 0.071355988 0.78619673
97 0.42994024 0.616706005 0.68963918
98 1.42811745 0.642106243 0.99258297
99 0.27834373 0.310252127 0.71026805
100 0.98552422 0.073099646 0.21789834
Using dplyr, we can use slice
library(dplyr)
df1 %>%
mutate(row = row_number()) %>%
slice(which(col_1 > 0))

add number spaced apart in single cell in r

I would like to add these numbers together in the following code in the col3.
I have tried using gsub, to add a + and calculate in r
I have tried using separate to do a sum across.
train <- data.table(col1=c(rep('a0001',4),rep('b0002',4)), col2=c(seq(1,4,1),seq(1,4,1)), col3=c("12 43 543 1232 43 543", "","","","15 24 85 64 85 25 46","","658 1568 12 584 15684",""))
I would like the results to be a sum of the number in col3 by row like in col4
result<-data.frame(col1=c("a0001","b0002"), col3=c("12 43 543 1232 43 543", "","","","15 24 85 64 85 25 46","","658 1568 12 584 15684",""),col4=c("2416",'18850'))
Grouped by 'col1', we can split by the space, unlist, convert to numeric, get the sum and assign (:=) to create new column
train[, col4 := sum(as.numeric(unlist(strsplit(col3, ' '))), na.rm = TRUE), col1]
Or another option is scan
train[, col4 := sum(scan(text = col3, what = numeric(), quiet = TRUE)), col1]

Create multiple percentage columns based on existing columns in R

I want to create multiple columns that will show the percentage of each element of col2, col3 and Total. The code I came up with only paste the percentage in those columns instead of pasting it in new columns.
I have searched on stack and google but I have not found the answer I was looking for.
Sample data :
data <- data.table(col1= c("A", "B", "C"),
col2= c(43,23,19),
col3= c(102,230,149))
data <- data[, Total := col2 + col3]
data <- janitor::adorn_title(data)
Output :
col1 col2 col3 Total
A 43 102 145
B 23 230 253
C 19 149 168
Total 85 481 566
My percentage function :
add_percent <- function(dt, col_no_percent, col_percent){
dt <- dt[
, c(.SD[, col_no_percent, with=FALSE],
lapply(.SD[, col_percent, with=FALSE], function(x){
paste0(x, format(round(x / sum(x) * 100 * 2, 1), nsmall = 1, decimal.mark = "."))
}))
]
}
Data output with my function:
data <- add_percent(data, "col1", c("col2", "col3", "Total"))
col1 col2 col3 Total
A 43 50.6 102 21.2 145 25.6
B 23 27.1 230 47.8 253 44.7
C 19 22.4 149 31.0 168 29.7
Total 85 100.0 481 100.0 566 100.0
Data output I want :
col1 col2 col3 Total col2.x col3.x Total.x
A 43 102 145 50.6 21.2 25.6
B 23 230 253 27.1 47.8 44.7
C 19 149 168 22.4 31.0 29.7
Total 85 481 566 100.0 100.0 100.0
It is possible that my data will contain way more columns, so all the new columns will have to be created "automatically". So I would like to know how to generate those columns based on my percent function or even a more efficient way if possible.
Thank you.
Initial Data. Note I removed the janitor step. Will do that part at the end.
data <- data.table(col1= c("A", "B", "C"),
col2= c(43,23,19),
col3= c(102,230,149))
data <- data[, Total := col2 + col3]
Add percent columns for all numeric columns and add "Total" row
cols <- names(data)[sapply(data, is.numeric)]
data[, paste0(cols, '_pct') := lapply(.SD, function(x) 100*x/sum(x))
, .SDcols = cols]
adorn_totals(data)
# col1 col2 col3 Total col2_pct col3_pct Total_pct
# A 43 102 145 50.58824 21.20582 25.61837
# B 23 230 253 27.05882 47.81705 44.69965
# C 19 149 168 22.35294 30.97713 29.68198
# Total 85 481 566 100.00000 100.00000 100.00000
I know it is a data.table question, but dplyr has a really nice way of doing this. So just to add it as one possible answer.
library(dplyr)
# this is your function (slightly changed)
as_perc <- function(x) {
paste0(format(100 * (round(x/ sum(x), 2)), nsmall = 1, decimal.mark = "."), "%")
}
data %>%
mutate_if(is.numeric, .funs = list(perc = ~ as_perc(.)))
col1 col2 col3 Total col2_perc col3_perc Total_perc
1 A 43 102 145 51.0% 21.0% 26.0%
2 B 23 230 253 27.0% 48.0% 45.0%
3 C 19 149 168 22.0% 31.0% 30.0%

how can I merge columns from 2 tables with different number of rows using SQLite?

I have 2 tables, 'tab1' and 'tab2':
tab1 is:
col1
25.0
30.0
31.0
25.0
tab2 is:
col1 col2
25.0 11.0
30.0 99.0
31.0 57.0
I want to get the following merged table result by matching the col1 values in tab1 with col1 in tab2 (thus filling in using col2 values from tab2):
col1 col2
25.0 11.0
30.0 99.0
31.0 57.0
25.0 11.0
I am using this sqlite code currently:
INSERT INTO `merged_table1`
SELECT * FROM tab1 LEFT JOIN tab2
ON tab1.col1 = tab2.col1;
However, the result is not correct (giving an extra column):
25 25 11
30 30 99
31 31 57
25 25 11
If the columns actually have the same name, you can do the join using the USING clause, which automatically removes the duplicate column:
INSERT ...
SELECT *
FROM tab1
LEFT JOIN tab2 USING (col1);
Otherwise, just tell the database which columns you want:
INSERT ...
SELECT tab1.col1,
tab2.col2
FROM tab1
LEFT JOIN tab2 ON tab1.col1 = tab2.col1;

r count cells with missing values across each row [duplicate]

This question already has answers here:
Count NAs per row in dataframe [duplicate]
(2 answers)
Closed 6 years ago.
I have a dataframe as shown below
Id Date Col1 Col2 Col3 Col4
30 2012-03-31 A42.2 20.46 NA
36 1996-11-15 NA V73 55
96 2010-02-07 X48 Z16 13
40 2010-03-18 AD14 20.12 36
69 2012-02-21 22.45
11 2013-07-03 81 V017 TCG11
22 2001-06-01 67
83 2005-03-16 80.45 V22.15 46.52 X29.11
92 2012-02-12
34 2014-03-10 82.12 N72.22 V45.44
I am trying to count the number of NA or Empty cells across each row and the final expected output is as follows
Id Date Col1 Col2 Col3 Col4 MissCount
30 2012-03-31 A42.2 20.46 NA 2
36 1996-11-15 NA V73 55 2
96 2010-02-07 X48 Z16 13 1
40 2010-03-18 AD14 20.12 36 1
69 2012-02-21 22.45 3
11 2013-07-03 81 V017 TCG11 1
22 2001-06-01 67 3
83 2005-03-16 80.45 V22.15 46.52 X29.11 0
92 2012-02-12 4
34 2014-03-10 82.12 N72.22 V45.44 1
The last column MissCount will store the number of NAs or empty cells for each row. Any help is much appreciated.
The one-liner
rowSums(is.na(df) | df == "")
given by #DavidArenburg in his comment is definitely the way to go, assuming that you don't mind checking every column in the data frame. If you really only want to check Col1 through Col4, then using an apply function might make more sense.
apply(df, 1, function(x) {
sum(is.na(x[c("Col1", "Col2", "Col3", "Col4")])) +
sum(x[c("Col1", "Col2", "Col3", "Col4")] == "", na.rm=TRUE)
})
Edit: Shortened code
apply(df[c("Col1", "Col2", "Col3", "Col4")], 1, function(x) {
sum(is.na(x)) +
sum(x == "", na.rm=TRUE)
})
or if data columns are exactly like the example data:
apply(df[3:6], 1, function(x) {
sum(is.na(x)) +
sum(x == "", na.rm=TRUE)
})
This should do it.
yourframe$MissCount = rowSums(is.na(yourframe) | yourframe == "" | yourframe == " "))
You can use by_row from library purrr:
library(purrr)
#sample data frame
x <- data.frame(A1=c(1,NA,3,NA),
A2=c("A","B"," ","C"),
A3=c(" "," ",NA,"t"))
Here you apply a function on each row, you can edit it according to your condition. And you can use whatever function you want.
In the following example, I counted empty or NA entries in each row by using sum(...):
by_row(x, function(y) sum(y==" "| (is.na(y))),
.to="MissCount",
.collate = "cols"
)
You will get:
# A tibble: 4 x 4
A1 A2 A3 MissCount
<dbl> <fctr> <fctr> <int>
1 1 A 1
2 NA B 2
3 3 NA 2
4 NA C t 1
We can use
Reduce(`+`, lapply(df, function(x) is.na(x)|!nzchar(as.character(x))))

Resources