Split columns that are dictionaries

Split columns that are dictionaries - r

I have a data frame that has about 200,000 rows with columns like:
ID
dictionary column 1
dictionary column 2
1
{""1720100"":4,""1720101"":3}
{""1720100"":5,""1720101"":1,""1720102"":2}
2
{""1720100"":4}
{""1720100"":4,""1720101"":2}
...
...
...
The output table I would like to get is:
ID
col_a
col_b
col_c
col_d
1
1720100
4
1720101
5
1
1720101
3
1720102
1
1
NA
NA
1720103
2
2
1720100
4
1720101
4
2
NA
NA
1720102
2
...
...
...
...
...
And, I feel like it would be even better if the data frame is divided into several chunks before splitting the columns above to reduce the time needed for the calculation. Could anyone help me with this?

Looks like you may want to extract json from the column, using jsonlite package. You can put data into longer form, since you have json in two columns. Then with more pivoting to get desired final format. The final select just reorders columns values on the number contained in the column name.
library(tidyverse)
library(jsonlite)
df %>%
pivot_longer(cols = -ID) %>%
mutate(json_parsed = map(value, ~fromJSON(sprintf("[%s]", .), flatten = T))) %>%
unnest(json_parsed) %>%
pivot_longer(cols = -c(ID, name, value), names_to = "n", values_to = "v") %>%
pivot_wider(id_cols = ID, values_from = c(n, v), values_fn = list) %>%
unnest(cols = -ID) %>%
select(ID, order(parse_number(names(.)[-1])) + 1)
Output
ID n_dictionary_column_1 v_dictionary_column_1 n_dictionary_column_2 v_dictionary_column_2
<dbl> <chr> <int> <chr> <int>
1 1 1720100 4 1720100 5
2 1 1720101 3 1720101 1
3 1 1720102 NA 1720102 2
4 2 1720100 4 1720100 4
5 2 1720101 NA 1720101 2
6 2 1720102 NA 1720102 NA

Related

How do I reverse gather function in R?

I used gather function from tidyr to change the format of data wide to long. After that I removed NA values. How can I reverse this operation ?
first.df <- data.frame(a = c(1,NA,NA),b=c(2,3,NA),c=c(4,5,6))
second.df <- first.df %>% gather("x","y") %>% na.omit %>% `rownames<-`( NULL )
second.df:
|x|y|
|-|-|
|a|1|
|b|2|
|b|3|
|c|4|
|c|5|
|c|6|
How can I reverse this with spread or another function to first.df below ?
a
b
c
1
2
4
NA
3
5
NA
NA
6

It requires a sequence column and then either dcast or pivot_wider can work or spread
library(dplyr)
library(tidyr)
library(data.table)
second.df %>%
mutate(rn = rowid(x)) %>%
pivot_wider(names_from = x, values_from = y) %>%
select(-rn)
-output
# A tibble: 3 × 3
a b c
<dbl> <dbl> <dbl>
1 1 2 4
2 NA 3 5
3 NA NA 6
Or with dcast
library(data.table)
dcast(second.df, rowid(x) ~ x, value.var = 'y')[,-1]

R Rowise operation on partially similar columns

I have a df with multiple instances of each column. I'd like to "concatenate" all rows for each group of column. Possibly with dplyr if that's possible.
Example:
IDs
p001_i1
p001_i2
p501_i1
p501_i2
p501_i3
AA
NA
NA
1
NA
NA
AB
5
NA
NA
NA
NA
AC
NA
10
NA
NA
2
Here for example I'd like to group the columns with similar starting name (here "p001" and "p501" but there can be many more, and different numbers of instances for each "pxxx" (p001 has 2 columns, p501 has 3)) and report their non-missing value for each row.
The final result would be:
IDs
p001
p501
AA
NA
1
AB
5
NA
AC
10
2
If there's multiple values for each row, it could for example get the mean or the value of the latest field (ie priority i3 > i2 > i1).
I've looked at across(), c_across() and pmap(), but I can't see how to implement that on each "group" of columns. Thanks!

Here is a tidyverse option
library(tidyverse)
df %>%
pivot_longer(
-IDs, names_pattern = c("(.+)_(.+)"), names_to = c("name", NA)) %>%
group_by(IDs, name) %>%
summarise(value = mean(value, na.rm = TRUE), .groups = "drop") %>%
pivot_wider()
## A tibble: 3 × 3
# IDs p001 p501
# <chr> <dbl> <dbl>
#1 AA NaN 1
#2 AB 5 NaN
#3 AC 10 2
Explanation: Reshape from wide to long (and only retain the first part of the wide column names "p001", "p501" through using names_pattern and names_to), group by IDs and name, then calculate the mean (ignoring NAs) and reshape again from long to wide.
Sample data
df <-read.table(text = "IDs p001_i1 p001_i2 p501_i1 p501_i2 p501_i3
AA NA NA 1 NA NA
AB 5 NA NA NA NA
AC NA 10 NA NA 2", header =T)

Another tidyverse option, but this preserves separate groups for the AC column:
df %>%
pivot_longer(-1) %>%
group_by(IDs, name_prefix = str_extract(name, "^.+?(?=_)")) %>%
filter(!is.na(value)) %>%
slice_tail(n = 1) %>%
ungroup() %>%
pivot_wider(names_from = name_prefix, values_from = value, values_fill = NA) %>%
select(-name)
# A tibble: 4 × 3
IDs p501 p001
<chr> <int> <int>
1 AA 1 NA
2 AB NA 5
3 AC NA 10
4 AC 2 NA
Also, you can check this out:
How to get value of last non-NA column

R Extract specific text from column into multiple columns [duplicate]

This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 10 months ago.
I have a dataframe exported from the web with this format
id vals
1 {7,12,58,1}
2 {1,2,5,7}
3 {15,12}
I would like to extract ONLY the numbers (ignore curlys and commas) into multiple columns like this
id val_1 val_2 val_3 val_4 val_5
1 7 12 58 1
2 1 2 5 7
3 15 12
Even though the Max of values we got was 4 I want to always go up to value val_5.
Thanks!

We could use str_extract_all for this:
library(dplyr)
library(stringr)
df %>%
mutate(vals = str_extract_all(vals, '\\d+', ''))
or as #akrun suggest in the comments
df %>%
mutate(vals = str_extract_all(vals, '\\d+', '')) %>%
do.call(data.frame, .)
id vals.1 vals.2 vals.3 vals.4
1 1 7 12 58 1
2 2 1 2 5 7
3 3 15 12 <NA> <NA>
data:
df <- structure(list(id = 1:3, vals = c("{7,12,58,1}", "{1,2,5,7}",
"{15,12}")), class = "data.frame", row.names = c(NA, -3L))

Another possible tidyverse option, where we remove the curly brackets, then separate the rows on the ,, then pivot to wide form. Then, we can create the additional column (using add_column from tibble) based on the max value in the column names (which is 4 in this case), and then can create val_5.
library(tidyverse)
df %>%
mutate(vals = str_replace_all(vals, "\\{|\\}", "")) %>%
separate_rows(vals, sep=",") %>%
group_by(id) %>%
mutate(ind = row_number()) %>%
pivot_wider(names_from = ind, values_from = vals, names_prefix = "val_") %>%
add_column(!!(paste0("val_", parse_number(names(.)[ncol(.)])+1)) := NA)
Output
id val_1 val_2 val_3 val_4 val_5
1 1 7 12 58 1 NA
2 2 1 2 5 7 NA
3 3 15 12 <NA> <NA> NA
Data
df <- read.table(text = "id vals
1 {7,12,58,1}
2 {1,2,5,7}
3 {15,12} ", header = T)

Using data.table
library(data.table)
library(stringi)
result <- setDT(df)[, stri_match_all_regex(vals, '\\d+')[[1]], by=.(id)]
result[, item:=paste('val', 1:.N, sep='_'), by=.(id)] # defines column names
dcast(result, id~item, value.var = 'V1') # convert from long to wide
## id val_1 val_2 val_3 val_4
## 1: 1 7 12 58 1
## 2: 2 1 2 5 7
## 3: 3 15 12 <NA> <NA>

is there an R code for the following data wrangling and transformation

I have the following data set
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015,016,017,018,019,020,021,022,023,024,025,026,027,028,029)
dat1<-data.frame(id,s02)
I would wish to create a data set based on this dat1. I would wish to have an R code that creates n s02 automatically as s02__0, s02__1, s02__2, s02__3, s02__4, in which case my n==5. Then based on the ID in dat1, the code should allocate each s02 to the respective s02__0 to s02__4 in the data frame. These rows are uniquely identified by another ID_2 created based on the number of rows. If incase the s02 are less in the row created, then the remaining cells should be allocated ##N/A##. if the s02 are more than the n, then another new row with an increment from the unique ID_2 is formed to accommodate the extra s02 and every blank cell is still filled with ##N/A##.
From the dataset above, I would wish to have the following output
id<-c(1,2,3,3,4,4,4,4,4,4)
id_2<-c(1,1,1,2,1,2,3,4,5,6)
s02__0<-c(1,1,1,6,1,6,11,16,21,26)
s02__1<-c(2,2,2,7,2,7,12,17,22,27)
s02__2<-c(3,3,3,##N/A##,3,8,13,18,23,28)
s02__3<-c(4,4,4,##N/A##,4,9,14,19,24,29)
s02__4<-c(##N/A##,5,5,##N/A##,5,10,15,20,25,##N/A##)
dat2<-data.frame(id,id_2,s02__0,s02__1,s02__2,s02__3,s02__4)

This can produce what you want:
library(tidyverse)
#Data
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007)
dat1<-data.frame(id,s02)
#Code
dat2 <- dat1 %>% group_by(id) %>% mutate(id2 = ifelse(s02<=5,1,2)) %>% ungroup() %>%
group_by(id,id2) %>% mutate(val=1:n()-1,nid = cur_group_id()) %>% ungroup() %>%
select(-id2) %>% mutate(id=paste0(id,'.',nid),val=paste0('s02','.',val)) %>% select(-nid) %>%
pivot_wider(names_from = c(val),values_from = s02) %>%
mutate(id=gsub("\\..*","", id)) %>% group_by(id) %>%
mutate(id2=1:n()) %>% select(order(colnames(.)))
dat2
# A tibble: 4 x 7
# Groups: id [3]
id id2 s02.0 s02.1 s02.2 s02.3 s02.4
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 2 3 4 NA
2 2 1 1 2 3 4 5
3 3 1 1 2 3 4 5
4 3 2 6 7 NA NA NA

Removing NA's in tibble after using pivot_wider

I am trying to create a table from a data set that takes two factors from a variable, pivots them wider, and lines them up in a single row. Unfortunately, I either keep producing two separate lists, or I get this:
dput(head(test1, 5))
# Edited section:
test1 <- df %>% # Code used to create the table below
select(`Incident ID`,`Device Time`, Description, `Elapsed Time`) %>%
filter(Description == "CPR Stopped" | Description == "CPR Started") %>%
mutate(Index = c(1:16)) %>%
pivot_wider(names_from = Description,
values_from = c(`Elapsed Time`, `Device Time`)) %>%
filter(!is.na(test1))
dput(head(test1, 5))
`Incident ID` Index `Elapsed Time_CPR Started` `Elapsed Time_CPR Stopped` `Device Time_CPR Started` `Device Time_CPR Stopped`
<chr> <int> <time> <time> <time> <time>
1 F190158585 1 01'03" NA 18'37" NA
2 F190158585 2 NA 01'08" NA 18'42"
3 F190158585 3 01'34" NA 19'08" NA
4 F190158585 4 NA 03'47" NA 21'22"
5 F190158585 5 04'00" NA 21'35" NA
I am trying to get a table that looks like this:
df <- data.frame(Index = c(1:3),
CPR_Started = c("00:01:00", "00:02:03", "00:05:46"),
CPR_Stopped = c("00:01:53", "00:04:30", "00:08:00"))
print(df)
Index CPR_Started CPR_Stopped
1 1 00:01:00 00:01:53
2 2 00:02:03 00:04:30
3 3 00:05:46 00:08:00

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split columns that are dictionaries - r

Related

How do I reverse gather function in R?

R Rowise operation on partially similar columns

R Extract specific text from column into multiple columns [duplicate]

is there an R code for the following data wrangling and transformation

Removing NA's in tibble after using pivot_wider

Categories

Resources