regex (in gathering multiple sets of columns with tidyr) - r

inspired by hadley's nifty gather approach in this answer I tried to use tidyr's gather() and spread() in combination with a regular expression, regex, but I seem to get it wrong on the regex.
I did study several regex questions; this one, this one, and also at regex101.com. I tried to circumvent the regex by using starts_with(), ends_with() and matches() inspired by this question, but with no luck.
I am asking here in the hope that someone can show where I get it wrong and I can solve it, preferably using, the select helpers from tidyselect.
I need to select 2 regex-groups one up to the last . and one consisting of what comes after the last ., I made this two example below, one where my code s working and one where I am stuck.
First the example that is working,
# install.packages(c("tidyverse"), dependencies = TRUE)
require(tidyverse)
The first data set, that work, looks like this,
myData1 <- tibble(
id = 1:10,
Wage.1997.1 = c(NA, 32:38, NA, NA),
Wage.1997.2 = c(NA, 12:18, NA, NA),
Wage.1998.1 = c(NA, 42:48, NA, NA),
Wage.1998.2 = c(NA, 2:8, NA, NA),
Wage.1998.3 = c(NA, 42:48, NA, NA),
Job.Type.1997.1 = NA,
Job.Type.1997.2 = c(NA, rep(c('A', 'B'), 4), NA),
Job.Type.1998.1 = c(NA, rep(c('A', 'B'), 4), NA),
Job.Type.1998.2 = c(NA, rep(c('A', 'B'), 4), NA)
)
and this is how I gather() it,
myData1 %>% gather(key, value, -id) %>%
extract(col = key, into = c("variable", "id.job"), regex = "(.*?\\..*?)\\.(.)$") %>%
spread(variable, value)
#> # A tibble: 30 x 6
#> id id.job Job.Type.1997 Job.Type.1998 Wage.1997 Wage.1998
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 1 <NA> <NA> <NA> <NA>
#> 2 1 2 <NA> <NA> <NA> <NA>
#> 3 1 3 <NA> <NA> <NA> <NA>
#> 4 2 1 <NA> A 32 42
#> 5 2 2 A A 12 2
#> 6 2 3 <NA> <NA> <NA> 42
#> 7 3 1 <NA> B 33 43
#> 8 3 2 B B 13 3
#> 9 3 3 <NA> <NA> <NA> 43
#> 10 4 1 <NA> A 34 44
#> # ... with 20 more rows
It works, I suspect I overdoing it with the regex, but it works. However, my real data can have either one or two digest at the end, i.e.
The second data, where I get stuck,
myData2 <- tibble(
id = 1:10,
Wage.1997.1 = c(NA, 32:38, NA, NA),
Wage.1997.12 = c(NA, 12:18, NA, NA),
Wage.1998.1 = c(NA, 42:48, NA, NA),
Wage.1998.12 = c(NA, 2:8, NA, NA),
Wage.1998.13 = c(NA, 42:48, NA, NA),
Job.Type.1997.1 = NA,
Job.Type.1997.12 = c(NA, rep(c('A', 'B'), 4), NA),
Job.Type.1998.1 = c(NA, rep(c('A', 'B'), 4), NA),
Job.Type.1998.12 = c(NA, rep(c('A', 'B'), 4), NA)
)
Now, this is where I use (0[0-1]|1[0-9])$ for the second group, I also tried thing like \d{1}|\d{2}, but did that not work either.
myData2 %>% gather(key, value, -id) %>%
extract(col = key, into = c("variable", "id.job"),
regex = "(.*?\\..*?)\\.(0[0-1]|1[0-9])$") %>%
spread(variable, value)
The expected output would be something like this,
#> # A tibble: 30 x 6
#> id id.job Job.Type.1997 Job.Type.1998 Wage.1997 Wage.1998
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 1 <NA> <NA> <NA> <NA>
#> 2 1 12 <NA> <NA> <NA> <NA>
#> 3 1 13 <NA> <NA> <NA> <NA>
#> 4 2 1 <NA> A 32 42
#> 5 2 12 A A 12 2
#> 6 2 13 <NA> <NA> <NA> 42
#> 7 3 1 <NA> B 33 43
#> 8 3 12 B B 13 3
#> 9 3 13 <NA> <NA> <NA> 43
#> 10 4 1 <NA> A 34 44
#> # ... with 20 more rows
A simply solution à la t this question using select helpers like starts_with(), ends_with(), matches(), etc. would be appreciated.

We can change the regex in extract to match characters and capture as group ((.*)) from the start (^) of the string followed by a dot (\\.) and one or more characters that are not a dot captured as a group (([^.]+)) till the end ($) of the string
myData2 %>%
gather(key, value, -id) %>%
extract(col = key, into = c("variable", "id.job"), "^(.*)\\.([^.]+)$") %>%
spread(variable, value)
# A tibble: 30 x 6
# id id.job Job.Type.1997 Job.Type.1998 Wage.1997 Wage.1998
# * <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 1 <NA> <NA> <NA> <NA>
# 2 1 12 <NA> <NA> <NA> <NA>
# 3 1 13 <NA> <NA> <NA> <NA>
# 4 2 1 <NA> A 32 42
# 5 2 12 A A 12 2
# 6 2 13 <NA> <NA> <NA> 42
# 7 3 1 <NA> B 33 43
# 8 3 12 B B 13 3
# 9 3 13 <NA> <NA> <NA> 43
#10 4 1 <NA> A 34 44
# ... with 20 more rows

Related

R Lag Variable And Skip Value Between

DATA = data.frame(STUDENT = c(1,1,1,2,2,2,3,3,4,4),
SCORE = c(6,4,8,10,9,0,2,3,3,7),
CLASS = c('A', 'B', 'C', 'A', 'B', 'C', 'B', 'C', 'A', 'B'),
WANT = c(NA, NA, 2, NA, NA, -10, NA, NA, NA, NA))
I have DATA and wish to create 'WANT' which is calculate by:
For each STUDENT, find the SCORE where SCORE equals to SCORE(CLASS = C) - SCORE(CLASS = A)
EX: SCORE(STUDENT = 1, CLASS = C) - SCORE(STUDENT = 1, CLASS = A) = 8-6=2
Assuming at most one 'C' and 'A' CLASS per each 'STUDENT', just subset the 'SCORE' where the CLASS value is 'C', 'A', do the subtraction and assign the value only to position where CLASS is 'C' by making all other positions to NA (after grouping by 'STUDENT')
library(dplyr)
DATA <- DATA %>%
group_by(STUDENT) %>%
mutate(WANT2 = (SCORE[CLASS == 'C'][1] - SCORE[CLASS == 'A'][1]) *
NA^(CLASS != "C")) %>%
ungroup
-output
# A tibble: 10 × 5
STUDENT SCORE CLASS WANT WANT2
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 6 A NA NA
2 1 4 B NA NA
3 1 8 C 2 2
4 2 10 A NA NA
5 2 9 B NA NA
6 2 0 C -10 -10
7 3 2 B NA NA
8 3 3 C NA NA
9 4 3 A NA NA
10 4 7 B NA NA
Here is a solution with the data organized in a wider format first, then a longer format below. This solution works regardless of the order of the "CLASS" column (for instance, if there is one instance in which the CLASS order is CBA or BCA instead os ABC, this solution will work).
Solution
library(dplyr)
library(tidyr)
wider <- DATA %>% select(-WANT) %>%
pivot_wider( names_from = "CLASS", values_from = "SCORE") %>%
rowwise() %>%
mutate(WANT = C-A) %>%
ungroup()
output wider
# A tibble: 4 × 5
STUDENT A B C WANT
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 4 8 2
2 2 10 9 0 -10
3 3 NA 2 3 NA
4 4 3 7 NA NA
If you really want like your output example, then we can reorganize the wider data this way:
Reorganizing wider to long format
wider %>%
pivot_longer(A:C, values_to = "SCORE", names_to = "CLASS") %>%
relocate(WANT, .after = SCORE) %>%
mutate(WANT = if_else(CLASS == "C", WANT, NA_real_))
Final Output
# A tibble: 12 × 4
STUDENT CLASS SCORE WANT
<dbl> <chr> <dbl> <dbl>
1 1 A 6 NA
2 1 B 4 NA
3 1 C 8 2
4 2 A 10 NA
5 2 B 9 NA
6 2 C 0 -10
7 3 A NA NA
8 3 B 2 NA
9 3 C 3 NA
10 4 A 3 NA
11 4 B 7 NA
12 4 C NA NA

Splitting the data in brackets in R

Hi I have a dataset that looks like this
PTNUM
AGE1_2
AGE2_3
AGE3_2
12345
(23,35)
NA
NA
12346
NA
(23,28,34,44)
(45,50)
12347
(17,22)
NA
(38,45)
I would like to have the output looking like this
PTNUM
AGE1_1
AGE1_2
AGE2_2
AGE2_3
AGE3_3
AGE3_2
12345
23
35
NA
NA
NA
NA
12346
NA
NA
23
28
NA
NA
12346
NA
NA
34
44
45
50
12347
17
22
NA
NA
38
45
I tried this code in R just to try splitting AGE1_2 to AGE1_1 and AGE1_2 but this resulted in all the rows of AGE1_1 and AGE1_2 being NA's.
ZX_1_2 <- extract(ZX, AGE1_2, into = c('AGE1_1', 'AGE1_2'),
regex = "(.?) \((.?)\)")
Could someone help me get the expected result?
We could use
library(purrr)
library(tidyr)
library(stringr)
map_dfc(names(ZX)[-1], ~ df1 %>%
select(all_of(.x)) %>%
extract(1, into = str_c(names(.), "_", 1:2),
"\\((\\d+),(\\d+)\\)", convert = TRUE)) %>%
bind_cols(ZX['PTNUM'], .)
-output
PTNUM AGE1_2_1 AGE1_2_2 AGE2_3_1 AGE2_3_2 AGE3_2_1 AGE3_2_2
1 12345 23 35 NA NA NA NA
2 12346 NA NA 34 40 45 50
3 12347 17 22 NA NA 38 45
Or another option is
ZX %>%
mutate(across(starts_with('AGE'),
~ read.csv(text = str_remove_all(.x, "\\(|\\)"),
header = FALSE, fill = TRUE))) %>%
unpack(where(is.data.frame), names_sep = "_")
-output
# A tibble: 3 × 7
PTNUM AGE1_2_V1 AGE1_2_V2 AGE2_3_V1 AGE2_3_V2 AGE3_2_V1 AGE3_2_V2
<int> <int> <int> <int> <int> <int> <int>
1 12345 23 35 NA NA NA NA
2 12346 NA NA 34 40 45 50
3 12347 17 22 NA NA 38 45
For the updated data
library(data.table)
ZX2 %>%
pivot_longer(cols = starts_with("AGE")) %>%
mutate(value = str_remove_all(value, "\\(|\\)")) %>%
separate_rows(value, sep = ",") %>%
group_by(PTNUM, name) %>%
mutate(rn = as.integer(gl(n(), 2, n()))) %>%
ungroup %>%
mutate(rn2 = rowid(PTNUM, name, rn)) %>%
unite(name, name, rn2) %>%
pivot_wider(names_from = name, values_from = value) %>%
select(-rn) %>%
group_by(PTNUM) %>%
mutate(across(everything(), ~ .x[order(!is.na(.x))])) %>%
ungroup
-output
# A tibble: 4 × 7
PTNUM AGE1_2_1 AGE1_2_2 AGE2_3_1 AGE3_2_1 AGE2_3_2 AGE3_2_2
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 12345 23 35 <NA> <NA> <NA> <NA>
2 12346 <NA> <NA> 23 <NA> 28 <NA>
3 12346 <NA> <NA> 34 45 44 50
4 12347 17 22 <NA> 38 <NA> 45
data
ZX <- structure(list(PTNUM = 12345:12347, AGE1_2 = c("(23,35)", NA,
"(17,22)"), AGE2_3 = c(NA, "(34,40)", NA), AGE3_2 = c(NA, "(45,50)",
"(38,45)")), class = "data.frame", row.names = c(NA, -3L))
ZX2 <- structure(list(PTNUM = 12345:12347, AGE1_2 = c("(23,35)", NA,
"(17,22)"), AGE2_3 = c(NA, "(23,28,34,44)", NA), AGE3_2 = c(NA,
"(45,50)", "(38,45)")), class = "data.frame", row.names = c(NA,
-3L))

How to conditionally fill in rows with NAs depending on an evaluation of the first two columns in a dataframe using R?

I have a dataframe like this:
df <- data.frame(
Metric = c('WI', NA, 'MN', NA, 'CO', NA),
Eval = c('WI', NA, 'AK', NA, 'CO', NA),
colA = c(30, 'ABC', 45, 'DEF', 2, 'XYZ'),
colB = c(25, 'BEC', 23, 'FED', 50, 'HIJ')
)
I'd like to evaluate whether the Metric column and the Eval column are equal, and, if they are, fill everything to the right of the Metric column with NAs so the result would look like this:
df_desired <- data.frame(
Metric = c('WI', NA, 'MN', NA, 'CO', NA),
Eval = c(NA, NA, 'AK', NA, NA, NA),
colA = c(NA, 'ABC', 45, 'DEF', NA, 'XYZ'),
colB = c(NA, 'BEC', 23, 'FED', NA, 'HIJ')
)
What's the best way to do this using R, ideally with a tidyverse function? I tried using mutate/across, but defining the conditional here is throwing me off.
Create a logical vector and do the assignment based on row/column index/names (base R is more direct to do)
i1 <- with(df, Metric == Eval & !is.na(Metric) & !is.na(Eval))
df[i1, -1] <- NA
-output
> df
Metric Eval colA colB
1 WI <NA> <NA> <NA>
2 <NA> <NA> ABC BEC
3 MN AK 45 23
4 <NA> <NA> DEF FED
5 CO <NA> <NA> <NA>
6 <NA> <NA> XYZ HIJ
Or with dplyr, create a column of logical vector ('i1'), loop across the columns 'Eval' to 'colB', use case_when/ifelse/if_else/replace to change the values to NA based on the 'i1' and remove the temporary column by assigning to NULL
library(dplyr)
df %>%
mutate(i1 = Metric == Eval,
across(Eval:colB, ~ case_when(i1 ~ NA_character_, TRUE ~ .)),
i1 = NULL)
-output
Metric Eval colA colB
1 WI <NA> <NA> <NA>
2 <NA> <NA> ABC BEC
3 MN AK 45 23
4 <NA> <NA> DEF FED
5 CO <NA> <NA> <NA>
6 <NA> <NA> XYZ HIJ
Use mutate() and if_else() to conditionally replace values:
df |>
mutate(colA = if_else(Metric == Eval, NA_character_, colA, missing = colA))
#> Metric Eval colA colB
#> 1 WI WI <NA> 25
#> 2 <NA> <NA> ABC BEC
#> 3 MN AK 45 23
#> 4 <NA> <NA> DEF FED
#> 5 CO CO <NA> 50
#> 6 <NA> <NA> XYZ HIJ
Note that we can't just use NA, we have to match it to the existing column type. In your example colA and colB are character vectors, so it's NA_character_. And we have to specify missing to handle the NA == NA case.
To generalise this across multiple columns, use across() and wrap the if_else() in an anonymous function:
df |>
mutate(across(Eval:colB, ~if_else(Metric == Eval, NA_character_, ., missing = .)))
#> Metric Eval colA colB
#> 1 WI <NA> <NA> <NA>
#> 2 <NA> <NA> ABC BEC
#> 3 MN AK 45 23
#> 4 <NA> <NA> DEF FED
#> 5 CO <NA> <NA> <NA>
#> 6 <NA> <NA> XYZ HIJ

how to find out how many respondents have at least 3 missing responses?

I am just starting to learn r studio. I have a data set that contains variables v1 to v6 that represent different groups, contains values 0 and 1 that represent the answer no and yes.
So my question is How many respondents have at least 3 missing responses from questions v1 to v6?
Here is a solution in dplyr (part of the tidyverse), where the final output will give you a tibble with number of missing responses for each individual.
library(tidyverse)
# Random number
set.seed(4)
# Make some example data, I assume it looks something like this
data = tibble(
v1 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
v2 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
v3 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
v4 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
v5 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
v6 = sample(x = c("no","yes", NA), size = 100, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
id = 1:100
)
data
#> # A tibble: 100 x 7
#> v1 v2 v3 v4 v5 v6 id
#> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 no yes no <NA> <NA> no 1
#> 2 yes no no no yes no 2
#> 3 yes yes no yes yes yes 3
#> 4 yes no yes yes no yes 4
#> 5 <NA> no yes <NA> yes yes 5
#> 6 yes no yes <NA> no <NA> 6
#> 7 no no no <NA> <NA> yes 7
#> 8 <NA> yes no <NA> <NA> yes 8
#> 9 <NA> <NA> no yes yes no 9
#> 10 yes <NA> yes <NA> yes yes 10
#> # ... with 90 more rows
# We then pivot the data into a long format
long_data = data %>%
pivot_longer(cols = starts_with("v"), names_to = "group", values_to = "response")
long_data
#> # A tibble: 600 x 3
#> id group response
#> <int> <chr> <chr>
#> 1 1 v1 no
#> 2 1 v2 yes
#> 3 1 v3 no
#> 4 1 v4 <NA>
#> 5 1 v5 <NA>
#> 6 1 v6 no
#> 7 2 v1 yes
#> 8 2 v2 no
#> 9 2 v3 no
#> 10 2 v4 no
#> # ... with 590 more rows
# We then summarise the number of missing values for each individual, and filter for those with > 3
long_data %>%
filter(is.na(response)) %>%
group_by(id) %>%
tally() %>%
filter(n > 2)
#> # A tibble: 9 x 2
#> id n
#> <int> <int>
#> 1 8 3
#> 2 14 3
#> 3 19 3
#> 4 26 3
#> 5 36 3
#> 6 41 3
#> 7 49 3
#> 8 84 4
#> 9 90 3
Created on 2021-10-07 by the reprex package (v0.3.0)
You can try to count the sum by row of your data.frame
Print and paste all the following code with seed included
#1- Simulation data
set.seed(1)
values=c(0,1,NA)
df=data.frame(
v1=sample(values,10,TRUE),
v2=sample(values,10,TRUE),
v3=sample(values,10,TRUE),
v4=sample(values,10,TRUE),
v5=sample(values,10,TRUE),
v6=sample(values,10,TRUE)
)
#2- Number of each value by row
#Number of NA values by row
df$nbNA=apply(df,1,function(x) sum(is.na(x)))
#Number of 0 values by row
df$nb0=apply(df,1,function(x) sum(x==0,na.rm=TRUE))
#Number of 1 values by row
df$nb1=apply(df,1,function(x) sum(x==1,na.rm=TRUE))

R function for grouping rows based on patterns across columns?

I would like to group rows of a dataframe based on the pattern of each row across columns. Here is a very simple example.
df <- data.frame("gene" = 1:5,
"stg 1" = c("up", "up", NA, NA, NA),
"stg 2" = c("up", "up", NA, NA, NA),
"stg 3" = c("up", "up", NA, NA, NA),
"stg 4" = c("down", "down", "up", "up", NA))
> df
gene stg.1 stg.2 stg.3 stg.4
1 1 up up up down
2 2 up up up down
3 3 <NA> <NA> <NA> up
4 4 <NA> <NA> <NA> up
5 5 <NA> <NA> <NA> <NA>
In this case, gene 1 and 2 would be grouped, and genes 3 and 4 would be grouped. I would like the names of the genes in each pattern group, and what the pattern is for that group. I hope that is clear. Thanks in advance!
Try this approach. Create a variable to collect the values across rows using c_across() and toString(). After that, format as factor and assign the suffix Group.. Here the code using tidyverse functions:
library(tidyverse)
#Code
dfnew <- df %>% group_by(gene) %>%
mutate(Var=toString(c_across(stg.1:stg.4))) %>%
ungroup() %>%
mutate(Var=paste0('Group.',as.numeric(factor(Var,levels = unique(Var),ordered = T))))
Output:
# A tibble: 5 x 6
gene stg.1 stg.2 stg.3 stg.4 Var
<int> <fct> <fct> <fct> <fct> <chr>
1 1 up up up down Group.1
2 2 up up up down Group.1
3 3 NA NA NA up Group.2
4 4 NA NA NA up Group.2
5 5 NA NA NA NA Group.3
If you only need a pattern, try this:
#Code 2
dfnew <- df %>% group_by(gene) %>%
mutate(Var=toString(c_across(stg.1:stg.4)))
Output:
# A tibble: 5 x 6
# Groups: gene [5]
gene stg.1 stg.2 stg.3 stg.4 Var
<int> <fct> <fct> <fct> <fct> <chr>
1 1 up up up down up, up, up, down
2 2 up up up down up, up, up, down
3 3 NA NA NA up NA, NA, NA, up
4 4 NA NA NA up NA, NA, NA, up
5 5 NA NA NA NA NA, NA, NA, NA
We can do this in a vectorized way with unite
library(dplyr)
library(tidyr)
df %>%
unite(grp, starts_with('stg'), na.rm = TRUE, remove = FALSE) %>%
mutate(grp = match(grp, unique(grp)))
# gene grp stg.1 stg.2 stg.3 stg.4
#1 1 1 up up up down
#2 2 1 up up up down
#3 3 2 <NA> <NA> <NA> up
#4 4 2 <NA> <NA> <NA> up
#5 5 3 <NA> <NA> <NA> <NA>
Though not specifically asked, data.table solution goes as under
library(data.table)
setDT(df)
df[,group:= paste0(stg.1,stg.2,stg.3,stg.4),by= gene][,group:= match(group, unique(group))]
> df
gene stg.1 stg.2 stg.3 stg.4 group
1: 1 up up up down 1
2: 2 up up up down 1
3: 3 <NA> <NA> <NA> up 2
4: 4 <NA> <NA> <NA> up 2
5: 5 <NA> <NA> <NA> <NA> 3

Resources