R function for grouping rows based on patterns across columns? - r

I would like to group rows of a dataframe based on the pattern of each row across columns. Here is a very simple example.
df <- data.frame("gene" = 1:5,
"stg 1" = c("up", "up", NA, NA, NA),
"stg 2" = c("up", "up", NA, NA, NA),
"stg 3" = c("up", "up", NA, NA, NA),
"stg 4" = c("down", "down", "up", "up", NA))
> df
gene stg.1 stg.2 stg.3 stg.4
1 1 up up up down
2 2 up up up down
3 3 <NA> <NA> <NA> up
4 4 <NA> <NA> <NA> up
5 5 <NA> <NA> <NA> <NA>
In this case, gene 1 and 2 would be grouped, and genes 3 and 4 would be grouped. I would like the names of the genes in each pattern group, and what the pattern is for that group. I hope that is clear. Thanks in advance!

Try this approach. Create a variable to collect the values across rows using c_across() and toString(). After that, format as factor and assign the suffix Group.. Here the code using tidyverse functions:
library(tidyverse)
#Code
dfnew <- df %>% group_by(gene) %>%
mutate(Var=toString(c_across(stg.1:stg.4))) %>%
ungroup() %>%
mutate(Var=paste0('Group.',as.numeric(factor(Var,levels = unique(Var),ordered = T))))
Output:
# A tibble: 5 x 6
gene stg.1 stg.2 stg.3 stg.4 Var
<int> <fct> <fct> <fct> <fct> <chr>
1 1 up up up down Group.1
2 2 up up up down Group.1
3 3 NA NA NA up Group.2
4 4 NA NA NA up Group.2
5 5 NA NA NA NA Group.3
If you only need a pattern, try this:
#Code 2
dfnew <- df %>% group_by(gene) %>%
mutate(Var=toString(c_across(stg.1:stg.4)))
Output:
# A tibble: 5 x 6
# Groups: gene [5]
gene stg.1 stg.2 stg.3 stg.4 Var
<int> <fct> <fct> <fct> <fct> <chr>
1 1 up up up down up, up, up, down
2 2 up up up down up, up, up, down
3 3 NA NA NA up NA, NA, NA, up
4 4 NA NA NA up NA, NA, NA, up
5 5 NA NA NA NA NA, NA, NA, NA

We can do this in a vectorized way with unite
library(dplyr)
library(tidyr)
df %>%
unite(grp, starts_with('stg'), na.rm = TRUE, remove = FALSE) %>%
mutate(grp = match(grp, unique(grp)))
# gene grp stg.1 stg.2 stg.3 stg.4
#1 1 1 up up up down
#2 2 1 up up up down
#3 3 2 <NA> <NA> <NA> up
#4 4 2 <NA> <NA> <NA> up
#5 5 3 <NA> <NA> <NA> <NA>

Though not specifically asked, data.table solution goes as under
library(data.table)
setDT(df)
df[,group:= paste0(stg.1,stg.2,stg.3,stg.4),by= gene][,group:= match(group, unique(group))]
> df
gene stg.1 stg.2 stg.3 stg.4 group
1: 1 up up up down 1
2: 2 up up up down 1
3: 3 <NA> <NA> <NA> up 2
4: 4 <NA> <NA> <NA> up 2
5: 5 <NA> <NA> <NA> <NA> 3

Related

Generating a new variable if any of the conditions are met without list all variables in R

I would like to generate a variable called outcome which assigns 1 if any of the columns in the dataset below have any form of consent response else assign 0. However, I do not want to list all variables in my code.
I have tried the following code;
vars<-c("a1","a2","a3","a4")
dat<-dat%>%
mutate(outcome = case_when(if_any(vars, ~ .x == "consented now"|
"consented later") ~ 1))
dataset
dat1 <- tibble(
a1 = c("consented now", NA, NA, NA),
a2= c("", "Refused", NA, NA),
a3= c(NA, "consented now", NA, NA),
a4= c(NA, NA, NA, "consented later"))
A base variant using paste with do.call and grepl might be:
dat1$outcome <- +grepl("consented", do.call(paste, dat1))
dat1
# a1 a2 a3 a4 outcome
#1 consented now <NA> <NA> 1
#2 <NA> Refused consented now <NA> 1
#3 <NA> <NA> <NA> <NA> 0
#4 <NA> <NA> <NA> consented later 1
Or using rowSums and sapply.
dat1$outcome <- +(rowSums(sapply(dat1, grepl, pattern="consented")) > 0)
You don't need case_when, with if_any and grepl:
dat1 %>%
mutate(outcome = +if_any(a1:a4, ~ grepl("consented", .x)))
output
# A tibble: 4 × 5
# a1 a2 a3 a4 outcome
# <chr> <chr> <chr> <chr> <int>
#1 consented now "" NA NA 1
#2 NA "Refused" consented now NA 1
#3 NA NA NA NA 0
#4 NA NA NA consented later 1

filter some rows of a data frame depending values of others rows

I would like to filter some rows of a dataframe based on the values of other rows and I don't know how to proceed. Below is an example of what I want to do.
data=data.frame(ID=c("ID1",NA, NA, "ID2", NA, "ID3", NA, NA),
l2=c(NA,9,4,NA,5,NA,6,8),
var3=c("aa", NA, NA, "bc",NA, "cc", NA, NA),
var4=c(NA,"yes","no",NA,"yes",NA,"yes","no"))
> data
ID l2 var3 var4
1 ID1 NA aa <NA>
2 <NA> 9 <NA> yes
3 <NA> 4 <NA> no
4 ID2 NA bc <NA>
5 <NA> 5 <NA> yes
6 ID3 NA cc <NA>
7 <NA> 6 <NA> yes
8 <NA> 8 <NA> no
On this dataframe I would like to select the rows using the ID variable if the rows following this ID (until the next one) have at least one value < 7 for the l2 variable AND a "yes" for the var4 variable.
Following this rule I would expect the following output.
This is just an example, I have much more rows and much more other variables.
If someone have a solution with dplyr it would be perfect.
> outpout=data.frame(ID=c("ID2","ID3"), l2=c(NA,NA), var3=c( "bc","cc"), var4=c(NA,NA))
> outpout
ID l2 var3 var4
1 ID2 NA bc NA
2 ID3 NA cc NA
You could do:
a) if it would be ok if the conditions for l2 and var4 could be satisfied in different rows:
library(tidyverse)
data %>%
fill(ID) %>%
group_by(ID) %>%
filter((any(l2 < 7) & any(var4 == "yes")) & row_number() == 1) %>%
ungroup()
# A tibble: 3 x 4
ID l2 var3 var4
<chr> <dbl> <chr> <chr>
1 ID1 NA aa NA
2 ID2 NA bc NA
3 ID3 NA cc NA
b) if the conditions for l2 and var4 have to be satisfied in the same row (could probably be simplified a bit, but I be a bit more verbose here for illustrative purposes):
data %>%
fill(ID) %>%
group_by(ID) %>%
filter((l2 < 7 & var4 == "yes") | row_number() == 1) %>%
filter(n() > 1 & row_number() == 1) %>%
ungroup()
# A tibble: 2 x 4
ID l2 var3 var4
<chr> <dbl> <chr> <chr>
1 ID2 NA bc NA
2 ID3 NA cc NA

R: fill in cells with values from different rows

I’m trying to fill NAs in a row with values from a different row. These rows are “linked” by a case number. I want to write an if loop that goes through the entire data frame and does this. But I think I don’t grasp the R language well enough. Can anybody help me?
The data frame:
CASE <- c(1, 2, 3, 4, 5, 6)
SERIAL <-c("AB",NA, NA, "CD", NA, NA)
REF <- c(NA, 1, 1, NA, 4, 4)
PA <- c(4, NA, NA, 2, NA, NA)
PE <- c(NA, 2, NA, NA, 1, NA)
PE2 <- c(NA, NA, 3, NA, NA, 3)
df <- data.frame (CASE, SERIAL, REF, PA, PE, PE2)
CASE SERIAL REF PA PE PE2
1 AB NA 4 NA NA
2 <NA> 1 NA 2 NA
3 <NA> 1 NA NA 3
4 CD NA 2 NA NA
5 <NA> 4 NA 1 NA
6 <NA> 4 NA NA 3
In the row CASE = 1, I want to fill in the empty PE and PE2 with the values from the rows below, which reference the line (by REF = 1). In the line CASE = 4, I want to fill in the empty PE and PE2 with the values from the rows below, which reference the line (by REF = 4). The lines with no serial number only serve to fill the lines 1 and 4, so to speak. There is no way to collect the data directly into the corresponding lines. I tried this for loop, but don't know how to refrence the values correctly?
for (i in 1:dim(df)[1]{
if (data$SERIAL[i]==NA){
[data$CASE[data$REF[i]],PE] <- data$PE[i]
[data$CASE[data$REF[i]],PE2] <- data$PE2[i]}
}
)
Expected output:
CASE SERIAL REF PA PE PE2
1 1 AB NA 4 2 3
2 2 <NA> 1 NA 2 NA
3 3 <NA> 1 NA NA 3
4 4 CD NA 2 1 3
5 5 <NA> 4 NA 1 NA
6 6 <NA> 4 NA NA 3
This is a dplyr solution, but perhaps it would work:
df %>%
mutate(REF = ifelse(is.na(REF), CASE, REF)) %>%
group_by(REF) %>%
summarise(SERIAL = first(SERIAL),
across(c(PA, PE, PE2), ~sum(.x, na.rm=TRUE))) %>%
rename("CASE" = "REF")
# # A tibble: 2 x 5
# CASE SERIAL PA PE PE2
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 AB 4 2 3
# 2 4 CD 2 1 3
withSerial = subset(df, !is.na(SERIAL))
withSerial
# CASE SERIAL REF PA PE PE2
#1 1 AB NA 4 NA NA
#4 4 CD NA 2 NA NA
noSerialwithRef = subset(df, is.na(SERIAL) & !is.na(REF))
noSerialwithRef
# CASE SERIAL REF PA PE PE2
#2 2 <NA> 1 NA 2 NA
#3 3 <NA> 1 NA NA 3
#5 5 <NA> 4 NA 1 NA
#6 6 <NA> 4 NA NA 3
withSerial$PE = subset(noSerialwithRef, !is.na(PE))$PE
withSerial$PE2 = subset(noSerialwithRef, !is.na(PE2))$PE2
withSerial
# CASE SERIAL REF PA PE PE2
#1 1 AB NA 4 2 3
#4 4 CD NA 2 1 3
Update: Added library(tidyr) thanks to Martin Gal and added alternative code suggested by Martin Gal:
Here is another dplyr way:
fill SERIAL
use lead in the grouped_columns
keep only first rows of gorups with slice(1)
library(dplyr)
library(tidyr)
df %>%
fill(SERIAL, .direction = "down") %>%
group_by(SERIAL) %>%
mutate(PE = lead(PE),
PE2 = lead(PE2,2)) %>%
slice(1)
# Alternative and better (suggested by Martin Gal):
df %>% fill(-c(CASE, SERIAL), .direction = "up") %>% drop_na()
CASE SERIAL REF PA PE PE2
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 AB NA 4 2 3
2 4 CD NA 2 1 3

How to calculate row differences in r when it's not in sequence

I have a data frame like this:
name count
a 3
a 5
a 8
b 2
a 9
b 7
so I want to calculate the row differences group by name. so my code is:
data%>%group_by(Name)%>%mutate(last_count = lag(count),diff = count - last_count)
However, I get a result like the below table
name count last_count diff
a 3 NA NA
a 5 3 2
a 8 5 3
b 2 NA NA
a 9 8 1
b 7 2 5
But what I want should look like this:
name count last_count diff
a 3 NA NA
a 5 3 2
a 8 5 3
b 2 NA NA
a 9 NA NA
b 7 NA NA
Thanks in advance to whoever can help me fix it!
Does this work:
> library(dplyr)
> df %>% mutate(last_count = case_when(name == lag(name) ~ lag(count), TRUE ~ NA_real_),
diff = case_when(name == lag(name) ~ count - lag(count), TRUE ~ NA_real_))
# A tibble: 6 x 4
name count last_count diff
<chr> <dbl> <dbl> <dbl>
1 a 3 NA NA
2 a 5 3 2
3 a 8 5 3
4 b 2 NA NA
5 a 9 NA NA
6 b 7 NA NA
>
We could use rleid to create a grouping column based on the adjacent matching values in the 'name' column and then apply the diff
library(dplyr)
library(data.table)
data %>%
group_by(grp = rleid(name)) %>%
mutate(last_count = lag(count), diff = count - last_count) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 6 x 4
# name count last_count diff
# <chr> <int> <int> <int>
#1 a 3 NA NA
#2 a 5 3 2
#3 a 8 5 3
#4 b 2 NA NA
#5 a 9 NA NA
#6 b 7 NA NA
Or using base R with ave and rle
data$diff <- with(data, ave(count, with(rle(name),
rep(seq_along(values), lengths)), FUN = function(x) c(NA, diff(x)))
data
data <- structure(list(name = c("a", "a", "a", "b", "a", "b"), count = c(3L,
5L, 8L, 2L, 9L, 7L)), class = "data.frame", row.names = c(NA,
-6L))

regex (in gathering multiple sets of columns with tidyr)

inspired by hadley's nifty gather approach in this answer I tried to use tidyr's gather() and spread() in combination with a regular expression, regex, but I seem to get it wrong on the regex.
I did study several regex questions; this one, this one, and also at regex101.com. I tried to circumvent the regex by using starts_with(), ends_with() and matches() inspired by this question, but with no luck.
I am asking here in the hope that someone can show where I get it wrong and I can solve it, preferably using, the select helpers from tidyselect.
I need to select 2 regex-groups one up to the last . and one consisting of what comes after the last ., I made this two example below, one where my code s working and one where I am stuck.
First the example that is working,
# install.packages(c("tidyverse"), dependencies = TRUE)
require(tidyverse)
The first data set, that work, looks like this,
myData1 <- tibble(
id = 1:10,
Wage.1997.1 = c(NA, 32:38, NA, NA),
Wage.1997.2 = c(NA, 12:18, NA, NA),
Wage.1998.1 = c(NA, 42:48, NA, NA),
Wage.1998.2 = c(NA, 2:8, NA, NA),
Wage.1998.3 = c(NA, 42:48, NA, NA),
Job.Type.1997.1 = NA,
Job.Type.1997.2 = c(NA, rep(c('A', 'B'), 4), NA),
Job.Type.1998.1 = c(NA, rep(c('A', 'B'), 4), NA),
Job.Type.1998.2 = c(NA, rep(c('A', 'B'), 4), NA)
)
and this is how I gather() it,
myData1 %>% gather(key, value, -id) %>%
extract(col = key, into = c("variable", "id.job"), regex = "(.*?\\..*?)\\.(.)$") %>%
spread(variable, value)
#> # A tibble: 30 x 6
#> id id.job Job.Type.1997 Job.Type.1998 Wage.1997 Wage.1998
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 1 <NA> <NA> <NA> <NA>
#> 2 1 2 <NA> <NA> <NA> <NA>
#> 3 1 3 <NA> <NA> <NA> <NA>
#> 4 2 1 <NA> A 32 42
#> 5 2 2 A A 12 2
#> 6 2 3 <NA> <NA> <NA> 42
#> 7 3 1 <NA> B 33 43
#> 8 3 2 B B 13 3
#> 9 3 3 <NA> <NA> <NA> 43
#> 10 4 1 <NA> A 34 44
#> # ... with 20 more rows
It works, I suspect I overdoing it with the regex, but it works. However, my real data can have either one or two digest at the end, i.e.
The second data, where I get stuck,
myData2 <- tibble(
id = 1:10,
Wage.1997.1 = c(NA, 32:38, NA, NA),
Wage.1997.12 = c(NA, 12:18, NA, NA),
Wage.1998.1 = c(NA, 42:48, NA, NA),
Wage.1998.12 = c(NA, 2:8, NA, NA),
Wage.1998.13 = c(NA, 42:48, NA, NA),
Job.Type.1997.1 = NA,
Job.Type.1997.12 = c(NA, rep(c('A', 'B'), 4), NA),
Job.Type.1998.1 = c(NA, rep(c('A', 'B'), 4), NA),
Job.Type.1998.12 = c(NA, rep(c('A', 'B'), 4), NA)
)
Now, this is where I use (0[0-1]|1[0-9])$ for the second group, I also tried thing like \d{1}|\d{2}, but did that not work either.
myData2 %>% gather(key, value, -id) %>%
extract(col = key, into = c("variable", "id.job"),
regex = "(.*?\\..*?)\\.(0[0-1]|1[0-9])$") %>%
spread(variable, value)
The expected output would be something like this,
#> # A tibble: 30 x 6
#> id id.job Job.Type.1997 Job.Type.1998 Wage.1997 Wage.1998
#> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 1 <NA> <NA> <NA> <NA>
#> 2 1 12 <NA> <NA> <NA> <NA>
#> 3 1 13 <NA> <NA> <NA> <NA>
#> 4 2 1 <NA> A 32 42
#> 5 2 12 A A 12 2
#> 6 2 13 <NA> <NA> <NA> 42
#> 7 3 1 <NA> B 33 43
#> 8 3 12 B B 13 3
#> 9 3 13 <NA> <NA> <NA> 43
#> 10 4 1 <NA> A 34 44
#> # ... with 20 more rows
A simply solution à la t this question using select helpers like starts_with(), ends_with(), matches(), etc. would be appreciated.
We can change the regex in extract to match characters and capture as group ((.*)) from the start (^) of the string followed by a dot (\\.) and one or more characters that are not a dot captured as a group (([^.]+)) till the end ($) of the string
myData2 %>%
gather(key, value, -id) %>%
extract(col = key, into = c("variable", "id.job"), "^(.*)\\.([^.]+)$") %>%
spread(variable, value)
# A tibble: 30 x 6
# id id.job Job.Type.1997 Job.Type.1998 Wage.1997 Wage.1998
# * <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 1 <NA> <NA> <NA> <NA>
# 2 1 12 <NA> <NA> <NA> <NA>
# 3 1 13 <NA> <NA> <NA> <NA>
# 4 2 1 <NA> A 32 42
# 5 2 12 A A 12 2
# 6 2 13 <NA> <NA> <NA> 42
# 7 3 1 <NA> B 33 43
# 8 3 12 B B 13 3
# 9 3 13 <NA> <NA> <NA> 43
#10 4 1 <NA> A 34 44
# ... with 20 more rows

Resources