Wide to long, combining columns in pairs but keeping ID column - R - r

I have a dataframe of the following type
ID case1 case2 case3 case4
1 A B C D
2 B A
3 E F
4 G C A
5 T
I need to change its format, to a long shape, similar as the below:
ID col1 col2
1 A B
1 A C
1 A D
1 B C
1 B D
1 C D
2 B A
3 E F
4 G C
4 G A
4 C A
5 T
As you can see, I need to maintain the ID and ignore empty columns. There are some cases like T that need to remain in the dataset, but without a col2.
I am honestly not sure how to approach this, so that is why there are no examples of what I have tried.

You can get the data in long format and create all combination of values for each ID if the number of rows is greater than 1 in that ID.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -ID, values_drop_na = TRUE) %>%
group_by(ID) %>%
summarise(value = if(n() > 1) list(setNames(as.data.frame(t(combn(value, 2))),
c('col1', 'col2')))
else list(data.frame(col1 = value[1], col2 = NA_character_))) %>%
unnest(value)
# A tibble: 12 x 3
# ID col1 col2
# <int> <chr> <chr>
# 1 1 A B
# 2 1 A C
# 3 1 A D
# 4 1 B C
# 5 1 B D
# 6 1 C D
# 7 2 B A
# 8 3 E F
# 9 4 G C
#10 4 G A
#11 4 C A
#12 5 T NA
data
df <- structure(list(ID = 1:5, case1 = c("A", "B", "E", "G", "T"),
case2 = c("B", "A", "F", "C", NA), case3 = c("C", NA, NA,
"A", NA), case4 = c("D", NA, NA, NA, NA)),
class = "data.frame", row.names = c(NA, -5L))

Related

how to add a column to identify specific combination of values in R?

I have a database with several columns ( >20) and 2 of these columns have the subject names. I would like to add another column with inside a number that identifies the combination of the two subjects.
Here is an example with only the 2 columns of names (I don't include the others for convenience):
ID1 ID2
A B
A C
A B
B C
A B
B A
C B
And here is what i would like to create:
ID1 ID2 CODE
A B 1
A C 2
A B 1
B C 3
A B 1
B A 1
C B 3
I am kind of new in R and I think it can be done with stringr but I am not sure how
Thanks for the help!
Simo
df$CODE <- as.integer(
factor(
apply(df, 1, function(x) paste0(sort(x), collapse = ""))
)
)
# ID1 ID2 CODE
# 1 A B 1
# 2 A C 2
# 3 A B 1
# 4 B C 3
# 5 A B 1
# 6 B A 1
# 7 C B 3
Data
df <- data.frame(
ID1 = c("A", "A", "A", "B", "A", "B", "C"),
ID2 = c("B", "C", "B", "C", "B", "A", "B")
)
Try this:
library(dplyr)
#Code
new <- df %>% rowwise() %>%
mutate(Var = paste0(sort(c(ID1, ID2)), collapse = '')) %>%
group_by(Var) %>%
mutate(CODE=cur_group_id()) %>%
ungroup() %>%
select(-Var)
Output:
# A tibble: 7 x 3
ID1 ID2 CODE
<chr> <chr> <int>
1 A B 1
2 A C 2
3 A B 1
4 B C 3
5 A B 1
6 B A 1
7 C B 3
Some data used:
#Data
df <- structure(list(ID1 = c("A", "A", "A", "B", "A", "B", "C"), ID2 = c("B",
"C", "B", "C", "B", "A", "B")), class = "data.frame", row.names = c(NA,
-7L))

Removing a repeated value in a row

I have two columns in a data frame that may or may not have copied values in them. If the second column has the same value as the first column, I would like to replace that value with a NULL value or a string indicating the value has been replaced. If the values are different, I want to keep both of those values. For example:
I want to take this
col_1 col_2
a a
a b
b d
c c
c d
c c
a a
And turn this into:
col_1 col_2
a NULL
a b
b d
c NULL
c d
c NULL
a NULL
How can I do that?
You can also try:
#Code
df$col_2 <- ifelse(df$col_2==df$col_1,'NULL',df$col_2)
Output:
df
col_1 col_2
1 a NULL
2 a b
3 b d
4 c NULL
5 c d
Some data used:
#Data
df <- structure(list(col_1 = c("a", "a", "b", "c", "c"), col_2 = c("a",
"b", "d", "c", "d")), class = "data.frame", row.names = c(NA,
-5L))
Another option can be, using correct R sintax:
#Code2
df$col_2[df$col_2==df$col_1]<-'NULL'
Same output.
Using the ifelse() approach, we get this:
df
col_1 col_2
1 a NULL
2 a b
3 b d
4 c NULL
5 c d
6 c NULL
7 a NULL
By NULL value, I assume you need NA, if you need actual string NULL, you can use 'NULL' in place of NA_character_ as in Duck's answer.
library(dplyr)
df %>%
mutate(col_2 = case_when(col_1 == col_2 ~ NA_character_, TRUE ~ col_2))
# A tibble: 5 x 2
# Rowwise:
col_1 col_2
<chr> <chr>
1 a NA
2 a b
3 b d
4 c NA
5 c d
Based on new input:
df %>% mutate(col_2 = case_when(col_1 == col_2 ~ NA_character_, TRUE ~ col_2))
# A tibble: 7 x 2
# Rowwise:
col_1 col_2
<chr> <chr>
1 a NA
2 a b
3 b d
4 c NA
5 c d
6 c NA
7 a NA
Data used:
df
# A tibble: 7 x 2
col_1 col_2
<chr> <chr>
1 a a
2 a b
3 b d
4 c c
5 c d
6 c c
7 a a
We can use data.table methods which is fast and efficient
library(data.table)
setDT(df)[col_1 == col_2, col_2 := 'NULL']
-output
df
# col_1 col_2
#1: a NULL
#2: a b
#3: b d
#4: c NULL
#5: c d
data
df <- structure(list(col_1 = c("a", "a", "b", "c", "c"), col_2 = c("a",
"b", "d", "c", "d")), class = "data.frame", row.names = c(NA,
-5L))

Count number of element for each row in a matrix [duplicate]

This question already has answers here:
Count number of values in row using dplyr
(5 answers)
Counting number of instances of a condition per row R [duplicate]
(1 answer)
Closed 2 years ago.
Hello I have a matrix such as :
COL1 COL2 COL3
A "A" "B" NA
B "B" "B" "C"
C NA NA NA
D "B" "B" "B"
E NA NA "C"
F "A" "A" "C"
and I would liek for each row (A,B,C,D etc) get the number of letters being A or B
exemple :
Nb
A 2
B 2
C 0
D 3
E 0
F 2
does someone have an idea ?
another way is to use sapply:
df$n <- sapply(1:nrow(df), function(i) sum((df[i,] %in% c('A', 'B'))))
# COL1 COL2 COL3 n
# A A B <NA> 2
# B B B C 2
# C <NA> <NA> <NA> 0
# D B B B 3
# E <NA> <NA> C 0
# F A A C 2
You can achieve the same output by using purrr::map_dbl as well. Just replace sapply with map_dbl.
You can try a base R solution with apply():
#Base R
df$Var <- apply(df,1,function(x) length(which(!is.na(x) & x %in% c('A','B'))))
Output:
COL1 COL2 COL3 Var
A A B <NA> 2
B B B C 2
C <NA> <NA> <NA> 0
D B B B 3
E <NA> <NA> C 0
F A A C 2
Some data used:
#Data
df <- structure(list(COL1 = c("A", "B", NA, "B", NA, "A"), COL2 = c("B",
"B", NA, "B", NA, "A"), COL3 = c(NA, "C", NA, "B", "C", "C")), row.names = c("A",
"B", "C", "D", "E", "F"), class = "data.frame")
Or if you feel curious about tidyverse:
library(tidyverse)
#Code
df %>% mutate(id=1:n()) %>%
left_join(df %>% mutate(id=1:n()) %>%
pivot_longer(cols = -id) %>%
filter(value %in% c('A','B')) %>%
group_by(id) %>%
summarise(Var=n())) %>% ungroup() %>%
replace(is.na(.),0) %>% select(-id)
Output:
COL1 COL2 COL3 Var
1 A B 0 2
2 B B C 2
3 0 0 0 0
4 B B B 3
5 0 0 C 0
6 A A C 2
library(dplyr)
df <- structure(list(COL1 = c("A", "B", NA, "B", NA, "A"), COL2 = c("B",
"B", NA, "B", NA, "A"), COL3 = c(NA, "C", NA, "B", "C", "C")), row.names = c("A",
"B", "C", "D", "E", "F"), class = "data.frame")
df %>%
rowwise() %>%
mutate(sumVar = across(c(COL1:COL3),~ifelse(. %in% c("A", "B"),1,0)) %>% sum)
# A tibble: 6 x 4
# Rowwise:
COL1 COL2 COL3 sumVar
<chr> <chr> <chr> <dbl>
1 A B NA 2
2 B B C 2
3 NA NA NA 0
4 B B B 3
5 NA NA C 0
6 A A C 2

Extract non-missing elements by rows and stack them

I have a data frame like this
df <- data.frame(id = 1:4,
V1 = c("A", NA, "C", NA),
V2 = c(NA, NA, NA, "E"),
V3 = c(NA, "B", NA, "F"),
V4 = c(NA, NA, "D", NA), stringsAsFactors = F)
# id V1 V2 V3 V4
# 1 1 A <NA> <NA> <NA>
# 2 2 <NA> <NA> B <NA>
# 3 3 C <NA> <NA> D
# 4 4 <NA> E F <NA>
How can I extract non-missing elements by rows and stack them into a column? My expected output is:
# id value
# 1 1 A
# 2 2 B
# 3 3 C
# 4 3 D
# 5 4 E
# 6 4 F
Try pivot_longer() or unite() + separate_rows().
library(tidyr)
library(dplyr)
# Method 1
df %>%
pivot_longer(-id, values_drop_na = T) %>%
select(-name)
# Method 2
df %>%
unite(value, -id, na.rm = T) %>%
separate_rows(value)
# # A tibble: 6 x 2
# id value
# <int> <chr>
# 1 1 A
# 2 2 B
# 3 3 C
# 4 3 D
# 5 4 E
# 6 4 F
You can use dplyr and tidyr:
df %>%
tidyr::gather(-id, key = "key", value = "value") %>%
dplyr::filter(!is.na(value))
id key value
1 1 V1 A
2 3 V1 C
3 4 V2 E
4 2 V3 B
5 4 V3 F
6 3 V4 D
One base R solution could be:
na.omit(data.frame(df[1], stack(df[-1])[1]))
id values
1 1 A
3 3 C
8 4 E
10 2 B
12 4 F
15 3 D
How about combining complete.cases with reshape library?
library(reshape2)
df.temp <- melt(df, id.vars = "id")
df.temp[complete.cases(df.temp),-2]
results in
id value
1 1 A
3 3 C
8 4 E
10 2 B
12 4 F
15 3 D
pivot_longer then filter
library(tidyverse)
df <- data.frame(id = 1:4,
V1 = c("A", NA, "C", NA),
V2 = c(NA, NA, NA, "E"),
V3 = c(NA, "B", NA, "F"),
V4 = c(NA, NA, "D", NA), stringsAsFactors = FALSE)
df %>% pivot_longer(-id, names_to = "name", values_to = "value") %>%
filter(!is.na(value)) %>%
select(-name)
#> # A tibble: 6 x 2
#> id value
#> <int> <chr>
#> 1 1 A
#> 2 2 B
#> 3 3 C
#> 4 3 D
#> 5 4 E
#> 6 4 F
Created on 2020-03-02 by the reprex package (v0.3.0)

Remove NA in front of one specific string but leave in front of another specific string, by group

I have this data frame:
df <- data.frame(
id = rep(1:4, each = 4),
status = c(
NA, "a", "c", "a",
NA, "b", "c", "c",
NA, NA, "a", "c",
NA, NA, "b", "b"),
stringsAsFactors = FALSE)
For each group (id), I aim to remove the rows with one or multiple leading NA in front of an "a" (in the column "status") but not in front of a "b".
The final data frame should look like this:
structure(list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L),
status = c("a", "c", "a", NA, "b", "c", "c", "a", "c", NA, NA, "b", "b")),
.Names = c("id", "status"), row.names = c(NA, -13L), class = "data.frame")
How do I do that?
Edit: alternatively, how would I do it to preserve other variables in the data frame such as the variable otherVar in the following example:
df2 <- data.frame(
id = rep(1:4, each = 4),
status = c(
NA, "a", "c", "a",
NA, "b", "c", "c",
NA, NA, "a", "c",
NA, NA, "b", "b"),
otherVar = letters[1:16],
stringsAsFactors = FALSE)
We can group by 'id', summarise the 'status' by pasteing the elements together, then use gsub to remove the NA before the 'a' and convert it to 'long' format with separate_rows
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
summarise(status = gsub("(NA, ){1,}(?=a)", "", toString(status),
perl = TRUE)) %>%
separate_rows(status, convert = TRUE)
# A tibble: 13 x 2
# id status
# <int> <chr>
# 1 1 a
# 2 1 c
# 3 1 a
# 4 2 NA
# 5 2 b
# 6 2 c
# 7 2 c
# 8 3 a
# 9 3 c
#10 4 NA
#11 4 NA
#12 4 b
#13 4 b
Or using data.table with the same methodology
library(data.table)
out1 <- setDT(df)[, strsplit(gsub("(NA, ){1,}(?=a)", "",
toString(status), perl = TRUE), ", "), id]
setnames(out1, 'V1', "status")[]
# id status
# 1: 1 a
# 2: 1 c
# 3: 1 a
# 4: 2 NA
# 5: 2 b
# 6: 2 c
# 7: 2 c
# 8: 3 a
# 9: 3 c
#10: 4 NA
#11: 4 NA
#12: 4 b
#13: 4 b
Update
For the updated dataset 'df2'
i1 <- setDT(df2)[, .I[seq(which(c(diff((status %in% "a") +
rleid(is.na(status))) > 1), FALSE))] , id]$V1
df2[-i1]
# id status otherVar
# 1: 1 a b
# 2: 1 c c
# 3: 1 a d
# 4: 2 NA e
# 5: 2 b f
# 6: 2 c g
# 7: 2 c h
# 8: 3 a k
# 9: 3 c l
#10: 4 NA m
#11: 4 NA n
#12: 4 b o
#13: 4 b p
From zoo with na.locf and is.na, notice it assuming you data is ordered.
df[!(na.locf(df$status,fromLast = T)=='a'&is.na(df$status)),]
id status
2 1 a
3 1 c
4 1 a
5 2 <NA>
6 2 b
7 2 c
8 2 c
11 3 a
12 3 c
13 4 <NA>
14 4 <NA>
15 4 b
16 4 b
Here's a dplyr solution and a not as pretty base translation :
dplyr
library(dplyr)
df %>% group_by(id) %>%
filter(status[!is.na(status)][1]!="a" | !is.na(status))
# # A tibble: 13 x 2
# # Groups: id [4]
# id status
# <int> <chr>
# 1 1 a
# 2 1 c
# 3 1 a
# 4 2 <NA>
# 5 2 b
# 6 2 c
# 7 2 c
# 8 3 a
# 9 3 c
# 10 4 <NA>
# 11 4 <NA>
# 12 4 b
# 13 4 b
base
do.call(rbind,
lapply(split(df,df$id),
function(x) x[x$status[!is.na(x$status)][1]!="a" | !is.na(x$status),]))
# id status
# 1.2 1 a
# 1.3 1 c
# 1.4 1 a
# 2.5 2 <NA>
# 2.6 2 b
# 2.7 2 c
# 2.8 2 c
# 3.11 3 a
# 3.12 3 c
# 4.13 4 <NA>
# 4.14 4 <NA>
# 4.15 4 b
# 4.16 4 b
note
Will fail if not all NAs are leading because will remove all NAs from groups starting with "a" as a first non NA value.

Resources