How to split my columns using a unique and tidyR - r

I'm working on a data.table with a column like this:
A <- c("a;b;c","a;a;b","d;a;b","f;f;f")
df <- data.frame(A)
I would like to separate this column into 3 columns like this:
seg1 seg2 seg3
1 a b c
2 a b <NA>
3 d a b
4 f <NA> <NA>
The thing here is that when i split each row by ";" i need to keep unique of the row.

Here's a tidyverse approach. We split the character in A, keep only the unique values, paste the result back together and separate into three columns:
library(tidyverse)
df %>%
mutate(A = map(strsplit(as.character(A), ";"),
.f = ~ paste(unique(.x), collapse = ";"))) %>%
separate(A, into = c("seg1", "seg2", "seg3"))
Which gives:
# seg1 seg2 seg3
#1 a b c
#2 a b <NA>
#3 d a b
#4 f <NA> <NA>

library(stringr)
A <- c("a;b;c","a;a;b","d;a;b","f;f;f")
df <- data.frame(A)
df <- str_split_fixed(df$A, ";", 3)
df <- apply(X = df,
FUN = function(x){
return(x[!duplicated(x)][1:ncol(df)])
},
MARGIN = 1)
df <- t(df)
df <- as.data.frame(df)
names(df) <- c("seg1", "seg2", "seg3")
df
# seg1 seg2 seg3
# 1 a b c
# 2 a b <NA>
# 3 d a b
# 4 f <NA> <NA>

Related

How to do a for loop with case_when

I'm a beginner with R and I'm trying to do a for-loop to recode many variables: when "test" modality is missing, then have "test.v1" modality. It looked very easy to do, but I can't get it:
VEC_1 <- c("test1","test2","test3","test4","test5","test6","test7","test8","test9")
VEC_2 <- c("test1.v1","test2.v1","test3.v1","test4.v1","test5.v1","test6.v1","test7.v1","test8.v1","test9.v1")
for (i in 1:(min(length(VEC_1), length(VEC_2)))){
df2 <- df1 %>%
mutate(
VEC_1[i] = case_when(
is.na(VEC_1[i]) & !is.na(VEC_2[i]) ~ VEC_2[i],
TRUE ~ VEC_1[i])
)
}
I have this error
Unexpected error : '=' in:
" mutate(
VEC_1[i] ="
Does anyone have an idea ?
EDIT: df1 is like :
test1 <- c("A","B","A","A",NA,"B","A",NA,"A")
test1.v1 <- c("B",NA,"B","B","A","B","B",NA,"A")
test2 <- c("B","B","B","B",NA,"C","C","C","C")
test2.v1 <- c("C",NA,"A","A","B","B","C",NA,"C")
test3 <- c("A","B","B","B",NA,"C","C",NA,"C")
test3.v1 <- c("B","A","B",NA,"A","A","A","A",NA)
test4 <- c(NA,"B","B","A",NA,"B","A",NA,"A")
test4.v1 <- c("B","B","B","A","A","B","B","B","B")
df1 <- data.frame(test1,test1.v1,test2,test2.v1,test3,test3.v1,test4,test4.v1)
Based on the example data.frame df1, I'm wondering if you might try putting your data into long form, then grouping by row number and test, then substituting missing values.
library(tidyverse)
df1 %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn, names_to = c("test", "mode"), names_pattern = "test(\\d+)([.v1]*)") %>%
group_by(rn, test) %>%
mutate(value = ifelse(mode == "" & is.na(value), value[mode == ".v1"], value)) %>%
pivot_wider(id_cols = rn, names_from = c(test, mode), values_from = value, names_prefix = "test", names_sep = "")
Output
rn test1 test1.v1 test2 test2.v1 test3 test3.v1 test4 test4.v1
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 A B B C A B B B
2 2 B NA B NA B A B B
3 3 A B B A B B B B
4 4 A B B A B NA A A
5 5 A A B B A A A A
6 6 B B C B C A B B
7 7 A B C C C A A B
8 8 NA NA C NA A A B B
9 9 A A C C C NA A B
I'm assuming VEC_1 and VEC_2 are the same length here. you don't need to use a for-loop, mutate affects the whole column of the dataframe, it kinda behaves like a for-loop. I have rewritten your code like this and changed the data for testing purposes:
library(dplyr)
VEC_1 <- c("test1","test2","test3","test4",NA,"test6","test7",NA,"test9")
VEC_2 <- c("test1.v1",NA,"test3.v1","test4.v1","test5.v1","test6.v1","test7.v1",NA,"test9.v1")
df <- data.frame(VEC_1,VEC_2)
df %>% mutate(VEC_1 = if_else(is.na(VEC_1) & !is.na(VEC_2),VEC_2,VEC_1))
which is equal to
df <- data.frame(VEC_1,VEC_2)
for (i in 1:nrow(df)){
if(is.na(df$VEC_1[i]) & !is.na(df$VEC_2[i])){
df$VEC_1[i] = df$VEC_2[i]
}
}
Output:
> df
VEC_1 VEC_2
1 test1 test1.v1
2 test2 <NA>
3 test3 test3.v1
4 test4 test4.v1
5 test5.v1 test5.v1
6 test6 test6.v1
7 test7 test7.v1
8 <NA> <NA>
9 test9 test9.v1
i also changed from case_when to if_else, because you only check one condition.
?if_else
if_else(condition, true, false, missing = NULL)
ifelse/if_else will be a good solution, unless you have multiple conditions, then case_when is your friend. A loop isn't necessary:
library(dplyr)
VEC_1 <- c("test1","test2",NA,NA)
VEC_2 <- c("test1.v1",NA,NA,"test4.v1")
df <- tibble(VEC_1, VEC_2)
df %>%
mutate(VEC_1 = case_when(
is.na(VEC_2) ~ VEC_1,
is.na(VEC_1) ~ VEC_2,
TRUE ~ VEC_1)
)
# A tibble: 4 × 2
VEC_1 VEC_2
<chr> <chr>
1 test1 test1.v1
2 test2 NA
3 NA NA
4 NA test4.v1
# A tibble: 4 × 2
VEC_1 VEC_2
<chr> <chr>
1 test1 test1.v1
2 test2 NA
3 NA NA
4 test4.v1 test4.v1

R with dplyr rename, avoid error if column doesn't exist AND create new column with NAs

We are looking to rename columns in a dataframe in R, however the columns may be missing and this throws an error:
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
my_df %>% dplyr::rename(aa = a, bb = b, cc = c)
Error: Can't rename columns that don't exist.
x Column `c` doesn't exist.
our desired output is this, which creates a new column with NA values if the original column does not exist:
> my_df
aa bb c
1 1 4 NA
2 2 5 NA
3 3 6 NA
A possible solution:
library(tidyverse)
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
cols <- c(a = NA_real_, b = NA_real_, c = NA_real_)
my_df %>% add_column(!!!cols[!names(cols) %in% names(.)]) %>%
rename(aa = a, bb = b, cc = c)
#> aa bb cc
#> 1 1 4 NA
#> 2 2 5 NA
#> 3 3 6 NA
You can use a named vector with any_of() to rename that won't error on missing variables. I'm uncertain of a dplyr way to then create the missing vars but it's easy enough in base R.
library(dplyr)
cols <- c(aa = "a", bb = "b", cc = "c")
my_df %>%
rename(any_of(cols)) %>%
`[<-`(., , setdiff(names(cols), names(.)), NA)
aa bb cc
1 1 4 NA
2 2 5 NA
3 3 6 NA
Here is a solution using the data.table function setnames. I've added a second "missing" column "d" to demonstrate generality.
library(tidyverse)
library(data.table)
my_df <- data.frame(a = c(1,2,3), b = c(4,5,6))
curr <- names(my_df)
cols <- data.frame(new=c("aa","bb","cc","dd"), old = c("a", "b", "c","d")) %>%
mutate(exist = old %in% curr)
foo <- filter(cols, exist)
bar <- filter(cols, !exist)
setnames(my_df, new = foo$new)
my_df[, bar$old] <- NA
my_df
#> my_df
# aa bb c d
#1 1 4 NA NA
#2 2 5 NA NA
#3 3 6 NA NA

Extract character list values from data.frame rows and reshape data

I have a variable x with character lists in each row:
dat <- data.frame(id = c(rep('a',2),rep('b',2),'c'),
x = c('f,o','f,o,o','b,a,a,r','b,a,r','b,a'),
stringsAsFactors = F)
I would like to reshape the data so that each row is a unique (id, x) pair such as:
dat2 <- data.frame(id = c(rep('a',2),rep('b',3),rep('c',2)),
x = c('f','o','a','b','r','a','b'))
> dat2
id x
1 a f
2 a o
3 b a
4 b b
5 b r
6 c a
7 c b
I've attempted to do this by splitting the character lists and keeping only the unique list values in each row:
dat$x <- sapply(strsplit(dat$x, ','), sort)
dat$x <- sapply(dat$x, unique)
dat <- unique(dat)
> dat
id x
1 a f, o
3 b a, b, r
5 c a, b
However, I'm not sure how to proceed with converting the row lists into individual row entries.
How would I accomplish this? Or is there a more efficient way of converting a list of strings to reshape the data as described?
You can use tidytext::unnest_tokens:
library(tidytext)
library(dplyr)
dat %>%
unnest_tokens(x1, x) %>%
distinct()
id x1
1 a f
2 a o
3 b b
4 b a
5 b r
6 c b
7 c a
A base R method with two lines is
#get list of X potential vars
x <- strsplit(dat$x, ",")
# construct full data.frame, then use unique to return desired rows
unique(data.frame(id=rep(dat$id, lengths(x)), x=unlist(x)))
This returns
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
If you don't want to write out the variable names yourself, you can use setNames.
setNames(unique(data.frame(rep(dat$id, lengths(x)), unlist(x))), names(dat))
We could use separate_rows
library(tidyverse)
dat %>%
separate_rows(x) %>%
distinct()
# id x
#1 a f
#2 a o
#3 b b
#4 b a
#5 b r
#6 c b
#7 c a
A solution can be achieved using splitstackshape::cSplit to split x column into mulltiple columns. Then gather and filter will help to achieve desired output.
library(tidyverse)
library(splitstackshape)
dat %>% cSplit("x", sep=",") %>%
mutate_if(is.factor, as.character) %>%
gather(key, value, -id) %>%
filter(!is.na(value)) %>%
select(-key) %>% unique()
# id value
# 1 a f
# 3 b b
# 5 c b
# 6 a o
# 8 b a
# 10 c a
# 13 b r
Base solution:
temp <- do.call(rbind, apply( dat, 1,
function(z){ data.frame(
id=z[1],
x = scan(text=z['x'], what="",sep=","),
stringsAsFactors=FALSE)} ) )
Read 2 items
Read 3 items
Read 4 items
Read 3 items
Read 2 items
Warning messages:
1: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
2: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
3: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
4: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
5: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
temp[!duplicated(temp),]
#------
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
To get rid of all the messages and warnings:
temp <- do.call(rbind, apply( dat, 1,
function(z){ suppressWarnings(data.frame(id=z[1],
x = scan(text=z['x'], what="",sep=",", quiet=TRUE), stringsAsFactors=FALSE)
)} ) )
temp[!duplicated(temp),]

selecting values of one dataframe based on partial string in another dataframe

I have two dataframes (DF1 and DF2)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
DF1
parties
A, B
C
A
C, D
.
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
.
DF2
party party.number
A 1
B 2
C 3
D 4
E 5
F 6
G 7
H 8
I 9
J 10
The desired result should be an additional column in DF1 which contains the party numbers taken from DF2 for each row in DF1.
Desired result (based on DF1):
parties party.numbers
A, B 1, 2
C 3
A 1
C, D 3, 4
I strongly suspect that the answer involves something like str_match(DF1$parties, DF2$party.number) or a similar regular expression, but I can't figure out how to put two (or more) party numbers into the same row (DF2$party.numbers).
One option is gsubfn by matching the pattern as upper-case letter, as replacement use a key/value list
library(gsubfn)
DF1$party.numbers <- gsubfn("[A-Z]", setNames(as.list(DF2$party.number),
DF2$party), as.character(DF1$parties))
DF1
# parties party.numbers
#1 A, B 1, 2
#2 C 3
#3 A 1
#4 C, D 3, 4
An alternative solution using tidyverse. You can reshape DF1 to have one string per row, then join DF2 and then reshape back to your initial form:
library(tidyverse)
DF1 <- as.data.frame(c("A, B","C","A","C, D"))
names(DF1) <- c("parties")
B <- as.data.frame(c(LETTERS[1:10]))
C <- as.data.frame(1:10)
DF2 <- bind_cols(B,C)
names(DF2) <- c("party","party.number")
DF1 %>%
group_by(id = row_number()) %>%
separate_rows(parties) %>%
left_join(DF2, by=c("parties"="party")) %>%
summarise(parties = paste(parties, collapse = ", "),
party.numbers = paste(party.number, collapse = ", ")) %>%
select(-id)
# # A tibble: 4 x 2
# parties party.numbers
# <chr> <chr>
# 1 A, B 1, 2
# 2 C 3
# 3 A 1
# 4 C, D 3, 4

separate() in tidyr with NA

I have a question related to separate() in the tidyr package. When there is no NA in a data frame, separate() works. I have been using this function a lot. But, today I had a case in which there were NAs in a data frame. separate() returned an error message. I could be very silly. But, I wonder if tidyr may not be designed for this kind of data cleaning. Or is there any way separate() can work with NAs? Thank you very much for taking your time.
Here is an updated sample based on the comments. Say I want to separate characters in y and create new columns. If I remove the row with NA, separate() will work. But, I do not want to delete the row, what could I do?
x <- c("a-1","b-2","c-3")
y <- c("d-4","e-5", NA)
z <- c("f-6", "g-7", "h-8")
foo <- data.frame(x,y,z, stringsAsFactors = F)
ana <- foo %>%
separate(y, c("part1", "part2"))
# > foo
# x y z
# 1 a-1 d-4 f-6
# 2 b-2 e-5 g-7
# 3 c-3 <NA> h-8
# > ana <- foo %>%
# + separate(y, c("part1", "part2"))
# Error: Values not split into 2 pieces at 3
One way would be:
res <- foo %>%
mutate(y=ifelse(is.na(y), paste0(NA,"-", NA), y)) %>%
separate(y, c('part1', 'part2'))
res[res=='NA'] <- NA
res
# x part1 part2 z
#1 a-1 d 4 f-6
#2 b-2 e 5 g-7
#3 c-3 <NA> <NA> h-8
You can use extra option in separate.
Here's an example from hadley's github issue page
> df <- data.frame(x = c("a", "a b", "a b c", NA))
> df
x
1 a
2 a b
3 a b c
4 <NA>
> df %>% separate(x, c("a", "b"), extra = "merge")
a b
1 a <NA>
2 a b
3 a b c
4 <NA> <NA>
> df %>% separate(x, c("a", "b"), extra = "drop")
a b
1 a <NA>
2 a b
3 a b
4 <NA> <NA>

Resources