Separating mixed values and generating new columns in tidyverse [duplicate] - r

This question already has answers here:
Separate a column into multiple columns using tidyr::separate with sep=""
(2 answers)
Closed 3 years ago.
A sample of my data is as follows:
df1 <- read.table(text = "var Time
12O 12
13O 11
22B 45
33Z 22
21L 2
11M 13", header = TRUE)
I want to separate values in column "Var" to get the following data:
df2 <- read.table(text = " Group1 Group2 Group3
1 2 O
1 3 O
2 2 B
3 3 Z
2 1 L
1 1 M", header = TRUE)
I tried the following codes:
df2 <- df1 %>% separate(var, into = c('Group1', 'Group2','Group3'), sep = 1)
I get an error. I have searched to find the error out, but I have failed.

If you want to retain the original column, you can use str_split_fixed from stringr package and cbind the result to your existing dataframe
cbind(df1, str_split_fixed(as.character(df1$var),"", n = 3))
var Time 1 2 3
1 12O 12 1 2 O
2 13O 11 1 3 O
3 22B 45 2 2 B
4 33Z 22 3 3 Z
5 21L 2 2 1 L
6 11M 13 1 1 M

A possible base/stringr solution:
res<-as.data.frame(do.call(rbind,strsplit(stringr::str_replace_all(df1$var
,"([0-9])([0-9])([A-Z])","\\1 \\2 \\3"),
" ")))
names(res)<-paste0("Group",1:ncol(res))
cbind(df1["Time"],res)
Time Group1 Group2 Group3
1 12 1 2 O
2 11 1 3 O
3 45 2 2 B
4 22 3 3 Z
5 2 2 1 L
6 13 1 1 M

As far as I am concerned (Separate outputs empty separator error for each row independently), this cannot be done with tidyr separate(). A possibility is str_split() from stringr or strsplit() from base R.
So, using str_split():
df1 %>%
mutate(var = str_split(var, pattern = "")) %>%
unnest() %>%
group_by(Time) %>%
mutate(val = var,
var = paste0("Group", row_number())) %>%
spread(var, val) %>%
ungroup()
Time Group1 Group2 Group3
<int> <chr> <chr> <chr>
1 2 2 1 L
2 11 1 3 O
3 12 1 2 O
4 13 1 1 M
5 22 3 3 Z
6 45 2 2 B
Using strsplit():
df1 %>%
mutate(var = strsplit(as.character(var), split = "", fixed = TRUE)) %>%
unnest() %>%
group_by(Time) %>%
mutate(val = var,
var = paste0("Group", row_number())) %>%
spread(var, val) %>%
ungroup()
To have new columns with appropriate class (character, integer etc.), you can add convert = TRUE into spread().

Related

apply function or loop within mutate

Let's say I have a data frame. I would like to mutate new columns by subtracting each pair of the existing columns. There are rules in the matching columns. For example, in the below codes, the prefix is all same for the first component (base_g00) of the subtraction and the same for the second component (allow_m00). Also, the first component has numbers from 27 to 43 for the id and the second component's id is from 20 to 36 also can be interpreted as (1st_id-7). I am wondering for the following code, can I write in a apply function or loops within mutate format to make the codes simpler. Thanks so much for any suggestions in advance!
pred_error<-y07_13%>%mutate(annual_util_1=base_g0027-allow_m0020,
annual_util_2=base_g0028-allow_m0021,
annual_util_3=base_g0029-allow_m0022,
annual_util_4=base_g0030-allow_m0023,
annual_util_5=base_g0031-allow_m0024,
annual_util_6=base_g0032-allow_m0025,
annual_util_7=base_g0033-allow_m0026,
annual_util_8=base_g0034-allow_m0027,
annual_util_9=base_g0035-allow_m0028,
annual_util_10=base_g0036-allow_m0029,
annual_util_11=base_g0037-allow_m0030,
annual_util_12=base_g0038-allow_m0031,
annual_util_13=base_g0039-allow_m0032,
annual_util_14=base_g0040-allow_m0033,
annual_util_15=base_g0041-allow_m0034,
annual_util_16=base_g0042-allow_m0035,
annual_util_17=base_g0043-allow_m0036)
I think a more idiomatic tidyverse approach would be to reshape your data so those column groups are encoded as a variable instead of as separate columns which have the same semantic meaning.
For instance,
library(dplyr); library(tidyr); library(stringr)
y07_13 <- tibble(allow_m0021 = 1:5,
allow_m0022 = 2:6,
allow_m0023 = 11:15,
base_g0028 = 5,
base_g0029 = 3:7,
base_g0030 = 100)
y07_13 %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
mutate(type = str_extract(name, "allow_m|base_g"),
num = str_remove(name, type) %>% as.numeric(),
group = num - if_else(type == "allow_m", 20, 27)) %>%
select(row, type, group, value) %>%
pivot_wider(names_from = type, values_from = value) %>%
mutate(annual_util = base_g - allow_m)
Result
# A tibble: 15 x 5
row group allow_m base_g annual_util
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 5 4
2 1 2 2 3 1
3 1 3 11 100 89
4 2 1 2 5 3
5 2 2 3 4 1
6 2 3 12 100 88
7 3 1 3 5 2
8 3 2 4 5 1
9 3 3 13 100 87
10 4 1 4 5 1
11 4 2 5 6 1
12 4 3 14 100 86
13 5 1 5 5 0
14 5 2 6 7 1
15 5 3 15 100 85
Here is vectorised base R approach -
base_cols <- paste0("base_g00", 27:43)
allow_cols <- paste0("allow_m00", 20:36)
new_cols <- paste0("annual_util", 1:17)
y07_13[new_cols] <- y07_13[base_cols] - y07_13[allow_cols]
y07_13

Counting all unique strings in a data frame containing strings and numeric values

I have a number of large data frames which has the occasional string value and I would like to know what the unique string values are (ignoring the numeric values) and if possible count these strings.
df <- data.frame(1:16)
df$A <- c("Name",0,0,0,0,0,12,12,0,14,NA_real_,14,NA_real_,NA_real_,16,16)
df$B <- c(10,0,"test",0,12,12,12,12,0,14,NA_real_,14,16,16,16,16)
df$C <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
X1.16 A B C
1 1 Name 10 10
2 2 0 0 12
3 3 0 test 14
4 4 0 0 16
5 5 0 12 10
6 6 0 12 12
7 7 12 12 14
8 8 12 12 16
9 9 0 0 10
10 10 14 14 12
11 11 <NA> <NA> 14
12 12 14 14 16
13 13 <NA> 16 10
14 14 <NA> 16 12
15 15 16 16 14
16 16 16 16 16
I know I can use the count function in dplyr but I have too many unique numeric values so this is not a great solution. In the code below I was able to filter my data so to only retain rows that contain an alphabetical character (although this isn't a solution either).
df %>% filter_all(any_vars(str_detect(., pattern = "[:alpha:]")))
X1.16 A B C
1 1 Name 10 10
2 3 0 test 14
My desired output would be something to the effect of:
Variable n
"Name" 1
"test" 1
You can get the string value with grep and count them using table :
stack(table(grep('[[:alpha:]]', unlist(df), value = TRUE)))[2:1]
If you want a tidyverse answer you can get the data in long format, keep only the rows with characters in it and count them.
library(dplyr)
df %>%
mutate(across(.fns = as.character)) %>%
tidyr::pivot_longer(cols = everything()) %>%
filter(grepl('[[:alpha:]]', value)) %>%
count(value)
# value n
# <chr> <int>
#1 Name 1
#2 test 1
#Ronak and #akrun above beat me to the punch, my solution is very similar - with an extension if you want a count within columns
# Coerce to tibble for ease of reading
df <- df %>%
as_tibble() %>%
mutate(across(.fns = as.character))
df %>%
pivot_longer(cols = everything()) %>%
summarise(Variable = str_subset(value, "[:alpha:]")) %>%
count(Variable, sort = TRUE)
# A tibble: 2 x 2
Variable n
<chr> <int>
1 Name 1
2 test 1
# str_subset is a convenient wrapper around filter & str_detect
Add some extra words to test
# Test on extra word counts - replace 12 and 14 with words
df2 <- df
df2[df2 == 12] <- 'Name'
df2[df2 == 14] <- 'test'
df2
df2 %>%
pivot_longer(cols = everything()) %>%
summarise(Variable = str_subset(value, "[:alpha:]")) %>%
count(Variable, sort = TRUE)
# A tibble: 2 x 2
Variable n
<chr> <int>
1 Name 12
2 test 10
If you want counts by column
df2 %>%
select(-1) %>%
pivot_longer(everything(), names_to = 'col') %>%
group_by(col) %>%
summarise(Variable = str_subset(value, "[:alpha:]")) %>%
count(col, Variable)
# A tibble: 6 x 3
# Groups: col [3]
col Variable n
<chr> <chr> <int>
1 A Name 3
2 A test 2
3 B Name 4
4 B test 3
5 C Name 4
6 C test 4
We can use filter with across
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
df %>%
select(-1) %>%
mutate(across(everything(), as.character)) %>%
filter(across(everything(), ~ str_detect(., '[:alpha:]')) %>% reduce(`|`)) %>%
pivot_longer(everything()) %>%
filter(str_detect(value, '[:alpha:]')) %>%
count(value)
# A tibble: 2 x 2
# value n
# <chr> <int>
#1 Name 1
#2 test 1

How to reduce factor levels depending on other attribute?

I have a dataframe of two columns id and result, and I want to assign factor levels to result depending on id. So that for id "1", result c("a","b","c","d") will have factor levels 1,2,3,4.
For id "2", result c("22","23","24") will have factor levels 1,2,3.
id <- c(1,1,1,1,2,2,2)
result <- c("a","b","c","d","22","23","24")
I tried to group them by split, but they will be converted to a list instead of a data frame, which causes a length problem for modeling. Can you help please?
Though the question was closed as a duplicate by user #Ronak Shah, I don't believe it is the same question.
After numbering the row by group the new column must be coerced to class "factor".
library(dplyr)
id <- c(1,1,1,1,2,2,2)
result <- c("a","b","c","d","22","23","24")
df <- data.frame(id, result)
df %>%
group_by(id) %>%
mutate(fac = row_number()) %>%
ungroup() %>%
mutate(fac = factor(fac))
# A tibble: 7 x 3
# id result fac
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 23 2
#7 2 24 3
Edit.
If there are repeated values in result, coerce as.integer/factor to get numbers, then coerce those numbers to factor.
id2 <- c(1,1,1,1,2,2,2,2)
result2 <- c("a","b","c","d","22", "22","23","24")
df2 <- data.frame(id = id2, result = result2)
df2 %>%
group_by(id) %>%
mutate(fac = as.integer(factor(result))) %>%
ungroup() %>%
mutate(fac = factor(fac))
# A tibble: 8 x 3
# id result fac
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 22 1
#7 2 23 2
#8 2 24 3
After grouping by id, we can use match with unique to assign unique number to each result. Using #Rui Barradas' dataframe df2
library(dplyr)
df2 %>%
group_by(id) %>%
mutate(ans = match(result, unique(result))) %>%
ungroup %>%
mutate(ans = factor(ans))
# id result ans
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 22 1
#7 2 23 2
#8 2 24 3

dplyr collapse 'tail' rows into larger groups

library(tidyverse)
df <- tibble(a = as.factor(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
How do I make dplyr look at this data frame df and collapse all these occurences of 2 into a single summed group, and collapse all the occurrences of 1 into a single summed group? And also keep the rest of the data frame.
Turn this:
# A tibble: 20 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 2
11 11 2
12 12 2
13 13 2
14 14 1
15 15 1
16 16 1
17 17 1
18 18 1
19 19 1
20 20 1
into this:
# A tibble: 5 x 2
a b
<fct> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
[Edit] - I fixed the example data. Sorry about that.
We group by a manufactured sortkey to maintain sort order. We used the fact that b is in descending order in the input but if that is not the case in your actual data then replace sortkey = -b with the more general sortkey = data.table::rleid(b) or the longer sortkey = cumsum(coalesce(b != lag(b), FALSE)) .
We also convert b to the group names giving a new a. It wasn't clear which groups are to be converted to grp... form. Hard-coded 1 and 2? Any group with more than one row? Groups at the end with more than one row? At any rate it would be easy enough to change the condition in the if_else once that were clarified.
Finally perform the summation and then remove the sortkey.
df %>%
group_by(sortkey = -b, a = paste0(if_else(b %in% 1:2, "grp", ""), b)) %>%
summarize(b = sum(b)) %>%
ungroup %>%
select(-sortkey)
giving:
# A tibble: 5 x 2
a b
<chr> <int>
1 50 50
2 20 20
3 13 13
4 grp2 20
5 grp1 7
Here's a way. I have converted a from factor to character to make things easier. You can convert it back to factor if you want. Also your test data was a bit wrong.
df <- tibble(a = as.character(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
df %>%
mutate(
a = case_when(
b == 1 ~ "grp1",
b == 2 ~ "grp2",
TRUE ~ a
)
) %>%
group_by(a) %>%
summarise(b = sum(b))
# A tibble: 5 x 2
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp1 7
5 grp2 20
This is an approach which gives you the desired names for groups & where you don't need to think in advance how many cases like that you would need (e.g. it would create grp3, grp4, ... depending on the number in b).
library(dplyr)
df %>%
mutate(
grp = as.numeric(lag(df$b) != df$b),
grp = cumsum(ifelse(is.na(grp), 0, grp))
) %>% group_by(grp) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)
Output:
a b
<chr> <dbl>
1 1 50
2 2 20
3 3 13
4 grp2 20
5 grp1 7
Note that the code could be also condensed but that leads to a certain lack of readability in my opinion:
df %>%
group_by(grp = cumsum(ifelse(is.na(as.numeric(lag(df$b) != df$b)), 0, as.numeric(lag(df$b) != df$b)))) %>%
mutate(
a = ifelse(n() > 1, paste0("grp", b), a),
b = sum(b)
) %>% ungroup() %>% distinct(a, b)

Finding Maximumth value of currently mutating variable in dplyr

While trying to work out this question Identify duplicates of one value with different values in another column; I felt that the solution was closer but I couldn't because the dplyr mutate function refers to the pre-mutated state's max when I use max(ID) in the below code and not post-mutated value (like recursively).
The objective is to assign a new unique ID value for the rows where the current Address has mismatch with the previous Address of the same ID value.
The code I tried:
df <- read.table(text = 'ID Address
1 X
1 X
1 Y
2 Z
2 Z
3 A
3 B
4 C
4 D
4 E
5 F
5 F
5 F
', header= T, stringsAsFactors = F)
df %>% group_by(ID) %>% mutate(flag = ifelse(lag(Address)==Address,F,T)) %>%
mutate(flag = ifelse(is.na(flag),F,flag)) %>% ungroup() %>%
mutate(newID = ifelse(flag | is.na(flag), max(ID)+1,ID))%>%
select(ID = newID,Address)
Received Output:
# A tibble: 13 x 2
ID Address
<dbl> <chr>
1 1 X
2 1 X
3 6 Y
4 2 Z
5 2 Z
6 3 A
7 6 B
8 4 C
9 6 D
10 6 E
11 5 F
12 5 F
13 5 F
Expected Output:
ID Address
1 X
1 X
6 Y
2 Z
2 Z
3 A
7 B
4 C
8 D
9 E
5 F
5 F
5 F
Any help would be appreciated!
Edit:
Ideal code: Where I should've been able to use newID which is the current mutating variable to use.
> df %>% group_by(ID) %>% mutate(flag = ifelse(lag(Address)==Address,F,T)) %>%
+ mutate(flag = ifelse(is.na(flag),F,flag)) %>% ungroup() %>%
+ mutate(newID = ifelse(flag | is.na(flag), max(newID)+1,ID))%>%
+ select(ID = newID,Address)
One problem is the max(ID) + 1 which will give the constant value and the second problem is the ifelse itself which requires equal length vector for 'yes' and 'no'. In the below solution, we replace the max(ID) + 1 with max(ID) + seq_len(sum(flag)) and instead of ifelse used replace
df %>%
group_by(ID) %>%
mutate(flag = lag(Address, default = Address[1])!= Address) %>%
ungroup() %>%
mutate(newID = replace(ID, flag, max(ID) + seq_len(sum(flag))))%>%
select(ID = newID,Address)
# A tibble: 13 x 2
# ID Address
# <dbl> <chr>
# 1 1 X
# 2 1 X
# 3 6 Y
# 4 2 Z
# 5 2 Z
# 6 3 A
# 7 7 B
# 8 4 C
# 9 8 D
#10 9 E
#11 5 F
#12 5 F
#13 5 F
In addition, the two ifelse statements to create the 'flag' can be replaced by a single statement

Resources