I have two tables that I'm trying to join in a particular way. One is a simple tibble that provides a HEX color and its category that it is associated with:
library(tibble)
library(dplyr)
colors <- tibble(Category = c("A", "B", "C", "D"),
Colors = c("#0079c0", "#cc9900", "#252525", "#c5120e"))
# A tibble: 4 × 2
Category Colours
<chr> <chr>
1 A #0079c0
2 B #cc9900
3 C #252525
4 D #c5120e
I have another tibble that lists the categories both as rows and columns, and those appear in a specific way:
Main_Table <- tibble(Category = c("A", "B", "C", "D"),
A = c(NA, "A", NA, NA),
B = c(NA, NA, NA, NA),
C = c(NA, "C", NA, NA),
D = c("D", "D", NA, NA))
# A tibble: 4 × 5
Category A B C D
<chr> <chr> <lgl> <chr> <chr>
1 A <NA> NA <NA> D
2 B A NA C D
3 C <NA> NA <NA> <NA>
4 D <NA> NA <NA> <NA>
I want to join the color into the main table based on whether its corresponding category is present under the variable that bears its name. For example, let's say that if I want category D's color to be included, I'd end up with the below:
Main_Table_Goal <- tibble(Category = c("A", "B", "C", "D"),
A = c(NA, "A", NA, NA),
B = c(NA, NA, NA, NA),
C = c(NA, "C", NA, NA),
D = c("D", "D", NA, NA),
color = c("#c5120e", "#c5120e", NA, NA))
# A tibble: 4 × 6
Category A B C D color
<chr> <chr> <lgl> <chr> <chr> <chr>
1 A <NA> NA <NA> D #c5120e
2 B A NA C D #c5120e
3 C <NA> NA <NA> <NA> <NA>
4 D <NA> NA <NA> <NA> <NA>
How do I achieve this using dplyr? I've been trying with *_join and other tricks, but I've not gotten anywhere.
EDIT: I should have mentioned that I'd like to eventually include this in a function, so ideally the code can be flexible to accommodate any number of categories.
Here is an option using match
Main_Table %>%
mutate(color = colors$Colors[match(D, colors$Category)])
# A tibble: 4 × 6
# Category A B C D color
# <chr> <chr> <lgl> <chr> <chr> <chr>
#1 A <NA> NA <NA> D #c5120e
#2 B A NA C D #c5120e
#3 C <NA> NA <NA> <NA> <NA>
#4 D <NA> NA <NA> <NA> <NA>
I am not sure how many categories you have in your data. But if you have only four (i.e., A, B, C, and D), the following would be one way. I wanted to work with one data frame. So I initially merged the two data frames. I converted B in logical to character since I wanted to use mutate_at(). Then, I replaced the four categories with the four colors. Finally, I removed Colors and converted B to logical.
library(dplyr)
left_join(Main_Table, colors) %>%
mutate(B = as.character(B)) %>%
mutate_at(vars(A:D),
funs(color = recode(., A = Colors[1],
B = Colors[2],
C = Colors[3],
D = Colors[4]))) %>%
select(-Colors) %>%
mutate(B = as.logical(B))
Given akrun's idea, you can do the following. As long as you can tell how many category you have, you just specify the columns in vars(). If all columns are in character, no need to convert logical to character.
left_join(Main_Table, colors) %>%
mutate(B = as.character(B)) %>%
mutate_at(vars(A:D),funs(color = Colors[match(., Category)])) %>%
select(-Colors) %>%
mutate(B = as.logical(B))
# Category A B C D A_color B_color C_color D_color
# <chr> <chr> <lgl> <chr> <chr> <chr> <chr> <chr> <chr>
#1 A <NA> NA <NA> D <NA> <NA> <NA> #c5120e
#2 B A NA C D #0079c0 <NA> #252525 #c5120e
#3 C <NA> NA <NA> <NA> <NA> <NA> <NA> <NA>
#4 D <NA> NA <NA> <NA> <NA> <NA> <NA> <NA>
This is a dynamic solution where you set the color parameter once at the top:
target_category <- 'D' # set color
target_category_table <- Main_Table %>%
select_(target_category) %>%
left_join(colors %>%
filter(Category == target_category) %>%
setNames(c(target_category, 'color')))
goal_table <- Main_Table %>%
bind_cols(select(target_category_table, color))
goal_table
Result:
# A tibble: 4 × 6
Category A B C D color
<chr> <chr> <lgl> <chr> <chr> <chr>
1 A <NA> NA <NA> D #c5120e
2 B A NA C D #c5120e
3 C <NA> NA <NA> <NA> <NA>
4 D <NA> NA <NA> <NA> <NA>
Related
I have this type of data:
df <- data.frame(
w1 = c("A", "B", "C", "E", "F", "G"),
w2 = c("B", "G", "C", "D", "E", "V"),
w3 = c("D", "S", "O", "F", NA, "N"),
w4 = c("E", "U", NA, "T", NA, NA),
w5 = c("C", NA, NA, NA, NA, NA)
)
I need to iterate through column pairs to rolling-paste the separate strings into bigrams. Note that in the actual data the strings are of variable character length and character type.
I've tried this but it fails:
df[, paste0("bigr_", 1:4, "_", 2:5)] <- lapply(df[, 1:5],
function(x) paste(x[i], x[i+1], sep = " "))
The expected output is:
w1 w2 w3 w4 w5 bigr_1_2 bigr_2_3 bigr_3_4 bigr_4_5
1 A B D E C A B B D D E E C
2 B G S U <NA> B G G S S U <NA>
3 C C O <NA> <NA> C C C O <NA> <NA>
4 E D F T <NA> E D D F F T <NA>
5 F E <NA> <NA> <NA> F E <NA> <NA> <NA>
6 G V N <NA> <NA> G V V N <NA> <NA>
I'd be most interested in a dplyr solution but am open and grateful for other solutions as well.
As you said you're most interested in a dplyr solution, this can be achieved using mutate() and across(). You can alter the function applied to each column if this doesn't achieve the exact desired output.
df %>%
mutate(
across(
# For the first four columns (i.e. has number 1-4 in column name)
matches("[1-4]"),
# Apply custom function
function(col) {
# Paste together
paste(
col, # the data in the current column
cur_data()[[which(names(cur_data()) == cur_column())+1]], # and the data in the next column along
sep = " "
)
},
.names = "{gsub(pattern = 'w', replacement = 'bigr_', {col})}" # alter name of new cols (replace 'w' with 'bigr_')
)
) %>%
# EDIT: added to rename columns to match desired output
rename_with(.cols = matches("bigr"),
.fn = function(colname) {
paste0(colname, "_", as.numeric(gsub(pattern = "bigr_", replacement = "", colname))+1)
})
df <- data.frame(
w1 = c("A", "B", "C", "E", "F", "G"),
w2 = c("B", "G", "C", "D", "E", "V"),
w3 = c("D", "S", "O", "F", NA, "N"),
w4 = c("E", "U", NA, "T", NA, NA),
w5 = c("C", NA, NA, NA, NA, NA)
)
library(tidyverse)
library(janitor)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(-rn, values_drop_na = TRUE) %>%
group_by(rn) %>%
mutate(bigr = paste0(value, "_", lead(value))) %>%
mutate(bigr = if_else(str_detect(bigr, "_NA"), NA_character_, bigr)) %>%
pivot_wider(rn, names_from = c(name), values_from = c(value, bigr)) %>%
remove_empty("cols") %>%
ungroup() %>%
select(-rn) %>%
rename_with(~str_remove(string = ., "value_")) %>%
rename_with(~str_replace(., "(_w)(\\d+)", "_\\2"))
#> # A tibble: 6 × 9
#> w1 w2 w3 w4 w5 bigr_1 bigr_2 bigr_3 bigr_4
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A B D E C A_B B_D D_E E_C
#> 2 B G S U <NA> B_G G_S S_U <NA>
#> 3 C C O <NA> <NA> C_C C_O <NA> <NA>
#> 4 E D F T <NA> E_D D_F F_T <NA>
#> 5 F E <NA> <NA> <NA> F_E <NA> <NA> <NA>
#> 6 G V N <NA> <NA> G_V V_N <NA> <NA>
Created on 2022-04-26 by the reprex package (v2.0.1)
As long as you don't have a string that is NA, you could try:
df %>%
mutate(across(-1,
~ paste(get(paste0("w", match(cur_column(), names(cur_data())) - 1)), .),
.names = 'bigr_{paste0("w", match(.col, names(cur_data())) - 1)}_{.col}')) %>%
mutate(across(starts_with("bigr"),
~ if_else(str_count(., "NA") != 0, NA_character_, .)))
w1 w2 w3 w4 w5 bigr_w1_w2 bigr_w2_w3 bigr_w3_w4 bigr_w4_w5
1 A B D E C A B B D D E E C
2 B G S U <NA> B G G S S U <NA>
3 C C O <NA> <NA> C C C O <NA> <NA>
4 E D F T <NA> E D D F F T <NA>
5 F E <NA> <NA> <NA> F E <NA> <NA> <NA>
6 G V N <NA> <NA> G V V N <NA> <NA>
As you are open to non-dplyr solutions, we can do it in base R by modifying your original code:
df[, paste0("bigr_", 1:4, "_", 2:5)] <- mapply(paste, df[, 1:4], df[, 2:5])
# as NA is coerced to character, we need to find those positions and correct
x <- which(is.na(df[, 1:4]) | is.na(df[, 2:5]), arr.ind = TRUE)
x[, 2] <- x[, 2] + 5
df[x] <- NA
df
# w1 w2 w3 w4 w5 bigr_1_2 bigr_2_3 bigr_3_4 bigr_4_5
# 1 A B D E C A B B D D E E C
# 2 B G S U <NA> B G G S S U <NA>
# 3 C C O <NA> <NA> C C C O <NA> <NA>
# 4 E D F T <NA> E D D F F T <NA>
# 5 F E <NA> <NA> <NA> F E <NA> <NA> <NA>
# 6 G V N <NA> <NA> G V V N <NA> <NA>
We can use the tidytext package as follows:
df %>%
rowid_to_column() %>%
unite(col, -rowid, sep = ' ') %>%
tidytext::unnest_ngrams(value, 'col', 2, to_lower = FALSE) %>%
group_by(rowid) %>%
mutate(name = row_number()) %>%
pivot_wider(rowid, names_prefix = 'bgram_')
# A tibble: 6 x 5
# Groups: rowid [6]
rowid bgram_1 bgram_2 bgram_3 bgram_4
<int> <chr> <chr> <chr> <chr>
1 1 A B B D D E E C
2 2 B G G S S U U NA
3 3 C C C O O NA NA NA
4 4 E D D F F T T NA
5 5 F E E NA NA NA NA NA
6 6 G V V N N NA NA NA
using data.table
df[, (paste("bigr", 1:4, 2:5, sep = "_")) := Map(function(x, y) ifelse(is.na(x) | is.na(y), NA, paste(x, y)), .SD[, 1:4], .SD[, 2:5])]
I was having the same problem as How to find last column with value (for each row) in R?, except I have rows with no value (entire row of NA). The sample provided in said post did not have an entire row of NAs.
I was wondering how I should modify the following? I do not want to remove those rows with all NAs because they will be useful in later analysis.
df %>%
rowwise %>%
mutate(m = {tmp <- c_across(starts_with('m'))
tail(na.omit(tmp), 1)}) %>%
ungroup
Thanks a lot in advance!
If all the elements in the rows are empty, then a general solution would be to create condition to return NA for those rows
library(dplyr)
df %>%
rowwise %>%
mutate(m = {tmp <- c_across(starts_with('m'))
if(all(is.na(tmp))) NA_character_ else
tail(na.omit(tmp), 1)}) %>%
ungroup
-output
# A tibble: 4 × 5
id m_1 m_2 m_3 m
<dbl> <chr> <chr> <chr> <chr>
1 1 a e i i
2 2 b <NA> <NA> b
3 3 <NA> <NA> <NA> <NA>
4 4 d h l l
If the OP wants to return only the last single non-NA element, we may also add an index [1] to extract, which automatically return NA when there are no elements
df %>%
rowwise %>%
mutate(m = {tmp <- c_across(starts_with('m'))
tail(na.omit(tmp), 1)[1]}) %>%
ungroup
# A tibble: 4 × 5
id m_1 m_2 m_3 m
<dbl> <chr> <chr> <chr> <chr>
1 1 a e i i
2 2 b <NA> <NA> b
3 3 <NA> <NA> <NA> <NA>
4 4 d h l l
data
df <- structure(list(id = c(1, 2, 3, 4), m_1 = c("a", "b", NA, "d"),
m_2 = c("e", NA, NA, "h"), m_3 = c("i", NA, NA, "l")), row.names = c(NA,
-4L), class = "data.frame")
Using data from #akrun (many thanks) we could do maybe this way:
'\\b[^,]+$' is a regular expression:
\\ ... means escape (in other words do not match) this is R special in other languages it is only one \
\\b... The metacharacter \b is an anchor like ^ and $ sign. It matches at a position that is called a “word boundary”. This match is zero-length.
[^,]+... stands for character class, here special with the ^caret: One character that is not ,. The + means here one or more ,
$ ... means end of string or end of line depending on multiline mode.
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(across(starts_with("m"), ~case_when(!is.na(.) ~ cur_column()), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ', ') %>%
mutate(New_Col = str_extract(New_Col, '\\b[^,]+$'))
id m_1 m_2 m_3 New_Col
1 1 a e i m_3
2 2 b <NA> <NA> m_1
3 3 <NA> <NA> <NA> <NA>
4 4 d h l m_3
library(tidyverse)
df <- data.frame(id = c(1, 2, 3, 4), m_1 = c("a", NA, "c", "d"), m_2 = c("e", NA, "g", "h"), m_3 = c("i", NA, NA, "l"))
df %>%
rowwise() %>%
mutate(
nms = list(str_subset(names(df), "^m")),
m = c_across(starts_with("m")) %>%
{
ifelse(test = all(is.na(.)),
yes = NA,
no = nms[which(. == tail(na.omit(.), 1))]
)
}
) %>%
select(-nms)
#> # A tibble: 4 × 5
#> # Rowwise:
#> id m_1 m_2 m_3 m
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 a e i m_3
#> 2 2 <NA> <NA> <NA> <NA>
#> 3 3 c g <NA> m_2
#> 4 4 d h l m_3
# only the value no the column name
df %>%
rowwise() %>%
mutate(
m = c_across(starts_with("m")) %>%
{
ifelse(test = all(is.na(.)),
yes = NA,
no = tail(na.omit(.), 1)
)
}
)
#> # A tibble: 4 × 5
#> # Rowwise:
#> id m_1 m_2 m_3 m
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 a e i i
#> 2 2 <NA> <NA> <NA> <NA>
#> 3 3 c g <NA> g
#> 4 4 d h l l
Created on 2022-01-01 by the reprex package (v2.0.1)
I have seen lots of posts on how to remove rows if user specified columns contain a certain string.
I want to do the reverse and generalise it. I want to remove every column if any row in that column contains a certain string. (To compare with Excel, I would find all cells containing a given string and then delete every column.)
How can I do this? I was thinking with dplyr and filter, but I have to specify columns I think, or at least the way I would know how to approach it. But I have 300 odd columns and almost 4000 rows.
EDIT: Here is a sample of my dataframe.
# A tibble: 6 x 310
ISIN AU000KFWHAC9 AU3CB0243657 AU3CB0256162 AU3CB0260321 AU3CB0265239 AU3CB0283190 AU3SG0001928 AU3SG0002371
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Timestamp MID_PRICE Mid Price Cl~ Mid Price C~ Mid Price C~ Mid Price C~ Mid Price C~ Mid Price C~ Mid Price C~
2 41275 Invalid RIC. NA NA Invalid RIC. NA Invalid RIC. NA NA
3 41276 NA NA NA NA NA NA NA NA
4 41277 NA NA NA NA 3 NA NA NA
5 41278 NA NA NA NA NA NA NA NA
6 41279 5 NA 4 NA NA NA NA NA
So as you can see, the dataframe is full of lots of NA's. I am unsure if this will affect some functions' ability.
With a dataframe of:
> df <- data.frame(a=c("a", "b", "c"), b=c("bad string", "d", "e"), c=c("f", "g", "h"))
> df
a b c
1 a bad string f
2 b d g
3 c e h
>
Use colSums:
> df[, !colSums(df == "bad string")]
a c
1 a f
2 b g
3 c h
>
Only keep columns where colSums is 0.
You can grep your search:
dat[,-grep("Invalid", dat)]
ISIN AU3CB0243657 AU3CB0256162 AU3CB0265239 AU3SG0001928 AU3SG0002371
1 Timestamp MidPriceC~ MidPriceC~ MidPriceC~ MidPriceC~ MidPriceC~
2 41275 <NA> <NA> <NA> <NA> <NA>
3 41276 <NA> <NA> <NA> <NA> <NA>
4 41277 <NA> <NA> 3 <NA> <NA>
5 41278 <NA> <NA> <NA> <NA> <NA>
6 41279 <NA> 4 <NA> <NA> <NA>
Data:
dat <- structure(list(ISIN = c("Timestamp", "41275", "41276", "41277",
"41278", "41279"), AU000KFWHAC9 = c("MID_PRICE", "Invalid_RIC.",
NA, NA, NA, "5"), AU3CB0243657 = c("MidPriceC~", NA, NA, NA,
NA, NA), AU3CB0256162 = c("MidPriceC~", NA, NA, NA, NA, "4"),
AU3CB0260321 = c("MidPriceC~", "Invalid_RIC.", NA, NA, NA,
NA), AU3CB0265239 = c("MidPriceC~", NA, NA, "3", NA, NA),
AU3CB0283190 = c("MidPriceC~", "Invalid_RIC.", NA, NA, NA,
NA), AU3SG0001928 = c("MidPriceC~", NA, NA, NA, NA, NA),
AU3SG0002371 = c("MidPriceC~", NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
A solution using dplyr. We can use select and where to apply a function to check if a column contains a certain string or not. dat is from Andre Wildberg's answer.
library(dplyr)
dat2 <- dat %>%
select(where(function(x) all(!grepl("Invalid", x))))
dat2
# ISIN AU3CB0243657 AU3CB0256162 AU3CB0265239 AU3SG0001928 AU3SG0002371
# 1 Timestamp MidPriceC~ MidPriceC~ MidPriceC~ MidPriceC~ MidPriceC~
# 2 41275 <NA> <NA> <NA> <NA> <NA>
# 3 41276 <NA> <NA> <NA> <NA> <NA>
# 4 41277 <NA> <NA> 3 <NA> <NA>
# 5 41278 <NA> <NA> <NA> <NA> <NA>
# 6 41279 <NA> 4 <NA> <NA> <NA>
Given the following example dataset:
df <- structure(list(Id = 1:10,
Department = c("A", "B", "A", "C",
"A", "B", "B", "C", "D", "A"),
Q1 = c("US", NA, NA, "US",
NA, "US", NA, "US", NA, "US"),
Q2 = c("Comp B", NA, NA,
"Comp B", "Comp B", NA, "Comp B", NA, "Comp B", "Comp B"),
Q3 = c(NA, NA, NA, NA, NA, NA, "Comp C", NA, NA, NA),
Q4 = c(NA, "Comp D", NA, "Comp D", NA, NA, NA, NA, "Comp D", NA),
Sales = c(10, 23, 12, 5, 5, 76, 236, 4, 3, 10)),
row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
Is there a way to replace all non NA values in columns Q2:Q4 with, for instance, the word "Competitor" all at once? I know how to do string_replace on individual columns but with over 100 columns, with different words to be replaced in each, I'm hoping there is a quicker way. I tried messing around with various versions of mutate(across(Q2:Q4, ~str_replace(.x, !is.na, "Competitor"))), which I modelled after mutate(across(Q2:Q4, ~replace_na(.x, 0))) but that didn't work. I'm still not clear on the syntax on across except for the most simple operations and don't even know if it is applicable here.
Thanks!
str_replace is for replacing substring. The second argument with is.na is not be called i.e is.na is a function. We could use replace to replace the entire non-NA element
library(dplyr)
df1 <- df %>%
mutate(across(Q2:Q4, ~ replace(., !is.na(.), "Competitor")))
-output
# A tibble: 10 x 7
Id Department Q1 Q2 Q3 Q4 Sales
<int> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 A US Competitor <NA> <NA> 10
2 2 B <NA> <NA> <NA> Competitor 23
3 3 A <NA> <NA> <NA> <NA> 12
4 4 C US Competitor <NA> Competitor 5
5 5 A <NA> Competitor <NA> <NA> 5
6 6 B US <NA> <NA> <NA> 76
7 7 B <NA> Competitor Competitor <NA> 236
8 8 C US <NA> <NA> <NA> 4
9 9 D <NA> Competitor <NA> Competitor 3
10 10 A US Competitor <NA> <NA> 10
Or in base R
nm1 <- grep("^Q[2-4]$", names(df), value = TRUE)
df[nm1][!is.na(df[nm1])] <- "Competitor"
Here is another option:
library(dplyr)
library(purrr)
df %>%
mutate(pmap_df(select(df, Q2:Q4), ~ replace(c(...), !is.na(c(...)), "Competitor")))
# A tibble: 10 x 7
Id Department Q1 Q2 Q3 Q4 Sales
<int> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 A US Competitor NA NA 10
2 2 B NA NA NA Competitor 23
3 3 A NA NA NA NA 12
4 4 C US Competitor NA Competitor 5
5 5 A NA Competitor NA NA 5
6 6 B US NA NA NA 76
7 7 B NA Competitor Competitor NA 236
8 8 C US NA NA NA 4
9 9 D NA Competitor NA Competitor 3
10 10 A US Competitor NA NA 10
Suppose the following data structure:
structure(list(`1.a` = c("a", NA, "a"), `1.b` = c("b", "b", NA
), `2` = c("ba", "ba", "ab"), `3.a` = c("a", "a", NA), `3.b` = c("b",
NA, "b")), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
# A tibble: 3 x 5
`1.a` `1.b` `2` `3.a` `3.b`
<chr> <chr> <chr> <chr> <chr>
1 a b ba a b
2 NA b ba a NA
3 a NA ab NA b
Now, for the columns with .a etc., I want to create a column named X (the part/number before the dot) in front of the .a columns and paste the cell values together, keeping the "order". Result I want:
# A tibble: 3 x 7
`1` `1.a` `1.b` `2` `3` `3.a` `3.b`
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 ab a b ab ab a b
2 b NA b a a a NA
3 a a NA b b NA b
A base R option
do.call(
cbind,
unname(
lapply(
split.default(df, gsub("\\..*", "", names(df))),
function(x) {
if (length(x) > 1) {
cbind(
setNames(
data.frame(
apply(x, 1, function(v) paste0(na.omit(unlist(v)), collapse = ""))
),
unique(gsub("\\..*", "", names(x)))
),
x
)
} else {
x
}
}
)
)
)
gives
1 1.a 1.b 2 3 3.a 3.b
1 ab a b ba ab a b
2 b <NA> b ba a a <NA>
3 a a <NA> ab b <NA> b