How can I reduce this code with the DRY principle? - r

I am practicing the DRY principle in my R code and I have reached this point where I have not managed to reduce the amount of lines of code. I see that it is very repetitive and I would like your help.
Here is a reproducible example:
library(tidyverse)
set.seed(2023)
# first, I generate the data
data <- as.data.frame(cbind(
replicate(10, sample(0:1, 7, replace = TRUE)),
replicate(10, sample(30:100, 7, replace = TRUE))
))
names(data) <- c(sprintf("var1_%02d", 1:10), sprintf("var2_%02d", 1:10))
data
# var1_01 var1_02 var1_03 var1_04 var1_05 var1_06 var1_07 var1_08 var1_09 var1_10 var2_01 var2_02 var2_03 var2_04 var2_05 var2_06 var2_07 var2_08 var2_09 var2_10
# 1 0 1 0 0 0 0 0 0 0 0 61 72 74 58 85 93 85 46 99 55
# 2 1 1 0 1 0 0 0 1 1 0 66 56 91 72 77 53 61 34 57 43
# 3 0 0 1 1 1 1 0 1 1 1 71 89 49 99 38 84 53 41 95 64
# 4 0 0 0 0 1 0 1 1 1 1 50 91 83 61 81 41 71 83 96 81
# 5 1 0 1 1 1 1 1 1 0 1 41 61 79 67 96 98 97 60 36 90
# 6 0 0 0 1 1 1 1 1 1 1 60 93 39 86 53 82 69 39 67 54
# 7 1 0 0 0 1 0 0 1 1 0 57 96 82 47 95 41 100 53 98 45
This is the code I want to reduce:
data %<>%
mutate(var3_01 = case_when(var1_01 == 1 ~ var2_01 + 0, TRUE ~ 0),
var3_02 = case_when(var1_02 == 1 ~ var2_02 + 0, TRUE ~ 0),
var3_03 = case_when(var1_03 == 1 ~ var2_03 + 0, TRUE ~ 0),
var3_04 = case_when(var1_04 == 1 ~ var2_04 + 0, TRUE ~ 0),
var3_05 = case_when(var1_05 == 1 ~ var2_05 + 0, TRUE ~ 0),
var3_06 = case_when(var1_06 == 1 ~ var2_06 + 0, TRUE ~ 0),
var3_07 = case_when(var1_07 == 1 ~ var2_07 + 0, TRUE ~ 0),
var3_08 = case_when(var1_08 == 1 ~ var2_08 + 0, TRUE ~ 0),
var3_09 = case_when(var1_09 == 1 ~ var2_09 + 0, TRUE ~ 0),
var3_10 = case_when(var1_10 == 1 ~ var2_10 + 0, TRUE ~ 0))
The goal is that if the var1_* == 1, it takes the value of var2_* for each row. However, I have not been able to replicate this code in a shorter version (tidyverse or base version doesn't matter). I tried this:
numbers <- c(paste0("0", 1:5))
data %<>%
mutate(across(starts_with("var1_"), ~ifelse(isTRUE(.x==1), .x:=data[, 6:10], 0), .names="var3_{numbers}"))
But this code does not generate the same result as the extended version. I appreciate any suggestion!

Staying within tidyverse
You can use across, using get to use within case_when to relieve us from repetition.
cols = names(data)[1:10]
data |>
mutate(across({cols}, \(x){
ifelse(x == 1, get(sub("var1", "var2", cur_column())), 0)
}, .names = "{sub('var1', 'var3', .col)}"))
var1_01 var1_02 var1_03 var1_04 var1_05 var1_06 var1_07 var1_08 var1_09 var1_10 var2_01 var2_02 var2_03 var2_04
1 0 0 1 1 1 0 0 1 1 1 31 74 42 60
2 0 1 0 0 1 0 1 0 1 1 92 63 57 98
3 1 1 0 1 0 0 0 1 1 0 53 89 64 42
4 0 1 0 0 0 1 0 1 1 1 55 37 41 97
5 0 0 0 0 1 1 0 0 0 1 47 87 56 60
6 0 0 1 0 1 0 0 0 0 1 99 73 79 31
7 1 0 0 1 0 0 0 1 1 0 61 44 52 90
var2_05 var2_06 var2_07 var2_08 var2_09 var2_10 var3_01 var3_02 var3_03 var3_04 var3_05 var3_06 var3_07 var3_08
1 60 55 57 67 97 40 0 0 42 60 60 0 0 67
2 97 78 74 30 90 49 0 63 0 0 97 0 74 0
3 77 43 52 84 43 78 53 89 0 42 0 0 0 84
4 95 94 65 86 32 82 0 37 0 0 0 94 0 86
5 47 65 100 70 91 40 0 0 0 0 47 65 0 0
6 93 77 92 57 76 93 0 0 79 0 93 0 0 0
7 46 100 74 35 38 56 61 0 0 90 0 0 0 35
var3_09 var3_10
1 97 40
2 90 49
3 43 0
4 32 82
5 0 40
6 0 93
7 38 0

Solution 1
Multiplying across(contains("var1")) and across(contains("var2")) together:
sol1 <- data %>%
mutate(across(contains("var1"), ~ .x == 1, .names = "{sub('var1', 'var3', .col)}") *
across(contains("var2")))
Solution 2
Using cur_column() inside across to match var1_* and var2_*:
sol2 <- data %>%
mutate(across(contains("var1"),
~ ifelse(.x == 1, pick(everything())[[sub('var1', 'var2', cur_column())]], 0),
.names = "{sub('var1', 'var3', .col)}"))
Solution 3
Assume that all column names are sorted well, you can use across to select and rename columns, convert the subset dataframe into a matrix, and pass it into ifelse.
sol3 <- data %>%
mutate(as_tibble(
ifelse(as.matrix(across(contains("var1"), .names = "{sub('var1', 'var3', .col)}")) == 1,
as.matrix(across(contains("var2"))), 0)
))
all.equal(sol1, sol2)
# [1] TRUE
all.equal(sol2, sol3)
# [1] TRUE

If you want to maintain the wide data format, I would use matrices. Something like this:
set.seed(2023)
DF <- do.call(data.frame,
c(setNames(replicate(10, sample(0:1, 7, replace=T), simplify = FALSE),
sprintf("var1_%02d", 1:10)),
setNames(replicate(10, sample(30:100, 7, replace=T), simplify = FALSE),
sprintf("var2_%02d", 1:10))))
foo <- function(a, b) {
a <- as.matrix(a)
b <- as.matrix(b)
b[a == 0] <- 0
colnames(b) <- gsub("var2", "var3", colnames(b))
as.data.frame(b)
}
DF <- cbind(DF, foo(DF[, grepl("var1", names(DF))],
DF[, grepl("var2", names(DF))]))
all.equal(data, DF)
#[1] TRUE

I would recommend to first make your data tidy (i.e. make it more vertical/longer/less wide)
data_long <- data %>%
mutate(.before = 1, record_id = row_number()) %>%
pivot_longer(cols = -record_id, names_to = c("var", "j"), names_sep = "_", values_to = "value") %>%
pivot_wider(names_from = var, values_from = value)
Which gives the following:
# A tibble: 70 x 4
record_id j var1 var2
<int> <chr> <int> <int>
1 1 01 1 78
2 1 02 0 33
3 1 03 1 77
4 1 04 0 80
5 1 05 0 100
6 1 06 1 36
7 1 07 0 91
8 1 08 0 39
9 1 09 0 65
10 1 10 0 88
Then the implementation is straightforward:
data_long = data_long %>% mutate(var3 = if_else(var1==1L, var2, 0L))
Finally, if you really need the output to be wide:
data_long %>%
pivot_wider(id_cols = record_id, names_from = j, values_from = var1:var3)

Related

Interaction terms loop in R

Here is a small example of data. Imagine I have many more covariates than this.
install.packages("mltools")
library(mltools)
library(dplyr)
set.seed(1234)
data <- tibble::data_frame(
age = round(runif(60, min = 48, max = 90)),
gender = sample(c(0,1), replace=TRUE, size=60),
weight = round(runif(60, min = 100, max = 300)),
group = sample(letters[1:4], size = 60, replace = TRUE))
one_hot <- data[,c("group")] %>%
glmnet::makeX() %>%
data.frame()
data$group <- NULL
data <- cbind(data, one_hot)
I want to create a data.frame that interacts with the group (groupa, groupb, groupc,groupd) and all variables (age, gender weight).
groupa * age
groupa * gender
groupa * weight
Same for the groupb, groupc, and groupd.
I've seen many questions about all possible interaction generators.
But I haven't seen any that show interaction with one column and the rest.
Hope this question was clear enough.
Thanks.
I am sure there is a more elegant solution, but you could try writing your own function that does the interaction then use apply to go over the columns and do.call to combine everything:
intfun <- function(var){
data %>%
mutate(across(starts_with("group"),~.*{{var}})) %>%
select(starts_with("group"))
}
int_terms <- cbind(data,
do.call(cbind, apply(data[,1:3], 2, function(x) intfun(x))))
Output (note not all columns presented here):
# > head(int_terms)
# age gender weight groupa groupb groupc groupd age.groupa age.groupb age.groupc age.groupd gender.groupa gender.groupb gender.groupc gender.groupd weight.groupa
# 1 88 33 113 0 1 0 0 0 88 0 0 0 33 0 0 0
# 2 49 33 213 1 0 0 0 49 0 0 0 33 0 0 0 213
# 3 83 33 152 1 0 0 0 83 0 0 0 33 0 0 0 152
# 4 75 33 101 0 1 0 0 0 75 0 0 0 33 0 0 0
# 5 61 33 218 0 1 0 0 0 61 0 0 0 33 0 0 0
# 6 79 33 204 1 0 0 0 79 0 0 0 33 0 0 0 204

R: table frequencies of letters in string based on Alphabet

I need to compute letter frequencies of a large list of words. For each of the locations in the word (first, second,..), I need to find how many times each letter (a-z) appeared in the list and then table the data according to the word positon.
For example, if my word list is: words <- c("swims", "seems", "gills", "draws", "which", "water")
then the result table should like that:
letter
first position
second position
third position
fourth position
fifth position
a
0
1
1
0
0
b
0
0
0
0
0
c
0
0
0
1
0
d
1
0
0
0
0
e
0
1
1
1
0
f
0
0
0
0
0
...continued until z
...
...
...
...
...
All words are of same length (5).
What I have so far is:
alphabet <- letters[1:26]
words.df <- data.frame("Words" = words)
words.df <- words.df %>% mutate("First_place" = substr(words.df$words,1,1))
words.df <- words.df %>% mutate("Second_place" = substr(words.df$words,2,2))
words.df <- words.df %>% mutate("Third_place" = substr(words.df$words,3,3))
words.df <- words.df %>% mutate("Fourth_place" = substr(words.df$words,4,4))
words.df <- words.df %>% mutate("Fifth_place" = substr(words.df$words,5,5))
x1 <- words.df$First_place
x1 <- table(factor(x1,alphabet))
x2 <- words.df$Second_place
x2 <- table(factor(x2,alphabet))
x3 <- words.df$Third_place
x3 <- table(factor(x3,alphabet))
x4 <- words.df$Fourth_place
x4 <- table(factor(x4,alphabet))
x5 <- words.df$Fifth_place
x5 <- table(factor(x5,alphabet))
My code is not effective and gives tables to each letter position sepretely. All help will be appreicated.
in base R use table:
table(let = unlist(strsplit(words,'')),pos = sequence(nchar(words)))
pos
let 1 2 3 4 5
a 0 1 1 0 0
c 0 0 0 1 0
d 1 0 0 0 0
e 0 1 1 1 0
g 1 0 0 0 0
h 0 1 0 0 1
i 0 1 2 0 0
l 0 0 1 1 0
m 0 0 0 2 0
r 0 1 0 0 1
s 2 0 0 0 4
t 0 0 1 0 0
w 2 1 0 1 0
Note that if you need all the values from a-z then use
table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
Also to get a dataframe you could do:
d <- table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
cbind(letters = rownames(d), as.data.frame.matrix(d))
Here is a tidyverse solution using dplyr, purrr, and tidyr:
strsplit(words.df$Words, "") %>%
map_dfr(~setNames(.x, seq_along(.x))) %>%
pivot_longer(everything(),
values_drop_na = T,
names_to = "pos",
values_to = "letter") %>%
count(pos, letter) %>%
pivot_wider(names_from = pos,
names_glue = "pos{pos}",
id_cols = letter,
values_from = n,
values_fill = 0L)
Output
letter pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 pos11
1 a 65 127 88 38 28 17 14 5 3 0 0
2 b 58 4 7 9 2 4 2 0 1 0 0
3 c 83 14 45 37 20 19 8 3 2 0 0
4 C 2 0 0 0 0 0 0 0 0 0 0
5 d 43 8 33 47 21 22 9 3 1 1 0
6 e 45 156 81 132 114 69 48 23 14 2 2
7 f 54 11 18 10 5 2 1 0 0 0 0
8 g 23 7 27 21 15 8 7 1 0 0 0
9 h 38 56 6 28 21 10 3 3 1 1 0
10 i 25 106 51 58 38 28 8 4 1 0 0
11 j 6 0 2 2 0 0 0 0 0 0 0
12 k 9 1 6 22 12 0 0 0 0 0 0
13 l 45 41 54 54 36 9 7 6 0 2 0
14 m 45 8 31 19 8 8 4 2 0 0 0
15 n 23 42 75 53 34 41 16 16 4 2 0
16 o 28 167 76 41 38 9 11 2 1 0 0
17 p 72 20 34 30 8 3 1 1 1 0 0
18 q 7 2 1 0 0 0 0 0 0 0 0
19 r 46 74 92 59 56 45 12 9 1 1 0
20 s 119 8 67 35 31 22 18 4 1 0 0
21 t 65 30 73 83 57 42 31 9 6 3 1
22 u 12 66 39 36 20 7 7 2 0 0 0
23 v 8 7 20 12 5 5 1 0 0 0 0
24 w 53 8 13 10 2 3 0 1 0 0 0
25 y 6 4 16 15 17 15 10 5 6 1 1
26 x 0 12 5 0 0 0 0 0 0 0 0
27 z 0 0 1 0 0 0 1 1 0 0 0

Giving a conditional onto a function that performs for factors

This questions ties onto this here, and at #Akruns request I'm asking for something similar.
Essentially, If I insert a dataframe within the following conditional:
if(length(weight) > 0) {weight %>%
select(where(negate(is.numeric))) %>%
map_dfc(~ model.matrix(~ .x -1) %>%
as_tibble) %>%
rename_all(~ str_remove(., "\\.x")) %>%
bind_rows(weight, .)
}
Assignment:
#Following #Akruns mention for turning numeric into factor:
i1 <- sapply(weight, is.numeric); df[i1] <- lapply(weight[i1], factor) and then use the Filter(function(x) is.factor(x)|is.character(x), weight)
test = function(data) {
x = as.data.frame(Reduce(cbind, lapply(x, function(col) model.matrix(~ . -1, data = data.frame(col)))))
setNames(x, sub(pattern = "^col", replacement = "", names(x)))
}
test(weight)
#Missing column names
1 64 57 8 1 0 0 1 0
2 71 59 10 1 0 0 1 0
3 53 49 6 1 0 0 1 0
4 67 62 11 1 0 0 1 0
5 55 51 8 0 0 1 1 0
6 58 50 7 0 0 1 1 0
7 77 55 10 0 0 1 0 1
8 57 48 9 0 0 1 0 1
9 56 42 10 0 1 0 0 1
10 51 42 6 0 1 0 0 1
11 76 61 12 0 1 0 0 1
12 68 57 9 0 1 0 0 1
Then if weight has factors, it will split columns that are factors into columns and assign them values with 1 where it appeared before and 0 elsewhere.
However, if I input a numeric only dataframe, it returns character(0). The question is, how to give the following function a conditional such that whether the dataframe for example x is numeric then return the dataframe as it is. If it is a factor, then return the requested output.
The reason I request this is because I'm looking to implement this within another function, that will include many dataframe where some have only numeric and others include factors. In that case, I can denote the dataframe as x within the function.
My edit of the function:
fact_col <- function(x){
if(length(x) > 0) {
weight_sub <- x %>%
select(where(is.factor))
weight_sub %>%
map_dfc(~ model.matrix(~ .x -1) %>%
as_tibble) %>%
rename_all(~ str_remove(., "\\.x")) %>%
bind_cols(weight_sub, .) -> x
x<- x%>% select(!where(is.factor))
x<- data.frame(sapply(x, as.numeric))
}}
expected output:
#when x is numeric
function(x) { ... }
Richness pat
1 20 1
2 17 2
3 18 3
4 19 4
5 11 5
6 15 6
7 17 7
8 15 8
9 15 9
10 9 10
11 13 11
12 14 12
#when x is a factor
function(x) { ... }
wgt hgt age id sex black brown white female male
1 64 57 8 black female 1 0 0 1 0
2 71 59 10 black female 1 0 0 1 0
3 53 49 6 black female 1 0 0 1 0
4 67 62 11 black female 1 0 0 1 0
5 55 51 8 white female 0 0 1 1 0
6 58 50 7 white female 0 0 1 1 0
7 77 55 10 white male 0 0 1 0 1
8 57 48 9 white male 0 0 1 0 1
9 56 42 10 brown male 0 1 0 0 1
10 51 42 6 brown male 0 1 0 0 1
11 76 61 12 brown male 0 1 0 0 1
12 68 57 9 brown male 0 1 0 0 1
reproducible code:
structure(list(wgt = c(64L, 71L, 53L, 67L, 55L, 58L, 77L, 57L,
56L, 51L, 76L, 68L), hgt = c(57L, 59L, 49L, 62L, 51L, 50L, 55L,
48L, 42L, 42L, 61L, 57L), age = c(8L, 10L, 6L, 11L, 8L, 7L, 10L,
9L, 10L, 6L, 12L, 9L), id = structure(c(1L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 2L, 2L, 2L, 2L), .Label = c("black", "brown", "white"
), class = "factor"), sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("female", "male"), class = "factor")), class = "data.frame", row.names = c(NA,
-12L))
An option is to split the code before we use the if i.e. select the columns that are factor and create a new object ('weight_sub'), then check the length on the 'weight_sub', if it is greater than 0, do the rest of model.matrix and assign it back to 'weight'
weight_sub <- weight %>%
select(where(is.factor))
if(length(weight_sub) > 0) {
weight_sub %>%
map_dfc(~ model.matrix(~ .x -1) %>%
as_tibble) %>%
rename_all(~ str_remove(., "\\.x")) %>%
bind_cols(weight, .) -> weight
}
-output
# wgt hgt age id sex black brown white female male
#1 64 57 8 black female 1 0 0 1 0
#2 71 59 10 black female 1 0 0 1 0
#3 53 49 6 black female 1 0 0 1 0
#4 67 62 11 black female 1 0 0 1 0
#5 55 51 8 white female 0 0 1 1 0
#6 58 50 7 white female 0 0 1 1 0
#7 77 55 10 white male 0 0 1 0 1
#8 57 48 9 white male 0 0 1 0 1
#9 56 42 10 brown male 0 1 0 0 1
#10 51 42 6 brown male 0 1 0 0 1
#11 76 61 12 brown male 0 1 0 0 1
#12 68 57 9 brown male 0 1 0 0 1
As a negative test, do this by checking if it is a character class column
weight_sub <- weight %>%
select(where(is.character))
if(length(weight_sub) > 0) {
weight_sub %>%
map_dfc(~ model.matrix(~ .x -1) %>%
as_tibble) %>%
rename_all(~ str_remove(., "\\.x")) %>%
bind_cols(weight, .) -> weight
}
No output as the if condition returns FALSE, thus the 'weight' dataset remains the same without adding any new columns
In the update, if the OP is also using numeric columns to be passed into model.matrix, it just returns the same column i.e. one column (as we are looping over columns with map) with the column name as .x (from model.matrix formula). This .x column name is removed with rename_all when we use str_remove, leaving a blank column name, which by default is filled with a column name assigned as 'col' from _dfc. To prevent, that, we can use an if/else condition before doing this to append the original column name as suffix for those having one column output and is a numeric one
weight %>%
imap_dfc(~ {
nm1 <- .y
tmp <- model.matrix(~ .x - 1) %>%
as_tibble
if(ncol(tmp) == 1 && class(tmp[[1]]) == 'numeric') {
names(tmp) <- paste0(names(tmp), nm1)
}
tmp
}) %>%
rename_all(~ str_remove(., "\\.x"))
-output
# A tibble: 12 x 8
# wgt hgt age black brown white female male
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 64 57 8 1 0 0 1 0
# 2 71 59 10 1 0 0 1 0
# 3 53 49 6 1 0 0 1 0
# 4 67 62 11 1 0 0 1 0
# 5 55 51 8 0 0 1 1 0
# 6 58 50 7 0 0 1 1 0
# 7 77 55 10 0 0 1 0 1
# 8 57 48 9 0 0 1 0 1
# 9 56 42 10 0 1 0 0 1
#10 51 42 6 0 1 0 0 1
#11 76 61 12 0 1 0 0 1
#12 68 57 9 0 1 0 0 1
Or we do this with Map in base R
out <- do.call(cbind, unname(Map(function(x, y) {
tmp <- as.data.frame(model.matrix(~x -1))
if(ncol(tmp) == 1 & class(tmp[[1]]) == 'numeric') {
names(tmp) <- paste0(names(tmp), y)}
tmp
}, weight, names(weight))))
names(out) <- sub('^x', '', names(out))
out
# wgt hgt age black brown white female male
#1 64 57 8 1 0 0 1 0
#2 71 59 10 1 0 0 1 0
#3 53 49 6 1 0 0 1 0
#4 67 62 11 1 0 0 1 0
#5 55 51 8 0 0 1 1 0
#6 58 50 7 0 0 1 1 0
#7 77 55 10 0 0 1 0 1
#8 57 48 9 0 0 1 0 1
#9 56 42 10 0 1 0 0 1
#10 51 42 6 0 1 0 0 1
#11 76 61 12 0 1 0 0 1
#12 68 57 9 0 1 0 0 1

Shift rows values according to a vector

data.tab <- read.table(text = "
0 0 344 34 93 76 190 78 0
0 0 350 16 45 22 21 0 0
0 0 366 11 87 65 71 0 0
0 0 780 28 46 33 30 0 0
0 0 997 55 98 65 12 0 0
0 0 402 30 11 18 198 0 0")
I want to shift non zero value of rows to the right or to the left according to values this vector :
vect <- c(-2, 1, 0, 2, 2, 2, 2, 2, -2).
The first row will be shifted by two columns to the left to obtain :
344 34 93 76 190 78 0 0 0
The second row will be shifted by one column to the right to obtain:
0 0 0 350 16 45 22 21 0
In my previous post some one suggested the mutate function of dplyr package which works well, but required a single integer
Here is a tidyverse solution using map2 to apply a shift value to the corresponding row of your dataset:
data.tab <- read.table(text = "
0 0 344 34 93 76 190 78 0
0 0 350 16 45 22 21 0 0
0 0 366 11 87 65 71 0 0
0 0 780 28 46 33 30 0 0
0 0 997 55 98 65 12 0 0
0 0 402 30 11 18 198 0 0")
library(tidyverse)
# for reproducibility
set.seed(15)
data.tab %>%
group_by(id = row_number()) %>% # group by row id
nest() %>% # nest data
mutate(shift = sample(-2:2, nrow(data.tab), replace=TRUE), # create and add the shift values for each row
d = map2(shift, data, ~if(.x >= 0) lag(.y, .x, default = 0L) else lead(.y, abs(.x), default = 0L))) %>% # apply shift value to corresponding rows
unnest(d) %>% # unnest shifted data
select(-id, -data) # remove unnecessary columns
# # A tibble: 6 x 10
# shift V1 V2 V3 V4 V5 V6 V7 V8 V9
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 0 0 0 344 34 93 76 190 78
# 2 -2 350 16 45 22 21 0 0 0 0
# 3 2 0 0 0 0 366 11 87 65 71
# 4 1 0 0 0 780 28 46 33 30 0
# 5 -1 0 997 55 98 65 12 0 0 0
# 6 2 0 0 0 0 402 30 11 18 198
The key point it to use lag if the shift value is positive, otherwise use lead and the absolute shit value.

How to combine rows into one row in TermDocumentMatrix?

Iam trying to combine rows into on row in TermDocumentMatrix
(I know every row represents each words)
ex) cabin, staff -> crews
Because 'cabin, staff and crew' mean samething,
Iam trying to combine rows which represent 'cabin, staff'
into one row which represent 'crew.
but, it doesn't work at all.
R said argument "weighting" is missing, with no default
The codes I typed is below
r=GET('http://www.airlinequality.com/airline-reviews/cathay-pacific-airways/')
base_url=('http://www.airlinequality.com/airline-reviews/cathay-pacific-airways/')
h<-read_html(base_url)
all.reviews = c()
for (i in 1:10){
print(i)
url = paste(base_url, 'page/', i, '/', sep="")
r = GET(url)
h = read_html(r)
comment_area = html_nodes(h, '.tc_mobile')
comments= html_nodes(comment_area, '.text_content')
reviews = html_text(comments)
all.reviews=c(all.reviews, reviews)}
cps <- Corpus(VectorSource(all.reviews))
cps <- tm_map(cps, content_transformer(tolower))
cps <- tm_map(cps, content_transformer(stripWhitespace))
cps <- tm_map(cps, content_transformer(removePunctuation))
cps <- tm_map(cps, content_transformer(removeNumbers))
cps <- tm_map(cps, removeWords, stopwords("english"))
tdm <- TermDocumentMatrix(cps, control=list(
wordLengths=c(3, 20),
weighting=weightTf))
rows.cabin = grep('cabin|staff', row.names(tdm))
rows.cabin
# [1] 235 1594
count.cabin = as.array(rollup(tdm[rows.cabin,], 1))
count.cabin
#Docs
#Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
#1 0 1 1 0 0 2 2 0 0 1 1 0 4 0 1 0 1 0 2 1 0 0 1 3 1 4 2 0 3 0 1 1 4 0 0 2 1 0 0 2 1 0 2 1 3 3 1
#Docs
#Terms 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
#1 0 1 0 1 2 3 2 2 1 1 0 2 0 0 0 0 0 2 0 1 0 0 4 0 2 2 1 3 1 1 1 1 0 0 0 5 3 0 2 1 0 1 0 0
#Docs
#Terms 92 93 94 95 96 97 98 99 100
#1 1 5 2 1 0 0 0 1 0
row.crews = grep('crews', row.names(tdm))
row.crews
#[1] 408
tdm[row.crews,] = count.cabin
rows.cabin = setdiff(rows.cabin, row.crews) # ok
tdm = tdm[-rows.cabin,] # ok
dtm = as.DocumentTermMatrix(tdm)
# Error in .TermDocumentMatrix(t(x), weighting) :
# argument "weighting" is missing, with no default
maybe it is not right approach to combine rows in TermDocumentMatrix
Please fix this codes or suggest better approach to solve this problem.
Thanks in advance.
Hmm I wonder why you stick to your approach, which obviously does not work, instead of just copying+pasting+adjusting* my suggestion from here?
library(tm)
library(httr)
library(rvest)
library(slam)
# [...] # your code
inspect(tdm[grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE), 1:15])
# Docs
# Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# cabin 0 0 0 0 0 1 1 0 0 1 0 0 3 0 0
# crew 0 0 0 1 1 1 1 0 2 1 0 1 0 2 0
# crews 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# staff 0 1 1 0 0 1 1 0 0 0 1 0 1 0 1
dict <- list(
"CREW" = grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE, value = TRUE)
)
terms <- Terms(tdm)
for (x in seq_along(dict))
terms[terms %in% dict[[x]] ] <- names(dict)[x]
tdm <- slam::rollup(tdm, 1, terms, sum)
inspect(tdm[grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE), 1:15])
# Docs
# Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# CREW 0 1 1 1 1 3 3 0 2 2 1 1 4 2 1
*I only adjusted the line inside the dict definition...

Resources