I'm trying to plot a nonlinear decision boundary, that is supposed to look like this:
I have fitted a regularized nonlinear logistic regression of the form:
This is an extract of my data:
ones test1 test2 use
1 1 0.051267 0.69956 1
2 1 -0.092742 0.68494 1
3 1 -0.213710 0.69225 1
4 1 -0.375000 0.50219 1
5 1 -0.513250 0.46564 1
6 1 -0.524770 0.20980 1
These are the parameters I have calculated using the optim() function:
[1] 0.377980476 -0.085951551 0.445140731
[4] -1.953080687 -0.506554404 -0.330330236
[7] 0.414649938 0.270281786 0.183804530
[10] -0.155359467 -0.753665545 0.351880543
[13] 0.238052214 0.619714119 -0.582420943
[16] 0.150625144 0.266319363 -0.331130949
[19] 0.177759335 -0.005402135 -0.124253913
[22] 0.085607070 0.580258782 0.973785263
[25] 0.387313615 0.237754576 -0.011198804
[28] -0.514447404
I'm still new to R, and I don't really have any idea on how to tackle this problem, can anybody help me out please?
ones test1 test2 use
1 1 0.0512670 0.699560 1
2 1 -0.0927420 0.684940 1
3 1 -0.2137100 0.692250 1
4 1 -0.3750000 0.502190 1
5 1 -0.5132500 0.465640 1
6 1 -0.5247700 0.209800 1
7 1 -0.3980400 0.034357 1
8 1 -0.3058800 -0.192250 1
9 1 0.0167050 -0.404240 1
10 1 0.1319100 -0.513890 1
11 1 0.3853700 -0.565060 1
12 1 0.5293800 -0.521200 1
13 1 0.6388200 -0.243420 1
14 1 0.7367500 -0.184940 1
15 1 0.5466600 0.487570 1
16 1 0.3220000 0.582600 1
17 1 0.1664700 0.538740 1
18 1 -0.0466590 0.816520 1
19 1 -0.1733900 0.699560 1
20 1 -0.4786900 0.633770 1
21 1 -0.6054100 0.597220 1
22 1 -0.6284600 0.334060 1
23 1 -0.5938900 0.005117 1
24 1 -0.4210800 -0.272660 1
25 1 -0.1157800 -0.396930 1
26 1 0.2010400 -0.601610 1
27 1 0.4660100 -0.535820 1
28 1 0.6733900 -0.535820 1
29 1 -0.1388200 0.546050 1
30 1 -0.2943500 0.779970 1
31 1 -0.2655500 0.962720 1
32 1 -0.1618700 0.801900 1
33 1 -0.1733900 0.648390 1
34 1 -0.2828300 0.472950 1
35 1 -0.3634800 0.312130 1
36 1 -0.3001200 0.027047 1
37 1 -0.2367500 -0.214180 1
38 1 -0.0639400 -0.184940 1
39 1 0.0627880 -0.163010 1
40 1 0.2298400 -0.411550 1
41 1 0.2932000 -0.228800 1
42 1 0.4832900 -0.184940 1
43 1 0.6445900 -0.141080 1
44 1 0.4602500 0.012427 1
45 1 0.6273000 0.158630 1
46 1 0.5754600 0.268270 1
47 1 0.7252300 0.443710 1
48 1 0.2240800 0.524120 1
49 1 0.4429700 0.670320 1
50 1 0.3220000 0.692250 1
51 1 0.1376700 0.575290 1
52 1 -0.0063364 0.399850 1
53 1 -0.0927420 0.553360 1
54 1 -0.2079500 0.355990 1
55 1 -0.2079500 0.173250 1
56 1 -0.4383600 0.217110 1
57 1 -0.2194700 -0.016813 1
58 1 -0.1388200 -0.272660 1
59 1 0.1837600 0.933480 0
60 1 0.2240800 0.779970 0
61 1 0.2989600 0.619150 0
62 1 0.5063400 0.758040 0
63 1 0.6157800 0.728800 0
64 1 0.6042600 0.597220 0
65 1 0.7655500 0.502190 0
66 1 0.9268400 0.363300 0
67 1 0.8231600 0.275580 0
68 1 0.9614100 0.085526 0
69 1 0.9383600 0.012427 0
70 1 0.8634800 -0.082602 0
71 1 0.8980400 -0.206870 0
72 1 0.8519600 -0.367690 0
73 1 0.8289200 -0.521200 0
74 1 0.7943500 -0.557750 0
75 1 0.5927400 -0.740500 0
76 1 0.5178600 -0.594300 0
77 1 0.4660100 -0.418860 0
78 1 0.3508100 -0.579680 0
79 1 0.2874400 -0.769740 0
80 1 0.0858290 -0.755120 0
81 1 0.1491900 -0.579680 0
82 1 -0.1330600 -0.448100 0
83 1 -0.4095600 -0.411550 0
84 1 -0.3922800 -0.258040 0
85 1 -0.7436600 -0.258040 0
86 1 -0.6975800 0.041667 0
87 1 -0.7551800 0.290200 0
88 1 -0.6975800 0.684940 0
89 1 -0.4038000 0.706870 0
90 1 -0.3807600 0.918860 0
91 1 -0.5074900 0.904240 0
92 1 -0.5478100 0.706870 0
93 1 0.1031100 0.779970 0
94 1 0.0570280 0.918860 0
95 1 -0.1042600 0.991960 0
96 1 -0.0812210 1.108900 0
97 1 0.2874400 1.087000 0
98 1 0.3968900 0.823830 0
99 1 0.6388200 0.889620 0
100 1 0.8231600 0.663010 0
101 1 0.6733900 0.641080 0
102 1 1.0709000 0.100150 0
103 1 -0.0466590 -0.579680 0
104 1 -0.2367500 -0.638160 0
105 1 -0.1503500 -0.367690 0
106 1 -0.4902100 -0.301900 0
107 1 -0.4671700 -0.133770 0
108 1 -0.2885900 -0.060673 0
109 1 -0.6111800 -0.067982 0
110 1 -0.6630200 -0.214180 0
111 1 -0.5996500 -0.418860 0
112 1 -0.7263800 -0.082602 0
113 1 -0.8300700 0.312130 0
114 1 -0.7206200 0.538740 0
115 1 -0.5938900 0.494880 0
116 1 -0.4844500 0.999270 0
117 1 -0.0063364 0.999270 0
118 1 0.6326500 -0.030612 0
Though not a ideal answer, you can use a SVM model to visualize this (it gives ~0.83 in-sample error):
require(e1071)
data = data[, c("use", "test1", "test2")]
fit = svm(use ~ ., data = data)
plot(fit, data = data)
Using a simple transformation we can try to get a linearly separable dataset:
data2 = data.frame(
y = factor(data[, "use"]),
x1 = data[, "test1"]^2,
x2 = data[, "test2"]^2 )
require(MASS)
fit = glm(y ~ x2 + x1, data = data2, family = binomial(link = "logit"))
plot(x2 ~ x1, data = data2, bg = as.numeric(y) + 1, pch = 21, main = "Logistic regression on Y ~ X1 + X2")
abline(-fit$coefficients[1]/fit$coefficients[2], -fit$coefficients[3]/fit$coefficients[2], col = 'blue', lwd = 2)
Which gives you this (~ 0.73 in-sample error):
So now you got
Y = w0 + w1 * test1^2 + w2 * test2^2
Which you can use to isolate test2 = f(test1) and plot the non-linear boundary.
Related
set
inst
ind
color_Blue
1
0
0
70
1
0
1
60
1
0
2
50
1
1
0
30
1
1
1
20
1
1
2
66
2
0
0
35
2
0
1
22
2
0
2
28
2
1
0
90
2
1
1
47
2
1
2
23
I have data frame looks like this above and I want to convert this to:
ind
set
inst_0
inst_1
inst_2
0
1
70
60
50
1
1
30
20
66
2
1
35
22
28
0
2
90
47
23
1
2
..
..
..
2
2
..
..
..
How can I do this transform? I would appreciate any suggestion. Thank you so much!
I have tried some things but did not really work.I have to do the change based on two columns information and that was confusing me.
data.table
df <- read.table(text = "set inst ind color_Blue
1 0 0 70
1 0 1 60
1 0 2 50
1 1 0 30
1 1 1 20
1 1 2 66
2 0 0 35
2 0 1 22
2 0 2 28
2 1 0 90
2 1 1 47
2 1 2 23", header = T)
library(data.table)
dcast(
data = setDT(df),
formula = inst + set ~ paste0("inst_", ind),
value.var = "color_Blue"
)
#> inst set inst_0 inst_1 inst_2
#> 1: 0 1 70 60 50
#> 2: 0 2 35 22 28
#> 3: 1 1 30 20 66
#> 4: 1 2 90 47 23
Created on 2023-01-19 with reprex v2.0.2
You can use pivot_wider() from tidyr for reshaping.
library(tidyr)
df %>%
pivot_wider(names_from = ind, values_from = color_Blue, names_prefix = 'inst_')
# # A tibble: 4 × 5
# set inst inst_0 inst_1 inst_2
# <int> <int> <int> <int> <int>
# 1 1 0 70 60 50
# 2 1 1 30 20 66
# 3 2 0 35 22 28
# 4 2 1 90 47 23
Data
df <- read.table(text = "
set inst ind color_Blue
1 0 0 70
1 0 1 60
1 0 2 50
1 1 0 30
1 1 1 20
1 1 2 66
2 0 0 35
2 0 1 22
2 0 2 28
2 1 0 90
2 1 1 47
2 1 2 23", header = TRUE)
I have the follow data frame:
> resident
X LOS Age Meds MHealth DietRest ReligAff NmChores Employed EdLevel Courses
1 R1 27 35 2 1 3 2 2 0 2 1
2 R2 56 43 0 0 0 1 3 1 3 2
3 R3 101 41 1 1 0 0 2 2 2 3
4 R4 19 54 3 2 4 3 1 0 1 0
5 R5 34 29 0 0 0 2 3 0 2 1
6 R6 78 46 2 0 2 1 2 1 3 2
7 R7 134 51 3 2 4 0 1 1 3 2
8 R8 112 38 0 1 1 4 2 1 2 3
9 R9 83 61 3 1 3 2 2 0 4 3
10 R10 9 50 2 0 2 1 1 2 2 0
11 R11 67 23 0 1 0 0 2 0 3 1
12 R12 30 47 2 2 0 3 2 0 4 0
13 R13 95 65 4 1 4 2 2 0 3 2
14 R14 165 63 5 2 4 1 1 0 2 2
15 R15 29 40 0 1 0 0 3 2 5 0
16 R16 44 33 2 2 1 0 2 0 3 1
17 R17 36 48 2 1 0 3 2 0 1 1
18 R18 58 57 3 0 2 1 1 1 2 1
19 R19 116 39 0 1 0 2 2 1 3 1
20 R20 73 44 1 0 0 2 1 0 4 2
21 R21 79 30 3 2 3 3 1 0 2 1
22 R22 39 41 0 0 0 0 3 2 2 2
23 R23 18 50 2 1 2 1 1 1 3 0
24 R24 60 35 1 0 0 0 2 1 4 2
25 R25 106 48 3 2 3 2 2 0 2 2
26 R26 46 31 2 1 0 0 1 1 3 1
27 R27 52 59 2 0 1 1 3 2 2 1
28 R28 28 62 6 0 4 2 1 0 5 1
29 R29 79 45 4 2 3 3 2 1 3 2
30 R30 24 42 1 1 1 0 1 0 2 1
31 R31 123 36 3 1 0 2 2 1 3 4
32 R32 11 49 2 0 2 1 2 0 1 0
33 R33 95 26 1 1 0 1 3 0 3 4
34 R34 61 24 0 0 0 2 2 1 2 1
35 R35 88 63 2 1 0 1 1 1 4 2
36 R36 64 38 1 2 1 4 1 1 2 3
37 R37 99 40 2 0 0 1 3 2 4 1
>
LOS = length of stay
I am trying to go through the data frame and create a new column that consists of either a zero or one, based upon if the resident is completing an average of one course every thirty days. How would I go upon doing this? I understand I would need to do something like within this subset of people, break things down so that if someone has been there between thirty and fifty-nine days and has completed at least one course, they receive a value of one. If someone has been there between sixty and eighty-nine days and that person has finished at least two courses, give them a one, and so forth and if not give them a value of zero. How would I create a function that does this and adds a value of either 1 or 0 to a new vector based upon the data for each resident?
I have two dataframes which look similar to this:
>health
ID Stroke Diab MI Age Sex
1 1 0 0 0 65 M
2 2 0 0 0 66 M
3 3 1 0 0 78 F
4 4 0 0 0 55 M
5 5 0 0 0 67 M
6 6 1 1 1 66 M
7 7 0 0 0 79 F
8 8 0 0 0 54 M
9 9 0 0 0 65 F
10 10 1 1 1 78 F
>Asthma
ID Smoker Smoking_Status
1 12 2 0
2 15 0 1
3 24 1 0
4 2 2 1
5 8 2 0
6 53 1 1
7 10 0 0
8 32 0 0
9 1 0 0
10 5 1 1
These are the codes that I used to produce these example tables
health <- data.frame(ID=c(1,2,3,4,5,6,7,8,9,10), Stroke = factor(c(0,0,1,0,0,1,0,0,0,1)),
Diab = factor(c(0,0,0,0,0,1,0,0,0,1)), MI = factor(c(0,0,0,0,0,1,0,0,0,1)),
Age = factor(c(65,66,78,55,67,66,79,54,65,78)),
Sex = factor(c("M","M","F","M","M","M","F","M","F","F")))
Asthma <- data.frame(ID=c(12,15,24,2,8,53,10,32,1,5), Smoker = factor(c(2,0,1,2,2,1,0,0,0,1)),
Smoking_Status = factor(c(0,1,0,1,0,1,0,0,0,1)))
My question is how can I produce another column in the health dataframe which would give another column a value of 1 to show whether the ID appeared in the Asthma dataframe.
This is my expected outcome:
ID Asthma Stroke Diab MI Age Sex
1 1 1 0 0 0 65 M
2 2 1 0 0 0 66 M
3 3 0 1 0 0 78 F
4 4 0 0 0 0 55 M
5 5 1 0 0 0 67 M
6 6 0 1 1 1 66 M
7 7 0 0 0 0 79 F
8 8 0 0 0 0 54 M
9 9 0 0 0 0 65 F
10 10 1 1 1 1 78 F
One of the many probable ways:
health$asthma =match(x = health$ID,table = Asthma$ID,nomatch = 0)
health$asthma = replace(x = health$asthma,list = which(health$asthma>0),values = 1)
Using data.table:
health = as.data.table(x = health)
Asthma = as.data.table(x = Asthma)
health[,`:=`(asthma = numeric(nrow(health)))]
set(x = health,i = which(health$ID %in% Asthma$ID),j = "asthma",value = 1)
#> health
# ID Stroke Diab MI Age Sex asthma
# 1: 1 0 0 0 65 M 1
# 2: 2 0 0 0 66 M 1
# 3: 3 1 0 0 78 F 0
# 4: 4 0 0 0 55 M 0
# 5: 5 0 0 0 67 M 1
# 6: 6 1 1 1 66 M 0
# 7: 7 0 0 0 79 F 0
# 8: 8 0 0 0 54 M 1
# 9: 9 0 0 0 65 F 0
#10: 10 1 1 1 78 F 1
You can do this in one line using data.table package-
> data.table::setDT(health)[,ind:=ifelse(ID %in% Asthma$ID,1,0)]
> health
ID Stroke Diab MI Age Sex id_app ind
1: 1 0 0 0 65 M 1 1
2: 2 0 0 0 66 M 1 1
3: 3 1 0 0 78 F 0 0
4: 4 0 0 0 55 M 0 0
5: 5 0 0 0 67 M 1 1
6: 6 1 1 1 66 M 0 0
7: 7 0 0 0 79 F 0 0
8: 8 0 0 0 54 M 1 1
9: 9 0 0 0 65 F 0 0
10: 10 1 1 1 78 F 1 1
I have a data frame dfSub with a number of parameters inside. This is hourly based data for energy use. I need to sort data by each hour, e.g. for each hour get all values of energy from data frame. As a result I expect to have data frame with 24 columns for each hour, rows are filled with energy values.
The hour is specified as 1:24 and in data frame is linked as dfSub$hr.
The heat is dfSub$heat
I constructed a for-loop and tried to save with cbind, but it does not work, error message is about different size of rows and columns.
I print results and see them on screen, but cant save as d(dataframe)
here is the code:
d = NULL
for (i in 1:24) {
subh= subset(dfSub$heat, dfSub$hr == i)
print(subh)
d = cbind(d, as.data.frame(subh))
}
append function is not applicable, since I dont know the expected length of heat value for each hour.
Any help is appreciated.
Part of dfSub
hr wk month dyid wend t heat
1 2 1 1 0 -9.00 81
2 2 1 1 0 -8.30 61
3 2 1 1 0 -7.80 53
4 2 1 1 0 -7.00 51
5 2 1 1 0 -7.00 30
6 2 1 1 0 -6.90 31
7 2 1 1 0 -7.10 51
8 2 1 1 0 -6.50 90
9 2 1 1 0 -8.90 114
10 2 1 1 0 -9.90 110
11 2 1 1 0 -11.70 126
12 2 1 1 0 -9.70 113
13 2 1 1 0 -11.60 104
14 2 1 1 0 -10.00 107
15 2 1 1 0 -10.20 117
16 2 1 1 0 -9.00 90
17 2 1 1 0 -8.00 114
18 2 1 1 0 -7.80 83
19 2 1 1 0 -8.10 82
20 2 1 1 0 -8.20 61
21 2 1 1 0 -8.80 34
22 2 1 1 0 -9.10 52
23 2 1 1 0 -10.10 41
24 2 1 1 0 -8.80 52
1 2 1 2 0 -8.70 44
2 2 1 2 0 -8.40 50
3 2 1 2 0 -8.10 33
4 2 1 2 0 -7.70 41
5 2 1 2 0 -7.80 33
6 2 1 2 0 -7.50 43
7 2 1 2 0 -7.30 40
8 2 1 2 0 -7.10 8
The output expected as:
hr1 hr2 hr3 hr4..... hr24
81 61 53 51 ..... 52
44 50 33 41
One can avoid use of for-loop in this case. An option is to use tidyr::spread to convert your hourly data in wide format.
library(tidyverse)
df %>% select(-t, -wend) %>%
mutate(hr = sprintf("hr%02d",hr)) %>%
spread(hr, heat)
Result:
# wk month dyid hr01 hr02 hr03 hr04 hr05 hr06 hr07 hr08 hr09 hr10 hr11 hr12 hr13 hr14 hr15 hr16 hr17 hr18 hr19 hr20 hr21 hr22 hr23 hr24
# 1 2 1 1 81 61 53 51 30 31 51 90 114 110 126 113 104 107 117 90 114 83 82 61 34 52 41 52
# 2 2 1 2 44 50 33 41 33 43 40 8 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Data:
df <- read.table(text =
"hr wk month dyid wend t heat
1 2 1 1 0 -9.00 81
2 2 1 1 0 -8.30 61
3 2 1 1 0 -7.80 53
4 2 1 1 0 -7.00 51
5 2 1 1 0 -7.00 30
6 2 1 1 0 -6.90 31
7 2 1 1 0 -7.10 51
8 2 1 1 0 -6.50 90
9 2 1 1 0 -8.90 114
10 2 1 1 0 -9.90 110
11 2 1 1 0 -11.70 126
12 2 1 1 0 -9.70 113
13 2 1 1 0 -11.60 104
14 2 1 1 0 -10.00 107
15 2 1 1 0 -10.20 117
16 2 1 1 0 -9.00 90
17 2 1 1 0 -8.00 114
18 2 1 1 0 -7.80 83
19 2 1 1 0 -8.10 82
20 2 1 1 0 -8.20 61
21 2 1 1 0 -8.80 34
22 2 1 1 0 -9.10 52
23 2 1 1 0 -10.10 41
24 2 1 1 0 -8.80 52
1 2 1 2 0 -8.70 44
2 2 1 2 0 -8.40 50
3 2 1 2 0 -8.10 33
4 2 1 2 0 -7.70 41
5 2 1 2 0 -7.80 33
6 2 1 2 0 -7.50 43
7 2 1 2 0 -7.30 40
8 2 1 2 0 -7.10 8",
header = TRUE, stringsAsFactors = FALSE)
With tidyr:
> df<-read.fwf(textConnection(
+ "hr,wk,month,dyid,wend,t,heat
+ 1 2 1 1 0 -9.00 81
+ 2 2 1 1 0 -8.30 61
+ 3 2 1 1 0 -7.80 53
+ 4 2 1 1 0 -7.00 51
+ 5 2 1 1 0 -7.00 30
+ 6 2 1 1 0 -6.90 31
+ 7 2 1 1 0 -7.10 51
+ 8 2 1 1 0 -6.50 90
+ 9 2 1 1 0 -8.90 114
+ 10 2 1 1 0 -9.90 110
+ 11 2 1 1 0 -11.70 126
+ 12 2 1 1 0 -9.70 113
+ 13 2 1 1 0 -11.60 104
+ 14 2 1 1 0 -10.00 107
+ 15 2 1 1 0 -10.20 117
+ 16 2 1 1 0 -9.00 90
+ 17 2 1 1 0 -8.00 114
+ 18 2 1 1 0 -7.80 83
+ 19 2 1 1 0 -8.10 82
+ 20 2 1 1 0 -8.20 61
+ 21 2 1 1 0 -8.80 34
+ 22 2 1 1 0 -9.10 52
+ 23 2 1 1 0 -10.10 41
+ 24 2 1 1 0 -8.80 52
+ 1 2 1 2 0 -8.70 44
+ 2 2 1 2 0 -8.40 50
+ 3 2 1 2 0 -8.10 33
+ 4 2 1 2 0 -7.70 41
+ 5 2 1 2 0 -7.80 33
+ 6 2 1 2 0 -7.50 43
+ 7 2 1 2 0 -7.30 40
+ 8 2 1 2 0 -7.10 8"
+ ),header=TRUE,sep=",",widths=c(5,3,6,5,5,7,5))
>
> library(tidyr)
> df1 <- select(df,dyid,hr,heat)
> df2 <- spread(df1,hr,heat)
> colnames(df2)[2:ncol(df2)] <- paste0("hr",colnames(df2)[2:ncol(df2)])
> df2
dyid hr1 hr2 hr3 hr4 hr5 hr6 hr7 hr8 hr9 hr10 hr11 hr12 hr13 hr14 hr15 hr16 hr17 hr18 hr19 hr20 hr21 hr22 hr23 hr24
1 1 81 61 53 51 30 31 51 90 114 110 126 113 104 107 117 90 114 83 82 61 34 52 41 52
2 2 44 50 33 41 33 43 40 8 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
>
I found solution that helped me to solve my task here: Append data frames together in a for loop
by using empty list and combining later on in data frame
datalist = list()
for (i in 1:24) {
subh= subset(dfSub$heat, dfSub$hr == i)
datalist[[i]] = subh
}
big_data = do.call(rbind, datalist)
both cbind and rbind work.
Thanks everyone for help :)
Iam trying to combine rows into on row in TermDocumentMatrix
(I know every row represents each words)
ex) cabin, staff -> crews
Because 'cabin, staff and crew' mean samething,
Iam trying to combine rows which represent 'cabin, staff'
into one row which represent 'crew.
but, it doesn't work at all.
R said argument "weighting" is missing, with no default
The codes I typed is below
r=GET('http://www.airlinequality.com/airline-reviews/cathay-pacific-airways/')
base_url=('http://www.airlinequality.com/airline-reviews/cathay-pacific-airways/')
h<-read_html(base_url)
all.reviews = c()
for (i in 1:10){
print(i)
url = paste(base_url, 'page/', i, '/', sep="")
r = GET(url)
h = read_html(r)
comment_area = html_nodes(h, '.tc_mobile')
comments= html_nodes(comment_area, '.text_content')
reviews = html_text(comments)
all.reviews=c(all.reviews, reviews)}
cps <- Corpus(VectorSource(all.reviews))
cps <- tm_map(cps, content_transformer(tolower))
cps <- tm_map(cps, content_transformer(stripWhitespace))
cps <- tm_map(cps, content_transformer(removePunctuation))
cps <- tm_map(cps, content_transformer(removeNumbers))
cps <- tm_map(cps, removeWords, stopwords("english"))
tdm <- TermDocumentMatrix(cps, control=list(
wordLengths=c(3, 20),
weighting=weightTf))
rows.cabin = grep('cabin|staff', row.names(tdm))
rows.cabin
# [1] 235 1594
count.cabin = as.array(rollup(tdm[rows.cabin,], 1))
count.cabin
#Docs
#Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
#1 0 1 1 0 0 2 2 0 0 1 1 0 4 0 1 0 1 0 2 1 0 0 1 3 1 4 2 0 3 0 1 1 4 0 0 2 1 0 0 2 1 0 2 1 3 3 1
#Docs
#Terms 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
#1 0 1 0 1 2 3 2 2 1 1 0 2 0 0 0 0 0 2 0 1 0 0 4 0 2 2 1 3 1 1 1 1 0 0 0 5 3 0 2 1 0 1 0 0
#Docs
#Terms 92 93 94 95 96 97 98 99 100
#1 1 5 2 1 0 0 0 1 0
row.crews = grep('crews', row.names(tdm))
row.crews
#[1] 408
tdm[row.crews,] = count.cabin
rows.cabin = setdiff(rows.cabin, row.crews) # ok
tdm = tdm[-rows.cabin,] # ok
dtm = as.DocumentTermMatrix(tdm)
# Error in .TermDocumentMatrix(t(x), weighting) :
# argument "weighting" is missing, with no default
maybe it is not right approach to combine rows in TermDocumentMatrix
Please fix this codes or suggest better approach to solve this problem.
Thanks in advance.
Hmm I wonder why you stick to your approach, which obviously does not work, instead of just copying+pasting+adjusting* my suggestion from here?
library(tm)
library(httr)
library(rvest)
library(slam)
# [...] # your code
inspect(tdm[grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE), 1:15])
# Docs
# Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# cabin 0 0 0 0 0 1 1 0 0 1 0 0 3 0 0
# crew 0 0 0 1 1 1 1 0 2 1 0 1 0 2 0
# crews 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# staff 0 1 1 0 0 1 1 0 0 0 1 0 1 0 1
dict <- list(
"CREW" = grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE, value = TRUE)
)
terms <- Terms(tdm)
for (x in seq_along(dict))
terms[terms %in% dict[[x]] ] <- names(dict)[x]
tdm <- slam::rollup(tdm, 1, terms, sum)
inspect(tdm[grep("cabin|staff|crew", Terms(tdm), ignore.case=TRUE), 1:15])
# Docs
# Terms 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# CREW 0 1 1 1 1 3 3 0 2 2 1 1 4 2 1
*I only adjusted the line inside the dict definition...