Importing DAT file into R but uneven columns - r

I have a DAT file I want to read into R but when I import my data, it keeps on showing I have 10 columns/variables (coming from first line) when in actuality, it is really supposed to be 29 columns/variables. How do i fix this problem?
DAT file example on notepad:
smsa66 smsa76 nearc2 nearc4 nearc4a nearc4b ed76 ed66 age76 daded
nodaded momed nomomed momdad14 sinmom14 step14 south66 south76
lwage76 famed black wage76 enroll76 kww iqscore mar76 libcrd14
exp76 exp762
1 1 0 0 0 0 7
5 29 9.94 1 10.25 1 1
0 0 0 0 6.306275 9 1
548 0 15 . 1 0 16
256
1 1 0 0 0 0 12
11 27 8 0 8 0 1
0 0 0 0 6.175867 8 0
481 0 35 93 1 1 9
81
1 1 0 0 0 0 12
12 34 14 0 12 0 1
0 0 0 0 6.580639 2 0
721 0 42 103 1 1 16
256
1 1 1 1 1 0 11
11 27 11 0 12 0 1
0 0 0 0 5.521461 6 0
250 0 25 88 1 1 10
100
1 1 1 1 1 0 12
12 34 8 0 7 0 1
0 0 0 0 6.591674 8 0
729 0 34 108 1 0 16
256
1 1 1 1 1 0 12
11 26 9 0 12 0 1
0 0 0 0 6.214608 6 0
500 0 38 85 1 1 8
64
1 1 1 1 1 0 18
16 33 14 0 14 0 1
0 0 0 0 6.336826 1 0
565 0 41 119 1 1 9
81
1 1 1 1 1 0 14
13 29 14 0 14 0 1
0 0 0 0 6.410175 1 0
608 0 46 108 1 1 9
81

txt1<-" smsa66 smsa76 nearc2 nearc4 nearc4a nearc4b ed76 ed66 age76 daded
nodaded momed nomomed momdad14 sinmom14 step14 south66 south76
lwage76 famed black wage76 enroll76 kww iqscore mar76 libcrd14
exp76 exp762"
txt2 <-
" 1 1 0 0 0 0 7
5 29 9.94 1 10.25 1 1
0 0 0 0 6.306275 9 1
548 0 15 NA 1 0 16
256
1 1 0 0 0 0 12
11 27 8 0 8 0 1
0 0 0 0 6.175867 8 0
481 0 35 93 1 1 9
81
1 1 0 0 0 0 12
12 34 14 0 12 0 1
0 0 0 0 6.580639 2 0
721 0 42 103 1 1 16
256
1 1 1 1 1 0 11
11 27 11 0 12 0 1
0 0 0 0 5.521461 6 0
250 0 25 88 1 1 10
100
1 1 1 1 1 0 12
12 34 8 0 7 0 1
0 0 0 0 6.591674 8 0
729 0 34 108 1 0 16
256
1 1 1 1 1 0 12
11 26 9 0 12 0 1
0 0 0 0 6.214608 6 0
500 0 38 85 1 1 8
64
1 1 1 1 1 0 18
16 33 14 0 14 0 1
0 0 0 0 6.336826 1 0
565 0 41 119 1 1 9
81
1 1 1 1 1 0 14
13 29 14 0 14 0 1
0 0 0 0 6.410175 1 0
608 0 46 108 1 1 9
81"
Now the code:
inp <- scan(text=txt2, what="numeric")
inmat <- matrix( as.numeric(inp), ncol=29, byrow=TRUE)
dfrm <- as.data.frame(inmat)
scan(text=txt1, what="")
Read 29 items
[1] "smsa66" "smsa76" "nearc2" "nearc4" "nearc4a" "nearc4b" "ed76"
[8] "ed66" "age76" "daded" "nodaded" "momed" "nomomed" "momdad14"
[15] "sinmom14" "step14" "south66" "south76" "lwage76" "famed" "black"
[22] "wage76" "enroll76" "kww" "iqscore" "mar76" "libcrd14" "exp76"
[29] "exp762"
names(dfrm) <- scan(text=txt1, what="")
#Read 29 items
dfrm
#-----------------------
smsa66 smsa76 nearc2 nearc4 nearc4a nearc4b ed76 ed66 age76 daded nodaded momed nomomed
1 1 1 0 0 0 0 7 5 29 9.94 1 10.25 1
2 1 1 0 0 0 0 12 11 27 8 0 8 0
3 1 1 0 0 0 0 12 12
snipped remainder of output
Final result:
str(dfrm)
'data.frame': 8 obs. of 29 variables:
$ smsa66 : num 1 1 1 1 1 1 1 1
$ smsa76 : num 1 1 1 1 1 1 1 1
$ nearc2 : num 0 0 0 1 1 1 1 1
$ nearc4 : num 0 0 0 1 1 1 1 1
$ nearc4a : num 0 0 0 1 1 1 1 1
$ nearc4b : num 0 0 0 0 0 0 0 0
$ ed76 : num 7 12 12 11 12 12 18 14
$ ed66 : num 5 11 12 11 12 11 16 13
$ age76 : num 29 27 34 27 34 26 33 29
$ daded : num 9.94 8 14 11 8 9 14 14
$ nodaded : num 1 0 0 0 0 0 0 0
$ momed : num 10.2 8 12 12 7 ...
$ nomomed : num 1 0 0 0 0 0 0 0
$ momdad14: num 1 1 1 1 1 1 1 1
$ sinmom14: num 0 0 0 0 0 0 0 0
$ step14 : num 0 0 0 0 0 0 0 0
$ south66 : num 0 0 0 0 0 0 0 0
$ south76 : num 0 0 0 0 0 0 0 0
$ lwage76 : num 6.31 6.18 6.58 5.52 6.59 ...
$ famed : num 9 8 2 6 8 6 1 1
$ black : num 1 0 0 0 0 0 0 0
$ wage76 : num 548 481 721 250 729 500 565 608
$ enroll76: num 0 0 0 0 0 0 0 0
$ kww : num 15 35 42 25 34 38 41 46
$ iqscore : num NA 93 103 88 108 85 119 108
$ mar76 : num 1 1 1 1 1 1 1 1
$ libcrd14: num 0 1 1 1 0 1 1 1
$ exp76 : num 16 9 16 10 16 8 9 9
$ exp762 : num 256 81 256 100 256 64 81 81

Related

Calculate weighted mean from matrix in R

I have a matrix that looks like the following. For rows 1:23, I would like to calculate the weighted mean, where the data in rows 1:23 are the weights and row 24 is the data.
1 107 33 41 22 12 4 122 44 297 123 51 16 7 9 1 1 0
10 5 2 2 1 0 3 4 6 12 3 3 0 1 1 0 0 0
11 1 3 1 0 0 0 4 2 8 3 4 0 0 0 0 0 0
12 2 1 1 0 0 0 2 1 5 6 3 1 0 0 0 0 0
13 1 0 1 0 0 0 3 1 3 5 2 2 0 1 0 0 0
14 3 0 0 0 0 0 3 1 2 3 0 1 0 0 0 0 0
15 0 0 0 0 0 0 2 0 0 1 0 1 0 0 0 0 0
16 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0
17 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
18 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
2 80 27 37 5 6 4 97 48 242 125 44 27 7 8 8 0 2
20 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
22 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
3 47 12 33 12 6 1 63 42 200 96 45 19 6 6 9 2 0
4 45 14 21 9 4 2 54 26 130 71 36 17 8 5 1 0 2
5 42 10 14 6 3 2 45 19 89 45 26 7 4 8 2 1 0
6 17 3 12 5 2 0 18 21 51 41 19 15 5 1 1 0 0
7 16 2 6 0 0 1 14 9 37 23 17 7 3 0 3 0 0
8 9 4 4 2 1 0 7 9 30 15 8 3 3 1 1 0 1
9 12 2 3 1 1 1 6 5 14 12 5 1 2 0 0 1 0
24 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
As an example using the top two rows, there would have an additional column at the end indicated the weighted mean.
1 107 33 41 22 12 4 122 44 297 123 51 16 7 9 1 1 0 6.391011
10 5 2 2 1 0 3 4 6 12 3 3 0 1 1 0 0 0 6.232558
I'm a little new to coding so I wasn't too sure how to do it - any advice would be appreciated!
You can do:
apply(df[-nrow(df), ], 1, function(row) weighted.mean(df[nrow(df), ], row))
I'm assuming your first columns is some kind of index and not used for the weighted mean (and the data is stored in matr_dat):
apply(matr_dat[-nrow(matr_dat), -1], 1,
function(row) weighted.mean(matr_dat[nrow(matr_dat), -1], row))
Using apply and setting the margin to 1, the function defined in the third argument of apply to each row of the data; to calculate the weighted mean, you can use weighted.mean and set the weights to the values of the row.

plot Buy ratio as a function of the time spent on an item in a session, session duration and the number of clicks on a given item in a session

I have a dataframe with 34154695 obs. In a dataset a Class variable with value 0 indicate "not purchased" and 1 indicate "purchase".
> str(data)
'data.frame': 34154695 obs. of 5 variables:
$ SessionID: int 1 1 1 2 2 2 2 2 2 3 ...
$ Timestamp: Factor w/ 34069144 levels "2014-04-01T03:00:00.124Z",..: 1452469 1452684 1453402 1501801 1501943 1502207 1502429 1502569 1502932 295601 ...
$ ItemID : int 214536500 214536506 214577561 214662742 214662742 214825110 214757390 214757407 214551617 214716935 ...
$ Category : Factor w/ 339 levels "0","1","10","11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Class : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
where as the number of sessions are
> length(unique(data$SessionID))
[1] 9249729
> head(data, 50)
SessionID Timestamp ItemID Category Class
1 1 2014-04-07T10:54:09.868Z 214536500 0 0
2 1 2014-04-07T10:54:46.998Z 214536506 0 0
3 1 2014-04-07T10:57:00.306Z 214577561 0 0
4 2 2014-04-07T13:56:37.614Z 214662742 0 0
5 2 2014-04-07T13:57:19.373Z 214662742 0 0
6 2 2014-04-07T13:58:37.446Z 214825110 0 0
7 2 2014-04-07T13:59:50.710Z 214757390 0 0
8 2 2014-04-07T14:00:38.247Z 214757407 0 0
9 2 2014-04-07T14:02:36.889Z 214551617 0 0
10 3 2014-04-02T13:17:46.940Z 214716935 0 0
11 3 2014-04-02T13:26:02.515Z 214774687 0 0
12 3 2014-04-02T13:30:12.318Z 214832672 0 0
13 4 2014-04-07T12:09:10.948Z 214836765 0 0
14 4 2014-04-07T12:26:25.416Z 214706482 0 0
15 6 2014-04-06T16:58:20.848Z 214701242 0 0
16 6 2014-04-06T17:02:26.976Z 214826623 0 0
17 7 2014-04-02T06:38:53.104Z 214826835 0 0
18 7 2014-04-02T06:39:05.854Z 214826715 0 0
19 8 2014-04-06T08:49:58.728Z 214838855 0 0
20 8 2014-04-06T08:52:12.647Z 214838855 0 0
21 9 2014-04-06T11:26:24.127Z 214576500 0 0
22 9 2014-04-06T11:28:54.654Z 214576500 0 0
23 9 2014-04-06T11:29:13.479Z 214576500 0 0
24 11 2014-04-03T10:44:35.672Z 214821275 0 0
25 11 2014-04-03T10:45:01.674Z 214821275 0 0
26 11 2014-04-03T10:45:29.873Z 214821371 0 0
27 11 2014-04-03T10:46:12.162Z 214821371 0 0
28 11 2014-04-03T10:46:57.355Z 214821371 0 0
29 11 2014-04-03T10:53:22.572Z 214717089 0 0
30 11 2014-04-03T10:53:49.875Z 214563337 0 0
31 11 2014-04-03T10:55:19.267Z 214706462 0 0
32 11 2014-04-03T10:55:47.327Z 214717436 0 0
33 11 2014-04-03T10:56:30.520Z 214743335 0 0
34 11 2014-04-03T10:57:19.331Z 214826837 0 0
35 11 2014-04-03T10:57:39.433Z 214819762 0 0
36 12 2014-04-02T10:30:13.176Z 214717867 0 0
37 12 2014-04-02T10:33:12.621Z 214717867 0 0
38 13 2014-04-06T14:50:13.638Z 214836761 0 0
39 13 2014-04-06T14:52:54.363Z 214684513 0 0
40 13 2014-04-06T14:53:18.268Z 214836761 0 0
41 14 2014-04-01T10:09:01.362Z 214577732 0 0
42 14 2014-04-01T10:11:14.773Z 214587013 0 0
43 14 2014-04-01T10:12:36.482Z 214577732 0 0
44 17 2014-04-06T11:34:14.289Z 214826897 0 0
45 17 2014-04-06T11:34:16.193Z 214820441 0 0
46 16 2014-04-05T13:08:01.626Z 214684093 0 0
47 16 2014-04-05T13:08:39.897Z 214684093 0 0
48 16 2014-04-05T13:20:53.092Z 214684093 0 0
49 19 2014-04-01T20:52:12.357Z 214561790 0 0
50 19 2014-04-01T20:52:13.758Z 214561790 0 0
i want to plot Buy ratio as a function of the time spent on an item in a session, the number of clicks on a given item in a session and the session’s duration.i wanna output like this
Could someone please inform how I should proceed?? Really, thank you for any help and suggesting.
Kind regards

Recoding range of numerics into single numeric in R

I am trying to recode a data frame with four columns. Across all of the columns, I want to recode all the numeric values into these ordinal numeric values:
0 stays as is
1:3 <- 1
4:10 <- 2
11:22 <- 3
22:max <-4
This is the data frame:
> df
T4.1 T4.2 T4.3 T4.4
1 0 54 0 5
2 0 5 0 0
3 0 3 0 0
4 0 2 0 0
5 0 3 0 0
6 0 2 0 0
7 0 4 0 0
8 1 20 0 0
9 1 7 0 2
10 0 14 0 0
11 0 3 0 0
12 0 202 0 41
13 2 12 0 0
14 3 6 0 0
15 3 21 0 3
16 0 143 0 0
17 0 0 0 0
18 4 9 0 0
19 3 15 0 0
20 0 58 0 6
21 2 0 0 0
22 0 52 0 0
23 0 3 0 0
24 0 1 0 0
25 4 6 0 1
26 1 4 0 0
27 0 38 0 1
28 0 6 0 0
29 0 8 0 0
30 0 29 0 4
31 1 14 0 0
32 0 12 0 10
33 4 1 0 3
I'm trying to use the recode function, but I can't seem to figure out how to input a range of numeric values into it. I get the following errors with my attempts:
> recode(df, 11:22=3)
Error: unexpected '=' in "recode(df, 11:22="
> recode(df, c(11:22)=3)
Error: unexpected '=' in "recode(df, c(11:22)="
I would greatly appreciate any advice. Thanks for your time!
Edit: Thanks all for the help!!
You can use cut with range of values as:
df_res <- as.data.frame(sapply(df, function(x)cut(x,
breaks = c(-0.5, 0.5, 3.5, 10.5, 22.5, Inf),
labels = c(0, 1, 2, 3, 4)))
)
str(df_res)
#'data.frame': 33 obs. of 4 variables:
# $ T4.1: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 2 2 1 ...
# $ T4.2: Factor w/ 5 levels "0","1","2","3",..: 5 3 2 2 2 2 3 4 3 4 ...
# $ T4.3: Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
# $ T4.4: Factor w/ 4 levels "0","1","2","4": 3 1 1 1 1 1 1 1 2 1 ...
df_res
# T4.1 T4.2 T4.3 T4.4
# 1 0 4 0 2
# 2 0 2 0 0
# 3 0 1 0 0
# 4 0 1 0 0
# 5 0 1 0 0
# 6 0 1 0 0
# 7 0 2 0 0
# 8 1 3 0 0
# 9 1 2 0 1
# 10 0 3 0 0
# 11 0 1 0 0
# 12 0 4 0 4
# 13 1 3 0 0
# 14 1 2 0 0
# 15 1 3 0 1
# 16 0 4 0 0
# 17 0 0 0 0
# 18 2 2 0 0
# 19 1 3 0 0
# 20 0 4 0 2
# 21 1 0 0 0
# 22 0 4 0 0
# 23 0 1 0 0
# 24 0 1 0 0
# 25 2 2 0 1
# 26 1 2 0 0
# 27 0 4 0 1
# 28 0 2 0 0
# 29 0 2 0 0
# 30 0 4 0 2
# 31 1 3 0 0
# 32 0 3 0 2
# 33 2 1 0 1
I find named vectors are a nice pattern for re-coding variables, especially for irregular patterns. You could use one like this here:
decoder <- c(0, rep(1,3), rep(2,7), rep(3, 12))
names(decoder) <- 0:22
sapply(df, function(x) ifelse(x <= 22, decoder[as.character(x)], 4))
If the re-coding was more of a pattern, cut is a useful function.

How to import a text file with irregular white spaces in R

I hava a text file that i want to use for survival data analysis:
1 0 0 0 15 0 0 1 1 0 0 2 12 0 12 0 12 0
2 0 0 1 20 0 0 1 0 0 0 4 9 0 9 0 9 0
3 0 0 1 15 0 0 0 1 1 0 2 13 0 13 0 7 1
4 0 0 0 20 1 0 1 0 0 0 2 11 1 29 0 29 0
5 0 0 1 70 1 1 1 1 0 0 2 28 1 31 0 4 1
6 0 0 1 20 1 0 1 0 0 0 4 11 0 11 0 8 1
7 0 0 1 5 0 0 0 0 0 1 4 12 0 12 0 11 1
8 0 0 1 30 1 0 1 1 0 0 4 8 1 34 0 4 1
9 0 0 1 25 0 1 0 1 1 0 4 10 1 53 0 4 1
10 0 0 1 20 0 1 0 1 0 0 4 7 0 1 1 7 0
11 0 0 1 30 1 0 1 0 0 1 4 7 1 21 1 44 1
12 0 0 0 20 0 0 1 0 0 1 4 20 0 1 1 20 0
13 0 0 1 25 0 0 1 1 1 0 4 12 1 32 0 32 0
14 0 0 1 70 0 0 0 0 0 1 4 16 0 16 0 16 0
15 0 0 1 20 1 0 1 0 0 0 4 39 0 39 0 39 0
16 0 0 0 10 1 0 1 0 0 1 4 23 1 34 0 34 0
17 0 0 1 10 1 0 0 0 0 0 4 8 0 8 0 8 0
18 0 0 1 15 0 0 0 0 0 0 4 15 0 15 0 6 1
19 0 0 1 10 0 0 0 0 0 1 4 8 0 8 0 8 0
20 0 0 1 15 0 0 0 0 1 0 4 24 1 32 0 32 0
21 0 0 1 16 0 0 1 0 0 0 4 25 1 22 1 43 0
22 0 1 1 55 1 0 1 1 0 0 4 14 1 3 1 56 0
23 0 0 1 20 1 0 1 1 0 0 4 24 1 47 0 11 1
24 0 0 0 30 0 0 0 1 1 0 4 6 1 43 0 43 0
25 0 0 1 40 0 1 0 1 1 0 1 25 0 3 1 25 0
26 0 0 1 15 1 0 1 1 0 0 4 12 0 12 0 12 0
27 0 1 1 50 0 0 1 0 0 1 4 15 1 53 0 32 1
28 0 0 1 40 1 0 1 1 0 0 4 18 1 52 0 51 1
29 0 1 1 45 0 1 1 1 1 0 4 13 1 11 1 21 0
30 0 1 0 40 0 1 1 1 1 0 2 29 0 2 1 29 0
31 0 0 1 28 0 0 1 0 0 0 2 7 0 7 0 3 1
32 0 0 1 19 1 0 1 0 0 0 3 16 0 16 0 16 0
33 0 0 1 15 0 0 1 0 0 0 2 10 0 10 0 3 1
34 0 0 1 5 0 0 1 0 1 0 3 6 0 6 0 4 1
35 0 1 1 35 0 0 1 0 0 0 4 8 1 43 0 7 1
36 0 0 1 2 1 0 1 0 0 0 1 1 1 27 0 27 0
37 0 1 1 5 0 0 1 0 0 0 2 18 0 18 0 18 0
38 0 0 1 55 1 0 1 0 0 1 4 6 1 5 1 47 1
39 0 0 0 10 0 0 0 1 0 0 2 19 1 29 0 29 0
40 0 0 1 15 0 0 1 0 0 0 4 5 0 5 0 5 0
41 0 1 1 20 1 0 1 0 0 1 4 1 1 4 1 97 0
42 0 1 0 30 1 0 1 1 0 1 4 15 1 28 0 28 0
43 0 0 1 25 1 1 1 1 0 1 4 14 1 4 1 7 1
44 0 0 1 95 1 1 1 1 1 1 4 9 0 9 0 3 1
45 0 1 1 30 0 0 0 0 1 0 4 1 1 39 0 39 0
46 0 0 1 15 1 0 1 0 0 0 4 10 0 10 0 10 0
47 0 0 1 20 0 1 1 1 0 0 4 6 1 5 1 46 0
48 0 1 1 6 0 0 1 0 0 0 2 13 1 28 0 28 0
49 0 0 1 15 0 0 1 0 0 1 4 11 1 21 0 21 0
50 0 0 1 7 0 0 1 1 0 0 1 8 1 17 1 38 0
51 0 0 1 13 0 0 1 1 1 0 4 10 0 10 0 10 0
52 0 0 1 25 1 0 1 0 0 1 4 6 1 40 0 5 1
53 0 0 1 25 1 0 1 0 1 1 4 18 1 22 0 9 1
54 0 1 1 20 1 0 1 0 0 1 4 16 1 16 1 21 1
55 0 1 1 25 0 0 1 1 0 0 4 7 1 26 0 26 0
56 0 0 1 95 1 0 1 1 1 1 4 14 0 14 0 14 0
57 0 0 1 17 1 0 1 0 0 0 4 16 0 16 0 16 0
58 0 0 1 3 0 0 1 0 1 0 3 4 0 4 0 1 1
59 0 0 1 15 1 0 1 0 0 0 4 19 0 6 1 19 0
60 0 0 1 65 1 1 1 1 1 1 4 21 1 8 1 10 1
61 0 1 1 15 1 0 1 1 1 1 4 18 0 18 0 18 0
62 0 0 1 40 1 0 1 0 0 0 3 31 0 31 0 13 1
63 0 0 1 45 1 0 1 1 0 1 4 11 1 24 1 40 0
64 0 1 0 35 0 0 1 1 0 0 4 4 1 5 1 47 0
65 0 0 1 85 1 1 1 1 0 1 4 12 1 8 1 9 1
66 0 1 1 15 0 1 0 1 0 1 4 11 1 35 0 19 1
67 0 0 1 70 0 1 1 1 1 0 2 23 1 8 1 60 0
68 0 0 1 6 1 0 0 0 0 1 4 7 0 7 0 7 0
69 0 0 1 20 0 0 1 0 0 0 4 19 1 26 0 6 1
70 0 1 1 36 1 0 1 0 1 1 4 16 1 20 1 23 1
71 1 1 1 50 1 1 1 0 1 0 4 15 0 1 1 15 0
72 1 0 1 21 1 0 1 0 0 0 4 6 1 13 1 23 0
73 1 0 1 16 1 0 1 0 0 0 4 2 1 9 0 9 0
74 1 1 1 3 0 0 1 0 0 0 4 6 1 14 0 14 0
75 1 0 1 5 1 0 1 0 0 0 3 8 0 8 0 2 1
76 1 0 1 32 0 1 1 1 0 1 4 18 1 51 0 18 1
77 1 0 1 38 0 1 1 1 0 0 4 12 1 22 0 22 0
78 1 0 1 16 1 0 1 0 0 0 4 7 1 16 0 16 0
79 1 1 1 9 0 1 0 1 0 0 4 6 1 2 1 2 1
80 1 0 1 17 0 1 1 0 0 0 2 10 1 10 1 22 0
81 1 0 1 22 1 0 1 0 0 0 4 12 1 20 0 5 1
82 1 0 1 10 0 0 1 0 0 0 4 5 1 5 1 14 0
83 1 0 1 12 1 0 1 0 0 0 4 12 0 12 0 12 0
84 1 0 1 80 1 1 1 1 1 1 4 6 1 4 1 41 0
85 1 1 1 15 0 0 1 1 0 0 4 9 1 9 1 21 0
86 1 0 1 50 1 0 1 0 0 1 4 18 1 7 1 56 0
87 1 0 1 50 1 1 1 1 1 1 4 7 1 42 1 67 0
88 1 0 1 15 1 0 1 0 0 0 3 11 0 11 0 11 0
89 1 0 1 8 1 0 1 0 0 0 4 9 1 17 0 17 0
90 1 1 1 45 1 1 1 1 0 0 1 11 1 11 1 18 1
91 1 0 1 20 0 1 1 1 0 1 4 6 1 6 1 14 1
92 1 0 1 5 0 0 1 0 1 0 3 4 1 8 0 5 1
93 1 0 1 25 0 0 1 0 0 0 2 5 1 10 0 5 1
94 1 0 1 40 0 1 1 1 0 0 4 11 1 8 1 31 0
95 1 0 1 4 0 0 1 0 1 0 3 9 1 7 1 23 0
96 1 0 1 25 0 0 1 1 0 1 4 4 1 14 1 46 0
97 1 1 1 20 0 0 1 0 1 0 4 5 1 1 1 38 0
98 1 1 1 26 0 0 1 0 0 1 4 8 1 3 1 35 0
99 1 0 1 10 0 1 1 1 0 0 4 13 1 21 0 21 0
100 1 1 1 85 1 1 1 1 0 1 4 11 0 3 1 11 0
101 1 0 1 75 1 0 1 1 1 0 4 29 1 49 0 16 1
102 1 0 0 5 0 0 1 0 1 0 1 13 0 13 0 13 0
103 1 0 1 20 1 0 1 0 0 0 4 1 1 12 0 12 0
104 1 1 1 8 0 1 0 1 1 0 4 6 1 6 1 13 0
105 1 1 1 10 0 0 1 0 0 1 4 6 1 23 0 23 0
106 1 0 1 10 0 0 0 0 1 1 4 3 1 31 0 31 0
107 1 1 0 2 0 0 1 0 0 0 1 2 1 2 1 10 0
108 1 0 0 5 0 0 0 0 1 0 2 4 1 4 1 17 0
109 1 0 1 10 1 0 0 0 1 0 4 5 1 18 0 18 0
110 1 0 1 18 0 0 1 1 1 0 4 6 1 5 1 33 0
111 1 0 1 20 1 0 1 1 0 0 4 9 1 8 1 17 0
112 1 0 1 80 1 1 1 1 1 1 4 4 1 11 1 13 0
113 1 0 0 17 1 0 1 1 1 1 4 5 1 4 1 35 0
114 1 0 0 35 1 0 1 0 0 0 4 7 1 7 1 71 0
115 1 0 1 50 1 0 1 0 1 1 4 11 0 11 0 3 1
116 1 0 0 20 0 0 1 0 0 0 4 6 1 31 1 42 1
117 1 0 1 25 0 1 1 1 0 0 3 8 0 8 0 5 1
118 1 0 1 20 0 0 0 1 0 1 1 3 1 2 1 30 0
119 1 0 1 20 0 0 1 1 0 0 4 6 1 38 0 38 0
120 1 0 1 10 1 0 1 0 0 0 4 16 0 16 0 16 0
121 1 0 0 15 1 0 1 0 0 0 2 20 0 20 0 20 0
122 1 0 1 15 0 0 1 0 1 0 4 30 0 2 1 30 0
123 1 0 1 15 0 0 1 0 0 0 4 2 1 7 0 7 0
124 1 0 1 20 0 0 1 1 0 0 2 8 1 6 1 22 0
125 1 0 1 13 1 0 1 0 0 0 4 13 0 4 1 5 1
126 1 0 1 25 1 0 1 0 0 1 4 13 1 1 1 31 0
127 1 0 1 25 0 0 1 1 0 1 4 17 0 17 0 10 1
128 1 0 1 8 1 0 1 0 0 0 4 14 0 14 0 14 0
129 1 1 1 30 1 0 1 0 0 1 4 13 0 5 1 13 0
130 1 0 1 40 0 1 1 1 1 0 4 24 0 7 1 17 1
131 1 1 1 12 0 1 1 1 1 0 1 14 1 21 0 21 0
132 1 0 1 15 0 0 1 0 0 0 4 8 1 19 1 25 0
133 1 0 1 25 1 0 1 0 0 0 4 23 0 23 0 8 1
134 1 0 1 15 0 0 1 0 0 0 4 17 1 17 0 11 1
135 1 0 0 20 0 0 1 1 1 0 4 19 1 31 0 31 0
136 1 0 1 22 0 1 1 0 0 0 4 14 1 20 0 20 0
137 1 0 1 15 1 0 1 0 1 0 4 15 1 22 0 22 0
138 1 0 1 7 1 0 1 0 0 0 3 13 0 3 1 13 0
139 1 0 1 30 0 1 1 1 1 0 2 49 0 49 0 4 1
140 1 0 1 20 1 0 1 0 0 1 4 14 0 10 1 14 0
141 1 1 1 35 1 0 1 0 0 1 4 6 1 5 1 49 0
142 1 0 0 10 0 0 1 0 0 0 4 12 0 12 0 12 0
143 1 0 1 8 0 0 1 0 1 0 3 14 0 1 1 14 0
144 1 0 1 13 0 0 0 0 1 0 4 32 1 38 0 38 0
145 1 1 0 10 0 1 1 1 0 0 2 12 1 13 1 41 0
146 1 0 1 8 0 0 0 1 1 0 4 10 1 18 0 18 0
147 1 0 1 7 1 0 1 0 0 0 4 8 0 8 0 8 0
148 1 0 1 52 1 0 1 1 1 1 4 15 1 39 1 76 0
149 1 1 1 14 0 1 1 1 1 0 4 8 1 62 0 62 0
150 1 1 1 7 0 0 1 0 0 0 1 5 1 17 0 17 0
151 1 1 1 20 1 0 1 0 0 0 4 7 1 6 1 17 1
152 1 0 1 15 0 0 0 1 1 1 4 19 1 3 1 42 0
153 1 0 1 10 0 0 1 0 0 0 4 10 0 10 0 2 1
154 1 0 1 35 1 1 1 0 0 0 4 10 1 27 0 27 0
I have used the Import Dataset tool within R, but I cannot seem to find the right setting to import the dataset. The columns are either merged together, or there are additional columns (with many) NAs.
I have looked around for similar questions, however I cannot find a solution that suits my problem.
How can I import this dataset?
Ensure it is saved as a text file (for example text.txt) then apply the following: read.table("text.txt").

Error in running glinternet : a statistical function for automatic model selection using interaction terms by Stanford's professor T. Hastie

The glinternet is an R package and a function that implements an algorithm developed by Trevor Hastie -- the eminent Stanford professor on Statistical Learning -- and his ex-phD student. glinternet() detects automatically interaction terms and as such it is very useful in building a model in a situation with many variables where the possible combinations are enormous.
When I run glinternet I get an error message which I reproduce here using the mtcars base R dataset:
data(mtcars)
setDT(mtcars)
glimpse(mtcars)
x = as.matrix(mtcars[, -c("am"), with = FALSE])
class(x)
y <- mtcars$am
class(y)
glinter_fit <- glinternet(x , y, numLevels = 2)
Error: pCat + pCont == ncol(X) is not TRUE
Your advice will be appreciated.
It's not very clear, but you need to provide a vector that is as long as your number of predictor columns, each element indicating the number of categories for each column.
In your example, in x it's all continuous, so we do:
glinternet(x,y,numLevels=rep(1,ncol(x)))
Call: glinternet(X = x, Y = y, numLevels = rep(1, ncol(x)))
lambda objValue cat cont catcat contcont catcont
1 0.068900 0.1210 0 0 0 0 0
2 0.062800 0.1200 0 1 0 0 0
3 0.057100 0.1180 0 1 0 0 0
4 0.052000 0.1160 0 1 0 0 0
5 0.047300 0.1130 0 2 0 0 0
6 0.043100 0.1100 0 2 0 0 0
7 0.039200 0.1060 0 3 0 0 0
8 0.035700 0.1020 0 3 0 0 0
9 0.032500 0.0983 0 3 0 0 0
10 0.029600 0.0944 0 3 0 0 0
11 0.026900 0.0904 0 3 0 0 0
12 0.024500 0.0866 0 3 0 0 0
13 0.022300 0.0829 0 3 0 0 0
14 0.020300 0.0794 0 3 0 0 0
15 0.018500 0.0760 0 3 0 0 0
16 0.016800 0.0728 0 3 0 1 0
17 0.015300 0.0698 0 4 0 1 0
18 0.014000 0.0668 0 4 0 1 0
19 0.012700 0.0638 0 4 0 2 0
20 0.011600 0.0608 0 4 0 2 0
21 0.010500 0.0579 0 3 0 2 0
22 0.009580 0.0551 0 3 0 2 0
23 0.008720 0.0523 0 3 0 2 0
24 0.007940 0.0497 0 3 0 2 0
25 0.007230 0.0472 0 3 0 3 0
26 0.006580 0.0448 0 5 0 3 0
27 0.005990 0.0425 0 5 0 3 0
28 0.005450 0.0403 0 5 0 3 0
29 0.004960 0.0382 0 5 0 3 0
30 0.004520 0.0361 0 4 0 3 0
31 0.004110 0.0342 0 4 0 3 0
32 0.003740 0.0324 0 4 0 4 0
33 0.003410 0.0307 0 4 0 5 0
34 0.003100 0.0291 0 4 0 6 0
35 0.002820 0.0275 0 3 0 6 0
36 0.002570 0.0261 0 3 0 6 0
37 0.002340 0.0247 0 3 0 8 0
38 0.002130 0.0234 0 3 0 7 0
39 0.001940 0.0221 0 3 0 7 0
40 0.001760 0.0210 0 3 0 7 0
41 0.001610 0.0199 0 3 0 8 0
42 0.001460 0.0188 0 3 0 8 0
43 0.001330 0.0178 0 4 0 10 0
44 0.001210 0.0168 0 4 0 10 0
45 0.001100 0.0159 0 4 0 12 0
46 0.001000 0.0149 0 4 0 12 0
47 0.000914 0.0140 0 4 0 12 0
48 0.000832 0.0132 0 4 0 12 0
49 0.000757 0.0123 0 3 0 13 0
50 0.000689 0.0115 0 2 0 13 0

Resources