Loosing observation when I use reshape in R - r

I have data set
> head(pain_subset2, n= 50)
PatientID RSE SE SECODE
1 1001-01 0 0 0
2 1001-01 0 0 0
3 1001-02 0 0 0
4 1001-02 0 0 0
5 1002-01 0 0 0
6 1002-01 1 2a 1
7 1002-02 0 0 0
8 1002-02 0 0 0
9 1002-02 0 0 0
10 1002-03 0 0 0
11 1002-03 0 0 0
12 1002-03 1 1 1
> dim(pain_subset2)
[1] 817 4
> table(pain_subset2$RSE)
0 1
788 29
> table(pain_subset2$SE)
0 1 2a 2b 3 4 5
788 7 5 1 6 4 6
> table(pain_subset2$SECODE)
0 1
788 29
I want to create matrix with n * 6 (n :# of PatientID, column :6 levels of SE)
I use reshape, I lost many observations
> dim(p)
[1] 246 9
My code:
p <- reshape(pain_subset2, timevar = "SE", idvar = c("PatientID","RSE"),v.names = "SECODE", direction = "wide")
p[is.na(p)] <- 0
> table(p$RSE)
0 1
226 20
Compare with table of RSE, I lost 9 patients having 1.
This is out put I have
PatientID RSE SECODE.0 SECODE.2a SECODE.1 SECODE.5 SECODE.3 SECODE.2b SECODE.4
1 1001-01 0 0 0 0 0 0 0 0
3 1001-02 0 0 0 0 0 0 0 0
5 1002-01 0 0 0 0 0 0 0 0
6 1002-01 1 0 1 0 0 0 0 0
7 1002-02 0 0 0 0 0 0 0 0
10 1002-03 0 0 0 0 0 0 0 0
12 1002-03 1 0 0 1 0 0 0 0
13 1002-04 0 0 0 0 0 0 0 0
15 1003-01 0 0 0 0 0 0 0 0
18 1003-02 0 0 0 0 0 0 0 0
21 1003-03 0 0 0 0 0 0 0 0
24 1003-04 0 0 0 0 0 0 0 0
27 1003-05 0 0 0 0 0 0 0 0
30 1003-06 0 0 0 0 0 0 0 0
32 1003-07 0 0 0 0 0 0 0 0
35 1004-01 0 0 0 0 0 0 0 0
36 1004-01 1 0 0 0 1 0 0 0
40 1004-02a 0 0 0 0 0 0 0 0
Anyone knows what happens, I really appreciate.
Thanks for your help, best.

Try:
library(dplyr)
library(tidyr)
pain_subset2 %>%
spread(SE, SECODE)

Related

Multiplying multiple columns with each other into a new dataframe in R

I want to multiply many of my binary variables into new columns, so called interactive variables. My dataset is structured like this:
YearCountry <- data.frame( Time = c("2000","2001", "2002", "2003",
"2000","2001", "2002", "2003",
"2000","2001", "2002", "2003"),
AL = c(1,1,1,1,0,0,0,0,0,0,0,0),
FR = c(0,0,0,0,1,1,1,1,0,0,0,0),
UK = c(0,0,0,0,0,0,0,0,1,1,1,1),
Y2000d = c(1,0,0,0,1,0,0,0,1,0,0,0),
Y2001d = c(0,1,0,0,0,1,0,0,0,1,0,0),
Y2002d = c(0,0,1,0,0,0,1,0,0,0,1,0),
Y2003d = c(0,0,0,1,0,0,0,1,0,0,0,1))
YearCountry
Time AL FR UK Y2000d Y2001d Y2002d Y2003d
1 2000 1 0 0 1 0 0 0
2 2001 1 0 0 0 1 0 0
3 2002 1 0 0 0 0 1 0
4 2003 1 0 0 0 0 0 1
5 2000 0 1 0 1 0 0 0
6 2001 0 1 0 0 1 0 0
7 2002 0 1 0 0 0 1 0
8 2003 0 1 0 0 0 0 1
9 2000 0 0 1 1 0 0 0
10 2001 0 0 1 0 1 0 0
11 2002 0 0 1 0 0 1 0
12 2003 0 0 1 0 0 0 1
I need to multiply the binary variable for each of the countries (AL,FR,UK) with each of the binary variables for a given year so that I get #country x #year new variables. In this case I have three countries and four years which gives 12 new variables. My full data contains 105 countries/regions and stretches over twenty years. I therefore need a general formula. I want data that looks like this
Interact <- data.frame(Time = c("2000","2001", "2002", "2003",
"2000","2001", "2002", "2003",
"2000","2001", "2002", "2003"),
Y2000xAL = c(1,0,0,0,0,0,0,0,0,0,0,0),
Y2001xAL = c(0,1,0,0,0,0,0,0,0,0,0,0),
Y2002xAL = c(0,0,1,0,0,0,0,0,0,0,0,0),
Y2003xAL = c(0,0,0,1,0,0,0,0,0,0,0,0),
Y2000xFR = c(0,0,0,0,1,0,0,0,0,0,0,0),
Y2001xFR = c(0,0,0,0,0,1,0,0,0,0,0,0),
Y2002xFR = c(0,0,0,0,0,0,1,0,0,0,0,0),
Y2003xFR = c(0,0,0,0,0,0,0,1,0,0,0,0),
Y2000xUk = c(0,0,0,0,0,0,0,0,1,0,0,0),
Y2001xUK = c(0,0,0,0,0,0,0,0,0,1,0,0),
Y2002xUK = c(0,0,0,0,0,0,0,0,0,0,1,0),
Y2003xUK = c(0,0,0,0,0,0,0,0,0,0,0,1))
Interact
Time Y2000xAL Y2001xAL Y2002xAL Y2003xAL Y2000xFR Y2001xFR Y2002xFR Y2003xFR Y2000xUk Y2001xUK Y2002xUK Y2003xUK
1 2000 1 0 0 0 0 0 0 0 0 0 0 0
2 2001 0 1 0 0 0 0 0 0 0 0 0 0
3 2002 0 0 1 0 0 0 0 0 0 0 0 0
4 2003 0 0 0 1 0 0 0 0 0 0 0 0
5 2000 0 0 0 0 1 0 0 0 0 0 0 0
6 2001 0 0 0 0 0 1 0 0 0 0 0 0
7 2002 0 0 0 0 0 0 1 0 0 0 0 0
8 2003 0 0 0 0 0 0 0 1 0 0 0 0
9 2000 0 0 0 0 0 0 0 0 1 0 0 0
10 2001 0 0 0 0 0 0 0 0 0 1 0 0
11 2002 0 0 0 0 0 0 0 0 0 0 1 0
12 2003 0 0 0 0 0 0 0 0 0 0 0 1
Here's an approach with dplyr::across. We can make the final result into a plain data.frame with purrr:invoke as demonstrated in this answer.
library(dplyr)
library(purrr)
YearCountry %>%
mutate(across(AL:UK, ~ . * select(cur_data(), Y2000d:Y2003d))) %>%
select(-(Y2000d:Y2003d)) %>%
invoke(.f = data.frame) %>%
rename_with(~str_replace(.,"\\.",""))
Time ALY2000d ALY2001d ALY2002d ALY2003d FRY2000d FRY2001d FRY2002d FRY2003d UKY2000d UKY2001d UKY2002d UKY2003d
1 2000 1 0 0 0 0 0 0 0 0 0 0 0
2 2001 0 1 0 0 0 0 0 0 0 0 0 0
3 2002 0 0 1 0 0 0 0 0 0 0 0 0
4 2003 0 0 0 1 0 0 0 0 0 0 0 0
5 2000 0 0 0 0 1 0 0 0 0 0 0 0
6 2001 0 0 0 0 0 1 0 0 0 0 0 0
7 2002 0 0 0 0 0 0 1 0 0 0 0 0
8 2003 0 0 0 0 0 0 0 1 0 0 0 0
9 2000 0 0 0 0 0 0 0 0 1 0 0 0
10 2001 0 0 0 0 0 0 0 0 0 1 0 0
11 2002 0 0 0 0 0 0 0 0 0 0 1 0
12 2003 0 0 0 0 0 0 0 0 0 0 0 1
1) model.matrix We split the names by the number of characters in them (the countries have 2 characters in their names and the years have 6) and paste pluses in each. (Alternately use Plus(grep("^..$", nms, value = TRUE)) to get the country names and use that in place of spl["2"] and similarly Plus(grep("^Y....d$", nms, value = TRUE)) in place of spl["6"].)
c(`2` = "AL+FR+UK", `6` = "Y2000d+Y2001d+Y2002d+Y2003d")
and from that the formula:
~(AL + FR + UK):(Y2000d + Y2001d + Y2002d + Y2003d) + 0
and then compute its model matrix.
The formula could also be expanded to one accepted by lm by modifying the sprintf format so we might not even need to create the model matrix. For example, if we had a response vector R then we could write: s <- sprintf("R ~ (%s)*(%s)", spl["2"], spl["4"]); fo <- formula(s); lm(fo, YearCountry) to include all variables and the interactions of countries and year as well as an intercept.
Plus <- function(x) paste(x, collapse = "+")
nms <- names(YearCountry)[-1]
spl <- sapply(split(nms, nchar(nms)), Plus)
s <- sprintf("~ (%s):(%s)+0", spl["2"], spl["6"])
fo <- formula(s)
model.matrix(fo, YearCountry)
giving this matrix:
AL:Y2000d AL:Y2001d AL:Y2002d AL:Y2003d FR:Y2000d FR:Y2001d FR:Y2002d FR:Y2003d UK:Y2000d UK:Y2001d UK:Y2002d UK:Y2003d
1 1 0 0 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0 0 0
8 0 0 0 0 0 0 0 1 0 0 0 0
9 0 0 0 0 0 0 0 0 1 0 0 0
10 0 0 0 0 0 0 0 0 0 1 0 0
11 0 0 0 0 0 0 0 0 0 0 1 0
12 0 0 0 0 0 0 0 0 0 0 0 1
attr(,"assign")
[1] 1 2 3 4 5 6 7 8 9 10 11 12
Alternately we can write it compactly like this:
Plus <- function(x) paste(x, collapse = "+")
nms <- names(YearCountry)
s <- sprintf("~ (%s):(%s)+0", Plus(nms[2:4]), Plus(nms[5:8]))
fo <- formula(s)
model.matrix(fo, YearCountry)
2) eList Another approach is to use list comprehensions. With the eList package we can do this:
library(eList)
DF(for(i in YearCountry[2:4]) for(j in YearCountry[5:8]) i*j)
giving this data frame. Use as.matrix(...) on it if you want a matrix.
AL.Y2000d AL.Y2001d AL.Y2002d AL.Y2003d FR.Y2000d FR.Y2001d FR.Y2002d FR.Y2003d UK.Y2000d UK.Y2001d UK.Y2002d UK.Y2003d
1 1 0 0 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0 0 0
6 0 0 0 0 0 1 0 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 0 0 0
8 0 0 0 0 0 0 0 1 0 0 0 0
9 0 0 0 0 0 0 0 0 1 0 0 0
10 0 0 0 0 0 0 0 0 0 1 0 0
11 0 0 0 0 0 0 0 0 0 0 1 0
12 0 0 0 0 0 0 0 0 0 0 0 1
3) listcompr listcompr is another list comprehension package. Note that the development version of this package is needed in order to use bycol=. Replace gen.named.matrix with gen.named.data.frame if you want a data frame.
# devtools::github_github("patrickroocks/listcompr")
library(listcompr)
nms <- names(YearCountry)
gen.named.matrix("{nms[i]}.{nms[j]}", YearCountry[[i]] * YearCountry[[j]],
i = 2:4, j = 5:8, bycol = TRUE)

EDITED: spreading data based on column match

I have an empty data frame I am trying to populate.
Df1 looks like this:
col1 col2 col3 col4 important_col
1 82 193 104 86 120
2 85 68 116 63 100
3 78 145 10 132 28
4 121 158 103 15 109
5 48 175 168 190 151
6 91 136 156 180 155
Df2 looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A data frame full of 0's.
I combine the data frames to make df_fin.
What I am trying to do now is something similar to a dummy variable approach… I have the column in important_col. What I am trying to do is spread this column out, so if important_col = 28 then put a 1 in column 28.
How can I go about creating this?
EDIT: I added a comment to illustrate what I am trying to achieve. I paste it here also.
Say that the important_col is countries, then the column names would
be all the countries in the world. That is in this example all of the
241 countries in the world. However the data I might have already
collected might only contain 200 of these countires. So
one_hot_encoding here would give me 200 columns but I am missing
potentially 41 countries. So if a new user from a country (not
currently in the data) comes to the data and inputs their country,
then it wouldn´t be recognised
Smaller example:
col1 col2 col3 col4 important_col 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 11 14 3 11 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 19 15 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 17 10 10 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 13 10 8 17 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 18 5 3 18 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 11 10 9 5 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 5 11 18 16 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 5 8 13 8 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 10 1 7 16 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 4 17 17 3 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Expected output:
col1 col2 col3 col4 important_col 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 11 14 3 11 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 19 15 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 17 10 10 6 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 13 10 8 17 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
5 18 5 3 18 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
6 11 10 9 5 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
7 5 11 18 16 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
8 5 8 13 8 6 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 10 1 7 16 12 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
10 4 17 17 3 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The number of columns is greater than the number of potential entries into important_col. Using the countries example the columns would be all countries in the world and the important_col would consist of a subset of these countries.
Code to generate the above:
df1 <- data.frame(replicate(5, sample(1:20, 10, rep=TRUE)))
colnames(df1) <- c("col1", "col2", "col3", "col4", "important_col")
df2 <- data.frame(replicate(20, sample(0:0, nrow(df1), rep=TRUE)))
colnames(df2) <- gsub("X", "", colnames(df2))
df_fin <- cbind(df1, df2)
df_fin
Does this solve the problem:
Data:
set.seed(123)
df1 <- data.frame(replicate(5, sample(1:20, 10, rep=TRUE)))
colnames(df1) <- c("col1", "col2", "col3", "col4", "important_col")
df2 <- data.frame(replicate(20, sample(0:0, nrow(df1), rep=TRUE)))
colnames(df2) <- gsub("X", "", colnames(df2))
df_fin <- cbind(df1, df2)
Result:
vecp <- colnames(df2)
imp_col <- df1$important_col
m <- matrix(vecp, byrow = TRUE, nrow = length(imp_col), ncol = length(vecp))
d <- ifelse(m == imp_col, 1, 0)
df_fin <- cbind(df1, d)
Output:
col1 col2 col3 col4 important_col 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 6 20 18 20 3 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 16 10 14 19 9 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
3 9 14 13 14 9 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
4 18 12 20 16 8 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
5 19 3 14 1 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 1 18 15 10 3 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 11 5 11 16 5 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 18 1 12 5 10 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
9 12 7 6 7 6 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 10 20 3 5 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
What you are trying to do is one hot encoding which you can easily achieve using model.matrix
Below example should take you to the right direction:
df <- data.frame(important_col = as.factor(c(1:3)))
df
important_col
1 1
2 2
3 3
as.data.frame(model.matrix(~.-1, df))
important_col1 important_col2 important_col3
1 1 0 0
2 0 1 0
3 0 0 1
Like Sonny mentioned, model.matrix() should do the job. One potential problem is that you have to add back columns that did not show up in your important_col like the following case:
df <- data.frame(important_col = as.factor(c(1:3, 5)))
df
important_col
1 1
2 2
3 3
4 5
as.data.frame(model.matrix(~.-1, df))
important_col1 important_col2 important_col3 important_col5
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
Col4 is missing in the second df, because the important_col does not include value 4. You have to add back the col 4 if you need it for analysis.

Garch(1,1) with Dummy Variable

I am trying in R to use Garch(1,1) to estimate the influence of day of the week, and also later other parameters, on my log return (ln(Pt/Pt-1)) of Product sells
I have all setup in a CSV file and for each Day a dummy variable (D1,D2) with 1 or 0 as value
I am building the following model in R
#Bind Data
ext.reg.D1 <- mydata$D1
ext.reg.D2 <- mydata$D2
ext.reg.D3 <- mydata$D3
ext.reg.D4 <- mydata$D4
ext.reg.D5 <- mydata$D5
ext.reg.D6 <- mydata$D6
ext.reg.D7 <- mydata$D7
ext.reg <- cbind(ext.reg.D1, ext.reg.D2, ext.reg.D3,ext.reg.D4,ext.reg.D5,ext.reg.D6)
y <- mydata$log_return
fit.spec <- ugarchspec(variance.model = list(model = "sGARCH", garchOrder = c(1, 1), submodel = NULL, external.regressors = NULL, variance.targeting = FALSE), mean.model = list(armaOrder = c(0, 0), external.regressors = ext.reg), distribution.model = "norm", start.pars = list(), fixed.pars = list())
fit <- ugarchfit(data = y, spec = fit.spec)
Error
In .sgarchfit(spec = spec, data = data, out.sample = out.sample, : ugarchfit-->warning: solver failer to converge.
Any ideas how to solve this?
Thanks
Sampled Data 14 Rows
log_return D5 D6 D7 D1 D2 D3 D4
1 -0.02979189 1 0 0 0 0 0 0
2 17.43188265 0 1 0 0 0 0 0
3 -9.12727223 0 0 1 0 0 0 0
4 2.77744081 0 0 0 1 0 0 0
5 9.62597392 0 0 0 0 1 0 0
6 -0.11614358 0 0 0 0 0 1 0
7 10.81279075 0 0 0 0 0 0 1
8 -1.03825650 1 0 0 0 0 0 0
9 -5.49109661 0 1 0 0 0 0 0
10 -16.81177602 0 0 1 0 0 0 0
11 9.74292804 0 0 0 1 0 0 0
12 15.22583595 0 0 0 0 1 0 0
13 -1.79578436 0 0 0 0 0 1 0
14 0.40559431 0 0 0 0 0 0 1
15 -2.38281092 1 0 0 0 0 0 0
16 -4.88853323 0 1 0 0 0 0 0
17 -16.98493635 0 0 1 0 0 0 0
18 7.57998016 0 0 0 1 0 0 0
19 17.56008274 0 0 0 0 1 0 0
20 -0.46754932 0 0 0 0 0 1 0
21 -1.27007966 0 0 0 0 0 0 1
22 -1.79234966 1 0 0 0 0 0 0
23 -5.79461986 0 1 0 0 0 0 0
24 -17.82636881 0 0 1 0 0 0 0
25 9.48124679 0 0 0 1 0 0 0
26 17.64277207 0 0 0 0 1 0 0
27 -0.71191725 0 0 0 0 0 1 0
28 -1.14937870 0 0 0 0 0 0 1
29 -1.62331777 1 0 0 0 0 0 0
30 -5.52787401 0 1 0 0 0 0 0
31 -18.50034717 0 0 1 0 0 0 0
32 10.31502542 0 0 0 1 0 0 0
33 16.21997258 0 0 0 0 1 0 0
34 -1.09910695 0 0 0 0 0 1 0
35 -0.57416519 0 0 0 0 0 0 1
36 -1.83623328 1 0 0 0 0 0 0
37 -5.48021232 0 1 0 0 0 0 0
38 -20.02869823 0 0 1 0 0 0 0
39 11.48799875 0 0 0 1 0 0 0
40 17.55356524 0 0 0 0 1 0 0
41 -1.45430558 0 0 0 0 0 1 0
42 -2.15287757 0 0 0 0 0 0 1
43 -4.91058837 1 0 0 0 0 0 0
44 -4.35107354 0 1 0 0 0 0 0
45 -19.40533612 0 0 1 0 0 0 0
46 6.47785167 0 0 0 1 0 0 0
47 16.54500844 0 0 0 0 1 0 0
48 1.43266482 0 0 0 0 0 1 0
49 1.91234500 0 0 0 0 0 0 1
50 -1.44926252 1 0 0 0 0 0 0
51 -5.69296574 0 1 0 0 0 0 0
52 -14.21241905 0 0 1 0 0 0 0
53 9.85180551 0 0 0 1 0 0 0
54 16.72072000 0 0 0 0 1 0 0
55 -1.04381003 0 0 0 0 0 1 0
56 -1.49048390 0 0 0 0 0 0 1
57 -2.57835848 1 0 0 0 0 0 0
58 -2.93456505 0 1 0 0 0 0 0
59 -21.27981318 0 0 1 0 0 0 0
60 14.27747712 0 0 0 1 0 0 0
61 15.20376637 0 0 0 0 1 0 0
62 -2.36474181 0 0 0 0 0 1 0
63 -0.12825700 0 0 0 0 0 0 1
64 -2.17755007 1 0 0 0 0 0 0
65 -6.50236487 0 1 0 0 0 0 0
66 -20.40159745 0 0 1 0 0 0 0
67 10.12381534 0 0 0 1 0 0 0
68 19.34672964 0 0 0 0 1 0 0
69 -0.18663788 0 0 0 0 0 1 0
70 -1.26430704 0 0 0 0 0 0 1
71 -2.17712050 1 0 0 0 0 0 0
72 -5.20850527 0 1 0 0 0 0 0
73 -19.00303225 0 0 1 0 0 0 0
74 10.78960865 0 0 0 1 0 0 0
75 16.50911599 0 0 0 0 1 0 0
76 -1.20629718 0 0 0 0 0 1 0
77 -0.92077350 0 0 0 0 0 0 1
78 -2.13818901 1 0 0 0 0 0 0
79 -6.39795596 0 1 0 0 0 0 0
80 -16.89947946 0 0 1 0 0 0 0
81 11.84070286 0 0 0 1 0 0 0
82 16.76126417 0 0 0 0 1 0 0
83 -2.32992683 0 0 0 0 0 1 0
84 -0.04347497 0 0 0 0 0 0 1
85 -1.58421553 1 0 0 0 0 0 0
86 -5.11294741 0 1 0 0 0 0 0
87 -22.94382512 0 0 1 0 0 0 0
88 12.08906834 0 0 0 1 0 0 0
89 18.59588505 0 0 0 0 1 0 0
90 -0.66190281 0 0 0 0 0 1 0
91 -3.35891858 0 0 0 0 0 0 1
92 -5.56096067 1 0 0 0 0 0 0
93 -19.12946131 0 1 0 0 0 0 0
94 -2.45717082 0 0 1 0 0 0 0
95 -6.00314421 0 0 0 1 0 0 0
96 16.87403882 0 0 0 0 1 0 0
97 16.72700765 0 0 0 0 0 1 0
98 -1.80683941 0 0 0 0 0 0 1
99 -2.08228231 1 0 0 0 0 0 0
100 -5.98864409 0 1 0 0 0 0 0
101 -14.91991224 0 0 1 0 0 0 0
I think the problem is that the explanatory variables are all dummy variables. You should include another non dummy variable as x with D1...D7. Your model does not make sense without this variable.
You can not estimate y (which is a continuous variable) with only dummy ones. try for example to add y-1 to
ext.reg <- cbind(ext.reg.D1, ext.reg.D2, ext.reg.D3,ext.reg.D4,ext.reg.D5,ext.reg.D6)
good luck
change your ext.reg for this
ext.reg <- cbind(ext.reg.D1, ext.reg.D2, ext.reg.D3, ext.reg.D4,
ext.reg.D5, ext.reg.D6, ext.reg.D7)
men see, solved exercise.

Turn a long data structure to a wide matrix structure

I do have the following data structure...
ID value
1 1 1
2 1 63
3 1 2
4 1 58
5 2 3
6 2 4
7 3 34
8 3 25
Now I want to turn it into a kind of dyadic data structure. Every ID with the same value should have a relationship.
I tried several option and:
df_wide <- dcast(df, ID ~ value)
... have brought me a long way down the road...
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 39 40
1 1001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1007 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0
4 1011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1018 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 1020 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
7 1030 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0
8 1036 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Now is my main problem to turn it into a proper matrix to get a igraph object out of it.
df_wide_matrix <- data.matrix(df_wide)
df_aus_wide_g <- graph.edgelist(df_wide_matrix ,directed = TRUE)
don't get me there...
I also tried to transform it into a adjacency matrix...
df_wide_matrix <- get.adjacency(graph.edgelist(as.matrix(df_wide), directed=FALSE))
... but it didn't work either
If you want to create an edge between all IDs with the same value, try something like this instead. First merge the data frame onto itself by the value. Then, remove the value column, and remove all (undirected) edges that are duplicate or just points. Finally, convert to a two-column matrix and create the edges.
res <- merge(df, df, by='value', all=FALSE)[,c('ID.x','ID.y')]
res <- res[res$ID.x<res$ID.y,]
resg <- graph.edgelist(as.matrix(res))

how to convert a matrix of values into a binary matrix

I'd like to convert a matrix of values into a matrix of 'bits'.
I have been looking for solutions and found this, which seems to be part of a solution.
I'll try to explain what I am looking for.
I have a matrix like
> x<-matrix(1:20,5,4)
> x
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
which I would like to convert into
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0
2 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0
3 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0
4 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0
5 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
so for each value in the row a "1" in the corresponding column.
If I use
> table(sequence(length(x)),t(x))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
5 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
9 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
13 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
17 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
this is close to what I am looking for, but returns a line for each value.
I would only need to consolidate all values from one row into one row.
Because a
> table(x)
x
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
gives alls values of the whole table, so what do I need to do to get the values per row.
Here is another option using table() function:
table(row(x), x)
# x
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0
# 2 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0
# 3 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0
# 4 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0
# 5 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
bit_x = matrix(0, nrow = nrow(x), ncol = max(x))
for (i in 1:nrow(x)) {bit_x[i,x[i,]] = 1}
Let
(x <- matrix(c(1, 3), 2, 2))
[,1] [,2]
[1,] 1 1
[2,] 3 3
One approach would be
M <- matrix(0, nrow(x), max(x))
M[cbind(c(row(x)), c(x))] <- 1
M
# [,1] [,2] [,3]
# [1,] 1 0 0
# [2,] 0 0 1
In one line:
replace(matrix(0, nrow(x), max(x)), cbind(c(row(x)), c(x)), 1).
Following your approach, and similarly to #Psidom's suggestion:
table(rep(1:nrow(x), ncol(x)), x)
# x
# 1 3
# 1 2 0
# 2 0 2
We can use the reshape2 package.
library(reshape2)
# At first we make the matrix you provided
x <- matrix(1:20, 5, 4)
# then melt it based on first column
da <- melt(x, id.var = 1)
# then cast it
dat <- dcast(da, Var1 ~ value, fill = 0, fun.aggregate = length)
which gives us this
Var1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0
2 2 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0
3 3 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0
4 4 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0
5 5 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1

Resources