Including squared predictors in model matrix - r

I have the following code
x <- c(1, 2, 3)
y <- c(2, 3, 4)
z <- c(3, 4, 5)
df <- data.frame(x, y, z)
model.matrix(x ~ .^4, df)
This gives me a model matrix with predictors $y, z$, and $y:z$. However, I also want y^2 and z^2, and want to use a solution that uses "$.$", since I have lots of other predictors beyond $y$ and $z$. What's the best way to approach this?

Try this:
> x <- c(1, 2, 3)
> y <- c(2, 3, 4)
> z <- c(3, 4, 5)
> df <- data.frame(x, y, z)
>
> #Assuming that your 1st column is the response variable, then I excluded it to have
> #just the independent variables as a new data.frame called df.2
> df.2=df[,-1]
> model.matrix(x ~ .^4+I(df.2^2), df)
(Intercept) y z I(df.2^2)y I(df.2^2)z y:z
1 1 2 3 4 9 6
2 1 3 4 9 16 12
3 1 4 5 16 25 20
attr(,"assign")
[1] 0 1 2 3 3 4

Related

Writing a summation formula using variables from multiple observations

I am trying to create a new variable for each observation using the following formula:
Index = ∑(BAj / DISTANCEij)
where:
j = focal observation; i= other observation
Basically, I'm taking the focal individual (i) and finding the euclidean distance between it and another point and dividing the other points BA by that distance. Do that for all the other points and then sum them all and repeat all of this for each point.
Here is some sample data:
ID <- 1:4
BA <- c(3, 5, 6, 9)
x <- c(0, 2, 3, 7)
y <- c(1, 3, 4, 9)
df <- data.frame(ID, BA, x, y)
print(df)
ID BA x y
1 1 3 0 1
2 2 5 2 3
3 3 6 3 4
4 4 9 7 9
Currently, I've extracted out vectors and created a formula to calculate part of the formula shown here:
vec1 <- df[1, ]
vec2 <- df[2, ]
dist <- function(vec1, vec2) vec1$BA/sqrt((vec2$x - vec1$x)^2 +
(vec2$y - vec1$y)^2)
My question is how do I repeat this with the x and y values for vec2 changing for each new other point with vec1 remaining the same and then sum them all together?
We may loop over the row sequence, extract the data and apply the dist function
library(dplyr)
library(purrr)
df %>%
mutate(dist_out = map_dbl(row_number(), ~ {
othr <- cur_data()[-.x,]
cur <- cur_data()[.x, ]
sum(dist(cur, othr))
}))
-output
ID BA x y dist_out
1 1 3 0 1 2.049983
2 2 5 2 3 5.943485
3 3 6 3 4 6.593897
4 4 9 7 9 3.404545
Here are two base R ways.
1. for loop
ID <- 1:4
BA <- c(3, 5, 6, 9)
x <- c(0, 2, 3, 7)
y <- c(1, 3, 4, 9)
df <- data.frame(ID, BA, x, y)
n <- nrow(df)
d <- dist(df[c("x", "y")], upper = TRUE)
d <- as.matrix(d)
Index <- numeric(n)
for(j in seq_len(n)) {
d_j <- d[-j, j, drop = TRUE]
Index[j] <- sum(df$BA[j]/d_j)
}
Index
#> [1] 2.049983 5.943485 6.593897 3.404545
Created on 2022-08-18 by the reprex package (v2.0.1)
2. sapply loop
Index <- sapply(seq_len(n), \(j) sum(df$BA[j]/d[-j, j, drop = TRUE]))
Index
#> [1] 2.049983 5.943485 6.593897 3.404545
Created on 2022-08-18 by the reprex package (v2.0.1)

polynomial curve fitting for future frames

orignally i have the data in the form
m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
using the following code i convert it to
input<-file('stdin', 'r')
mn <- read.table(input, nrows = 1, as.is = TRUE)
DF <- read.table(input, skip = 0)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
`
m = the current datas length
n = number of points in the future to be predicted
x1= 1 5 9 13
x2= 2 6 10 14
i would like to predict all the values of x1 y1 x2 y2 for n values after the given values.
i tried to fit with lm but i am not sure how to proceed with all the values of data points to be predicted in the future missing and just getting the coefficients in terms of the other would not be sufficient as all of them need to be predicted
In order to get that to run without error one needs to use skip =1 on the second read.table:
mn <- read.table(text="m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16", nrows = 1, as.is = TRUE)
DF <- read.table(text="m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16", skip = 1)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
So those input data are exactly colinear and you would NOT expect there to be any useful information in either the quadratic or cubic terms. That is in fact recognized by the lm machinery:
> fit1
Call:
lm(formula = x1 ~ poly(y1, 3, raw = TRUE))
Coefficients:
(Intercept) poly(y1, 3, raw = TRUE)1 poly(y1, 3, raw = TRUE)2
-1 1 0
poly(y1, 3, raw = TRUE)3
0
Generally one should be using the data argument
> fit3<-lm(x1 ~ poly(y1, 3, raw=TRUE), DF)
>
> fit4<-lm(x2 ~ poly(y2, 3, raw=TRUE), DF)
But in this case it doesn't seem to matter:
> predict(fit1, newdata = list(y1=20:23))
1 2 3 4
19 20 21 22
> predict(fit3, newdata = list(y1=20:23))
1 2 3 4
19 20 21 22
> predict(fit2, newdata = list(y1=25:28))
1 2 3 4
3 7 11 15
The way to get predictions is to supply a newdata argument that can be coerced into a dataframe. Using a list value that has items of the same length (in this case a single argument) will succeed.

Error in isoMDS(d): zero or negative distance between objects

I'm trying to do a nonmetric MDS (R version 3.3.3) using the isoMDS function in the MASS package and I get this error:
Error in isoMDS(d): zero or negative distance between objects 1 and 2
Here's an example of what I'm doing:
# LOAD LIBRARY
library(MASS)
# CREATE FAKE DATA
a <- c(1, 1, 1, 1)
b <- c(2, 2, 2, 2)
c <- c(3, 3, 4, 5)
d <- c(4, 4, 7, 9)
x <- data.frame(a, b, c, d)
x
a b c d
1 1 2 3 4
2 1 2 3 4
3 1 2 4 7
4 1 2 5 9
# EUCLIDEAN DISTANCE BETWEEN ROWS 1, 2, 3 and 4
d <- dist(x)
d
1 2 3
2 0.000000
3 3.162278 3.162278
4 5.385165 5.385165 2.236068
# NMDS
fit <- isoMDS(d)
Error in isoMDS(d) : distance négative ou nulle entre les objets 1 et 2
I don't know if there's a way of getting around this issue or if I'm doing something wrong. I understand that objects 1 and 2 are identical and that that's probably why the distance is negative or equals to zero. I found out that my question was a "FAQ", but one of the only answers I found is this:
Short answer: you cannot compare distances including NAs, so there is no
way to find a monotone mapping of distances. If the data really are identical for two rows, you can easily drop one of
them whilst doing MDS, and then assign the position found for one to the
other.
So, my next questions are: how do you drop rows whilst doing MDS, and is there any other way to perform a NMDS?
Any help would be greatly appreciated!
The dist function computes the distances between the rows of a data matrix.
Your a, b, c, and d vectors are the columns of the x matrix, not the rows.
A simple solution is to transpose x:
library(MASS)
a <- c(1, 1, 1, 1)
b <- c(2, 2, 2, 2)
c <- c(3, 3, 4, 5)
d <- c(4, 4, 7, 9)
x <- data.frame(a, b, c, d)
# Calculate distance between the columns
d <- dist(t(x))
# NMDS
fit <- isoMDS(d)
# initial value 0.000000
# final value 0.000000
# converged
fit
# $points
# [,1] [,2]
# a -4.594429 0.4509513
# b -2.770312 -0.3638885
# c 1.098884 -0.3114594
# d 6.265857 0.2243966
#
# $stress
# [1] 7.976932e-15
I hope it can help you.
As you noted, you have identical rows.
You can omit identical rows when you first create the distance matrix
d <- dist(x[-1,])
Then continue as normal
fit <- isoMDS(d)
Alternatively, you could try the vegan::metaMDS function:
library(vegan)
#> This is vegan 2.5-3
x <- data.frame(a = c(1, 1, 1, 1),
b = c(2, 2, 2, 2),
c = c(3, 3, 4, 5),
d = c(4, 4, 7, 9))
# The warnings are expected for such a small dataset
fit <- vegan::metaMDS(comm = dist(x))
#> ... Procrustes: rmse 0.09543314 max resid 0.108719
#> *** No convergence -- monoMDS stopping criteria:
#> 17: stress < smin
#> 3: scale factor of the gradient < sfgrmin
#> Warning in vegan::metaMDS(comm = dist(x)): stress is (nearly) zero: you may
#> have insufficient data
ordiplot(fit, type = "text")
Variables/columns "a" and "b" (1 and 2) get the same coordinates.
Similarly, using the smacof::mds function:
library(smacof)
fit2 <- smacof::mds(delta = dist(x), type = "ordinal")
fit2$conf
#> D1 D2
#> 1 0.5742535 0.007220978 # 1 & 2 get the same coordinates
#> 2 0.5742535 0.007220978
#> 3 -0.2749314 -0.034928060
#> 4 -0.8735757 0.020486105

Sorting by successive vectors in R [duplicate]

I have a vector x, that I would like to sort based on the order of values in vector y. The two vectors are not of the same length.
x <- c(2, 2, 3, 4, 1, 4, 4, 3, 3)
y <- c(4, 2, 1, 3)
The expected result would be:
[1] 4 4 4 2 2 1 3 3 3
what about this one
x[order(match(x,y))]
You could convert x into an ordered factor:
x.factor <- factor(x, levels = y, ordered=TRUE)
sort(x)
sort(x.factor)
Obviously, changing your numbers into factors can radically change the way code downstream reacts to x. But since you didn't give us any context about what happens next, I thought I would suggest this as an option.
How about?:
rep(y,table(x)[as.character(y)])
(Ian's is probably still better)
In case you need to get order on "y" no matter if it's numbers or characters:
x[order(ordered(x, levels = y))]
4 4 4 2 2 1 3 3 3
By steps:
a <- ordered(x, levels = y) # Create ordered factor from "x" upon order in "y".
[1] 2 2 3 4 1 4 4 3 3
Levels: 4 < 2 < 1 < 3
b <- order(a) # Define "x" order that match to order in "y".
[1] 4 6 7 1 2 5 3 8 9
x[b] # Reorder "x" according to order in "y".
[1] 4 4 4 2 2 1 3 3 3
[Edit: Clearly Ian has the right approach, but I will leave this in for posterity.]
You can do this without loops by indexing on your y vector. Add an incrementing numeric value to y and merge them:
y <- data.frame(index=1:length(y), x=y)
x <- data.frame(x=x)
x <- merge(x,y)
x <- x[order(x$index),"x"]
x
[1] 4 4 4 2 2 1 3 3 3
x <- c(2, 2, 3, 4, 1, 4, 4, 3, 3)
y <- c(4, 2, 1, 3)
for(i in y) { z <- c(z, rep(i, sum(x==i))) }
The result in z: 4 4 4 2 2 1 3 3 3
The important steps:
for(i in y) -- Loops over the elements of interest.
z <- c(z, ...) -- Concatenates each subexpression in turn
rep(i, sum(x==i)) -- Repeats i (the current element of interest) sum(x==i) times (the number of times we found i in x).
Also you can use sqldf and do it by a join function in sql likes the following:
library(sqldf)
x <- data.frame(x = c(2, 2, 3, 4, 1, 4, 4, 3, 3))
y <- data.frame(y = c(4, 2, 1, 3))
result <- sqldf("SELECT x.x FROM y JOIN x on y.y = x.x")
ordered_x <- result[[1]]

weighted table data frame with plyr

I'm working with survey data consisting of integer value responses for multiple questions (y1, y2, y3, ...) and a weighted count assigned to each respondent, like this:
foo <- data.frame(wcount = c(10, 1, 2, 3), # weighted counts
y1 = sample(1:5, 4, replace=T), # numeric responses
y2 = sample(1:5, 4, replace=T), #
y3 = sample(1:5, 4, replace=T)) #
>foo
wcount y1 y2 y3
1 10 5 5 5
2 1 1 4 4
3 2 1 2 5
4 3 2 5 3
and I'd like to transform this into a consolidated data frame version of a weighted table, with the first column representing the response values, and the next 3 columns representing the weighted counts. This can be done explicitly by column using:
library(Hmisc)
ty1 <- wtd.table(foo$y1, foo$wcount)
ty2 <- wtd.table(foo$y2, foo$wcount)
ty3 <- wtd.table(foo$y3, foo$wcount)
bar <- merge(ty1, ty2, all=T, by="x")
bar <- merge(bar, ty3, all=T, by="x")
names(bar) <- c("x", "ty1", "ty2", "ty3")
bar[is.na(bar)]<-0
>bar
x ty1 ty2 ty3
1 1 3 0 0
2 2 3 2 0
3 3 0 0 3
4 4 0 1 1
5 5 10 13 12
I suspect there is a way of automating this with plyr and numcolwise or ddply. For instance, the following comes close, but I'm not sure what else is needed to finish the job:
library(plyr)
bar2 <- numcolwise(wtd.table)(foo[c("y1","y2","y3")], foo$wcount)
>bar2
y1 y2 y3
1 1, 2, 5 2, 4, 5 3, 4, 5
2 3, 3, 10 2, 1, 13 3, 1, 12
Any thoughts?
Not a plyr answer, but this struck me as a reshaping/aggregating problem that could be tackled straightforwardly using functions from package reshape2.
First, melt the dataset, making a column of the response value which can be named x (the unique values in y1-y3).
library(reshape2)
dat2 = melt(foo, id.var = "wcount", value.name = "x")
Now this can be cast back wide with dcast, using sum as the aggregation function. This puts y1-y3 back as columns with the sum of wcount for each value of x.
# Cast back wide using the values within y1-y3 as response values
# and filling with the sum of "wcount"
dcast(dat2, x ~ variable, value.var = "wcount", fun = sum)
Giving
x y1 y2 y3
1 1 3 0 0
2 2 3 2 0
3 3 0 0 3
4 4 0 1 1
5 5 10 13 12
you are describing a survey data set that uses replicate weights. see http://asdfree.com/ for many, many examples but for recs, do something like this:
library(survey)
x <- read.csv( "http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv" )
rw <- read.csv( "http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public_repweights.csv" )
y <- merge( x , rw )
# create a replicate-weighted survey design object
z <- svrepdesign( data = y , weights = ~NWEIGHT , repweights = "brr_weight_[0-9]" )
# now run all of your analyses on the object `z` ..
# see the `survey` package homepage for details
# distribution
svymean( ~ factor( BASEHEAT ) , z )
# mean
svymean( ~ TOTHSQFT , z )

Resources