orignally i have the data in the form
m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
using the following code i convert it to
input<-file('stdin', 'r')
mn <- read.table(input, nrows = 1, as.is = TRUE)
DF <- read.table(input, skip = 0)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
`
m = the current datas length
n = number of points in the future to be predicted
x1= 1 5 9 13
x2= 2 6 10 14
i would like to predict all the values of x1 y1 x2 y2 for n values after the given values.
i tried to fit with lm but i am not sure how to proceed with all the values of data points to be predicted in the future missing and just getting the coefficients in terms of the other would not be sufficient as all of them need to be predicted
In order to get that to run without error one needs to use skip =1 on the second read.table:
mn <- read.table(text="m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16", nrows = 1, as.is = TRUE)
DF <- read.table(text="m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16", skip = 1)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
So those input data are exactly colinear and you would NOT expect there to be any useful information in either the quadratic or cubic terms. That is in fact recognized by the lm machinery:
> fit1
Call:
lm(formula = x1 ~ poly(y1, 3, raw = TRUE))
Coefficients:
(Intercept) poly(y1, 3, raw = TRUE)1 poly(y1, 3, raw = TRUE)2
-1 1 0
poly(y1, 3, raw = TRUE)3
0
Generally one should be using the data argument
> fit3<-lm(x1 ~ poly(y1, 3, raw=TRUE), DF)
>
> fit4<-lm(x2 ~ poly(y2, 3, raw=TRUE), DF)
But in this case it doesn't seem to matter:
> predict(fit1, newdata = list(y1=20:23))
1 2 3 4
19 20 21 22
> predict(fit3, newdata = list(y1=20:23))
1 2 3 4
19 20 21 22
> predict(fit2, newdata = list(y1=25:28))
1 2 3 4
3 7 11 15
The way to get predictions is to supply a newdata argument that can be coerced into a dataframe. Using a list value that has items of the same length (in this case a single argument) will succeed.
Related
I simulated some data which I wanted to split into a list of data based on id but it seems that the split() function is not working properly?
set.seed(323)
#simulate some data
tsfunc2 <- function () {
x1 = rnorm(25, mean = 3, sd = 1)
x2.sample = rnorm(5, mean = 2, sd = 0.5)
x2 = rep(x2.sample, each = 5)
mu = rnorm(25, mean = 10, sd = 2)
y=as.numeric(mu + x1 + x2)
data.frame(id=rep(1:5, each=5), time=1:5, x1=x1, x2=x2, y=y)
}
set.seed(63)
#create a dataset in which the simulated data are randomly sampled in order to create imbalanced panel data
fd <- function() {
df <- tsfunc2()[sample(nrow(tsfunc2()), 20), ]
ds <- df[with(df, order(id, time)),]
return(ds)
}
set.seed(124)
split(fd(), fd()$id) #it seems that data are not properly split based on id (e.g., the first row of id2)
$`1`
id time x1 x2 y
1 1 1 1.614929 1.900059 13.43994
3 1 3 2.236970 1.900059 14.49136
4 1 4 3.212306 1.900059 15.08736
$`2`
id time x1 x2 y
5 1 5 4.425538 1.900059 15.53696 #this row is supposed to be in id1
7 2 2 3.700229 2.027456 17.48522
8 2 3 2.770645 2.027456 15.20741
9 2 4 3.197094 2.027456 13.44979
$`3`
id time x1 x2 y
12 3 2 1.576201 1.658917 16.40684
13 3 3 2.594909 1.658917 14.34763
14 3 4 3.995387 1.658917 16.36730
15 3 5 3.958818 1.658917 15.37498
$`4`
id time x1 x2 y
16 4 1 3.918088 1.636148 15.48205
17 4 2 2.849030 1.636148 12.52288
18 4 3 1.776931 1.636148 12.54456
19 4 4 2.131176 1.636148 13.63235
20 4 5 1.957515 1.636148 15.55745
$`5`
id time x1 x2 y
21 5 1 1.896362 1.569048 12.54131
22 5 2 3.444185 1.569048 14.56303
23 5 3 2.795049 1.569048 12.67120
25 5 5 2.868678 1.569048 13.88765
orignally i have the data in the form
m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
using the following code i convert it to
input<-file('stdin', 'r')
mn <- read.table(input, nrows = 1, as.is = TRUE)
DF <- read.table(input, skip = 0)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
`
m = the current datas length
n = number of points in the future to be predicted
x1= 1 5 9 13
x2= 2 6 10 14
i would like to predict all the values of x1 y1 x2 y2 for n values after the given values.
i tried to fit with lm but i am not sure how to proceed with all the values of data points to be predicted in the future missing and just getting the coefficients in terms of the other would not be sufficient as all of them need to be predicted
I have the following data frame
Type CA AR
alpha 1 5
beta 4 9
gamma 3 8
I want to get the column and row sums such that it looks like this:
Type CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
I am able to do rowSums (as shown above) I guess because they are all numeric.
colSums(df)
However, when I do colSums I get the error 'x must be numeric.' I realize that this is because the "Type" column is not numeric.
If I do the following code such that I try to print the value into the 4th row (and only the 2nd through 4th columns are summed)
df[,4] = colSums(df[c(2:4)]
Then I get an error that replacement isn't same as data size.
Does anyone know how to work around this? I want to print the column sums for columns 2-4, and leave the 1st column total blank or allow me to print "Total"?
Thanks in advance!!
Checkout numcolwise() in the plyr package.
library(plyr)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
numcolwise(sum)(df)
Result:
CA AR
1 8 22
Use a matrix:
m <- as.matrix(df[,-1])
rownames(m) <- df$Type
# CA AR
# alpha 1 5
# beta 4 9
# gamma 3 8
Then add margins:
addmargins(m,FUN=c(Total=sum),quiet=TRUE)
# CA AR Total
# alpha 1 5 6
# beta 4 9 13
# gamma 3 8 11
# Total 8 22 30
The simpler addmargins(m) also works, but defaults to labeling the margins with "Sum".
You are right, it is because the first column is not numeric.
Try to use the first column as rownames:
df <- data.frame(row.names = c("alpha", "beta", "gamma"), CA = c(1, 4, 3), AR = c(5, 9, 8))
df$Total <- rowSums(df)
df['Total',] <- colSums(df)
df
The output will be:
CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
If you need the word 'Type', just remove the rownames and add the column back:
Type <- rownames(df)
df <- data.frame(Type, df, row.names=NULL)
df
And it's output:
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
Use:
df$Total <- df$CA + df$AR
A more general solution:
data$Total <- Reduce('+',data[, sapply(data, is.numeric)])
EDIT: I realize I completely misunderstood the question. you are indeed looking for the sum of rows, and I gave sum of columns.
To do rows instead:
data <- data.frame(x = 1:3, y = 4:6, z = as.character(letters[1:3]))
data$z <- as.character(data$z)
rbind(data,sapply(data, function(y) ifelse(test = is.numeric(y), Reduce('+',y), "Total")))
If you do not know which columns are numeric, but rather want the sums across rows then do this:
df$Total = rowSums( df[ sapply(df, is.numeric)] )
The is.numeric function will return a logical value which is valid for selecting columns and sapply will return the logical values as a vector.
To add a set of column totals and a grand total we need to rewind to the point where the dataset was created and prevent the "Type" column from being constructed as a factor:
dat <- read.table(text="Type CA AR
alpha 1 5
beta 4 9
gamma 3 8 ",stringsAsFactors=FALSE)
dat$Total = rowSums( dat[ sapply(dat, is.numeric)] )
rbind( dat, append(c(Type="Total"),
as.list(colSums( dat[ sapply(dat, is.numeric)] ))))
#----------
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
That's a data.frame:
> str( rbind( dat, append(c(Type="Total"), as.list(colSums( dat[ sapply(dat, is.numeric)] )))) )
'data.frame': 4 obs. of 4 variables:
$ Type : chr "alpha" "beta" "gamma" "Total"
$ CA : num 1 4 3 8
$ AR : num 5 9 8 22
$ Total: num 6 13 11 30
I think this should solve your problem
x<-data.frame(type=c('alpha','beta','gama'), x=c(1,2,3), y=c(4,5,6))
x[,'Total'] <- rowSums(x[,c(2:3)])
x<-rbind(x,c(type = c('Total'), c(colSums(x[,c(2:4)]))))
library(tidyverse)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
df2 <- colSums(df[, c("CA", "AR")])
# CA AR
# 8 22
I have the following code
x <- c(1, 2, 3)
y <- c(2, 3, 4)
z <- c(3, 4, 5)
df <- data.frame(x, y, z)
model.matrix(x ~ .^4, df)
This gives me a model matrix with predictors $y, z$, and $y:z$. However, I also want y^2 and z^2, and want to use a solution that uses "$.$", since I have lots of other predictors beyond $y$ and $z$. What's the best way to approach this?
Try this:
> x <- c(1, 2, 3)
> y <- c(2, 3, 4)
> z <- c(3, 4, 5)
> df <- data.frame(x, y, z)
>
> #Assuming that your 1st column is the response variable, then I excluded it to have
> #just the independent variables as a new data.frame called df.2
> df.2=df[,-1]
> model.matrix(x ~ .^4+I(df.2^2), df)
(Intercept) y z I(df.2^2)y I(df.2^2)z y:z
1 1 2 3 4 9 6
2 1 3 4 9 16 12
3 1 4 5 16 25 20
attr(,"assign")
[1] 0 1 2 3 3 4
I apologize in advance if this has been asked before, or if I have missed something obvious.
I have two data sets, 'olddata' and 'newdata'
set.seed(0)
olddata <- data.frame(x = rnorm(10, 0,5), y = runif(10, 0, 5), z = runif(10,-10,10))
newdata <- data.frame(x = -5:5, z = -5:5)
I create a model from the old data, and want to predict values from the new data
mymodel <- lm(y ~ x+z, data = olddata)
predict.lm(mymodel, newdata)
However, I'd like to restrict the range of variables in 'newdata' to the range of variables on which the model was trained.
of course I could do this:
newnewdata <- subset(newdata,
x < max(olddata$x) & x > min(olddata$x) &
z < max(olddata$z) & z > max(olddata$z))
But this gets intractable over many dimensions. Is there a less repetitive way to do this?
It seems that all the values in your newdata are already within the appropriate ranges, so there's nothing there to subset. If we expand the ranges of newdata:
set.seed(0)
olddata <- data.frame(x = rnorm(10, 0,5), y = runif(10, 0, 5), z = runif(10,-10,10))
newdata <- data.frame(x = -10:10, z = -10:10)
newdata
x z
1 -10 -10
2 -9 -9
3 -8 -8
4 -7 -7
5 -6 -6
6 -5 -5
7 -4 -4
8 -3 -3
9 -2 -2
10 -1 -1
11 0 0
12 1 1
13 2 2
14 3 3
15 4 4
16 5 5
17 6 6
18 7 7
19 8 8
20 9 9
21 10 10
Then all we need to do is identify the ranges for each variable of olddata and then loop through as many iterations of subset as newdata has columns:
ranges <- sapply(olddata, range, na.rm = TRUE)
for(i in 1:ncol(newdata)) {
col_name <- colnames(newdata)[i]
newdata <- subset(newdata,
newdata[,col_name] >= ranges[1, col_name] &
newdata[,col_name] <= ranges[2, col_name])
}
newdata
x z
4 -7 -7
5 -6 -6
6 -5 -5
7 -4 -4
8 -3 -3
9 -2 -2
10 -1 -1
11 0 0
12 1 1
13 2 2
14 3 3
15 4 4
16 5 5
17 6 6
Here is an approach using the *apply family (using SchaunW's newdata):
set.seed(0)
olddata <- data.frame(x = rnorm(10, 0, 5), y = runif(10, 0, 5), z = runif(10,-10,10))
newdata <- data.frame(x = -10:10, z = -10:10)
minmax <- sapply(olddata[-2], range)
newdata[apply(newdata, 1, function(a) all(a > minmax[1,] & a < minmax[2,])), ]
Some care is required because I have assumed the columns of olddata (after dropping the second column) are identical to newdata.
Brevity comes at the cost of speed. After increasing nrow(newdata) to 2000 to emphasis the difference I found:
test replications elapsed relative user.self sys.self user.child sys.child
1 orizon() 100 2.193 27.759 2.191 0.002 0 0
2 SchaunW() 100 0.079 1.000 0.075 0.004 0 0
My guess at the main cause is that repeated subsetting avoids testing whether rows meet the criteria examined after they are excluded.