Related
I am wondering if there is a way to automatically generate random variables which are correlated (even a Binomial with a Gaussian, not only variables belonging to the same family). The variables need to belong to the Gaussian, Poisson or Binomial family. Here, there is a not very automatic way to create variables which are correlated within the same family.
x1<-rbinom(100, 1, 0.5)
index<-sort(sample(1:100, 10, replace=F))
x2<-x1
for(i in 1:length(index)){
if(x2[index[i]]==0){
x2[index[i]]<-1
}else{
x2[index[i]]<-0
}
}
normal<-as.data.frame(mvrnorm(n = 100, mu = c(1,2), Sigma = matrix(c(1,0.6, 0.6,1), nrow = 2, byrow = T)))
x3<-normal$V1
x4<-normal$V2
x5.1<- rpois(100, 0.5)
x5.2<-rpois(100, 2)
x5.3<-rpois(100, 1)
x5<- x5.1+x5.2
x6<-x5.1+x5.3
x<-as.data.frame(cbind(x1,x2,x3,x4,x5,x6))
cor(x)
My goal is to create a dataset with mixed-type correlated variables.
There are several packages that can generate or simulate data in the way you need, and beyond into more complex correlation structures. The simstudy package provides genCorData() genCorGen(), and addCorGen() to simulate gaussian data with a given correlation structure and also to produce data simulated from other distributions. Here's some correlated data simulated from a poisson distribution, for example, taken from the package's vignette on correlated data:
l <- c(8, 10, 12) # lambda for each new variable
dx <- genCorGen(1000, nvars = 3, params1 = l,
dist = "poisson", rho = .3,
corstr = "cs", wide = TRUE)
dx
## id V1 V2 V3
## 1: 1 5 16 13
## 2: 2 9 9 6
## 3: 3 7 11 18
## 4: 4 11 14 12
## 5: 5 10 8 15
## ---
## 996: 996 3 2 5
## 997: 997 6 14 11
## 998: 998 6 8 12
## 999: 999 10 12 11
## 1000: 1000 9 9 12
A related package that might be of interest is faux.
orignally i have the data in the form
m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
using the following code i convert it to
input<-file('stdin', 'r')
mn <- read.table(input, nrows = 1, as.is = TRUE)
DF <- read.table(input, skip = 0)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
`
m = the current datas length
n = number of points in the future to be predicted
x1= 1 5 9 13
x2= 2 6 10 14
i would like to predict all the values of x1 y1 x2 y2 for n values after the given values.
i tried to fit with lm but i am not sure how to proceed with all the values of data points to be predicted in the future missing and just getting the coefficients in terms of the other would not be sufficient as all of them need to be predicted
In order to get that to run without error one needs to use skip =1 on the second read.table:
mn <- read.table(text="m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16", nrows = 1, as.is = TRUE)
DF <- read.table(text="m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16", skip = 1)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
So those input data are exactly colinear and you would NOT expect there to be any useful information in either the quadratic or cubic terms. That is in fact recognized by the lm machinery:
> fit1
Call:
lm(formula = x1 ~ poly(y1, 3, raw = TRUE))
Coefficients:
(Intercept) poly(y1, 3, raw = TRUE)1 poly(y1, 3, raw = TRUE)2
-1 1 0
poly(y1, 3, raw = TRUE)3
0
Generally one should be using the data argument
> fit3<-lm(x1 ~ poly(y1, 3, raw=TRUE), DF)
>
> fit4<-lm(x2 ~ poly(y2, 3, raw=TRUE), DF)
But in this case it doesn't seem to matter:
> predict(fit1, newdata = list(y1=20:23))
1 2 3 4
19 20 21 22
> predict(fit3, newdata = list(y1=20:23))
1 2 3 4
19 20 21 22
> predict(fit2, newdata = list(y1=25:28))
1 2 3 4
3 7 11 15
The way to get predictions is to supply a newdata argument that can be coerced into a dataframe. Using a list value that has items of the same length (in this case a single argument) will succeed.
I have the following data frame
Type CA AR
alpha 1 5
beta 4 9
gamma 3 8
I want to get the column and row sums such that it looks like this:
Type CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
I am able to do rowSums (as shown above) I guess because they are all numeric.
colSums(df)
However, when I do colSums I get the error 'x must be numeric.' I realize that this is because the "Type" column is not numeric.
If I do the following code such that I try to print the value into the 4th row (and only the 2nd through 4th columns are summed)
df[,4] = colSums(df[c(2:4)]
Then I get an error that replacement isn't same as data size.
Does anyone know how to work around this? I want to print the column sums for columns 2-4, and leave the 1st column total blank or allow me to print "Total"?
Thanks in advance!!
Checkout numcolwise() in the plyr package.
library(plyr)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
numcolwise(sum)(df)
Result:
CA AR
1 8 22
Use a matrix:
m <- as.matrix(df[,-1])
rownames(m) <- df$Type
# CA AR
# alpha 1 5
# beta 4 9
# gamma 3 8
Then add margins:
addmargins(m,FUN=c(Total=sum),quiet=TRUE)
# CA AR Total
# alpha 1 5 6
# beta 4 9 13
# gamma 3 8 11
# Total 8 22 30
The simpler addmargins(m) also works, but defaults to labeling the margins with "Sum".
You are right, it is because the first column is not numeric.
Try to use the first column as rownames:
df <- data.frame(row.names = c("alpha", "beta", "gamma"), CA = c(1, 4, 3), AR = c(5, 9, 8))
df$Total <- rowSums(df)
df['Total',] <- colSums(df)
df
The output will be:
CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
If you need the word 'Type', just remove the rownames and add the column back:
Type <- rownames(df)
df <- data.frame(Type, df, row.names=NULL)
df
And it's output:
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
Use:
df$Total <- df$CA + df$AR
A more general solution:
data$Total <- Reduce('+',data[, sapply(data, is.numeric)])
EDIT: I realize I completely misunderstood the question. you are indeed looking for the sum of rows, and I gave sum of columns.
To do rows instead:
data <- data.frame(x = 1:3, y = 4:6, z = as.character(letters[1:3]))
data$z <- as.character(data$z)
rbind(data,sapply(data, function(y) ifelse(test = is.numeric(y), Reduce('+',y), "Total")))
If you do not know which columns are numeric, but rather want the sums across rows then do this:
df$Total = rowSums( df[ sapply(df, is.numeric)] )
The is.numeric function will return a logical value which is valid for selecting columns and sapply will return the logical values as a vector.
To add a set of column totals and a grand total we need to rewind to the point where the dataset was created and prevent the "Type" column from being constructed as a factor:
dat <- read.table(text="Type CA AR
alpha 1 5
beta 4 9
gamma 3 8 ",stringsAsFactors=FALSE)
dat$Total = rowSums( dat[ sapply(dat, is.numeric)] )
rbind( dat, append(c(Type="Total"),
as.list(colSums( dat[ sapply(dat, is.numeric)] ))))
#----------
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
That's a data.frame:
> str( rbind( dat, append(c(Type="Total"), as.list(colSums( dat[ sapply(dat, is.numeric)] )))) )
'data.frame': 4 obs. of 4 variables:
$ Type : chr "alpha" "beta" "gamma" "Total"
$ CA : num 1 4 3 8
$ AR : num 5 9 8 22
$ Total: num 6 13 11 30
I think this should solve your problem
x<-data.frame(type=c('alpha','beta','gama'), x=c(1,2,3), y=c(4,5,6))
x[,'Total'] <- rowSums(x[,c(2:3)])
x<-rbind(x,c(type = c('Total'), c(colSums(x[,c(2:4)]))))
library(tidyverse)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
df2 <- colSums(df[, c("CA", "AR")])
# CA AR
# 8 22
I'm working with survey data consisting of integer value responses for multiple questions (y1, y2, y3, ...) and a weighted count assigned to each respondent, like this:
foo <- data.frame(wcount = c(10, 1, 2, 3), # weighted counts
y1 = sample(1:5, 4, replace=T), # numeric responses
y2 = sample(1:5, 4, replace=T), #
y3 = sample(1:5, 4, replace=T)) #
>foo
wcount y1 y2 y3
1 10 5 5 5
2 1 1 4 4
3 2 1 2 5
4 3 2 5 3
and I'd like to transform this into a consolidated data frame version of a weighted table, with the first column representing the response values, and the next 3 columns representing the weighted counts. This can be done explicitly by column using:
library(Hmisc)
ty1 <- wtd.table(foo$y1, foo$wcount)
ty2 <- wtd.table(foo$y2, foo$wcount)
ty3 <- wtd.table(foo$y3, foo$wcount)
bar <- merge(ty1, ty2, all=T, by="x")
bar <- merge(bar, ty3, all=T, by="x")
names(bar) <- c("x", "ty1", "ty2", "ty3")
bar[is.na(bar)]<-0
>bar
x ty1 ty2 ty3
1 1 3 0 0
2 2 3 2 0
3 3 0 0 3
4 4 0 1 1
5 5 10 13 12
I suspect there is a way of automating this with plyr and numcolwise or ddply. For instance, the following comes close, but I'm not sure what else is needed to finish the job:
library(plyr)
bar2 <- numcolwise(wtd.table)(foo[c("y1","y2","y3")], foo$wcount)
>bar2
y1 y2 y3
1 1, 2, 5 2, 4, 5 3, 4, 5
2 3, 3, 10 2, 1, 13 3, 1, 12
Any thoughts?
Not a plyr answer, but this struck me as a reshaping/aggregating problem that could be tackled straightforwardly using functions from package reshape2.
First, melt the dataset, making a column of the response value which can be named x (the unique values in y1-y3).
library(reshape2)
dat2 = melt(foo, id.var = "wcount", value.name = "x")
Now this can be cast back wide with dcast, using sum as the aggregation function. This puts y1-y3 back as columns with the sum of wcount for each value of x.
# Cast back wide using the values within y1-y3 as response values
# and filling with the sum of "wcount"
dcast(dat2, x ~ variable, value.var = "wcount", fun = sum)
Giving
x y1 y2 y3
1 1 3 0 0
2 2 3 2 0
3 3 0 0 3
4 4 0 1 1
5 5 10 13 12
you are describing a survey data set that uses replicate weights. see http://asdfree.com/ for many, many examples but for recs, do something like this:
library(survey)
x <- read.csv( "http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv" )
rw <- read.csv( "http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public_repweights.csv" )
y <- merge( x , rw )
# create a replicate-weighted survey design object
z <- svrepdesign( data = y , weights = ~NWEIGHT , repweights = "brr_weight_[0-9]" )
# now run all of your analyses on the object `z` ..
# see the `survey` package homepage for details
# distribution
svymean( ~ factor( BASEHEAT ) , z )
# mean
svymean( ~ TOTHSQFT , z )
I have the following code
x <- c(1, 2, 3)
y <- c(2, 3, 4)
z <- c(3, 4, 5)
df <- data.frame(x, y, z)
model.matrix(x ~ .^4, df)
This gives me a model matrix with predictors $y, z$, and $y:z$. However, I also want y^2 and z^2, and want to use a solution that uses "$.$", since I have lots of other predictors beyond $y$ and $z$. What's the best way to approach this?
Try this:
> x <- c(1, 2, 3)
> y <- c(2, 3, 4)
> z <- c(3, 4, 5)
> df <- data.frame(x, y, z)
>
> #Assuming that your 1st column is the response variable, then I excluded it to have
> #just the independent variables as a new data.frame called df.2
> df.2=df[,-1]
> model.matrix(x ~ .^4+I(df.2^2), df)
(Intercept) y z I(df.2^2)y I(df.2^2)z y:z
1 1 2 3 4 9 6
2 1 3 4 9 16 12
3 1 4 5 16 25 20
attr(,"assign")
[1] 0 1 2 3 3 4