I have to perform a nonlinear multiple regression with data that looks like the following:
ID Customer Country Industry Machine-type Service hours**
1 A China mass A1 120
2 B Europe customized A2 400
3 C US mass A1 60
4 D Rus mass A3 250
5 A China mass A2 480
6 B Europe customized A1 300
7 C US mass A4 250
8 D Rus customized A2 260
9 A China Customized A2 310
10 B Europe mass A1 110
11 C US Customized A4 40
12 D Rus customized A2 80
Dependent variable: Service hours
Independent variables: Customer, Country, Industry, Machine type
I did a linear regression, but because the assumption of linearity does not hold I have to perform a nonlinear regression.
I know nonlinear regression can be done with the nls function. How do I add the categorical variables to the nonlinear regression so that I get the statistical summary in R?
Column names after adding dummies: table with dummies
ID Customer.a Customer.b Customer.c Customer.d Country.China Country.Europe Country.Rus Country.US Industry.customized industry.Customized Industry.mass Machine type.A1 Machine type.A2 Machine type.A3 Service hours
1 1 0 0 0 1 0 0 0 0 0 1 1 0 0 120
2 0 1 0 0 0 1 0 0 1 0 0 0 1 0 400
3 0 0 1 0 0 0 0 1 0 0 1 0 0 1 60
4 0 0 0 1 0 0 1 0 0 0 1 1 0 0 250
5 1 0 0 0 1 0 0 0 1 0 0 0 0 1 480
6 0 1 0 0 0 1 0 0 0 1 0 1 0 0 300
7 0 0 1 0 0 0 0 1 0 0 1 0 0 1 250
8 0 0 0 1 0 0 1 0 1 0 0 0 1 0 260
9 1 0 0 0 1 0 0 0 0 0 1 0 1 0 210
10 0 1 0 0 0 1 0 0 1 0 0 0 1 0 110
11 0 0 1 0 0 0 0 1 0 0 1 0 0 1 40
12 0 0 0 1 0 0 1 0 0 0 1 1 0 0 80
The way to handle categorical predictors is dependent on the number of levels the predictor can hold.
For predictors such as gender which can only take 2 forms (male or female), you can simply represent them as a binary (1,0) variable.
For predictors with greater than 2 levels, we use 1-of-k dummy encoding where k is the number of levels the particular variable takes. See the dummies package for useful functions!
After this, you can fit the model using formula:
nls(Service.hours ~ predictor1 + predictor2 + predictorN, data = df)
Related
I have that csv file, containing 600k lines and 3 rows, first one containing a disease name, second one a gene, a third one a number something like that: i have roughly 4k disease and 16k genes so sometimes the disease names and genes names are redudant.
cholera xx45 12
Cancer xx65 1
cholera xx65 0
i would like to make a DTM matrix using R, i've been trying to use the Corpus command from the tm library but corpus doesn't reduce the amount of disease and size's 600k ish, i'd love to understand how to transform that file into a DTM.
I'm sorry for not being that precise, totally starting with computer science things as a bio guy :)
Cheers!
If you're not concerned with the number in the third column, then you can accomplish what I think you're trying to do using only the first two columns (gene and disease).
Example with some simulated data:
library(data.table)
# Create a table with 10k combinations of ~6k different genes and 40 different diseases
df <- data.frame(gene=sapply(1:10000, function(x) paste(c(sample(LETTERS, size=2), sample(10, size=1)), collapse="")), disease=sample(40, size=100000, replace=TRUE))
table(df) creates a large matrix, nGenes rows long and nDiseases columns wide. Looking at just the first 10 rows (because it's so large and sparse).
head(table(df))
disease
gene 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
AB10 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
AB2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 1 0 1 0 1
AB3 0 1 0 0 2 1 1 0 0 1 0 0 0 0 0 2 1 0 0 1 0 0 1 0 3 0 1
AB4 0 0 1 0 0 1 0 2 1 1 0 1 0 0 1 1 1 1 0 1 0 2 0 0 0 1 1
AB5 0 1 0 1 0 0 2 2 0 1 1 1 0 1 0 0 2 0 0 0 0 0 0 1 1 1 0
AB6 0 0 2 0 2 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0
disease
gene 28 29 30 31 32 33 34 35 36 37 38 39 40
AB10 0 0 1 2 1 0 0 1 0 0 0 0 0
AB2 0 0 0 0 0 0 0 0 0 0 0 0 0
AB3 0 0 1 1 1 0 0 0 0 0 1 1 0
AB4 0 0 1 2 1 1 1 1 1 2 0 3 1
AB5 0 2 1 1 0 0 3 4 0 1 1 0 2
AB6 0 0 0 0 0 0 0 1 0 0 0 0 0
Alternatively, you can exclude the counts of 0 and only include combinations that actually exist. Easy aggregation can be done with data.table, e.g. (continuing from the above example)
library(data.table)
dt <- data.table(df)
dt[, .N, by=list(gene, disease)]
which gives a frequency table like the following:
gene disease N
1: HA5 20 2
2: RF9 10 3
3: SD8 40 2
4: JA7 35 4
5: MJ2 1 2
---
75872: FR10 26 1
75873: IC5 40 1
75874: IU2 20 1
75875: IG5 13 1
75876: DW7 21 1
I'm having an odd problem while trying to set up a design matrix to do downstream pairwise differential expression analysis on RNAseq data.
For the design matrix, I have both the donor information and each condition:
group<-factor(y$samples$group) #44 samples, 6 different conditions
sample<-factor(y$samples$samples) #44 samples, 11 different donors.
design<- model.matrix(~0+sample+group)
head(design)
Donor11.CD8 Donor12.CD8 Donor14.CD8 Donor15.CD8 Donor16.CD8
1 1 0 0 0 0
2 1 0 0 0 0
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
Donor17.CD8 Donor18.CD8 Donor19.CD8 Donor20.CD8 Donor3.CD8
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
Donor4.CD8 Treatment2 Treatment3 Treatment4 Treatment5
1 0 0 0 0 0
2 0 0 0 0 1
3 0 0 0 1 0
4 0 0 0 0 0
5 0 0 1 0 0
6 0 1 0 0 0
Treatment6
1 1
2 0
3 0
4 0
5 0
6 0
>
The issue is that I seem to be losing a condition (treatment 1) when I form the design matrix, and I'm not sure why.
Many thanks, in advance, for your help!
That's not a problem. Treatment 1 is indicated by all 0 for the columns in the design matrix. Look at row 4 - zero for Treatments 2 through 6. That means it is Treatment 1. This is called a "treatment contrast" because the coefficients in the model contrast the named treatment against the "base" level, in this case the base level is Treatment1.
I'm trying to do discrete choice modeling on the below data. Basically, 30 customers have 16 different choices of pizza. They can choose more than 1 type of pizza and the ones they choose is indicated by choice variable.
pizza cust choice pan thin pineapple veggie sausage romano mozarella oz
1 1 Cust1 0 1 0 1 0 0 1 0 1
2 2 Cust1 1 0 1 1 0 0 0 0 0
3 3 Cust1 0 0 0 1 0 0 0 1 1
4 4 Cust1 1 0 1 1 0 0 0 0 0
5 5 Cust1 1 1 0 0 1 0 0 0 1
6 6 Cust1 0 0 1 0 1 0 1 0 0
7 7 Cust1 0 0 0 0 1 0 0 0 1
8 8 Cust1 1 0 1 0 1 0 0 1 0
9 9 Cust1 0 1 0 0 0 1 0 1 0
10 10 Cust1 1 0 1 0 0 1 0 0 1
11 11 Cust1 0 0 0 0 0 1 1 0 0
12 12 Cust1 0 0 1 0 0 1 0 0 1
13 13 Cust1 0 1 0 0 0 0 0 0 0
14 14 Cust1 1 0 1 0 0 0 0 1 1
15 15 Cust1 0 0 0 0 0 0 0 0 0
16 16 Cust1 0 0 1 0 0 0 1 0 1
17 1 Cust10 0 1 0 1 0 0 1 0 1
18 2 Cust10 0 0 1 1 0 0 0 0 0
19 3 Cust10 0 0 0 1 0 0 0 1 1
20 4 Cust10 0 0 1 1 0 0 0 0 0
When I use the below command to transform my data. I tried making few changes here like adding chid.var = "chid" and alt.levels=c(1:16). If I use both alt.levels and alt.var it gives me an error saying pizza already exists and will be replaced. However, I get no error if I use either of them.
pz <- mlogit.data(pizza,shape = "long",choice = "choice",
varying = 4:8, id = "cust", alt.var = "pizza")
Finally, when I use the mlogit command, I get this error.
mlogit(choice ~ pan + thin + pineapple + veggie + sausage + romano + mozarella + oz, pz)
Error in solve.default(H, g[!fixed]) :
system is computationally singular: reciprocal condition number = 8.23306e-19
This is my first post on stackoverflow. I visit this site very often and so far never needed to post as I found solutions already. I went through almost all similar posts like this one but in vain. I'm new to discrete choice modeling so I don't know if I'm making any fundamental mistake here.
Also, I'm not really sure what chid.var does.
Couldn't solve this problem. Though you can use multinom function from nnet package. It seems to work. Verified the answer.
The dataset remains the same as shown in the question so no need for any transformation
library("nnet")
pizza_model <- multinom(choice ~ Price + IsThin + IsPan ,data=pizza_all)
summary(pizza_model)
where choice is a dependent categorical variable which you want to predict. Price, IsThin, and IsPan are independent variables. Below is the output
Call:
multinom(formula = choice ~ Price + I_cPan + I_cThin, data = pizza_all)
Coefficients:
Values Std. Err.
(Intercept) 0.007192623 1.3298018
Price -0.149665357 0.1464976
I_cPan 0.098438084 0.3138538
I_cThin 0.624447867 0.2637110
Residual Deviance: 553.8519
AIC: 561.8519
I have a data frame with the below structure from which I am looking to transpose the variables into categorical. Intent is to find the weighted mix of the variables.
data <- read.table(header=T, text='
subject weight sex test
1 2 M control
2 3 F cond1
3 2 F cond2
4 4 M control
5 3 F control
6 2 F control
')
data
Expected output:
subject weight control_F control_M cond1_F cond1_M cond2_F cond2_M
1 2 0 1 0 0 0 0
2 3 0 0 1 0 0 0
3 2 0 0 0 0 1 0
4 4 0 1 0 0 0 0
5 3 1 0 0 0 0 0
6 2 1 0 0 0 0 0
I tried using a combination of ifelse and cut, but just couldn't produce the output.
Any ideas on how I can do this?
TIA
You may use
model.matrix(~ subject + weight + sex:test - 1, data)
I think model.matrix is most natural here (see #Julius' answer), but here's an alternative:
library(data.table)
setDT(data)
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))
subject weight cond1_F cond1_M cond2_F cond2_M control_F control_M
1: 1 2 0 0 0 0 0 1
2: 2 3 1 0 0 0 0 0
3: 3 2 0 0 1 0 0 0
4: 4 4 0 0 0 0 0 1
5: 5 3 0 0 0 0 1 0
6: 6 2 0 0 0 0 1 0
To get the columns in the "right" order (with the control first), set factor levels before casting:
data[, test := relevel(test, "control")]
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))
subject weight control_F control_M cond1_F cond1_M cond2_F cond2_M
1: 1 2 0 1 0 0 0 0
2: 2 3 0 0 1 0 0 0
3: 3 2 0 0 0 0 1 0
4: 4 4 0 1 0 0 0 0
5: 5 3 1 0 0 0 0 0
6: 6 2 1 0 0 0 0 0
(Note: reshape2's dcast isn't so good here, since its drop option applies to both rows and cols.)
I have been using the textmatrix() function for a while to create DTMs which I can further use for LSI.
dirLSA<-function(dir){
dtm<-textmatrix(dir)
return(lsa(dtm))
}
textdir<-"C:/RProjects/docs"
dirLSA(textdir)
> tm
$matrix
D1 D2 D3 D4 D5 D6 D7 D8 D9
1. 000 2 0 0 0 0 0 0 0 0
2. 20 1 0 0 1 0 0 1 0 0
3. 200 1 0 0 0 0 0 0 0 0
4. 2014 1 0 0 0 0 0 0 0 0
5. 2015 1 0 0 0 0 0 0 0 0
6. 27 1 0 0 0 0 0 0 1 0
7. 30 1 0 0 0 1 0 1 0 0
8. 31 1 0 2 0 0 0 0 0 0
9. 40 1 0 0 0 0 0 0 0 0
10. 45 1 0 0 0 0 0 0 0 0
11. 500 1 0 0 0 0 0 1 0 0
12. 600 1 0 0 0 0 0 0 0 0
728. bias 0 0 0 2 0 0 0 0 0
729. biased 0 0 0 1 0 0 0 0 0
730. called 0 0 0 1 0 0 0 0 0
731. calm 0 0 0 1 0 0 0 0 0
732. cause 0 0 0 1 0 0 0 0 0
733. chauhan 0 0 0 2 0 0 0 0 0
734. chief 0 0 0 8 0 0 1 0 0
Textmatrix() is a function which takes a directory(folder path) and returns a document-wise term frequency. This is used in further analysis like Latent Semantic Indexing/Allocation(LSI/LSA)
However, a new problem that came across me is that if I have tweet data in batch files (~500000 tweets/batch) and I want to carry out similar operations on this data.
I have code modules to clean up my data, and I want to pass the cleaned tweets directly to the LSI function. The problem I face is that the textmatrix() does not support it.
I tried looking at other packages and code snippets, but that didn't get me any further. Is there any way I can create a line-term matrix of sorts?
I tried sending table(tokenize(cleanline[i])) into a loop, but it wont add new columns for words not already there in the matrix. Any workaround?
Update: I just tried this:
a<-table(tokenize(cleanline[10]))
b<-table(tokenize(cleanline[12]))
df1<-data.frame(a)
df1
df2<-data.frame(b)
df2
merge(df1,df2, all=TRUE)
I got this:
> df1
Var1 Freq
1 6
2 " 2
3 and 1
4 home 1
5 mabe 1
6 School 1
7 then 1
8 xbox 1
> b<-table(tokenize(cleanline[12]))
> df2<-data.frame(b)
> df2
Var1 Freq
1 13
2 " 2
3 BillGates 1
4 Come 1
5 help 1
6 Mac 1
7 make 1
8 Microsoft 1
9 please 1
10 Project 1
11 really 1
12 version 1
13 wish 1
14 would 1
> merge(df1,df2)
Var1 Freq
1 " 2
> merge(df1,df2, all=TRUE)
Var1 Freq
1 6
2 13
3 " 2
4 and 1
5 home 1
6 mabe 1
7 School 1
8 then 1
9 xbox 1
10 BillGates 1
11 Come 1
12 help 1
13 Mac 1
14 make 1
15 Microsoft 1
16 please 1
17 Project 1
18 really 1
19 version 1
20 wish 1
21 would 1
I think I'm close.
Try something like this
ll <- list(df1,df2)
dtm <- xtabs(Freq ~ ., data = do.call("rbind", ll))
Something that works for me:
textLSA<-function(text){
a<-data.frame(table(tokenize(text[1])))
colnames(a)[2]<-paste(c("Line",1),collapse=' ')
df<-a
for(i in 1:length(text)){
a<-data.frame(table(tokenize(text[i])))
colnames(a)[2]<-paste(c("Line",i),collapse=' ')
df<-merge(df,a, all=TRUE)
}
df[is.na(df)]<-0
dtm<-as.matrix(df[,-1])
rownames(dtm)<-df$Var1
return(lsa(dtm))
}
What do you think of this code?