How to add dummy variables in R - r

I know there are several questions about this topic, but none of them seem to answer my specific question.
I have a dataset with five independent variables and I want to add two dummy variables to my regression in R. I have my data in Excel and importing the dataset is not a problem (I use read.csv2). Now, when I want to see my dummy variables, D1 and D2, I can't. I can see all the other variables. The two dummy variables both vary from 0 and 1 through the dataset.
I can easily see a summary of all my data, including D1 and D2 (with median, mean, etc.), and I can call each of the 5 variables separately without any problems at all, but I can't do that with D1 and D2.
> str(tilskuere) 'data.frame': 180 obs. of 7 variables:
$ ATT : int 3166 4315 7123 6575 7895 7323 3579 9571 5345 6595 ...
$ PRICE : int 80 95 120 100 105 115 80 130 105 100 ...
$ viewers: int 41000 43000 56000 66000 157000 91000 51000 30000 36000 72000 ...
$ CB1 : int 10 10 5 2 7 2 3 1 10 1 ...
$ CB2 : num 1 1 1 0 0.33 ...
$ D1 : int 0 0 0 1 0 0 0 0 0 0 ...
$ D2 : int 1 0 0 0 0 1 1 0 0 0 ...
> summary(tilskuere)
> mean(ATT) [1] 6856.372
> mean(D1) Fejl i
mean(D1) : object 'D1' not found
To sum up: I can run regressions in R without D1 and D2, but I can't include these as dummy variables as R can't find these variables, when I run them. R simply says "object D1 not found."
I hope someone can help. Thank you in advance.
Kind regards
Mikkel

I added the material in your comment to the text , added some linefeeds, and it is now clear that you don't understand that columns are not first class objects in R. Try:
mean(tilskuere$D1)
You can see what objects are in your workspace with:
ls()
You appear to have an object named ATT in your workspace as well as a length-180 column by the same name in the object named tilskuere.

Related

How to devide my dataset for using permanova

Hello everyone :) I have a data set with individuals that correspond in 5 different species in one column, and their presence/absence in different landscapes (7 other columns).
data.frame': 1212 obs. of 10 variables:
$ latitude : num 34.5 34.7 34.7 34.8 34.8 ...
$ longitude : num 127 127 127 127 127 ...
$ species : Factor w/ 5 levels "Bufo gargarizans",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Built : int 0 0 0 0 0 0 0 0 0 0 ...
$ Agriculture: int 1 1 0 1 0 0 1 0 0 0 ...
$ Forested : int 0 0 1 0 0 0 0 1 1 1 ...
$ Grassland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Wetland : int 0 0 0 0 0 0 0 0 0 0 ...
$ Bare : int 0 0 0 0 1 0 0 0 0 0 ...
$ Water : int 1 0 0 0 0 1 0 0 0 0 ...
I try to use permanova and then Tukey test to see if the species use the landscape differently or not. My supervisor did it on SPSS and it worked very well, so I have to do it on R.
I saw I need 2 csv files for running permanova on R but I have only one. I will give you the script that I found on internet and I want to use for my analysis.
library(vegan)
data(dune)
data(dune.env)
# default test by terms
adonis2(dune ~ Management*A1, data = dune.env)
In my case, I should have 1 dataframe with species and 1 dataframe with environmental variables, if I understand well.
However my presence/absence are inside the environmental categories (see the str of my table above). So if I create 1 dataframe with species only, I will not have numerical values in the dataframe with species.
So I am totally lost. I don't know how to process. Can someone help me please ? Thank you !
I will split my answer into two parts. The one where I know what I am talking about and one where I am brainstorming ;)
Here is the first part on how to split your data into two data.frames
# Set seed
seed(1312)
# Some sample data
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=500,replace=T),
Built=sample(c(0,1),size=500,replace=T),
Agriculture=sample(c(0,1),size=500,replace=T),
Forested=sample(c(0,1),size=500,replace=T),
Grassland=sample(c(0,1),size=500,replace=T),
Wetland=sample(c(0,1),size=500,replace=T),
Bare=sample(c(0,1),size=500,replace=T),
Water=sample(c(0,1),size=500,replace=T))
# Split data
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
Now part two: as it says in ?adonis2 the first part of adnonis2 is a formula where the left part of the formula must be a community data matrix or a dissimilarity matrix
Eventhough I am not sure if it does make sense, I went wild and followed the instructions :D
df2_dist <- dist(df2)
vegan::adonis2(df2_dist~species, data=df1)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
vegan::adonis2(formula = d2 ~ species, data = d1)
Df SumOfSqs R2 F Pr(>F)
species 4 5.333 0.15802 0.7038 0.88
Residual 15 28.417 0.84198
Total 19 33.750 1.00000
Of course this might be nonsense in terms of content as I took a purely technical approach here, but maybe it helps you to shape your data as required
So I made this code :
setwd("C:/Users/Johan/Downloads/memoire Johanna (1)/memoire Johanna")
xdata=read.csv(file="all10_reduce_focal species1.csv", header=T, sep=",")
str(xdata)
xdata$species <- as.factor(xdata$species)
sample_data <- dplyr::tibble(species=sample(LETTERS[1:5],size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),
Agriculture=sample(c(0,1),size=1212,replace=T),
Forested=sample(c(0,1),size=1212,replace=T),
Grassland=sample(c(0,1),size=1212,replace=T),
Wetland=sample(c(0,1),size=1212,replace=T),
Bare=sample(c(0,1),size=1212,replace=T),
Water=sample(c(0,1),size=1212,replace=T))
library(dplyr)
df1 <- sample_data %>%
dplyr::select(species)
df2 <- sample_data %>%
dplyr::select(Built:Water)
df1_dist <- dist(df1)
vegan::adonis2(df1_dist~Built+Agriculture+Grassland+Forested+Wetland+Bare+Water, data=df2)
Species should be the response as I try to see the landscape on the species. When I do this I have :
Error in vegdist(as.matrix(lhs), method = method, ...) : input data must be numeric
It's because the "species" variable has only characters. So I changed to make it numeric :
sample_data <- dplyr::tibble(species=sample(c(1:5),size=1212,replace=T),
Built=sample(c(0,1),size=1212,replace=T),......
But the result I got is different as the result from SPSS, as I don't have any significant variable (in SPSS Built, Agriculture, Forested and Water are significant).
I think my code is wrong

How do I change the structure of a r data table

I've merged a handful of data sets all downloaded from either spss, cvs, or excel files into one large data table. For the most part I can use all the variables I want to run tests but every once in a while the structure of them needs to be changed. As an example here's my data set:
> str(gadd.us)
'data.frame': 467 obs. of 381 variables:
$ nidaid : Nmnl. item chr "45-D11150341" "45-D11180321" "45-D11220022" "45-D11240432" ...
$ id : Nmnl. item chr "D11150341" "D11180321" "D11220022" "D11240432" ...
$ agew1 : Itvl. item num 17 17 15 18 17 15 15 18 20 18 ...
$ nagew1 : Itvl. item num 17.3 17.2 15.7 18.2 17.2 ...
$ nsex : Nmnl. item w/ 2 labels for 0,1 num 1 1 0 0 0 0 1 1 1 1 ...
and when I focus on just one variable I get something like this
> str(gadd.us$wasiblckw2)
Itvl. item + ms.v. num [1:467] 70 48 40 60 37 46 67 55 45 61 ...
> str(gadd.us$nsex)
Nmnl. item w/ 2 labels for 0,1 num [1:467] 1 1 0 0 0 0 1 1 1 1 ...
So when I try to create a histogram I get an error...
> hist(gadd.us$wasiblckw2)
Error in hist.default(gadd.us$wasiblckw2) :
some 'x' not counted; maybe 'breaks' do not span range of 'x'
If I change this variable using as.numeric() it works just fine. Any idea what's going on here?
If you import your data from SPSS, SAS, or Stata using haven: library(haven), haven stores variable formats in an attribute: format.spss, format.sas, or format.stata. format.spss, or format.sas. This can sometimes cause problems for your code. haven has several functions to remove such formats and labels:
gadd.us <- haven::zap_formats(gadd.us)
gadd.us <- haven::zap_labels(gadd.us)
You may also want to try some other zap_ functions.

C5.0 decision tree - c50 code called exit with value 1

I am getting the following error
c50 code called exit with value 1
I am doing this on the titanic data available from Kaggle
# Importing datasets
train <- read.csv("train.csv", sep=",")
# this is the structure
str(train)
Output :-
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Then I tried using C5.0 dtree
# Trying with C5.0 decision tree
library(C50)
#C5.0 models require a factor outcome otherwise error
train$Survived <- factor(train$Survived)
new_model <- C5.0(train[-2],train$Survived)
So running the above lines gives me this error
c50 code called exit with value 1
I'm not able to figure out what's going wrong? I was using similar code on different dataset and it was working fine. Any ideas about how can I debug my code?
-Thanks
For anyone interested, the data can be found here: http://www.kaggle.com/c/titanic-gettingStarted/data. I think you need to be registered in order to download it.
Regarding your problem, first of I think you meant to write
new_model <- C5.0(train[,-2],train$Survived)
Next, notice the structure of the Cabin and Embarked Columns. These two factors have an empty character as a level name (check with levels(train$Embarked)). This is the point where C50 falls over. If you modify your data such that
levels(train$Cabin)[1] = "missing"
levels(train$Embarked)[1] = "missing"
your algorithm will now run without an error.
Just in case. You can take a look to the error by
summary(new_model)
Also this error occurs when there are a special characters in the name of a variable. For example, one will get this error if there is "я"(it's from Russian alphabet) character in the name of a variable.
Here is what worked finally:-
Got this idea after reading this post
library(C50)
test$Survived <- NA
combinedData <- rbind(train,test)
combinedData$Survived <- factor(combinedData$Survived)
# fixing empty character level names
levels(combinedData$Cabin)[1] = "missing"
levels(combinedData$Embarked)[1] = "missing"
new_train <- combinedData[1:891,]
new_test <- combinedData[892:1309,]
new_model <- C5.0(new_train[,-2],new_train$Survived)
new_model_predict <- predict(new_model,new_test)
submitC50 <- data.frame(PassengerId=new_test$PassengerId, Survived=new_model_predict)
write.csv(submitC50, file="c50dtree.csv", row.names=FALSE)
The intuition behind this is that in this way both the train and test data set will have consistent factor levels.
I had the same error, but I was using a numeric dataset without missing values.
After a long time, I discovered that my dataset had a predictive attribute called "outcome" and the C5.0Control use this name, and this was the error cause :'(
My solution was changing the column name. Other way, would be create a C5.0Control object and change the value of the label attribute and then pass this object as parameter for the C50 method.
I also struggled some hours with the same Problem (Return code "1") when building a model as well as when predicting.
With the hint of answer of Marco I have written a small function to remove all factor levels equal to "" in a data frame or vector, see code below. However, since R does not allow for pass by reference to functions, you have to use the result of the function (it can not change the original dataframe):
removeBlankLevelsInDataFrame <- function(dataframe) {
for (i in 1:ncol(dataframe)) {
levels <- levels(dataframe[, i])
if (!is.null(levels) && levels[1] == "") {
levels(dataframe[,i])[1] = "?"
}
}
dataframe
}
removeBlankLevelsInVector <- function(vector) {
levels <- levels(vector)
if (!is.null(levels) && levels[1] == "") {
levels(vector)[1] = "?"
}
vector
}
Call of the functions may look like this:
trainX = removeBlankLevelsInDataFrame(trainX)
trainY = removeBlankLevelsInVector(trainY)
model = C50::C5.0.default(trainX,trainY)
However, it seems, that C50 has a similar Problem with character columns containing an empty cell, so you will have probably to extend this to handle also character attributes if you have some.
I also got the same error, but it was because of some illegal characters in the factor levels of one the columns.
I used make.names function and corrected the factor levels:
levels(FooData$BarColumn) <- make.names(levels(FooData$BarColumn))
Then the problem was resolved.

Reading a csv file using ffdf and subsetting it successfully

I have been researching a way to efficiently extract information from large csv data sets using R. Many seem to recommend the package ff. I was successful in reading the data sets but am now running into problem trying to subset it.
The largest data set contains over 650,000 rows and 1005 columns. Not all columns contain the same data types. Viewed as a dataframe, the structure would look like this:
'data.frame': 5 obs. of 1005 variables:
$ SAMPLING_EVENT_ID : Factor w/ 5 levels "S6230404","S6252242",..: 2 1 3 4 5
$ LATITUDE : num 24.4 24.5 24.5 24.5 24.5
$ LONGITUDE : num -81.9 -81.9 -82 -82 -82
$ YEAR : int 2010 2010 2010 2010 2010
$ MONTH : int 4 3 10 10 10
$ DAY : int 97 88 299 298 300
$ TIME : num 9 10 10 11.58 9.58
$ COUNTRY : Factor w/ 1 level "United_States": 1 1 1 1 1
$ STATE_PROVINCE : Factor w/ 1 level "Florida": 1 1 1 1 1
$ COUNT_TYPE : Factor w/ 2 levels "P21","P22": 2 2 1 1 1
$ EFFORT_HRS : num 6 2 7 6 3.5
$ EFFORT_DISTANCE_KM : num 48.28 8.05 0 0 0
$ EFFORT_AREA_HA : int 0 0 0 0 0
$ OBSERVER_ID : Factor w/ 3 levels "obs132426","obs58643",..: 3 2 1 1 1
$ NUMBER_OBSERVERS : Factor w/ 2 levels "?","1": 2 1 2 2 2
$ Zenaida_macroura : int 0 0 1 0 0
All other variables being similar to this last one i.e. various species of bird.
Here is the code I used to “successfully: read the csv:
B2010 <- read.table.ffdf (x = NULL, “filePath&Name", nrows = -1, first.rows = 50000, next.rows = 50000)
Trying to learn about ffdf output, I entered command lines such as dim(B2010), str(B2010), ls(B2010), etc. dim(B2010) resulted in the appropriate number of rows but only one column (a string per record of the values separated by commas), and ls(B2010) outputted “[1] "physical" "row.names" "virtual" instead of the usual list of variables.
I not sure how to handle this type of output to be able to extract say STATE_PROVINCE == “California”? How do I tell B2010 what the variables are? I think I need to look at this differently but need some of your help to figure it out.
The ultimate goal for me is to subset a bunch of csv data sets (since I have one per year) and put the results back together as dataframe for various analysis.
Thanks,
Joe
To subset an ffdf, use the ffbase package.
As in
require(ffbase)
x <- subset(B2010, BB2010$STATE_PROVINCE == “California”)
I finally found the solution to getting the ffdf variable names and types properly read and accessible for subsetting:
B2010 <- read.csv.ffdf (file = "filepath/name", colClasses = c("factor", "numeric", "numeric", "integer", "integer", "integer", "numeric", rep("factor",998)), first.rows = 10000, next.rows = 50000, nrows = -1)
This took forever to read but seemed to have worked i.e. I was able to create a subset of the data. Next step: to save the subset back to a "normal" dataframe and/or to a csv.
According to the help page at ?read.table.ffdf, you should be using read.csv.ffdf(...). Then go to the page cited by Brandon.

How do I automatically specify the correct regression model when the number of variables differs between input data sets?

I have a working R program that will be used by my internal client for analysing their nutrient intake data. For each dataset that they have, they will re-run the R program.
A key part of the dataset is an nonlinear mixed method analysis, using nlmer from the lme4 package, that incorporates dummy variables for age. Depending on whether they will be analysing children or adults, the number of age band dummies in the formula will differ, although the reference age band dummy will always be the youngest. I think that the number of possible age bands ranges from 4 to about 6, so it's not a large range. It is a trivial matter to count the number of age band dummies, if I need to condition based on that.
What is the most efficient way for me to wrap the model-based code (the lmer that provides the starting parameter values, the function for the nlmer model, and the model specification in nlmer itself) so that the correct function and models are applied based on the number of age band dummies in the model? The other variables in the model are constant across datasets.
I've already got the program set up for automatically generating the relevant dummies and dropping those that aren't used in the current analysis. The program after the model is pretty well set up as automated as well. I'm just stuck on what to do with automating the two lme4-based analyses and function. These will only be run once for each dataset.
I've been wondering whether I need to write a function to contain all the lme4 related code, or whether there was an easier way. I would appreciate some pointers on how to do this. It took me one day to work out how to get the function working that I needed for the nlmer model, so I am still at a beginner level with functions.
I've searched for other R related automation questions on the site and I didn't find anything similar to what I would like to do.
Thanks in advance.
Update in response to suggestion in the comments about using a string. That sounds like an easy way forward for me, except that I don't then know how to apply the string content in a function as each dummy variable level (excluding the reference category) is used in the function for nlmer. How can I pull apart the string and use only the dummy variables that I have in a function? For example, one analysis might have AgeBand2, AgeBand3, AgeBand4, and another analysis might have AgeBand5 as well as those 3? If this was VBA, I would just create subfunctions based on the number of age dummy variables. I have no idea how to do this efficiently in R.
Can I just wrap a while loop around the lmer, function, and nlmer parts, so I have a series of while loops?
This is the section of code I wish to automate, the number of AgeBand dummy variables differs depending on the dataset that will be analysed (children vs. adults). This is using the dataset that I have been testing a SAS to R translation on, but the real datasets will be very similar. It is necessary to have a nonlinear model as this is the basis of the peer-reviewed published method that I am working off.
library(lme4)
Male.lmer <- lmer(BoxCoxXY ~ AgeBand4 + AgeBand5 + AgeBand6 + AgeBand7 +
AgeBand8 + Race1 + Race3 + Weekend + IntakeDay + (1|RespondentID),
data=Male.AddSugar,
weights=Replicates)
Male.lmer.fixef <- fixef(Male.lmer)
Male.lmer.fixef <- as.data.frame(Male.lmer.fixef)
bA <- Male.lmer.fixef[1,1]
bB <- Male.lmer.fixef[2,1]
bC <- Male.lmer.fixef[3,1]
bD <- Male.lmer.fixef[4,1]
bE <- Male.lmer.fixef[5,1]
bF <- Male.lmer.fixef[6,1]
bG <- Male.lmer.fixef[7,1]
bH <- Male.lmer.fixef[8,1]
bI <- Male.lmer.fixef[9,1]
bJ <- Male.lmer.fixef[10,1]
MD <- deriv(expression(b0 + b1*AgeBand4 + b2*AgeBand5 + b3*AgeBand6 +
b4*AgeBand7 + b5*AgeBand8 + b6*Race1 + b7*Race3 + b8*Weekend + b9*IntakeDay),
namevec=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9"),
function.arg=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9",
"AgeBand4","AgeBand5","AgeBand6","AgeBand7","AgeBand8",
"Race1","Race3","Weekend","IntakeDay"))
Male.nlmer <- nlmer(BoxCoxXY ~ MD(b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,AgeBand4,AgeBand5,AgeBand6,AgeBand7,AgeBand8,
Race1,Race3,Weekend,IntakeDay)
~ b0|RespondentID,
data=Male.AddSugar,
start=c(b0=bA, b1=bB, b2=bC, b3=bD, b4=bE, b5=bF, b6=bG, b7=bH, b8=bI, b9=bJ),
weights=Replicates
)
These will be the required changes between the datasets:
the number of fixed effect coefficients that I need to assign out of the lmer will change.
in the function, the expression, name.vec, and function.arg parts will change
the nlmer, the model statement and start parameter list will change.
I can change the lmer model statement so it takes AgeBand as a factor with levels, but I still need to pull out the values of the coefficients afterwards.
str(Male.AddSugar) gives:
'data.frame': 10287 obs. of 23 variables:
$ RespondentID: int 9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ...
$ RACE : int 2 3 2 2 3 2 2 2 2 1 ...
$ RNDW : int 26290 7237 10067 75391 1133 31298 20718 23908 7905 1091 ...
$ Replicates : num 41067 2322 17434 21723 375 ...
$ DRXTNUMF : int 27 11 13 18 17 13 13 19 11 11 ...
$ DRDDAYCD : int 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeAmt : num 33.45 2.53 9.58 43.34 55.66 ...
$ RIAGENDR : int 1 1 1 1 1 1 1 1 1 1 ...
$ RIDAGEYR : int 39 23 16 44 13 36 16 60 13 16 ...
$ Subgroup : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ...
$ WKEND : int 1 1 1 0 1 0 0 1 1 1 ...
$ AmtInd : num 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeDay : num 0 0 0 0 0 0 0 0 0 0 ...
$ Weekend : int 1 1 1 0 1 0 0 1 1 1 ...
$ Race1 : num 0 0 0 0 0 0 0 0 0 1 ...
$ Race3 : num 0 1 0 0 1 0 0 0 0 0 ...
$ AgeBand4 : num 0 0 1 0 0 0 1 0 0 1 ...
$ AgeBand5 : num 0 1 0 0 0 0 0 0 0 0 ...
$ AgeBand6 : num 1 0 0 1 0 1 0 0 0 0 ...
$ AgeBand7 : num 0 0 0 0 0 0 0 1 0 0 ...
$ AgeBand8 : num 0 0 0 0 0 0 0 0 0 0 ...
$ YN : num 1 1 1 1 1 1 1 1 1 1 ...
$ BoxCoxXY : num 7.68 1.13 3.67 8.79 9.98 ...
The AgeBand data is incorrectly shown as the ordered factor Subgroup. Because I haven't used it, I haven't gone back and correct this to a plain factor.
This assumes that you have one variable, "ageband", which is a factor with levels: AgeBand2, AgeBand3, AgeBand4, and perhaps others that you want to be ignored. Since factors are generally treated by R regression functions using the lowest lexigraphic values as the reference levels, you would get your correct level chosen automagically. You pick your desired levels by creating a dataset hat has only the desired levels.
agelevs <- c("AgeBand2", "AgeBand3", "AgeBand4")
dsub <- subset(inpdat, ageband %in agelevs)
res <- your_fun(dsub) nlmer(y ~ ageband + <other-parameters>, data=dsub, ...)
If you have gone to the trouble of creating separate variables, then you need to learn to use factors correctly rather than holding to inefficent habits enforced by training in SPSS or other clunky macro processors.

Resources