Make dummy variables for a categorial variable - r

Let's say I have a data frame df as follows:
df <- data.frame(type = c("A","B","AB","O","O","B","A"))
Obviously there are 4 kinds of type. However, in my actual data, I don't know how many kinds are in a column type. The number of dummy variables should be one less than the number of kinds in type. In this example, number of dummy variables should be 3. My expected output looks like this:
df <- data.frame(type = c("A","B","AB","O","O","B","A"),
A = c(1,0,0,0,0,0,1),
B = c(0,1,0,0,0,1,0),
AB = c(0,0,1,0,0,0,0))
Here I used A, B and AB as dummy variables, but whatever I choose from type doesn't matter. Even if I don't know the values of type and the number of kinds, I somehow want to make it as dummy variables.

The number of dummy variables should be one less than the number of kinds in type.
Here I used "A", "B" and "AB" as dummy variables, but whatever I choose from type doesn't matter.
Even if I don't know the values in type and the number of kinds, I somehow want to make it as dummy variables.
This is treatment contrasts coding. First, you need a factor variable.
## option 1: if you care the order of dummy variables
## the 1st level is not in dummy variables
## I do this to match your example output with "A", "B" and "AB"
f <- factor(df$type, levels = c("O", "A", "B", "AB"))
## option 2: if you don't care, then let R automatically order levels
f <- factor(df$type)
Now, apply treatment contrasts coding.
## option 1 (recommended): using contr.treatment()
m <- contr.treatment(nlevels(f))[f, ]
## option 2 (less efficient): using model.matrix()
m <- model.matrix(~ f)[, -1]
Finally you want to have nice row/column names for readability.
dimnames(m) <- list(1:length(f), levels(f)[-1])
The resulting m looks like:
# A B AB
#1 1 0 0
#2 0 1 0
#3 0 0 1
#4 0 0 0
#5 0 0 0
#6 0 1 0
#7 1 0 0
This is a matrix. If you want a data frame, do data.frame(m).

Related

Procedural way to generate signal combinations and their output in r

I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data

Running a interaction matrix between many variables

I have a data set with 70 column variables, each is 0-1 dummy variable, and 3500 observations. I am looking to see how often observations with a 'success' in one variable are matched with another variable. In other words it obs 1 has a success dummy in variable one how often does it also have a success in variable 2 and so on for all the variables. I have found how to create a matrix table showing interactions when only two columns are involved however i cant find anything involving many columns. Ideally id like to present this in an interaction matrix with 70 variables across and 70 down. Here is an idea of the data set:
Dat A B C D
XX 1 1 1 1
XY 0 1 0 1
XZ 0 0 1 1
The output im hoping for would be:
Out A B C D
A 0 1 1 1
B 0 1 2
C 0 2
D 0
Showing the number of times that (A,B) is a pairing (B,C) is a pairing and so on.
I have tried using the table() command as well as as.matrix but it seems these require data organized as two columns and cannot understand the data when it refers to many column variables. I am fairly new to R so I apologize if my question isnt clear or is possibly quite simple.
Any help is appreciated. Thanks
Here's how to create a correlation matrix of indefinite size. First create a reproducible example of your dataset...
dat <- matrix(sample(0:1, size = 700, replace = TRUE), ncol = 70)
dat <- data.frame(dat)
Then calculate the correlation...
dat <- cor(dat)
And then plot the correlation visually...
library(corrplot)
corrplot(dat, method = "square")
You can also plot the correlation using numbers instead of colors...
corrplot(dat, method = "number")
Obviously you'll want to finesse these charts before using them in a publication. corrplot offers tons of options for chart appearance.
You can try:
res <- apply(combn(2:ncol(df), 2), 2, function(x, y) sum(rowSums(y[, x]) == 2), df)
m <- diag(x=0, ncol(df)-1)
m[upper.tri(m)] <- res
m[lower.tri(m)] <- NA
dimnames(m) <- list(colnames(df)[-1], colnames(df)[-1])
A B C D
A 0 1 1 1
B NA 0 1 2
C NA NA 0 2
D NA NA NA 0

Imputing NAs for factorial variables NAs & Converting them to dummy variables

I have a dataframe, in which some of the variables (columns) are factorial, when for some records I have missing values (NA).
Questions are:
What is the correct approach of replacing\imputing NAs in factorial variables?
e.g VarX with 4 Levels {"A", "B", "C", "D"} - What would be the preffered value to replace NAs with? A\B\C\D? Maybe just 0? Maybe impute with the level that is the majority for this variable observations?
How to implement such imputation, based on answer to 1?
Once 1&2 resolved, I'll use the following to create dummy variables for the factorial variables:
is.fact <- sapply(my_data, is.factor)
my_data.dummy_vars <- dummy.data.frame(my_data[, is.fact], sep = ".")
Afterwards, how do I replace all the factorial variables in my_data with the dummy variables i've extracted into my_data.dummy_vars?
My use case is to calculate principal components afterwards (Which needs all variables to have numerical values, thus the dummy variables)
Thanks
Thanks for clarifying your intentions - that really helps! Here are my thoughts:
Imputing missing data is a non-trivial problem, and maybe a good question for the fine folks at crossvalidated. This is a problem that can only really be addressed in the context of the project, by you (the subject-matter expert). A big question is whether missing values are missing at random, or as a function of some other variables, and whether these are observed or unobserved. If you conclude that they're missing as a function of other (observed) variables, you might even consider a model-based approach, perhaps using GLM. The easiest approach by far (and if you don't have many missing values) is to just delete these rows with something like mydata2 <- mydata[!is.na(TheFactorInQuestion),] I'll say it again, imputation of missing data is a non-trivial problem that should be considered carefully and in context. Perhaps a good approach is to try a few methods of imputation and see if (and how) your inferences change. If they don't change (much), you'll know you don't need to worry.
Dropping rows instead could be done with a fairly simple mydata2 <- mydata[!is.na(TheFactorInQuestion),]. If you do any other form of imputation (in a sense, "making up" data), I'd advocate thinking long and hard about doing that before concluding that it's the right decision. And, of course, it might be.
Joining two data.frames is pretty straightforward using cbind, something like my_data2 <- cbind(my_data, my_data.dummy_vars). If you need to remove the column with your factor data, my_data3 <- my_data2[,-5] if, for example, the factor data is in column 5.
By dummy variables, do you mean zeroes and ones? This is how I'd structure it:
# first building a fake data frame
x <- 1:10
y <- as.factor(c("A","A","B","B","C","C",NA,"A","B","C"))
df <- data.frame(x,y)
# creating dummy variables
df$dummy_A <- 1*(y=="A")
df$dummy_B <- 1*(y=="B")
df$dummy_c <- 1*(y=="C")
# did it work?
df
x y dummy_A dummy_B dummy_c
1 1 A 1 0 0
2 2 A 1 0 0
3 3 B 0 1 0
4 4 B 0 1 0
5 5 C 0 0 1
6 6 C 0 0 1
7 7 <NA> NA NA NA
8 8 A 1 0 0
9 9 B 0 1 0
10 10 C 0 0 1

Dynamically creating columns of binary values based on a series True/False conditions

I would like to be able to create new columns in a dataframe, the values of which will be determined by a pre-defined list of conditional statements. The ultimate goal of this is to arrive at a table binary values that represent if a condition is being met for each instance. It may seem like a clunky or odd output, but it is the requirement of an economic model I'm trying to build (repeated sales model).
Here is a much simplified reproducible example:
df <- data.frame(a=c(1,2,3,4,5),b=c(0.3,0.2,0.5,0.3,0.7))
conditions <- data.frame(y=df$b>=0.5, z=df$b>=0.7)
columns <- c("y","z")
for(i in length(columns)){
df[, paste("var_",columns[i],sep="")] <- ifelse(conditions[i],1,0)
}
So in this instance, I'd like to get columns "var_y" and "var_z" which have binary values representing if the criteria for conditions y or z are being met.
Right now, I'm getting this error:
Error in ifelse(conditions[i], 1, 0) : (list) object cannot be
coerced to type 'logical'
Which I don't understand, as all of the information in the dataframe "conditions" is of the type 'logical'.
We can just do
df[paste0("var_", seq_along(columns))] <- +(conditions)
df
# a b var_1 var_2
#1 1 0.3 0 0
#2 2 0.2 0 0
#3 3 0.5 1 0
#4 4 0.3 0 0
#5 5 0.7 1 1

data.matrix converts zero-values

I am trying to use R to analyse some data, though when I am trying to convert a data.frame (with zero-values) into a TimeSeries, it changes the values.
data <- input-dataset
Qua1 Qua2 Qua3 Qua4
A 0 1 1 3
B 0 1 0 0
C 0 2 0 0
D 1 1 3 0
I need to transpose these data and set the colnames, before further analysis
data = t(data)
colnames(data) <- data[1,]
data = data.frame(data[2:5,])
ts(data)
The data.frame keeps the inputvalues, but when I apply TS() (converts data.frame to numeric matrix via data.matrix) all zero's change to 1, the existing 1 to 2, and rest remains the same. As I would like to keep my zeros for later analysis, I would like to avoid this change-in-values, but how??

Resources