So I have a list that contains certain characters as shown below
list <- c("MY","GM+" ,"TY","RS","LG")
And I have a variable named "CODE" in the data frame as follows
code <- c("MY GM+","","LGTY", "RS","TY")
df <- data.frame(1:5,code)
df
code
1 MY GM+
2
3 LGTY
4 RS
5 TY
Now I want to create 5 new variables named "MY","GM+","TY","RS","LG"
Which takes binary value, 1 if there's a match case in the CODE variable
df
code MY GM+ TY RS LG
1 MY GM+ 1 1 0 0 0
2 0 0 0 0 0
3 LGTY 0 0 1 0 1
4 RS 0 0 0 1 0
5 TY 0 0 1 0 0
Really appreciate your help. Thank you.
Since you know how many values will be returned (5), and what you want their types to be (integer), you could use vapply() with grepl(). We can turn the resulting logical matrix into integer values by using integer() in vapply()'s FUN.VALUE argument.
cbind(df, vapply(List, grepl, integer(nrow(df)), df$code, fixed = TRUE))
# code MY GM+ TY RS LG
# 1 MY GM+ 1 1 0 0 0
# 2 0 0 0 0 0
# 3 LGTY 0 0 1 0 1
# 4 RS 0 0 0 1 0
# 5 TY 0 0 1 0 0
I think your original data has a couple of typos, so here's what I used:
List <- c("MY", "GM+" , "TY", "RS", "LG")
df <- data.frame(code = c("MY GM+", "", "LGTY", "RS", "TY"))
I would like to extract every row from the data frame my.data for which the first non-zero element is a 1.
my.data <- read.table(text = '
x1 x2 x3 x4
0 0 1 1
0 0 0 1
0 2 1 1
2 1 2 1
1 1 1 2
0 0 0 0
0 1 0 0
', header = TRUE)
my.data
desired.result <- read.table(text = '
x1 x2 x3 x4
0 0 1 1
0 0 0 1
1 1 1 2
0 1 0 0
', header = TRUE)
desired.result
I am not even sure where to begin. Sorry if this is a duplicate. Thank you for any suggestions or advice.
Here's one approach:
# index of rows
idx <- apply(my.data, 1, function(x) any(x) && x[as.logical(x)][1] == 1)
# extract rows
desired.result <- my.data[idx, ]
The result:
x1 x2 x3 x4
1 0 0 1 1
2 0 0 0 1
5 1 1 1 2
7 0 1 0 0
Probably not the best answer, but:
rows.to.extract <- apply(my.data, 1, function(x) {
no.zeroes <- x[x!=0] # removing 0
to.return <- no.zeroes[1] == 1 # finding if first number is 0
# if a row is all 0, then to.return will be NA
# this fixes that problem
to.return[is.na(to.return)] <- FALSE # if row is all 0
to.return
})
my.data[rows.to.extract, ]
x1 x2 x3 x4
1 0 0 1 1
2 0 0 0 1
5 1 1 1 2
7 0 1 0 0
Use apply to iterate over all rows:
first.element.is.one <- apply(my.data, 1, function(x) x[x != 0][1] == 1)
The function passed to apply compares the first [1] non-zero [x != 0] element of x to == 1. It will be called once for each row, x will be a vector of four in your example.
Use which to extract the indices of the candidate rows (and remove NA values, too):
desired.rows <- which(first.element.is.one)
Select the rows of the matrix -- you probably know how to do this.
Bonus question: Where do the NA values mentioned in step 2 come from?
I would like to create a matrix of indicator variables. My initial thought was to use model.matrix, which was also suggested here: Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level
However, model.matrix does not seem to work if a factor has only one level.
Here is an example data set with three levels to the factor 'region':
dat = read.table(text = "
reg1 reg2 reg3
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
0 1 0
0 1 0
0 1 0
0 0 1
0 0 1
0 0 1
0 0 1
", sep = "", header = TRUE)
# model.matrix works if there are multiple regions:
region <- c(1,1,1,1,1,1,2,2,2,3,3,3,3)
df.region <- as.data.frame(region)
df.region$region <- as.factor(df.region$region)
my.matrix <- as.data.frame(model.matrix(~ -1 + df.region$region, df.region))
my.matrix
# The following for-loop works even if there is only one level to the factor
# (one region):
# region <- c(1,1,1,1,1,1,1,1,1,1,1,1,1)
my.matrix <- matrix(0, nrow=length(region), ncol=length(unique(region)))
for(i in 1:length(region)) {my.matrix[i,region[i]]=1}
my.matrix
The for-loop is effective and seems simple enough. However, I have been struggling to come up with a solution that does not involve loops. I can use the loop above, but have been trying hard to wean myself off of them. Is there a better way?
I would use matrix indexing. From ?"[":
A third form of indexing is via a numeric matrix with the one column for each dimension: each row of the index matrix then selects a single element of the array, and the result is a vector.
Making use of that nice feature:
my.matrix <- matrix(0, nrow=length(region), ncol=length(unique(region)))
my.matrix[cbind(seq_along(region), region)] <- 1
# [,1] [,2] [,3]
# [1,] 1 0 0
# [2,] 1 0 0
# [3,] 1 0 0
# [4,] 1 0 0
# [5,] 1 0 0
# [6,] 1 0 0
# [7,] 0 1 0
# [8,] 0 1 0
# [9,] 0 1 0
# [10,] 0 0 1
# [11,] 0 0 1
# [12,] 0 0 1
# [13,] 0 0 1
I came up with this solution by modifying an answer to a similar question here:
Reshaping a column from a data frame into several columns using R
region <- c(1,1,1,1,1,1,2,2,2,3,3,3,3)
site <- seq(1:length(region))
df <- cbind(site, region)
ind <- xtabs( ~ site + region, df)
ind
region <- c(1,1,1,1,1,1,1,1,1,1,1,1,1)
site <- seq(1:length(region))
df <- cbind(site, region)
ind <- xtabs( ~ site + region, df)
ind
EDIT:
The line below will extract the data frame of indicator variables from ind:
ind.matrix <- as.data.frame.matrix(ind)
I have a data.frame consisting of numeric and factor variables as seen below.
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))
I want to build out a matrix that assigns dummy variables to the factor and leaves the numeric variables alone.
model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)
As expected when running lm this leaves out one level of each factor as the reference level. However, I want to build out a matrix with a dummy/indicator variable for every level of all the factors. I am building this matrix for glmnet so I am not worried about multicollinearity.
Is there a way to have model.matrix create the dummy for every level of the factor?
(Trying to redeem myself...) In response to Jared's comment on #Fabians answer about automating it, note that all you need to supply is a named list of contrast matrices. contrasts() takes a vector/factor and produces the contrasts matrix from it. For this then we can use lapply() to run contrasts() on each factor in our data set, e.g. for the testFrame example provided:
> lapply(testFrame[,4:5], contrasts, contrasts = FALSE)
$Fourth
Alice Bob Charlie David
Alice 1 0 0 0
Bob 0 1 0 0
Charlie 0 0 1 0
David 0 0 0 1
$Fifth
Edward Frank Georgia Hank Isaac
Edward 1 0 0 0 0
Frank 0 1 0 0 0
Georgia 0 0 1 0 0
Hank 0 0 0 1 0
Isaac 0 0 0 0 1
Which slots nicely into #fabians answer:
model.matrix(~ ., data=testFrame,
contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))
You need to reset the contrasts for the factor variables:
model.matrix(~ Fourth + Fifth, data=testFrame,
contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F),
Fifth=contrasts(testFrame$Fifth, contrasts=F)))
or, with a little less typing and without the proper names:
model.matrix(~ Fourth + Fifth, data=testFrame,
contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)),
Fifth=diag(nlevels(testFrame$Fifth))))
caret implemented a nice function dummyVars to achieve this with 2 lines:
library(caret)
dmy <- dummyVars(" ~ .", data = testFrame)
testFrame2 <- data.frame(predict(dmy, newdata = testFrame))
Checking the final columns:
colnames(testFrame2)
"First" "Second" "Third" "Fourth.Alice" "Fourth.Bob" "Fourth.Charlie" "Fourth.David" "Fifth.Edward" "Fifth.Frank" "Fifth.Georgia" "Fifth.Hank" "Fifth.Isaac"
The nicest point here is you get the original data frame, plus the dummy variables having excluded the original ones used for the transformation.
More info: http://amunategui.github.io/dummyVar-Walkthrough/
dummyVars from caret could also be used. http://caret.r-forge.r-project.org/preprocess.html
Ok. Just reading the above and putting it all together. Suppose you wanted the matrix e.g. 'X.factors' that multiplies by your coefficient vector to get your linear predictor. There are still a couple extra steps:
X.factors =
model.matrix( ~ ., data=X, contrasts.arg =
lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
contrasts, contrasts = FALSE))
(Note that you need to turn X[*] back into a data frame in case you have only one factor column.)
Then say you get something like this:
attr(X.factors,"assign")
[1] 0 1 **2** 2 **3** 3 3 **4** 4 4 5 6 7 8 9 10 #emphasis added
We want to get rid of the **'d reference levels of each factor
att = attr(X.factors,"assign")
factor.columns = unique(att[duplicated(att)])
unwanted.columns = match(factor.columns,att)
X.factors = X.factors[,-unwanted.columns]
X.factors = (data.matrix(X.factors))
A tidyverse answer:
library(dplyr)
library(tidyr)
result <- testFrame %>%
mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>%
mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "")
yields the desired result (same as #Gavin Simpson's answer):
> head(result, 6)
First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac
1 1 5 4 0 0 1 0 0 1 0 0 0
2 1 14 10 0 0 0 1 0 0 1 0 0
3 2 2 9 0 1 0 0 1 0 0 0 0
4 2 5 4 0 0 0 1 0 1 0 0 0
5 2 13 5 0 0 1 0 1 0 0 0 0
6 2 15 7 1 0 0 0 1 0 0 0 0
Using the R package 'CatEncoders'
library(CatEncoders)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))
fit <- OneHotEncoder.fit(testFrame)
z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output
z <- transform(fit,testFrame,sparse=FALSE) # give the dense output
I am currently learning Lasso model and glmnet::cv.glmnet(), model.matrix() and Matrix::sparse.model.matrix()(for high dimensions matrix, using model.matrix will killing our time as suggested by the author of glmnet.).
Just sharing there has a tidy coding to get the same answer as #fabians and #Gavin's answer. Meanwhile, #asdf123 introduced another package library('CatEncoders') as well.
> require('useful')
> # always use all levels
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE)
>
> # just use all levels for Fourth
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))
Source : R for Everyone: Advanced Analytics and Graphics (page273)
I write a package called ModelMatrixModel to improve the functionality of model.matrix(). The ModelMatrixModel() function in the package in default return a class containing a sparse matrix with all levels of dummy variables which is suitable for input in cv.glmnet() in glmnet package. Importantly, returned
class also stores transforming parameters such as the factor level information, which can then be applied to new data. The function can hand most items in r formula like poly() and interaction. It also gives several other options like handle invalid factor levels , and scale output.
#devtools::install_github("xinyongtian/R_ModelMatrixModel")
library(ModelMatrixModel)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
Fourth=rep(c("Alice","Bob","Charlie","David"), 5))
newdata=data.frame(First=sample(1:10, 2, replace=T),
Second=sample(1:20, 2, replace=T), Third=sample(1:10, 2, replace=T),
Fourth=c("Bob","Charlie"))
mm=ModelMatrixModel(~First+Second+Fourth, data = testFrame)
class(mm)
## [1] "ModelMatrixModel"
class(mm$x) #default output is sparse matrix
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
data.frame(as.matrix(head(mm$x,2)))
## First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1 7 17 1 0 0 0
## 2 9 7 0 1 0 0
#apply the same transformation to new data, note the dummy variables for 'Fourth' includes the levels not appearing in new data
mm_new=predict(mm,newdata)
data.frame(as.matrix(head(mm_new$x,2)))
## First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1 6 3 0 1 0 0
## 2 2 12 0 0 1 0
You can use tidyverse to achieve this without specifying each column manually.
The trick is to make a "long" dataframe.
Then, munge a few things, and spread it back to wide to create the indicators/dummy variables.
Code:
library(tidyverse)
## add index variable for pivoting
testFrame$id <- 1:nrow(testFrame)
testFrame %>%
## pivot to "long" format
gather(feature, value, -id) %>%
## add indicator value
mutate(indicator=1) %>%
## create feature name that unites a feature and its value
unite(feature, value, col="feature_value", sep="_") %>%
## convert to wide format, filling missing values with zero
spread(feature_value, indicator, fill=0)
The output:
id Fifth_Edward Fifth_Frank Fifth_Georgia Fifth_Hank Fifth_Isaac First_2 First_3 First_4 ...
1 1 1 0 0 0 0 0 0 0
2 2 0 1 0 0 0 0 0 0
3 3 0 0 1 0 0 0 0 0
4 4 0 0 0 1 0 0 0 0
5 5 0 0 0 0 1 0 0 0
6 6 1 0 0 0 0 0 0 0
7 7 0 1 0 0 0 0 1 0
8 8 0 0 1 0 0 1 0 0
9 9 0 0 0 1 0 0 0 0
10 10 0 0 0 0 1 0 0 0
11 11 1 0 0 0 0 0 0 0
12 12 0 1 0 0 0 0 0 0
...
model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame)
or
model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame)
should be the most straightforward