dummyVars() in r and weird column names in R - r

for a dataset similar to the one below, I need N level dummy variables. I use dummyVars() from caret package.
As you can see the column names are ignoring "sep="-"" argument and there are some dots in the column names rather than < or > signs.
df <- data.frame(fruit=as.factor(c("apple", "orange","orange", "carrot", "apple")),
st=as.factor(c("CA", "MN","MN", "NY", "NJ")),
wt=as.factor(c("<2","2-4",">4","2-4","<2")),
buy=c(1,1,0,1,0))
fruit st wt buy
1 apple CA <2 1
2 orange MN 2-4 1
3 orange MN >4 0
4 carrot NY 2-4 1
5 apple NJ <2 0
library(caret)
dmy <- dummyVars(buy~ ., data = df, sep="-")
df2 <- data.frame(predict(dmy, newdata = df))
df2
fruit.apple fruit.carrot fruit.orange st.CA st.MN st.NJ st.NY wt..2 wt..4 wt.2.4
1 1 0 0 1 0 0 0 1 0 0
2 0 0 1 0 1 0 0 0 0 1
3 0 0 1 0 1 0 0 0 1 0
4 0 1 0 0 0 0 1 0 0 1
5 1 0 0 0 0 1 0 1 0 0
I am confused why dummyVars() is not converting the actual levels into the parts of the column names and why is it ignoring the separator argument.
I would appreciate any hint on what I am doing wrong!
EDIT: for the future readers :) ! according to AKRUN's note, the argument below for dataframe() solved the problem.
df2 <- data.frame(predict(dmy, newdata = df), check.names = FALSE)
fruit-apple fruit-carrot fruit-orange st-CA st-MN st-NJ st-NY wt-<2 wt->4 wt-2-4
1 1 0 0 1 0 0 0 1 0 0
2 0 0 1 0 1 0 0 0 0 1
3 0 0 1 0 1 0 0 0 1 0
4 0 1 0 0 0 0 1 0 0 1
5 1 0 0 0 0 1 0 1 0 0

Related

Splitting column of comma separated categories into binary matrix

I'm pretty new to R and I really need some help. I have a column cats in my dataframe which i would like to spread into a binary matrix where 1 is where the respondent reported interest in and 0 if they did not.
I've found that my problem is very similar to the one here:
Split column of comma-separated numbers into multiple columns based on value
However I am unable to solve my problem using the said solution and keep receiving multiple different errors at different points. I suspect it's because my data frame contains strings and not integers or numbers.
Here is a sample data frame of what I am working with
df <- data.frame(c("sports", "business,IT,entertainment", "feature,entertainment", "business,politics,sports", "health", "politics", "reviews", "entertainment,health", "IT"))
colnames(df) <- "cats"
# cats
#1 sports
#2 business,IT,entertainment
#3 feature,entertainment
#4 business,politics,sports
#5 health
#6 politics
#7 reviews
#8 entertainment,health
#9 IT
And this is what I'm trying to make it look like
sports business IT entertainment politics review health feature
1 1 0 0 0 0 0 0 0
2 0 1 1 1 0 0 0 0
3 0 0 0 1 0 0 0 1
4 1 1 0 0 1 0 0 0
etc...
Examples of errors I have received are:
Error: row_number() should only be called in a data context
Error in eval_tidy(enquo(var), var_env) : object '' not found
Any help would be greatly appreciated!
+with(df, sapply(unique(unlist(strsplit(as.character(cats), ","))), grepl, cats))
# sports business IT entertainment feature politics health reviews
# [1,] 1 0 0 0 0 0 0 0
# [2,] 0 1 1 1 0 0 0 0
# [3,] 0 0 0 1 1 0 0 0
# [4,] 1 1 0 0 0 1 0 0
# [5,] 0 0 0 0 0 0 1 0
# [6,] 0 0 0 0 0 1 0 0
# [7,] 0 0 0 0 0 0 0 1
# [8,] 0 0 0 1 0 0 1 0
# [9,] 0 0 1 0 0 0 0 0
One option with mtabulate
library(qdapTools)
mtabulate(strsplit(as.character(df$cats), ","))
# business entertainment feature health IT politics reviews sports
#1 0 0 0 0 0 0 0 1
#2 1 1 0 0 1 0 0 0
#3 0 1 1 0 0 0 0 0
#4 1 0 0 0 0 1 0 1
#5 0 0 0 1 0 0 0 0
#6 0 0 0 0 0 1 0 0
#7 0 0 0 0 0 0 1 0
#8 0 1 0 1 0 0 0 0
#9 0 0 0 0 1 0 0 0
Or with table from base R
table(stack(setNames(strsplit(as.character(df$cats), ","), seq_len(nrow(df))))[2:1])
Based on you can do:
library(tidyverse)
df %>%
rownames_to_column(var="row") %>%
separate_rows(cats, sep=",") %>%
count(row, cats) %>%
spread(cats, n, fill = 0)
Edit thanks to #eipi10

R - Creating a new column within a data frame when two or more columns are a match in a row

I'm currently stuck on a part of my code that feels intuitive but I can't figure a way to do it. I have a very big data frame (nrows = 34036, ncol = 43) in which I want to create a continuous sequence of the variables where the value of the row is 1 (without having multiple columns with 1). It consists of only zeros and ones similar to the following:
A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1
I was able to remove the zeroes using:
#find the sum of each row
placeholderData <- transform(placeholderData, sum=rowSums(placeholderData))
placeholderData <- placeholderData[!(placeholderData$sum <= 0),]
And the data frame now looks like:
A B C D sum
1 0 0 0 1
0 0 0 1 1
0 0 0 1 1
1 0 1 0 2
1 0 1 0 2
0 1 0 0 1
0 1 0 0 1
1 0 0 1 2
My main problem comes when there are two or more 1's in a row. To try to solve this, I used the following code to identify the columns that have a sum of 2 or more:
placeholderData$Matches <- lapply(apply(placeholderData == 1, 1, which), names)
Which added the following column to the data frame:
A B C D sum Matches
1 0 0 0 1 A
0 0 0 1 1 D
0 0 0 1 1 D
1 0 1 0 2 c("A","C")
1 0 1 0 2 c("A","C")
0 1 0 0 1 B
0 1 0 0 1 B
1 0 0 1 2 c("A", "D")
I added the Matches column as an approach to solve the problem, but I'm not sure how would I do it without using a lot of logical operators (I don't know what columns have matches or not). What I would like to do is to aggregate the rows that have more than (or equal to) two 1's into a new column, to be able to have a data frame like this:
A B C D AC AD sum Matches
1 0 0 0 0 0 1 A
0 0 0 1 0 0 1 D
0 0 0 1 0 0 1 D
0 0 0 0 1 0 1 c("A","C")
0 0 0 0 1 0 1 c("A","C")
0 1 0 0 0 0 1 B
0 1 0 0 0 0 1 B
0 0 0 0 0 1 1 c("A", "D")
Then, I would be able to use my code as normal (It works just fine when there are no repeated values in rows). I tried searching to find similar questions, but I'm not sure if I was even asking the right question. I was wondering if anyone could provide some help or some ideas that I could try.
Thank you very much!
This seems a lot like making dummy variables, so I would use the model.matrix function commonly used for dummy variables (one-hot encoding):
m = read.table(header = T, text = "A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1")
m = m[rowSums(m) > 0, ]
d = factor(sapply(apply(m == 1, 1, which), function(x) paste(names(m)[x], collapse = "")))
result = data.frame(model.matrix(~ d + 0))
names(result) = levels(d)
# A AC AD B D
# 1 1 0 0 0 0
# 2 0 0 0 0 1
# 3 0 0 0 0 1
# 4 0 1 0 0 0
# 5 0 1 0 0 0
# 6 0 0 0 1 0
# 7 0 0 0 1 0
# 8 0 0 1 0 0

Using any in nested ifelse statement

data:
set.seed(1337)
m <- matrix(sample(c(0,0,0,1),size = 50,replace=T),ncol=5) %>% as.data.frame
colnames(m)<-LETTERS[1:5]
code:
m %<>%
mutate(newcol = ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(any(A,B,C,D,E),0,NA)),
desiredResult= ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(!(A==0&B==0&C==0&D==0&E==0),0,NA)))
looks like:
A B C D E newcol desiredResult
1 0 1 1 1 0 0 0
2 0 1 0 0 1 0 0
3 0 1 0 0 0 0 0
4 0 0 0 0 0 0 NA
5 0 1 0 1 0 0 0
6 0 0 1 0 0 0 0
7 1 1 1 1 0 1 1
8 0 1 1 0 0 0 0
9 0 0 0 0 0 0 NA
10 0 0 1 0 0 0 0
question
I want newcol to be the same as desiredResult.
Why can't I use any in that "stratified" manner of ifelse. Is there a function like any that would work in that situation?
possible workaround
I could define a function
any_vec <- function(...) {apply(cbind(...),1,any)} but this does not make me smile too much.
like suggested in the answer
using pmax works exactly like a vectorized any.
m %>%
mutate(pmaxResult = ifelse(A==1& pmax(B,C) & pmax(D,E),1,
ifelse(pmax(A,B,C,D,E),0,NA)),
desiredResult= ifelse(A==1&(B==1|C==1)&(D==1|E==1),1,
ifelse(!(A==0&B==0&C==0&D==0&E==0),0,NA)))
Here's an alternative approach. I converted to logical at the beginning and back to integer at the end:
m %>%
mutate_all(as.logical) %>%
mutate(newcol = A & pmax(B,C) & pmax(D, E) ,
newcol = replace(newcol, !newcol & !pmax(A,B,C,D,E), NA)) %>%
mutate_all(as.integer)
# A B C D E newcol
# 1 0 1 1 1 0 0
# 2 0 1 0 0 1 0
# 3 0 1 0 0 0 0
# 4 0 0 0 0 0 NA
# 5 0 1 0 1 0 0
# 6 0 0 1 0 0 0
# 7 1 1 1 1 0 1
# 8 0 1 1 0 0 0
# 9 0 0 0 0 0 NA
# 10 0 0 1 0 0 0
I basically replaced the any with pmax.

R: Generating sparse matrix with all elements as rows and columns

I have a data set with user to user. It doesn't have all users as col and row. For example,
U1 U2 T
1 3 1
1 6 1
2 4 1
3 5 1
u1 and u2 represent users of the dataset. When I create a sparse matrix using following code, (df- keep all data of above dataset as a dataframe)
trustmatrix <- xtabs(T~U1+U2,df,sparse = TRUE)
3 4 5 6
1 1 0 0 1
2 0 1 0 0
3 0 0 1 0
Because this matrix doesn't have all the users in row and columns as below.
1 2 3 4 5 6
1 0 0 1 0 0 1
2 0 0 0 1 0 0
3 0 0 0 0 1 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
If I want to get above matrix after sparse matrix, How can I do so in R?
We can convert the columns to factor with levels as 1 through 6 and then use xtabs
df1[1:2] <- lapply(df1[1:2], factor, levels = 1:6)
as.matrix(xtabs(T~U1+U2,df1,sparse = TRUE))
# U2
#U1 1 2 3 4 5 6
# 1 0 0 1 0 0 1
# 2 0 0 0 1 0 0
# 3 0 0 0 0 1 0
# 4 0 0 0 0 0 0
# 5 0 0 0 0 0 0
# 6 0 0 0 0 0 0
Or another option is to get the expanded index filled with 0s and then use sparseMatrix
library(tidyverse)
library(Matrix)
df2 <- crossing(U1 = 1:6, U2 = 1:6) %>%
left_join(df1) %>%
mutate(T = replace(T, is.na(T), 0))
sparseMatrix(i = df2$U1, j = df2$U2, x = df2$T)
Or use spread
spread(df2, U2, T)

Building a symmetric binary matrix

I have a matrix that is for example like this:
rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3
I want to make a Symmetric binary matrix that it's dimnames of that is the same as rownames of above matrix. I want to fill these matrix by 1 & 0 in such a way that 1 indicated placing variables that has the same number in front of it and 0 for the opposite situation.This matrix would be like
dimnames
a c b d y q i j r
a 1 0 0 0 0 0 1 1 0
c 0 1 0 0 0 0 0 0 1
b 0 0 1 0 1 0 0 0 0
d 0 0 0 1 0 1 0 0 0
y 0 0 1 0 1 0 0 0 0
q 0 0 0 1 0 1 0 0 0
i 1 0 0 0 0 0 1 1 0
j 1 0 0 0 0 0 1 1 0
r 0 1 0 0 0 0 0 0 1
Anybody know how can I do that?
Use dist:
DF <- read.table(text = "rownames V1
a 1
c 3
b 2
d 4
y 2
q 4
i 1
j 1
r 3", header = TRUE)
res <- as.matrix(dist(DF$V1)) == 0L
#alternatively:
#res <- !as.matrix(dist(DF$V1))
#diag(res) <- 0L #for the first version of the question, i.e. a zero diagonal
res <- +(res) #for the second version, i.e. to coerce to an integer matrix
dimnames(res) <- list(DF$rownames, DF$rownames)
# 1 2 3 4 5 6 7 8 9
#1 1 0 0 0 0 0 1 1 0
#2 0 1 0 0 0 0 0 0 1
#3 0 0 1 0 1 0 0 0 0
#4 0 0 0 1 0 1 0 0 0
#5 0 0 1 0 1 0 0 0 0
#6 0 0 0 1 0 1 0 0 0
#7 1 0 0 0 0 0 1 1 0
#8 1 0 0 0 0 0 1 1 0
#9 0 1 0 0 0 0 0 0 1
You can do this using table and crossprod.
tcrossprod(table(DF))
# rownames
# rownames a b c d i j q r y
# a 1 0 0 0 1 1 0 0 0
# b 0 1 0 0 0 0 0 0 1
# c 0 0 1 0 0 0 0 1 0
# d 0 0 0 1 0 0 1 0 0
# i 1 0 0 0 1 1 0 0 0
# j 1 0 0 0 1 1 0 0 0
# q 0 0 0 1 0 0 1 0 0
# r 0 0 1 0 0 0 0 1 0
# y 0 1 0 0 0 0 0 0 1
If you want the row and column order as they are found in the data, rather than alphanumerically, you can subset
tcrossprod(table(DF))[DF$rownames, DF$rownames]
or use factor
tcrossprod(table(factor(DF$rownames, levels=unique(DF$rownames)), DF$V1))
If your data is large or sparse, you can use the sparse matrix algebra in xtabs, with similar ways to change the order of the resulting table as before.
Matrix::tcrossprod(xtabs(data=DF, ~ rownames + V1, sparse=TRUE))

Resources