Splitting column of comma separated categories into binary matrix - r

I'm pretty new to R and I really need some help. I have a column cats in my dataframe which i would like to spread into a binary matrix where 1 is where the respondent reported interest in and 0 if they did not.
I've found that my problem is very similar to the one here:
Split column of comma-separated numbers into multiple columns based on value
However I am unable to solve my problem using the said solution and keep receiving multiple different errors at different points. I suspect it's because my data frame contains strings and not integers or numbers.
Here is a sample data frame of what I am working with
df <- data.frame(c("sports", "business,IT,entertainment", "feature,entertainment", "business,politics,sports", "health", "politics", "reviews", "entertainment,health", "IT"))
colnames(df) <- "cats"
# cats
#1 sports
#2 business,IT,entertainment
#3 feature,entertainment
#4 business,politics,sports
#5 health
#6 politics
#7 reviews
#8 entertainment,health
#9 IT
And this is what I'm trying to make it look like
sports business IT entertainment politics review health feature
1 1 0 0 0 0 0 0 0
2 0 1 1 1 0 0 0 0
3 0 0 0 1 0 0 0 1
4 1 1 0 0 1 0 0 0
etc...
Examples of errors I have received are:
Error: row_number() should only be called in a data context
Error in eval_tidy(enquo(var), var_env) : object '' not found
Any help would be greatly appreciated!

+with(df, sapply(unique(unlist(strsplit(as.character(cats), ","))), grepl, cats))
# sports business IT entertainment feature politics health reviews
# [1,] 1 0 0 0 0 0 0 0
# [2,] 0 1 1 1 0 0 0 0
# [3,] 0 0 0 1 1 0 0 0
# [4,] 1 1 0 0 0 1 0 0
# [5,] 0 0 0 0 0 0 1 0
# [6,] 0 0 0 0 0 1 0 0
# [7,] 0 0 0 0 0 0 0 1
# [8,] 0 0 0 1 0 0 1 0
# [9,] 0 0 1 0 0 0 0 0

One option with mtabulate
library(qdapTools)
mtabulate(strsplit(as.character(df$cats), ","))
# business entertainment feature health IT politics reviews sports
#1 0 0 0 0 0 0 0 1
#2 1 1 0 0 1 0 0 0
#3 0 1 1 0 0 0 0 0
#4 1 0 0 0 0 1 0 1
#5 0 0 0 1 0 0 0 0
#6 0 0 0 0 0 1 0 0
#7 0 0 0 0 0 0 1 0
#8 0 1 0 1 0 0 0 0
#9 0 0 0 0 1 0 0 0
Or with table from base R
table(stack(setNames(strsplit(as.character(df$cats), ","), seq_len(nrow(df))))[2:1])

Based on you can do:
library(tidyverse)
df %>%
rownames_to_column(var="row") %>%
separate_rows(cats, sep=",") %>%
count(row, cats) %>%
spread(cats, n, fill = 0)
Edit thanks to #eipi10

Related

dummyVars() in r and weird column names in R

for a dataset similar to the one below, I need N level dummy variables. I use dummyVars() from caret package.
As you can see the column names are ignoring "sep="-"" argument and there are some dots in the column names rather than < or > signs.
df <- data.frame(fruit=as.factor(c("apple", "orange","orange", "carrot", "apple")),
st=as.factor(c("CA", "MN","MN", "NY", "NJ")),
wt=as.factor(c("<2","2-4",">4","2-4","<2")),
buy=c(1,1,0,1,0))
fruit st wt buy
1 apple CA <2 1
2 orange MN 2-4 1
3 orange MN >4 0
4 carrot NY 2-4 1
5 apple NJ <2 0
library(caret)
dmy <- dummyVars(buy~ ., data = df, sep="-")
df2 <- data.frame(predict(dmy, newdata = df))
df2
fruit.apple fruit.carrot fruit.orange st.CA st.MN st.NJ st.NY wt..2 wt..4 wt.2.4
1 1 0 0 1 0 0 0 1 0 0
2 0 0 1 0 1 0 0 0 0 1
3 0 0 1 0 1 0 0 0 1 0
4 0 1 0 0 0 0 1 0 0 1
5 1 0 0 0 0 1 0 1 0 0
I am confused why dummyVars() is not converting the actual levels into the parts of the column names and why is it ignoring the separator argument.
I would appreciate any hint on what I am doing wrong!
EDIT: for the future readers :) ! according to AKRUN's note, the argument below for dataframe() solved the problem.
df2 <- data.frame(predict(dmy, newdata = df), check.names = FALSE)
fruit-apple fruit-carrot fruit-orange st-CA st-MN st-NJ st-NY wt-<2 wt->4 wt-2-4
1 1 0 0 1 0 0 0 1 0 0
2 0 0 1 0 1 0 0 0 0 1
3 0 0 1 0 1 0 0 0 1 0
4 0 1 0 0 0 0 1 0 0 1
5 1 0 0 0 0 1 0 1 0 0

R - Creating a new column within a data frame when two or more columns are a match in a row

I'm currently stuck on a part of my code that feels intuitive but I can't figure a way to do it. I have a very big data frame (nrows = 34036, ncol = 43) in which I want to create a continuous sequence of the variables where the value of the row is 1 (without having multiple columns with 1). It consists of only zeros and ones similar to the following:
A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1
I was able to remove the zeroes using:
#find the sum of each row
placeholderData <- transform(placeholderData, sum=rowSums(placeholderData))
placeholderData <- placeholderData[!(placeholderData$sum <= 0),]
And the data frame now looks like:
A B C D sum
1 0 0 0 1
0 0 0 1 1
0 0 0 1 1
1 0 1 0 2
1 0 1 0 2
0 1 0 0 1
0 1 0 0 1
1 0 0 1 2
My main problem comes when there are two or more 1's in a row. To try to solve this, I used the following code to identify the columns that have a sum of 2 or more:
placeholderData$Matches <- lapply(apply(placeholderData == 1, 1, which), names)
Which added the following column to the data frame:
A B C D sum Matches
1 0 0 0 1 A
0 0 0 1 1 D
0 0 0 1 1 D
1 0 1 0 2 c("A","C")
1 0 1 0 2 c("A","C")
0 1 0 0 1 B
0 1 0 0 1 B
1 0 0 1 2 c("A", "D")
I added the Matches column as an approach to solve the problem, but I'm not sure how would I do it without using a lot of logical operators (I don't know what columns have matches or not). What I would like to do is to aggregate the rows that have more than (or equal to) two 1's into a new column, to be able to have a data frame like this:
A B C D AC AD sum Matches
1 0 0 0 0 0 1 A
0 0 0 1 0 0 1 D
0 0 0 1 0 0 1 D
0 0 0 0 1 0 1 c("A","C")
0 0 0 0 1 0 1 c("A","C")
0 1 0 0 0 0 1 B
0 1 0 0 0 0 1 B
0 0 0 0 0 1 1 c("A", "D")
Then, I would be able to use my code as normal (It works just fine when there are no repeated values in rows). I tried searching to find similar questions, but I'm not sure if I was even asking the right question. I was wondering if anyone could provide some help or some ideas that I could try.
Thank you very much!
This seems a lot like making dummy variables, so I would use the model.matrix function commonly used for dummy variables (one-hot encoding):
m = read.table(header = T, text = "A B C D
1 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 0 0 0
1 0 1 0
1 0 1 0
0 1 0 0
0 1 0 0
1 0 0 1")
m = m[rowSums(m) > 0, ]
d = factor(sapply(apply(m == 1, 1, which), function(x) paste(names(m)[x], collapse = "")))
result = data.frame(model.matrix(~ d + 0))
names(result) = levels(d)
# A AC AD B D
# 1 1 0 0 0 0
# 2 0 0 0 0 1
# 3 0 0 0 0 1
# 4 0 1 0 0 0
# 5 0 1 0 0 0
# 6 0 0 0 1 0
# 7 0 0 0 1 0
# 8 0 0 1 0 0

R - merge/combine columns with same name but some data values equal zero

First of all, I have a matrix of features and a data.frame of features from two separate text sources. On each of those, I have performed different text mining methods. Now, I want to combine them but I know some of them have columns with identical names like the following:
> dtm.matrix[1:10,66:70]
cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0
> dim(dtm.matrix)
[1] 14300 6543
And the second set looks like this:
> data1.sub[1:10,c(1,37:40)]
Data number cough coughing up blood dehydration dental abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0
> dim(data1.sub)
[1] 14300 168
I got this code from this topic but I'm new to R and I still need some help with it:
`data1.sub.merged <- dcast.data.table(merge(
## melt the first data.frame and set the key as ID and variable
setkey(melt(as.data.table(data1.sub), id.vars = "Data number"), "Data number", variable),
## melt the second data.frame
melt(as.data.table(dtm.matrix), id.vars = "Data number"),
## you'll have 2 value columns...
all = TRUE)[, value := ifelse(
## ... combine them into 1 with ifelse
(value.x == 0), value.y, value.x)],
## This is the reshaping formula
"Data number" ~ variable, value.var = "value")`
When I run this code, it returns a matrix of 1x6667 and doesn't merge the "cough" (or any other column) from the two data sets together. I'm confused. Could you help me how this works?
There are many ways to do that, f.e. using base R, data.table or dplyr. The choice depends on the volume of your data, and if you, say, work with very large matrices (which is usually the case with natural language processing and bag of words representation), you may need to play with different ways to solve your problem and profile the better (=the quickest) solution.
I did what you wanted via dplyr. This is a bit ugly but it works. I just merge two dataframes, then use for cycle for those variables which exist in both dataframes: sum them up (variable.x and variable.y) and then delete em. Note that I changed a bit your column names for reproducibility, but it shouldn't have any impact. Please let me know if that works for you.
df1 <- read.table(text =
' cough nasal sputum yellow intermitt
1 1 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 1 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
10 0 0 0 0 0')
df2 <- read.table(text =
' Data_number cough coughing_up_blood dehydration dental_abscess
1 1 0 0 0 0
2 3 1 0 0 0
3 6 0 0 0 0
4 8 0 0 0 0
5 9 0 0 0 0
6 11 1 0 0 0
7 12 0 0 0 0
8 13 0 0 0 0
9 15 0 0 0 0
10 16 1 0 0 0')
# Check what variables are common
common <- intersect(names(df1),names(df2))
# Set key IDs for data
df1$ID <- seq(1,nrow(df1))
df2$ID <- seq(1,nrow(df2))
# Merge dataframes
df <- merge(df1, df2,by = "ID")
# Sum and clean common variables left in merged dataframe
library(dplyr)
for (variable in common){
# Create a summed variable
df[[variable]] <- df %>% select(starts_with(paste0(variable,"."))) %>% rowSums()
# Delete columns with .x and .y suffixes
df <- df %>% select(-one_of(c(paste0(variable,".x"), paste0(variable,".y"))))
}
df
ID nasal sputum yellow intermitt Data_number coughing_up_blood dehydration dental_abscess cough
1 1 0 0 0 0 1 0 0 0 1
2 2 0 0 0 0 3 0 0 0 2
3 3 0 0 0 0 6 0 0 0 0
4 4 0 0 0 0 8 0 0 0 0
5 5 0 0 0 0 9 0 0 0 0
6 6 0 0 0 0 11 0 0 0 2
7 7 0 0 0 0 12 0 0 0 0
8 8 0 0 0 0 13 0 0 0 0
9 9 0 0 0 0 15 0 0 0 0
10 10 0 0 0 0 16 0 0 0 1

finding strcutural holes constraint , efficiency,ego density and effective size in r

I am working on the adjacency matrix to find the results of the egonet package function. But when I run the command index.egonet, it gives me an error.
My adjacency matrix "p2":
p2
1 2 3 4 5 7 8 9 6
1 0 1 1 1 1 0 0 0 0
2 1 0 0 0 1 1 1 1 0
3 1 0 0 0 0 1 0 1 1
4 1 0 0 0 0 0 0 0 0
5 1 1 0 0 0 0 0 0 0
7 0 1 1 0 0 0 0 0 0
8 0 1 0 0 0 0 0 0 0
9 0 1 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0
I apply this command on the adjacency for the desired results but it gives me an error
index.egonet(p2)
Error in dati[ego.name, y] : subscript out of bounds
So any alternative or solution to current code error will highly be appreciated.
The ego name must be "EGO" in capital letters, as far as I could understand from working with that function.
colnames(p2) <- rownames(p2) <- c("EGO", 2:ncol(p2))
index.egonet(p2)
this should work...

How can I calculate an empirical CDF in R?

I'm reading a sparse table from a file which looks like:
1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1
Note row lengths are different.
Each row represents a single simulation. The value in the i-th column in each row says how many times value i-1 was observed in this simulation. For example, in the first simulation (first row), we got a single result with value '0' (first column), 7 results with value '2' (third column) etc.
I wish to create an average cumulative distribution function (CDF) for all the simulation results, so I could later use it to calculate an empirical p-value for true results.
To do this I can first sum up each column, but I need to take zeros for the undef columns.
How do I read such a table with different row lengths? How do I sum up columns replacing 'undef' values with 0'? And finally, how do I create the CDF? (I can do this manually but I guess there is some package which can do that).
This will read the data in:
dat <- textConnection("1 0 7 0 0 1 0 0 0 5 0 0 0 0 2 0 0 0 0 1 0 0 0 1
1 0 0 1 0 0 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1
1 0 0 1 0 3 0 0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 2 1 0 1 0 1")
df <- data.frame(scan(dat, fill = TRUE, what = as.list(rep(1, 29))))
names(df) <- paste("Val", 1:29)
close(dat)
Resulting in:
> head(df)
Val 1 Val 2 Val 3 Val 4 Val 5 Val 6 Val 7 Val 8 Val 9 Val 10 Val 11 Val 12
1 1 0 7 0 0 1 0 0 0 5 0 0
2 1 0 0 1 0 0 0 3 0 0 0 0
3 0 0 0 1 0 0 0 2 0 0 0 0
4 1 0 0 1 0 3 0 0 0 0 1 0
5 0 0 0 1 0 0 0 2 0 0 0 0
....
If the data are in a file, provide the file name instead of dat. This code presumes that there are a maximum of 29 columns, as per the data you supplied. Alter the 29 to suit the real data.
We get the column sums using
df.csum <- colSums(df, na.rm = TRUE)
the ecdf() function generates the ECDF you wanted,
df.ecdf <- ecdf(df.csum)
and we can plot it using the plot() method:
plot(df.ecdf, verticals = TRUE)
You can use the ecdf() (in base R) or Ecdf() (from the Hmisc package) functions.

Resources