Hope all goes well.
I am working on a data set that has 7 binary variables ( they are all 0 and 1) and they are not mutually exclusive.
I need to convert them all into one categorical variable which will have 2^7 levels.
I was wondering if anyone has done such a thing in R before?
I really appreciate your time and answer.
Best,
library(tidyr)
data <- data.frame(x1 = c(0, 1, 0, 1), x2 = c(1, 1, 1, 1), x3 = c(0, 0, 0, 0),
x4 = c(1, 0, 1, 0), x5 = c(0, 0, 1, 1), x6 = c(1, 1, 0, 0), x7 = c(1, 0, 0, 1))
data <- unite(data, combine_x, 1:7, remove=FALSE)
data$combine_x <- factor(data$combine_x)
Using paste something like this should work.
#create dataframe
df<-as.data.frame(cbind(rbinom(100,1,.5), rbinom(100,1,.5),rbinom(100,1,.5)))
#paste columns together using apply to loop over rows
df$new<-apply(df,1,function(x) paste(x, collapse =""))
Any non NA character can be used in the collapse argument (for example collapse = ":" if you want to seperate records by :). Output:
> head(df)
V1 V2 V3 new
1 0 1 1 011
2 0 0 1 001
3 1 1 1 111
4 0 0 1 001
5 0 1 0 010
6 1 0 0 100
Related
Assume I have a data set with an arbitrary number of rows and columns like shown below.
tmp <- tibble(id = 1:10,
v1 = c(0, 0, 0, 1, 1, 0, 0, 0, 0, 1),
v2 = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1),
v3 = c(0, 0, 0, 1, 0, 0, 0, 0, 1, 0),
v4 = c(0, 0, 0, 1, 1, 0, 0, 0, 1, 0))
Each row is a response. The respondent has either said yes (1) or no (0) to a specific question. Here, we have 4 questions.
What is the easiest way to convert this into a concordance matrix like below:
v1 v2 v3 v4
v1 3 2 1 1
v2 2 2 1 1
v3 1 1 2 2
v4 1 1 2 3
Where each cell shows of those who answers yes to the question on the row, how many also answered yes to the question on the column.
Please note that the number of questions maybe bigger than 4, so I prefer not to have to hard code variable names in the solution. I can make sure the variable names always follow a specific format if that is helpful. A solution that doesn't care about variable names is ideal (we can drop the id column if needed).
The easiest way is with matrix multiplication...
mx <- as.matrix(tmp[,-1])
t(mx) %*% mx
v1 v2 v3 v4
v1 3 2 1 2
v2 2 2 1 1
v3 1 1 2 2
v4 2 1 2 3
crossprod(mx) will do the same thing.
Using tcrossprod
tcrossprod(t(tmp))
I couldn't find an answer to this specific question sorry if it's been asked:
library(tidyverse)
#sampledata
df <- data.frame(group=c(1, 1, 1, 1, 0, 0, 0, 0),
v1=c(1, 0, 0, 1, 0, 1, 1, 1),
v2=c(0, 0, 0, 0, 1, 0, 0, 1),
v3=c(0, 1, 0, 1, 1, 0, 1, 1))
I want to find the number of "1"s and "0"s in each v1, v2, v3 for each level of "group".
Currently I have been using
table(df$group, df$v1)
table(df$group, df$v2)
table(df$group, df$v3)
ad nauseum to get the number of "1" in each variable but I can't figure out how to create many such tables with one function...Any help would be greatly appreciated
We can use lapply to apply the same function to multiple columns.
lapply(df[-1], function(x) table(df$group, x))
#$v1
# x
# 0 1
# 0 1 3
# 1 2 2
#$v2
# x
# 0 1
# 0 2 2
# 1 4 0
#$v3
# x
# 0 1
# 0 1 3
# 1 2 2
Or with dplyr we can use count
purrr::map(names(df)[-1], ~count(df, group, !!sym(.x)))
I have two binary columns:
col1 col2
0 1
0 0
1 0
1 1
I would like to merge this columns and if value 1 exist into one of in both columns I would like to have the 1 value. Example of output
merged_col
1
0
1
1
The general merged I tried is this:
merge(df$col1, df$col2, all = TRUE)
Any idea how can I handle the values?
You can just treat them as logical values and use or...
df$col3 <- as.integer(df$col1|df$col2)
The code below should do what you need:
df <- data.frame(col1 = c(0, 0, 1, 1), col2 = c(1, 0, 0, 1))
df$merge_col <- ifelse(df$col1 == 1 | df$col2 == 1, 1, 0)
I'm trying to programmatically change a variable from a 0 to a 1 if there are three 1s before and after a 0.
For example, if the number in a vector were 1, 1, 1, 0, 1, 1, and 1, then I want to change the 0 to a 1.
Here is data in the vector dummy_code in the data.frame df:
original_df <- data.frame(dummy_code = c(1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1))
Here is how I'm trying to have the values be recoded:
desired_df <- data.frame(dummy_code = c(1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1)
I tried to use the function fill in the package tidyr, but this fills in missing values, so it won't work. If I were to recode the 0 values to be missing, then that would not work either, because it would simply code every NA as 1, when I would only want to code every NA surrounded by three 1s as 1.
Is there a way to do this in an efficient way programmatically?
An rle alternative, using the x from #G. Grothendieck's answer:
r <- rle(x)
Find indexes of runs of three 1:
i1 <- which(r$lengths == 3 & r$values == 1)
Check which of the "1 indexes" that surround a 0, and get the indexes of the 0 to be replaced:
i2 <- i1[which(diff(i1) == 2)] + 1
Replace relevant 0 with 1:
r$values[i2] <- 1
Reverse the rle operation on the updated runs:
inverse.rle(r)
# [1] 1 0 0 1 1 1 1 1 1 1 0 0 1
A similar solution based on data.table::rleid, slightly more compact and perhaps easier to read:
library(data.table)
d <- data.table(x)
Calculate length of each run:
d[ , n := .N, by = rleid(x)]
For "x" which are zero and the preceeding and subsequent runs of 1 are of length 3, set "x" to 1:
d[x == 0 & shift(n) == 3 & shift(n, type = "lead") == 3, x := 1]
d$x
# [1] 1 0 0 1 1 1 1 1 1 1 0 0 1
Here is a one-liner using rollapply from zoo:
library(zoo)
rollapply(c(0, 0, 0, x, 0, 0, 0), 7, function(x) if (all(x[-4] == 1)) 1 else x[4])
## [1] 1 0 0 1 1 1 1 1 1 1 0 0 1
Note: Input used was:
x <- c(1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1)
I have a data frame with only zeros and ones, e.g.
df <- data.frame(v1 = rbinom(100, 1, 0.5),
v2 = rbinom(100, 1, 0.2),
v3 = rbinom(100, 1, 0.4))
Now I want to modify this data set so that each row sums to 1.
So this
1 0 0
1 1 0
0 0 1
1 1 1
0 0 0
should become this:
1 0 0
0.5 0.5 0
0 0 1
0.33 0.33 0.33
0 0 0
edit: rows with all zeros should be left as is
As already pointed out by #lmo the data.frame (or matrix) can be modified with
df <- df / rowSums(df)
In the case of rows containing only zeros this will lead to rows containing only NaN. Since these rows should be kept as they were, the easiest way is probably to correct for this afterwards with
df[is.na(df)] <- 0
Here is a quick method:
# create matrix
temp <- matrix(c(1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1), ncol=3, byrow=T)
temp / rowSums(temp)
This exploits the fact that matrices are ordered column-wise, so that the element by element division of rowsSums and the recycling are aligned.
In the case that all elements in a row are zero, and you don't want an Inf, another method from #RHertel s is the following:
# save rowSum:
mySums <- rowSums(temp)
temp / ifelse(mySums != 0, mySums, 1)