Transforming Binary data - r

I have a dataframe that only consists of 0 and 1. So for each individual instead of having one column with a factoral value (ex. low price, 4 rooms) I have
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0
2 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1
3 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0
4 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0
How can I transform the dataset in R, so that I create new columns (#number of rooms) and give the position of the 1 (in the 4th column) a vhigh value?
I have multiple expenatory varibales I need to do this for. the 21 columns are representing 6 variables for 1000+ observations. should be something like this
PurchaseP. NumberofRooms ...
1. vhigh. 4
2. low. 4
3. vhigh. 1
4. vhigh. 2
Just did it for the first 2 epxlenatory varibales here, but essentially it repeats like this with each explenatory variable has 3-4 possible factoral values.
V1:V4 = purchase price, V5:V8 = number of rooms,V9:V11 = floors, and so on
In my head something like this could work
create a if statemt to give each 1 a value depending on column position, ex. if value in V4=1 then name "vhigh". and do this for each Vx
Then combine each column V1:V4, V5:V8, V9:V11 (depending on if it has 3-4 possible factoral/integer values) while ignoring 0 values.
Would this work, or is there a simpler approach? How would one code this in R?

Here is an approach that should work for you. I wrote a function, which will take as arguments your data.frame, the columns representing one of your variables of interest (e.g. purchase price is stored in columns 1 to 4), and the names of the levels you would like as a result. The function will then return the result you requested. You'll need to write this out for the 6 variables you are interested in.
I'll simulate some data and illustrate the approach.
df <- data.frame(matrix(rep(c(0,0,0,1, 1,0,0,0, 1,0,0,0,0,0,0,1), 2),
nrow = 4, byrow = T))
df
#> X1 X2 X3 X4 X5 X6 X7 X8
#> 1 0 0 0 1 1 0 0 0
#> 2 1 0 0 0 0 0 0 1
#> 3 0 0 0 1 1 0 0 0
#> 4 1 0 0 0 0 0 0 1
We'll say that the first four columns are the purchase price in v.low to v.high, and the second four are the number of rooms (1:4). We'll write a function that takes this information as arguments and returns the result:
rangeToCol <- function(df, # Your data.frame
range, # the columns that incode the category of interest
lev.names # The names of the category levels
) {
tdf <- df[range]
lev.names[unlist(apply(tdf, 1, function(rw){which(rw==1)}))]
}
new.df <- data.frame(PurchaseP = rangeToCol(df, 1:4,
c('vlow','low','high','vhigh')),
NumberofRooms = rangeToCol(df, 5:8, c(1:4)))
new.df
#> PurchaseP NumberofRooms
#> 1 vhigh 1
#> 2 vlow 4
#> 3 vhigh 1
#> 4 vlow 4

Related

Using R to remove all columns that sum to 0

I have a very large CSV file containing counts of unique DNA sequences, and there is a column for each unique sequence. I started with hundreds of samples and cut it down to only 15 that I care about but now I have THOUSANDS of columns that contain nothing but Zeroes and it is messing up my data processing. How do I go about completely removing any column that sums to zero? I’ve seen some similar questions on here but none of those suggestions have worked for me.
I have 6653 columns and 16 rows in my data frame.
If it matters my columns all have super crazy names, some several hundred characters long ( AATCGGCTAA..., etc) and the row names are the sample IDs which are also not entirely numeric. Any tips greatly appreciated. I am still new to R so please let me know where I will need to change things in code examples if you can! Thanks!
You can use colSums
set.seed(10)
df <- as.data.frame(matrix(sample(0:1, 50, replace = TRUE, prob = c(.8, .2)),
5, 10))
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 0 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
df[colSums(df) != 0]
# V4 V5 V6 V7 V8 V10
# 1 0 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1
But you might not want to remove all columns which sum to 0, because that could be true even if not all elements are 0. Take V4 in the data frame below as an example.
df$V4[1] <- -1
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 -1 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
So if you want to only remove columns where all elements are 0, you can do
df[colSums(df == 0) < nrow(df)]
# V4 V5 V6 V7 V8 V10
# 1 -1 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1
welcome to SO here is a tidyverse approach
library(tidyverse)
mtcars %>%
select_if(is.numeric) %>%
select_if(~ sum(.x) > 0)

How do you sum different columns of binary variables based on a desired set of variables/column?

I used the code below for a total of 25 variables and it worked.It shows up as either 1 or 0:
jb$finances <- ifelse(grepl("Finances", jb$content.roll),1,0)
I want to be able to add the number of "1" s in each row across the multiple of selected column/variables I just made (using the code above) into another column called "sum.content". I used the code below:
jb <- jb %>%
mutate(sum.content=sum(jb$Finances,jb$Exercise,jb$Volunteer,jb$Relationships,jb$Laugh,jb$Gratitude,jb$Regrets,jb$Meditate,jb$Clutter))
I didn't get an error using the code above, but I did not get the outcome I wanted.
The result of this was 14 for all my row.I was expecting something <9 since I only selected for 9 variables.I don't want to delete the other variables like V1 and V2, I just want to focus on summing some variables.
This is what I got using the code:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 14
2 0 1 0 0 1 14
2 0 0 0 0 1 14
This is What I want:
V1 V2... Finances Exercise Volunteer Relationships Laugh sum.content
1 1 1 1 1 0 4
2 0 1 0 0 1 1
2 0 0 0 0 1 1
I want R to add the number of 1's in each row(within the columns I want to select). How would I go about incorporating the adding of the 1's in code(from a set of variable/column)?
Here is an answer that uses dplyr to sum across rows of variables starting with the letter V. We'll simulate some data, convert to binary, and then sum the rows.
data <- matrix(rnorm(100,100,30),nrow = 10)
# recode to binary
data <- apply(data,2,function(x){x <- ifelse(x > 100,1,0)})
# change some of the column names to illustrate impact of
# select() within mutate()
colnames(data) <- c(paste0("V",1:5),paste0("X",1:5))
as.data.frame(data) %>%
mutate(total = select(.,starts_with("V")) %>% rowSums)
...and the output, where the sums should equal the sum of V1 - V5 but not
X1 - X5:
V1 V2 V3 V4 V5 X1 X2 X3 X4 X5 total
1 1 0 0 0 1 0 0 0 1 0 2
2 1 0 0 1 0 0 0 1 1 0 2
3 1 1 1 0 1 0 0 0 1 0 4
4 0 0 1 1 0 1 0 0 1 0 2
5 0 0 1 0 1 0 1 1 1 0 2
6 0 1 1 0 1 0 0 1 1 1 3
7 1 0 1 1 0 0 0 0 0 1 3
8 1 0 0 1 1 1 0 1 1 1 3
9 1 1 0 0 1 0 1 1 0 0 3
10 0 1 1 0 1 1 0 0 1 0 3
>

Subsetting a data frame using the sum of each row vector R

Hi I have some data I am reading in from a csv, which is set out in binary form:
1 2 3 4...N
1 0 1 0 1...1
2 1 1 0 1...1
3 0 0 0 0...0
4 1 0 1 1...1
. 1 1 1 0...1
. 1 0 0 0...1
N 0 0 1 1...0
screenshot of str(data)
I want to take a subset of this data where the sum of the row vectors is greater than a number say 10, or x. The first column is a placeholder column for customer ID, so this needs to be excluded. Do you have any suggestions about how I could go about doing this?
I've been trying various things like df=subset() but I've not been able to get the syntax correct.
Thanks in advance.
We can do this with rowSums
df1[rowSums(df1) > 10, , drop = FALSE]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
#7 0 0 0 1 0 0 1 1 0 1 1 1 1 1 0 0 0 1 1 1
#9 1 1 1 1 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 1
Update
In the OP's dataset, the first column 'X' is not binary and have bigger numbers. So, when we include that variable, the rowSums would be greater than 10. It is the index ID and not to be used in the calculation. So, by removing it in the rowSums, it would subset well
df1[rowSums(df1[-1])> 10,]
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(0:1, 10* 20, replace = TRUE), ncol = 20))

choose the specific variable in r

There are data like this table:
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15
4 0 0 2 0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 3 0 0 0 0 0 0 0 0 0 0 0
and I wish to make a new matrix with the specific variable columns which have a numerical value different from zero (in this case the specific variable columns are v1 and v4).
I know the subset function but I cannot find the way to choose conditional columns by using "if statement".
I mean... how can I make a matrix with only the specific columns that have numerical value different from zero by using "if statement"?
Please help me to solve my problem.
Thanks.
You haven't specified what format your data is in, but if you have a matrix or a data.frame, you should be able to use the R extract operator ([) to specify only the columns you want. You can feed it a vector of logical values (TRUE or FALSE) for that specification, so all you need is a function that will return the logical values you want.
As a simple example with a matrix, you could apply a function seeing if there are any non-zero values across each of the columns of the matrix:
> a
[,1] [,2] [,3] [,4]
[1,] 0 1 0 4
[2,] 0 2 0 5
[3,] 0 3 0 6
> a[, apply(a, 2, function(x) { return(any(x != 0)) })]
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
This same extract mechanism works on data.frames as well:
> a
V1 V2 V3 V4
1 0 1 0 4
2 0 2 0 5
3 0 3 0 6
> a[, sapply(a, function(x) { return(any(x != 0)) })]
V2 V4
1 1 4
2 2 5
3 3 6

Creating a factor/categorical variable from 4 dummies

I have a data frame with four columns, let's call them V1-V4 and ten observations. Exactly one of V1-V4 is 1 for each row, and the others of V1-V4 are 0. I want to create a new column called NEWCOL that takes on the value of 3 if V3 is 1, 4 if V4 is 1, and is 0 otherwise.
I have to do this for MANY sets of variables V1-V4 so I would like the solution to be as short as possible so that it will be easy to replicate.
This does it for 4 columns to add a fifth using matrix multiplication:
> cbind( mydf, newcol=data.matrix(mydf) %*% c(0,0,3,4) )
V1 V2 V3 V4 newcol
1 1 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 1 0 0 0
5 0 0 1 0 3
6 0 0 1 0 3
7 0 0 0 1 4
8 0 0 0 1 4
9 0 0 0 1 4
10 0 0 0 1 4
It's generalizable to getting multiple columns.... we just need the rules. You need to make a matric with the the same number of rows as there are columns in the original data and have one column for each of the new factors needed to build each new variable. This shows how to build one new column from the sum of 3 times the third column plus 4 times the fourth, and another new column from one times the first and 2 times the second.
> cbind( mydf, newcol=data.matrix(mydf) %*% matrix(c(0,0,3,4, # first set of factors
1,2,0,0), # second set
ncol=2) )
V1 V2 V3 V4 newcol.1 newcol.2
1 1 0 0 0 0 1
2 1 0 0 0 0 1
3 0 1 0 0 0 2
4 0 1 0 0 0 2
5 0 0 1 0 3 0
6 0 0 1 0 3 0
7 0 0 0 1 4 0
8 0 0 0 1 4 0
9 0 0 0 1 4 0
10 0 0 0 1 4 0
An example data set:
mydf <- data.frame(V1 = c(1, 1, rep(0, 8)),
V2 = c(0, 0, 1, 1, rep(0, 6)),
V3 = c(rep(0, 4), 1, 1, rep(0, 4)),
V4 = c(rep(0, 6), rep(1, 4)))
# V1 V2 V3 V4
# 1 1 0 0 0
# 2 1 0 0 0
# 3 0 1 0 0
# 4 0 1 0 0
# 5 0 0 1 0
# 6 0 0 1 0
# 7 0 0 0 1
# 8 0 0 0 1
# 9 0 0 0 1
# 10 0 0 0 1
Here's an easy approach to generate the new column:
mydf <- transform(mydf, NEWCOL = V3 * 3 + V4 * 4)
# V1 V2 V3 V4 NEWCOL
# 1 1 0 0 0 0
# 2 1 0 0 0 0
# 3 0 1 0 0 0
# 4 0 1 0 0 0
# 5 0 0 1 0 3
# 6 0 0 1 0 3
# 7 0 0 0 1 4
# 8 0 0 0 1 4
# 9 0 0 0 1 4
# 10 0 0 0 1 4

Resources