I have the following data
ID v1 v2 v3 v4 v5
1 1 3 6 4
2 4 2
3 3 1 8 5
4 2 5 3 1
Can I rearrange the data so that it will automatically create new columns and assign binary value (1 or 0) according to the value in each variable (v1 to v5)?
E.g. In first row, I have values of 1,3,4 and 6. Can R automatically create 6 dummy variables to have assign the value to the respective column as below:
ID dummy1 dummy2 dummy3 dummy4 dummy5 dummy6
1 1 0 1 1 0 1
To have something like this:
ID c1 c2 c3 c4 c5 c6 c7 c8
1 1 0 1 1 0 1 0 0
2 0 1 0 1 0 0 0 0
3 1 0 1 0 1 0 0 1
4 1 1 1 0 1 0 0 0
Thanks.
We can use base R to do this. Loop through the rows of the dataset except the first column, get the sequence of max value in the row, check how many of these are in the row and convert it to integer with as.integer, append NAs at the end to make the lengths same in the list output and cbind with the first column
lst <- apply(df[-1], 1, function(x) as.integer(seq_len(max(x, na.rm = TRUE)) %in% x))
res <- cbind(df[1], do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))))
res[is.na(res)] <- 0
colnames(res)[-1] <- paste0('c', 1:8)
res
# ID c1 c2 c3 c4 c5 c6 c7 c8
#1 1 1 0 1 1 0 1 0 0
#2 2 0 1 0 1 0 0 0 0
#3 3 1 0 1 0 1 0 0 1
#4 4 1 1 1 0 1 0 0 0
In base R, you can use:
table(transform(cbind(mydf[1], stack(mydf[-1]))[1:2], values = factor(values, 1:8)))
## values
## ID 1 2 3 4 5 6 7 8
## 1 1 0 1 1 0 1 0 0
## 2 0 1 0 1 0 0 0 0
## 3 1 0 1 0 1 0 0 1
## 4 1 1 1 0 1 0 0 0
Note that you need to convert the stacked values to factor if you want the "7" to be included in the output. This applies to the "data.table" and "tidyverse" approaches as well.
Alternatively, you can try the following with "data.table":
library(data.table)
melt(as.data.table(mydf), "ID", na.rm = TRUE)[
, dcast(.SD, ID ~ factor(value, 1:8), fun = length, drop = FALSE)]
Or the following with the "tidyverse":
library(tidyverse)
mydf %>%
gather(var, val, -ID, na.rm = TRUE) %>%
select(-var) %>%
mutate(var = 1, val = factor(val, 1:8)) %>%
spread(val, var, fill = 0, drop = FALSE)
Sample data:
mydf <- structure(list(ID = 1:4, v1 = c(1L, 4L, 3L, 2L), v2 = c(3L, 2L,
1L, 5L), v3 = c(6L, NA, 8L, 3L), v4 = c(4L, NA, 5L, 1L), v5 = c(NA,
NA, NA, NA)), .Names = c("ID", "v1", "v2", "v3", "v4", "v5"), row.names = c(NA,
4L), class = "data.frame")
If automation is important, you can also use syntax like factor(value, sequence(max(value)) in the "data.table" approach or val = factor(val, sequence(max(val)))) in the "tidyverse" approach.
Another base R answer with some similarities to akrun's is
# create matrix of values
myMat <- as.matrix(dat[-1])
# create result matrix of desired shape, filled with 0s
res <- matrix(0L, nrow(dat), ncol=max(myMat, na.rm=TRUE))
# use matrix indexing to fill in 1s
res[cbind(dat$ID, as.vector(myMat))] <- 1L
# convert to data.frame, add ID column, and provide variable names
setNames(data.frame(cbind(dat$ID, res)), c("ID", paste0("c", 1:8)))
which returns
ID c1 c2 c3 c4 c5 c6 c7 c8
1 1 1 0 1 1 0 1 0 0
2 2 0 1 0 1 0 0 0 0
3 3 1 0 1 0 1 0 0 1
4 4 1 1 1 0 1 0 0 0
Related
I have the following data. v1:v4 are boolean (TRUE/FALSE)
df1
id v1 v2 v3 v4
1 T T F F
2 F F T F
3 T F F F
4 F T T T
df2
var weight
v1 1
v2 4
v3 2
v4 5
I require to first replace the TRUE value of each variable based on the variable name and the secondary table df2. So any TRUE under V1 column, for example, will become 1. FALSE will always be 0.
Later, the status variable will define whether the entire row contains a single non-zero value or multiple
df.out
id v1 v2 v3 v4 Status
1 1 4 0 0 Multiple
2 0 0 2 0 Single
3 1 0 0 0 Single
4 0 4 2 5 Multiple
Here is one option with tidyverse. Loop across the column names specified in 'var' column of 'df2', replace the TRUEvalues by the corresponding 'weight' element by matching the column name (cur_column()) with the 'var' column. Then, we create a the 'Status' column based on the number of non-zero elements per each row using rowSums
library(dplyr)
df1 %>%
mutate(across(df2$var,
~ replace(., ., df2$weight[match(cur_column(), df2$var)]))) %>%
mutate(Status = case_when(rowSums(.[df2$var] > 0) > 1
~ 'Multiple', TRUE ~ 'Single'))
-output
id v1 v2 v3 v4 Status
1 1 1 4 0 0 Multiple
2 2 0 0 2 0 Single
3 3 1 0 0 0 Single
4 4 0 4 2 5 Multiple
Or using base R
df1new <- cbind(df1[1], setNames(df2$weight,
df2$var)[col(df1[df2$var])] * df1[df2$var])
df1new$Status <- c("Single", "Multiple")[1 + (rowSums(df1new[df2$var] > 0) > 1)]
-output
> df1new
id v1 v2 v3 v4 Status
1 1 1 4 0 0 Multiple
2 2 0 0 2 0 Single
3 3 1 0 0 0 Single
4 4 0 4 2 5 Multiple
Or another option is Map from base R
lst1 <- Map(`*`, df1[df2$var], df2$weight)
cbind(df1[1], lst1, Status = c('Single', 'Multiple')[1 + (rowSums(df1[-1]) > 1)])
id v1 v2 v3 v4 Status
1 1 1 4 0 0 Multiple
2 2 0 0 2 0 Single
3 3 1 0 0 0 Single
4 4 0 4 2 5 Multiple
data
df1 <- structure(list(id = 1:4, v1 = c(TRUE, FALSE, TRUE, FALSE), v2 = c(TRUE,
FALSE, FALSE, TRUE), v3 = c(FALSE, TRUE, FALSE, TRUE), v4 = c(FALSE,
FALSE, FALSE, TRUE)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(var = c("v1", "v2", "v3", "v4"), weight = c(1L,
4L, 2L, 5L)), class = "data.frame", row.names = c(NA, -4L))
It can be very simple by matrix operation x is df1 and y is df2
x <- matrix(c(T,F,T,F,T,F,F,T,F,T,F,T,F,F,F,T), nrow =4 , ncol = 4)
y <- c(1,4,2,5)
z <- x %*% diag(y)
z
result is
[,1] [,2] [,3] [,4]
[1,] 1 4 0 0
[2,] 0 0 2 0
[3,] 1 0 0 0
[4,] 0 4 2 5
letting
Status<- x %*% diag(y) %>% as.logical(.>0) %>% matrix(.,4,4) %>% rowSums
res <- cbind(z,Status) %>% as.data.frame
V1 V2 V3 V4 Status
1 1 4 0 0 2
2 0 0 2 0 1
3 1 0 0 0 1
4 0 4 2 5 3
I Have 2 indicator:
licence age.6-17
Na 1
1 0
Na 0
0 1
how I can change Na to 1 if a person is more than 17 years (that is second column is 0) old and 0 otherwise?
output
licence age.6-17
0 1
1 0
1 0
0 1
using dplyr and ifelse
yourdata %>% mutate(licence = ifelse(`age.6-17` == 0, 1,0))
No need to change how the nature of "Na" nor the column name.
In addition, in case you would need to replace only the "Na" cells, considering "Na" is a string here
yourdata %>% mutate(licence = ifelse(licence == "Na" & `age.6-17` == 0, 1,0))
If however it is <NA> you would need is.na(licence) instead of licence == "Na"
In base you can subset with is.na and then subtract the value of age.6.17 from 1.
x <- read.table(header=T, na.string="Na", text="licence age.6-17
Na 1
1 0
Na 0
0 1")
idx <- is.na(x$licence)
x$licence[idx] <- 1-x$age.6.17[idx]
x
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1
or in case you ignore what is actualy stored in column licence you can use:
with(x, data.frame(licence=1-age.6.17, age.6.17))
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1
Assuming your NAs are actual NA we can use case_when in dplyr and apply the conditions.
library(dplyr)
df %>%
mutate(licence = case_when(is.na(licence) & age.6.17 == 0 ~ 1L,
is.na(licence) & age.6.17 == 1 ~ 0L,
TRUE ~ licence))
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1
data
df <- structure(list(licence = c(NA, 1L, NA, 0L), age.6.17 = c(1L,
0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -4L))
I have a matrix that contains indices of column ides to the element of matrix I want create it
> index
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 5
[3,] 1 3 4
[4,] 1 3 5
[5,] 1 4 5
[6,] 2 3 5
[7,] 3 4 5
example first row have the column id 1 ,2 , 3 that set to it value 1
second row have the column id 1 , 2 , 5 that set to it value 1
now I want to create the following matrix:
a1 a2 a3 a4 a5
[1,] 1 1 1 0 0
[2,] 1 1 0 0 1
[3,] 1 0 1 1 0
[4,] 1 0 1 0 1
[5,] 1 0 0 1 1
[6,] 0 1 1 0 1
[7,] 0 0 1 1 1
Data
index <- rbind(c(1,2,3), c(1,2,5), c(1,3,4), c(1,3,5), c(1,4,5), c(2,3,5), c(3,4,5))
Here is an extremely fast and efficient base R method mentioned in the comments using matrix indexing.
# construct 0 matrix with correct dimensions
newMat <- matrix(0L, nrow(myMat), max(myMat))
# fill in matrix using matrix indexing
newMat[cbind(c(row(myMat)), c(myMat))] <- 1L
This returns
newMat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 1 0 0
[2,] 1 1 0 0 1
[3,] 1 0 1 1 0
[4,] 1 0 1 0 1
[5,] 1 0 0 1 1
[6,] 0 1 1 0 1
[7,] 0 0 1 1 1
data
myMat <-
structure(c(1L, 1L, 1L, 1L, 1L, 2L, 3L, 2L, 2L, 3L, 3L, 4L, 3L,
4L, 3L, 5L, 4L, 5L, 5L, 5L, 5L), .Dim = c(7L, 3L))
Here is a solution in base R:
# Your sample matrix
m.idx <- matrix(c(1,1,1,1,1,2,3,2,2,3,3,4,3,4,3,5,4,5,5,5,5), ncol = 3);
# Construct empty matrix of the right dims
m.val <- matrix(0, nrow = nrow(m.idx), ncol = max(m.idx));
for (i in 1:nrow(m.idx)) m.val[i, m.idx[i, ]] <- 1;
m.val;
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 1 1 0 0
#[2,] 1 1 0 0 1
#[3,] 1 0 1 1 0
#[4,] 1 0 1 0 1
#[5,] 1 0 0 1 1
#[6,] 0 1 1 0 1
#[7,] 0 0 1 1 1
Update
Please see below for benchmarking results of all methods presented here. I've wrapped all methods inside functions
# The original matrix with indices
m.idx <- matrix(c(1,1,1,1,1,2,3,2,2,3,3,4,3,4,3,5,4,5,5,5,5), ncol = 3);
# For loop method
method.for_loop <- function(m) {
m.val <- matrix(0, nrow = nrow(m), ncol = max(m));
for (i in 1:nrow(m.idx)) m.val[i, m.idx[i, ]] <- 1;
return(m.val);
}
# lapply method (#Headpoint)
method.lapply <- function(m) {
m.val <- as.data.frame(matrix(0, nrow = nrow(m), ncol = max(m)));
invisible(lapply(1:nrow(m),
function(x) m.val[x,][m[x,]] <<- 1));
return(m.val);
}
# Direct indexing method (#lmo)
method.indexing <- function(m) {
m.val <- matrix(0L, nrow(m.idx), max(m.idx));
m.val[cbind(c(row(m.idx)), c(m.idx))] <- 1L;
return(m.val);
}
# tidyr/dplyr method (#CPak)
method.dplyr_tidyr <- function(m) {
as.data.frame(m) %>%
gather() %>% # wide-to-long format
group_by(key) %>%
mutate(rn = row_number()) %>% # add unique row_id per `key` group
mutate(newval = 1) %>% # fill in `existing` with this value
ungroup() %>% # ungroup and unselect `key` group
select(-key) %>%
spread(value, newval, fill=0) %>% # long-to-wide format
# fill in `non-existing` with `0`
select(-rn) %>% # unselect row_id column
rename_all(funs(paste0("a", .))) # rename columns
}
The results from microbenchmark are given below.
library(microbenchmark);
library(tidyr);
library(dplyr);
library(magrittr);
res <- microbenchmark(
for_loop = method.for_loop(m.idx),
lapply = method.lapply(m.idx),
indexing = method.indexing(m.idx),
dplyr_tidyr = method.dplyr_tidyr(m.idx),
times = 1000L
)
print(res);
# Unit: microseconds
# expr min lq mean median uq max
# for_loop 6.796 9.5405 16.89643 13.497 20.445 96.537
# lapply 1315.765 1441.5990 1696.74392 1518.256 1675.027 66181.880
# indexing 5.695 8.1450 20.49116 14.918 20.094 3139.946
# dplyr_tidyr 18777.669 20525.8095 22225.51936 21647.120 23215.714 84791.858
Conclusion: The methods using a for loop or direct indexing are tied and the fastest. lapply is second, the tidyr/dplyr method the slowest (but note the large increases in runtime).
No for loop, but didn't really check if it is faster.
index <- matrix(c(1,1,1,1,1,2,3,2,2,3,3,4,3,4,3,5,4,5,5,5,5),
ncol = 3)
df <- as.data.frame(matrix(0, nrow = 7, ncol = 5))
invisible(lapply(1:nrow(index),
function(x) df[x,][index[x,]] <<- 1))
df
# V1 V2 V3 V4 V5
# 1 1 1 1 0 0
# 2 1 1 0 0 1
# 3 1 0 1 1 0
# 4 1 0 1 0 1
# 5 1 0 0 1 1
# 6 0 1 1 0 1
# 7 0 0 1 1 1
You can do this with a combination of dplyr and tidyr
Your data
df <- read.table(text="1 2 3
1 2 5
1 3 4
1 3 5
1 4 5
2 3 5
3 4 5", header=FALSE)
Solution: some of these steps are to clean up the output
df %>%
gather() %>% # wide-to-long format
group_by(key) %>%
mutate(rn = row_number()) %>% # add unique row_id per `key` group
mutate(newval = 1) %>% # fill in `existing` with this value
ungroup() %>% # ungroup and unselect `key` group
select(-key) %>%
spread(value, newval, fill=0) %>% # long-to-wide format
# fill in `non-existing` with `0`
select(-rn) %>% # unselect row_id column
rename_all(funs(paste0("a", .))) # rename columns
Output
# A tibble: 7 x 5
a1 a2 a3 a4 a5
* <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 0 0
2 1 1 0 0 1
3 1 0 1 1 0
4 1 0 1 0 1
5 1 0 0 1 1
6 0 1 1 0 1
7 0 0 1 1 1
I have a data frame with three initial columns: ID, start and end positions.The rest of the columns are numeric chromosomal positions, and it looks like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4
ind2 1 3
ind3 5 7
What I want is to fill out the empty columns (1:n) based on the range for every individual (start:end). For example in the first individual (ind1) the range goes from positions 2 to 4, then those positions fitting the range are filled out with one (1), and those positions out the range with zero (0). To simplify, the desired output should look like this:
ID start end 1 2 3 4 5 6 7 ... n
ind1 2 4 0 1 1 1 0 0 0 ... 0
ind2 1 3 1 1 1 0 0 0 0 ... 0
ind3 5 7 0 0 0 0 1 1 1 ... 1
I will appreciate any comment.
Supposing you know the number of columns you could use the between function from the data.table package:
cols <- paste0('c',1:7)
library(data.table)
setDT(DF)[, (cols) := lapply(1:7, function(x) +(between(x, start, end)))][]
which gives:
ID start end c1 c2 c3 c4 c5 c6 c7
1: ind1 2 4 0 1 1 1 0 0 0
2: ind2 1 3 1 1 1 0 0 0 0
3: ind3 5 7 0 0 0 0 1 1 1
Notes:
It is better not to name your colummns with just numbers. Therefore I added a c at the start of the columnnames.
Using + in +(between(x, start, end)) is a kind of tric. The more idiomatic way is using as.integer(between(x, start, end)).
Used data:
DF <- read.table(text="ID start end
ind1 2 4
ind2 1 3
ind3 5 7", header=TRUE)
If you were to begin with the data frame df, without the columns already added,
ID start end
1 ind1 2 4
2 ind2 1 3
3 ind3 5 7
you could do
mx <- max(df[-1])
M <- Map(function(x, y) replace(integer(mx), x:y, 1L), df$start, df$end)
cbind(df, do.call(rbind, M))
# ID start end 1 2 3 4 5 6 7
# 1 ind1 2 4 0 1 1 1 0 0 0
# 2 ind2 1 3 1 1 1 0 0 0 0
# 3 ind3 5 7 0 0 0 0 1 1 1
The number of new columns will equal the maximum of the start and end columns.
Data:
df <- structure(list(ID = structure(1:3, .Label = c("ind1", "ind2",
"ind3"), class = "factor"), start = c(2L, 1L, 5L), end = c(4L,
3L, 7L)), .Names = c("ID", "start", "end"), class = "data.frame", row.names = c(NA,
-3L))
I want to expand my dataset by replacing each incomplete row with the set of all possible rows. Does anyone have any suggestions for an efficient way to do this?
For example, suppose X and Z can each take values 0 or 1.
Input:
id y x z
1 1 0 0 NA
2 2 1 NA 0
3 3 0 1 1
4 4 1 NA NA
Output:
id y x z
1 1 0 0 0
2 1 0 0 1
3 2 1 0 0
4 2 1 1 0
5 3 0 1 1
6 4 1 0 0
7 4 1 0 1
8 4 1 1 0
9 4 1 1 1
At the moment I just work through the original dataset row by row:
for(i in 1:N){
if(is.na(temp.dat$x[i]) & !is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,1)
}else
if(!is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],2),ncol=ncol(temp.dat),byrow=TRUE)
augment[,4] <- c(0,1)
}else{
if(is.na(temp.dat$x[i]) & is.na(temp.dat$z[i])){
augment <- matrix(rep(temp.dat[i,],4),ncol=ncol(temp.dat),byrow=TRUE)
augment[,3] <- c(0,0,1,1)
augment[,4] <- c(0,1,0,1)
}
}
You could try by
Creating an "indx" of count of "NAs" in each row (rowSums(is.na(...))
Use the "indx" to expand the rows of the original dataset (df[rep(1:nrow...)
Loop over (sapply) the "indx" and use that as "times" argument in rep, and do expand.grid of values 0,1 to create the "lst"
split the expanded dataset, "df1", by "id"
Use Map to change corresponding "NA" values in "lst2" by the values in "lst"
rbind the list elements
indx <- rowSums(is.na(df[-1]))
df1 <- df[rep(1:nrow(df), 2^indx),]
lst <- sapply(indx, function(x) expand.grid(rep(list(0:1), x)))
lst2 <- split(df1, df1$id)
res <- do.call(rbind,Map(function(x,y) {x[is.na(x)] <- as.matrix(y);x},
lst2, lst))
row.names(res) <- NULL
res
# id y x z
#1 1 0 0 0
#2 1 0 0 1
#3 2 1 0 0
#4 2 1 1 0
#5 3 0 1 1
#6 4 1 0 0
#7 4 1 1 0
#8 4 1 0 1
#9 4 1 1 1
data
df <- structure(list(id = 1:4, y = c(0L, 1L, 0L, 1L), x = c(0L, NA,
1L, NA), z = c(NA, 0L, 1L, NA)), .Names = c("id", "y", "x", "z"
), class = "data.frame", row.names = c("1", "2", "3", "4"))