Combining two (boolean) categorical factors two new one - r

Given two boolean, categorical factors, how can I get the combination of them as a third category?
> my_data <- data.frame(a = c(0, 0, 1, 1, 1),
b = c(0, 1, 0, 1, 1))
> my_data
a b
1 0 0
2 0 1
3 1 0
4 1 1
5 1 1
I want to add a new category, with the combination of a and b so that:
> my_data
a b c
1 0 0 1
2 0 1 2
3 1 0 3
4 1 1 4
5 1 1 4
I didn't want to be lazy and thought about it for myself:
my_data$c <- as.numeric(as.factor(my_data$a + 1 + (my_data$b + 1) * 2))
This comes close, but I don't find it particularly elegant.
Therefore, any nicer solution in base R would be appreciated.
There are certainly also packages likes reshape2 which would offer similar functionality.

The following logic seems to be enough for all the cases you have provided.
my_data$c <- with(my_data, 2*a + b + 1)
my_data
a b c
1 0 0 1
2 0 1 2
3 1 0 3
4 1 1 4
5 1 1 4

Another option with base R:
r <- rle(do.call(paste0, my_data))
r$values <- seq_along(r$values)
my_data$c <- inverse.rle(r)
The result:
> my_data
a b c
1 0 0 1
2 0 1 2
3 1 0 3
4 1 1 4
5 1 1 4
A shorter version of above code:
r <- rle(do.call(paste0, my_data))$lengths
my_data$c <- rep(seq_along(r), r)

The expected output in the question is just the input seen as numbers in base 2 converted to base 10 plus 1.
So, looking for a function that converts from base 2 to base 10 I have found the accepted answer to this SO question.
So it's a matter of apply()ing that function to the data frame.
apply(my_data, 1, bitsToInt) + 1
#[1] 1 2 3 4 4

A general solution with dplyr:
library(dplyr)
my_data %>% mutate(c = group_indices(.,a,b))
# a b c
# 1 0 0 1
# 2 0 1 2
# 3 1 0 3
# 4 1 1 4
# 5 1 1 4
A base equivalent:
temp <- unique(my_data)
temp$c <- seq(nrow(temp))
merge(my_data,temp)
# a b c
# 1 0 0 1
# 2 0 1 2
# 3 1 0 3
# 4 1 1 4
# 5 1 1 4

Related

Shift (Complete) Specific Rows Left in R

I pulled a data.frame from the internet and need to shift completely, (5 of 168) specific rows to the left one column. I thought best to append a column to the front of the data.frame and move the rows over but am unsuccessful. For example, I need something like this:
a b c d e >>> a b c d e
0 1 2 3 4 0 1 2 3 4
0 0 1 2 3 0 1 2 3 NA
0 1 2 3 4 0 1 2 3 4
If you know which rows you want to shift, you can replace the first value(s) of these rows with NA, and then use hacksaw::shift_row_values.
library(hacksaw)
data[2, "a"] <- NA
data %>%
shift_row_values(at = 2)
a b c d e
1 0 1 2 3 4
2 0 1 2 3 NA
3 0 1 2 3 4
data
data <- read.table(header = T, text = "
a b c d e
0 1 2 3 4
0 0 1 2 3
0 1 2 3 4 ")
Another possible solution, based on base R:
rows <- 2:3
df[rows,] <- cbind(df[rows, -1], NA)
df
#> a b c d e
#> 1 0 1 2 3 4
#> 2 0 1 2 3 NA
#> 3 1 2 3 4 NA
You can replace a row with an offset part plus NA like this:
dat[2,] <- c(dat[2, 2:5], NA)
Data:
dat <- read.table(text="
a b c d e
0 1 2 3 4
0 0 1 2 3
0 1 2 3 4",
header=TRUE)

Add group to data frame using monotonically increasing numbers

I've got a data frame that looks like this (the real data is much larger and more complicated):
df.test = data.frame(
sample = c("a","a","a","a","a","a","b","b"),
day = c(0,1,2,0,1,3,0,2),
value = rnorm(8)
)
sample day value
1 a 0 -1.11182146
2 a 1 0.65679637
3 a 2 0.03652325
4 a 0 -0.95351736
5 a 1 0.16094840
6 a 3 0.06829702
7 b 0 0.33705141
8 b 2 0.24579603
The data frame is organized by experiments but the experiment ids are missed. The same sample can be used in different experiment, but I know that in a single experiment the days start from 0 and are monotonically increasing.
How can I add the experiment ids that can be a numbers {1, 2, ...}?
So the resulted data frame will be
sample day value exp
1 a 0 -1.11182146 1
2 a 1 0.65679637 1
3 a 2 0.03652325 1
4 a 0 -0.95351736 2
5 a 1 0.16094840 2
6 a 3 0.06829702 2
7 b 0 0.33705141 3
8 b 2 0.24579603 3
I would appreciate any help, especially with a tidy/dplyr solution.
As indicated in the comments, you can do this with cumsum:
df.test %>% mutate(exp = cumsum(day == 0))
## sample day value exp
## 1 a 0 0.09300394 1
## 2 a 1 0.85322925 1
## 3 a 2 -0.25167313 1
## 4 a 0 -0.14811243 2
## 5 a 1 -1.86789014 2
## 6 a 3 0.45983987 2
## 7 b 0 2.81199150 3
## 8 b 2 0.31951634 3
You can use diff :
library(dplyr)
df.test %>% mutate(exp = cumsum(c(TRUE, diff(day) < 0)))
# sample day value exp
#1 a 0 -0.3382010 1
#2 a 1 2.2241041 1
#3 a 2 2.2202612 1
#4 a 0 1.0359635 2
#5 a 1 0.4134727 2
#6 a 3 1.0144439 2
#7 b 0 -0.1292119 3
#8 b 2 -0.1191505 3

Repeating loop and adding columns in R

I am trying to build an R code that will take my loop and run it 20 times. Each time I would like to add a column to the existing data frame. Here I tried it by adding the code 3 times, but I feel like there must be an easier way to automate this. I am very grateful for any help.
My original data file (called "igel") contains two columns ("Year" and "Grid") and 1096 rows. With the loop I pick a random number from the column "Grid" and check whether it has been picked before. If so it adds 0 to a new column if not it adds 1.
Here the code:
a <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(a) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% a$number == TRUE) {0} else {1})
a<-a %>% add_row(number = num_i, count = count_i)
}
b <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(b) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% b$number == TRUE) {0} else {1})
b<-b %>% add_row(number = num_i, count = count_i)
}
c <- data.frame(matrix(ncol = 2, nrow = 0))
x <- c("number", "count")
colnames(c) <- x
for (i in 1:1096) {
num_i <- sample(igel$Grid, 1)
count_i <- c(if (num_i %in% c$number == TRUE) {0} else {1})
c<-c %>% add_row(number = num_i, count = count_i)
}
df.total<- cbind(a$count,b$count, c$count)
Consider sapply and even its wrapper, replicate and calculate number and count separately in vector calculations instead of growing object in loop by row.
# RUNS 3 SAMPLES OF igel$Grid 1,096 TIMES (ADJUST 3 TO ANY POSITIVE INT LIKE 20)
grid_number <- data.frame(replicate(3, replicate(1096, sample(igel$Grid, 1))))
# RUNS ACROSS 3 COLUMNS TO CHECK CURRENT ROW VALUE IS INCLUDED FOR ALL VALUES BEFORE ROW
grid_count <- sapply(grid_number, function(col)
sapply(seq_along(col), function(i)
ifelse(col[i] %in% col[1:(i-1)], 0, 1)
)
)
While above does not exactly reproduce your output, df.total (a matrix and not data frame), due to the random sampling within iterations, the two maintain similar structure:
dim(df.total)
# [1] 1096 3
dim(grid_count)
# [1] 1096 3
Try to avoid iterating through rows. It is rarely necessary, if ever. Here is one approach (replace n with 1096 and elem with igel$Grid):
n = 20
elem = 1:5
df.total = list()
for (i in 1:5) {
a = data.frame(number = sample(elem, n, replace=TRUE))
a$count = as.numeric(duplicated(a$number))
df.total[[i]] = a
}
df.total = as.data.frame(df.total)
df.total
## number count number.1 count.1 number.2 count.2 number.3 count.3 number.4 count.4
## 1 4 0 2 0 5 0 4 0 1 0
## 2 3 0 5 0 3 0 4 1 3 0
## 3 5 0 3 0 4 0 2 0 4 0
## 4 5 1 1 0 2 0 5 0 3 1
## 5 2 0 4 0 2 1 5 1 5 0
## 6 4 1 2 1 2 1 5 1 5 1
## 7 5 1 1 1 3 1 2 1 4 1
## 8 5 1 2 1 5 1 5 1 4 1
## 9 2 1 1 1 1 0 1 0 1 1
## 10 3 1 1 1 5 1 4 1 1 1
## 11 5 1 3 1 1 1 3 0 5 1
## 12 2 1 1 1 2 1 5 1 1 1
## 13 3 1 5 1 4 1 5 1 4 1
## 14 1 0 4 1 2 1 4 1 1 1
## 15 4 1 4 1 2 1 5 1 1 1
## 16 4 1 2 1 5 1 2 1 5 1
## 17 3 1 1 1 1 1 3 1 2 0
## 18 2 1 2 1 2 1 2 1 2 1
## 19 2 1 3 1 1 1 2 1 1 1
## 20 1 1 3 1 2 1 1 1 3 1

detect missings (NA or 0) in data frame

i want to create a new variable in a data frame that contains information about the other variables.
I have got a large data frame. To keep it short let's say:
a <- c(1,0,2,3)
b <- c(3,0,1,1)
c <- c(2,0,2,2)
d <- c(4,1,1,1)
(df <- data.frame(a,b,c,d) )
a b c d
1 1 3 2 4
2 0 0 0 1
3 2 1 2 1
4 3 1 2 1
Aim: Create a new variable that informs me if one person (row) has cero reports (or missings / NA) either in the variables a+b or in the variables c+d.
a b c d x
1 1 3 2 4 1
2 0 0 0 1 NA
3 2 1 2 1 1
4 3 1 2 1 1
As i have a large data frame i was thinking about the use of df[1:2] and df[3:4] so that i do not need to type every variable name. But i am not sure which is the best way to implement it. Maybe dplyr has a nice option?
df$x <- ifelse(rowSums(df), 1, NA)
EDIT: Answer to the updated question:
df$x <- ifelse(rowSums(df[1:2])&rowSums(df[3:4]), 1, NA)
gives,
a b c d x
1 1 3 2 4 1
2 0 0 0 1 NA
3 2 1 2 1 1
4 3 1 2 1 1

Subsequent row summing in dataframe object

I would like to do subsequent row summing of a columnvalue and put the result into a new columnvariable without deleting any row by another columnvalue .
Below is some R-code and an example that does the trick and hopefully illustrates my question. I was wondering if there is a more elegant way to do since the for loop will be time consuming in my actual object.
Thanks for any feedback.
As an example dataframe:
MyDf <- data.frame(ID = c(1,1,1,2,2,2), Y = 1:6)
MyDf$FIRST <- c(1,0,0,1,0,0)
MyDf.2 <- MyDf
MyDf.2$Y2 <- c(1,3,6,4,9,15)
The purpose of this is so that I can write code that calculates Y2 in MyDf.2 above for each ID, separately.
This is what I came up with and, it does the trick. (Calculating a TEST column in MyDf that has to be equal to Y2 cin MyDf.2)
MyDf$TEST <- NA
for(i in 1:length(MyDf$Y)){
MyDf[i,]$TEST <- ifelse(MyDf[i,]$FIRST == 1, MyDf[i,]$Y,MyDf[i,]$Y + MyDf[i-1,]$TEST)
}
MyDf
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
MyDf.2
ID Y FIRST Y2
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
You need ave and cumsum to get the column you want. transform is just to modify your existing data.frame.
> MyDf <- transform(MyDf, TEST=ave(Y, ID, FUN=cumsum))
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15

Resources