detect missings (NA or 0) in data frame - r

i want to create a new variable in a data frame that contains information about the other variables.
I have got a large data frame. To keep it short let's say:
a <- c(1,0,2,3)
b <- c(3,0,1,1)
c <- c(2,0,2,2)
d <- c(4,1,1,1)
(df <- data.frame(a,b,c,d) )
a b c d
1 1 3 2 4
2 0 0 0 1
3 2 1 2 1
4 3 1 2 1
Aim: Create a new variable that informs me if one person (row) has cero reports (or missings / NA) either in the variables a+b or in the variables c+d.
a b c d x
1 1 3 2 4 1
2 0 0 0 1 NA
3 2 1 2 1 1
4 3 1 2 1 1
As i have a large data frame i was thinking about the use of df[1:2] and df[3:4] so that i do not need to type every variable name. But i am not sure which is the best way to implement it. Maybe dplyr has a nice option?

df$x <- ifelse(rowSums(df), 1, NA)
EDIT: Answer to the updated question:
df$x <- ifelse(rowSums(df[1:2])&rowSums(df[3:4]), 1, NA)
gives,
a b c d x
1 1 3 2 4 1
2 0 0 0 1 NA
3 2 1 2 1 1
4 3 1 2 1 1

Related

Add group to data frame using monotonically increasing numbers

I've got a data frame that looks like this (the real data is much larger and more complicated):
df.test = data.frame(
sample = c("a","a","a","a","a","a","b","b"),
day = c(0,1,2,0,1,3,0,2),
value = rnorm(8)
)
sample day value
1 a 0 -1.11182146
2 a 1 0.65679637
3 a 2 0.03652325
4 a 0 -0.95351736
5 a 1 0.16094840
6 a 3 0.06829702
7 b 0 0.33705141
8 b 2 0.24579603
The data frame is organized by experiments but the experiment ids are missed. The same sample can be used in different experiment, but I know that in a single experiment the days start from 0 and are monotonically increasing.
How can I add the experiment ids that can be a numbers {1, 2, ...}?
So the resulted data frame will be
sample day value exp
1 a 0 -1.11182146 1
2 a 1 0.65679637 1
3 a 2 0.03652325 1
4 a 0 -0.95351736 2
5 a 1 0.16094840 2
6 a 3 0.06829702 2
7 b 0 0.33705141 3
8 b 2 0.24579603 3
I would appreciate any help, especially with a tidy/dplyr solution.
As indicated in the comments, you can do this with cumsum:
df.test %>% mutate(exp = cumsum(day == 0))
## sample day value exp
## 1 a 0 0.09300394 1
## 2 a 1 0.85322925 1
## 3 a 2 -0.25167313 1
## 4 a 0 -0.14811243 2
## 5 a 1 -1.86789014 2
## 6 a 3 0.45983987 2
## 7 b 0 2.81199150 3
## 8 b 2 0.31951634 3
You can use diff :
library(dplyr)
df.test %>% mutate(exp = cumsum(c(TRUE, diff(day) < 0)))
# sample day value exp
#1 a 0 -0.3382010 1
#2 a 1 2.2241041 1
#3 a 2 2.2202612 1
#4 a 0 1.0359635 2
#5 a 1 0.4134727 2
#6 a 3 1.0144439 2
#7 b 0 -0.1292119 3
#8 b 2 -0.1191505 3

Combining two (boolean) categorical factors two new one

Given two boolean, categorical factors, how can I get the combination of them as a third category?
> my_data <- data.frame(a = c(0, 0, 1, 1, 1),
b = c(0, 1, 0, 1, 1))
> my_data
a b
1 0 0
2 0 1
3 1 0
4 1 1
5 1 1
I want to add a new category, with the combination of a and b so that:
> my_data
a b c
1 0 0 1
2 0 1 2
3 1 0 3
4 1 1 4
5 1 1 4
I didn't want to be lazy and thought about it for myself:
my_data$c <- as.numeric(as.factor(my_data$a + 1 + (my_data$b + 1) * 2))
This comes close, but I don't find it particularly elegant.
Therefore, any nicer solution in base R would be appreciated.
There are certainly also packages likes reshape2 which would offer similar functionality.
The following logic seems to be enough for all the cases you have provided.
my_data$c <- with(my_data, 2*a + b + 1)
my_data
a b c
1 0 0 1
2 0 1 2
3 1 0 3
4 1 1 4
5 1 1 4
Another option with base R:
r <- rle(do.call(paste0, my_data))
r$values <- seq_along(r$values)
my_data$c <- inverse.rle(r)
The result:
> my_data
a b c
1 0 0 1
2 0 1 2
3 1 0 3
4 1 1 4
5 1 1 4
A shorter version of above code:
r <- rle(do.call(paste0, my_data))$lengths
my_data$c <- rep(seq_along(r), r)
The expected output in the question is just the input seen as numbers in base 2 converted to base 10 plus 1.
So, looking for a function that converts from base 2 to base 10 I have found the accepted answer to this SO question.
So it's a matter of apply()ing that function to the data frame.
apply(my_data, 1, bitsToInt) + 1
#[1] 1 2 3 4 4
A general solution with dplyr:
library(dplyr)
my_data %>% mutate(c = group_indices(.,a,b))
# a b c
# 1 0 0 1
# 2 0 1 2
# 3 1 0 3
# 4 1 1 4
# 5 1 1 4
A base equivalent:
temp <- unique(my_data)
temp$c <- seq(nrow(temp))
merge(my_data,temp)
# a b c
# 1 0 0 1
# 2 0 1 2
# 3 1 0 3
# 4 1 1 4
# 5 1 1 4

Create equal length vectors from time series based upon factor in R

I have a data frame that is something like this:
time type count
1 -2 a 1
2 -1 a 4
3 0 a 6
4 1 a 2
5 2 a 5
6 0 b 3
7 1 b 7
8 2 b 2
I want to create a new data frame that takes type 'b' and creates the full time series by filling in zeroes for count. It should look like this:
time type count
1 -2 b 0
2 -1 b 0
3 0 b 3
4 1 b 7
5 2 b 2
I can certainly subset(df, df$type = 'b') and then hack the beginning and rbind, but I want it to be more dynamic just in case the time vector changes.
We can use complete from tidyr to get the full 'time' for all the unique values of 'type' and filter the value of interest in 'type'.
library(tidyr)
library(dplyr)
val <- "b"
df1 %>%
complete(time, type, fill=list(count=0)) %>%
filter(type== val)
# time type count
# <int> <chr> <dbl>
#1 -2 b 0
#2 -1 b 0
#3 0 b 3
#4 1 b 7
#5 2 b 2
With base R:
df1 <- data.frame(time=df[df$type == 'a',]$time, type='b', count=0)
df1[match(df[df$type=='b',]$time, df1$time),]$count <- df[df$type=='b',]$count
df1
time type count
1 -2 b 0
2 -1 b 0
3 0 b 3
4 1 b 7
5 2 b 2

Creating new dataframe with missing value

i have a dataframe structured like this
time <- c(1,1,1,1,2,2)
group <- c('a','b','c','d','c','d')
number <- c(2,3,4,1,2,12)
df <- data.frame(time,group,number)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 c 2
6 2 d 12
in order to plot the data i need it to contain the values for each group (from a-d) at each time interval, even if they equal zero. so a data frame looking like this:
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a 0
6 2 b 0
7 2 c 2
8 2 d 12
any help?
You can use expand.grid and merge, like this:
> merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all = TRUE)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a NA
6 2 b NA
7 2 c 2
8 2 d 12
From there, it's just a simple matter of replacing NA with 0.
new <- merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all.y = TRUE)
new[is.na(new$number),"number"] <- 0
new

Subsequent row summing in dataframe object

I would like to do subsequent row summing of a columnvalue and put the result into a new columnvariable without deleting any row by another columnvalue .
Below is some R-code and an example that does the trick and hopefully illustrates my question. I was wondering if there is a more elegant way to do since the for loop will be time consuming in my actual object.
Thanks for any feedback.
As an example dataframe:
MyDf <- data.frame(ID = c(1,1,1,2,2,2), Y = 1:6)
MyDf$FIRST <- c(1,0,0,1,0,0)
MyDf.2 <- MyDf
MyDf.2$Y2 <- c(1,3,6,4,9,15)
The purpose of this is so that I can write code that calculates Y2 in MyDf.2 above for each ID, separately.
This is what I came up with and, it does the trick. (Calculating a TEST column in MyDf that has to be equal to Y2 cin MyDf.2)
MyDf$TEST <- NA
for(i in 1:length(MyDf$Y)){
MyDf[i,]$TEST <- ifelse(MyDf[i,]$FIRST == 1, MyDf[i,]$Y,MyDf[i,]$Y + MyDf[i-1,]$TEST)
}
MyDf
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
MyDf.2
ID Y FIRST Y2
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
You need ave and cumsum to get the column you want. transform is just to modify your existing data.frame.
> MyDf <- transform(MyDf, TEST=ave(Y, ID, FUN=cumsum))
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15

Resources