Subsetting a dataframe to exclude values in the same row as NA - r

I have a dataframe called "data". One of the columns is called "reward" and another is called "X.targetResp". I want to create a new dataframe, called "reward", that consists of all values from the column "reward" in "data." HOWEVER, I want to exclude values of the "reward" column that are in the same row as an NA value in the "X.targetResp" column of "data".
I've tried the following:
reward <- data$reward %in% filter(!is.na(data$X.targetResp))
reward <- subset(data, reward, !(X.targetResp=="NA"))
reward <- subset(data, reward, !is.na(X.targetResp))
...but I get errors for each of them.
Thanks for your input!

In dplyr, you can use filter and !is.na() to filter out the ones with NA in X.targetResp, and then use the select function to select the reward column.
library(dplyr)
# Create example data frame
dat <- data_frame(reward = 1:5,
X.targetResp = c(2, 4, NA, NA, 10))
# Print the data frame
dat
# # A tibble: 5 x 2
# reward X.targetResp
# <int> <dbl>
# 1 1 2
# 2 2 4
# 3 3 NA
# 4 4 NA
# 5 5 10
# Use the filter function
reward <- dat %>%
filter(!is.na(X.targetResp)) %>%
select(reward)
reward
# # A tibble: 3 x 1
# reward
# <int>
# 1 1
# 2 2
# 3 5
And here is a base R solution with the similar logic.
subset(dat, !is.na(X.targetResp), "reward")
# A tibble: 3 x 1
reward
# <int>
# 1 1
# 2 2
# 3 5
You can also consider use drop_na on X.targetResp from the tidyr.
library(dplyr)
library(tidyr)
reward <- dat %>%
drop_na(X.targetResp) %>%
select(reward)
reward
# # A tibble: 3 x 1
# reward
# <int>
# 1 1
# 2 2
# 3 5
Here is an example of the data.table package.
library(data.table)
setDT(dat)
reward <- dat[!is.na(X.targetResp), .(reward)]
reward
# reward
# 1: 1
# 2: 2
# 3: 5

You can simply use na.omit, which is designed to address this problem:
# replicating the same example data frame given by #www
data <- data.frame(
reward = 1:5,
X.targetResp = c(2, 4, NA, NA, 10)
)
# omitting the rows containing NAs
reward <- na.omit(data)
# resulting data frame with both columns
reward
# reward X.targetResp
# 1 1 2
# 2 2 4
# 5 5 10
# you can easily extract the first column if necessary
reward[1]
# reward
# 1 1
# 2 2
# 5 5
Following up #www's comment:
In case there are other columns you want to dodge:
# omitting the rows where only X.targetResp is NA
reward <- data[complete.cases(data["X.targetResp"]), ]
# resulting data frame with both columns
reward
# reward X.targetResp
# 1 1 2
# 2 2 4
# 5 5 10
# you can easily extract the first column if necessary
reward[1]
# reward
# 1 1
# 2 2
# 5 5

Related

Creating a colum based on another column while creating a data frame

I know there a several ways to create a column based on another column, however I would like to know how to do it while creating a data frame.
For example this works but is not the way I want to use it.
v1 = rnorm(10)
sample_df <- data.frame(v1 = v1,
cs = cumsum(v1))
This works not:
sample_df2 <- data.frame(v2 = rnorm(10),
cs = cumsum(v2))
Is there a way to it directly in the data.frame function? Thanks in advance.
It cannot be done using data.frame, but package tibble implements a data.frame analogue with the functionality that you want.
library("tibble")
tib <- tibble(x = 1:6, y = cumsum(x))
tib
# # A tibble: 6 × 2
# x y
# <int> <int>
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
In most cases, the resulting object (called a "tibble") can be treated as if it were a data frame, but if you truly need a data frame, then you can do this:
dat <- as.data.frame(tib)
dat
# x y
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
You can wrap everything in a function if you like:
f <- function(...) as.data.frame(tibble(...))
f(x = 1:6, y = cumsum(x))
# x y
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21

R lag across arbitrary number of missing values [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 1 year ago.
library(tidyverse)
testdata <- tibble(ID=c(1,NA,NA,2,NA,3),
Observation = LETTERS[1:6])
testdata1 <- testdata %>%
mutate(
ID1 = case_when(
is.na(ID) ~ lag(ID, 1),
TRUE ~ ID
)
)
testdata1
I have a dataset like testdata, with a valid ID only when ID changes. There can be an arbitrary number of records in a set, but the above case_when and lag() structure does not fill in ID for all records, just for record 2 in each group. Is there a way to get the 3rd (or deeper) IDs filled with the appropriate value?
We can use fill from the tidyr package. Since you are using tidyverse, tidyr is already inlcuded.
testdata1 <- testdata %>%
fill(ID)
testdata1
# # A tibble: 6 x 2
# ID Observation
# <dbl> <chr>
# 1 1 A
# 2 1 B
# 3 1 C
# 4 2 D
# 5 2 E
# 6 3 F
Or we can use na.locf from the zoo package.
library(zoo)
testdata1 <- testdata %>%
mutate(ID = na.locf(ID))
testdata1
# # A tibble: 6 x 2
# ID Observation
# <dbl> <chr>
# 1 1 A
# 2 1 B
# 3 1 C
# 4 2 D
# 5 2 E
# 6 3 F

dplyr conditionally filter on only unique items

I have a dataframe with $ID, $Age, and $Score features. I would like filter this down to unique IDs that have a score below a specific value. For those IDs with multiple scores below the threshold, I want to only keep the oldest (i.e., maximum age).
Here's how I've tried to implement it but it is slow due to the loop. Is there a way to speed this up with dplyr or similar library?
#find the indexes of the items below the threshold
idx <- df$Score <= threshold
#select the below threshold rows
df <- df[idx,]
#find the unique IDs
unique_ids <- unique(df$ID)
unique_items <- data.frame(matrix(ncol=3, nrow=length(unique_ids)))
colnames(unique_items) <- colnames(df)
#loop through each unique ID
for(i in 1:length(unique_ids))
{
#find all the items that match that unique ID
my.list <- df[df$ID == unique_ids[i],]
#find the index of the oldest unique item that is below the threshold
oldest_idx <- which.max(my.list$Age)
#assign it the the result dataframe
unique_items[i,] <- my.list[oldest_idx,]
}
We can also use
df %>%
filter(Score < 5) %>%
group_by(ID) %>%
slice(which.min(Age))
Sample data
library(dplyr)
set.seed(2021)
df <- tibble(ID=rep(1:2, each=5), Age=sample(10), Score=c(1:5, 3:7))
df
# # A tibble: 10 x 3
# ID Age Score
# <int> <int> <int>
# 1 1 7 1
# 2 1 6 2
# 3 1 9 3
# 4 1 2 4
# 5 1 4 5
# 6 2 8 3
# 7 2 10 4
# 8 2 5 5
# 9 2 1 6
# 10 2 3 7
Answer:
df %>%
filter(Score < 5) %>%
group_by(ID) %>%
slice_min(Age) %>%
ungroup()
# # A tibble: 2 x 3
# ID Age Score
# <int> <int> <int>
# 1 1 2 4
# 2 2 8 3
Here, the min-score is 5, and the oldest age records under 5 are returned.

Insert missing rows in time series data

I have an incomplete time series dataframe and I need to insert rows of NAs for missing time stamps. There should always be 6 time stamps per day, which is indicated by the variable "Signal" (1-6) in the dataframe. I am trying to merge the incomplete dataframe A with a vector Bcontaining all Signals. Simplified example data below:
B <- rep(1:6,2)
A <- data.frame(Signal = c(1,2,3,5,1,2,4,5,6), var1 = c(1,1,1,1,1,1,1,1,1))
Expected <- data.frame(Signal = c(1,2,3,NA, 5, NA, 1,2,NA,4,5,6), var1 = c(1,1,1,NA,1,NA,1,1,NA,1,1,1)
Note that Brepresents a dataframe with multiple variables and the NAs in Expected are rows of NAs in the dataframe. Also the actual dataframe has more observations (84 in total).
Would be awesome if you guys could help me out!
If you already know there are 6 timestamps in a day you can do this without B. We can create groups for each day and use complete to add the missing observations with NA.
library(dplyr)
library(tidyr)
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
ungroup() %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 NA
# 5 5 1
# 6 6 NA
# 7 1 1
# 8 2 1
# 9 3 NA
#10 4 1
#11 5 1
#12 6 1
If in the output you need Signal as NA for missing combination you can use
A %>%
group_by(gr = cumsum(c(TRUE, diff(Signal) < 0))) %>%
complete(Signal = 1:6) %>%
mutate(Signal = replace(Signal, is.na(var1), NA)) %>%
ungroup %>%
select(-gr)
# Signal var1
# <dbl> <dbl>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 NA NA
# 5 5 1
# 6 NA NA
# 7 1 1
# 8 2 1
# 9 NA NA
#10 4 1
#11 5 1
#12 6 1

Is there an R function to count values separated by semi-colon in a column and across rows

I am wondering if there is a smart way to count the occurrence of values separated by semi-colon in a column and across the entire rows.
Sample data and output expected
Here's a tidyverse approach:
library(tidyverse)
# example data
df1 = data.frame(var1 = c(2,4,3,5),
var2 = c("3;5;2;0;1","2;3;8;5","9;6;2","8;5;4;7;0;1"),
stringsAsFactors = F)
df1 %>%
separate_rows(var2) %>% # split values to different rows
filter(var2 %in% df1$var1) %>% # keep values that match var1
count(var2) # count each value
# # A tibble: 4 x 2
# var2 n
# <chr> <int>
# 1 2 3
# 2 3 2
# 3 4 1
# 4 5 3
And a base R approach:
v = unlist(strsplit(df1$var2, ";"))
data.frame(table(v[v %in% df1$var1]))
# Var1 Freq
# 1 2 3
# 2 3 2
# 3 4 1
# 4 5 3

Resources