test values within data frame R - r

i'm trying to figure out how to write a loop that tests if a value in one of many columns is greater than or less than values in two set columns in a data frame. I'd like a 1/0 output and to drop all the columns that are tested. my solution has an embarrassing number of mutates to create new columns that are T or F and then uses a Reduce function to check if TRUE is present in one of the columns from a set position to the end of the data frame. any help on this would be appreciated!
example:
library(tidyverse)
df3 = data.frame(X = sample(1:3, 15, replace = TRUE),
Y = sample(1:3, 15, replace = TRUE),
Z = sample(1:3, 15, replace = TRUE),
A = sample(1:3, 15, replace = TRUE))
df3 <- df3 %>% mutate(T1 = Z >= X & Z <= Y,
T2 = A >= X & A <= Y)
df3$check <- Reduce(`|`, lapply(df3[5:6], `==`, TRUE))

df3 %>%
mutate(check = if_any(Z:A, function(x) {x >= X & x <= Y}))

You can compare the entire subset df3[c('A', 'Z')] at once, which should be more efficient. We are looking for rowSums greater than zero.
To understand the logic:
cols <- c('A', 'Z')
as.integer(rowSums(df3[cols] >= df3$X & df3[cols] <= df3$Y) > 0)
# [1] 1 1 0 0 1 0 0 1 0 0 1 1 0 0 1
To create the column:
transform(df3, check=as.integer(rowSums(df3[cols] >= X & df3[cols] <= Y) > 0))
# X Y Z A check
# 1 1 3 3 2 1
# 2 1 3 3 2 1
# 3 1 1 2 2 0
# 4 1 1 2 2 0
# 5 2 3 2 3 1
# 6 2 1 2 2 0
# 7 2 3 1 1 0
# 8 1 1 1 2 1
# 9 3 1 2 3 0
# 10 3 2 2 2 0
# 11 1 3 3 2 1
# 12 1 2 3 2 1
# 13 2 1 1 1 0
# 14 2 2 1 1 0
# 15 2 2 2 1 1
Data:
dat <- structure(list(X = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 3L,
1L, 1L, 2L, 2L, 2L), Y = c(3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L, 1L,
2L, 3L, 2L, 1L, 2L, 2L), Z = c(3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L,
2L, 2L, 3L, 3L, 1L, 1L, 2L), A = c(2L, 2L, 2L, 2L, 3L, 2L, 1L,
2L, 3L, 2L, 2L, 2L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-15L))

Related

New data frame, if specific value(s) is contained AND other values aren't included in a range of columns in r

So, I have a large data frame with monthly observations of n individuals.
ind y_0101 y_0102 y_0103 y_0104_ .... y_0311 y_0312
A 33 6 1 2 1 5
B 36 5 0 2 1 5
C 22 4 1 NA 1 5
D 2 2 0 2 1 5
E 5 2 1 2 1 6
F 7 1 0 2 1 5
G 8 6 1 2 1 5
H 2 8 0 2 2 5
I 1 3 1 2 1 5
J 3 2 0 2 1 5
I want to create a new data frame, in which include the individuals who meet some specific conditions.
E.g. if, for individual i, the range of column y_0101:y_0312 does NOT include values of 3 & 6 & NA, AND include values of 2 | 1 THEN for individual i should be included in new data frame. Which produce the following data frame:
ind y_0101 y_0102 y_0103 y_0104_ .... y_0311 y_0312
B 36 5 0 2 1 5
D 2 2 0 2 1 5
F 7 1 0 2 1 5
H 2 8 0 2 2 5
I tried different ways, but I can't figure out how to get multiple conditions included.
df <- df %>% filter(vars(starts_with("y_"))!=3 | !=6 | != NA)
or
df <- df %>% filter_at(vars(starts_with("y_")), all_vars(!=3 | !=6 | != NA)
I've tried some other things as well, like !%in%, but that doesn't seem to work. Any ideas?
I think you're almost there, but might need a slight shift in the logic:
df <- data.frame(A1 = 1:10,
A2 = 10:1,
A3 = 1:10,
B1 = 1:10)
df %>%
filter_at(vars(starts_with("A")), ~!(.x %in% c(3, 6, NA))) %>%
filter(if_any(starts_with("A"), ~ .x %in% c(1, 2)))
In the first step, I filter out all rows where any of the columns are 3, 6, or NA. In the second row, I filter down to only rows where at least one of the columns is 1 or 2. Does this help with your case?
Here is a base R option using rowSums :
cols <- grep('y_', names(df))
include <- c(1, 2)
not_include <- c(3, 6, NA)
result <- subset(df, rowSums(sapply(df[cols], `%in%`, include)) > 0 &
rowSums(sapply(df[cols], `%in%`, not_include)) == 0)
result
# ind y_0101 y_0102 y_0103 y_0104 y_0311 y_0312
#2 B 36 5 0 2 1 5
#4 D 2 2 0 2 1 5
#6 F 7 1 0 2 1 5
#8 H 2 8 0 2 2 5
data
df <- structure(list(ind = c("A", "B", "C", "D", "E", "F", "G", "H",
"I", "J"), y_0101 = c(33L, 36L, 22L, 2L, 5L, 7L, 8L, 2L, 1L,
3L), y_0102 = c(6L, 5L, 4L, 2L, 2L, 1L, 6L, 8L, 3L, 2L), y_0103 = c(1L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L), y_0104 = c(2L, 2L, NA, 2L,
2L, 2L, 2L, 2L, 2L, 2L), y_0311 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 1L, 1L), y_0312 = c(5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L
)), class = "data.frame", row.names = c(NA, -10L))

Need help replacing values with NA when another condition is met in R (i.e. when another variable is a specific value)

I'm trying to delete some repeating information in my data set and replace it with NA. Here's an example of the data:
DataTable1
ID Day x y
1 1 1 3
1 2 1 3
2 1 2 5
2 2 2 5
3 1 3 4
3 2 3 4
4 1 4 6
4 2 4 6
I'm trying to replace "x" and "y" values with "NA" when Day=1. This is what I want:
ID Day x y
1 1 NA NA
1 2 1 3
2 1 NA NA
2 2 2 5
3 1 NA NA
3 2 3 4
4 1 NA NA
4 2 4 6
I'm not really sure where to start or how to go about this. I tried using the replace_with_na_if function from the naniar library. Otherwise, I am unsure what to try.
replace_with_na_if(data.frame=DataTable1$x,
condition=DataTable1$Day== 2)
I received an error message that reads:
Error in replace_with_na_if(data.frame = DataTable1$x, condition = DataTable1$Day == :
unused argument (data.frame = DataTable1$x)
An option in base R would be to create a logical vector based on the elements of 'Day'. Use that index to subset the 'x', 'y' columns and assign them to NA
i1 <- df1$Day == 1
df1[i1, c('x', 'y')] <- NA
Here's a data.table solution. Since you may be new to R, you need to install the data.table package first. If you have a large data set, data.table may work faster than using data frame. Also, I find the syntax to be easy to read and understand.
#Create the data frame:
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))
library(data.table)
dt <- setDT(df) # convert the data frame to a data.table
dt[Day == 1, c("x","y") := NA] # where Day equals 1, make the columns x and y equal NA
Good luck and welcome to stackoverflow!
Using dplyr, we can use mutate_at and replace like
library(dplyr)
df %>% mutate_at(vars(x, y), ~replace(., Day == 1, NA))
# ID Day x y
#1 1 1 NA NA
#2 1 2 1 3
#3 2 1 NA NA
#4 2 2 2 5
#5 3 1 NA NA
#6 3 2 3 4
#7 4 1 NA NA
#8 4 2 4 6
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))

subsetting data based with the condition of the current and previous entity in r

I have data with the status column. I want to subset my data to the condition of 'f' status, and previous condition of 'f' status.
to simplify:
df
id status time
1 n 1
1 n 2
1 f 3
1 n 4
2 f 1
2 n 2
3 n 1
3 n 2
3 f 3
3 f 4
my result should be:
id status time
1 n 2
1 f 3
2 f 1
3 n 2
3 f 3
3 f 4
How can I do this in R?
Here's a solution using dplyr -
df %>%
group_by(id) %>%
filter(status == "f" | lead(status) == "f") %>%
ungroup()
# A tibble: 6 x 3
id status time
<int> <fct> <int>
1 1 n 2
2 1 f 3
3 2 f 1
4 3 n 2
5 3 f 3
6 3 f 4
Data -
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
status = structure(c(2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L), .Label = c("f", "n"), class = "factor"), time = c(1L,
2L, 3L, 4L, 1L, 2L, 1L, 2L, 3L, 4L)), .Names = c("id", "status",
"time"), class = "data.frame", row.names = c(NA, -10L))

R - dplyr map slice for repeat rows

I have trouble combining slice and map.
I am interested of doing something similar to this; which is, in my case, transforming a compact person-period file to a long (sequential) person-period one. However, because my file is too big, I need to split the data first.
My data look like this
group id var ep dur
1 A 1 a 1 20
2 A 1 b 2 10
3 A 1 a 3 5
4 A 2 b 1 5
5 A 2 b 2 10
6 A 2 b 3 15
7 B 1 a 1 20
8 B 1 a 2 10
9 B 1 a 3 10
10 B 2 c 1 20
11 B 2 c 2 5
12 B 2 c 3 10
What I need is simply this (answer from this)
library(dplyr)
dt %>% slice(rep(1:n(),.$dur))
However, I am interested in introducing a split(.$group).
How I am suppose to do so ?
dt %>% split(.$group) %>% map_df(slice(rep(1:n(),.$dur)))
Is not working for example.
My desired output is the same as dt %>% slice(rep(1:n(),.$dur))
which is
group id var ep dur
1 A 1 a 1 20
2 A 1 a 1 20
3 A 1 a 1 20
4 A 1 a 1 20
5 A 1 a 1 20
6 A 1 a 1 20
7 A 1 a 1 20
8 A 1 a 1 20
9 A 1 a 1 20
10 A 1 a 1 20
.....
But I need to split this operation because the file is too big.
data
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), ep = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor"), dur = c(20, 10, 5, 5, 10, 15, 20,
10, 10, 20, 5, 10)), .Names = c("group", "id", "var", "ep",
"dur"), row.names = c(NA, -12L), class = "data.frame")
map takes two arguments: a vector/list in .x and a function in .f. It then applies .f on all elements in .x.
The function you are passing to map is not formatted correctly. Try this:
f <- function(x) x %>% slice(rep(1:n(), .$dur))
dt %>%
split(.$group) %>%
map_df(f)
You could also use it like this:
dt %>%
split(.$group) %>%
map_df(slice, rep(1:n(), dur))
This time you directly pass the slice function to map with additional parameters.
I'm not quite sure what your desired final output is, but you could use tidyr to nest the data that you want to repeat and a simple function to expand levels of your nested data, very similar to Tutuchan's answer.
expand_df <- function(df, repeats) {
df %>% slice(rep(1:n(), repeats))
}
dt %>%
tidyr::nest(var:ep) %>%
mutate(expanded = purrr::map2(data, dur, expand_df)) %>%
select(-data) %>%
tidyr::unnest()
Tutuchan's answer gives exactly the same output as your original approach - is that what you were looking for? I don't know if it will have any advantage over your original method.

Selecting a sequence of random length starting and ending with specific values and limited by another column

I have a fairly large data set that has the form of the following table:
value ID
1 0 A
2 0 A
3 1 A
4 1 A
5 0 A
6 -1 A
7 0 B
8 1 B
9 1 B
10 0 B
11 0 B
12 0 B
13 1 C
14 1 C
15 0 C
16 1 C
17 1 C
18 1 C
19 0 C
Essentially I'd like to transform the above, keeping only the first and last values of sequences that start with an occurrence of zero followed by a unknown number of ones and end at the last occurrence of one:
value ID
2 0 A
4 1 A
7 0 B
9 1 B
15 0 C
18 1 C
Is there an easy way to accomplish this?
dput of the first example follows:
structure(list(value = structure(c(2L, 2L, 3L, 3L, 2L, 1L, 2L,
3L, 3L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 2L), .Label = c("-1",
"0", "1"), class = "factor"), ID = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("value", "ID"), row.names = c(NA, -19L), class = "data.frame")
Here's my attempt using data.table and stringi packages combination
library(stringi)
library(data.table)
setDT(df)[, .(.I[stri_locate_all_regex(paste(value, collapse = ""), "01+")[[1]]], 0:1), by = ID]
# ID V1 V2
# 1: A 2 0
# 2: A 4 1
# 3: B 7 0
# 4: B 9 1
# 5: C 15 0
# 6: C 18 1
This basically converts each group to a single string and then detects the beginning and the end of parts that match the 01+ regex while subsetting from the row index .I. Eventually I'm just adding 0:1 to the data (which seems redundant to me at least).

Resources