Subset data based on conditions in R - r

I have multiple inputs like:
a <- x y z
1 2 2
2 3 2
3 2 4
4 2 4
5 2 1
b <- c(1,2)
c <- c(2,3)
i want to subset this data based on a condition that a$x contains values greater than equal to b[i] and less than equal to c[i]
output should look like:
d <- x y z
1 2 2
2 3 2
2 3 2
3 2 4
i have tried this:
d = as.data.frame(matrix(ncol=3, nrow=0))
names(d) = names(a)
for (i in 1:length(b){
d <- rbind(d,a[which(a$x>=b[i] & a$x<=c[i]),])
}

Using dplyr::filter function:
sub_list <- lapply(1:length(b), function(i) a %>% filter(x >= b[i] & x <= c[i]))
do.call(rbind, sub_list)
x y z
1 1 2 2
2 2 3 2
3 2 3 2
4 3 2 4
Input data:
a <- structure(list(x = 1:5, y = c(2L, 3L, 2L, 2L, 2L), z = c(2L,
2L, 4L, 4L, 1L)), .Names = c("x", "y", "z"), class = "data.frame", row.names = c(NA,
-5L))
b <- c(1,2)
c <- c(2,3)

Related

How to remove rows if values from a specified column in data set 1 does not match the values of the same column from data set 2 using dplyr

I have 2 data sets, both include ID columns with the same IDs. I have already removed rows from the first data set. For the second data set, I would like to remove any rows associated with IDs that do not match the first data set by using dplyr.
Meaning whatever is DF2 must be in DF1, if it is not then it must be removed from DF2.
For example:
DF1
ID X Y Z
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
DF2
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
DF2 once rows have been removed
ID A B C
1 1 1 1
2 2 2 2
3 3 3 3
5 5 5 5
6 6 6 6
I used anti_join() which shows me the difference in rows but I cannot figure out how to remove any rows associated with IDs that do not match the first data set by using dplyr.
Try with paste
i1 <- do.call(paste, DF2) %in% do.call(paste, DF1)
# if it is only to compare the 'ID' columns
i1 <- DF2$ID %in% DF1$ID
DF3 <- DF2[i1,]
DF3
ID A B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 5 5 5 5
5 6 6 6 6
DF4 <- DF2[!i1,]
DF4
ID A B C
4 4 4 4 4
7 7 7 7 7
data
DF1 <- structure(list(ID = c(1L, 2L, 3L, 5L, 6L), X = c(1L, 2L, 3L,
5L, 6L), Y = c(1L, 2L, 3L, 5L, 6L), Z = c(1L, 2L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))
DF2 <- structure(list(ID = 1:7, A = 1:7, B = 1:7, C = 1:7), class = "data.frame", row.names = c(NA,
-7L))
# Load package
library(dplyr)
# Load dataframes
df1 <- data.frame(
ID = 1:6,
X = 1:6,
Y = 1:6,
Z = 1:6
)
df2 <- data.frame(
ID = 1:7,
X = 1:7,
Y = 1:7,
Z = 1:7
)
# Include all rows in df1
df1 %>%
left_join(df2)
Joining, by = c("ID", "X", "Y", "Z")
ID X Y Z
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6

Map a function to two data frames of unequal lengths

For each row in df1 I would like to execute mult 10 times, once for each year in df2.
One option I can think of is to repeat df1 multiple times and join it to df2. But my actual data are much larger (~20k sections, 15 areas and 100 years), so I am looking for a more efficient way to do this.
# df1
section area a b c
1 1 1 0.1208916 0.7235306 0.7652636
2 2 1 0.8265642 0.2939602 0.6491496
3 1 2 0.9101611 0.7363248 0.1509295
4 2 2 0.8807047 0.5473221 0.6748055
5 1 3 0.2343558 0.2044689 0.9647333
6 2 3 0.4112479 0.9523639 0.1533197
----------
# df2
year d
1 1 0.7357432
2 2 0.4591575
3 3 0.3654561
4 4 0.1996439
5 5 0.2086226
6 6 0.5628826
7 7 0.4772953
8 8 0.8474007
9 9 0.8861693
10 10 0.6694851
mult <- function(a, b, c, d) {a * b * c * d}
The desired output would look something like this
section area year e
1 1 1 1 results of mult()
2 2 1 1 results of mult()
3 1 2 1 results of mult()
4 2 2 1 results of mult()
5 1 3 1 results of mult()
6 2 3 1 results of mult()
7 1 1 2 results of mult()
8 2 1 2 results of mult()
...
dput(df1)
structure(list(section = c(1L, 2L, 1L, 2L, 1L, 2L), area = c(1L,
1L, 2L, 2L, 3L, 3L), a = c(0.12089157756418, 0.826564211165532,
0.91016107192263, 0.880704707000405, 0.234355789143592, 0.411247851792723
), b = c(0.72353063733317, 0.293960151728243, 0.736324765253812,
0.547322086291388, 0.204468948533759, 0.952363904565573), c = c(0.765263637062162,
0.649149592733011, 0.150929539464414, 0.674805536167696, 0.964733332861215,
0.15331974090077)), out.attrs = list(dim = structure(2:3, .Names = c("section",
"area")), dimnames = list(section = c("section=1", "section=2"
), area = c("area=1", "area=2", "area=3"))), class = "data.frame", row.names = c(NA,
-6L))
dput(df2)
structure(list(year = 1:10, d = c(0.735743158031255, 0.459157506935298,
0.365456136409193, 0.199643932981417, 0.208622586680576, 0.562882597092539,
0.477295308141038, 0.847400720929727, 0.886169332079589, 0.669485098216683
)), class = "data.frame", row.names = c(NA, -10L))
Edit: full sized toy dataset
library(dplyr)
df1 <- expand.grid(section = 1:20000,
area = 1:15) %>%
mutate(a = runif(300000),
b = runif(300000),
c = runif(300000))
df2 <- data.frame(year = 1:100,
d = runif(100))
You can use crossing to create combinations of df1 and df2 and apply mult to them.
tidyr::crossing(df1, df2) %>% dplyr::mutate(e = mult(a, b, c, d))

Find multiple "switching points" by comparing the answers in columns

I have a data set in which subjects have made choices between A and B for 13 different B's. Below is a simplified example of what the data looks like with 54 subjects and 5 choices. (1 is A, 2 is B).
subject choice1 choice2 choice3 choice4 choice5
1 1 1 1 1 2 2
2 2 1 1 2 2 2
3 3 1 2 1 2 2
4 4 1 2 2 2 2
I would like to find the questions in which subjects switch option A to B , i.e. for subject 1 this would be choice4.
In a previous study we did this by computing number of times the subject would choose option A and then selecting the corresponding option B form a separate matrix. See code below.
However, the difference now is that instead of choosing 1 switching point, subjects were asked the questions in a randomized order, and thus there is the possibility of having multiple switching points. For example in the table above, subject 3 switches to B at choice2 and again at choice4.
I would like to find both the first time the subject switches to option B, and the last time (before sticking with B for the rest of the choices).
sure_amounts <- matrix(nrow = 4, ncol = 13) # 4 treatments, 13 questions
sure_amounts[1, ] <- c(0, 2, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 8, 10) # Option B's
sure_amounts[2, ] <- seq(2, 14, 1)
sure_amounts[3, ] <- seq(2, 14, 1)
sure_amounts[4, ] <- seq(2, 14, 1)
b_choice <- matrix(nrow = 201, ncol = 4)
switch_choice <- matrix(nrow = 201, ncol = 4) # switching point form A to B
for(j in 1:4){ # number of treatments
for(i in 201){ # number of subjects
choice = NULL
fl = data$ID == i
k = 1 + 36*(j-1) # 36 before going to the next treatment (due to other questions)
choice = c(data[fl,k:(k+12)])
b_choice[i,j] = length(choice[choice==1])
temp = b_choice[i,j]
switch_choice[i,j] <- ifelse(temp==0, 0, sure_amounts[j, temp])
}
}
Does anyone have any tips on how to approach this? Thanks in advance!
I am not sure how you want your expected output to look like but you can try to get data in long format and for each subject select rows where they switch from 1 -> 2.
library(dplyr)
df %>%
tidyr::pivot_longer(cols = -subject) %>%
group_by(subject) %>%
filter(value == 2 & lag(value) == 1 |
value == 1 & lead(value) == 2)
# subject name value
# <int> <chr> <int>
# 1 1 choice3 1
# 2 1 choice4 2
# 3 2 choice2 1
# 4 2 choice3 2
# 5 3 choice1 1
# 6 3 choice2 2
# 7 3 choice3 1
# 8 3 choice4 2
# 9 4 choice1 1
#10 4 choice2 2
Here we can see that subject 1 moves from 1 -> 2 from choice3 -> choice4 and so on.
data
df <- structure(list(subject = 1:4, choice1 = c(1L, 1L, 1L, 1L), choice2 = c(1L,
1L, 2L, 2L), choice3 = c(1L, 2L, 1L, 2L), choice4 = c(2L, 2L,
2L, 2L), choice5 = c(2L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -4L))
A Base R solution:
Essentially this code only substracts a lag of the decisions and detects if the difference is not equal to zero.
Code:
lapply(as.data.frame(t(df_1)[-1,]), function(x){
t <- x - c(x[-1], 0) # row substracted by shortened row
z <- which(t[-length(t)] != 0) # values not equal to zero and rm last value
z + 1 # remove lag
})
# $`1`
# [1] 4
# $`2`
# [1] 3
# $`3`
# [1] 2 3 4
# $`4`
# [1] 2
Data:
df_1 <- read.table(text = " subject choice1 choice2 choice3 choice4 choice5
1 1 1 1 1 2 2
2 2 1 1 2 2 2
3 3 1 2 1 2 2
4 4 1 2 2 2 2 ", header = T)
An alternative approach:
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(g = paste0(choice1, choice2, choice3, choice4, choice5),
switches = as.character(map(g, ~pluck(str_locate_all(.x, "12"), 1)))) %>%
select(-g)
#> subject choice1 choice2 choice3 choice4 choice5 switches
#> 1 1 1 1 1 2 2 3:4
#> 2 2 1 1 2 2 2 2:3
#> 3 3 1 2 1 2 2 c(1, 3, 2, 4)
#> 4 4 1 2 2 2 2 1:2
data
df <- structure(list(subject = 1:4, choice1 = c(1L, 1L, 1L, 1L), choice2 = c(1L,
1L, 2L, 2L), choice3 = c(1L, 2L, 1L, 2L), choice4 = c(2L, 2L,
2L, 2L), choice5 = c(2L, 2L, 2L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Created on 2020-07-10 by the reprex package (v0.3.0)

How to extract a column based on column name?

I have a data frame df
m n o p
a 1 1 2 5
b 1 2 0 4
c 3 3 3 3
I can extract column m by:
df[,"m"]
Now the problem is, the column name was generated somewhere else (multiple times, in a for loop). For example, column name m was generated by choosing a specific element in the dataframe, gen, in one loop
:
> gen[i,1]
[1] m
How do I extract the column based on gen[i,1]?
Just nest the subsetting.
dat[,"m"]
# [1] 1 1 3
i <- 13
gen[i, 1]
# [1] "m"
dat[, gen[i, 1]]
# [1] 1 1 3
Or, if you don't want the column to be dropped:
dat[, gen[i, 1], drop=FALSE]
# m
# a 1
# b 1
# c 3
Data
dat <- structure(list(m = c(1L, 1L, 3L), n = 1:3, o = c(2L, 0L, 3L),
p = 5:3), class = "data.frame", row.names = c("a", "b", "c"
))
gen <- data.frame(letters)
We can use select from dplyr
library(dplyr)
i <- 13
dat %>%
select(gen[i, 1])
# m
#a 1
#b 1
#c 3
data
dat <- structure(list(m = c(1L, 1L, 3L), n = 1:3, o = c(2L, 0L, 3L),
p = 5:3), class = "data.frame", row.names = c("a", "b", "c"
))
gen <- data.frame(letters)

How to use column indices to collect values from columns in R

x y z column_indices
6 7 1 1,2
5 4 2 3
1 3 2 1,3
I have the column indices of the values I would like to collect in a separate column like so, what I want to create is something like this:
x y z column_indices values
6 7 1 1,2 6,7
5 4 2 3 2
1 3 2 1,3 1,2
What is the simplest way to do this in R?
Thanks!
In base R, we can use apply, split the column_indices on ',', convert them to integer and get the corresponding value from the row.
df$values <- apply(df, 1, function(x) {
inds <- as.integer(strsplit(x[4], ',')[[1]])
toString(x[inds])
})
df
# x y z column_indices values
#1 6 7 1 1,2 6, 7
#2 5 4 2 3 2
#3 1 3 2 1,3 1, 2
data
df <- structure(list(x = c(6L, 5L, 1L), y = c(7L, 4L, 3L), z = c(1L,
2L, 2L), column_indices = structure(c(1L, 3L, 2L), .Label = c("1,2",
"1,3", "3"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))
One solution involving dplyr and tidyr could be:
df %>%
pivot_longer(-column_indices) %>%
group_by(column_indices) %>%
mutate(values = toString(value[1:n() %in% unlist(strsplit(column_indices, ","))])) %>%
pivot_wider(names_from = "name", values_from = "value")
column_indices values x y z
<chr> <chr> <int> <int> <int>
1 1,2 6, 7 6 7 1
2 3 2 5 4 2
3 1,3 1, 2 1 3 2

Resources