Filter using Dataset Position in R

Filter using Dataset Position in R - r

I'm not really familiar with dplyr function in R. However, I want to filter my dataset into certain conditions.
Let's say I've more than 100 of attributes in my dataset. And I want to perform filter with multiple condition.
Can I put my coding filter the position of the column instead of their name as follow:
y = filter(retag, c(4:50) != 8 & c(90:110) == 8)
I've tried few times similar with this coding, however still haven't get the result.
I also did tried coding as follow, but not sure how to add another conditions into the rowSums function.
retag[rowSums((retag!=8)[,c(4:50)])>=1,]
The only example that I found was using the dataset names instead of the position.
Or is there any way to filter using the dataset position as my data quite huge.

You can use a combination of filter() and across(). I didn't have your version of the retag dataframe so I created my own as an example
set.seed(2000)
retag <- tibble(
col1 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col2 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col3 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col4 = runif(n = 1000, min = 0, max = 10) %>% round(0),
col5 = runif(n = 1000, min = 0, max = 10) %>% round(0)
)
# filter where the first, second, and third column all equal 5 and the fourth column does not equal 5
retag %>%
filter(
across(1:3, function(x) x == 5),
across(4, function(x) x != 5)
)

if_all() and if_any() were recently introduced into the tidyverse for the purpose of filtering across multiple variables.
library(dplyr)
filter(retag, if_all(X:Y, ~ .x > 10 & .x < 35))
# # A tibble: 5 x 2
# X Y
# <int> <int>
# 1 11 30
# 2 12 31
# 3 13 32
# 4 14 33
# 5 15 34
filter(retag, if_any(X:Y, ~ .x == 2 | .x == 25))
# # A tibble: 2 x 2
# X Y
# <int> <int>
# 1 2 21
# 2 6 25
Data
retag <- structure(list(X = 1:20, Y = 20:39), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))

Here's a base R option.
This will select rows where there is no 8 in column 4 to 50 and there is at least one 8 in column 90 to 110.
result <- retag[rowSums(retag[4:50] == 8, na.rm = TRUE) == 0 &
rowSums(retag[90:110] == 8,na.rm = TRUE) > 0, ]

Related

How to add a column that contains specific values when criteria is met?

I have a dataframe:
tibble{
x = c(1,2,3)
y = c(0,2,4)
}
I want to add a NEW variable "z" that will be:
z = c("Lower", "Equal", "Higher")
I was thinking about using a for loop but I'm not sure if that's the most efficient/correct way.

The new variable in the dataset can be created with sign after taking the difference of 'x' and 'y', get the sign values, convert it to factor with levels and corresponding labels specified
library(dplyr)
df1 %>%
mutate(z = factor(sign(x - y), levels = c(-1, 0, 1),
c('Lower', "Equal", 'Higher')))
Or an option with case_when
df1 %>%
mutate(tmp = x - y,
z = case_when(tmp >0 ~ 'Higher', tmp < 0 ~ 'Lower',
TRUE ~ 'Equal'), tmp = NULL)
data
df1 <- tibble(
x = c(1,2,3),
y = c(0,2,4))

A base R option
within(df,z <- c("Lower", "Equal", "Higher")[sign(y-x)+2])
which gives
# A tibble: 3 x 3
x y z
<dbl> <dbl> <chr>
1 1 0 Lower
2 2 2 Equal
3 3 4 Higher

R package "infer" - Iterative bootstrapping / looping over column names

I'm bootstrapping with the infer package.
The statistic of interest is the mean, example data is given by a tibble with 3 columns and 5 rows. My real tibble has 86 rows and 40 columns. For every column I want to do a bootstrap simulation, like shown below for the column "x" in tibble "test_tibble".
library(infer)
library(tidyverse)
test_tibble <- tibble(x = 1:5, y = 6:10, z = 11:15)
# A tibble: 5 x 3
x y z
<int> <int> <int>
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
specify(test_tibble, response = x) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
# A tibble: 1 x 2
lower_CI upper_CI
<dbl> <dbl>
1 2.10 4
I am now looking for a way of doing the same thing for the other columns in my tibble. I have tried a for-loop like this:
for (i in 1:ncol(test_tibble)){
var_name <- names(test_tibble)[i]
specify(test_tibble, response = var_name) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
}
Unfortunately, this returns the follwing error
Error: The response variable `var_name` cannot be found in this dataframe.
Is there any way of iterating over the columns x, y and z without entering them manually as arguments for "response"? That'd be quite tedious for 40 columns.

This is a tricky question with a tricky answer.
Take a look at the response argument of the specify function in documentation:
The variable name in x that will serve as the response. This is alternative to using the formula argument.
With this in mind I modified the code to automate the process, adding one more column to the original dataframe and using the formula argument to obtain the same result, using a column of ones as explanatory variable.
library(infer)
library(tidyverse)
test_tibble <- tibble(x = 1:5, y = 6:10, z = 11:15, w = seq(1, 1, length.out = 5))
for (i in 1:ncol(test_tibble)){
var_name <- names(test_tibble)[i]
specify(test_tibble, formula = eval(parse(text = paste0(var_name, "~", "w"))))[, 1] %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
}
Hope it helps

How to use R and dplyr to paste column name as value for lookup

I am using the output coefficients from a glm regression model and I need to create a lookup value, using key paste ([column name].[Factor Level], and then return the corresponding value from another data table. The column names must be dynamic so that I don't have to explicitly name each column one by one.
The returned values from the lookup are then multiplied by 1 (for factors) or by the actual numeric values and all coef_colnames summed into column Total.
I've done some example in excel but cannot replicate it in R.
var_Factor1 combines the column name and the factor level from each row (using paste) to build a key for the next step lookup
var_Number1 is just the column name as it is numeric and has no factor levels
library(dplyr)
# original data
dt = data.table(
Factor1 = c("A","B","C"),
Number1 = c(10, 20,40),
Factor2 = c("D","H","N"),
Number2 = c(2, 5,3)
)
# Lookup table
model_coef = data.table(
Factor1.A = 10,
Factor1.B = 20,
Factor1.C = 30,
Factor2.D = 40,
Factor2.H = 50,
Factor2.N = 60,
Number1 = 200,
Number2 = 500
)
#initial steps
dt <- dt %>% mutate (
var_Factor1 = paste("Factor1", Factor1, sep =".")
, var_Number1 = "Number1"
, var_Factor2 = paste("Factor2", Factor2, sep =".")
, var_Number2 = "Number2"
) %>% mutate (
coef_Factor1 = model_coef[,var_Factor1]
)
#The final output should produce (as replicated from Excel)
final_output = data.table (
Factor1= c("A", "B", "C"),
Number1= c(10, 20, 40),
Factor2= c("D", "H", "N"),
Number2= c(2, 5, 3),
var_Factor1= c("Factor1.A", "Factor1.B", "Factor1.C"),
var_Number1= c("Number1", "Number1", "Number1"),
var_Factor2= c("Factor2.D", "Factor2.H", "Factor2.N"),
var_Number2= c("Number2", "Number2", "Number2"),
coef_Factor1= c(10, 20, 30),
coef_Number1= c(200, 200, 200),
coef_Factor2= c(40, 50, 60),
coef_Number2= c(500, 500, 500),
calc_Factor1= c(10, 20, 30),
calc_Number1= c(2000, 4000, 8000),
calc_Factor2= c(40, 50, 60),
calc_Number2= c(1000, 2500, 1500),
Total= c(3050, 6570, 9590)
)

It's generally a bad idea to try to generate and manipulate dynamic columns.
It will probably be better to use tidy data conventions and make the data "long". Also, it looks like you're trying to mix data.table and dplyr/tidyverse. In particular, this doesn't work: mutate (coef_Factor1 = model_coef[,var_Factor1]
I've tidied your data and modified your code to use dplyr/tidyverse below:
using tibble instead of data.table
re-built lookup table to tidy-long format so it can be left_joined
properly to your table
used mutate to do the calculations that you describe
Beyond your example, if you have more than 2 "Numbers"/"Factors" (your naming/labeling/numbering is confusing btw), there are ways to generalize further so that the code multiplies coef * number generically, for each "number"/combination. Also, your data implies but it isn't clear that A is related to D, B is related to H, etc.
library(tidyverse)
data <- tibble(Factor1 = c("A","B","C"),Number1 = c(10, 20,40),Factor2 = c("D","H","N"),Number2 = c(2, 5,3))
model_coef <- tibble(Factor1.A = 10,Factor1.B = 20,Factor1.C = 30,Factor2.D = 40,Factor2.H = 50,Factor2.N = 60,Number1 = 200,Number2 = 500)
(model_coef_factor1 <- model_coef %>%
select(Factor1.A:Factor1.C) %>%
pivot_longer(cols = everything(), names_to = c("number", "factor"), names_sep = "[.]", values_to = "coef_factor1") %>%
select(-number))
#> # A tibble: 3 x 2
#> factor coef_factor1
#> <chr> <dbl>
#> 1 A 10
#> 2 B 20
#> 3 C 30
(model_coef_factor2 <- model_coef %>%
select(Factor2.D:Factor2.N) %>%
pivot_longer(cols = everything(), names_to = c("number", "factor"), names_sep = "[.]", values_to = "coef_factor2") %>%
select(-number))
#> # A tibble: 3 x 2
#> factor coef_factor2
#> <chr> <dbl>
#> 1 D 40
#> 2 H 50
#> 3 N 60
(final_output <- data %>%
left_join(model_coef_factor1, by = c("Factor1"="factor")) %>%
left_join(model_coef_factor2, by = c("Factor2"="factor")) %>%
mutate(coef_number1 = model_coef$Number1,
coef_number2 = model_coef$Number2,
calc_factor1 = coef_factor1,
calc_number1 = Number1 * coef_number1,
calc_factor2 = coef_factor2,
calc_number2 = Number2 * coef_number2,
total = calc_factor1 + calc_number1 + calc_factor2 + calc_number2) %>%
select(total, everything()))
#> # A tibble: 3 x 13
#> total Factor1 Number1 Factor2 Number2 coef_factor1 coef_factor2
#> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 3050 A 10 D 2 10 40
#> 2 6570 B 20 H 5 20 50
#> 3 9590 C 40 N 3 30 60
#> # ... with 6 more variables: coef_number1 <dbl>, coef_number2 <dbl>,
#> # calc_factor1 <dbl>, calc_number1 <dbl>, calc_factor2 <dbl>,
#> # calc_number2 <dbl>
Created on 2019-10-23 by the reprex package (v0.3.0)

Mutating a "rich" character vector into multiple variable denoting presence of their element [duplicate]

This question already has answers here:
Convert column with pipe delimited data into dummy variables [duplicate]
(4 answers)
Closed 3 years ago.
I'm currently cleaning some survey data where there are variables with multiple responses in each. For instance, respondents endorse all elements that apply and they all get stored in one variable e.x., "Dogs, Cats, Rhinos". A reproducible example of one such variable is given below:
library(dplyr); library(magrittr)
set.seed(42)
foo <- data.frame(x = c(sample(LETTERS[1:5],
size = runif(1, min = 0, max = 5),
replace = F) %>% paste0(collapse = ", "),
sample(LETTERS[1:5],
size = runif(1, min = 0, max = 5),
replace = F) %>% paste0(collapse = ", ")))
What I'm looking to accomplish is to decompose the elements a variable and have new variables denoting the presence (or lack) of a given element. In this case my separator for elements would be a comma. An example of the intended output given below.
fooWant <- data.frame("A" = c(1, 0), "B" = c(1, 1), "D" = c(1, 0), "E" = c(1, 1))
So far my progress hasn't been great and I've just accomplished at parsing the elements into nested lists (code below) and am hoping that someone can take me the rest of the way there. Thanks a ton :)
strsplit(foo$x %>% as.character, split = "[,]\\s?") %>% sapply(X = ., sort)

A tidyverse solution using tidyr::separate_rows and tidyr::spread
foo %>%
rowid_to_column("row") %>%
separate_rows(x) %>%
mutate(n = 1) %>%
spread(x, n, fill = 0) %>%
select(-row)
# A B D E
#1 1 1 1 1
#2 0 1 0 1

How about something like this?
library(dplyr)
library(magrittr)
library(stringr)
set.seed(42)
foo <- data.frame(
x = c(
sample(
LETTERS[1:5],
size = runif(1, min = 0, max = 5),
replace = F
) %>%
paste0(collapse = ", "),
sample(
LETTERS[1:5],
size = runif(1, min = 0, max = 5),
replace = F
) %>%
paste0(collapse = ", "))
)
foo[, LETTERS[1:5]] <- do.call(
rbind,
lapply(
foo$x,
function (df) {
str_count(df, LETTERS[1:5])
}
)
)
str_count counts the number of occurrences of the possible values and appends them on as columns to the right of the original data.
> foo
x A B C D E
1 E, A, D, B 1 1 0 1 1
2 B, E 0 1 0 0 1
Here it is again, but as a tibble to more clearly see the columns:
> library(tibble); as_tibble(foo)
# A tibble: 2 x 6
x A B C D E
<fct> <int> <int> <int> <int> <int>
1 E, A, D, B 1 1 0 1 1
2 B, E 0 1 0 0 1

Finding maximum number from column for each day of whole year and creating a plot up to this number in R

There is a database of whole year:
Month Day Time X Y
...
3 1 0 2 4
3 1 1 4 2
3 1 2 7 3
3 1 3 8 8
3 1 4 4 6
3 1 5 1 4
3 1 6 6 6
3 1 7 7 9
...
3 2 0 5 7
3 2 1 7 2
3 2 2 9 3
...
4 1 0 2 8
...
I want to find maximum value of X for each day and create a plot for each day starting from beginning of the day (Time 0) up to this found maximum value. I tried to use dataframe but I got a bit lost and database is quite big so I'm not sure if this is the best idea.
Any ideas how to do it?

If I understood you correctly, this should work:
Sample dataset:
set.seed(123)
df <- data.frame(Month = sample(c(1:12), 30, replace = TRUE),
Day = sample(c(1:31), 30, replace = TRUE),
Time = sample(c(1:24), 30, replace = TRUE),
x = rnorm(30, mean = 10, sd = 5),
y = rnorm(30, mean = 10, sd = 5))
Using tidyverse (ggplot and dplyr):
require(tidyverse)
df %>%
#Grouping by month and day
group_by(Month, Day) %>%
#Creating new variables for x and y - the max value, and removing values bigger than the max value.
mutate(maxX = max(x, na.rm = TRUE),
maxY = max(y, na.rm = TRUE),
plotX = ifelse(x > maxY, NA, x),
plotY = ifelse(y > maxY, NA, y)) %>%
ungroup() %>%
#Select and gather only the needed variables for the plot
select(Time, plotX, plotY) %>%
gather(plot, value, -Time) %>%
#Plot
ggplot(aes(Time, value, color = plot)) +
geom_point()
output:

You can try a tidyverse. Duplicated Times per Day and Month are removed without any ranking.
library(tidyverse)
set.seed(123)
df <- data.frame(Month = sample(c(1:2), 30, replace = TRUE),
Day = sample(c(1:2), 30, replace = TRUE),
Time = sample(c(1:10), 30, replace = TRUE),
x = rnorm(30, mean = 10, sd = 5),
y = rnorm(30, mean = 10, sd = 5))
df %>%
group_by(Month, Day) %>%
filter(!duplicated(Time)) %>% # remove dupliceted "Time"'s.
filter(x<=max(x) & Time <= Time[x == max(x)]) %>%
ggplot(aes(Time, x)) +
geom_line() +
geom_point(data=. %>% filter(x == max(x)))+
facet_grid(Month~Day, labeller = label_both)
Or try to put all in one plot using different colors
df %>%
group_by(Month, Day) %>%
filter(!duplicated(Time)) %>%
filter(x<=max(x) & Time <= Time[x == max(x)]) %>%
ggplot(aes(Time, x, color = interaction(Month, Day))) +
geom_line() +
geom_point(data=. %>% filter(x == max(x)))