How can I show an output in r by a condition - r

Im trying to show the output of my code depending by a condition that its print.condition(pregunta_menos < 12347) but it only displays an error, How can I display minimum sum in the output?
Here's part of my code:
pregunta_menos <- colSums(!is.na(df))
as.data.table(pregunta_menos,keep.rownames = TRUE)
I need to print ("The Minimum column is:" )

Perhaps:
library(tidyverse)
set.seed(111)
#some dummy data
(df <-
paste0('x', 1:5) %>%
map_dfc(~ tibble('{.x}' := rnorm(5))))
#> # A tibble: 5 × 5
#> x1 x2 x3 x4 x5
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.235 0.140 -0.174 -1.57 0.362
#> 2 -0.331 -1.50 -0.407 -0.0859 0.347
#> 3 -0.312 -1.01 1.85 -0.359 0.190
#> 4 -2.30 -0.948 0.394 -1.19 -0.160
#> 5 -0.171 -0.494 0.798 0.364 0.327
min_col <-
df %>%
colSums() %>%
which.min() %>%
names()
paste("The Minimum column is:", min_col)
#> [1] "The Minimum column is: x2"
#Calculate only with numeric columns
nms <- select(df, where(is.numeric)) %>% names()
min_col_numeric_only <-
df %>%
transmute(c_sums = colSums(across(where(is.numeric), ~.)), nms = nms) %>%
filter(c_sums == min(c_sums)) %>%
pull(nms)
paste("The Minimum column is:", min_col_numeric_only)
#> [1] "The Minimum column is: x2"
Created on 2021-11-23 by the reprex package (v2.0.1)

Related

Trying to isolate all columns (representing years) via conditional operator and renaming them in R

library(readr)
d <- read.csv("per_capita.csv")
rc <- d[,-2:-3]
df <- data.frame(rc)
draw <- df$X1994[df$Country.Name == "India"]
format(draw, scientific = F, big.marks = ",")
library(dplyr)
df %>%
filter(Country.Name == "India") %>%
select(names(.)[-1][readr::parse_integer(names(.)[-1] > 1994])
I tried this code and its giving me an error in the last line. Also, how should I rename these columns in the CSV file without using a dataframe?
The column names are: X1994, X1995..... and so on.
Thank You!
If you want to select columns that have numbers greater than a value, you could do this:
library(tidyverse)
#example
set.seed(24)
df <- tibble(country = rep(c("India", "Canada"), each = 3),
X1990 = runif(6),
X1991 = runif(6),
X1992 = runif(6))
df |>
filter(country == "India") |>
select(!!!vars(colnames(df)[-1][which(parse_number(colnames(df)[-1]) > 1990)]))
#> # A tibble: 3 x 2
#> X1991 X1992
#> <dbl> <dbl>
#> 1 0.280 0.672
#> 2 0.764 0.673
#> 3 0.802 0.320
Although that is pretty complicated. It might be better to go long, filter, then go wide:
df |>
filter(country == "India") |>
mutate(id = row_number()) |>
pivot_longer(contains("X")) |>
mutate(name = parse_number(name))|>
filter(name > 1990) |>
pivot_wider(names_from = name, values_from = value)|>
select(-c(id, country))
#> # A tibble: 3 x 2
#> `1991` `1992`
#> <dbl> <dbl>
#> 1 0.280 0.672
#> 2 0.764 0.673
#> 3 0.802 0.320
We can see that this answer is pretty long and cumbersome. Maybe we stick in base R:
cols <- which(as.numeric(sub("^.*?(\\d+).*$", "\\1", colnames(df)[-1])) > 1990) +1
rows <- df$country == "India"
df[rows,cols]
#> # A tibble: 3 x 2
#> X1991 X1992
#> <dbl> <dbl>
#> 1 0.280 0.672
#> 2 0.764 0.673
#> 3 0.802 0.320
Or actually, maybe we can make the tidyverse version cleaner if we just look for strings that have values higher than the target year:
all_years <- 1990:1995
df |>
filter(country == "India") |>
select(contains(paste0("X", all_years[all_years > 1990])))
#> # A tibble: 3 x 2
#> X1991 X1992
#> <dbl> <dbl>
#> 1 0.280 0.672
#> 2 0.764 0.673
#> 3 0.802 0.320
Using the same logic, we can also do a partial string match with base R:
all_years <- 1990:1995
cols <- grepl(paste(all_years[all_years>1990], collapse = "|"), colnames(df))
rows <- df$country == "India"
df[rows,cols]
#> # A tibble: 3 x 2
#> X1991 X1992
#> <dbl> <dbl>
#> 1 0.280 0.672
#> 2 0.764 0.673
#> 3 0.802 0.320
Hopefully one of these helps and strikes your fancy. Lots of options out there for whatever flavor your in the mood for.

How to pass tibble of variable names and function calls to tibble

I'm trying to go from a tibble of variable names and functions like this:
N <- 100
dat <-
tibble(
variable_name = c("a", "b"),
variable_value = c("rnorm(N)", "rnorm(N)")
)
to a tibble with two variables a and b of length N
dat2 <-
tibble(
a = rnorm(N),
b = rnorm(N)
)
is there a !!! or rlang-y way to accomplish this?
We can evalutate the string
library(dplyr)
library(purrr)
library(tibble)
deframe(dat) %>%
map_dfc(~ eval(rlang::parse_expr(.x)))
-output
# A tibble: 100 x 2
a b
<dbl> <dbl>
1 0.0750 2.55
2 -1.65 -1.48
3 1.77 -0.627
4 0.766 -0.0411
5 0.832 0.200
6 -1.91 -0.533
7 -0.0208 -0.266
8 -0.409 1.08
9 -1.38 -0.181
10 0.727 0.252
# … with 90 more rows
Here is a base way with a pipe and a as_tibble call.
Map(function(x) eval(str2lang(x)), setNames(dat$variable_value, dat$variable_name)) %>%
as_tibble

r studio: simulate my code 1000 times and pick the things which p value<0.05

Here is my original code:
x = rbinom(1000,1,0.5)
z = log(1.3)*x
pr = 1/(1+exp(-z))
y = rbinom(1000,1,pr)
k=glm(y~x,family="binomial")$coef
t=exp(k)
How can I simulate it 1000 times and pick the one with a p-value<0.05?
This is a perfect application for the tidyverse and it's list columns. Please see explanation in the inline comments.
library(tidyverse)
library(broom)
# create a tibble with an id column for each simulation and x wrapped in list()
sim <- tibble(id = 1:1000,
x = list(rbinom(1000,1,0.5))) %>%
# to generate z, pr, y, k use map and map2 from the purrr package to loop over the list column x
# `~ ... ` is similar to `function(.x) {...}`
# `.x` represents the variable you are using map on
mutate(z = map(x, ~ log(1.3) * .x),
pr = map(z, ~ 1 / (1 + exp(-.x))),
y = map(pr, ~ rbinom(1000, 1, .x)),
k = map2(x, y, ~ glm(.y ~ .x, family="binomial")),
# use broom::tidy to get the model summary in form of a tibble
sum = map(k, broom::tidy)) %>%
# select id and sum and unnest the tibbles
select(id, sum) %>%
unnest(cols = c(sum)) %>%
# drop the intercepts and every .x with a p < 0.05
filter(term !="(Intercept)",
p.value < 0.05)
sim
#> # A tibble: 545 x 6
#> id term estimate std.error statistic p.value
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 3 .x 0.301 0.127 2.37 0.0176
#> 2 7 .x 0.263 0.127 2.06 0.0392
#> 3 8 .x 0.293 0.127 2.31 0.0211
#> 4 11 .x 0.377 0.128 2.96 0.00312
#> 5 12 .x 0.265 0.127 2.08 0.0373
#> 6 13 .x 0.366 0.127 2.88 0.00403
#> 7 14 .x 0.461 0.128 3.61 0.000305
#> 8 17 .x 0.274 0.127 2.16 0.0309
#> 9 18 .x 0.394 0.127 3.09 0.00200
#> 10 19 .x 0.371 0.127 2.92 0.00354
#> # … with 535 more rows
Created on 2020-05-18 by the reprex package (v0.3.0)
I won't do this for you, but these are the steps you'll probably want to go through:
Write your code as a function that returns the value you're interested in (presumably t)
Use something like replicate to run this function many times and record all the answers
Use something like quantile to extract the percentile you're interested in

How to write a function that conducts paired t-tests on all group/variable combinations in a data frame

I have a data frame similar to data created below:
ID <- data.frame(ID=rep(c(12,122,242,329,595,130,145,245,654,878),each=5))
Var <- data.frame(Variable=c("Copper","Iron","Lead","Zinc","CaCO"))
n <- 10
Variable <- do.call("rbind",replicate(n,Var,simplify=F))
Location <- rep(c("Alpha","Beta","Gamma"), times=c(20,20,10))
Location <- data.frame(Location)
set.seed(1)
FirstPt<- data.frame(FirstPt=sample(1:100,50,replace=T))
LastPt <- data.frame(LastPt=sample(1:100,50,replace=T))
First3<- data.frame(First3=sample(1:100,50,replace=T))
First5<- data.frame(First5=sample(1:100,50,replace=T))
First7<- data.frame(First7=sample(1:100,50,replace=T))
First10<- data.frame(First10=sample(1:100,50,replace=T))
Last3<- data.frame(Last3=sample(1:100,50,replace=T))
Last5<- data.frame(Last5=sample(1:100,50,replace=T))
Last7<- data.frame(Last7=sample(1:100,50,replace=T))
Last10<- data.frame(Last10=sample(1:100,50,replace=T))
data <- cbind(ID,Location,Variable,FirstPt,LastPt,First3,First5,First7,
First10,Last3,Last5,Last7,Last10)
This may be a two part question, but I want to write a function that groups all Variables that are the same (for instance, all the observations that are Copper) and conducts a paired t test between all possible combinations of the numeric columns (FirstPt:Last10). I want it to return the p values in a data frame like this:
Test P-Value
FirstPt.vs.LastPt …
FirstPt.vs.First3 …
ect... …
This will likely be a second function, but I also want to do this after the observations are grouped by Location so that the output data frame will look like this:
Test P-Value
FirstPt.vs.LastPt.InAlpha
FirstPt.vs.LastPt.InBeta
ect...
You can do both of these with one function:
library(tidyverse)
t.test.by.group.combos <- function(.data, groups){
by <- gsub(x = rlang::quo_get_expr(enquo(groups)), pattern = "\\((.*)?\\)", replacement = "\\1")[-1]
.data %>%
group_by(!!!groups) %>%
select_if(is.integer) %>%
group_split() %>%
map(.,
~pivot_longer(., cols = (FirstPt:Last10), names_to = "name", values_to = "val") %>%
nest(data = val) %>%
full_join(.,.,by = by) %>%
filter(name.x != name.y) %>%
mutate(test = paste(name.x, "vs",name.y, !!!groups, sep = "."),
p.value = map2_dbl(data.x,data.y, ~t.test(unlist(.x), unlist(.y))$p.value)) %>%
select(test,p.value)%>%
filter(!duplicated(p.value))
) %>%
bind_rows()
}
t.test.by.group.combos(data, vars(Variable))
#> # A tibble: 225 x 2
#> test p.value
#> <chr> <dbl>
#> 1 FirstPt.vs.LastPt.CaCO 0.511
#> 2 FirstPt.vs.First3.CaCO 0.184
#> 3 FirstPt.vs.First5.CaCO 0.494
#> 4 FirstPt.vs.First7.CaCO 0.354
#> 5 FirstPt.vs.First10.CaCO 0.893
#> 6 FirstPt.vs.Last3.CaCO 0.496
#> 7 FirstPt.vs.Last5.CaCO 0.909
#> 8 FirstPt.vs.Last7.CaCO 0.439
#> 9 FirstPt.vs.Last10.CaCO 0.146
#> 10 LastPt.vs.First3.CaCO 0.578
#> # … with 215 more rows
t.test.by.group.combos(data, vars(Variable, Location))
#> # A tibble: 674 x 2
#> test p.value
#> <chr> <dbl>
#> 1 FirstPt.vs.LastPt.CaCO.Alpha 0.850
#> 2 FirstPt.vs.First3.CaCO.Alpha 0.822
#> 3 FirstPt.vs.First5.CaCO.Alpha 0.895
#> 4 FirstPt.vs.First7.CaCO.Alpha 0.810
#> 5 FirstPt.vs.First10.CaCO.Alpha 0.645
#> 6 FirstPt.vs.Last3.CaCO.Alpha 0.870
#> 7 FirstPt.vs.Last5.CaCO.Alpha 0.465
#> 8 FirstPt.vs.Last7.CaCO.Alpha 0.115
#> 9 FirstPt.vs.Last10.CaCO.Alpha 0.474
#> 10 LastPt.vs.First3.CaCO.Alpha 0.991
#> # … with 664 more rows
This is kind of a lengthy function, but in general we group by the groups argument, then we select the groups and any integer columns, then we split the dataframe by the groups. After, we map all the combinations of variables and perform t.tests for each combo. Lastly, we rejoin all the groups into one dataframe.
I think this is what you want. The key was to use group_by and do from tidyverse.
df <- NULL
for(i in (4:(ncol(data)-1))){
for(j in ((i+1):ncol(data))){
df <- rbind(df,data %>%
group_by(Location) %>%
do(data.frame(pval = t.test(.[[i]],.[[j]], data = .)$p.value)) %>%
ungroup() %>%
mutate(Test = paste0(colnames(data)[i],'.vs.',colnames(data)[j]))
)
}
}
df$Test <- paste0(df$Test,'.In',df$Location)
Probably, you can acheive what you want using the below code :
library(dplyr)
library(tidyr)
data %>%
pivot_longer(cols = FirstPt:Last10) %>%
group_by(Variable) %>%
summarise(p_value = list(combn(name, 2, function(x)
t.test(value[name == x[1]], value[name == x[2]])$p.value)),
test = list(combn(name, 2, paste, collapse = "_"))) %>%
unnest(cols = c(test, p_value))
# Variable p_value test
# <fct> <dbl> <chr>
# 1 CaCO 0.915 FirstPt_LastPt
# 2 CaCO 0.529 FirstPt_First3
# 3 CaCO 0.337 FirstPt_First5
# 4 CaCO 0.350 FirstPt_First7
# 5 CaCO 0.395 FirstPt_First10
# 6 CaCO 0.765 FirstPt_Last3
# 7 CaCO 0.204 FirstPt_Last5
# 8 CaCO 0.873 FirstPt_Last7
# 9 CaCO 0.479 FirstPt_Last10
#10 CaCO 1 FirstPt_FirstPt
# … with 24,740 more rows
To do it grouped by Location you can add that into group_by command and keep rest of the code as it is.

Order data frame by the last column with dplyr

library(dplyr)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
arrange(colnames(df) %>% tail(1) %>% desc())
I am looping over a list of data frames. There are different columns in the data frames and the last column of each may have a different name.
I need to arrange every data frame by its last column. The simple case looks like the above code.
Using arrange_at and ncol:
df %>% arrange_at(ncol(.), desc)
As arrange_at will be depricated in the future, you could also use:
# option 1
df %>% arrange(desc(.[ncol(.)]))
# option 2
df %>% arrange(across(ncol(.), desc))
If we need to arrange by the last column name, either use the name string
df %>%
arrange_at(vars(last(names(.))), desc)
Or specify the index
df %>%
arrange_at(ncol(.), desc)
The new dplyr way (I guess from 1.0.0 on) would be using across(last_col()):
library(dplyr)
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
arrange(across(last_col(), desc))
#> # A tibble: 10 x 4
#> a b c d
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.283 0.443 1.30 0.910
#> 2 0.797 -0.0819 -0.936 0.828
#> 3 0.0717 -0.858 -0.355 0.671
#> 4 -1.38 -1.08 -0.472 0.426
#> 5 1.52 1.43 -0.0593 0.249
#> 6 0.827 -1.28 1.86 0.0824
#> 7 -0.448 0.0558 -1.48 -0.143
#> 8 0.377 -0.601 0.238 -0.918
#> 9 0.770 1.93 1.23 -1.43
#> 10 0.0532 -0.0934 -1.14 -2.08
> packageVersion("dplyr")
#> [1] ‘1.0.4’

Resources