Can I find out if there is any code that creates a table or data frame that binds multiple tables for me?
table(df$col1)
table(df$col1,df$col2<0)
table(df$col1,df$col3>0)
table(df$col1,df$col4>0)
In the above example, I am grouping my dataset based on similar values in df$col1 and displaying data that satisfy the condition df$col2<0. What I get is a true and false matrix of the number of records fulfilling the condition and not. I want a combined table that still groups the data by df$col1 and shows the true condition for df$col2<0,df$col3>0 and df$col4>0 in the same table.
Based on the description, we could do a cbind
r1 <- cbind(table(df$col1), table(df$col1,df$col2<0)[,2],
table(df$col1,df$col3>0)[,2], table(df$col1,df$col4>0)[,2])
If there are many columns, this can be done by looping
r2 <- do.call(cbind, c(list(col1 = table(df$col1)), Map(function(x, y)
table(df$col1, get(y)(x, 0))[,2], df[-1], c("<", ">", ">"))))
all.equal(r1, r2, check.attributes = FALSE)
#[1] TRUE
We can also do this with group by operations.
library(dplyr)
df %>%
mutate(col2 = col2 < 0) %>%
mutate_at(3:4, funs(. > 0)) %>%
group_by(col1) %>%
mutate(n = n()) %>%
group_by(n, add = TRUE) %>%
summarise_all(sum)
data
set.seed(24)
df <- as.data.frame(matrix(sample(-2:5, 10*4, replace = TRUE), ncol=4))
names(df) <- paste0("col", 1:4)
Related
Suppose I have a data frame with a bunch of columns where I want to do the same NA replacement:
dd <- data.frame(x = c(NA, LETTERS[1:4]), a = rep(NA_real_, 5), b = c(1:4, NA))
For example, in the data frame above I'd like to do something like replace_na(dd, where(is.numeric), 0) to replace the NA values in columns a and b.
I could do
num_cols <- purrr::map_lgl(dd, is.numeric)
r <- as.list(setNames(rep(0, sum(num_cols)), names(dd)[num_cols]))
replace_na(dd, r)
but I'm looking for something tidier/more idiomatic/nicer ...
If we need to dynamically do the replacement with where(is.numeric), can wrap it in across
library(dplyr)
library(tidyr)
dd %>%
mutate(across(where(is.numeric), replace_na, 0))
Or we can specify the replace as a list of key/value pairs
replace_na(dd, list(a = 0, b = 0))
which can be programmatically created by selecting the columns that are numeric, get the names, convert to a key/value pair with deframe (or use summarise with 0) and then use replace_na
library(tibble)
dd %>%
select(where(is.numeric)) %>%
summarise(across(everything(), ~ 0)) %>%
replace_na(dd, .)
My data frame consists of 21 columns, for this problem only one is relevant:
I want replace values 2 or 3 or 4 or 5 in a column a with the value 1 (in the same column).
beside of doing the code below for any value 2,3,4,5 i'm looking for something more elegant:
df <- df %>% mutate (a = replace(a, a == 2,1))
df <- df %>% mutate (a = replace(a, a == 3,1))
df <- df %>% mutate (a = replace(a, a == 4,1))
df <- df %>% mutate (a = replace(a, a == 5,1))
so i'm just stock with the condition "or" i need create inside the code...
any solution?
You can replace multiple columns using across and multiple values with %in%. For example, if you want to replace values from column a, b, c and d, you can do :
library(dplyr)
df <- df %>% mutate(across(a:d, ~replace(., . %in% 2:5, 1)))
#For dplyr < 1.0.0 use `mutate_at`
#df <- df %>% mutate_at(vars(a:d), ~replace(., . %in% 2:5, 1))
In base R, you can do this with lapply :
cols <- c('a','b','c','d')
df[cols] <- lapply(df[cols], function(x) replace(x, x %in% 2:5, 1))
I'm tring to filter something across a list of dataframes for a specific column. Typically across a single dataframe using dplyr I would use:
#creating dataframe
df <- data.frame(a = 0:10, d = 10:20)
# filtering column a for rows greater than 7
df %>% filter(a > 7)
I've tried doing this across a list using the following:
# creating list
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(c = 11:20, d = 21:30),
data.frame(e = 15:25, f = 35:45))
# selecting the appropriate column and trying to filter
# this is not working
x[1][[1]][1] %>% lapply(. %>% {filter(. > 2)})
# however, if I use the min() function it works
x[1][[1]][1] %>% lapply(. %>% {min(.)})
I find the %>% syntax quite easy to understand and carry out. However, in this case, selecting a specific column and doing something quite simple like filtering is not working. I'm guessing map could be equally useful. Any help is appreciated.
You can use filter_at to refer column by position.
library(dplyr)
purrr::map(x, ~.x %>% filter_at(1, any_vars(. > 7)))
In filter, you can subset the column and use it
purrr::map(x, ~.x %>% filter(.[[1]] > 7))
In base R, that would be :
lapply(x, function(y) y[y[[1]] > 7, ])
It seems you are interested in checking the condition on the first column of each dataframe in your list.
One solution using dplyr would be
lapply(x, function(df) {df %>% filter_at(1, ~. > 7)})
The 1 in filter_at indicates that I want to check the condition on the first column (1 is a positional index) of each dataframe in the list.
EDIT
After the discussion in the comments, I propose the following solution
lapply(x, function(df) {df %>% filter(a > 7) %>% select(a) %>% slice(1)})
Input data
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(a = 11:20, b = 21:30),
data.frame(a = 15:25, b = 35:45))
Output
[[1]]
a
1 8
[[2]]
a
1 11
[[3]]
a
1 15
Using filter with across
library(dplyr)
library(purrr)
map(x, ~ .x %>%
filter(across(names(.)[1], ~ .> 7)))
My dataset looks something like this:
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"))
df <- matrix(rnorm(12*4), ncol = 12)
colnames(df) <- c("AC-1", "AC-2", "AC-3", "AM-1", "AM-2", "AM-3", "SC-1", "SC-2", "SC-3", "SM-1", "SM-2", "SM-3")
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"), df)
df
compound AC.1 AC.2 AC.3 AM.1 AM.2 AM.3 SC.1 SC.2 SC.3 SM.1
1 alanine 1.18362683 -2.03779314 -0.7217692 -1.7569264 -0.8381042 0.06866567 0.2327702 -1.1558879 1.2077454 0.437707310
2 arginine -0.19610110 0.05361113 0.6478384 -0.1768597 0.5905398 -0.67945600 -0.2221109 1.4032349 0.2387620 0.598236199
3 asparagine 0.02540509 0.47880021 -0.1395198 0.8394257 1.9046667 0.31175358 -0.5626059 0.3596091 -1.0963363 -1.004673116
4 aspartate -1.36397906 0.91380826 2.0630076 -0.6817453 -0.2713498 -2.01074098 1.4619707 -0.7257269 0.2851122 -0.007027878
I want to perform a t-test for each row (compound) on the columns [2:4] as one, and [5:7] as one, and store all the p-values. Basically see if there is a difference between the AC group and AM group for each compound.
I am aware there is another topic with this however I couldn't find a viable solution for my problem.
PS. my real dataset has about 35000 rows (maybe it needs a different solution than only 4 rows)
After selecting the columns of interest, use pmap to apply the t.test on each row by selecting the first 3 and next 3 observations as input to t.test and bind the extracted 'p value' as another column in the original data
library(tidyverse)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{t.test(.[1:3], .[4:6])$p.value}) %>%
bind_cols(df, pval_AC_AM = .)
Or after selecting the columns, do a gather to convert to 'long' format, spread, apply the t.test in summarise and join with the original data
df %>%
select(compound, AC.1:AM.3) %>%
gather(key, val, -compound) %>%
separate(key, into = c('key1', 'key2')) %>%
spread(key1, val) %>%
group_by(compound) %>%
summarise(pval_AC_AM = t.test(AC, AM)$p.value) %>%
right_join(df)
Update
If there are cases where there is only a unique value, then t.test shows error. One option is to run the t.test and get NA for those cases. This can be done with possibly
posttest <- possibly(function(x, y) t.test(x, y)$p.value, otherwise = NA)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{posttest(.[1:3], .[4:6])}) %>%
bind_cols(df, pval_AC_AM = .)
posttest(rep(3,5), rep(1, 5))
#[1] NA
If you can use an external library:
library(matrixTests)
row_t_welch(df[,2:4], df[,5:7])$pvalue
[1] 0.67667626 0.39501003 0.26678161 0.01237438
I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")