I have a dataframe like the following:
observations<- data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00KS177011","00P0006","00P006","00P006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,16,0,6,13,29), hzdepb = c(20,30,15,30,13,30,20,30,16,30,6,13,29,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Black","Red","White","White","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,0.53,0.47,0.2,0.23,0.53,0.04))
I want to be able to reduce this so that anytime X and Y are the same for two rows, the observations are combined i.e.
data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00P0006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,0,6), hzdepb = c(20,30,15,30,13,30,20,30,30,6,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Red","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,1.00,0.20,0.80))
Any suggestions on how to best go about this?
Edit: ok, now that I see how hzdept and hzdepb are supposed to be combined from your commment above:
library(tidyverse)
df <- observations %>% count(X,Y,wt = Z,name = "Z")
df_hzdept <- observations %>%
arrange(hzdept) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdept)
df_hzdepb <- observations %>%
arrange(desc(hzdepb)) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdepb)
df <- df %>% left_join(df_hzdept) %>% left_join(df_hzdepb)
Using dplyr
Here is how you would group by two columns and summarize using the minimum, max, and sum other columns in a dataframe:
library(magrittr) # For the pipe: %>%
observations %>%
dplyr::group_by(X, Y) %>%
dplyr::summarise(hzdept = min(hzdept),
hzdepb = max(hzdepb),
Z = sum(Z), .groups = 'drop')
I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")
I have a large dataframe and want to standardise multiple columns while conditioning the mean and the standard deviation on values. Say I have the following example data:
set.seed(123)
df = data.frame("sample" = c(rep(1:2, each = 5)),
"status" = c(0,1),
"s1" = runif(10, -1, 1),
"s2" = runif(10, -5, 5),
"s3" = runif(10, -25, 25))
and want to standardise every s1-s3 while conditioning the mean and standard deviation to be status==0. If I should do this for say, s1 only I could do the following:
df = df %>% group_by(sample) %>%
mutate(sd_s1 = (s1 - mean(s1[status==0])) / sd(s1[status==0]))
But my problem arises when I have to perform this operation on multiple columns. I tried writing a function to include with mutate_at:
standardize <- function(x) {
return((x - mean(x[status==0]))/sd(x[status==0]))
}
df = df %>% group_by(sample) %>%
mutate_at(vars(s1:s3), standardize)
Which just creates Na values for s1-s3.
I have tried to use the answer provided in:
R - dplyr - mutate - use dynamic variable names, but cannot figure out how to do the subsetting.
Any help is greatly appreciated. Thanks!
We could just use
df %>%
group_by(sample) %>%
mutate_at(vars(s1:s3), funs((.- mean(.[status == 0]))/sd(.[status == 0])))
I'm experimenting with dplyr, tidyr and purrr. I have data like this:
library(tidyverse)
set.seed(123)
df <- data_frame(X1 = rep(LETTERS[1:4], 6),
X2 = sort(rep(1:6, 4)),
ref = sample(1:50, 24),
sampl1 = sample(1:50, 24),
var2 = sample(1:50, 24),
meas3 = sample(1:50, 24))
Now dplyr is awesome because I can do things like mutate_at() to manipulate multiple columns at once. e.g:
df <- df %>%
mutate_at(vars(-one_of(c("X1", "X2", "ref"))), funs(first = . - ref)) %>%
mutate_at(vars(contains("first")), funs(second = . *2 ))
and tidyr allows me nest subsets of the data as sub-tables in a single column:
df <- df %>% nest(-X1)
and thanks to purrr I can summarize these sub-tables while retaining the original data in the nested column:
df %>% mutate(mean = map_dbl(data, ~ mean(.x$meas3_first_second)))
How can I use purrr and mutate_at() to generate multiple summary columns (take the means of different (but not all) columns in each nested sub-table)?
In this example I'd like to take the mean of every column with the word "second" in it.I had hoped that this might produce a new nested column which I could then unnest() but it does not work.
df %>% mutate(mean = map(data, ~ mutate_at(vars(contains("second")),
funs(mean_comp_exp = mean(.)))))
How can I achieve this?
The comment by #aosmith was correct and helpful In addition I realised I needed to use summarise_at() and not mutate_at() like so:
df %>%
mutate(mean = map(data, ~ summarise_at(.x, vars(contains("second")),
funs(mean_comp_exp = mean(.) )))) %>%
unnest(mean)
I am having trouble figuring out how to perform a chisq.test within a nested list column of a data frame. If I need to turn the data list-column into a matrix, how do I do that, and then how do I properly refer to the variables for the chisq.test? Take the example below. Thank you!
Here is an example:
a <- rep(c('A', 'B'), 10)
b <- rep(c('a', 'b'), each = 10)
c <- as.numeric(rep(c(1:10), each = 2))
df <- as.data.frame(cbind(a, b, c)) %>%
mutate(c = as.numeric(c))
Is the distribution the same between factor 'b' (levels 'a' and 'b') with 'c' counts, within a subgroups of factor 'a'('A' and 'B')?
dfnest <- df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, ~chisq.test(.$b~.$c)$p.value))
The last line is what I want to accomplish, but the above is incorrect - how do I use the chisq.test within the list-column data, and insert the p.value into a new column?
Changing the arguments in the call of chisq.test returns the expected result.
df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, ~chisq.test(.)$p.value))
You can also use an anonymous function.
df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, function(f) { chisq.test(f)$p.value }))