I have a dataframe like the following:
observations<- data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00KS177011","00P0006","00P006","00P006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,16,0,6,13,29), hzdepb = c(20,30,15,30,13,30,20,30,16,30,6,13,29,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Black","Red","White","White","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,0.53,0.47,0.2,0.23,0.53,0.04))
I want to be able to reduce this so that anytime X and Y are the same for two rows, the observations are combined i.e.
data.frame(X=c("00KS089001","00KS089001","00KS089002","00KS089002","00KS089003","00KS089003","00KS105001","00KS105001", "00KS177011","00P0006","00P006"), hzdept = c(0,20,0,15,0,13,0,20,0,0,6), hzdepb = c(20,30,15,30,13,30,20,30,30,6,30),Y=c("Red","White","Red","White","Green","Red","Red","Blue", "Black","Red","White"), Z = c(0.67,0.33,0.5,0.5,0.43,0.57,0.67,0.33,1.00,0.20,0.80))
Any suggestions on how to best go about this?
Edit: ok, now that I see how hzdept and hzdepb are supposed to be combined from your commment above:
library(tidyverse)
df <- observations %>% count(X,Y,wt = Z,name = "Z")
df_hzdept <- observations %>%
arrange(hzdept) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdept)
df_hzdepb <- observations %>%
arrange(desc(hzdepb)) %>%
distinct(X,Y,.keep_all = T) %>%
select(X,Y,hzdepb)
df <- df %>% left_join(df_hzdept) %>% left_join(df_hzdepb)
Using dplyr
Here is how you would group by two columns and summarize using the minimum, max, and sum other columns in a dataframe:
library(magrittr) # For the pipe: %>%
observations %>%
dplyr::group_by(X, Y) %>%
dplyr::summarise(hzdept = min(hzdept),
hzdepb = max(hzdepb),
Z = sum(Z), .groups = 'drop')
I have a dataset that is organized by groups (site) and has baseline observations (trt == 0) and observations collected from a modified environment (trt == 1, although it's not experimental data which is why I'm doing this). For the trt == 1 observations, I would like to calculate the quantile of each observation within the baseline ecdf for that group (i.e. site). My instinct was to use map2_dbl() but the ecdf to compare to is within the list-column itself, not external to the data. I'm struggling to get the correct syntax (in the R tidyverse).
df <- tibble(site = rep(letters[1:4], length.out = 2000),
trt = rep(c(0, 1), each = 1000),
value = c(rnorm(n = 1000), rnorm(.1, n = 1000)))
# calculate ecdf for baseline:
baseline <- df %>%
filter(trt == 0) %>%
group_by(site) %>%
summarize(ecdf0 = list(ecdf(value)))
# compare each trt = 1 observation to ecdf for that site:
trtQuantile <- df %>%
filter(trt == 1) %>%
inner_join(baseline)
# what would be next line is where I'm struggling to get the correct map syntax
head(trtQuantile)
# for the first row I am aiming for the result given by:
trtQuantile$ecdf0[[1]](trtQuantile$value[[1]])
Any advice from the purrr masters is appreciated! Thanks.
You can use map2_dbl :
library(dplyr)
library(purrr)
trtQuantile %>% mutate(out = map2_dbl(ecdf0, value, ~.x(.y)))
Or mapply in base R :
trtQuantile$out <- mapply(function(x, y) x(y),trtQuantile$ecdf0,trtQuantile$value)
I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")
Working with dplyr, I am trying to match a row with n other rows on a variable, so I can feed the matching set to summarise(). I've only succeeded with a loop so far. Example data:
dfraw <- data.frame( id = c(1:20), age = c(30:35, 32:37, 34:41) )
set.seed(1)
df <- dfraw %>%
mutate( var = age + runif(20) - 0.5 ) %>%
arrange( age )
To calculate a z-score of var from the five closest matches on age, I can do
for ( i in 1:nrow(df) ) {
df$windowedz[i] <- df %>%
arrange( abs( df$age[i] - age) ) %>%
head(n=6) %>% tail(n=5) %>% # 5 closest matches excluding row `i`
summarise( (df$var[i] - mean(var) ) / sd(var) ) %>%
as.numeric
}
Is there a more elegant way to achieve this? If I use group_by, I can't seem to generate a matching group from the individual variable (df$age[i] in the example).
Edit: Minor changes for clarification, arrange as part of the example data definition, modified loop to insert a scalar rather than a list in column windowedz
Edit: With the package RcppRoll I was partially successful:
library(RcppRoll)
df <- df %>%
mutate(
mean = roll_mean( var, n = 5, fill = NA ),
sd = roll_sd( var, n = 5, fill = NA ),
roll_z = (var - mean) / sd
)
The issue with this solution is that the window contains the value that is to be transformed. So there is no equivalent to the head-tail manoeuvre that removes the matched row from the matching set. Also, this approach gives strange results if I calculate roll_z directly instead of calculating mean and sd first.
I am having trouble figuring out how to perform a chisq.test within a nested list column of a data frame. If I need to turn the data list-column into a matrix, how do I do that, and then how do I properly refer to the variables for the chisq.test? Take the example below. Thank you!
Here is an example:
a <- rep(c('A', 'B'), 10)
b <- rep(c('a', 'b'), each = 10)
c <- as.numeric(rep(c(1:10), each = 2))
df <- as.data.frame(cbind(a, b, c)) %>%
mutate(c = as.numeric(c))
Is the distribution the same between factor 'b' (levels 'a' and 'b') with 'c' counts, within a subgroups of factor 'a'('A' and 'B')?
dfnest <- df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, ~chisq.test(.$b~.$c)$p.value))
The last line is what I want to accomplish, but the above is incorrect - how do I use the chisq.test within the list-column data, and insert the p.value into a new column?
Changing the arguments in the call of chisq.test returns the expected result.
df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, ~chisq.test(.)$p.value))
You can also use an anonymous function.
df %>%
nest(-a) %>%
mutate(chisq_p = map_dbl(data, function(f) { chisq.test(f)$p.value }))