I am working with a large data frame of clinical information. It includes info on several cell types, cell counts, and treatment response for a large group of patients. I want to plot each individual cell type into a box plot comparing cell counts for positive vs negative response:
[this is an example of what I want][1]
[1]: https://i.stack.imgur.com/JHvda.png
Because of the size of the data frame, I want to filter the data so that I only create box plots that show significance.
I decided that the best way to do this was to create a column for p-values and use the p-values assigned to filter the data frame.
I tried the following:
# zb is the orginal data frame
zb.pval <- zb %>%
group_by(cell_type) %>%
mutate(res.ct.pval = kruskal.test(response ~ count)$p.value)
However, I get an error:
Error: Problem with mutate() input res.ct.pval. x all observations are in the same group i Input res.ct.pval is kruskal.test(response ~ count)$p.value. i The error occured in group 30: cell_type = "ACTIVATION HLA DR FITC- CD69 PE- CD19 ECD- CD56 PC5.5- CD16 PC7+ CD134 APC+ CD4 700+ CD3 750- CD8 750+".
I decided to try a for loop:
pval.list <- list()
for(ct in pval.list){
tmp <- zb %>%
filter(cell_type == ct)
pval.list[[ct]] <- kruskal.test(tmp$response ~ tmp$count)$p.value
}
pval.df <- pval.list %>%
bind_rows() %>%
rownames_to_column(var = "cell_type")
But I just end up with an empty data frame. I do not have a ton of experience with r and am wondering if I am making any obvious mistakes. Otherwise, I am curious if anyone has a better way of doing this.
Related
I am attempting to use the new Dplyr scoped summarize() verbs to search through my data table and create a summary dataframe, grouped by treatment arm, for each one of a set of multiple outcomes that contains statistics for both numeric (2.5th, 50th, 97.5th percentiles) & categorical predictor variables (counts). I appear to have been successful with these computations (thanks to this massively helpful post - r summarize_if with multiple conditions).
However my dataframes are not visually friendly, as the quantile() & table() functions insert lists into each dataframe cell so I am unable to scroll through the dataframe within the R-Studio viewer to browse through the results. Does anybody have any suggestions regarding how to reorganize or view this dataframe in R-Studio in order to see the full results of these lists more clearly?
Thank you kindly!
outcome.dfs <- list()
dt <- data.table(short.vars.full) # convert to data table to allow for subsetting with column name stored in variable
for (ae.outcome in outcomes.list) {
outcome.dfs[[ae.outcome]] <- dt[get(ae.outcome) == ae.outcome, ] %>%
select(-USUBJID) %>%
group_by(XDARM) %>%
summarise_all(~ if(is.numeric(.))
list(format(round(quantile(., probs = c(.025,0.50,0.975), na.rm = TRUE), 2), nsmall=2))
else if (is.factor(.))
list(table(.)))
}
I am trying to create a table which provides the weighted means of a list of variables by categories of another list of variables. I want to iterate over the second list of variables with each iteration appending the dataframe to the previous dataframe. I think this is supposed to involve imap_dfr from purrr but I can't quite get the code right. I want to use tidyverse for my code.
I'll use the illinois dataset from the pollster package for my example.
require(pollster)
# rv and voter dummy variables that I want to recode to 1
# and 0 so that I can get the percent of people who are 1s # in each variable. Here I recode them.
voter_vars <- c("rv", "voter")
df2 <- illinois %>%
mutate_at(
voter_vars, ~
recode(.x,
"1" = 0,
"2" = 1)) %>%
mutate_at(
voter_vars, ~
as.numeric(.x))
So those are the variables I want as the columns in my table. To get the weighted means for these two variables I write a function
news_summary <- function(var1){
var1 <- ensym(var1)
df3 <- df2 %>%
group_by(!!var1) %>%
summarise_at(vars(voter_vars),
funs(weighted.mean(., weight, na.rm=TRUE)))
return(df3)
}
This creates a data frame output if I run it for one variable in the dataset
news_summary(educ6)
But what I want to do is run it for three variables in the dataset, rowbinding each output to the previous output so I have a table with all of the weighted means together.
demographic_vars <- c("educ6", "raceethnic", "maritalstatus")
However, I don't quite understand how to put this into imap_dfr (which I think is what I am supposed to use to do this) to make it work. I tried this based on code I found elsewhere. But it doesn't work.
purrr::imap_dfr(demographic_vars ~ news_summary(!!.x))
I want to aggregate my data. The goal is to have for each time interval one point in a diagram. Therefore I have a data frame with 2 columns. The first columns is a timestamp. The second is a value. I want to evaluate each time period. That means: The values be added all together within the Time period for example 1 second.
I don't know how to work with the aggregate function, because these function supports no time.
0.000180 8
0.000185 8
0.000474 32
It is not easy to tell from your question what you're specifically trying to do. Your data has no column headings, we do not know the data types, you did not include the error message, and you contradicted yourself between your original question and your comment (Is the first column the time stamp? Or is the second column the time stamp?
I'm trying to understand. Are you trying to:
Split your original data.frame in to multiple data.frame's?
View a specific sub-set of your data? Effectively, you want to filter your data?
Group your data.frame in to specific increments of a set time-interval to then aggregate the results?
Assuming that you have named the variables on your dataframe as time and value, I've addressed these three examples below.
#Set Data
num <- 100
set.seed(4444)
tempdf <- data.frame(time = sample(seq(0.000180,0.000500,0.000005),num,TRUE),
value = sample(1:100,num,TRUE))
#Example 1: Split your data in to multiple dataframes (using base functions)
temp1 <- tempdf[ tempdf$time>0.0003 , ]
temp2 <- tempdf[ tempdf$time>0.0003 & tempdf$time<0.0004 , ]
#Example 2: Filter your data (using dplyr::filter() function)
dplyr::filter(tempdf, time>0.0003 & time<0.0004)
#Example 3: Chain the funcions together using dplyr to group and summarise your data
library(dplyr)
tempdf %>%
mutate(group = floor(time*10000)/10000) %>%
group_by(group) %>%
summarise(avg = mean(value),
num = n())
I hope that helps?
To reduce the dataset, I have been advised to use "stratified sampling".
Because I'm very new to the R programming, current articles on Stack aren't easy to follow, there is very little explanation.
I have a data set of over 60000 obs. and 24 variables. Out of all variables, 21 are quantitative (numbers).
How do I get sample data out of that? Also
- Where do I specify the dataset name
- do I need to name the new "reduced" dataset, so I could include it for the further analysis?
ADDED CODE (this is what I used for sampling):
# Sample a percentage of values from each stratum (10% in this case)
DB.quant.sample = lapply(split(DB.quant, DB.quant$group_size), function(DB.quant) {
DB.quant[sample(1:nrow(DB.quant), ceiling(nrow(DB.quant) * 0.1)), ]
})
Browse[1]>DB.quant[sample(1:nrow(DB.quant), 6000), ]
#DB.quant is the dataset and group_size is one of the variable. I'm not sure which variable should I use?
I'm having problem with illustrating graphically and intuitively how a cluster algorithm works
I started:
DB <- na.omit(DB)
DB.quant <- DB[c(2,3,4,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24)]
And then:
d <- dist(DB.quant.sample) # but im getting an error:
Error in dist(DB.quant.sample, method = "euclidean") : (list) object cannot be coerced to type 'double'
Example image of my DataSet:
first few rows of the data
I'm not sure exactly how you want to sample, but here's a simple example using the built-in iris data frame. Below are two ways to do it. One using Base R and the other using the dplyr package.
Base R
split the data frame into three separate smaller data frames, one for each Species.
Randomly sample 5 rows per Species.
# Sample 5 rows from each stratum
df.sample = lapply(split(iris, iris$Species), function(df) {
df[sample(1:nrow(df), 5), ]
})
# Sample a percentage of values from each stratum (10% in this case)
df.sample = lapply(split(iris, iris$Species), function(df) {
df[sample(1:nrow(df), ceiling(nrow(df) * 0.1)), ]
})
This gives us a list containing three data frames, one for each of the three different unique values of Species.
Combine the three samples into a single data frame.
df.sample = do.call(rbind, df.sample)
dplyr package
Do the grouping and sampling in a single chain of functions using the pipe (%>%) operator:
library(dplyr)
# Sample 5 values from each stratum
df.sample = iris %>%
group_by(Species) %>%
sample_n(5)
# Sample a percentage of values from each stratum (10% in this case)
df.sample = iris %>%
group_by(Species) %>%
sample_frac(0.1)
I have some data (download link: http://spreadsheets.google.com/pub?key=0AkBd6lyS3EmpdFp2OENYMUVKWnY1dkJLRXAtYnI3UVE&output=xls) that I'm trying to filter. I had reconfigured the data so that instead of one row per country, and one column per year, each row of the data frame is a country-year combination (i.e. Afghanistan, 1960, NA).
Now that I've done that, I want to create a subset of the initial data that excludes any country that has 10+ years of missing contraceptive use data.
I had thought to create a list of the unique country names in a second data frame, and then add a variable to that frame that holds the # of rows for each country that have an NA for contraceptive use (i.e. for Afghanistan it would have 46). My first thought (being most fluent in VB.net) was to use a for loop to iterate through the countries, get the NA count for that country, and then update the second data frame with that value.
In that vein I tried the following:
for(x in cl){
+ x$rc = nrow(subset(BCU, BCU$Country == x$Country))
+ }
After that failed, a little more Googling brought me to a question on here (forgot to grab the link) that suggested using by(). Based on that I tried:
by(cl, 1:nrow(cl), cl$rc <- nrow(subset(BCU, BCU$Country == cl$Country
& BCU$Contraceptive_Use == "NA")))
(cl is the second data frame listing the country names, and BCU is the initial contraceptive use data frame)
I'm fairly new to R (the problem I'm working is for an R course on Udacity), so I'll freely admit this may not be the best approach, but I'm still curious how to do this sort of aggregation.
They all seem to have >= 10 years of missing data (unless I miscalculated somewhere):
library(tidyr)
library(dplyr)
dat <- read.csv("contraceptive use.csv", stringsAsFactors=FALSE, check.names=FALSE)
dat <- rename(gather(dat, year, value, -1),
country=`Contraceptive prevalence (% of women ages 15-49)`)
dat %>%
group_by(country) %>%
summarise(missing_count=sum(is.na(value))) %>%
arrange(desc(missing_count)) -> missing
sum(missing$missing_count >= 10)
## [1] 213
length(unique(dat$country))
## [1] 213