To reduce the dataset, I have been advised to use "stratified sampling".
Because I'm very new to the R programming, current articles on Stack aren't easy to follow, there is very little explanation.
I have a data set of over 60000 obs. and 24 variables. Out of all variables, 21 are quantitative (numbers).
How do I get sample data out of that? Also
- Where do I specify the dataset name
- do I need to name the new "reduced" dataset, so I could include it for the further analysis?
ADDED CODE (this is what I used for sampling):
# Sample a percentage of values from each stratum (10% in this case)
DB.quant.sample = lapply(split(DB.quant, DB.quant$group_size), function(DB.quant) {
DB.quant[sample(1:nrow(DB.quant), ceiling(nrow(DB.quant) * 0.1)), ]
})
Browse[1]>DB.quant[sample(1:nrow(DB.quant), 6000), ]
#DB.quant is the dataset and group_size is one of the variable. I'm not sure which variable should I use?
I'm having problem with illustrating graphically and intuitively how a cluster algorithm works
I started:
DB <- na.omit(DB)
DB.quant <- DB[c(2,3,4,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24)]
And then:
d <- dist(DB.quant.sample) # but im getting an error:
Error in dist(DB.quant.sample, method = "euclidean") : (list) object cannot be coerced to type 'double'
Example image of my DataSet:
first few rows of the data
I'm not sure exactly how you want to sample, but here's a simple example using the built-in iris data frame. Below are two ways to do it. One using Base R and the other using the dplyr package.
Base R
split the data frame into three separate smaller data frames, one for each Species.
Randomly sample 5 rows per Species.
# Sample 5 rows from each stratum
df.sample = lapply(split(iris, iris$Species), function(df) {
df[sample(1:nrow(df), 5), ]
})
# Sample a percentage of values from each stratum (10% in this case)
df.sample = lapply(split(iris, iris$Species), function(df) {
df[sample(1:nrow(df), ceiling(nrow(df) * 0.1)), ]
})
This gives us a list containing three data frames, one for each of the three different unique values of Species.
Combine the three samples into a single data frame.
df.sample = do.call(rbind, df.sample)
dplyr package
Do the grouping and sampling in a single chain of functions using the pipe (%>%) operator:
library(dplyr)
# Sample 5 values from each stratum
df.sample = iris %>%
group_by(Species) %>%
sample_n(5)
# Sample a percentage of values from each stratum (10% in this case)
df.sample = iris %>%
group_by(Species) %>%
sample_frac(0.1)
Related
I am trying to create a table which provides the weighted means of a list of variables by categories of another list of variables. I want to iterate over the second list of variables with each iteration appending the dataframe to the previous dataframe. I think this is supposed to involve imap_dfr from purrr but I can't quite get the code right. I want to use tidyverse for my code.
I'll use the illinois dataset from the pollster package for my example.
require(pollster)
# rv and voter dummy variables that I want to recode to 1
# and 0 so that I can get the percent of people who are 1s # in each variable. Here I recode them.
voter_vars <- c("rv", "voter")
df2 <- illinois %>%
mutate_at(
voter_vars, ~
recode(.x,
"1" = 0,
"2" = 1)) %>%
mutate_at(
voter_vars, ~
as.numeric(.x))
So those are the variables I want as the columns in my table. To get the weighted means for these two variables I write a function
news_summary <- function(var1){
var1 <- ensym(var1)
df3 <- df2 %>%
group_by(!!var1) %>%
summarise_at(vars(voter_vars),
funs(weighted.mean(., weight, na.rm=TRUE)))
return(df3)
}
This creates a data frame output if I run it for one variable in the dataset
news_summary(educ6)
But what I want to do is run it for three variables in the dataset, rowbinding each output to the previous output so I have a table with all of the weighted means together.
demographic_vars <- c("educ6", "raceethnic", "maritalstatus")
However, I don't quite understand how to put this into imap_dfr (which I think is what I am supposed to use to do this) to make it work. I tried this based on code I found elsewhere. But it doesn't work.
purrr::imap_dfr(demographic_vars ~ news_summary(!!.x))
I have a function which detrends data:
detrender <- function(Data,Vars,timevar){
Data_detrend <- Data
for (v in seq_along(Vars)){
ff <- as.formula(paste0(Vars[[v]]," ~ ",timevar))
fit <- lm(ff, data = Data_detrend)
if (anova(fit)$P[1] < 0.05){
message(paste("Detrending variable",v))
Data_detrend[[Vars[v]]][!is.na(Data_detrend[[Vars[[v]]]])] <- residuals(fit)
}
}
Data_detrend
}
useVar <- c("depression_sum")
inertia <- detrender(inertia,useVar, "timestamp")
This works fine, and detrends my data. However, as my datafile has many groups (Rid), detrend is conducted across groups. I want to detrend per group (Rid) and have tried to apply my function using group_by in dplyr. However, this code seems to add thousands of new rows with data. This is wrong, the number of rows should stay the same, while the data is detrended.
inertia_2<- inertia %>%
group_by(Rid) %>%
do(detrender(inertia,useVar, "timestamp")) %>%
ungroup()
I am working with a large data frame of clinical information. It includes info on several cell types, cell counts, and treatment response for a large group of patients. I want to plot each individual cell type into a box plot comparing cell counts for positive vs negative response:
[this is an example of what I want][1]
[1]: https://i.stack.imgur.com/JHvda.png
Because of the size of the data frame, I want to filter the data so that I only create box plots that show significance.
I decided that the best way to do this was to create a column for p-values and use the p-values assigned to filter the data frame.
I tried the following:
# zb is the orginal data frame
zb.pval <- zb %>%
group_by(cell_type) %>%
mutate(res.ct.pval = kruskal.test(response ~ count)$p.value)
However, I get an error:
Error: Problem with mutate() input res.ct.pval. x all observations are in the same group i Input res.ct.pval is kruskal.test(response ~ count)$p.value. i The error occured in group 30: cell_type = "ACTIVATION HLA DR FITC- CD69 PE- CD19 ECD- CD56 PC5.5- CD16 PC7+ CD134 APC+ CD4 700+ CD3 750- CD8 750+".
I decided to try a for loop:
pval.list <- list()
for(ct in pval.list){
tmp <- zb %>%
filter(cell_type == ct)
pval.list[[ct]] <- kruskal.test(tmp$response ~ tmp$count)$p.value
}
pval.df <- pval.list %>%
bind_rows() %>%
rownames_to_column(var = "cell_type")
But I just end up with an empty data frame. I do not have a ton of experience with r and am wondering if I am making any obvious mistakes. Otherwise, I am curious if anyone has a better way of doing this.
I am trying to expand a dataframe by including, for each row, 500 simulated values from a Poisson distribution whose parameter Theta (count_mean) is already stored in the dataframe. In the example below I am only providing a dataframe example, since my real data is composed by more than 50,000 rows (i.e. ids).
example.data <- data.frame(id=c("4008", "4118", "5330"),
count_mean=c(2, 25, 11)
)
So for each row, I know I have to generate the simulated values by:
rpois(500, example.data$count_mean)
How can I introduce these values into the same dataframe, in which each new column presents one simulated value for each row?
You can use sapply to simulate the numbers and then use cbind to bind your data together:
simdata <- t(sapply(example.data$count_mean, function(x) rpois(500, x)))
colnames(simdata) <- paste0("sim_", 1:500)
cbind(example.data, simdata)
However, I would encourage you to work with a different data format: maybe a long table would be more appropriate in this situation than the current wide table.
Another option using dplyr and tidyr:
example.data %>%
rowwise() %>%
mutate(poisson = list(rpois(500, count_mean))) %>%
unnest(poisson) %>%
group_by(id) %>%
mutate(count=row_number()) %>%
pivot_wider(names_from="count", names_prefix="sim_", values_from="poisson")
I want to aggregate my data. The goal is to have for each time interval one point in a diagram. Therefore I have a data frame with 2 columns. The first columns is a timestamp. The second is a value. I want to evaluate each time period. That means: The values be added all together within the Time period for example 1 second.
I don't know how to work with the aggregate function, because these function supports no time.
0.000180 8
0.000185 8
0.000474 32
It is not easy to tell from your question what you're specifically trying to do. Your data has no column headings, we do not know the data types, you did not include the error message, and you contradicted yourself between your original question and your comment (Is the first column the time stamp? Or is the second column the time stamp?
I'm trying to understand. Are you trying to:
Split your original data.frame in to multiple data.frame's?
View a specific sub-set of your data? Effectively, you want to filter your data?
Group your data.frame in to specific increments of a set time-interval to then aggregate the results?
Assuming that you have named the variables on your dataframe as time and value, I've addressed these three examples below.
#Set Data
num <- 100
set.seed(4444)
tempdf <- data.frame(time = sample(seq(0.000180,0.000500,0.000005),num,TRUE),
value = sample(1:100,num,TRUE))
#Example 1: Split your data in to multiple dataframes (using base functions)
temp1 <- tempdf[ tempdf$time>0.0003 , ]
temp2 <- tempdf[ tempdf$time>0.0003 & tempdf$time<0.0004 , ]
#Example 2: Filter your data (using dplyr::filter() function)
dplyr::filter(tempdf, time>0.0003 & time<0.0004)
#Example 3: Chain the funcions together using dplyr to group and summarise your data
library(dplyr)
tempdf %>%
mutate(group = floor(time*10000)/10000) %>%
group_by(group) %>%
summarise(avg = mean(value),
num = n())
I hope that helps?