I have a dataset includig 60 predictors and a dependend variable which indicates if a purchase has taken place and how much was spend. The conversion-rate in my data 3.5% and I want to downsample it to 2.5% by excluding records with a purchase. The original distributions should be preserved.
Thanks you for your help!
bjoern.
First, some simpler data (2 columns instead of 60) with 3.5% TRUE values in column b:
library(tidyverse)
n <- 10000
df <- data.frame(
a = rnorm(n)) %>%
mutate(b = row_number() <= .035*n)
df %>%
summarize(mean(b))
mean(b)
1 0.035
One way to downsample would be to rbind all of the FALSE values in a that you'd like to keep with a sample of the TRUE values reduced by a target amount via sample_frac:
df2 <- rbind(
df %>% filter(!b),
df %>% filter(b) %>% sample_frac(.025/.035)
)
df2 %>%
summarize(mean(b))
mean(b)
1 0.02525253
You might not get exactly 2.5%, depending on the original size of your data since we can only sample in whole numbers.
Related
I am using R studio. What happens is that I have a dataset in which I have 1000k data. I have all columns called FINAL_CLASSIFICATION and AGE. In the FINAL_RANKING column there is data ranging from 1 to 7. In this column we say that those with 1, 2 or 3, are infected with SARS_COVID, while in those with 4, 5,6 and 7 are those who are healthy. I need to make a histogram of the ages of those who are infected and for this I understand that I must make a group to see the ages that coincide with 1, 2 and 3 of the column CLASIFICACION_FINAL and those ages will be of the infected people and, from there I need to make the histogram but I do not find the way to create the group or to obtain this.
Could you help me?
I have the following code
#1)
# import the data into R
# RECOMMENDATION: use read_csv
covid_dataset <- read_csv("Desktop/Course in R/Examples/covid_dataset.csv")
View(covid_dataset)
#------------------------------------------------------------------------------------------
#2) Extract a random sample of 100k records and assign it into a new variable. From now on work with this dataset
# HINT: use dplyr's sample_n function
sample <- sample_n(covid_dataset, 100000)
# With the function sample_n what we get is a syntax sample_n(x,n) where we have that
#x will be our dataset from where we want to extract the sample and n is the sample size
#that we want
nrow(sample)
#with this function we can corroborate that we have extracted a 100K sample.
#------------------------------------------------------------------------------------------
#3)Make a statistical summary of the dataset and also show the data types by column.
summary(sample)
#The summary function is the one that gives us the summary statistics.
map(sample, class)
#The map() function gives us the data type by columns and we can see that there are
#more numeric data type.
#-------------------------------------------------------------------------------------------
#4)Filter the rows that are positive for SARS-COVID and calculate the number of records.
## Positive cases are those that in the FINAL_CLASSIFICATION column have 1, 2 or 3.
## To filter the rows, we will make use of the PIPE operator and the select function of dplyr.
#This will help us to select the column and to be able to filter the rows where
#the FINAL_CLASSIFICATION column is 1, 2 or 3, i.e. SARS-COVID positive results.
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 1
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 2
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3)
# Here I filter the rows for which the column FINAL_CLASSIFICATION has a 3
# I do them separately to have a better view of the records.
#Now if we want to get them all together we simply do the following
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3)
#This gives us the rows less than or equal to 3, which is the same as giving us the rows in which the
#Rows where the FINAL_RANKING column has 1, 2 or 3.
#Now, if we want the number of records, doing it separately, we simply add
#another PIPE operator in which we will add the nrow() function to give me the number of #rows for each record.
#rows for each record.
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 1) %>% nrow()
#gives us a result of 1471
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 2) %>% nrow()
#gives us a result of 46
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION == 3) %>% nrow()
#Gives us a result of 37703
#If we add the 3 results, we have that the total number of records is
1471+46+37703
#Which gives us 39220
#But it can be simplified by doing it in a straightforward way as follows
sample %>% select(FINAL_CLASSIFICATION) %>% filter(FINAL_CLASSIFICATION <= 3) %>% nrow()
#And we notice that we get the same result as the previous code.
#In conclusion, we have a total of 39220 positive SARS-COVID cases.
#---------------------------------------------------------------------------------------------
#5)Count the number of null records per column (HINT: Use sapply or map, and is.na)
apply(sample, MARGIN = 2, function(x)sum(is.na(x))))
#This shows us the number of NA's per column. We notice that the only column
#that has NA's is the DATE_DEF with a total of 95044, this tells us that out of the
#100K data, only approximately 5k data are known for DATE_DEF.
#------------------------------------------------------------------------------------------
#6)
##a)Calculate the mean age of covid infectees.
##b)Make a histogram of the ages of the infected persons.
##c)Make a density plot of the ages of the infected persons
sample %>% group_by(FINAL_CLASSIFICATION
group_by(FINAL_CLASSIFICATION <= 3 ) %>% %>%
summarise(average = mean(AGE))
#Then the total average number of infected is 43.9
#Now we make a histogram of the ages of the infected persons
sample %>% group_by(FINAL_CLASSIFICATION <=3, AGE) %>% summarise(count = n())
It is in the last part where I have doubts. I want to find the average of the ages of the infected people, I used the code that I put there using group_by but I don't know if that is correct. And my doubts are already with the other two questions in #6, where I want to know about the histograms and how to plot them.
What I gathered is that you wish to 1. create a variable 'FINAL_CLASSIFICATION' based on values of 'FINAL_RANKING,' 2. summarize the average age of groups in FINAL_CLASSIFICATION, and 3. create a histogram of the positive cases in FINAL_CLASSIFICATION
I created a random sample of 100 cases with random assumptions of AGE and FINAL_RANKING
library(dplyr)
library(ggplot2)
sample <- tibble(FINAL_RANKING = sample(1:7, 100, replace = T), AGE = sample(10:100, 100, replace = T) )
sample <- sample %>%
mutate(
FINAL_CLASSIFICATION = case_when(
FINAL_RANKING %in% 1:3 ~ "SARS_COVID_POSITIVE",
FINAL_RANKING %in% 4:7 ~ "SARS_COVID_NEGATIVE")
)
sample %>%
group_by(FINAL_CLASSIFICATION) %>%
summarize(average_age = mean(AGE))
sample %>%
filter(FINAL_CLASSIFICATION == "SARS_COVID_POSITIVE") %>%
ggplot(., aes(x = AGE)) +
geom_histogram()
Gives summary output:
# A tibble: 2 x 2
FINAL_CLASSIFICATION average_age
<chr> <dbl>
1 SARS_COVID_NEGATIVE 51.8
2 SARS_COVID_POSITIVE 58.6
and plot:
As noted in output, you should adjust bins
I am having trouble doing cross-validation for a hierarchical dataset. There is a level 2 factor ("ID") that needs to be equally represented in each subset. For this dataset, there are 157 rows and 28 IDs. I want to divide my data up into five subsets, each containing 31 rows, where each of the 28 IDs is represented (a stand can be repeated within a subset).
I have gotten as far as:
library(dplyr)
df %>%
group_by(ID) %>%
and have no clue where to take it from there. Any help is appreciated!
Here's what I'd do: assign one row from each ID randomly to each of the 5 subsets, and then distribute the leftovers fully randomly. Without sample data this is untested, but it should at least get you on the right track.
df %>%
group_by(ID) %>%
mutate(
random_rank = rank(runif(n())),
strata = ifelse(random_rank <= 5, random_rank, sample(1:5, size = n(), replace = TRUE))
) %>%
select(-random_rank) %>%
ungroup()
This should create a strata column as described above. If you want to split the data into a list of data frames for each strata, ... %>% group_by(strata) %>% group_split().
I have a dataset that looks like this:
group=rep(1:4,each=100)
values=round(runif(400,25,350),0)
data<-data.frame(values,group)
Each group is comprised by 100 observations (values).
For each group, I would take 20 random samples without replacement and varying sampling size starting from 10 and increasing by 5 up to 95.
Thus for each group I want 20 samples with size=10, 20 samples with size=15....20 samples with size=95.
Any idea on how to do it using some tidyverse solution?
At the moment I did this:
data %>%
group_by(group) %>%
nest() %>%
mutate(v=map(data,~rep_sample_n(.,size=10,replace=FALSE,reps=20))) %>%
unnest(v)
It seems correctly replicate 20 times a sample with size=10, but still I need to change the size...
Thanks.
You could create a sequence of sample sizes, wrap your group_by/nest/etc dude into a For loop, then add each new sample to a list.
Notice how the size argument in ~rep_sample_n is now sizes[i] rather than a fixed number.
sizes <- seq(10,95,by=5)
sample_list <- list()
for (i in 1:length(sizes)){
new_data <- data %>%
group_by(group) %>%
nest() %>%
mutate(v=map(data,~rep_sample_n(.,size=sizes[i],replace=FALSE,reps=20))) %>%
unnest(v)
sample_list[i] <- new_data
}
I am suggesting a for loop instead of lapply(), as it makes more sense to me and this application doesn't take much time anyway.
I have a tibble (or data frame, if you like) that is 19 columns of pure numerical data and I want to filter it down to only the rows where at least one value is above or below a threshold. I prefer a tidyverse/dplyr solution but whatever works is fine.
This is related to this question but a distinct in at least two ways that I can see:
I have no identifier column (besides the row number, I suppose)
I need to subset based on the max across the current row being evaluated, not across a column
Here are attempts I've tried:
data %>% filter(max(.) < 8)
data %>% filter(max(value) < 8)
data %>% slice(which.max(.))
Here's a way which will keep rows having value above threshold. For keeping values below threshold, just reverse the inequality in any -
data %>%
filter(apply(., 1, function(x) any(x > threshold)))
Actually, #r2evans has better answer in comments -
data %>%
filter(rowSums(. > threshold) >= 1)
Couple more options that should scale pretty well:
library(dplyr)
# a more dplyr-y option
iris %>%
filter_all(any_vars(. > 5))
# or taking advantage of base functions
iris %>%
filter(do.call(pmax, as.list(.))>5)
Maybe there are better and more efficient ways, but these two functions should do what you need if I understood correctly. This solution assumes you have only numerical data.
You transpose the tibble (so you obtain a numerical matrix)
Then you use map to get the max or min by column (which is the max/min by row in the initial dataset).
You obtain the row index you are looking for
Finally, you can filter your dataset.
# Random Data -------------------------------------------------------------
data <- as.tibble(replicate(10, runif(20)))
# Threshold to be used -----------------------------------------------------
max_treshold = 0.9
min_treshold = 0.1
# Lesser_max --------------------------------------------------------------
lesser_max = function(data, max_treshold = 0.9) {
index_max_list =
data %>%
t() %>%
as.tibble() %>%
map(max) %>%
unname()
index_max =
index_max_list < max_treshold
data[index_max,]
}
# Greater_min -------------------------------------------------------------
greater_min = function(data, min_treshold = 0.1) {
index_min_list =
data %>%
t() %>%
as.tibble() %>%
map(min) %>%
unname()
index_min =
index_min_list > min_treshold
data[index_min,]
}
# Examples ----------------------------------------------------------------
data %>%
lesser_max(max_treshold)
data %>%
greater_min(min_treshold)
We can use base R methods
data[Reduce(`|`, lapply(data, `>`, threshold)),]`
I'm working with a dataframe in R which has one column $Z$. I'm looking to add an extra column $X$ which consists of $0.5$ and $-0.5$, which indicate two
different groups. My goal is to have as many people in group $0.5$ as in group $-0.5$. I've tried looking for solutions, but the only thing I've come across is the sample() function, which doesn't give equal groups, but just uses equal probability.
The code I've used so far is:
Z = rnorm(1000, 50, sd = 5)
df = data.frame(Z)
Simply replicate -0.5 and 0.5 the value of nrow(df) / 2 times each, then sample the population. This will create a random vector with an equal number of each value.
df$X <- sample(rep(c(-0.5, 0.5), nrow(df) / 2))
Solution idea
Randomly select, without replacement, half the integers from 1 through the number of rows for one group; the complement is the other group.
Let's create a data frame for testing:
X <- data.frame(Z=rnorm(10))
Here's an R implementation of one solution:
n <- nrow(X)
X$Group <- ifelse(sample.int(n, n) * 2 <= n, 1/2, -1/2)
Generalization
To partition the data frame randomly into g groups, sample 1..n (which randomly permutes the row indexes) and reduce them modulo g:
g <- 2
X$Group <- factor(sample.int(n, n) %% g)
The counts of each group will all be within 1 of each other.
With the sample function I am creating a random order of rows. So I give your df rownames and the second df rownames. Than I am assinging 500 values with -0.5 and the rest with the value 0.5.
Run random more than once, to see that its completely random, but 0.5 and -0.5 exists 500 times in the final df.
library(tidyverse)
random <- function() {
Z = rnorm(1000, 50, sd = 5)
df = as_tibble(Z) %>% rename(Z=value) %>% rownames_to_column() # create column for merging
result <- sample(1:1000) %>%
as_tibble() %>%
mutate(X=ifelse(value>=500, 0.5, -0.5)) %>% # assigning 500 values with -0.5 and the rest with 0.5
rownames_to_column() %>% # rownumer as column
right_join(df, by="rowname") %>% # merge both df's
select(Z, X, -rowname, -value) # clean up columns
print(result)
}
random()