Good afternoon,
I have to add dummy data to a dataframe whenever a specific variable is absent of several given intervals.
require(plyr)
df <- data.frame(length = c(1.5e+07, 2.5e+07), grade = c(1000, 1000), company = "TEST")
for(x in df$length){
if (x<=0|x>1e+07) {
df <- rbind.fill(df, data.frame(length = c(5000000), grade = c(1000)))
}
This works fine but I am having trouble to check if x is absent in each “length” interval from 0 to 1e+08, with a step of 1e+07, and add “1000“ in “grade” if that is the case. I tried all lot of things, and the end my data frame is only 1 row larger.
After that, I will create subgroups based on these intervals and I need a value for each subgroup.
df$length <- cut(df$length, breaks = seq(0, 1e+08, 1e+07))
In the end, the objective is to still get an empty space on a boxplot for each condition where there is no data, as the “1000“ I added is way above the limit threshold.
The next step will be to do the same but for each “company” variable.
I hope I am clear, sorry for my English.
Thanks
You can do it using dplyr and tidyr.
First, cut your df$length:
df <- data.frame(length = c(1.5e+07, 2.5e+07), grade = c(1000, 1000), company = "TEST")
df$length <- cut(df$length, breaks = seq(0, 1e+08, 1e+07))
Now we can use dplyr to left_join on all the levels of length, we then complete company:length, filter out any NA companies, and change the NA to 1000:
library(dplyr)
library(tidyr)
df %>% left_join(data.frame(length = levels(df$length)), .) %>%
complete(length, company) %>%
filter(!is.na(company)) %>%
mutate(grade = ifelse(is.na(grade), 1000, grade))
Source: local data frame [10 x 3]
length company grade
(fctr) (fctr) (dbl)
1 (0,1e+07] TEST 1000
2 (1e+07,2e+07] TEST 1000
3 (2e+07,3e+07] TEST 1000
4 (3e+07,4e+07] TEST 1000
5 (4e+07,5e+07] TEST 1000
6 (5e+07,6e+07] TEST 1000
7 (6e+07,7e+07] TEST 1000
8 (7e+07,8e+07] TEST 1000
9 (8e+07,9e+07] TEST 1000
10 (9e+07,1e+08] TEST 1000
Related
Starting data
I'm working in R and I have a set of data generated from groups (cohorts) of animals treated with different doses of different drugs. A simplified reproducible example of my dataset follows:
# set starting values for simulation of animal cohorts across doses of various drugs with a few numeric endpoints
cohort_size <- 3
animals <- letters[1:cohort_size]
drugs <- factor(c("A", "B", "C"))
doses <- factor(c(0, 10, 100))
total_size <- cohort_size * length(drugs) * length(doses)
# simulate data based on above parameters
df <- cbind(expand.grid(drug = drugs, dose = doses, animal = animals),
data.frame(
other_metadata = sample(LETTERS[24:26], size = total_size, replace = TRUE),
num1 = rnorm(total_size, mean = 10, sd = 3),
num2 = rnorm(total_size, mean = 60, sd = 9),
num3 = runif(total_size, min = 1, max = 5)))
This produces something like:
## drug dose animal other_metadata num1 num2 num3
## 1 A 0 a X 6.448411 54.49473 4.111368
## 2 B 0 a Y 9.439396 67.39118 4.917354
## 3 C 0 a Y 8.519773 67.11086 3.969524
## 4 A 10 a Z 6.286326 69.25982 2.194252
## 5 B 10 a Y 12.428265 70.32093 1.679301
## 6 C 10 a X 13.278707 68.37053 1.746217
My goal
For each drug treatment, I consider the dose == 0 animals as my control group for that drug (let's say each was run at a different time and has it's own control group). I wish to calculate the mean for each numeric endpoint (columns 5:7 in this example) of the control group. Next I want to normalize (divide) every numeric endpoint (columns 5:7) for every animal by the mean of it's respective control group.
In other words num1 for all animals where drug == "A" should be divided by the mean of num1 for all animals where drug == "A" AND dose == 0 and so on for each endpoint.
The final output should be the same size as the original data.frame with all of the non-numeric metadata columns remaining unchanged on the left side and all the numeric data columns now with the normalized values.
Naturally I'd like to find the simplest solution possible - minimizing creation of new variables and ideally in a single dplyr pipeline if possible.
What I've tried so far
I should say that I have technically solved this but the solution is super ugly with a ton of steps so I'm hoping to get help to find a more elegant solution.
I know I can easily get the averages for the control groups into a new data.frame using:
df %>%
filter(dose == 0) %>%
group_by(drug, dose) %>%
summarise_all(mean)
I've looked into several things but can't figure out how to implement them. In order of what seems most promising to me:
dplyr::group_modify()
dplyr::rowwise()
sweep() in some type of loop
Thanks in advance for any help you can offer!
If the intention is to divide the numeric columns by the mean of the control group values, grouped by 'drug', after grouping by 'drug', use mutate with across (from dplyr 1.0.0), divide the column values (. with mean of the values where the 'dose' is 0
library(dplyr) # 1.0.0
df %>%
group_by(drug) %>%
mutate(across(where(is.numeric), ~ ./mean(.[dose == 0])))
If we have a dplyr version is < 1.0.0, use mutate_if
df %>%
group_by(drug) %>%
mutate_if(is.numeric, ~ ./mean(.[dose == 0]))
I want to divide my data set into train and test data. but I have one column as a group.All member of a group must be in train or test. for example if the group column is like this:
group
1
1
1
1
1
2
2
2
3
3
if one of the row of first group is in train set the first 5 rows must be in there and ...
A solution using dplyr. dat_train and dat_test is the final result. I assume a case with 10000 group of training dataset and 5000 group of testing dataset.
library(dplyr)
# Set seed for reproducibility
set.seed(12345)
# Create an example data frame with group and data
dat <- tibble(group = rep(1:15000, each = 5),
data = rnorm(75000))
# Step 1: Create a look up table showing group number
g <- dat %>% distinct(group)
# Step 2: Use sample_n to sampel for train
g_train <- g %>% sample_n(size = 10000)
# Step 3: Use semi_join and anti_join to split dat into train and test
dat_train <- dat %>% semi_join(g_train, by = "group")
dat_test <- dat %>% anti_join(g_train, by = "group")
Let's assume you have a total of 20 groups and you want 8 groups in the training set and the remaining 12 in your test set.
First, let's generate some data to play with:
dat <- data.frame(group=factor(rep(1:20, each=5)), value=rnorm(100))
As you want to sample by group rather than observation, now draw a random sample of size 8 from groups for your training set and put the rest into the test set.
train.groups <- sample(levels(dat$group), 8)
dat.train <- dat[dat$group %in% train.groups, ]
dat.test <- dat[!(dat$group %in% train.groups), ]
You could use the dplyr and tidyverse (package) to solve this.
Assuming your dataset's name is df1.
Here is an example:
library(dplyr)
library(tidyverse)
training_data <- df1 %>% filter(group=1)
testing_data <- df1 %>% filter(group=2)
In R, I have a large list of large dataframes consisting of two columns, value and count. The function which I am using in the previous step returns the value of the observation in value, the corresponding column count shows how many times this specific value has been observed. The following code produces one dataframe as an example - however all dataframes in the list do have different values resp. value ranges:
d <- as.data.frame(
cbind(
value = runif(n = 1856, min = 921, max = 4187),
count = runif(n = 1856, min = 0, max = 20000)
)
)
Now I would like to aggregate the data to be able to create viewable visualizations. This aggregation should be applied to all dataframes in a list, which do each have different value ranges. I am looking for a function, cutting the data into new values and counts, a little bit like a histogram function. So for example, for all data from a value of 0 to 100, the counts should be summated (and so on, in a defined interval, with a clean interval border starting point like 0).
My first try was to create a simple value vector, where each value is repeated in a number of times that is determined by the count field. Then, the next step would have been applying the hist() function without plotting to obtain the aggregated values and counts which can be defined in the hist()'s arguments. However, this produces too large vectors (some Gb for each) that R cannot handle anymore. I appreciate any solutions or hints!
I am not entirely sure I understand your question correctly, but this might solve your problem or at least point you in a direction. I make a list of data-frames and then generate a new column containing the result of applying the binfunction to each dataframe by using mapfrom the purrr package.
library(tidyverse)
d1 <- d2 <- tibble(
value = runif(n = 1856, min = 921, max = 4187),
count = runif(n = 1856, min = 0, max = 20000)
)
d <- tibble(name = c('d1', 'd2'), data = list(d1, d2))
binfunction <- function(data) {
data %>% mutate(bin = value - (value %% 100)) %>%
group_by(bin) %>%
mutate(sum = sum(count)) %>%
select(bin, sum)
}
d_binned <- d %>%
mutate(binned = map(data, binfunction)) %>%
select(-data) %>%
unnest() %>%
group_by(name, bin) %>%
slice(1L)
d_binned
#> Source: local data frame [66 x 3]
#> Groups: name, bin [66]
#>
#> # A tibble: 66 x 3
#> name bin sum
#> <chr> <dbl> <dbl>
#> 1 d1 900 495123.8
#> 2 d1 1000 683108.6
#> 3 d1 1100 546524.4
#> 4 d1 1200 447077.5
#> 5 d1 1300 604759.2
#> 6 d1 1400 506225.4
#> 7 d1 1500 499666.5
#> 8 d1 1600 541305.9
#> 9 d1 1700 514080.9
#> 10 d1 1800 586892.9
#> # ... with 56 more rows
d_binned %>%
ggplot(aes(x = bin, y = sum, fill = name)) +
geom_col() +
facet_wrap(~name)
See this comment for my inspiration for the binning. It bins the data in groups of 100, so e.g. bin 1100 represents 1100 to <1200 etc. I imagine you can adapt the binfunction to your needs.
I have a dataframe that looks something like this:
time id trialNum trialType accX gravX
1 1 6 7 low -0.38876217 10.185266
2 2 1 6 low 0.68254705 10.741545
3 3 3 15 high -0.21906854 9.466929
4 4 2 15 none -0.03370001 9.490829
5 5 4 1 high 0.16511542 10.986796
6 6 9 2 none -0.10441621 9.915561
You can generate something similar using this:
testDF <- data.frame(time = 1:50,
id = sample(1:10, size=50, replace=T),
trialNum = sample(1:15, size = 50, replace=T),
trialType = sample(c("none", "low", "high"),
size = 50, replace=T),
accX = sin(seq(1,50,1)),
gravX = 0.1)
And a function to calculate the average time between peaks in a filtered signal (returning mean time, and variance of the time differences):
library(dplyr)
library(signal)
library(quantmod)
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
What I'm trying to do apply this function to each combination of id, trialNum, and trialType using groupby:
tempTrial <-
group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime(.)[1],
varianceTime= calcStepTime(.)[2])
The problem is that in the output dataframe (tempTrial) every row of meanTime and varianceTime is identical
In this toy dataset, sometimes the columns all show NA (this doesnt happen in my actual dataset)
Am I doing something incorrectly to cause each row to be identical for the 2 columns? It should be taking each combination of id, trialNum and trialType, and calculating peak times for each of those separately. However, it seems its only storing a single value for each combination?
The chain is working properly in the sense that . refers to the grouped data frame group_by(testDF, id, trialNum, trialType). Since your defined function has no way of using the group information in ., the results are what you see (i.e. the function applied to the whole data frame).
So your problem here is the incorrect use of summarise. Latrunculia's answer shows you that the proper way to use summarise in the way you expect is to apply the function to combinations of columns in your data frame, in which case the function applies by group in each variable.
dplyr has a do function for applications where you wish to apply a function to the data frame subset implied by group_by. Simply replace your summarise with do:
tempTrial <- group_by(testDF, id, trialNum, trialType) %>% do(meanTime = calcStepTime(.)[1], varianceTime= calcStepTime(.)[2])
The documentation for do is not terribly clear, but this post describes the application very well.
What you get right now is the result of calcStepTime applied on the whole (ungrouped) data frame for each group.
Try rewriting the function such that it depends on the variables, but not on the data frame.
alcStepTime <- function(var1, var2, var3){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, var1 - var2)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- var3[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return(c(meanStepTime, varianceStepTime))
}
testDF %>% group_by(testDF, id, trialNum, trialType) %>%
summarise(meanTime = calcStepTime( accX, gravX, time)[1],
varianceTime= calcStepTime(accX, gravX, time)[2])
It gives the right result if you just pipe the testDF data frame into it. It breaks for the grouped DF but I can't find if that's because the function is not defined for the subsets or if it's a problem with the function.
let me know if it works for the full data
As noted by yourself and Latrunculia, calcStepTime is very likely to return NaN/NA on the 50 observation datasets. This occurs when either no peak or a single peak was found within a group of observations. You may want to defend against this in your analysis code. I used this for testing:
testDF <- data.frame(time = 1:200,
id = sample(1:2, size=200, replace=T),
trialNum = sample(1:1, size = 200, replace=T),
trialType = sample(c("low"), size = 200, replace=T),
accX = sin(seq(1,200,1)),
gravX = 0.1)
If you change the return type of your function of data_frame (tibble), like so:
calcStepTime <- function(df){
bf <- butter(1, c(0.03,0.05), type="pass")
filtered <- filtfilt(bf, df$accX - df$gravX)
peaks <- findPeaks(filtered)
peakValue <- filtered[peaks]
peakTime <- df$time[peaks]
timeDifferences <- diff(peakTime)
meanStepTime <- mean(timeDifferences)
varianceStepTime <- var(timeDifferences)
return (data_frame("meanStepTime" = meanStepTime,
"varianceStepTime" = varianceStepTime))
}
Then you can take advantage of purrr::by_slice() for a fairly elegant solution:
library(purrr)
testDF %>%
group_by(id, trialNum, trialType) %>%
by_slice(calcStepTime, .collate="cols")
I got this from my test sample:
# A tibble: 2 x 5
id trialNum trialType meanStepTime1 varianceStepTime1
<int> <int> <fctr> <dbl> <dbl>
1 1 1 low 42.75 802.2500
2 2 1 low 39.75 616.9167
Note that .collate="cols" is the important argument that tells by_slice() to create the named columns for the results in the output. I'm a little curious myself as to why the "1" has been appended to the names we set in the data_frame returned by your function.
I'm looking for a better option saving a group mean directly in the same data frame in a new column. I usually solve this problem in the following steps shown below. Is there a possibility of saving the means without merging them explicitly but doing i right away maybe with dplyr?
data <- data.frame(group = rep(c("low","high"),2),
values = runif(n = 4, min = 0, max = 2))
data_mean <- data %>% group_by(group) %>% summarise (mean(values))
merge(data_mean, data)
group mean(values) values
1 high 0.2889459 0.07079697
2 high 0.2889459 0.50709475
3 low 0.7767188 0.93176182
4 low 0.7767188 0.62167588
Just use mutate instead of summarise should do what you want:
data %>%
group_by(group) %>%
mutate(mean = mean(values))
#Source: local data frame [4 x 3]
#Groups: group
#
# group values mean
#1 low 1.4017168 0.7478336
#2 high 0.8074821 1.1018971
#3 low 0.0939505 0.7478336
#4 high 1.3963122 1.1018971
Note: my values are different from yours because you didn't use set.seed for reproducibility of random numbers.
You could use tapply in base R
within(data, means <- tapply(values, group, mean, na.rm=TRUE))
# group values means
# 1 low 1.1069518 1.515846
# 2 high 1.6729194 1.001568
# 3 low 0.8961838 1.515846
# 4 high 1.3587732 1.001568