I have a dataframe (df) with yearly value for atmospheric deposition and a vector of 12 elements (mul).
str(df) 'data.frame': 220 obs. of 11 variables:
$ year : num 1900 1902 1903 1904 1906 ...
$ BOX1 : num 0.72 0.72 0.72 0.72 0.72 ...
mul <- c(0.7,0.7,1.3,1.7,0.7,1.0,0.7, 1.7,1.7,1.7,1.0,0.4)
I want to multiply each element of a given column 'BOX' for the 12 value of 'mul', so that a 220 elements column becomes a 2640 elements column.
In the following example I've done it for the first element of column BOX1 but I need to apply the procedure to all the elements at once.
df$BOX1[1] * mul
[1] 0.503811 0.503811 0.935649 1.223541 0.503811 0.719730 0.503811 1.223541 1.223541 1.151568 0.719730 0.287892
Thank you for your help
# example data
df = data.frame(year = c(1900,1902,1903),
BOX1 = c(0.72, 0.75, 0.80))
mul <- c(0.7,0.7,1.3,1.7,0.7,1.0,0.7, 1.7,1.7,1.7,1.0,0.4)
library(tidyverse)
df %>%
rowwise() %>% # for each row
mutate(x = list(BOX1 * mul)) %>% # multiply value in BOX1 with mul and save results as a list
unnest() # unnest data
# # A tibble: 36 x 3
# year BOX1 x
# <dbl> <dbl> <dbl>
# 1 1900 0.72 0.504
# 2 1900 0.72 0.504
# 3 1900 0.72 0.936
# 4 1900 0.72 1.22
# 5 1900 0.72 0.504
# 6 1900 0.72 0.72
# 7 1900 0.72 0.504
# 8 1900 0.72 1.22
# 9 1900 0.72 1.22
# 10 1900 0.72 1.22
# # ... with 26 more rows
You can remove column BOX1 if you want.
You can also try to use a vectorized function instead of rowwise, which might be faster:
# vectorised function to multiply vectors
f = function(x,y) x*y
f = Vectorize(f)
df %>%
mutate(x = list(f(BOX1, mul))) %>%
unnest()
Another solution with tidyverse :
library(tidyr)
library(dplyr)
library(tibble)
mul %>% as.data.frame %>% rowid_to_column %>% # to keep duplicates in 'mul'
crossing(df) %>% mutate(v=BOX1*mul) # when calling 'crossing'
Related
I'm new in R and I'm struggling with this df that looks like this:
Date Group Factor 1 Factor 2 Spread
2019-04-01 a 1.01 1.011 0.01
2019-04-02 a 1.02 1.012 0.02
2019-04-03 a 1.03 1.013 0.03
2019-04-01 b 1.005 1.004 0.01
2019-04-02 b 1.0051 1.0041 0.02
2019-04-03 b 1.0052 1.0042 0.03
I would like do verify each group in each row and if the results are Group "a" do Factor1/Factor1(1 day lag) * Factor2 + spread, and if the group it's not "a" do not add the spread.
Since you are conditioning on the group, this is a good example of by (base R), dplyr::group_by, or data.table's x[,,by=].
The equation is effectively the same in all three, capitalizing on the fact that (Group[1] == "a") will be coerced from a logical to numeric when multipled by a number; since FALSE translates to a 0, then effectively disabled adding Spread.
Base
I use within here to make the internals a little more readable, but this is not a requirement (in which case you'd need to prepend x$ in front of all of the variable names).
The lagging can be done using dplyr::lag (even if you don't use the rest of the package for this) or many other techniques. I don't find stats::lag to be the most intuitive in applications like this, but I'm sure somebody will suggest a way to incorporate it into an answer. The use of c(NA, ...) ensures that we don't bring in a different group's data or impute data we don't have, since we have no value to bring in on the first row of a group. Finally, head(..., n = 1) returns the first element of a vector/list, while head(..., n = -1) (negative) returns all but the last.
newx <- by(x, x$Group, function(y) {
within(y, {
NewVal = Factor2 * Factor1 / c(NA, head(Factor1, n=-1)) + (Group[1] == "a") * Spread
})
})
newx
# x$Group: a
# Date Group Factor1 Factor2 Spread NewVal
# 1 2019-04-01 a 1.01 1.011 0.01 NA
# 2 2019-04-02 a 1.02 1.012 0.02 1.042020
# 3 2019-04-03 a 1.03 1.013 0.03 1.052931
# -------------------------------------------------------
# x$Group: b
# Date Group Factor1 Factor2 Spread NewVal
# 4 2019-04-01 b 1.0050 1.0040 0.01 NA
# 5 2019-04-02 b 1.0051 1.0041 0.02 1.0042
# 6 2019-04-03 b 1.0052 1.0042 0.03 1.0043
This is really just a list with some fancy by-specific formatting, so you can treat it as such as combine them in an efficient base-R way:
do.call("rbind.data.frame", c(newx, stringsAsFactors = FALSE))
# Date Group Factor1 Factor2 Spread NewVal
# a.1 2019-04-01 a 1.0100 1.0110 0.01 NA
# a.2 2019-04-02 a 1.0200 1.0120 0.02 1.042020
# a.3 2019-04-03 a 1.0300 1.0130 0.03 1.052931
# b.4 2019-04-01 b 1.0050 1.0040 0.01 NA
# b.5 2019-04-02 b 1.0051 1.0041 0.02 1.004200
# b.6 2019-04-03 b 1.0052 1.0042 0.03 1.004300
dplyr
Many find the tidyverse line of packages to read intuitively.
library(dplyr)
x %>%
group_by(Group) %>%
mutate(NewVal = Factor2 * Factor1 / lag(Factor1) + (Group[1] == "a") * Spread) %>%
ungroup()
# # A tibble: 6 x 6
# Date Group Factor1 Factor2 Spread NewVal
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2019-04-01 a 1.01 1.01 0.01 NA
# 2 2019-04-02 a 1.02 1.01 0.02 1.04
# 3 2019-04-03 a 1.03 1.01 0.03 1.05
# 4 2019-04-01 b 1.00 1.00 0.01 NA
# 5 2019-04-02 b 1.01 1.00 0.02 1.00
# 6 2019-04-03 b 1.01 1.00 0.03 1.00
data.table
On a different note, many find data.table better because of efficiencies gained from in-place modification (most of R's operations are copy-on-write, meaning some operations re-copy the object or a portion of it with each change).
library(data.table)
X <- as.data.table(x)
X[, NewVal := Factor2 * Factor1 / shift(Factor1) + (Group[1] == "a") * Spread, by = "Group"]
X
# Date Group Factor1 Factor2 Spread NewVal
# 1: 2019-04-01 a 1.0100 1.0110 0.01 NA
# 2: 2019-04-02 a 1.0200 1.0120 0.02 1.042020
# 3: 2019-04-03 a 1.0300 1.0130 0.03 1.052931
# 4: 2019-04-01 b 1.0050 1.0040 0.01 NA
# 5: 2019-04-02 b 1.0051 1.0041 0.02 1.004200
# 6: 2019-04-03 b 1.0052 1.0042 0.03 1.004300
The "in-place" part is evident on the second line here, where it appears as if the [ operation should just return a subset or something of the data ... but in this case using := causes the columns to be created (or changed) in-place.
First and foremost, regardless if you have input or not, thank you for taking your time to view my question.
Let me break down what I am doing, the sample dataset, and the error.
What I currently have is data for several different ID's that list the dispersion per day. (you will see below). I want to loop through the dates and add two columns to the data : Rolling Means columns & Rolling standard deviation column.
The code I have written out so far is this:
library(zoo)
Testing1 <- function(dataset, k) {
ops <- data.frame()
for (i in unique(dataset$Date)) {
ops <- dataset %>% mutate(rolling_mean = rollmean(dataset$Dispersion,k)) %>%
mutate(rolling_std = rollapply(dataset$Dispersion, width = k, FUN = sd))
}
Results <<- ops
}
however, i get the following error:
Error in mutate_impl(.data, dots) :
Column rolling_mean must be length 30 (the number of rows) or one, not 26
I am assuming that the row differential is due to me specifying a 5 day window for the rolling average, meaning it won't calculate it for the first 4 rows. But how do I go about telling R that it's ok to input NA's on those rows? Or If you guys have any other solution, that would work as well. Please do help.
Heres a sample of the data:
Identifier Date Dispersion
1000 2/15/2018 0.390
1000 2/16/2018 0.664
1000 2/17/2018 0.526
1000 2/18/2018 0.933
1000 2/19/2018 0.009
1000 2/20/2018 0.987
1000 2/21/2018 0.517
1000 2/22/2018 0.641
1000 2/23/2018 0.777
1000 2/24/2018 0.613
1001 2/15/2018 0.617
1001 2/16/2018 0.234
1001 2/17/2018 0.303
1001 2/18/2018 0.796
1001 2/19/2018 0.359
1001 2/20/2018 0.840
1001 2/21/2018 0.291
1001 2/22/2018 0.699
1001 2/23/2018 0.882
1001 2/24/2018 0.467
1002 2/15/2018 0.042
1002 2/16/2018 0.906
1002 2/17/2018 0.077
1002 2/18/2018 0.156
1002 2/19/2018 0.350
1002 2/20/2018 0.060
1002 2/21/2018 0.457
1002 2/22/2018 0.770
1002 2/23/2018 0.433
1002 2/24/2018 0.366
You get this error because the length of rolling means/stds does not match the legth of Dispersion. Simply add k - 1 NAs at the beginnig of your means/stds vectors.
Below is a working example. You can modify this based on your needs.
my_function <- function(df, k) {
df %>%
mutate(
rolling_mean = c(rep(NA, k - 1), rollmean(Dispersion, k)),
rolling_std = c(rep(NA, k - 1), rollapply(Dispersion, width = k, FUN = sd))
)
}
For example, you may want to add group_by to compute these values for each Identifier:
my_function <- function(df, k) {
df %>%
group_by(Identifier) %>%
mutate(
rolling_mean = c(rep(NA, k - 1), rollmean(Dispersion, k)),
rolling_std = c(rep(NA, k - 1), rollapply(Dispersion, width = k, FUN = sd))
)
}
Update following up #G. Grothendieck's comment:
It turns out the package zoo already has comprehensive features for NA handling, refactoring the above-given code as:
my_function <- function(df, k) {
df %>%
mutate(
rolling_mean = rollmeanr(Dispersion, k, fill = NA),
rolling_std = rollapplyr(Dispersion, width = k, FUN = sd, fill = NA)
)
}
I'd take a look at tibbletime.
Assuming your data frame is named mydata and the Date column is a character: first convert the Date, then convert to a time-aware tibble:
library(dplyr)
library(tibbletime)
mydata <- mydata %>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
as_tbl_time(index = Date)
Now you can define functions for rolling mean and sd:
mean_5 <- rollify(mean, window = 5)
sd_5 <- rollify(sd, window = 5)
mydata %>%
mutate(rolling_mean = mean_5(Dispersion),
rolling_std = sd_5(Dispersion))
# A time tibble: 30 x 5
# Index: Date
Identifier Date Dispersion rolling_mean rolling_std
<int> <date> <dbl> <dbl> <dbl>
1 1000 2018-02-15 0.39 NA NA
2 1000 2018-02-16 0.664 NA NA
3 1000 2018-02-17 0.526 NA NA
4 1000 2018-02-18 0.933 NA NA
5 1000 2018-02-19 0.009 0.504 0.342
6 1000 2018-02-20 0.987 0.624 0.393
7 1000 2018-02-21 0.517 0.594 0.394
8 1000 2018-02-22 0.641 0.617 0.393
9 1000 2018-02-23 0.777 0.586 0.367
10 1000 2018-02-24 0.613 0.707 0.182
# ... with 20 more rows
How can I randomly add missing values to some or each column (say random ~5% missing in each) in a simulated dataframe, plus, is there a more efficient way of simulating a dataframe with both continuous and factor columns?
#Simulate some data
N <- 2000
data <- data.frame(id = 1:2000,age = rnorm(N,18:90),bmi = rnorm(N,15:40),
chol = rnorm(N,50:350), insulin = rnorm(N,2:40),sbp = rnorm(N, 50:200),
dbp = rnorm(N, 30:150), sex = c(rep(1, 1000), rep(2, 1000)),
smoke = rep(c(1, 2), 1000), educ = sample(LETTERS[1:4]))
#Manually add some missing values
data <- data %>%
mutate(age = "is.na<-"(age, age <19 | age >88),
bmi = "is.na<-"(bmi, bmi >38 | bmi <16),
insulin = "is.na<-"(insulin, insulin >38),
educ = "is.na<-"(educ, bmi >35))
Best solution in my opinion would be using the mice package for this. This is a R package dedicated to imputation. It also has a function called amputate for introducing missing data into a data.frame.
ampute - Generate Missing Data For Simulation Purposes
This function generates multivariate missing data in a MCAR, MAR or MNAR manner.
The advantage of this solution is you can set multiple parameters for the simulation of your missing data.
ampute(data, prop = 0.5, patterns = NULL, freq = NULL, mech = "MAR",
weights = NULL, cont = TRUE, type = NULL, odds = NULL,
bycases = TRUE, run = TRUE)
As you can see you can set the percentage of missing values, the missing data mechanism (MCAR would be your choice for missing completely at random) and several other parameters. This solution would also be quite clean since it is only 1 line of code.
Here's a tidyverse approach that will remove roughly 20% of your data for each column you specify:
set.seed(1)
# example data
N <- 20
data <- data.frame(id = 1:N,
age = rnorm(N,18:90),
bmi = rnorm(N,15:40),
chol = rnorm(N,50:350))
library(tidyverse)
# specify which variables should have missing data and prc of missing data
c_names = c("age","bmi")
prc_missing = 0.20
data %>%
gather(var, value, -id) %>% # reshape data
mutate(r = runif(nrow(.)), # simulate a random number from 0 to 1 for each row
value = ifelse(var %in% c_names & r <= prc_missing, NA, value)) %>% # if it's one of the variables you specified and the random number is less than your threshold update to NA
select(-r) %>% # remove random number
spread(var, value) # reshape back to original format
# id age bmi chol
# 1 1 17.37355 15.91898 49.83548
# 2 2 19.18364 16.78214 50.74664
# 3 3 19.16437 17.07456 52.69696
# 4 4 NA 16.01065 53.55666
# 5 5 22.32951 19.61983 53.31124
# 6 6 22.17953 19.94387 54.29250
# 7 7 24.48743 NA 56.36458
# 8 8 25.73832 20.52925 57.76853
# 9 9 26.57578 NA 57.88765
# 10 10 26.69461 24.41794 59.88111
# 11 11 29.51178 26.35868 60.39811
# 12 12 NA 25.89721 60.38797
# 13 13 NA 27.38767 62.34112
# 14 14 28.78530 27.94619 61.87064
# 15 15 33.12493 27.62294 65.43302
# 16 16 32.95507 NA 66.98040
# 17 17 33.98381 30.60571 65.63278
# 18 18 35.94384 NA 65.95587
# 19 19 36.82122 34.10003 68.56972
# 20 20 37.59390 34.76318 68.86495
And this is an alternative that will remove exactly 20% of data for the columns you specify:
set.seed(1)
# example data
N <- 20
data <- data.frame(id = 1:N,
age = rnorm(N,18:90),
bmi = rnorm(N,15:40),
chol = rnorm(N,50:350))
library(tidyverse)
# specify which variables should have missing data and prc of missing data
c_names = c("age","bmi")
prc_missing = 0.20
n_remove = prc_missing * nrow(data)
data %>%
gather(var, value, -id) %>% # reshape data
sample_frac(1) %>% # shuffle rows
group_by(var) %>% # for each variables
mutate(value = ifelse(var %in% c_names & row_number() <= n_remove, NA, value)) %>% # update to NA top x number of rows if it's one of the variables you specified
spread(var, value) # reshape to original format
# # A tibble: 20 x 4
# id age bmi chol
# <int> <dbl> <dbl> <dbl>
# 1 1 17.4 15.9 49.8
# 2 2 19.2 16.8 50.7
# 3 3 19.2 17.1 52.7
# 4 4 NA 16.0 53.6
# 5 5 22.3 NA 53.3
# 6 6 22.2 19.9 54.3
# 7 7 24.5 20.8 56.4
# 8 8 25.7 NA 57.8
# 9 9 26.6 NA 57.9
# 10 10 NA NA 59.9
# 11 11 NA 26.4 60.4
# 12 12 NA 25.9 60.4
# 13 13 29.4 27.4 62.3
# 14 14 28.8 27.9 61.9
# 15 15 33.1 27.6 65.4
# 16 16 33.0 29.6 67.0
# 17 17 34.0 30.6 65.6
# 18 18 35.9 31.9 66.0
# 19 19 36.8 34.1 68.6
# 20 20 37.6 34.8 68.9
Would this work?
n_rows <- nrow(data)
perc_missing <- 5 # percentage missing data
row_missing <- sample(1:n_rows, sample(1:n_rows, round(perc_missing/100 * n_rows,0))) # sample randomly x% of rows
col_missing <- 1 # define column
data[row_missing, col_missing] <- NA # assign missing values
I have a large dataset I need to divide into multiple balanced sets.
The set looks something like the following:
> data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
> colnames(data)<-c("A","B","C","D","E","F","G","H")
The sets, each containing for example 20 rows, will need to be balanced across multiple variables so that each subset ends up having a similar mean of B, C, D that's included in their subgroup compared to all the other subsets.
Is there a way to do that with R? Any advice would be much appreciated. Thank you in advance!
library(tidyverse)
# Reproducible data
set.seed(2)
data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
colnames(data)<-c("A","B","C","D","E","F","G","H")
data=as.data.frame(data)
Updated Answer
It's probably not possible to get similar means across sets within each column if you want to keep observations from a given row together. With 8 columns (as in your sample data), you'd need 25 20-row sets where each column A set has the same mean, each column B set has the same mean, etc. That's a lot of constraints. Probably there are, however, algorithms that could find the set membership assignment schedule that minimizes the difference in set means.
However, if you can separately take 20 observations from each column without regard to which row it came from, then here's one option:
# Group into sets with same means
same_means = data %>%
gather(key, value) %>%
arrange(value) %>%
group_by(key) %>%
mutate(set = c(rep(1:25, 10), rep(25:1, 10)))
# Check means by set for each column
same_means %>%
group_by(key, set) %>%
summarise(mean=mean(value)) %>%
spread(key, mean) %>% as.data.frame
set A B C D E F G H
1 1 4.940018 5.018584 5.117592 4.931069 5.016401 5.171896 4.886093 5.047926
2 2 4.946496 5.018578 5.124084 4.936461 5.017041 5.172817 4.887383 5.048850
3 3 4.947443 5.021511 5.125649 4.929010 5.015181 5.173983 4.880492 5.044192
4 4 4.948340 5.014958 5.126480 4.922940 5.007478 5.175898 4.878876 5.042789
5 5 4.943010 5.018506 5.123188 4.924283 5.019847 5.174981 4.869466 5.046532
6 6 4.942808 5.019945 5.123633 4.924036 5.019279 5.186053 4.870271 5.044757
7 7 4.945312 5.022991 5.120904 4.919835 5.019173 5.187910 4.869666 5.041317
8 8 4.947457 5.024992 5.125821 4.915033 5.016782 5.187996 4.867533 5.043262
9 9 4.936680 5.020040 5.128815 4.917770 5.022527 5.180950 4.864416 5.043587
10 10 4.943435 5.022840 5.122607 4.921102 5.018274 5.183719 4.872688 5.036263
11 11 4.942015 5.024077 5.121594 4.921965 5.015766 5.185075 4.880304 5.045362
12 12 4.944416 5.024906 5.119663 4.925396 5.023136 5.183449 4.887840 5.044733
13 13 4.946751 5.020960 5.127302 4.923513 5.014100 5.186527 4.889140 5.048425
14 14 4.949517 5.011549 5.127794 4.925720 5.006624 5.188227 4.882128 5.055608
15 15 4.943008 5.013135 5.130486 4.930377 5.002825 5.194421 4.884593 5.051968
16 16 4.939554 5.021875 5.129392 4.930384 5.005527 5.197746 4.883358 5.052474
17 17 4.935909 5.019139 5.131258 4.922536 5.003273 5.204442 4.884018 5.059162
18 18 4.935830 5.022633 5.129389 4.927106 5.008391 5.210277 4.877859 5.054829
19 19 4.936171 5.025452 5.127276 4.927904 5.007995 5.206972 4.873620 5.054192
20 20 4.942925 5.018719 5.127394 4.929643 5.005699 5.202787 4.869454 5.055665
21 21 4.941351 5.014454 5.125727 4.932884 5.008633 5.205170 4.870352 5.047728
22 22 4.933846 5.019311 5.130156 4.923804 5.012874 5.213346 4.874263 5.056290
23 23 4.928815 5.021575 5.139077 4.923665 5.017180 5.211699 4.876333 5.056836
24 24 4.928739 5.024419 5.140386 4.925559 5.012995 5.214019 4.880025 5.055182
25 25 4.929357 5.025198 5.134391 4.930061 5.008571 5.217005 4.885442 5.062630
Original Answer
# Randomly group data into 20-row groups
set.seed(104)
data = data %>%
mutate(set = sample(rep(1:(500/20), each=20)))
head(data)
A B C D E F G H set
1 1.848823 6.920055 3.2283369 6.633721 6.794640 2.0288792 1.984295 2.09812642 10
2 7.023740 5.599569 0.4468325 5.198884 6.572196 0.9269249 9.700118 4.58840437 20
3 5.733263 3.426912 7.3168797 3.317611 8.301268 1.4466065 5.280740 0.09172101 19
4 1.680519 2.344975 4.9242313 6.163171 4.651894 2.2253335 1.175535 2.51299726 25
5 9.438393 4.296028 2.3563249 5.814513 1.717668 0.8130327 9.430833 0.68269106 19
6 9.434750 7.367007 1.2603451 5.952936 3.337172 5.2892300 5.139007 6.52763327 5
# Mean by set for each column
data %>% group_by(set) %>%
summarise_all(mean)
set A B C D E F G H
1 1 5.240236 6.143941 4.638874 5.367626 4.982008 4.200123 5.521844 5.083868
2 2 5.520983 5.257147 5.209941 4.504766 4.231175 3.642897 5.578811 6.439491
3 3 5.943011 3.556500 5.366094 4.583440 4.932206 4.725007 5.579103 5.420547
4 4 4.729387 4.755320 5.582982 4.763171 5.217154 5.224971 4.972047 3.892672
5 5 4.824812 4.527623 5.055745 4.556010 4.816255 4.426381 3.520427 6.398151
6 6 4.957994 7.517130 6.727288 4.757732 4.575019 6.220071 5.219651 5.130648
7 7 5.344701 4.650095 5.736826 5.161822 5.208502 5.645190 4.266679 4.243660
8 8 4.003065 4.578335 5.797876 4.968013 5.130712 6.192811 4.282839 5.669198
9 9 4.766465 4.395451 5.485031 4.577186 5.366829 5.653012 4.550389 4.367806
10 10 4.695404 5.295599 5.123817 5.358232 5.439788 5.643931 5.127332 5.089670
# ... with 15 more rows
If the total number of rows in the data frame is not divisible by the number of rows you want in each set, then you can do the following when you create the sets:
data = data %>%
mutate(set = sample(rep(1:ceiling(500/20), each=20))[1:n()])
In this case, the set sizes will vary a bit with the number of data rows is not divisible by the desired number of rows in each set.
The following approach could be worth trying for someone in a similar position.
It is based on the numerical balancing in groupdata2's fold() function, which allows creating groups with balanced means for a single column. By standardizing each of the columns and numerically balancing their rowwise sum, we might increase the chance of getting balanced means in the individual columns.
I compared this approach to creating groups randomly a few times and selecting the split with the least variance in means. It seems to be a bit better, but I'm not too convinced that this will hold in all contexts.
# Attach dplyr and groupdata2
library(dplyr)
library(groupdata2)
set.seed(1)
# Create the dataset
data <- matrix(runif(4000, min = 0, max = 10), nrow = 500, ncol = 8)
colnames(data) <- c("A", "B", "C", "D", "E", "F", "G", "H")
data <- dplyr::as_tibble(data)
# Standardize all columns and calculate row sums
data_std <- data %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create groups (new column called ".folds")
# We numerically balance the "total" column
data_std <- data_std %>%
groupdata2::fold(k = 25, num_col = "total") # k = 500/20=25
# Transfer the groups to the original (non-standardized) data frame
data$group <- data_std$.folds
# Check the means
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean)
> # A tibble: 25 x 9
> group A B C D E F G H
> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 4.48 5.05 4.80 5.65 5.04 4.60 5.12 4.85
> 2 2 5.57 5.17 3.21 5.46 4.46 5.89 5.06 4.79
> 3 3 4.33 6.02 4.57 6.18 4.76 3.79 5.94 3.71
> 4 4 4.51 4.62 4.62 5.27 4.65 5.41 5.26 5.23
> 5 5 4.55 5.10 4.19 5.41 5.28 5.39 5.57 4.23
> 6 6 4.82 4.74 6.10 4.34 4.82 5.08 4.89 4.81
> 7 7 5.88 4.49 4.13 3.91 5.62 4.75 5.46 5.26
> 8 8 4.11 5.50 5.61 4.23 5.30 4.60 4.96 5.35
> 9 9 4.30 3.74 6.45 5.60 3.56 4.92 5.57 5.32
> 10 10 5.26 5.50 4.35 5.29 4.53 4.75 4.49 5.45
> # … with 15 more rows
# Check the standard deviations of the means
# Could be used to compare methods
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd))
> # A tibble: 1 x 8
> A B C D E F G H
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 0.496 0.546 0.764 0.669 0.591 0.611 0.690 0.475
It might be best to compare the means and mean variances (or standard deviations as above) of different methods on the standardized data though. In that case, one could calculate the sum of the variances and minimize it.
data_std %>%
dplyr::select(-total) %>%
dplyr::group_by(.folds) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
> 1.643989
Comparing multiple balanced splits
The fold() function allows creating multiple unique grouping factors (splits) at once. So here, I will perform the numerically balanced split 20 times and find the grouping with the lowest sum of the standard deviations of the means. I'll further convert it to a function.
create_multi_balanced_groups <- function(data, cols, k, num_tries){
# Extract the variables of interest
# We assume these are numeric but we could add a check
data_to_balance <- data[, cols]
# Standardize all columns
# And calculate rowwise sums
data_std <- data_to_balance %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create `num_tries` unique numerically balanced splits
data_std <- data_std %>%
groupdata2::fold(
k = k,
num_fold_cols = num_tries,
num_col = "total"
)
# The new fold column names ".folds_1", ".folds_2", etc.
fold_col_names <- paste0(".folds_", seq_len(num_tries))
# Remove total column
data_std <- data_std %>%
dplyr::select(-total)
# Calculate score for each split
# This could probably be done more efficiently without a for loop
variance_scores <- c()
for (fcol in fold_col_names){
score <- data_std %>%
dplyr::group_by(!!as.name(fcol)) %>%
dplyr::summarise(across(where(is.numeric), mean)) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
variance_scores <- append(variance_scores, score)
}
# Get the fold column with the lowest score
lowest_fcol_index <- which.min(variance_scores)
best_fcol <- fold_col_names[[lowest_fcol_index]]
# Add the best fold column / grouping factor to the original data
data[["group"]] <- data_std[[best_fcol]]
# Return the original data and the score of the best fold column
list(data, min(variance_scores))
}
# Run with 20 splits
set.seed(1)
data_grouped_and_score <- create_multi_balanced_groups(
data = data,
cols = c("A", "B", "C", "D", "E", "F", "G", "H"),
k = 25,
num_tries = 20
)
# Check data
data_grouped_and_score[[1]]
> # A tibble: 500 x 9
> A B C D E F G H group
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
> 1 5.86 6.54 0.500 2.88 5.70 9.67 2.29 3.01 2
> 2 0.0895 4.69 5.71 0.343 8.95 7.73 5.76 9.58 1
> 3 2.94 1.78 2.06 6.66 9.54 0.600 4.26 0.771 16
> 4 2.77 1.52 0.723 8.11 8.95 1.37 6.32 6.24 7
> 5 8.14 2.49 0.467 8.51 0.889 6.28 4.47 8.63 13
> 6 2.60 8.23 9.17 5.14 2.85 8.54 8.94 0.619 23
> 7 7.24 0.260 6.64 8.35 8.59 0.0862 1.73 8.10 5
> 8 9.06 1.11 6.01 5.35 2.01 9.37 7.47 1.01 1
> 9 9.49 5.48 3.64 1.94 3.24 2.49 3.63 5.52 7
> 10 0.731 0.230 5.29 8.43 5.40 8.50 3.46 1.23 10
> # … with 490 more rows
# Check score
data_grouped_and_score[[2]]
> 1.552656
By commenting out the num_col = "total" line, we can run this without the numerical balancing. For me, this gave a score of 1.615257.
Disclaimer: I am the author of the groupdata2 package. The fold() function can also balance a categorical column (cat_col) and keep all data points with the same ID in the same fold (id_col) (e.g. to avoid leakage in cross-validation). There's a very similar partition() function as well.
I'm a newbie in R. I have a data set with 3 set of lung function measurements for 3 corresponding dates given below for each observation. I would like to extract slope for each observation(decline in lung function) using R software and insert in the new column for each observation.
1. How should I approach the problem?
2. Is my data set arranged in right format?
ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13
18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993
18200 0.87 0.85 9/12/1991 3/11/1993
18303 0.79 4/23/1992
24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994
28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992
34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994
43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995
103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996
114101 0.73 0.59 0.6 6/25/1989 8/5/1990 8/24/1991
example for 1st observation, slope=0.0003
Thanks..
If I understood the question, I think you want the slope between each set of visits:
library(dplyr)
group_by(df, ID) %>%
mutate_at(vars(starts_with("DATE")), funs(as.Date(., "%m/%d/%Y"))) %>%
do(data_frame(slope=diff(unlist(.[,2:4]))/diff(unlist(.[,5:7])),
after_visit=1+(1:length(slope))))
## Source: local data frame [18 x 3]
## Groups: ID [9]
##
## ID slope after_visit
## <int> <dbl> <dbl>
## 1 18105 -2.309469e-04 2
## 2 18105 -2.830189e-04 3
## 3 18200 -3.663004e-05 2
## 4 18200 NA 3
## 5 18303 NA 2
## 6 18303 NA 3
## 7 24204 -3.484321e-04 2
## 8 24204 6.745363e-05 3
## 9 28102 -5.639098e-04 2
## 10 28102 -2.359882e-04 3
## 11 34104 2.594810e-04 2
## 12 34104 -2.747253e-05 3
## 13 43108 -4.433498e-04 2
## 14 43108 -1.098901e-04 3
## 15 103114 -3.937008e-04 2
## 16 103114 -4.123711e-04 3
## 17 114101 -3.448276e-04 2
## 18 114101 2.604167e-05 3
Alternate munging:
group_by(df, ID) %>%
mutate_at(vars(starts_with("DATE")), funs(as.Date(., "%m/%d/%Y"))) %>%
do(data_frame(date=as.Date(unlist(.[,5:7]), origin="1970-01-01"), # in the event you wanted to keep the data less awful and have one observation per row, this preserves the Date class
reading=unlist(.[,2:4]))) %>%
do(data_frame(slope=diff(.$reading)/unclass(diff(.$date))))
This is a bit of a "hacky" solution but if I understand your question correctly (some clarification may be needed), this should work in your case. Note, this is somewhat specific to your case since the column pairs are expected to be in the order you specified.
library(dplyr)
library(lubridate)
### Load Data
tdf <- read.table(header=TRUE, stringsAsFactors = FALSE, text = '
ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13
18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993
18200 0.87 0.85 NA 9/12/1991 3/11/1993 NA
18303 0.79 NA NA 4/23/1992 NA NA
24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994
28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992
34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994
43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995
103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996
114101 0.73 0.59 0.6 6/25/1989 8/5/1990 8/24/1991') %>% tbl_df
#####################################
### Reshape the data by column pairs.
#####################################
### Function to reshape a single column pair
xform_data <- function(x) {
df<-data.frame(tdf[,'ID'],
names(tdf)[x],
tdf[,names(tdf)[x]],
tdf[,names(tdf)[x+3]], stringsAsFactors = FALSE)
names(df) <- c('ID', 'DateKey', 'Val', 'Date'); df
}
### Create a new data frame with the data in a deep format (i.e. reshaped)
### 'lapply' is used to reshape each pair of columns (date and value).
### 'lapply' returns a list of data frames (on df per pair) and 'bind_rows'
### combines them into one data frame.
newdf <-
bind_rows(lapply(2:4, function(x) {xform_data(x)})) %>%
mutate(Date = mdy(Date, tz='utc'))
#####################################
### Calculate the slopes per ID
#####################################
slopedf <-
newdf %>%
arrange(DateKey, Date) %>%
group_by(ID) %>%
do(slope = lm(Val ~ Date, data = .)$coefficients[[2]]) %>%
mutate(slope = as.vector(slope)) %>%
ungroup
slopedf
## # A tibble: 9 x 2
## ID slope
## <int> <dbl>
## 1 18105 -3.077620e-09
## 2 18200 -4.239588e-10
## 3 18303 NA
## 4 24204 -5.534095e-10
## 5 28102 -4.325210e-09
## 6 34104 1.690414e-09
## 7 43108 -2.490139e-09
## 8 103114 -4.645589e-09
## 9 114101 -1.924497e-09
##########################################
### Adding slope column to original data.
##########################################
> tdf %>% left_join(slopedf, by = 'ID')
## # A tibble: 9 x 8
## ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13 slope
## <int> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993 -3.077620e-09
## 2 18200 0.87 0.85 NA 9/12/1991 3/11/1993 <NA> -4.239588e-10
## 3 18303 0.79 NA NA 4/23/1992 <NA> <NA> NA
## 4 24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994 -5.534095e-10
## 5 28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992 -4.325210e-09
## 6 34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994 1.690414e-09
## 7 43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995 -2.490139e-09
## 8 103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996 -4.645589e-09
## 9 114101 0.73 0.59 0.60 6/25/1989 8/5/1990 8/24/1991 -1.924497e-09