Arranging Columns in R - r

I have data of the form:
Department LengthAfter
1 A 8.42
2 B 10.93
3 D 9.98
4 A 10.13
5 B 10.54
6 C 7.82
7 A 9.55
8 D 12.53
9 C 7.87
I would like to make a new table or dataframe in which the column header is each department (A, B, C, D) and the Lengths under each column are the values on LengthAfter corresponding to each department. e.g.
A B C D
8.42 10.93 7.82 9.98
Can anyone help with this? Thank you

Using tidyverse, you can use pivot_wider to pivot your data into the desired form. Before that, you will need to sort (arrange) by Department first, if you want to include the values from LengthAfter in the order of appearance, and have the columns in order as above.
library(tidyverse)
df %>%
arrange(Department) %>%
group_by(Department) %>%
mutate(rn = row_number()) %>%
pivot_wider(names_from = "Department", values_from = "LengthAfter") %>%
select(-rn)
Output
A B C D
<dbl> <dbl> <dbl> <dbl>
1 8.42 10.9 7.82 9.98
2 10.1 10.5 7.87 12.5
3 9.55 NA NA NA

You can use package reshape2 for this
Library (reshape2)
Df_new <- dcast(df_old, Department~LengthAfter)

Base-R
dept_length <- read.csv("/Users/usr/SO_Department_LengthAfter.tsv", sep="\t");
dl_list <- with(dept_length, tapply(LengthAfter, Department, `c`));
n.obs <- sapply(dl_list, length);
seq.max <- seq_len(max(n.obs));
sapply(dl_list, `[`, i = seq.max);
Returns:
A B C D
[1,] 8.42 10.93 7.82 9.98
[2,] 10.13 10.54 7.87 12.53
[3,] 9.55 NA NA NA
References:
https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html
How to convert a list consisting of vector of different lengths to a usable data frame in R?
https://eeob-biodata.github.io/R-Data-Skills/05-split-apply-combine/

Related

tidyr::spread() with multiple keys and values

I assume this has been asked multiple times but I couldn't find the proper words to find a workable solution.
How can I spread() a data frame based on multiple keys for multiple values?
A simplified (I have many more columns to spread, but on only two keys: Id and time point of a given measurement) data I'm working with looks like this:
df <- data.frame(id = rep(seq(1:10),3),
time = rep(1:3, each=10),
x = rnorm(n=30),
y = rnorm(n=30))
> head(df)
id time x y
1 1 1 -2.62671241 0.01669755
2 2 1 -1.69862885 0.24992634
3 3 1 1.01820778 -1.04754037
4 4 1 0.97561596 0.35216040
5 5 1 0.60367158 -0.78066767
6 6 1 -0.03761868 1.08173157
> tail(df)
id time x y
25 5 3 0.03621258 -1.1134368
26 6 3 -0.25900538 1.6009824
27 7 3 0.13996626 0.1359013
28 8 3 -0.60364935 1.5750232
29 9 3 0.89618748 0.0294315
30 10 3 0.14709567 0.5461084
What i'd like to have is a dataframe populated like this:
One row per Id columns for each value from the time and each measurement variable.
With the devel version of tidyr (tidyr_0.8.3.9000), we can use pivot_wider to reshape multiple value columns from long to wide format
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(time = str_c("time", time)) %>%
pivot_wider(names_from = time, values_from = c("x", "y"), names_sep="")
# A tibble: 10 x 7
# id xtime1 xtime2 xtime3 ytime1 ytime2 ytime3
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 -0.256 0.483 -0.254 -0.652 0.655 0.291
# 2 2 1.10 -0.596 -1.85 1.09 -0.401 -1.24
# 3 3 0.756 -2.19 -0.0779 -0.763 -0.335 -0.456
# 4 4 -0.238 -0.675 0.969 -0.829 1.37 -0.830
# 5 5 0.987 -2.12 0.185 0.834 2.14 0.340
# 6 6 0.741 -1.27 -1.38 -0.968 0.506 1.07
# 7 7 0.0893 -0.374 -1.44 -0.0288 0.786 1.22
# 8 8 -0.955 -0.688 0.362 0.233 -0.902 0.736
# 9 9 -0.195 -0.872 -1.76 -0.301 0.533 -0.481
#10 10 0.926 -0.102 -0.325 -0.678 -0.646 0.563
NOTE: The numbers are different as there was no set seed while creating the sample dataset
Reshaping with multiple value variables can best be done with dcast from data.table or reshape from base R.
library(data.table)
out <- dcast(setDT(df), id ~ paste0("time", time), value.var = c("x", "y"), sep = "")
out
# id xtime1 xtime2 xtime3 ytime1 ytime2 ytime3
# 1: 1 0.4334921 -0.5205570 -1.44364515 0.49288757 -1.26955148 -0.83344256
# 2: 2 0.4785870 0.9261711 0.68173681 1.24639813 0.91805332 0.34346260
# 3: 3 -1.2067665 1.7309593 0.04923993 1.28184341 -0.69435556 0.01609261
# 4: 4 0.5240518 0.7481787 0.07966677 -1.36408357 1.72636849 -0.45827205
# 5: 5 0.3733316 -0.3689391 -0.11879819 -0.03276689 0.91824437 2.18084692
# 6: 6 0.2363018 -0.2358572 0.73389984 -1.10946940 -1.05379502 -0.82691626
# 7: 7 -1.4979165 0.9026397 0.84666801 1.02138768 -0.01072588 0.08925716
# 8: 8 0.3428946 -0.2235349 -1.21684977 0.40549497 0.68937085 -0.15793111
# 9: 9 -1.1304688 -0.3901419 -0.10722222 -0.54206830 0.34134397 0.48504564
#10: 10 -0.5275251 -1.1328937 -0.68059800 1.38790593 0.93199593 -1.77498807
Using reshape we could do
# setDF(df) # in case df is a data.table now
reshape(df, idvar = "id", timevar = "time", direction = "wide")
Your entry data frame is not tidy. You should use gather to make it so.
gather(df, key, value, -id, -time) %>%
mutate(key = paste0(key, "time", time)) %>%
select(-time) %>%
spread(key, value)

Group values with identical ID into columns without summerizing them in R

I have a dataframe that looks like this, but with a lot more Proteins
Protein z
Irak4 -2.46
Irak4 -0.13
Itk -0.49
Itk 4.22
Itk -0.51
Ras 1.53
For further operations I need the data to be grouped by Proteinname into columns like this.
Irak4 Itk Ras
-2.46 -0.49 1.53
-0.13 4.22 NA
NA -0.51 NA
I tried different packages like dplyr or reshape, but did not manage to transform the data into the desired format.
Is there any way to achieve this? I think the missing datapoints for some Proteins are the main problem here.
I am quite new to R, so my apologies if I am missing an obvious solution.
Here is an option with tidyverse
library(tidyverse)
DF %>%
group_by(Protein) %>%
mutate(idx = row_number()) %>%
spread(Protein, z) %>%
select(-idx)
# A tibble: 3 x 3
# Irak4 Itk Ras
# <dbl> <dbl> <dbl>
#1 -2.46 -0.49 1.53
#2 -0.13 4.22 NA
#3 NA -0.51 NA
Before we spread the data, we need to create unique identifiers.
In base R you could use unstack first which will give you a named list of vectors that contain the values in the z column.
Use lapply to iterate over that list and append the vectors with NAs using the `length<-` function in order to have a list of vectors with equal lengths. Then we can call data.frame.
lst <- unstack(DF, z ~ Protein)
data.frame(lapply(lst, `length<-`, max(lengths(lst))))
# Irak4 Itk Ras
#1 -2.46 -0.49 1.53
#2 -0.13 4.22 NA
#3 NA -0.51 NA
data
DF <- structure(list(Protein = c("Irak4", "Irak4", "Itk", "Itk", "Itk",
"Ras"), z = c(-2.46, -0.13, -0.49, 4.22, -0.51, 1.53)), .Names = c("Protein",
"z"), class = "data.frame", row.names = c(NA, -6L))
library(data.table)
dcast(setDT(df),rowid(Protein)~Protein,value.var='z')
Protein Irak4 Itk Ras
1: 1 -2.46 -0.49 1.53
2: 2 -0.13 4.22 NA
3: 3 NA -0.51 NA
in base R you can do:
data.frame(sapply(a<-unstack(df,z~Protein),`length<-`,max(lengths(a))))
Irak4 Itk Ras
1 -2.46 -0.49 1.53
2 -0.13 4.22 NA
3 NA -0.51 NA
Or using reshape:
reshape(transform(df,gr=ave(z,Protein,FUN=seq_along)),v.names = 'z',timevar = 'Protein',idvar = 'gr',dir='wide')
gr z.Irak4 z.Itk z.Ras
1 1 -2.46 -0.49 1.53
2 2 -0.13 4.22 NA
5 3 NA -0.51 NA

How to divide dataset into balanced sets based on multiple variables

I have a large dataset I need to divide into multiple balanced sets.
The set looks something like the following:
> data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
> colnames(data)<-c("A","B","C","D","E","F","G","H")
The sets, each containing for example 20 rows, will need to be balanced across multiple variables so that each subset ends up having a similar mean of B, C, D that's included in their subgroup compared to all the other subsets.
Is there a way to do that with R? Any advice would be much appreciated. Thank you in advance!
library(tidyverse)
# Reproducible data
set.seed(2)
data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
colnames(data)<-c("A","B","C","D","E","F","G","H")
data=as.data.frame(data)
Updated Answer
It's probably not possible to get similar means across sets within each column if you want to keep observations from a given row together. With 8 columns (as in your sample data), you'd need 25 20-row sets where each column A set has the same mean, each column B set has the same mean, etc. That's a lot of constraints. Probably there are, however, algorithms that could find the set membership assignment schedule that minimizes the difference in set means.
However, if you can separately take 20 observations from each column without regard to which row it came from, then here's one option:
# Group into sets with same means
same_means = data %>%
gather(key, value) %>%
arrange(value) %>%
group_by(key) %>%
mutate(set = c(rep(1:25, 10), rep(25:1, 10)))
# Check means by set for each column
same_means %>%
group_by(key, set) %>%
summarise(mean=mean(value)) %>%
spread(key, mean) %>% as.data.frame
set A B C D E F G H
1 1 4.940018 5.018584 5.117592 4.931069 5.016401 5.171896 4.886093 5.047926
2 2 4.946496 5.018578 5.124084 4.936461 5.017041 5.172817 4.887383 5.048850
3 3 4.947443 5.021511 5.125649 4.929010 5.015181 5.173983 4.880492 5.044192
4 4 4.948340 5.014958 5.126480 4.922940 5.007478 5.175898 4.878876 5.042789
5 5 4.943010 5.018506 5.123188 4.924283 5.019847 5.174981 4.869466 5.046532
6 6 4.942808 5.019945 5.123633 4.924036 5.019279 5.186053 4.870271 5.044757
7 7 4.945312 5.022991 5.120904 4.919835 5.019173 5.187910 4.869666 5.041317
8 8 4.947457 5.024992 5.125821 4.915033 5.016782 5.187996 4.867533 5.043262
9 9 4.936680 5.020040 5.128815 4.917770 5.022527 5.180950 4.864416 5.043587
10 10 4.943435 5.022840 5.122607 4.921102 5.018274 5.183719 4.872688 5.036263
11 11 4.942015 5.024077 5.121594 4.921965 5.015766 5.185075 4.880304 5.045362
12 12 4.944416 5.024906 5.119663 4.925396 5.023136 5.183449 4.887840 5.044733
13 13 4.946751 5.020960 5.127302 4.923513 5.014100 5.186527 4.889140 5.048425
14 14 4.949517 5.011549 5.127794 4.925720 5.006624 5.188227 4.882128 5.055608
15 15 4.943008 5.013135 5.130486 4.930377 5.002825 5.194421 4.884593 5.051968
16 16 4.939554 5.021875 5.129392 4.930384 5.005527 5.197746 4.883358 5.052474
17 17 4.935909 5.019139 5.131258 4.922536 5.003273 5.204442 4.884018 5.059162
18 18 4.935830 5.022633 5.129389 4.927106 5.008391 5.210277 4.877859 5.054829
19 19 4.936171 5.025452 5.127276 4.927904 5.007995 5.206972 4.873620 5.054192
20 20 4.942925 5.018719 5.127394 4.929643 5.005699 5.202787 4.869454 5.055665
21 21 4.941351 5.014454 5.125727 4.932884 5.008633 5.205170 4.870352 5.047728
22 22 4.933846 5.019311 5.130156 4.923804 5.012874 5.213346 4.874263 5.056290
23 23 4.928815 5.021575 5.139077 4.923665 5.017180 5.211699 4.876333 5.056836
24 24 4.928739 5.024419 5.140386 4.925559 5.012995 5.214019 4.880025 5.055182
25 25 4.929357 5.025198 5.134391 4.930061 5.008571 5.217005 4.885442 5.062630
Original Answer
# Randomly group data into 20-row groups
set.seed(104)
data = data %>%
mutate(set = sample(rep(1:(500/20), each=20)))
head(data)
A B C D E F G H set
1 1.848823 6.920055 3.2283369 6.633721 6.794640 2.0288792 1.984295 2.09812642 10
2 7.023740 5.599569 0.4468325 5.198884 6.572196 0.9269249 9.700118 4.58840437 20
3 5.733263 3.426912 7.3168797 3.317611 8.301268 1.4466065 5.280740 0.09172101 19
4 1.680519 2.344975 4.9242313 6.163171 4.651894 2.2253335 1.175535 2.51299726 25
5 9.438393 4.296028 2.3563249 5.814513 1.717668 0.8130327 9.430833 0.68269106 19
6 9.434750 7.367007 1.2603451 5.952936 3.337172 5.2892300 5.139007 6.52763327 5
# Mean by set for each column
data %>% group_by(set) %>%
summarise_all(mean)
set A B C D E F G H
1 1 5.240236 6.143941 4.638874 5.367626 4.982008 4.200123 5.521844 5.083868
2 2 5.520983 5.257147 5.209941 4.504766 4.231175 3.642897 5.578811 6.439491
3 3 5.943011 3.556500 5.366094 4.583440 4.932206 4.725007 5.579103 5.420547
4 4 4.729387 4.755320 5.582982 4.763171 5.217154 5.224971 4.972047 3.892672
5 5 4.824812 4.527623 5.055745 4.556010 4.816255 4.426381 3.520427 6.398151
6 6 4.957994 7.517130 6.727288 4.757732 4.575019 6.220071 5.219651 5.130648
7 7 5.344701 4.650095 5.736826 5.161822 5.208502 5.645190 4.266679 4.243660
8 8 4.003065 4.578335 5.797876 4.968013 5.130712 6.192811 4.282839 5.669198
9 9 4.766465 4.395451 5.485031 4.577186 5.366829 5.653012 4.550389 4.367806
10 10 4.695404 5.295599 5.123817 5.358232 5.439788 5.643931 5.127332 5.089670
# ... with 15 more rows
If the total number of rows in the data frame is not divisible by the number of rows you want in each set, then you can do the following when you create the sets:
data = data %>%
mutate(set = sample(rep(1:ceiling(500/20), each=20))[1:n()])
In this case, the set sizes will vary a bit with the number of data rows is not divisible by the desired number of rows in each set.
The following approach could be worth trying for someone in a similar position.
It is based on the numerical balancing in groupdata2's fold() function, which allows creating groups with balanced means for a single column. By standardizing each of the columns and numerically balancing their rowwise sum, we might increase the chance of getting balanced means in the individual columns.
I compared this approach to creating groups randomly a few times and selecting the split with the least variance in means. It seems to be a bit better, but I'm not too convinced that this will hold in all contexts.
# Attach dplyr and groupdata2
library(dplyr)
library(groupdata2)
set.seed(1)
# Create the dataset
data <- matrix(runif(4000, min = 0, max = 10), nrow = 500, ncol = 8)
colnames(data) <- c("A", "B", "C", "D", "E", "F", "G", "H")
data <- dplyr::as_tibble(data)
# Standardize all columns and calculate row sums
data_std <- data %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create groups (new column called ".folds")
# We numerically balance the "total" column
data_std <- data_std %>%
groupdata2::fold(k = 25, num_col = "total") # k = 500/20=25
# Transfer the groups to the original (non-standardized) data frame
data$group <- data_std$.folds
# Check the means
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean)
> # A tibble: 25 x 9
> group A B C D E F G H
> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 4.48 5.05 4.80 5.65 5.04 4.60 5.12 4.85
> 2 2 5.57 5.17 3.21 5.46 4.46 5.89 5.06 4.79
> 3 3 4.33 6.02 4.57 6.18 4.76 3.79 5.94 3.71
> 4 4 4.51 4.62 4.62 5.27 4.65 5.41 5.26 5.23
> 5 5 4.55 5.10 4.19 5.41 5.28 5.39 5.57 4.23
> 6 6 4.82 4.74 6.10 4.34 4.82 5.08 4.89 4.81
> 7 7 5.88 4.49 4.13 3.91 5.62 4.75 5.46 5.26
> 8 8 4.11 5.50 5.61 4.23 5.30 4.60 4.96 5.35
> 9 9 4.30 3.74 6.45 5.60 3.56 4.92 5.57 5.32
> 10 10 5.26 5.50 4.35 5.29 4.53 4.75 4.49 5.45
> # … with 15 more rows
# Check the standard deviations of the means
# Could be used to compare methods
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd))
> # A tibble: 1 x 8
> A B C D E F G H
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 0.496 0.546 0.764 0.669 0.591 0.611 0.690 0.475
It might be best to compare the means and mean variances (or standard deviations as above) of different methods on the standardized data though. In that case, one could calculate the sum of the variances and minimize it.
data_std %>%
dplyr::select(-total) %>%
dplyr::group_by(.folds) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
> 1.643989
Comparing multiple balanced splits
The fold() function allows creating multiple unique grouping factors (splits) at once. So here, I will perform the numerically balanced split 20 times and find the grouping with the lowest sum of the standard deviations of the means. I'll further convert it to a function.
create_multi_balanced_groups <- function(data, cols, k, num_tries){
# Extract the variables of interest
# We assume these are numeric but we could add a check
data_to_balance <- data[, cols]
# Standardize all columns
# And calculate rowwise sums
data_std <- data_to_balance %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create `num_tries` unique numerically balanced splits
data_std <- data_std %>%
groupdata2::fold(
k = k,
num_fold_cols = num_tries,
num_col = "total"
)
# The new fold column names ".folds_1", ".folds_2", etc.
fold_col_names <- paste0(".folds_", seq_len(num_tries))
# Remove total column
data_std <- data_std %>%
dplyr::select(-total)
# Calculate score for each split
# This could probably be done more efficiently without a for loop
variance_scores <- c()
for (fcol in fold_col_names){
score <- data_std %>%
dplyr::group_by(!!as.name(fcol)) %>%
dplyr::summarise(across(where(is.numeric), mean)) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
variance_scores <- append(variance_scores, score)
}
# Get the fold column with the lowest score
lowest_fcol_index <- which.min(variance_scores)
best_fcol <- fold_col_names[[lowest_fcol_index]]
# Add the best fold column / grouping factor to the original data
data[["group"]] <- data_std[[best_fcol]]
# Return the original data and the score of the best fold column
list(data, min(variance_scores))
}
# Run with 20 splits
set.seed(1)
data_grouped_and_score <- create_multi_balanced_groups(
data = data,
cols = c("A", "B", "C", "D", "E", "F", "G", "H"),
k = 25,
num_tries = 20
)
# Check data
data_grouped_and_score[[1]]
> # A tibble: 500 x 9
> A B C D E F G H group
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
> 1 5.86 6.54 0.500 2.88 5.70 9.67 2.29 3.01 2
> 2 0.0895 4.69 5.71 0.343 8.95 7.73 5.76 9.58 1
> 3 2.94 1.78 2.06 6.66 9.54 0.600 4.26 0.771 16
> 4 2.77 1.52 0.723 8.11 8.95 1.37 6.32 6.24 7
> 5 8.14 2.49 0.467 8.51 0.889 6.28 4.47 8.63 13
> 6 2.60 8.23 9.17 5.14 2.85 8.54 8.94 0.619 23
> 7 7.24 0.260 6.64 8.35 8.59 0.0862 1.73 8.10 5
> 8 9.06 1.11 6.01 5.35 2.01 9.37 7.47 1.01 1
> 9 9.49 5.48 3.64 1.94 3.24 2.49 3.63 5.52 7
> 10 0.731 0.230 5.29 8.43 5.40 8.50 3.46 1.23 10
> # … with 490 more rows
# Check score
data_grouped_and_score[[2]]
> 1.552656
By commenting out the num_col = "total" line, we can run this without the numerical balancing. For me, this gave a score of 1.615257.
Disclaimer: I am the author of the groupdata2 package. The fold() function can also balance a categorical column (cat_col) and keep all data points with the same ID in the same fold (id_col) (e.g. to avoid leakage in cross-validation). There's a very similar partition() function as well.

Max-If excel replication in R

I'm trying to replicate the max if function from Excel in R.
ksu price max
9144037 3.11 3.11
8448749 4.19 5.24
9649391 0 8.39
8448749 4.19 5.24
8448749 4.19 5.24
8448749 4.19 5.24
8448749 4.19 5.24
9649391 8.39 8.39
8448749 5.24 5.24
9144037 1.99 3.11
9144037 1.99 3.11
If I were doing it in excel i'd use max(if()). This code is supposed to look at the max price for each ksu, and return the max value on the last column. I've tried this :
max(price[ksu == ksu])
But it doesn't give me the desired output. It only returns one max value regardless of the ksu.
Assuming you have a data.frame called df you could easily use the ave function to get what you want. An example:
> df <- data.frame(grp = c('a','a','b','b'), vals = 1:4)
> df
grp vals
1 a 1
2 a 2
3 b 3
4 b 4
> # Returns a vector
> ave(df$vals, df$grp, FUN = max)
[1] 2 2 4 4
> # So we can store it back into the data.frame if we want
> df$max <- ave(df$vals, df$grp, FUN = max)
> df
grp vals max
1 a 1 2
2 a 2 2
3 b 3 4
4 b 4 4
So using your variable names (but still assuming the data.frame is df):
df$max <- ave(df$price, df$ksu, FUN = max)
Suppose your data is in a data.frame called dat, we can use the dplyr package:
library(dplyr)
dat %>%
group_by(ksu) %>%
mutate(max = max(price))
ksu price max
<int> <dbl> <dbl>
1 9144037 3.11 3.11
2 8448749 4.19 5.24
3 9649391 0.00 8.39
...

Extracting the slope for individual observation

I'm a newbie in R. I have a data set with 3 set of lung function measurements for 3 corresponding dates given below for each observation. I would like to extract slope for each observation(decline in lung function) using R software and insert in the new column for each observation.
1. How should I approach the problem?
2. Is my data set arranged in right format?
ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13
18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993
18200 0.87 0.85 9/12/1991 3/11/1993
18303 0.79 4/23/1992
24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994
28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992
34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994
43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995
103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996
114101 0.73 0.59 0.6 6/25/1989 8/5/1990 8/24/1991
example for 1st observation, slope=0.0003
Thanks..
If I understood the question, I think you want the slope between each set of visits:
library(dplyr)
group_by(df, ID) %>%
mutate_at(vars(starts_with("DATE")), funs(as.Date(., "%m/%d/%Y"))) %>%
do(data_frame(slope=diff(unlist(.[,2:4]))/diff(unlist(.[,5:7])),
after_visit=1+(1:length(slope))))
## Source: local data frame [18 x 3]
## Groups: ID [9]
##
## ID slope after_visit
## <int> <dbl> <dbl>
## 1 18105 -2.309469e-04 2
## 2 18105 -2.830189e-04 3
## 3 18200 -3.663004e-05 2
## 4 18200 NA 3
## 5 18303 NA 2
## 6 18303 NA 3
## 7 24204 -3.484321e-04 2
## 8 24204 6.745363e-05 3
## 9 28102 -5.639098e-04 2
## 10 28102 -2.359882e-04 3
## 11 34104 2.594810e-04 2
## 12 34104 -2.747253e-05 3
## 13 43108 -4.433498e-04 2
## 14 43108 -1.098901e-04 3
## 15 103114 -3.937008e-04 2
## 16 103114 -4.123711e-04 3
## 17 114101 -3.448276e-04 2
## 18 114101 2.604167e-05 3
Alternate munging:
group_by(df, ID) %>%
mutate_at(vars(starts_with("DATE")), funs(as.Date(., "%m/%d/%Y"))) %>%
do(data_frame(date=as.Date(unlist(.[,5:7]), origin="1970-01-01"), # in the event you wanted to keep the data less awful and have one observation per row, this preserves the Date class
reading=unlist(.[,2:4]))) %>%
do(data_frame(slope=diff(.$reading)/unclass(diff(.$date))))
This is a bit of a "hacky" solution but if I understand your question correctly (some clarification may be needed), this should work in your case. Note, this is somewhat specific to your case since the column pairs are expected to be in the order you specified.
library(dplyr)
library(lubridate)
### Load Data
tdf <- read.table(header=TRUE, stringsAsFactors = FALSE, text = '
ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13
18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993
18200 0.87 0.85 NA 9/12/1991 3/11/1993 NA
18303 0.79 NA NA 4/23/1992 NA NA
24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994
28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992
34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994
43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995
103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996
114101 0.73 0.59 0.6 6/25/1989 8/5/1990 8/24/1991') %>% tbl_df
#####################################
### Reshape the data by column pairs.
#####################################
### Function to reshape a single column pair
xform_data <- function(x) {
df<-data.frame(tdf[,'ID'],
names(tdf)[x],
tdf[,names(tdf)[x]],
tdf[,names(tdf)[x+3]], stringsAsFactors = FALSE)
names(df) <- c('ID', 'DateKey', 'Val', 'Date'); df
}
### Create a new data frame with the data in a deep format (i.e. reshaped)
### 'lapply' is used to reshape each pair of columns (date and value).
### 'lapply' returns a list of data frames (on df per pair) and 'bind_rows'
### combines them into one data frame.
newdf <-
bind_rows(lapply(2:4, function(x) {xform_data(x)})) %>%
mutate(Date = mdy(Date, tz='utc'))
#####################################
### Calculate the slopes per ID
#####################################
slopedf <-
newdf %>%
arrange(DateKey, Date) %>%
group_by(ID) %>%
do(slope = lm(Val ~ Date, data = .)$coefficients[[2]]) %>%
mutate(slope = as.vector(slope)) %>%
ungroup
slopedf
## # A tibble: 9 x 2
## ID slope
## <int> <dbl>
## 1 18105 -3.077620e-09
## 2 18200 -4.239588e-10
## 3 18303 NA
## 4 24204 -5.534095e-10
## 5 28102 -4.325210e-09
## 6 34104 1.690414e-09
## 7 43108 -2.490139e-09
## 8 103114 -4.645589e-09
## 9 114101 -1.924497e-09
##########################################
### Adding slope column to original data.
##########################################
> tdf %>% left_join(slopedf, by = 'ID')
## # A tibble: 9 x 8
## ID FEV1_Date11 FEV1_Date12 FEV1_Date13 DATE11 DATE12 DATE13 slope
## <int> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 18105 1.35 1.25 1.04 6/9/1990 8/16/1991 8/27/1993 -3.077620e-09
## 2 18200 0.87 0.85 NA 9/12/1991 3/11/1993 <NA> -4.239588e-10
## 3 18303 0.79 NA NA 4/23/1992 <NA> <NA> NA
## 4 24204 4.05 3.95 3.99 6/8/1992 3/22/1993 11/5/1994 -5.534095e-10
## 5 28102 1.19 1.04 0.96 10/31/1990 7/24/1991 6/27/1992 -4.325210e-09
## 6 34104 1.03 1.16 1.15 7/25/1992 12/8/1993 12/7/1994 1.690414e-09
## 7 43108 0.92 0.83 0.79 6/23/1993 1/12/1994 1/11/1995 -2.490139e-09
## 8 103114 2.43 2.28 2.16 6/5/1994 6/21/1995 4/7/1996 -4.645589e-09
## 9 114101 0.73 0.59 0.60 6/25/1989 8/5/1990 8/24/1991 -1.924497e-09

Resources