How to subtract first entry from last entry in grouped data - r

I would appreciate some help with the following task: From the data frame below (C), for each id I would like to subtract the first entry under column d_2 from the final entry and then store the results in another dataframe containing the same ids. I can then merge this with my initial dataframe. Pls note that the subtraction has to be in this order (last entry minus first entry for each id).
Here are the codes:
id <- c("A1", "A1", "B10","B10", "B500", "B500", "C100", "C100", "C100", "D40", "D40", "G100", "G100")
d_1 <- c( rep(1.15, 2), rep(1.44, 2), rep(1.34, 2), rep(1.50, 3), rep(1.90, 2), rep(1.59, 2))
set.seed(2)
d_2 <- round(runif(13, -1, 1), 2)
C <- data.frame(id, d_1, d_2)
id d_1 d_2
A1 1.15 -0.63
A1 1.15 0.40
B10 1.44 0.15
B10 1.44 -0.66
B500 1.34 0.89
B500 1.34 0.89
C100 1.50 -0.74
C100 1.50 0.67
C100 1.50 -0.06
D40 1.90 0.10
D40 1.90 0.11
G100 1.59 -0.52
G100 1.59 0.52
Desired result:
id2 <- c("A1", "B10", "B500", "C100", "D40", "G100")
difference <- c(1.03, -0.81, 0, 0.68, 0.01, 1.04)
diff_df <- data.frame(id2, difference)
id2 difference
A1 1.03
B10 -0.81
B500 0.00
C100 0.68
D40 0.01
G100 1.04
I attempted this by using ddply to obtain the first and last entries but I'm really struggling with indexing the "function argument" in the second code (below) to get the desired outcome.
C_1 <- ddply(C, .(id), function(x) x[c(1, nrow(x)), ])
ddply(C_1, .(patient), function )
To be honest, I'm not very familiar with the ddply package-I got the code above from another post on stack exchange .
My original data is a groupedData and I believe another way of approaching this is using gapply but again I'm struggling with the third argument here (usually a function)
grouped_C <- groupedData(d_1 ~ d_2 | id, data = C, FUN = mean, labels = list( x = "", y = ""), units = list(""))
x1 <- gapply(grouped_C, "d_2", first_entry)
x2 <- gapply(grouped_C, "d_2", last_entry)
where first_entry and last_entry are functions to help me get the first and and last entries.
I can then get the difference with: x2 - x1. However, I'm not sure what to input as first_entry and last_entry in the above codes (perhaps to do with head or tail ?).
Any help would be much appreciated.

This can be done easily with dplyr. The last and first functions are very helpful for this task.
library(dplyr) #install the package dplyr and load it into library
diff_df <- C %>% #create a new data.frame (diff_df) and store the output of the following operation in it. The %.% operator is used to chain several operations together but you dont have to reference the data.frame you are using each time. so here we are using your data.frame C for the following steps
group_by(id) %>% #group the whole data.frame C by id
summarize(difference = last(d_2)-first(d_2)) #for each group of id, create a single line summary where the first entry of d_2 (for that group) is subtracted from the last entry of d_2 for that group
# id difference #this is the result stored in diff_df
#1 A1 1.03
#2 B10 -0.81
#3 B500 0.00
#4 C100 0.68
#5 D40 0.01
#6 G100 1.04
Edit note: updated post with %>% instead of %.% which is deprecated.

If you have any singletons and they need to be left alone, then this will solve your problem. It's the same as docendo discimus's answer, but with an if-else component to deal with the singleton cases:
library(dplyr)
diff_df <- C %>%
group_by(id) %>%
summarize(difference = if(n() > 1) last(d_2) - first(d_2) else d_2)

Related

Manipulate list object into data frame

library(survey)
I have data such as this. I am using the survey package to produce the MEAN, SE and FREQ of each variables in the vector named vars. I am new to manipulating lists in R & would really appreciate help!
df <- data.frame(
married = c(1,1,1,1,0,0,1,1),
pens = c(0, 1, 1, NA, 1, 1, 0, 0),
weight = c(1.12, 0.55, 1.1, 0.6, 0.23, 0.23, 0.66, 0.67))
vars <- c("weight","married","pens")
design <- svydesign(ids=~1, data=df, weights=~weight)
myfun <- function(x){
means <- svymean(as.formula(paste0('~(', x, ')')), design, na.rm = T)
table <- svytable(as.formula(paste0('~(', x, ')')), design)
results <- list(svymean = means, svytable = table)
return(results)
}
lapply(vars, myfun)
The output looks like this:
[[1]]
[[1]]$svymean
mean SE
weight 0.79791 0.1177
[[1]]$svytable
weight
0.23 0.55 0.6 0.66 0.67 1.1 1.12
0.46 0.55 0.60 0.66 0.67 1.10 1.12
[[2]]
[[2]]$svymean
mean SE
married 0.91085 0.0717
[[2]]$svytable
married
0 1
0.46 4.70
[[3]]
[[3]]$svymean
mean SE
pens 0.46272 0.2255
[[3]]$svytable
pens
0 1
2.45 2.11
I want to extract/manipulate this list above to create a dataframe that looks more like this:
question mean SE sum_svytable
weight 0.797 0.1177 5.16
married 0.910 0.071 5.16
As you can see, the sum_svytable is the sum of the frequencies produced in the $svytable generated list for each variable. Even though this number is the same for each variable (5.16 for all) in my example, it is not the same in my dataset.
sum_svytable was derived like this:
output of myfun function for weight:
[[1]]$svytable
weight
0.23 0.55 0.6 0.66 0.67 1.1 1.12
0.46 0.55 0.60 0.66 0.67 1.10 1.12
I simply summed the frequencies for each response:
sum_svytable(for weight) = 0.46 +0.55+ 0.60+ 0.66+ 0.67+ 1.10+ 1.12
I don't mind how this result is arrived at, I just need it to be in a df!
Is this possible?
An option is to loop over the list of output from 'myfun' then extract teh components, 'svymean', create a data.frame, add the column of sums from 'svytable' element, rbind the list elements and create the 'question' column from the row names
out <- lapply(vars, myfun)
lst1 <- lapply(out, function(x)
cbind(setNames(as.data.frame(x$svymean), c("mean", "SE")),
sum_svytable = sum(x$svytable)))
out1 <- do.call(rbind, lst1)
out1$question <- row.names(out1)
row.names(out1) <- NULL
out1[c('question', 'mean', 'SE', 'sum_svytable')]
# question mean SE sum_svytable
#1 weight 0.7979070 0.1177470 5.16
#2 married 0.9108527 0.0716663 5.16
#3 pens 0.4627193 0.2254907 4.56

How can I get row-wise max based on condition of specific column in R dataframe?

I'm trying to get the maximum value BY ROW across several columns (climatic water deficit -- def_59_z_#) depending on how much time has passed (time since fire -- YEAR.DIFF). Here are the conditions:
If 1 year has passed, select the deficit value for first year.
(def_59_z_1).
If 2 years: max deficit of first 2 years.
If 3 years: max of deficit of first 3 years.
If 4 years: max of deficit of first 4 years.
If 5 or more years: max of first 5 years.
However, I am unable to extract a row-wise max when I include a condition. There are several existing posts that address row-wise min and max (examples 1 and 2) and sd (example 3) -- but these don't use conditions. I've tried using apply but I haven't been able to find a solution when I have multiple columns involved as well as a conditional requirement.
The following code simply returns 3.5 in the new column def59_z_max15, which is the maximum value that occurs in the dataframe -- except when YEAR.DIFF is 1, in which case def_50_z_1 is directly returned. But for all the other conditions, I want 0.98, 0.67, 0.7, 1.55, 1.28 -- values that reflect the row maximum of the specified columns. Link to sample data here. How can I achieve this?
I appreciate any/all suggestions!
data <- data %>%
mutate(def59_z_max15 = ifelse(YEAR.DIFF == 1,
(def59_z_1),
ifelse(YEAR.DIFF == 2,
max(def59_z_1, def59_z_2),
ifelse(YEAR.DIFF == 3,
max(def59_z_1, def59_z_2, def59_z_3),
ifelse(YEAR.DIFF == 4,
max(def59_z_1, def59_z_2, def59_z_3, def59_z_4),
max(def59_z_1, def59_z_2, def59_z_3, def59_z_4, def59_z_5))))))
Throw this function in an apply family function
func <- function(x) {
first.val <- x[1]
if (first.val < 5) {
return(max(x[2:(first.val+)])
} else {
return(max(x[2:6]))
}
}
Your desired output should be obtained by:
apply(data, 1, function(x) func(x)) #do it by row by setting arg2 = 1
An option would be to get the pmax (rowwise max - vectorized) for each set of conditions separately in a loop (map - if the value of 'YEAR.DIFF' is 1, select only the 'def_59_z_1', for 2, get the max of 'def_59_z_1' and 'def_59_z_2', ..., for 5, max of 'def_59_z_1' to 'def_59_z_5', coalesce the columns together and replace the rest of the NA with the pmax of all the 'def59_z" columns
library(tidyverse)
out <- map_dfc(1:5, ~
df1 %>%
select(seq_len(.x) + 1) %>%
transmute(val = na_if((df1[["YEAR.DIFF"]] == .x)*
pmax(!!! rlang::syms(names(.))), 0))) %>%
transmute(def59_z_max15 = coalesce(!!! rlang::syms(names(.)))) %>%
bind_cols(df1, .)%>%
mutate(def59_z_max15 = case_when(is.na(def59_z_max15) ~
pmax(!!! rlang::syms(names(.)[2:6])), TRUE ~ def59_z_max15))
head(out, 10)
# YEAR.DIFF def59_z_1 def59_z_2 def59_z_3 def59_z_4 def59_z_5 def59_z_max15
#1 5 0.25 -2.11 0.98 -0.07 0.31 0.98
#2 9 0.67 0.65 -0.27 0.52 0.26 0.67
#3 10 0.56 0.33 0.03 0.70 -0.09 0.70
#4 2 -0.34 1.55 -1.11 -0.40 0.94 1.55
#5 4 0.98 0.71 0.41 1.28 -0.14 1.28
#6 3 0.71 -0.17 1.70 -0.57 0.43 1.70
#7 4 -1.39 -1.71 -0.89 0.78 1.22 0.78
#8 4 -1.14 -1.46 -0.72 0.74 1.32 0.74
#9 2 0.71 1.39 1.07 0.65 0.29 1.39
#10 1 0.28 0.82 -0.64 0.45 0.64 0.28
data
df1 <- read.csv("https://raw.githubusercontent.com/CaitLittlef/random/master/data.csv")

Return value in column 1 when value in column 2 exceeds 2 for 1st time

I have a dataframe called "new_dat" containing the time (days) in column t, and temperature data (and occaisionally NA) in columns A - C (please see the example in the code below):
> new_dat
t A B C
1 0.00 0.82 0.88 0.46
2 0.01 0.87 0.94 0.52
3 0.02 NA NA NA
4 0.03 0.95 1.03 0.62
5 0.04 0.98 1.06 0.67
6 0.05 1.01 1.09 0.71
7 0.06 2.00 1.13 2.00
8 0.07 1.06 1.16 0.78
9 0.08 1.07 1.18 0.81
10 0.09 1.09 1.20 0.84
11 0.10 1.10 1.21 0.86
12 0.11 2.00 1.22 0.87
Here is a dput() of the dataframe:
structure(list(t = c(0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07,
0.08, 0.09, 0.1, 0.11), A = c(0.82, 0.870000000000001, NA,
0.949999999999999,
0.979999999999997, 1.01, 2, 1.06, 1.07, 1.09, 1.1, 2), B =
c(0.879999999999999,
0.940000000000001, NA, 1.03, 1.06, 1.09, 1.13, 1.16, 1.18, 1.2,
1.21, 1.22), C = c(0.460000000000001, 0.520000000000003, NA,
0.619999999999997, 0.669999999999998, 0.709999999999997, 2,
0.780000000000001,
0.809999999999999, 0.84, 0.859999999999999, 0.87)), .Names = c("t",
"A", "B", "C"), row.names = c(NA, 12L), class = "data.frame")
As output, I want a vector (list?) of the values of column t where the temperature reading from columns A-C >= 2 for the first time (and only the first time), or - if the temperature is never >= 2 - return the last time reading in column t (0.11 in my example). So 'A' would return the value 0.06 (and not 0.11), 'B' would have the value 0.11 and 'C' 0.06. I intended to use the vector generated to create a new dataframe something like this:
A B C
0.06 0.11 0.06
I'm inexperienced with R (and code in general) so, despite reading that looping can be ineficient (but not really understanding how to accomplish what i want without it), I tried to solve this by looping first by column and then by row as follows:
#create blank vector to add my results to
aer <- c()
#loop by column, then by row, adding values according to the if statement
for (c in 2:ncol(new_dat)){
c <- c
for (r in 1:nrow(new_dat)){
r <- r
if ((!is.na(new_dat[r,c] )) & (new_dat[r,c] >= 2)){
aer <- c(aer, new_dat$t[r])
}
}
}
This returns my vector, aer, as:
> aer
[1] 0.06 0.11 0.06
So it's returning both instances where 'A' is 2, and the one from column 'C'.
I dont know how to instruct the loop to stop and move to the next column after finding one instance where my 'if' statement is true. I also tried adding an 'else' to cover the situation where temperature doesnt exceed 2:
else {
aer <- c(aer, new_dat$t[nrow(new_dat)])
But this did not work.
I would appreciate any help in completing the code, or suggestions for a better solution.
library(tidyverse)
new_dat %>%
gather(col, temp, -t) %>% # reshape data
na.omit() %>% # remove rows with NAs
group_by(col) %>% # for each column value
summarise(v = ifelse(is.na(first(t[temp >= 2])), last(t), first(t[temp >= 2]))) %>% # return the last t value if there are no temp >=2 otherwise return the first t with temp >= 2
spread(col, v) # reshape again
# # A tibble: 1 x 3
# A B C
# <dbl> <dbl> <dbl>
# 1 0.06 0.11 0.06
This solution will create the dataframe for you automatically, instead of returning a vector for you to create the dataframe yourself.
Here is a two steps solution.
First get an index vector of the values you want, then use that index vector to subset the dataframe.
inx <- sapply(new_dat[-1], function(x) {
w <- which(x >= 2)
if(length(w)) min(w) else NROW(x)
})
new_dat[inx, 1]
#[1] 0.06 0.11 0.06

Random sample in R when data is in long format

I need to randomly sample a dataset which is arranged in long format. In my dataset, each subject has 4 observations, so if I randomly sample a row I am randomly losing one or more observation per subject.
This is a simulated data for illustration purposes, my data is much bigger.
sub sex group dv1 dv2
P1 m A 0.66 0.94
P1 m B 0.98 0.26
P1 m C 0.02 0.03
P1 m D 0.60 0.30
P2 m A 0.92 0.99
P2 m B 0.82 0.09
P2 m C 0.44 0.67
P2 m D 0.53 0.80
P3 f A 0.29 0.22
P3 f B 0.46 0.20
P3 f C 0.37 0.77
P3 f D 0.76 0.54
P4 m A 0.28 0.99
P4 m B 0.16 0.57
P4 m C 0.46 0.75
P4 m D 0.28 0.21
In this example, I need to randomly select 2 males. For example, I tried using dplyr packaged (see below), but if I give a sample of 2, it just gives me 2 rows for sex="m" and 2 for sex="f". In total, 4 randomly chosen rows. What I need it to do is to give me 8 rows where 4 come from one male and 4 from another. Changing grouping parameter to sub doesn't work, as it barks that there are only 2 levels in the group (actually, it would work in this toy example as there are 4 levels for each sub, but note that I am choosing like 50 samples from a bigger dataset). Also, it would just give me 2 random rows for each sub, which is not what I need.
library(dplyr)
subset <- data %>%
group_by(sex) %>%
sample_n(2)
Please do not suggest to reshape the date to wide format and sample it there, as I know that I can do that. I am sure there must be a way to sample in long format.
I would sample from the patient names and then filter by those sampled names:
Look at all males
male_subset <- data %>% filter(sex == "m")
Look for unique male ID
male_IDs <- unique(male_subset$sub)
Sample from the unique IDs
sampled_IDs <- sample(male_IDs, 2)
Now you subset your data based on these sampled IDs:
data %>% filter(sub %in% sampled_IDs)
This should return all four rows for each of the 2 sampled individuals.
I'm not sure if I've quite understood what you want. Would this do it?
data %>% filter(sex == 'm') %>% filter(sub %in% sample(paste0('P',1:4), 2))
You'd have to change what's in the paste0 function for your real data, of course.
In base R,
set.seed(1)
subset<- sample(data[data$sex == "m",]$sub,2)
data_subset<-data[data$sub %in% subset,]
nrow(data_subset)
# [1] 8
Works, but not flashy.

Aggregating columns

I have a data frame of n columns and r rows. I want to determine which column is correlated most with column 1, and then aggregate these two columns. The aggregated column will be considered the new column 1. Then, I remove the column that is correlated most from the set. Thus, the size of the date is decreased by one column. I then repeat the process, until the data frame result has has n columns, with the second column being the aggregation of two columns, the third column being the aggregation of three columns, etc. I am therefore wondering if there is an efficient or quicker way to get to the result I'm going for. I've tried various things, but without success so far. Any suggestions?
n <- 5
r <- 6
> df
X1 X2 X3 X4 X5
1 0.32 0.88 0.12 0.91 0.18
2 0.52 0.61 0.44 0.19 0.65
3 0.84 0.71 0.50 0.67 0.36
4 0.12 0.30 0.72 0.40 0.05
5 0.40 0.62 0.48 0.39 0.95
6 0.55 0.28 0.33 0.81 0.60
This is what result should look like:
> result
X1 X2 X3 X4 X5
1 0.32 0.50 1.38 2.29 2.41
2 0.52 1.17 1.78 1.97 2.41
3 0.84 1.20 1.91 2.58 3.08
4 0.12 0.17 0.47 0.87 1.59
5 0.40 1.35 1.97 2.36 2.84
6 0.55 1.15 1.43 2.24 2.57
I think most of the slowness and eventual crash comes from memory overheads during the loop and not from the correlations (though that could be improved too as #coffeeinjunky says). This is most likely as a result of the way data.frames are modified in R. Consider switching to data.tables and take advantage of their "assignment by reference" paradigm. For example, below is your code translated into data.table syntax. You can time the two loops, compare perfomance and comment the results. cheers.
n <- 5L
r <- 6L
result <- setDT(data.frame(matrix(NA,nrow=r,ncol=n)))
temp <- copy(df) # Create a temporary data frame in which I calculate the correlations
set(result, j=1L, value=temp[[1]]) # The first column is the same
for (icol in as.integer(2:n)) {
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1]) # Determine which are correlated most
set(x=result, i=NULL, j=as.integer(icol), value=(temp[[1]] + temp[[mch]]))# Aggregate and place result in results datatable
set(x=temp, i=NULL, j=1L, value=result[[icol]])# Set result as new 1st column
set(x=temp, i=NULL, j=as.integer(mch), value=NULL) # Remove column
}
Try
for (i in 2:n) {
maxcor <- names(which.max(sapply(temp[,-1, drop=F], function(x) cor(temp[, 1], x) )))
result[,i] <- temp[,1] + temp[,maxcor]
temp[,1] <- result[,i] # Set result as new 1st column
temp[,maxcor] <- NULL # Remove column
}
The error was caused because in the last iteration, subsetting temp yields a single vector, and standard R behavior is to reduce the class from dataframe to vector in such cases, which causes sapply to pass on only the first element, etc.
One more comment: currently, you are using the most positive correlation, not the strongest correlation, which may also be negative. Make sure this is what you want.
To adress your question in the comment: Note that your old code could be improved by avoiding repeat computation. For instance,
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1])
contains the command cor(temp) twice. This means each and every correlation is computed twice. Replacing it with
cortemp <- cor(temp)
mch <- match(c(max(cortemp[-1,1])),cortemp[,1])
should cut the computational burden of the initial code line in half.

Resources