Manipulate list object into data frame - r

library(survey)
I have data such as this. I am using the survey package to produce the MEAN, SE and FREQ of each variables in the vector named vars. I am new to manipulating lists in R & would really appreciate help!
df <- data.frame(
married = c(1,1,1,1,0,0,1,1),
pens = c(0, 1, 1, NA, 1, 1, 0, 0),
weight = c(1.12, 0.55, 1.1, 0.6, 0.23, 0.23, 0.66, 0.67))
vars <- c("weight","married","pens")
design <- svydesign(ids=~1, data=df, weights=~weight)
myfun <- function(x){
means <- svymean(as.formula(paste0('~(', x, ')')), design, na.rm = T)
table <- svytable(as.formula(paste0('~(', x, ')')), design)
results <- list(svymean = means, svytable = table)
return(results)
}
lapply(vars, myfun)
The output looks like this:
[[1]]
[[1]]$svymean
mean SE
weight 0.79791 0.1177
[[1]]$svytable
weight
0.23 0.55 0.6 0.66 0.67 1.1 1.12
0.46 0.55 0.60 0.66 0.67 1.10 1.12
[[2]]
[[2]]$svymean
mean SE
married 0.91085 0.0717
[[2]]$svytable
married
0 1
0.46 4.70
[[3]]
[[3]]$svymean
mean SE
pens 0.46272 0.2255
[[3]]$svytable
pens
0 1
2.45 2.11
I want to extract/manipulate this list above to create a dataframe that looks more like this:
question mean SE sum_svytable
weight 0.797 0.1177 5.16
married 0.910 0.071 5.16
As you can see, the sum_svytable is the sum of the frequencies produced in the $svytable generated list for each variable. Even though this number is the same for each variable (5.16 for all) in my example, it is not the same in my dataset.
sum_svytable was derived like this:
output of myfun function for weight:
[[1]]$svytable
weight
0.23 0.55 0.6 0.66 0.67 1.1 1.12
0.46 0.55 0.60 0.66 0.67 1.10 1.12
I simply summed the frequencies for each response:
sum_svytable(for weight) = 0.46 +0.55+ 0.60+ 0.66+ 0.67+ 1.10+ 1.12
I don't mind how this result is arrived at, I just need it to be in a df!
Is this possible?

An option is to loop over the list of output from 'myfun' then extract teh components, 'svymean', create a data.frame, add the column of sums from 'svytable' element, rbind the list elements and create the 'question' column from the row names
out <- lapply(vars, myfun)
lst1 <- lapply(out, function(x)
cbind(setNames(as.data.frame(x$svymean), c("mean", "SE")),
sum_svytable = sum(x$svytable)))
out1 <- do.call(rbind, lst1)
out1$question <- row.names(out1)
row.names(out1) <- NULL
out1[c('question', 'mean', 'SE', 'sum_svytable')]
# question mean SE sum_svytable
#1 weight 0.7979070 0.1177470 5.16
#2 married 0.9108527 0.0716663 5.16
#3 pens 0.4627193 0.2254907 4.56

Related

Convert one string column in three columns

I am trying to separate values for the estimates and CIs into three columns, so that the column with info of the type 99.99[-99.9,99.9] is converted into three separated columns.
Please consider the data:
out <-
structure(list(name = c("total_gray_vol_0_to_psychosis_24", "total_gray_vol_24_to_psychosis_48",
"psychosis_0_to_total_gray_vol_24", "psychosis_24_to_total_gray_vol_48"
), Std.Estimate = c(0.304045656442265, 1.48352171485462, 0.673583361513608,
0.703098685562618), Std.SE = c(0.239964279466103, 2.72428816136731,
0.112111316151443, 0.14890331153936), CI = c("0.3 [-0.17, 0.77]",
"1.48 [-3.86, 6.82]", "0.67 [0.45, 0.89]", "0.7 [0.41, 0.99]"
)), class = "data.frame", row.names = c(NA, -4L))
The farthest I got was to extract the first digit with:
library(stringr)
str_match(out$CI, pattern= "([[0-9]+]*)([[0-9]+]*)([[0-9]+]*)")
But this is not working, as it is returning only the first digits, and for some reason four columns.
How do I split the column CI into three columns (estimate, lower, upper) correctly?
You could also use tidyr::extract for this purpose as follows. Also note that in regex argument you need to define as many capturing groups as the length of into argument.
out %>%
extract(CI, c('estimate', 'lower', 'upper'), '([-\\d.]+)\\s+\\[([-\\d.]+)\\W+([-\\d.]+)\\]')
name Std.Estimate Std.SE estimate lower upper
1 total_gray_vol_0_to_psychosis_24 0.3040457 0.2399643 0.3 -0.17 0.77
2 total_gray_vol_24_to_psychosis_48 1.4835217 2.7242882 1.48 -3.86 6.82
3 psychosis_0_to_total_gray_vol_24 0.6735834 0.1121113 0.67 0.45 0.89
4 psychosis_24_to_total_gray_vol_48 0.7030987 0.1489033 0.7 0.41 0.99
Here is an option using tidyr::separate
out %>%
separate(CI, c("estimate", "lower", "upper"), sep = "\\s|[|]") %>%
mutate(across(
c(estimate, lower, upper),
~ .x %>% str_remove_all("\\[|\\]|,|\\s") %>% as.numeric()))
# name Std.Estimate Std.SE estimate lower upper
#1 total_gray_vol_0_to_psychosis_24 0.3040457 0.2399643 0.30 -0.17 0.77
#2 total_gray_vol_24_to_psychosis_48 1.4835217 2.7242882 1.48 -3.86 6.82
#3 psychosis_0_to_total_gray_vol_24 0.6735834 0.1121113 0.67 0.45 0.89
#4 psychosis_24_to_total_gray_vol_48 0.7030987 0.1489033 0.70 0.41 0.99
First, split entries on a white space, "[" or "]", then remove these characters from the resulting new columns and coerce to numeric.
Using base R
out <- cbind(out, read.table(text = gsub("[][]|,", "", out$CI),
header = FALSE, col.names = c("estimate", "lower", "upper")))
-output
> out$CI <- NULL
> out
name Std.Estimate Std.SE estimate lower upper
1 total_gray_vol_0_to_psychosis_24 0.3040457 0.2399643 0.30 -0.17 0.77
2 total_gray_vol_24_to_psychosis_48 1.4835217 2.7242882 1.48 -3.86 6.82
3 psychosis_0_to_total_gray_vol_24 0.6735834 0.1121113 0.67 0.45 0.89
4 psychosis_24_to_total_gray_vol_48 0.7030987 0.1489033 0.70 0.41 0.99

Find the optimal way to calculate the highest possible Pearson correlation for millions of combinations in R

I am working on a large dataset. However I will start with a small example to illustrate what I am trying to achieve.
I have the following vectors:
season <- c("2019")
round <- c("1")
team <- c("Team A", "Team B", "Team C")
margin <- c(33, 56, 63)
score_A <- c(0.330, 0.256, 0.118)
score_B <- c(0.584, 0.176, 0.342)
score_C <- c(0.118, 0.193, 0.286)
And I create a data frame like so:
df1 <- data.frame(season, round, team, score_A, score_B, score_C, margin)
I then apply weightings to each of the scores like:
df1$score_A <- df1$score_A * 0.25
df1$score_B <- df1$score_B * 0.5
df1$score_C <- df1$score_C * 0.75
I then sum all the scores and create a team total score:
df1$score_total <- df1$score_A + df1$score_B + df1$score_C
library(dplyr)
df1 <- df1%>%group_by(season, round, team)%>%dplyr::mutate(score_Team_Total=sum(score_total))
I know that I can then calculate the Pearson like so:
> cor(df1$margin, df1$score_Team_Total, method = "pearson")
[1] -0.5505451
Although this does not give me a line by line return, I am not quite sure yet how to calculate that.
However this is where it starts to get tricky.
I have a number of weightings that I would like to apply to each of the scores like so:
weightings <- c(0.25,0.5,0.75,1,1.25,1.5,2,2.5,3)
And I have a number of scores for each weighting (score_A right through to score_R).
The first combination would be:
df1$score_A <- df1$score_A * 0.25
df1$score_B <- df1$score_B * 0.25
df1$score_C <- df1$score_C * 0.25
The second combination would be:
df1$score_A <- df1$score_A * 0.25
df1$score_B <- df1$score_B * 0.25
df1$score_C <- df1$score_C * 0.5
The third combination would be:
df1$score_A <- df1$score_A * 0.25
df1$score_B <- df1$score_B * 0.5
df1$score_C <- df1$score_C * 0.5
And so on.
But how can I get the Pearson correlation for each combination and return the highest possible Pearson?
I know there will be millions of comibations, as I ran this:
> length(permutations(7, 9, repeats.allowed = TRUE))
[1] 363182463
But I have 9 different variables in my weightings (0.25,0.5,0.75,1,1.25,1.5,2,2.5,3) and 18 different scores (score_A through to score_R).
So when I tried:
> length(permutations(9, 18, repeats.allowed = TRUE))
I received this error:
Error: cannot allocate vector of size 73.7 Gb
So I know the number is going to be very large.
I need to apply each combination of weightings to the scores, then create the totals and calculate the Pearson.
A dataframe or list with the results would be too large I assume, so is there a way to return the optimal comination? The output would look something like:
score_A score_B score_C pearson
weighting 0.25 0.50 0.25 0.63
I am still new to R and learning so I am not quite sure where to go from here.
You should realize that you are trying to explore 9^18 permutations, that is:
options(scipen = 999)
9^18
# [1] 150094635296999136
What about exploring a subset of them? The following code generates 18^7 combinations of your weightings:
set.seed(1)
n_scores <- 18
p <- 7
aux <- matrix(sample(weightings, n_scores^p, replace = TRUE), ncol = n_scores)
# First combination
aux[1,]
[1] 3.00 2.00 0.50 1.00 1.25 2.00 1.50 2.50 0.25 0.75 3.00 2.50 1.25 3.00
[15] 0.75 2.50 0.50 0.50
Then, you could repeat this smaller exploration several times and look at the similarity of the several optimal combinations to gain some insights.
Edit upon the comment by #Michael:
First, I modify your toy example to have an extra row:
season <- c("2019")
round <- c("1")
team <- c("Team A", "Team B", "Team C", "Team D")
margin <- c(33, 56, 63, 50)
score_A <- c(0.330, 0.256, 0.118, 0.2)
score_B <- c(0.584, 0.176, 0.342, 0.15)
score_C <- c(0.118, 0.193, 0.286, 0.2)
df1 <- data.frame(season, round, team, score_A, score_B, score_C, margin)
Then, I generate 9 sets of weights:
weightings <- c(0.25,0.5,0.75,1,1.25,1.5,2,2.5,3)
set.seed(1)
n_scores <- 3
p <- 3
aux1 <- matrix(sample(weightings, n_scores^p, replace = TRUE), ncol = n_scores)
colnames(aux1) <- c("score_A", "score_B", "score_C")
Finally, I perform the main operation
aux2 <- cbind(df1$score_A, df1$score_B, df1$score_C)
df2 <- data.frame(aux1,
pearson = c(cor(df1$margin, apply(aux1, 1, function(x) rowSums(t(x*t(aux2)))))))
df2
# score_A score_B score_C pearson
# 1 3.00 1.25 1.25 -0.8473964
# 2 1.00 1.25 1.25 -0.6385250
# 3 2.00 1.50 0.50 -0.8222945
# 4 0.25 2.00 3.00 -0.2510155
# 5 0.50 3.00 0.25 -0.6804298
# 6 2.00 1.25 1.00 -0.8025296
# 7 0.50 1.25 0.75 -0.6260844
# 8 0.75 3.00 1.50 -0.6088807
# 9 0.25 3.00 1.50 -0.5591034
Edit upon the 2nd comment by #Michael:
After creating aux2 as above, generate aux3 as below. In aux3 you will have as many columns called w_x as the number of sets of weights that you are exploring, but also the original columns from df1 that you should need for your next computations. Each w_x is a weighted sum of the scores:
aux3 <- apply(aux1, 1, function(x) rowSums(t(x*t(aux2))))
colnames(aux3) <- paste0("w_", 1:ncol(aux3))
df1 %>%
select(season, round, team, margin) %>%
cbind(aux3) -> aux3
aux3
# season round team margin w_1 w_2 w_3 w_4 w_5 w_6
# 1 2019 1 Team A 33 1.86750 1.20750 1.5950 1.6045 1.94650 1.5080
# 2 2019 1 Team B 56 1.22925 0.71725 0.8725 0.9950 0.70425 0.9250
# 3 2019 1 Team C 63 1.13900 0.90300 0.8920 1.5715 1.15650 0.9495
# 4 2019 1 Team D 50 1.03750 0.63750 0.7250 0.9500 0.60000 0.7875
# w_7 w_8 w_9
# 1 0.98350 2.1765 2.0115
# 2 0.49275 1.0095 0.8815
# 3 0.70100 1.5435 1.4845
# 4 0.43750 0.9000 0.8000

How can I get row-wise max based on condition of specific column in R dataframe?

I'm trying to get the maximum value BY ROW across several columns (climatic water deficit -- def_59_z_#) depending on how much time has passed (time since fire -- YEAR.DIFF). Here are the conditions:
If 1 year has passed, select the deficit value for first year.
(def_59_z_1).
If 2 years: max deficit of first 2 years.
If 3 years: max of deficit of first 3 years.
If 4 years: max of deficit of first 4 years.
If 5 or more years: max of first 5 years.
However, I am unable to extract a row-wise max when I include a condition. There are several existing posts that address row-wise min and max (examples 1 and 2) and sd (example 3) -- but these don't use conditions. I've tried using apply but I haven't been able to find a solution when I have multiple columns involved as well as a conditional requirement.
The following code simply returns 3.5 in the new column def59_z_max15, which is the maximum value that occurs in the dataframe -- except when YEAR.DIFF is 1, in which case def_50_z_1 is directly returned. But for all the other conditions, I want 0.98, 0.67, 0.7, 1.55, 1.28 -- values that reflect the row maximum of the specified columns. Link to sample data here. How can I achieve this?
I appreciate any/all suggestions!
data <- data %>%
mutate(def59_z_max15 = ifelse(YEAR.DIFF == 1,
(def59_z_1),
ifelse(YEAR.DIFF == 2,
max(def59_z_1, def59_z_2),
ifelse(YEAR.DIFF == 3,
max(def59_z_1, def59_z_2, def59_z_3),
ifelse(YEAR.DIFF == 4,
max(def59_z_1, def59_z_2, def59_z_3, def59_z_4),
max(def59_z_1, def59_z_2, def59_z_3, def59_z_4, def59_z_5))))))
Throw this function in an apply family function
func <- function(x) {
first.val <- x[1]
if (first.val < 5) {
return(max(x[2:(first.val+)])
} else {
return(max(x[2:6]))
}
}
Your desired output should be obtained by:
apply(data, 1, function(x) func(x)) #do it by row by setting arg2 = 1
An option would be to get the pmax (rowwise max - vectorized) for each set of conditions separately in a loop (map - if the value of 'YEAR.DIFF' is 1, select only the 'def_59_z_1', for 2, get the max of 'def_59_z_1' and 'def_59_z_2', ..., for 5, max of 'def_59_z_1' to 'def_59_z_5', coalesce the columns together and replace the rest of the NA with the pmax of all the 'def59_z" columns
library(tidyverse)
out <- map_dfc(1:5, ~
df1 %>%
select(seq_len(.x) + 1) %>%
transmute(val = na_if((df1[["YEAR.DIFF"]] == .x)*
pmax(!!! rlang::syms(names(.))), 0))) %>%
transmute(def59_z_max15 = coalesce(!!! rlang::syms(names(.)))) %>%
bind_cols(df1, .)%>%
mutate(def59_z_max15 = case_when(is.na(def59_z_max15) ~
pmax(!!! rlang::syms(names(.)[2:6])), TRUE ~ def59_z_max15))
head(out, 10)
# YEAR.DIFF def59_z_1 def59_z_2 def59_z_3 def59_z_4 def59_z_5 def59_z_max15
#1 5 0.25 -2.11 0.98 -0.07 0.31 0.98
#2 9 0.67 0.65 -0.27 0.52 0.26 0.67
#3 10 0.56 0.33 0.03 0.70 -0.09 0.70
#4 2 -0.34 1.55 -1.11 -0.40 0.94 1.55
#5 4 0.98 0.71 0.41 1.28 -0.14 1.28
#6 3 0.71 -0.17 1.70 -0.57 0.43 1.70
#7 4 -1.39 -1.71 -0.89 0.78 1.22 0.78
#8 4 -1.14 -1.46 -0.72 0.74 1.32 0.74
#9 2 0.71 1.39 1.07 0.65 0.29 1.39
#10 1 0.28 0.82 -0.64 0.45 0.64 0.28
data
df1 <- read.csv("https://raw.githubusercontent.com/CaitLittlef/random/master/data.csv")

Return value in column 1 when value in column 2 exceeds 2 for 1st time

I have a dataframe called "new_dat" containing the time (days) in column t, and temperature data (and occaisionally NA) in columns A - C (please see the example in the code below):
> new_dat
t A B C
1 0.00 0.82 0.88 0.46
2 0.01 0.87 0.94 0.52
3 0.02 NA NA NA
4 0.03 0.95 1.03 0.62
5 0.04 0.98 1.06 0.67
6 0.05 1.01 1.09 0.71
7 0.06 2.00 1.13 2.00
8 0.07 1.06 1.16 0.78
9 0.08 1.07 1.18 0.81
10 0.09 1.09 1.20 0.84
11 0.10 1.10 1.21 0.86
12 0.11 2.00 1.22 0.87
Here is a dput() of the dataframe:
structure(list(t = c(0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07,
0.08, 0.09, 0.1, 0.11), A = c(0.82, 0.870000000000001, NA,
0.949999999999999,
0.979999999999997, 1.01, 2, 1.06, 1.07, 1.09, 1.1, 2), B =
c(0.879999999999999,
0.940000000000001, NA, 1.03, 1.06, 1.09, 1.13, 1.16, 1.18, 1.2,
1.21, 1.22), C = c(0.460000000000001, 0.520000000000003, NA,
0.619999999999997, 0.669999999999998, 0.709999999999997, 2,
0.780000000000001,
0.809999999999999, 0.84, 0.859999999999999, 0.87)), .Names = c("t",
"A", "B", "C"), row.names = c(NA, 12L), class = "data.frame")
As output, I want a vector (list?) of the values of column t where the temperature reading from columns A-C >= 2 for the first time (and only the first time), or - if the temperature is never >= 2 - return the last time reading in column t (0.11 in my example). So 'A' would return the value 0.06 (and not 0.11), 'B' would have the value 0.11 and 'C' 0.06. I intended to use the vector generated to create a new dataframe something like this:
A B C
0.06 0.11 0.06
I'm inexperienced with R (and code in general) so, despite reading that looping can be ineficient (but not really understanding how to accomplish what i want without it), I tried to solve this by looping first by column and then by row as follows:
#create blank vector to add my results to
aer <- c()
#loop by column, then by row, adding values according to the if statement
for (c in 2:ncol(new_dat)){
c <- c
for (r in 1:nrow(new_dat)){
r <- r
if ((!is.na(new_dat[r,c] )) & (new_dat[r,c] >= 2)){
aer <- c(aer, new_dat$t[r])
}
}
}
This returns my vector, aer, as:
> aer
[1] 0.06 0.11 0.06
So it's returning both instances where 'A' is 2, and the one from column 'C'.
I dont know how to instruct the loop to stop and move to the next column after finding one instance where my 'if' statement is true. I also tried adding an 'else' to cover the situation where temperature doesnt exceed 2:
else {
aer <- c(aer, new_dat$t[nrow(new_dat)])
But this did not work.
I would appreciate any help in completing the code, or suggestions for a better solution.
library(tidyverse)
new_dat %>%
gather(col, temp, -t) %>% # reshape data
na.omit() %>% # remove rows with NAs
group_by(col) %>% # for each column value
summarise(v = ifelse(is.na(first(t[temp >= 2])), last(t), first(t[temp >= 2]))) %>% # return the last t value if there are no temp >=2 otherwise return the first t with temp >= 2
spread(col, v) # reshape again
# # A tibble: 1 x 3
# A B C
# <dbl> <dbl> <dbl>
# 1 0.06 0.11 0.06
This solution will create the dataframe for you automatically, instead of returning a vector for you to create the dataframe yourself.
Here is a two steps solution.
First get an index vector of the values you want, then use that index vector to subset the dataframe.
inx <- sapply(new_dat[-1], function(x) {
w <- which(x >= 2)
if(length(w)) min(w) else NROW(x)
})
new_dat[inx, 1]
#[1] 0.06 0.11 0.06

How to subtract first entry from last entry in grouped data

I would appreciate some help with the following task: From the data frame below (C), for each id I would like to subtract the first entry under column d_2 from the final entry and then store the results in another dataframe containing the same ids. I can then merge this with my initial dataframe. Pls note that the subtraction has to be in this order (last entry minus first entry for each id).
Here are the codes:
id <- c("A1", "A1", "B10","B10", "B500", "B500", "C100", "C100", "C100", "D40", "D40", "G100", "G100")
d_1 <- c( rep(1.15, 2), rep(1.44, 2), rep(1.34, 2), rep(1.50, 3), rep(1.90, 2), rep(1.59, 2))
set.seed(2)
d_2 <- round(runif(13, -1, 1), 2)
C <- data.frame(id, d_1, d_2)
id d_1 d_2
A1 1.15 -0.63
A1 1.15 0.40
B10 1.44 0.15
B10 1.44 -0.66
B500 1.34 0.89
B500 1.34 0.89
C100 1.50 -0.74
C100 1.50 0.67
C100 1.50 -0.06
D40 1.90 0.10
D40 1.90 0.11
G100 1.59 -0.52
G100 1.59 0.52
Desired result:
id2 <- c("A1", "B10", "B500", "C100", "D40", "G100")
difference <- c(1.03, -0.81, 0, 0.68, 0.01, 1.04)
diff_df <- data.frame(id2, difference)
id2 difference
A1 1.03
B10 -0.81
B500 0.00
C100 0.68
D40 0.01
G100 1.04
I attempted this by using ddply to obtain the first and last entries but I'm really struggling with indexing the "function argument" in the second code (below) to get the desired outcome.
C_1 <- ddply(C, .(id), function(x) x[c(1, nrow(x)), ])
ddply(C_1, .(patient), function )
To be honest, I'm not very familiar with the ddply package-I got the code above from another post on stack exchange .
My original data is a groupedData and I believe another way of approaching this is using gapply but again I'm struggling with the third argument here (usually a function)
grouped_C <- groupedData(d_1 ~ d_2 | id, data = C, FUN = mean, labels = list( x = "", y = ""), units = list(""))
x1 <- gapply(grouped_C, "d_2", first_entry)
x2 <- gapply(grouped_C, "d_2", last_entry)
where first_entry and last_entry are functions to help me get the first and and last entries.
I can then get the difference with: x2 - x1. However, I'm not sure what to input as first_entry and last_entry in the above codes (perhaps to do with head or tail ?).
Any help would be much appreciated.
This can be done easily with dplyr. The last and first functions are very helpful for this task.
library(dplyr) #install the package dplyr and load it into library
diff_df <- C %>% #create a new data.frame (diff_df) and store the output of the following operation in it. The %.% operator is used to chain several operations together but you dont have to reference the data.frame you are using each time. so here we are using your data.frame C for the following steps
group_by(id) %>% #group the whole data.frame C by id
summarize(difference = last(d_2)-first(d_2)) #for each group of id, create a single line summary where the first entry of d_2 (for that group) is subtracted from the last entry of d_2 for that group
# id difference #this is the result stored in diff_df
#1 A1 1.03
#2 B10 -0.81
#3 B500 0.00
#4 C100 0.68
#5 D40 0.01
#6 G100 1.04
Edit note: updated post with %>% instead of %.% which is deprecated.
If you have any singletons and they need to be left alone, then this will solve your problem. It's the same as docendo discimus's answer, but with an if-else component to deal with the singleton cases:
library(dplyr)
diff_df <- C %>%
group_by(id) %>%
summarize(difference = if(n() > 1) last(d_2) - first(d_2) else d_2)

Resources