Convert one string column in three columns - r

I am trying to separate values for the estimates and CIs into three columns, so that the column with info of the type 99.99[-99.9,99.9] is converted into three separated columns.
Please consider the data:
out <-
structure(list(name = c("total_gray_vol_0_to_psychosis_24", "total_gray_vol_24_to_psychosis_48",
"psychosis_0_to_total_gray_vol_24", "psychosis_24_to_total_gray_vol_48"
), Std.Estimate = c(0.304045656442265, 1.48352171485462, 0.673583361513608,
0.703098685562618), Std.SE = c(0.239964279466103, 2.72428816136731,
0.112111316151443, 0.14890331153936), CI = c("0.3 [-0.17, 0.77]",
"1.48 [-3.86, 6.82]", "0.67 [0.45, 0.89]", "0.7 [0.41, 0.99]"
)), class = "data.frame", row.names = c(NA, -4L))
The farthest I got was to extract the first digit with:
library(stringr)
str_match(out$CI, pattern= "([[0-9]+]*)([[0-9]+]*)([[0-9]+]*)")
But this is not working, as it is returning only the first digits, and for some reason four columns.
How do I split the column CI into three columns (estimate, lower, upper) correctly?

You could also use tidyr::extract for this purpose as follows. Also note that in regex argument you need to define as many capturing groups as the length of into argument.
out %>%
extract(CI, c('estimate', 'lower', 'upper'), '([-\\d.]+)\\s+\\[([-\\d.]+)\\W+([-\\d.]+)\\]')
name Std.Estimate Std.SE estimate lower upper
1 total_gray_vol_0_to_psychosis_24 0.3040457 0.2399643 0.3 -0.17 0.77
2 total_gray_vol_24_to_psychosis_48 1.4835217 2.7242882 1.48 -3.86 6.82
3 psychosis_0_to_total_gray_vol_24 0.6735834 0.1121113 0.67 0.45 0.89
4 psychosis_24_to_total_gray_vol_48 0.7030987 0.1489033 0.7 0.41 0.99

Here is an option using tidyr::separate
out %>%
separate(CI, c("estimate", "lower", "upper"), sep = "\\s|[|]") %>%
mutate(across(
c(estimate, lower, upper),
~ .x %>% str_remove_all("\\[|\\]|,|\\s") %>% as.numeric()))
# name Std.Estimate Std.SE estimate lower upper
#1 total_gray_vol_0_to_psychosis_24 0.3040457 0.2399643 0.30 -0.17 0.77
#2 total_gray_vol_24_to_psychosis_48 1.4835217 2.7242882 1.48 -3.86 6.82
#3 psychosis_0_to_total_gray_vol_24 0.6735834 0.1121113 0.67 0.45 0.89
#4 psychosis_24_to_total_gray_vol_48 0.7030987 0.1489033 0.70 0.41 0.99
First, split entries on a white space, "[" or "]", then remove these characters from the resulting new columns and coerce to numeric.

Using base R
out <- cbind(out, read.table(text = gsub("[][]|,", "", out$CI),
header = FALSE, col.names = c("estimate", "lower", "upper")))
-output
> out$CI <- NULL
> out
name Std.Estimate Std.SE estimate lower upper
1 total_gray_vol_0_to_psychosis_24 0.3040457 0.2399643 0.30 -0.17 0.77
2 total_gray_vol_24_to_psychosis_48 1.4835217 2.7242882 1.48 -3.86 6.82
3 psychosis_0_to_total_gray_vol_24 0.6735834 0.1121113 0.67 0.45 0.89
4 psychosis_24_to_total_gray_vol_48 0.7030987 0.1489033 0.70 0.41 0.99

Related

Probability, sample function - intervals

I have a problem with the sample function. I have an error that incorrect number of probabilities. Can I use probability in another way? I don't know that this function works on intervals.
OL_x = c(15.0:47.0,0.0:15.0,47:80,80:105)
x = sample(OL_x,1000,replace = TRUE,prob = c(0.60,0.22,0.13,0.05) )+ runif(1000,0,1)
You need to have a probability associated with each value, i don't know a way to assign a probability to an interval, so doing it "by hand" could be like:
probs = c(rep(0.60, 48-15), rep(0.22,16-0), rep(0.13, 81-47), rep(0.05, 106-80))
x = sample(OL_x, 1000, replace = TRUE, prob = probs) + runif(1000,0,1)
This is not much efficient because you need to calculate the size of each interval by hand, there are probably better ways of doing this.
The prob argument can be length 1 or one value for each element of x. OL_x is a vector with 109 elements, since the : integer sequence operator expands out your values. Not quite sure what you are trying to create, but if you are after 1000 values drawn from the values presented with the probabilities described, try:
# keep groups separate as a list
OL_x = list(15.0:47.0,0.0:15.0,47:80,80:105)
# number of values in each group
vapply(X = OL_x, FUN = length, FUN.VALUE = 0L)
# [1] 33 16 34 26
# create 109 probabilities
rep(c(0.60,0.22,0.13,0.05), times = vapply(X = OL_x, FUN = length, FUN.VALUE = 0L))
# [1] 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60
# [14] 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.60
# [27] 0.60 0.60 0.60 0.60 0.60 0.60 0.60 0.22 0.22 0.22 0.22 0.22 0.22
# ...
# create 1000 samples
x = sample(
x = unlist(OL_x),
size = 1000,
replace = TRUE,
prob = rep(c(0.60,0.22,0.13,0.05),
times = vapply(X = OL_x, FUN = length, FUN.VALUE = 0L))
) + runif(1000,0,1)
head(x)
# [1] 18.826530 36.948981 15.366685 5.142625 47.659682 14.946690

Manipulate list object into data frame

library(survey)
I have data such as this. I am using the survey package to produce the MEAN, SE and FREQ of each variables in the vector named vars. I am new to manipulating lists in R & would really appreciate help!
df <- data.frame(
married = c(1,1,1,1,0,0,1,1),
pens = c(0, 1, 1, NA, 1, 1, 0, 0),
weight = c(1.12, 0.55, 1.1, 0.6, 0.23, 0.23, 0.66, 0.67))
vars <- c("weight","married","pens")
design <- svydesign(ids=~1, data=df, weights=~weight)
myfun <- function(x){
means <- svymean(as.formula(paste0('~(', x, ')')), design, na.rm = T)
table <- svytable(as.formula(paste0('~(', x, ')')), design)
results <- list(svymean = means, svytable = table)
return(results)
}
lapply(vars, myfun)
The output looks like this:
[[1]]
[[1]]$svymean
mean SE
weight 0.79791 0.1177
[[1]]$svytable
weight
0.23 0.55 0.6 0.66 0.67 1.1 1.12
0.46 0.55 0.60 0.66 0.67 1.10 1.12
[[2]]
[[2]]$svymean
mean SE
married 0.91085 0.0717
[[2]]$svytable
married
0 1
0.46 4.70
[[3]]
[[3]]$svymean
mean SE
pens 0.46272 0.2255
[[3]]$svytable
pens
0 1
2.45 2.11
I want to extract/manipulate this list above to create a dataframe that looks more like this:
question mean SE sum_svytable
weight 0.797 0.1177 5.16
married 0.910 0.071 5.16
As you can see, the sum_svytable is the sum of the frequencies produced in the $svytable generated list for each variable. Even though this number is the same for each variable (5.16 for all) in my example, it is not the same in my dataset.
sum_svytable was derived like this:
output of myfun function for weight:
[[1]]$svytable
weight
0.23 0.55 0.6 0.66 0.67 1.1 1.12
0.46 0.55 0.60 0.66 0.67 1.10 1.12
I simply summed the frequencies for each response:
sum_svytable(for weight) = 0.46 +0.55+ 0.60+ 0.66+ 0.67+ 1.10+ 1.12
I don't mind how this result is arrived at, I just need it to be in a df!
Is this possible?
An option is to loop over the list of output from 'myfun' then extract teh components, 'svymean', create a data.frame, add the column of sums from 'svytable' element, rbind the list elements and create the 'question' column from the row names
out <- lapply(vars, myfun)
lst1 <- lapply(out, function(x)
cbind(setNames(as.data.frame(x$svymean), c("mean", "SE")),
sum_svytable = sum(x$svytable)))
out1 <- do.call(rbind, lst1)
out1$question <- row.names(out1)
row.names(out1) <- NULL
out1[c('question', 'mean', 'SE', 'sum_svytable')]
# question mean SE sum_svytable
#1 weight 0.7979070 0.1177470 5.16
#2 married 0.9108527 0.0716663 5.16
#3 pens 0.4627193 0.2254907 4.56

How can I get row-wise max based on condition of specific column in R dataframe?

I'm trying to get the maximum value BY ROW across several columns (climatic water deficit -- def_59_z_#) depending on how much time has passed (time since fire -- YEAR.DIFF). Here are the conditions:
If 1 year has passed, select the deficit value for first year.
(def_59_z_1).
If 2 years: max deficit of first 2 years.
If 3 years: max of deficit of first 3 years.
If 4 years: max of deficit of first 4 years.
If 5 or more years: max of first 5 years.
However, I am unable to extract a row-wise max when I include a condition. There are several existing posts that address row-wise min and max (examples 1 and 2) and sd (example 3) -- but these don't use conditions. I've tried using apply but I haven't been able to find a solution when I have multiple columns involved as well as a conditional requirement.
The following code simply returns 3.5 in the new column def59_z_max15, which is the maximum value that occurs in the dataframe -- except when YEAR.DIFF is 1, in which case def_50_z_1 is directly returned. But for all the other conditions, I want 0.98, 0.67, 0.7, 1.55, 1.28 -- values that reflect the row maximum of the specified columns. Link to sample data here. How can I achieve this?
I appreciate any/all suggestions!
data <- data %>%
mutate(def59_z_max15 = ifelse(YEAR.DIFF == 1,
(def59_z_1),
ifelse(YEAR.DIFF == 2,
max(def59_z_1, def59_z_2),
ifelse(YEAR.DIFF == 3,
max(def59_z_1, def59_z_2, def59_z_3),
ifelse(YEAR.DIFF == 4,
max(def59_z_1, def59_z_2, def59_z_3, def59_z_4),
max(def59_z_1, def59_z_2, def59_z_3, def59_z_4, def59_z_5))))))
Throw this function in an apply family function
func <- function(x) {
first.val <- x[1]
if (first.val < 5) {
return(max(x[2:(first.val+)])
} else {
return(max(x[2:6]))
}
}
Your desired output should be obtained by:
apply(data, 1, function(x) func(x)) #do it by row by setting arg2 = 1
An option would be to get the pmax (rowwise max - vectorized) for each set of conditions separately in a loop (map - if the value of 'YEAR.DIFF' is 1, select only the 'def_59_z_1', for 2, get the max of 'def_59_z_1' and 'def_59_z_2', ..., for 5, max of 'def_59_z_1' to 'def_59_z_5', coalesce the columns together and replace the rest of the NA with the pmax of all the 'def59_z" columns
library(tidyverse)
out <- map_dfc(1:5, ~
df1 %>%
select(seq_len(.x) + 1) %>%
transmute(val = na_if((df1[["YEAR.DIFF"]] == .x)*
pmax(!!! rlang::syms(names(.))), 0))) %>%
transmute(def59_z_max15 = coalesce(!!! rlang::syms(names(.)))) %>%
bind_cols(df1, .)%>%
mutate(def59_z_max15 = case_when(is.na(def59_z_max15) ~
pmax(!!! rlang::syms(names(.)[2:6])), TRUE ~ def59_z_max15))
head(out, 10)
# YEAR.DIFF def59_z_1 def59_z_2 def59_z_3 def59_z_4 def59_z_5 def59_z_max15
#1 5 0.25 -2.11 0.98 -0.07 0.31 0.98
#2 9 0.67 0.65 -0.27 0.52 0.26 0.67
#3 10 0.56 0.33 0.03 0.70 -0.09 0.70
#4 2 -0.34 1.55 -1.11 -0.40 0.94 1.55
#5 4 0.98 0.71 0.41 1.28 -0.14 1.28
#6 3 0.71 -0.17 1.70 -0.57 0.43 1.70
#7 4 -1.39 -1.71 -0.89 0.78 1.22 0.78
#8 4 -1.14 -1.46 -0.72 0.74 1.32 0.74
#9 2 0.71 1.39 1.07 0.65 0.29 1.39
#10 1 0.28 0.82 -0.64 0.45 0.64 0.28
data
df1 <- read.csv("https://raw.githubusercontent.com/CaitLittlef/random/master/data.csv")

Plotting multiple columns in ggplot

I am trying to plot all columns of a data frame based on a column in the data frame. The df basically looks like this:
iters a b c
1 1 0.92 0.83 0.97
2 2 0.12 0.93 0.76
3 3 0.55 0.41 0.87
4 4 0.43 0.55 0.49
So far I have tried this code:
df <- melt(acc_s1, id.vars = 'iter', variable.name = 'letter')
ggplot(df, aes(iter,value)) + geom_line(aes(colour = letter))
Unfortunately, my results looks like this (don't mind the slightly different names):
Any ideas, where this comes from?
Thanks

Column Mean for rows with unique values

how can I compute the mean R, R1, R2, R3 values from the rows sharing the same lon,lat field? I'm sure this questions exists multiple times but I could not easily find it.
lon lat length depth R R1 R2 R3
1 147.5348 -35.32395 13709 1 0.67 0.80 0.84 0.83
2 147.5348 -35.32395 13709 2 0.47 0.48 0.56 0.54
3 147.5348 -35.32395 13709 3 0.43 0.29 0.36 0.34
4 147.4290 -35.27202 12652 1 0.46 0.61 0.60 0.58
5 147.4290 -35.27202 12652 2 0.73 0.96 0.95 0.95
6 147.4290 -35.27202 12652 3 0.77 0.92 0.92 0.91
I'd recommend using the split-apply-combine strategy, where you're splitting by BOTH lon and lat, applying mean to each group, then recombining into a single data frame.
I'd recommend using dplyr:
library(dplyr)
mydata %>%
group_by(lon, lat) %>%
summarize(
mean_r = mean(R)
, mean_r1 = mean(R1)
, mean_r2 = mean(R2)
, mean_r3 = mean(R3)
)

Resources