Scatterplot with categorical x-axis (and uncertainties boxes) in R - r

I have made some calculations on data measured on several systems of photovoltaic panels. I have 11 different photovoltaic systems, and for each of them I have 3 different numerical values.
My results are in a matrix that has 11 rows (each of them corresponding to one of the photovoltaic systems), and 3 columns (containing the 3 numerical quantities computed for each system).
Here is a minimal reproducible matrix :
monthly_LR monthly_CSD monthly_HW
solon 0.398 0.417 0.48
sanyo 0.489 0.479 0.59
atersa NA NA NA
sunpower 0.129 NA 0.19
schott_efg 0.387 0.486 0.47
BP 0.235 0.161 0.22
solarworld 1.153 1.245 1.25
schott_main 0.531 0.628 0.62
wurth 2.889 2.886 2.85
first 1.631 1.651 1.64
mhi 0.974 0.888 1.02
and the corresponding dput output so you can reproduce it :
structure(c(0.398, 0.489, NA, 0.129, 0.387, 0.235, 1.153, 0.531,
2.889, 1.631, 0.974, 0.417, 0.479, NA, NA, 0.486, 0.161, 1.245,
0.628, 2.886, 1.651, 0.888, 0.48, 0.59, NA, 0.19, 0.47, 0.22,
1.25, 0.62, 2.85, 1.64, 1.02), .Dim = c(11L, 3L), .Dimnames = list(
c("solon", "sanyo", "atersa", "sunpower", "schott_efg", "BP",
"solarworld", "schott_main", "wurth", "first", "mhi"), c("monthly_LR",
"monthly_CSD", "monthly_HW"))) `
I also have another matrix which contains the uncertainties associated with each value of the first matrix :
monthly_LR_uncertainty monthly_CSD_uncertainty monthly_HW_uncertainty
solon 0.14 0.09 0.07
sanyo 0.13 0.06 0.07
atersa NA 0.13 NA
sunpower 0.18 0.18 0.20
schott_efg 0.14 0.07 0.06
BP 0.14 0.14 0.15
solarworld 0.16 0.04 0.03
schott_main 0.15 0.08 0.07
wurth 0.12 0.10 0.11
first 0.08 0.09 0.10
mhi 0.08 0.07 0.08
and the corresponding dput output so you can reproduce it :
structure(c(0.14, 0.13, NA, 0.18, 0.14, 0.14, 0.16, 0.15, 0.12,
0.08, 0.08, 0.09, 0.06, 0.13, 0.18, 0.07, 0.14, 0.04, 0.08, 0.1,
0.09, 0.07, 0.07, 0.07, NA, 0.2, 0.06, 0.15, 0.03, 0.07, 0.11,
0.1, 0.08), .Dim = c(11L, 3L), .Dimnames = list(c("solon", "sanyo",
"atersa", "sunpower", "schott_efg", "BP", "solarworld", "schott_main",
"wurth", "first", "mhi"), c("monthly_LR_uncertainty", "monthly_CSD_uncertainty",
"monthly_HW_uncertainty"))) `
Now, here is the type of scatterplot I would like to obtain (I almost got what I wanted with boxplots, but now I'd prefer a scatterplot) :
I would like the x-axis to be categorical, as it is when I make a boxplot (i.e. one category for each of the 11 rows).
And above each category on the x-axis, I would like to have 3 points corresponding to the 3 values in the corresponding row of the first matrix, with boxes indicating the uncertainty on the results.
The image below (a graph in an article written by a researcher of the same lab than me, but that is gone from the lab now) shows exactly what I want to obtain. The 11 categories on the x-axis correspond to my 11 rows. The three different points for each category (blue, red, green) correspond to the 3 values for each category in the first matrix. And the box associated to each point corresponds to the uncertainty (given in the second matrix).

Let's say a is the table with means and b is the table with uncertainties:
# x axis width
x = 1:nrow(a)
# horizontal offset for data of same group
offset = 0.2
# draw empty plot
plot(NULL, xlim=c(0, nrow(a)), ylim=c(0, max(a,na.rm=T)), xaxt='n', ylab='performance', xlab='')
# add error bars (arrows with angle=90)
arrows(x0=x, x1=x, y0 = a[,1]-0.5*b[,1], y1 = a[,1]+0.5*b[,1], angle=90, code=3, len=0.01)
arrows(x0=x-offset, x1=x-offset, y0 = a[,2]-0.5*b[,2], y1 = a[,2]+0.5*b[,2], angle=90, code=3, col=2, len=0.02)
arrows(x0=x+offset, x1=x+offset, y0 = a[,3]-0.5*b[,3], y1 = a[,3]+0.5*b[,3], angle=90, code=3, col=4, len=0.02)
# add points
points(x, a[,1], pch=1, col=1)
points(x-offset, a[,2], pch=2, col=2)
points(x+offset, a[,3], pch=3, col=4)
# axis labels
axis(1, at = 1:nrow(a), labels = rownames(a), las=3)
# add legend
legend(x='topleft', legend=colnames(a), col=c(1,2,4), pch=c(1,2,3), inset=0.02)
Also have a look at this answer for grouped boxplots.

Related

Mutating new columns based on common string using existing columns

Sample data:
X_5 X_1 Y alpha_5 alpha_1 beta_5 beta_1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.21 0.02 0.61 10 5 3 0.01
2 0.01 0.02 0.37 0.4 0.01 0.8 0.5
3 0.02 0.03 0.55 0.01 0.01 0.3 0.99
4 0.04 0.05 0.29 0.01 0.005 0.03 0.55
5 0.11 0.1 -0.08 0.22 0.015 0.01 0.01
6 0.22 0.21 -0.08 0.02 0.03 0.01 0.01
I have a dataset which has columns of some variable of interest, say alpha, beta, and so on. I also have this saved as a character vector. I want to be able to mutate new columns based on these variable names, suffixed with an identifier, using the existing columns in the dataset as part of some transformation, like this:
df %>% mutate(
alpha_new = ((alpha_5-alpha_1) / (X_5-X_1) * Y),
beta_new = ((beta_5-beta_1) / (X_5-X_1) * Y)
)
X_5 X_1 Y alpha_5 alpha_1 beta_5 beta_1 alpha_new beta_new
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.21 0.02 0.61 10 5 3 0.01 16.1 9.60
2 0.01 0.02 0.37 0.4 0.01 0.8 0.5 -14.4 -11.1
3 0.02 0.03 0.55 0.01 0.01 0.3 0.99 0 38.0
4 0.04 0.05 0.29 0.01 0.005 0.03 0.55 -0.145 15.1
5 0.11 0.1 -0.08 0.22 0.015 0.01 0.01 -1.64 0
6 0.22 0.21 -0.08 0.02 0.03 0.01 0.01 0.0800 0
In my real data I have many more columns like this and I'm struggling to implement this in a "tidy" way which isn't hardcoded, what's the best practice for my situation?
Sample code:
structure(
list(
X_5 = c(0.21, 0.01, 0.02, 0.04, 0.11, 0.22),
X_1 = c(0.02,
0.02, 0.03, 0.05, 0.10, 0.21),
Y = c(0.61, 0.37, 0.55, 0.29, -0.08, -0.08),
alpha_5 = c(10, 0.4, 0.01, 0.01, 0.22, 0.02),
alpha_1 = c(5, 0.01, 0.01, 0.005, 0.015, 0.03),
beta_5 = c(3, 0.8, 0.3, 0.03, 0.01, 0.01),
beta_1 = c(0.01, 0.5, 0.99, 0.55, 0.01, 0.01)
),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")
) -> df
variable_of_interest <- c("alpha", "beta")
Here's another way to approach this with dynamic creation of columns. With map_dfc from purrr you can column-bind new results, creating new column names with bang-bang on left hand side of := operator, and using .data to access column values on right hand side.
library(tidyverse)
bind_cols(
df,
map_dfc(
variable_of_interest,
~ transmute(df, !!paste0(.x, '_new') :=
(.data[[paste0(.x, '_5')]] - .data[[paste0(.x, '_1')]]) /
(X_5 - X_1) * Y)
)
)
Output
X_5 X_1 Y alpha_5 alpha_1 beta_5 beta_1 alpha_new beta_new
1 0.21 0.02 0.61 10.00 5.000 3.00 0.01 16.05263 9.599474
2 0.01 0.02 0.37 0.40 0.010 0.80 0.50 -14.43000 -11.100000
3 0.02 0.03 0.55 0.01 0.010 0.30 0.99 0.00000 37.950000
4 0.04 0.05 0.29 0.01 0.005 0.03 0.55 -0.14500 15.080000
5 0.11 0.10 -0.08 0.22 0.015 0.01 0.01 -1.64000 0.000000
6 0.22 0.21 -0.08 0.02 0.030 0.01 0.01 0.08000 0.000000
Better to pivot the data first
library(dplyr)
library(tidyr)
# your data
df <- structure(list(X_5 = c(0.21, 0.01, 0.02, 0.04, 0.11, 0.22), X_1 = c(0.02,
0.02, 0.03, 0.05, 0.1, 0.21), Y = c(0.61, 0.37, 0.55, 0.29, -0.08,
-0.08), alpha_5 = c(10, 0.4, 0.01, 0.01, 0.22, 0.02), alpha_1 = c(5,
0.01, 0.01, 0.005, 0.015, 0.03), beta_5 = c(3, 0.8, 0.3, 0.03,
0.01, 0.01), beta_1 = c(0.01, 0.5, 0.99, 0.55, 0.01, 0.01)), class = "data.frame", row.names = c(NA,
-6L))
df <- df |> mutate(id = 1:n()) |>
pivot_longer(cols = -c(id, Y, X_5, X_1),
names_to = c("name", ".value"), names_sep="_") |>
mutate(new= (`5` - `1`) / (X_5 - X_1) * Y) |>
pivot_wider(id_cols = id, names_from = "name", values_from = c(`5`,`1`, `new`),
names_glue = "{name}_{.value}", values_fn = sum)
df
#> # A tibble: 6 × 7
#> id alpha_5 beta_5 alpha_1 beta_1 alpha_new beta_new
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 10 3 5 0.01 16.1 9.60
#> 2 2 0.4 0.8 0.01 0.5 -14.4 -11.1
#> 3 3 0.01 0.3 0.01 0.99 0 38.0
#> 4 4 0.01 0.03 0.005 0.55 -0.145 15.1
#> 5 5 0.22 0.01 0.015 0.01 -1.64 0
#> 6 6 0.02 0.01 0.03 0.01 0.0800 0
Created on 2023-02-16 with reprex v2.0.2
Note: if you want to add X_5 and X_1 in the output use id_cols = c(id, X_5, X_1) instead.
I modified your data to create a bit more complicated situation. My hope is that this is close to your real situation. The condition in this idea is that two columns that you wanna pair up stay next to each other. The first job is to collect column names that begin with small letters. Next job is to create a data frame. Here I keep the column names in odd positions
in target in the first column, and ones in even positions in the second column. I was thinking in the same line of Ben; I used map2_dfc to create an output data frame. In this function, I replaced all small letters with X so that I could specify two column names in the original data (i.e., ones starting with X). Then, I did the calculation as you specified. Finally, I created a column name for outcome in the loop. If you want to add the result to the original data, you can run the final line with cbind.
grep(x = names(df), pattern = "[[:lower:]]+_[0-9]+", value = TRUE) -> target
tibble(first_element = target[c(TRUE, FALSE)],
second_element = target[c(FALSE, TRUE)]) -> mydf
map2_dfc(.x = mydf$first_element,
.y = mydf$second_element,
.f = function(x, y) {
sub(x = x, pattern = "[[:lower:]]+", replacement = "X") -> foo1
sub(x = y, pattern = "[[:lower:]]+", replacement = "X") -> foo2
outcome <- ((df[x] - df[y]) / (df[foo1] - df[foo2]) * df["Y"])
names(outcome) <- paste(x,
sub(x = y, pattern = "[[:lower:]]+", replacement = ""),
sep = "")
return(outcome)
}) -> result
cbind(df, result)
# alpha_5_1 alpha_2_6 beta_5_1 beta_3_4
#1 16.05263 0.10736 9.599474 0.27145
#2 -14.43000 0.10730 -11.100000 0.28564
#3 0.00000 0.28710 37.950000 0.50820
#4 -0.14500 0.21576 15.080000 0.64206
#5 -1.64000 -0.06416 0.000000 -0.61352
#6 0.08000 -0.08480 0.000000 -0.25400
DATA
structure(list(
X_5 = c(0.21, 0.01, 0.02, 0.04, 0.11, 0.22),
X_1 = c(0.02,0.02, 0.03, 0.05, 0.10, 0.21),
X_2 = 1:6,
X_6 = 6:11,
X_3 = 21:26,
X_4 = 31:36,
Y = c(0.61, 0.37, 0.55, 0.29, -0.08, -0.08),
alpha_5 = c(10, 0.4, 0.01, 0.01, 0.22, 0.02),
alpha_1 = c(5, 0.01, 0.01, 0.005, 0.015, 0.03),
alpha_2 = c(0.12, 0.55, 0.39, 0.28, 0.99, 0.7),
alpha_6 = 1:6,
beta_5 = c(3, 0.8, 0.3, 0.03, 0.01, 0.01),
beta_1 = c(0.01, 0.5, 0.99, 0.55, 0.01, 0.01),
beta_3 = c(0.55, 0.28, 0.76, 0.86, 0.31, 0.25),
beta_4 = c(5, 8, 10, 23, 77, 32)),
row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame")) -> df

R - transpose dataframe with multiple id columns and multiple variables [duplicate]

I am trying to use pivot_longer. However, I am not sure how to use names_sep or names_pattern to solve this.
dat <- tribble(
~group, ~BP, ~HS, ~BB, ~lowerBP, ~upperBP, ~lowerHS, ~upperHS, ~lowerBB, ~upperBB,
"1", 0.51, 0.15, 0.05, 0.16, 0.18, 0.5, 0.52, 0.14, 0.16,
"2.1", 0.67, 0.09, 0.06, 0.09, 0.11, 0.66, 0.68, 0.08, 0.1,
"2.2", 0.36, 0.13, 0.07, 0.12, 0.15, 0.34, 0.38, 0.12, 0.14,
"2.3", 0.09, 0.17, 0.09, 0.13, 0.16, 0.08, 0.11, 0.15, 0.18,
"2.4", 0.68, 0.12, 0.07, 0.12, 0.14, 0.66, 0.69, 0.11, 0.13,
"3", 0.53, 0.15, 0.06, 0.14, 0.16, 0.52, 0.53, 0.15, 0.16)
Desired output (First row from wide data)
group names values lower upper
1 BP 0.51 0.16 0.18
1 HS 0.15 0.5 0.52
1 BB 0.05 0.14 0.16
Here is solution following a similar method that #Fnguyen used but using the newer pivot_longer and pivot_wider construct:
library(dplyr)
library(tidyr)
longer<-pivot_longer(dat, cols=-1, names_pattern = "(.*)(..)$", names_to = c("limit", "name")) %>%
mutate(limit=ifelse(limit=="", "value", limit))
answer <-pivot_wider(longer, id_cols = c(group, name), names_from = limit, values_from = value, names_repair = "check_unique")
Most of the selecting, separating, mutating and renaming is taking place within the pivot function calls.
Update:
This regular expressions "(.*)(..)$" means:
( ) ( ) Look for two parts,
(.*) the first part should have zero or more characters
(..) the second part should have just 2 characters at the “$” end of the string
A data.table version (not sure yet how to retain the original names so that you dont need to post substitute them https://github.com/Rdatatable/data.table/issues/2551):
library(data.table)
df <- data.table(dat)
v <- c("BP","HS","BB")
setnames(df, v, paste0("x",v) )
g <- melt(df, id.vars = "group",
measure.vars = patterns(values = "x" ,
lower = "lower",
upper = "upper"),
variable.name = "names")
g[names==1, names := "BP" ]
g[names==2, names := "HS" ]
g[names==3, names := "BB" ]
group names values lower upper
1: 1 BP 0.51 0.16 0.18
2: 2.1 BP 0.67 0.09 0.11
3: 2.2 BP 0.36 0.12 0.15
4: 2.3 BP 0.09 0.13 0.16
5: 2.4 BP 0.68 0.12 0.14
6: 3 BP 0.53 0.14 0.16
7: 1 HS 0.15 0.50 0.52
8: 2.1 HS 0.09 0.66 0.68
9: 2.2 HS 0.13 0.34 0.38
10: 2.3 HS 0.17 0.08 0.11
11: 2.4 HS 0.12 0.66 0.69
12: 3 HS 0.15 0.52 0.53
13: 1 BB 0.05 0.14 0.16
14: 2.1 BB 0.06 0.08 0.10
15: 2.2 BB 0.07 0.12 0.14
16: 2.3 BB 0.09 0.15 0.18
17: 2.4 BB 0.07 0.11 0.13
18: 3 BB 0.06 0.15 0.16
Based on your example data this solution using dplyr works for me:
library(dplyr)
dat %>%
gather(key, values,-group) %>%
mutate(names = gsub("lower","",gsub("upper","",key))) %>%
separate(key, into = c("key1","key2") ,"[[:upper:]]", perl=T) %>%
mutate(key1 = case_when(key1 == "" ~ "values", TRUE ~ key1)) %>%
select(group,names,key1,values) %>%
rowid_to_column() %>%
spread(key1,values) %>%
select(-rowid) %>%
group_by(group,names) %>%
summarise_all(mean,na.rm = TRUE)
I'd like to add an alternative tidyverse solution drawing from the answer provided by #Dave2e.
Like Dave2e's solution it's a two-step procedure (first rename, then reshape). Instead of reshaping the data twice, I add the prefix "values" to the columns named "BP", "HS", and "BB" using rename_with. This was necessary for getting the column names right when using the .value sentinel in the names_to argument of pivot_longer.
library(dplyr)
library(tidyr)
dat %>%
rename_with(~sub("^(BP|HS|BB)$", "values\\1", .)) %>% # add prefix values
pivot_longer(dat , cols= -1,
names_pattern = "(.*)(BP|HS|BB)$",
names_to = c(".value", "names"))

Creating multiple ggplots from select columns in one data frame using for loop or lapply

I am having all sorts of trouble trying to create a loop, or using lapply to generate multiple plots from one data frame.
df
target A.O2 A.H2O A.conc A.bias B.O2 B.H2O B.conc B.bias C.O2 C.H2O C.conc C.bias
1 85 20.90 0.06 254.96 0.01 20.90 0.06 255.02 0.03 20.90 0.06 254.98 0.01
2 50 20.90 0.09 150.09 0.09 20.90 0.09 150.06 0.08 20.90 0.09 150.00 0.03
3 25 20.94 0.09 75.24 0.31 20.94 0.09 75.47 0.62 20.94 0.09 74.98 -0.04
4 85 10.00 0.08 251.99 -1.22 10.00 0.08 252.02 -1.21 10.00 0.08 252.01 -1.21
5 50 10.00 0.09 148.51 -1.06 10.00 0.09 148.52 -1.05 10.00 0.09 148.50 -1.06
6 25 10.00 0.07 74.00 -1.27 10.00 0.07 74.03 -1.24 10.00 0.07 74.03 -1.24
7 85 0.10 0.06 246.99 -3.13 0.10 0.06 247.01 -3.13 0.10 0.06 247.00 -3.13
8 50 0.10 0.14 146.50 -2.39 0.10 0.14 146.50 -2.39 0.10 0.14 146.45 -2.42
9 25 0.10 0.10 72.97 -2.55 0.10 0.10 73.04 -2.45 0.10 0.10 73.04 -2.44
I want to create plots where X = O2 (A.O2, B.O2, C.O2) and Y = bias (A.bias, B.bias, C.bias) and the points are grouped based off the values in the target column.
library(ggrepel)
ggplot(df, aes(A.O2, A.bias)) +
theme_bw() +
theme(legend.position = 'bottom', plot.title = element_text(hjust=0.5)) +
geom_point(aes(colour = factor(target))) +
geom_line(aes(colour = factor(target))) +
geom_text_repel(aes(label=paste(A.bias),
hjust= 0.4,
vjust=-.8, colour = factor(target)),
size = 3) +
ggtitle('A') +
labs(
x = expression('O'[2]),
y = "bias",
colour = 'conc'
)
I want to repeat the same code where the only thing that changes is the X and Y values in aes() and ggtitle(). I have tried looking up similar posts for using for loops or lapply to do this but nothing seems to work.
Probably reshaping the data to long format and using facet_grid. This is easy using reshape when we switch suffix and prefix of column names.
names(df) <- sapply(lapply(strsplit(names(df), "\\."), rev), paste, collapse=".")
dfl <- reshape(df, varying=2:13, direction="long")
library(ggplot2)
library(ggrepel)
ggplot(dfl, aes(O2, bias)) +
theme_bw() +
theme(legend.position = 'bottom', plot.title = element_text(hjust=0.5)) +
geom_point(aes(colour = factor(target))) +
geom_line(aes(colour = factor(target)))+
geom_text_repel(aes(label=paste(bias),
hjust= 0.4,
vjust=-.8, colour = factor(target)),
size = 3) +
facet_grid("time") +
# ggtitle(z) + ## not needed
labs(
x = expression('O'[2]),
y = "bias",
colour = 'conc'
)
Or if you want three single plots you may put the code into a function together with ggsave to use in an lapply loop.
FUN <- function(x) {
ggplot(dfl[dfl$time == x, ], aes(O2, bias)) +
theme_bw() +
theme(legend.position = 'bottom', plot.title = element_text(hjust=0.5)) +
geom_point(aes(colour = factor(target))) +
geom_line(aes(colour = factor(target)))+
geom_text_repel(aes(label=paste(bias),
hjust= 0.4,
vjust=-.8, colour = factor(target)),
size = 3) +
# facet_grid("time") + ## not needed
ggtitle(x) +
labs(
x = expression('O'[2]),
y = "bias",
colour = 'conc'
)
ggsave(paste0("plot", x, ".png"))
}
times <- c("A", "B", "C")
lapply(times, FUN)
This saves the three plots in your working directory:
dir()
# [1] plotA.png
# [2] plotB.png
# [3] plotC.png
Example plot:
Data:
df <- structure(list(target = c(85L, 50L, 25L, 85L, 50L, 25L, 85L,
50L, 25L), A.O2 = c(20.9, 20.9, 20.94, 10, 10, 10, 0.1, 0.1,
0.1), A.H2O = c(0.06, 0.09, 0.09, 0.08, 0.09, 0.07, 0.06, 0.14,
0.1), A.conc = c(254.96, 150.09, 75.24, 251.99, 148.51, 74, 246.99,
146.5, 72.97), A.bias = c(0.01, 0.09, 0.31, -1.22, -1.06, -1.27,
-3.13, -2.39, -2.55), B.O2 = c(20.9, 20.9, 20.94, 10, 10, 10,
0.1, 0.1, 0.1), B.H2O = c(0.06, 0.09, 0.09, 0.08, 0.09, 0.07,
0.06, 0.14, 0.1), B.conc = c(255.02, 150.06, 75.47, 252.02, 148.52,
74.03, 247.01, 146.5, 73.04), B.bias = c(0.03, 0.08, 0.62, -1.21,
-1.05, -1.24, -3.13, -2.39, -2.45), C.O2 = c(20.9, 20.9, 20.94,
10, 10, 10, 0.1, 0.1, 0.1), C.H2O = c(0.06, 0.09, 0.09, 0.08,
0.09, 0.07, 0.06, 0.14, 0.1), C.conc = c(254.98, 150, 74.98,
252.01, 148.5, 74.03, 247, 146.45, 73.04), C.bias = c(0.01, 0.03,
-0.04, -1.21, -1.06, -1.24, -3.13, -2.42, -2.44)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))

pivot_longer into multiple columns

I am trying to use pivot_longer. However, I am not sure how to use names_sep or names_pattern to solve this.
dat <- tribble(
~group, ~BP, ~HS, ~BB, ~lowerBP, ~upperBP, ~lowerHS, ~upperHS, ~lowerBB, ~upperBB,
"1", 0.51, 0.15, 0.05, 0.16, 0.18, 0.5, 0.52, 0.14, 0.16,
"2.1", 0.67, 0.09, 0.06, 0.09, 0.11, 0.66, 0.68, 0.08, 0.1,
"2.2", 0.36, 0.13, 0.07, 0.12, 0.15, 0.34, 0.38, 0.12, 0.14,
"2.3", 0.09, 0.17, 0.09, 0.13, 0.16, 0.08, 0.11, 0.15, 0.18,
"2.4", 0.68, 0.12, 0.07, 0.12, 0.14, 0.66, 0.69, 0.11, 0.13,
"3", 0.53, 0.15, 0.06, 0.14, 0.16, 0.52, 0.53, 0.15, 0.16)
Desired output (First row from wide data)
group names values lower upper
1 BP 0.51 0.16 0.18
1 HS 0.15 0.5 0.52
1 BB 0.05 0.14 0.16
Here is solution following a similar method that #Fnguyen used but using the newer pivot_longer and pivot_wider construct:
library(dplyr)
library(tidyr)
longer<-pivot_longer(dat, cols=-1, names_pattern = "(.*)(..)$", names_to = c("limit", "name")) %>%
mutate(limit=ifelse(limit=="", "value", limit))
answer <-pivot_wider(longer, id_cols = c(group, name), names_from = limit, values_from = value, names_repair = "check_unique")
Most of the selecting, separating, mutating and renaming is taking place within the pivot function calls.
Update:
This regular expressions "(.*)(..)$" means:
( ) ( ) Look for two parts,
(.*) the first part should have zero or more characters
(..) the second part should have just 2 characters at the “$” end of the string
A data.table version (not sure yet how to retain the original names so that you dont need to post substitute them https://github.com/Rdatatable/data.table/issues/2551):
library(data.table)
df <- data.table(dat)
v <- c("BP","HS","BB")
setnames(df, v, paste0("x",v) )
g <- melt(df, id.vars = "group",
measure.vars = patterns(values = "x" ,
lower = "lower",
upper = "upper"),
variable.name = "names")
g[names==1, names := "BP" ]
g[names==2, names := "HS" ]
g[names==3, names := "BB" ]
group names values lower upper
1: 1 BP 0.51 0.16 0.18
2: 2.1 BP 0.67 0.09 0.11
3: 2.2 BP 0.36 0.12 0.15
4: 2.3 BP 0.09 0.13 0.16
5: 2.4 BP 0.68 0.12 0.14
6: 3 BP 0.53 0.14 0.16
7: 1 HS 0.15 0.50 0.52
8: 2.1 HS 0.09 0.66 0.68
9: 2.2 HS 0.13 0.34 0.38
10: 2.3 HS 0.17 0.08 0.11
11: 2.4 HS 0.12 0.66 0.69
12: 3 HS 0.15 0.52 0.53
13: 1 BB 0.05 0.14 0.16
14: 2.1 BB 0.06 0.08 0.10
15: 2.2 BB 0.07 0.12 0.14
16: 2.3 BB 0.09 0.15 0.18
17: 2.4 BB 0.07 0.11 0.13
18: 3 BB 0.06 0.15 0.16
Based on your example data this solution using dplyr works for me:
library(dplyr)
dat %>%
gather(key, values,-group) %>%
mutate(names = gsub("lower","",gsub("upper","",key))) %>%
separate(key, into = c("key1","key2") ,"[[:upper:]]", perl=T) %>%
mutate(key1 = case_when(key1 == "" ~ "values", TRUE ~ key1)) %>%
select(group,names,key1,values) %>%
rowid_to_column() %>%
spread(key1,values) %>%
select(-rowid) %>%
group_by(group,names) %>%
summarise_all(mean,na.rm = TRUE)
I'd like to add an alternative tidyverse solution drawing from the answer provided by #Dave2e.
Like Dave2e's solution it's a two-step procedure (first rename, then reshape). Instead of reshaping the data twice, I add the prefix "values" to the columns named "BP", "HS", and "BB" using rename_with. This was necessary for getting the column names right when using the .value sentinel in the names_to argument of pivot_longer.
library(dplyr)
library(tidyr)
dat %>%
rename_with(~sub("^(BP|HS|BB)$", "values\\1", .)) %>% # add prefix values
pivot_longer(dat , cols= -1,
names_pattern = "(.*)(BP|HS|BB)$",
names_to = c(".value", "names"))

Keep NA values in their original position when reordering vector

I have a large set of data that I want to reorder in groups of twelve using the sample() function in R to generate randomised data sets with which I can carry out a permutation test. However, this data has NA characters where data could not be collected and I would like them to stay in their respective original positions when the data is shuffled.
Currently, NAs are shuffled randomly with all other values. For example, where example.data is a made-up example set of 12 values:
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00, 0.42)
sample(example.data, replace = F, prob = NULL)
[1] 0.64 0.83 NA 0.33 0.47 0.90 0.25 NA 0.12 0.42 1.00 NA
Whereas a suitable reordering would be:
[1] 0.64 0.83 NA 0.33 0.47 0.90 0.25 0.12 NA NA 0.42 1.00
Is there a simple way to do this?
Thank you for your help!
This has been solved, but I have an extending question
Extending from this, if I have a set of data with a length of 24 how would I go about re-ordering the first and second set of 12 values individually?
For example, a vector extending from the first example:
example.data <- c(0.33, 0.12, NA, 0.25, 0.47, 0.83, 0.90, 0.64, NA, NA, 1.00, 0.42, 0.73, NA, 0.56, 0.12, 1.0, 0.47, NA, 0.62, NA, 0.98, NA, 0.05)
Where example.data[1:12] and example.data[13:24] are shuffled separately within their own respective groups.
The code I am trying to work this solution into is as follows:
shuffle.data = function(input.data,nr,ns){
simdata <- input.data
for(i in 1:nr){
start.row <- (ns*(i-1))+1
end.row <- start.row + actual.length[i] - 1
newdata = sample(input.data[start.row:end.row], size=actual.length[i], replace=F)
simdata[start.row:end.row] <- newdata
}
return(simdata)}
Where input.data is the raw input data (example.data); nr is the number of groups (2), ns is the size of each sample (12); and actual.length is the length of each group exluding NAs stored in a vector (actual.length <- c(9, 8) in the example above).
Thank you again for your help!
Is this what you are looking for ?
example.data[!is.na(example.data)] <- sample(example.data[!is.na(example.data)], replace = F, prob = NULL)
We can try with non-NA elements by creating an index
i1 <- which(!is.na(example.data))
example.data[i1] <- example.data[sample(i1)]
example.data
#[1] 0.25 0.64 NA 0.83 0.12 1.00 0.42 0.47 NA NA 0.33 0.90

Resources