On the following data I am running split.default() in R. The problem is that the separated sub data.frames() in the list are getting into sorted based on column name.
I don't want this to occur and want to preserve the column name sequence as that is the original data. Is there an approach that I can follow to do so? Please suggest.
Input Data
data <- structure(list(`B-DIODE` = c(1.2, 0.4), `B-DIODE` = c(1.3, 0.6
), `A-DIODE` = c(1.4, 0.8), `A-ACC1` = c(1.5, 1), `A-ACC2` = c(1.6,
1.2), `A-ANA0` = c(1.7, 1.4), `A-ANA1` = c(1.8, 1.6), `A-BRICKID` = c(1.9,
1.8), `A-CC0` = c(2L, 2L), `A-CC1` = c(2.1, 2.2), `A-DIGDN` = c(2.2,
2.4), `A-DIGDP` = c(2.3, 2.6), `A-DN1` = c(2.4, 2.8), `A-DN2` = c(2.5,
3), `A-DP1` = c(2.6, 3.2), `A-DP2` = c(2.7, 3.4), `A-SCL` = c(2.8,
3.6), `A-SDA` = c(2.9, 3.8), `A-USB0DN` = 3:4, `A-USB0DP` = c(3.1,
4.2), `A-USB1DN` = c(3.2, 4.4), `A-USB1DP` = c(3.3, 4.6), `A-ACC1` = c(3.4,
4.8), `A-ACC2` = c(3.5, 5), `A-ANA0` = c(3.6, 5.2), `A-ANA1` = c(3.7,
5.4), `A-BRICKID` = c(3.8, 5.6), `A-CC0` = c(3.9, 5.8), `A-CC1` = c(4L,
6L), `A-DIGDN` = c(4.1, 6.2), `A-DIGDP` = c(4.2, 6.4), `A-DN1` = c(4.3,
6.6), `A-DN2` = c(4.4, 6.8), `A-DP1` = c(4.5, 7), `A-DP2` = c(4.6,
7.2), `A-SCL` = c(4.7, 7.4), `A-SDA` = c(4.8, 7.6), `A-USB0DN` = c(4.9,
7.8), `A-USB0DP` = c(5L, 8L), `A-USB1DN` = c(5.1, 8.2), `A-USB1DP` = c(5.2,
8.4), `A-NA` = c(5.3, 8.6), `A-ACC2PWRLKG_0v4` = c(5.4, 8.8),
`A-ACC2PWRLKG_0v4` = c(5.5, 9), `A-P_IN_Leak` = c(5.6, 9.2
)), class = "data.frame", row.names = c(NA, -2L))
Code
split.default(data, sub("-.*", "", names(data)))
Output
$`A`
A-DIODE A-ACC1 A-ACC2 A-ANA0 A-ANA1 A-BRICKID A-CC0 A-CC1 A-DIGDN A-DIGDP A-DN1 A-DN2 A-DP1 A-DP2 A-SCL A-SDA A-USB0DN A-USB0DP A-USB1DN A-USB1DP
1 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3
2 0.8 1.0 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6
A-ACC1.1 A-ACC2.1 A-ANA0.1 A-ANA1.1 A-BRICKID.1 A-CC0.1 A-CC1.1 A-DIGDN.1 A-DIGDP.1 A-DN1.1 A-DN2.1 A-DP1.1 A-DP2.1 A-SCL.1 A-SDA.1 A-USB0DN.1
1 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9
2 4.8 5.0 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8
A-USB0DP.1 A-USB1DN.1 A-USB1DP.1 A-NA A-ACC2PWRLKG_0v4 A-ACC2PWRLKG_0v4.1 A-P_IN_Leak
1 5 5.1 5.2 5.3 5.4 5.5 5.6
2 8 8.2 8.4 8.6 8.8 9.0 9.2
$B
B-DIODE B-DIODE.1
1 1.2 1.3
2 0.4 0.6
In the above output I want the $B to appear first and then $A as that's the sequence that Input Data followed.
One option is to convert the names to factor and set the levels as needed
new_name <- sub("-.*", "", names(data))
split.default(data, factor(new_name, levels = unique(new_name)))
#$B
# B-DIODE B-DIODE.1
#1 1.2 1.3
#2 0.4 0.6
#$A
# A-DIODE A-ACC1 A-ACC2 A-ANA0 ....
#1 1.4 1.5 1.6 1.7 ....
#2 0.8 1.0 1.2 1.4 ....
by specifying levels as unique(new_name) we can ensure that the list will be split based on their occurrence in the dataframe and not alphabetically.
As #thelatemail suggests we can also avoid converting names to factor variable by reordering the list based on unique new_name
split.default(data, new_name)[unique(new_name)]
Another option is to create the group indices for splitting using rle
rl <- rle(sub("-.*", "", names(data)))
split.default(data, rep(1:length(rl), rl$length))
#$`1`
# B-DIODE B-DIODE.1
#1 1.2 1.3
#2 0.4 0.6
#
#$`2`
# A-DIODE A-ACC1 A-ACC2 A-ANA0 A-ANA1 A-BRICKID A-CC0 A-CC1 A-DIGDN A-DIGDP
#1 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3
#2 0.8 1.0 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
# A-DN1 A-DN2 A-DP1 A-DP2 A-SCL A-SDA A-USB0DN A-USB0DP A-USB1DN A-USB1DP
#1 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3
#2 2.8 3.0 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6
# A-ACC1.1 A-ACC2.1 A-ANA0.1 A-ANA1.1 A-BRICKID.1 A-CC0.1 A-CC1.1 A-DIGDN.1
#1 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1
#2 4.8 5.0 5.2 5.4 5.6 5.8 6 6.2
# A-DIGDP.1 A-DN1.1 A-DN2.1 A-DP1.1 A-DP2.1 A-SCL.1 A-SDA.1 A-USB0DN.1
#1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9
#2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8
# A-USB0DP.1 A-USB1DN.1 A-USB1DP.1 A-NA A-ACC2PWRLKG_0v4 A-ACC2PWRLKG_0v4.1
#1 5 5.1 5.2 5.3 5.4 5.5
#2 8 8.2 8.4 8.6 8.8 9.0
# A-P_IN_Leak
#1 5.6
#2 9.2
Related
cutoff KM KM_lo KM_hi rstm rstm_lo rstm_hi
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017-01-01 2.1 1.4 4.9 7.2 3.9 10.2
2 2017-04-01 3.5 2.1 4.7 8.9 6.6 10.8
3 2017-07-01 3.7 2.8 4.2 7.2 6.2 8.4
How do I convert this to a long table? I am struggling to create it into the format I want. I tried the gather and melt functions. The output table would look something like this
cutoff VAR Val Val-hi Val-lo
<chr> <chr> <dbl> <dbl> <dbl>
1 2017-01-01 KM 2.1 4.9 1.4
2 2017-01-01 rstm 7.2 4.7 3.9
3 2017-07-01 KM 3.7 4.2 2.8
Sample date
structure(list(cutoff = c("2017-01-01", "2017-04-01", "2017-07-01"
), KM = c(2.1, 3.5, 3.7), KM_lo = c(1.4, 2.1, 2.8), KM_hi = c(4.9,
4.7, 4.2), rstm = c(7.2, 8.9, 7.2), rstm_lo = c(3.9, 6.6, 6.2
), rstm_hi = c(10.2, 10.8, 8.4)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
We may do
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
rename_with(~ str_c(., "_none"), c("KM", "rstm")) %>%
pivot_longer(cols = -cutoff, names_to = c("VAR", ".value"),
names_sep = "_") %>%
rename_with(~ c("Val", "Val-lo", "Val-hi"), 3:5)
-output
# A tibble: 6 × 5
cutoff VAR Val `Val-lo` `Val-hi`
<chr> <chr> <dbl> <dbl> <dbl>
1 2017-01-01 KM 2.1 1.4 4.9
2 2017-01-01 rstm 7.2 3.9 10.2
3 2017-04-01 KM 3.5 2.1 4.7
4 2017-04-01 rstm 8.9 6.6 10.8
5 2017-07-01 KM 3.7 2.8 4.2
6 2017-07-01 rstm 7.2 6.2 8.4
Here is another pivot_longer approach:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-cutoff,
names_to = c("VAR", ".value"),
names_pattern = "(.+)_(.+)"
) %>%
na.omit()
cutoff VAR lo hi
<chr> <chr> <dbl> <dbl>
1 2017-01-01 KM 1.4 4.9
2 2017-01-01 rstm 3.9 10.2
3 2017-04-01 KM 2.1 4.7
4 2017-04-01 rstm 6.6 10.8
5 2017-07-01 KM 2.8 4.2
6 2017-07-01 rstm 6.2 8.4
library(tidyverse)
df <-
structure(
list(
cutoff = c("2017-01-01", "2017-04-01", "2017-07-01"),
KM = c(2.1, 3.5, 3.7),
KM_lo = c(1.4, 2.1, 2.8),
KM_hi = c(4.9, 4.7, 4.2),
rstm = c(7.2, 8.9, 7.2),
rstm_lo = c(3.9, 6.6, 6.2),
rstm_hi = c(10.2, 10.8, 8.4)
),
row.names = c(NA,-3L),
class = c("tbl_df",
"tbl", "data.frame")
)
df %>%
pivot_longer(cols = -cutoff) %>%
separate(col = name, into = c("name", "suffix"), sep = "_", remove = TRUE) %>%
mutate(id = data.table::rleid(name)) %>%
pivot_wider(id_cols = c(id, cutoff, name), names_from = suffix, names_prefix = "VAL_", values_from = value) %>%
select(-id) %>%
rename(VAL = VAL_NA)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 6 rows [1, 4, 7,
#> 10, 13, 16].
#> # A tibble: 6 x 5
#> cutoff name VAL VAL_lo VAL_hi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2017-01-01 KM 2.1 1.4 4.9
#> 2 2017-01-01 rstm 7.2 3.9 10.2
#> 3 2017-04-01 KM 3.5 2.1 4.7
#> 4 2017-04-01 rstm 8.9 6.6 10.8
#> 5 2017-07-01 KM 3.7 2.8 4.2
#> 6 2017-07-01 rstm 7.2 6.2 8.4
Created on 2021-09-28 by the reprex package (v2.0.1)
I want to calculate the correlation between 'y' column and each column in 'col_df' dataframe.
For each calculation I want to save only the columns name with significant p_value (p_value<0.05).
y is a vector 64X1 of 0 and 1.
Example of the col_df- 60X12000
a b c d e
7.6 4.9 8.9 6.0 4.2
25.0 6.5 4.6 13.2 3.0
col_df <- as.matrix(df)
test <- col_df[, apply(col_df, MARGIN = 2, FUN = function(x)
(cor.test(y, col_df[,x], method = "pearson")$p.value <0.05))]
This is the error:
Error in col_df[, x] : subscript out of bounds
Is this the way to do that?
This is a working solution:
df <- structure(list(a = c(7.6, 7.6, 25, 25, 25, 25, 7.6, 7.6, 7.6, 25),
b = c(4.9, 4.9, 6.5, 6.5, 4.9, 6.5, 4.9, 4.9, 6.5, 6.5),
c = c(8.9, 4.6, 8.9, 8.9, 8.9, 4.6, 4.6, 8.9, 8.9, 4.6),
d = c(13.2, 13.2, 6, 6, 6, 6, 6, 13.2, 13.2, 13.2),
e = c(3, 4.2, 3, 4.2, 3, 3, 3, 4.2, 4.2, 4.2)),
class = "data.frame", row.names = c(NA, -10L))
y <- c(1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L)
test <- df[, apply(df, MARGIN = 2, FUN = function(x)
(cor.test(y, x, method = "pearson")$p.value < 0.05))]
test
#> a b
#> 1 7.6 4.9
#> 2 7.6 4.9
#> 3 25.0 6.5
#> 4 25.0 6.5
#> 5 25.0 4.9
#> 6 25.0 6.5
#> 7 7.6 4.9
#> 8 7.6 4.9
#> 9 7.6 6.5
#> 10 25.0 6.5
The difference to your solution ist that apply() gives you the column as x and
not an index. Hence, all you have to do is replace col_df[,x] of your solution with
just x.
You can simplify it a little with sapply(). I also recommend not to put everything into
a single line. It is hard to read and harder to debug.
Columns <- sapply(df, FUN = function(x) (cor.test(y, x, method = "pearson")$p.value < 0.05))
test <- df[, Columns]
test
#> a b
#> 1 7.6 4.9
#> 2 7.6 4.9
#> 3 25.0 6.5
#> 4 25.0 6.5
#> 5 25.0 4.9
#> 6 25.0 6.5
#> 7 7.6 4.9
#> 8 7.6 4.9
#> 9 7.6 6.5
#> 10 25.0 6.5
Created on 2020-07-22 by the reprex package (v0.3.0)
I have a large data (thousands of columns) frame in which few columns have duplicate column name. Then, there are set of column names which have part of column name which is duplicate and another part of the same column name is not.
Using R and above two properties, I want to split all such columns into different data frames for further analysis. To achieve this I want to run following dynamic logic on data frame:
First: Find and cbind() duplicate column name columns into different data frames. If 10 columns have same column name, they form a data frame and another another 5 with same column name form another data frame.
Second: Find and cbind() duplicate column name columns into different data frames if the string of column name before - matches with the string of column name before - for another column and the string of column name after - doesn't match with part of column name after - for another column.
Below is the sample input data (the big data is too big, but follows exact same property), for which first two columns will form a single data frame based on above example. There will be another data frame that will contain columns starting three to the last one.
I tried split(), but that hasn't worked out so far. Any suggestions on how I can do this?
Sample Input Data
structure(list(`A-DIODE` = c(1.2, 0.4), `A-DIODE` = c(1.3, 0.6
), `B-DIODE` = c(1.4, 0.8), `B-ACC1` = c(1.5, 1), `B-ACC2` = c(1.6,
1.2), `B-ANA0` = c(1.7, 1.4), `B-ANA1` = c(1.8, 1.6), `B-BRICKID` = c(1.9,
1.8), `B-CC0` = c(2L, 2L), `B-CC1` = c(2.1, 2.2), `B-DIGDN` = c(2.2,
2.4), `B-DIGDP` = c(2.3, 2.6), `B-DN1` = c(2.4, 2.8), `B-DN2` = c(2.5,
3), `B-DP1` = c(2.6, 3.2), `B-DP2` = c(2.7, 3.4), `B-SCL` = c(2.8,
3.6), `B-SDA` = c(2.9, 3.8), `B-USB0DN` = 3:4, `B-USB0DP` = c(3.1,
4.2), `B-USB1DN` = c(3.2, 4.4), `B-USB1DP` = c(3.3, 4.6), `B-ACC1` = c(3.4,
4.8), `B-ACC2` = c(3.5, 5), `B-ANA0` = c(3.6, 5.2), `B-ANA1` = c(3.7,
5.4), `B-BRICKID` = c(3.8, 5.6), `B-CC0` = c(3.9, 5.8), `B-CC1` = c(4L,
6L), `B-DIGDN` = c(4.1, 6.2), `B-DIGDP` = c(4.2, 6.4), `B-DN1` = c(4.3,
6.6), `B-DN2` = c(4.4, 6.8), `B-DP1` = c(4.5, 7), `B-DP2` = c(4.6,
7.2), `B-SCL` = c(4.7, 7.4), `B-SDA` = c(4.8, 7.6), `B-USB0DN` = c(4.9,
7.8), `B-USB0DP` = c(5L, 8L), `B-USB1DN` = c(5.1, 8.2), `B-USB1DP` = c(5.2,
8.4), `B-NA` = c(5.3, 8.6), `B-ACC2PWRLKG_0v4` = c(5.4, 8.8),
`B-ACC2PWRLKG_0v4` = c(5.5, 9), `B-P_IN_Leak` = c(5.6, 9.2
)), class = "data.frame", row.names = c(NA, -2L))
Output Based On Logic Discussed Above
Data Frame 1
A-DIODE A-DIODE
1.2 1.3
0.4 0.6
Data Frame 2
B-DIODE B-ACC1 B-ACC2 B-ANA0 B-ANA1 B-BRICKID B-CC0 B-CC1 B-DIGDN B-DIGDP B-DN1 B-DN2 B-DP1 B-DP2 B-SCL B-SDA B-USB0DN B-USB0DP
1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1
0.8 1.0 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4 4.2
B-USB1DN B-USB1DP B-ACC1.1 B-ACC2.1 B-ANA0.1 B-ANA1.1 B-BRICKID.1 B-CC0.1 B-CC1.1 B-DIGDN.1 B-DIGDP.1 B-DN1.1 B-DN2.1 B-DP1.1
3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5
4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7.0
B-DP2.1 B-SCL.1 B-SDA.1 B-USB0DN.1 B-USB0DP.1 B-USB1DN.1 B-USB1DP.1 B-NA B-ACC2PWRLKG_0v4 B-ACC2PWRLKG_0v4.1 B-P_IN_Leak
4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6
7.2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9.0 9.2
We can use split.default on the substring of names of the dataset
split.default(df1, sub("-.*", "", names(df1)))
Or if we know there would be only one character before -
split.default(df1, substr(names(df1), 1, 1))
#$A
# A-DIODE A-DIODE.1
#1 1.2 1.3
#2 0.4 0.6
#$B
# B-DIODE B-ACC1 B-ACC2 B-ANA0 B-ANA1 B-BRICKID B-CC0 B-CC1 B-DIGDN B-DIGDP B-DN1 B-DN2 B-DP1 B-DP2 B-SCL B-SDA B-USB0DN B-USB0DP
#1 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1
#2 0.8 1.0 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4 4.2
# B-USB1DN B-USB1DP B-ACC1.1 B-ACC2.1 B-ANA0.1 B-ANA1.1 B-BRICKID.1 B-CC0.1 B-CC1.1 B-DIGDN.1 B-DIGDP.1 B-DN1.1 B-DN2.1 B-DP1.1 B-DP2.1
#1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6
#2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7.0 7.2
# B-SCL.1 B-SDA.1 B-USB0DN.1 B-USB0DP.1 B-USB1DN.1 B-USB1DP.1 B-NA B-ACC2PWRLKG_0v4 B-ACC2PWRLKG_0v4.1 B-P_IN_Leak
#1 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6
#2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9.0 9.2
Considering this is my dataset below
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.7 2.5 5.0 2.0 virginica
7.7 3.0 6.1 2.3 virginica
6.7 3.3 5.7 2.1 virginica
4.8 3.0 1.4 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.6 1.4 0.1 setosa
6.3 3.3 4.7 1.6 versicolor
5.6 2.9 3.6 1.3 versicolor
5.9 3.0 4.2 1.5 versicolor
df <- structure(list(Sepal.Length = c(5.7, 7.7, 6.7, 4.8, 5.5, 4.9,
6.3, 5.6, 5.9), Sepal.Width = c(2.5, 3, 3.3, 3, 4.2, 3.6, 3.3,
2.9, 3), Petal.Length = c(5, 6.1, 5.7, 1.4, 1.4, 1.4, 4.7, 3.6,
4.2), Petal.Width = c(2, 2.3, 2.1, 0.1, 0.2, 0.1, 1.6, 1.3, 1.5
), Species = structure(c(3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("setosa",
"versicolor", "virginica"), class = "factor")), row.names = c(NA,
-9L), class = "data.frame")
My goal is to
Subtract the values of Sepal.Length Sepal.Width Petal.Length Petal.Width from the 1st row of Species == "virginica" with every row of "Setosa",
Which I am doing like this below
Virginia1_vs_Setosa1a <- df[1:4][df$Species == "virginica",][1,] - df[1:4][df$Species == "setosa",][1,]
Virginia1_vs_Setosa1a
0.9 -0.5 3.6 1.9
Virginia1_vs_Setosa2a <- df[1:4][df$Species == "virginica",][1,] - df[1:4][df$Species == "setosa",][2,]
Virginia1_vs_Setosa2a
0.2 -1.7 3.6 1.8
Virginia1_vs_Setosa3a <- df[1:4][df$Species == "virginica",][1,] - df[1:4][df$Species == "setosa",][3,]
Virginia1_vs_Setosa3a
0.8 -1.1 3.6 1.9
Take the product of each element
Virginia1_vs_Setosa1 <- as.numeric(
Virginia1_vs_Setosa1a[1]*Virginia1_vs_Setosa1a[2]*
Virginia1_vs_Setosa1a[3]*Virginia1_vs_Setosa1a[4])
0.9*-0.5*3.6*1.9 = -3.078
Virginia1_vs_Setosa2 <- as.numeric(
Virginia1_vs_Setosa2a[1]*Virginia1_vs_Setosa2a[2]*
Virginia1_vs_Setosa2a[3]*Virginia1_vs_Setosa2a[4])
0.2*-1.7*3.6*1.8 = -2.2032
Virginia1_vs_Setosa3 <- as.numeric(
Virginia1_vs_Setosa3a[1]*Virginia1_vs_Setosa3a[2]*
Virginia1_vs_Setosa3a[3]*Virginia1_vs_Setosa3a[4])
0.8*-1.1*3.6*1.9 = -6.0192
Similarly for the 2nd row in virginica with every row in setosa.
Virginia2_vs_Setosa1a <- df[1:4][df$Species == "virginica",][2,] - df[1:4][df$Species == "setosa",][1,]
Virginia2_vs_Setosa2a <- df[1:4][df$Species == "virginica",][2,] - df[1:4][df$Species == "setosa",][2,]
Virginia2_vs_Setosa3a <- df[1:4][df$Species == "virginica",][2,] - df[1:4][df$Species == "setosa",][3,]
Virginia2_vs_Setosa1 <- as.numeric(
Virginia2_vs_Setosa1a[1]*Virginia2_vs_Setosa1a[2]*
Virginia2_vs_Setosa1a[3]*Virginia2_vs_Setosa1a[4])
Virginia2_vs_Setosa2 <- as.numeric(
Virginia2_vs_Setosa2a[1]*Virginia2_vs_Setosa2a[2]*
Virginia2_vs_Setosa2a[3]*Virginia2_vs_Setosa2a[4])
Virginia2_vs_Setosa3 <- as.numeric(
Virginia2_vs_Setosa3a[1]*Virginia2_vs_Setosa3a[2]*
Virginia2_vs_Setosa3a[3]*Virginia2_vs_Setosa3a[4])
rm(Virginia2_vs_Setosa1a, Virginia2_vs_Setosa2a,
Virginia2_vs_Setosa3a)
Similarly with 3rd row in virginica with every row in setosa
Virginia3_vs_Setosa1a <- df[1:4][df$Species == "virginica",][3,] - df[1:4][df$Species == "setosa",][1,]
Virginia3_vs_Setosa2a <- df[1:4][df$Species == "virginica",][3,] - df[1:4][df$Species == "setosa",][2,]
Virginia3_vs_Setosa3a <- df[1:4][df$Species == "virginica",][3,] - df[1:4][df$Species == "setosa",][3,]
Virginia3_vs_Setosa1 <- as.numeric(
Virginia3_vs_Setosa1a[1]*Virginia3_vs_Setosa1a[2]*
Virginia3_vs_Setosa1a[3]*Virginia3_vs_Setosa1a[4])
Virginia3_vs_Setosa2 <- as.numeric(
Virginia3_vs_Setosa2a[1]*Virginia3_vs_Setosa2a[2]*
Virginia3_vs_Setosa2a[3]*Virginia3_vs_Setosa2a[4])
Virginia3_vs_Setosa3 <- as.numeric(
Virginia3_vs_Setosa3a[1]*Virginia3_vs_Setosa3a[2]*
Virginia3_vs_Setosa3a[3]*Virginia3_vs_Setosa3a[4])
rm(Virginia3_vs_Setosa1a, Virginia3_vs_Setosa2a,
Virginia3_vs_Setosa3a)
And finally creating a 3*3 matrix like this below
matrix(c(Virginia1_vs_Setosa1, Virginia1_vs_Setosa2, Virginia1_vs_Setosa3, Virginia2_vs_Setosa1, Virginia2_vs_Setosa2, Virginia2_vs_Setosa3,
Virginia3_vs_Setosa1, Virginia3_vs_Setosa2, Virginia3_vs_Setosa3), nrow=3, ncol=3)
[,1] [,2] [,3]
[1,] -3.0780 0.0000 4.9020
[2,] -2.2032 -26.0568 -8.8236
[3,] -6.0192 -17.3712 -4.6440
As you can see my solution is very clunky and inefficient. I will be very thankful if anybody can show me an efficient way of achieving the same results.
You can do this with a double for loop. Maybe there are solutions with the *apply family of functions but this one works.
f <- droplevels(df$Species[df$Species != "versicolor"])
sp <- split(df[df$Species != "versicolor", ], f)
res <- matrix(0, 3, 3)
for(i in 1:nrow(sp[[1]])){
for(j in 1:nrow(sp[[2]])){
res[i, j] <- prod(sp[[2]][j, -5] - sp[[1]][i, -5])
}
}
res
# [,1] [,2] [,3]
#[1,] -3.0780 0.0000 4.9020
#[2,] -2.2032 -26.0568 -8.8236
#[3,] -6.0192 -17.3712 -4.6440
For this particular case, you can borrow some ideas from outer
X <- lapply(split(df[df$Species=="virginica", 1:4], 1:3), unlist)
Y <- lapply(split(df[df$Species=="setosa", 1:4], 1:3), unlist)
FUN <- function(l1, l2) mapply(function(v,w) prod(v-w), l1, l2)
Y <- rep(Y, rep.int(length(X), length(Y)))
if (length(X))
X <- rep(X, times = ceiling(length(Y)/length(X)))
matrix(FUN(X, Y), ncol=3L, byrow=TRUE)
For the most general case, you will need to generate every possible pairs of different rows, then calculate according to your formula. Using data.table, it would be something like:
library(data.table)
setDT(df)
setorder(df, Species)[, numid := rowid(Species)]
parts <- split(df, by=c("Species", "numid"))
combis <- CJ(parts, parts, sorted=FALSE)
combis[, .(
Species1=V1[[1]][,Species],
numid1=V1[[1]][,numid],
Species2=V2[[1]][,Species],
numid2=V2[[1]][,numid],
differ=prod(V1[[1]][, 1:4] - V2[[1]][, 1:4])),
by=seq_len(combis[,.N])][
Species1!=Species2, -1L]
I have a character field in a dataframe that contains numbers e.g. (0.5,3.5,7.8,2.4).
For every record I am trying to extract the largest value from the string and put it in a new column.
e.g.
x csi
1 0.5, 6.7, 2.3
2 9.5, 2.6, 1.1
3 0.7, 2.3, 5.1
4 4.1, 2.7, 4.7
The desired output would be:
x csi csi_max
1 0.5, 6.7, 2.3 6.7
2 9.5, 2.6, 1.1 9.5
3 0.7, 2.3, 5.1 5.1
4 4.1, 2.7, 4.7 4.7
I have had various attempts ...with my latest attempt being the following - which provides the maximum csi score from the entire column rather than from the individual row's csi numbers...
library(stringr)
numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")
}
df$max_csi <- max(numextract(df$csi))
Thank you
We can use tidyverse
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(csi) %>%
group_by(x) %>%
summarise(csi_max = max(csi)) %>%
left_join(df1, .)
# x csi csi_max
#1 1 0.5, 6.7, 2.3 6.7
#2 2 9.5, 2.6, 1.1 9.5
#3 3 0.7, 2.3, 5.1 5.1
#4 4 4.1, 2.7, 4.7 4.7
Or this can be done with pmax from base R after separating the 'csi' column into a data.frame with read.table
df1$csi_max <- do.call(pmax, read.table(text=df1$csi, sep=","))
Hope this helps!
df$csi_max <- sapply(df$csi, function(x) max(as.numeric(unlist(strsplit(as.character(x), split=",")))))
Output is:
x csi csi_max
1 1 0.5, 6.7, 2.3 6.7
2 2 9.5, 2.6, 1.1 9.5
3 3 0.7, 2.3, 5.1 5.1
4 4 4.1, 2.7, 4.7 4.7
#sample data
> dput(df)
structure(list(x = 1:4, csi = structure(c(1L, 4L, 2L, 3L), .Label = c("0.5, 6.7, 2.3",
"0.7, 2.3, 5.1", "4.1, 2.7, 4.7", "9.5, 2.6, 1.1"), class = "factor")), .Names = c("x",
"csi"), class = "data.frame", row.names = c(NA, -4L))
Edit:
As suggested by #RichScriven, the more efficient way could be
df$csi_max <- sapply(strsplit(as.character(df$csi), ","), function(x) max(as.numeric(x)))
A solution using the splitstackshape package.
library(splitstackshape)
dat$csi_max <- apply(cSplit(dat, "csi")[, -1], 1, max)
dat
# x csi csi_max
# 1 1 0.5, 6.7, 2.3 6.7
# 2 2 9.5, 2.6, 1.1 9.5
# 3 3 0.7, 2.3, 5.1 5.1
# 4 4 4.1, 2.7, 4.7 4.7
DATA
dat <- read.table(text = "x csi
1 '0.5, 6.7, 2.3'
2 '9.5, 2.6, 1.1'
3 '0.7, 2.3, 5.1'
4 '4.1, 2.7, 4.7'",
header = TRUE, stringsAsFactors = FALSE)