I have a large data (thousands of columns) frame in which few columns have duplicate column name. Then, there are set of column names which have part of column name which is duplicate and another part of the same column name is not.
Using R and above two properties, I want to split all such columns into different data frames for further analysis. To achieve this I want to run following dynamic logic on data frame:
First: Find and cbind() duplicate column name columns into different data frames. If 10 columns have same column name, they form a data frame and another another 5 with same column name form another data frame.
Second: Find and cbind() duplicate column name columns into different data frames if the string of column name before - matches with the string of column name before - for another column and the string of column name after - doesn't match with part of column name after - for another column.
Below is the sample input data (the big data is too big, but follows exact same property), for which first two columns will form a single data frame based on above example. There will be another data frame that will contain columns starting three to the last one.
I tried split(), but that hasn't worked out so far. Any suggestions on how I can do this?
Sample Input Data
structure(list(`A-DIODE` = c(1.2, 0.4), `A-DIODE` = c(1.3, 0.6
), `B-DIODE` = c(1.4, 0.8), `B-ACC1` = c(1.5, 1), `B-ACC2` = c(1.6,
1.2), `B-ANA0` = c(1.7, 1.4), `B-ANA1` = c(1.8, 1.6), `B-BRICKID` = c(1.9,
1.8), `B-CC0` = c(2L, 2L), `B-CC1` = c(2.1, 2.2), `B-DIGDN` = c(2.2,
2.4), `B-DIGDP` = c(2.3, 2.6), `B-DN1` = c(2.4, 2.8), `B-DN2` = c(2.5,
3), `B-DP1` = c(2.6, 3.2), `B-DP2` = c(2.7, 3.4), `B-SCL` = c(2.8,
3.6), `B-SDA` = c(2.9, 3.8), `B-USB0DN` = 3:4, `B-USB0DP` = c(3.1,
4.2), `B-USB1DN` = c(3.2, 4.4), `B-USB1DP` = c(3.3, 4.6), `B-ACC1` = c(3.4,
4.8), `B-ACC2` = c(3.5, 5), `B-ANA0` = c(3.6, 5.2), `B-ANA1` = c(3.7,
5.4), `B-BRICKID` = c(3.8, 5.6), `B-CC0` = c(3.9, 5.8), `B-CC1` = c(4L,
6L), `B-DIGDN` = c(4.1, 6.2), `B-DIGDP` = c(4.2, 6.4), `B-DN1` = c(4.3,
6.6), `B-DN2` = c(4.4, 6.8), `B-DP1` = c(4.5, 7), `B-DP2` = c(4.6,
7.2), `B-SCL` = c(4.7, 7.4), `B-SDA` = c(4.8, 7.6), `B-USB0DN` = c(4.9,
7.8), `B-USB0DP` = c(5L, 8L), `B-USB1DN` = c(5.1, 8.2), `B-USB1DP` = c(5.2,
8.4), `B-NA` = c(5.3, 8.6), `B-ACC2PWRLKG_0v4` = c(5.4, 8.8),
`B-ACC2PWRLKG_0v4` = c(5.5, 9), `B-P_IN_Leak` = c(5.6, 9.2
)), class = "data.frame", row.names = c(NA, -2L))
Output Based On Logic Discussed Above
Data Frame 1
A-DIODE A-DIODE
1.2 1.3
0.4 0.6
Data Frame 2
B-DIODE B-ACC1 B-ACC2 B-ANA0 B-ANA1 B-BRICKID B-CC0 B-CC1 B-DIGDN B-DIGDP B-DN1 B-DN2 B-DP1 B-DP2 B-SCL B-SDA B-USB0DN B-USB0DP
1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1
0.8 1.0 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4 4.2
B-USB1DN B-USB1DP B-ACC1.1 B-ACC2.1 B-ANA0.1 B-ANA1.1 B-BRICKID.1 B-CC0.1 B-CC1.1 B-DIGDN.1 B-DIGDP.1 B-DN1.1 B-DN2.1 B-DP1.1
3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5
4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7.0
B-DP2.1 B-SCL.1 B-SDA.1 B-USB0DN.1 B-USB0DP.1 B-USB1DN.1 B-USB1DP.1 B-NA B-ACC2PWRLKG_0v4 B-ACC2PWRLKG_0v4.1 B-P_IN_Leak
4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6
7.2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9.0 9.2
We can use split.default on the substring of names of the dataset
split.default(df1, sub("-.*", "", names(df1)))
Or if we know there would be only one character before -
split.default(df1, substr(names(df1), 1, 1))
#$A
# A-DIODE A-DIODE.1
#1 1.2 1.3
#2 0.4 0.6
#$B
# B-DIODE B-ACC1 B-ACC2 B-ANA0 B-ANA1 B-BRICKID B-CC0 B-CC1 B-DIGDN B-DIGDP B-DN1 B-DN2 B-DP1 B-DP2 B-SCL B-SDA B-USB0DN B-USB0DP
#1 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1
#2 0.8 1.0 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4 4.2
# B-USB1DN B-USB1DP B-ACC1.1 B-ACC2.1 B-ANA0.1 B-ANA1.1 B-BRICKID.1 B-CC0.1 B-CC1.1 B-DIGDN.1 B-DIGDP.1 B-DN1.1 B-DN2.1 B-DP1.1 B-DP2.1
#1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6
#2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7.0 7.2
# B-SCL.1 B-SDA.1 B-USB0DN.1 B-USB0DP.1 B-USB1DN.1 B-USB1DP.1 B-NA B-ACC2PWRLKG_0v4 B-ACC2PWRLKG_0v4.1 B-P_IN_Leak
#1 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6
#2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 9.0 9.2
Related
cutoff KM KM_lo KM_hi rstm rstm_lo rstm_hi
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017-01-01 2.1 1.4 4.9 7.2 3.9 10.2
2 2017-04-01 3.5 2.1 4.7 8.9 6.6 10.8
3 2017-07-01 3.7 2.8 4.2 7.2 6.2 8.4
How do I convert this to a long table? I am struggling to create it into the format I want. I tried the gather and melt functions. The output table would look something like this
cutoff VAR Val Val-hi Val-lo
<chr> <chr> <dbl> <dbl> <dbl>
1 2017-01-01 KM 2.1 4.9 1.4
2 2017-01-01 rstm 7.2 4.7 3.9
3 2017-07-01 KM 3.7 4.2 2.8
Sample date
structure(list(cutoff = c("2017-01-01", "2017-04-01", "2017-07-01"
), KM = c(2.1, 3.5, 3.7), KM_lo = c(1.4, 2.1, 2.8), KM_hi = c(4.9,
4.7, 4.2), rstm = c(7.2, 8.9, 7.2), rstm_lo = c(3.9, 6.6, 6.2
), rstm_hi = c(10.2, 10.8, 8.4)), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
We may do
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
rename_with(~ str_c(., "_none"), c("KM", "rstm")) %>%
pivot_longer(cols = -cutoff, names_to = c("VAR", ".value"),
names_sep = "_") %>%
rename_with(~ c("Val", "Val-lo", "Val-hi"), 3:5)
-output
# A tibble: 6 × 5
cutoff VAR Val `Val-lo` `Val-hi`
<chr> <chr> <dbl> <dbl> <dbl>
1 2017-01-01 KM 2.1 1.4 4.9
2 2017-01-01 rstm 7.2 3.9 10.2
3 2017-04-01 KM 3.5 2.1 4.7
4 2017-04-01 rstm 8.9 6.6 10.8
5 2017-07-01 KM 3.7 2.8 4.2
6 2017-07-01 rstm 7.2 6.2 8.4
Here is another pivot_longer approach:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-cutoff,
names_to = c("VAR", ".value"),
names_pattern = "(.+)_(.+)"
) %>%
na.omit()
cutoff VAR lo hi
<chr> <chr> <dbl> <dbl>
1 2017-01-01 KM 1.4 4.9
2 2017-01-01 rstm 3.9 10.2
3 2017-04-01 KM 2.1 4.7
4 2017-04-01 rstm 6.6 10.8
5 2017-07-01 KM 2.8 4.2
6 2017-07-01 rstm 6.2 8.4
library(tidyverse)
df <-
structure(
list(
cutoff = c("2017-01-01", "2017-04-01", "2017-07-01"),
KM = c(2.1, 3.5, 3.7),
KM_lo = c(1.4, 2.1, 2.8),
KM_hi = c(4.9, 4.7, 4.2),
rstm = c(7.2, 8.9, 7.2),
rstm_lo = c(3.9, 6.6, 6.2),
rstm_hi = c(10.2, 10.8, 8.4)
),
row.names = c(NA,-3L),
class = c("tbl_df",
"tbl", "data.frame")
)
df %>%
pivot_longer(cols = -cutoff) %>%
separate(col = name, into = c("name", "suffix"), sep = "_", remove = TRUE) %>%
mutate(id = data.table::rleid(name)) %>%
pivot_wider(id_cols = c(id, cutoff, name), names_from = suffix, names_prefix = "VAL_", values_from = value) %>%
select(-id) %>%
rename(VAL = VAL_NA)
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 6 rows [1, 4, 7,
#> 10, 13, 16].
#> # A tibble: 6 x 5
#> cutoff name VAL VAL_lo VAL_hi
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2017-01-01 KM 2.1 1.4 4.9
#> 2 2017-01-01 rstm 7.2 3.9 10.2
#> 3 2017-04-01 KM 3.5 2.1 4.7
#> 4 2017-04-01 rstm 8.9 6.6 10.8
#> 5 2017-07-01 KM 3.7 2.8 4.2
#> 6 2017-07-01 rstm 7.2 6.2 8.4
Created on 2021-09-28 by the reprex package (v2.0.1)
Trying to run a OLS regression model in R.
data = read.csv("C:/.../VOLATILITY.csv")
head(data)
volt LfquantBS HfquantBS LfbankVOL HfbankMM HfnonbankMM HfindMM
1 18.23 3.7 9.2 3.2 2.6 35.3 7.9
2 16.09 4.1 11.4 3.2 2.7 35.3 8.2
3 16.79 4.1 11.4 3.2 2.7 35.3 8.2
4 17.01 4.1 11.4 3.2 2.7 35.3 8.2
5 16.09 4.1 11.4 3.2 2.7 35.3 8.2
6 19.66 6.2 10.5 4.2 1.8 30.7 8.6
model <- lm(volt ~ lfquantBS + HfquantBs + LfbankVOL + HfbankMM + HfnonbankMM
+ HfindMM)
Error in eval(predvars, data, env) : object 'volt' not found
Have done this before without any problem. Any help appreciated.
It should have the data because the columns volt, lfquantBS, etc. exist only within the frame of the data.frame object named 'data'. In addition, case is important. In the formula, there is lfquantBS while in the dataset, it is named as LfQuantBS
lm(volt ~ LfquantBS + HfquantBS + LfbankVOL + HfbankMM +
HfnonbankMM + HfindMM, data = data)
-output
Call:
lm(formula = volt ~ LfquantBS + HfquantBS + LfbankVOL + HfbankMM +
HfnonbankMM + HfindMM, data = data)
Coefficients:
(Intercept) LfquantBS HfquantBS LfbankVOL HfbankMM HfnonbankMM HfindMM
23.2866 1.0846 -0.9858 NA NA NA NA
Regarding the comment Have done this before without any problem. It is possible that the OP may have attach(data) in the past to create those columns as objects in the global env or have created those as vector objects first before constructing the data.frame
data
data <- structure(list(volt = c(18.23, 16.09, 16.79, 17.01, 16.09, 19.66
), LfquantBS = c(3.7, 4.1, 4.1, 4.1, 4.1, 6.2), HfquantBS = c(9.2,
11.4, 11.4, 11.4, 11.4, 10.5), LfbankVOL = c(3.2, 3.2, 3.2, 3.2,
3.2, 4.2), HfbankMM = c(2.6, 2.7, 2.7, 2.7, 2.7, 1.8), HfnonbankMM = c(35.3,
35.3, 35.3, 35.3, 35.3, 30.7), HfindMM = c(7.9, 8.2, 8.2, 8.2,
8.2, 8.6)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6"))
I want to calculate the correlation between 'y' column and each column in 'col_df' dataframe.
For each calculation I want to save only the columns name with significant p_value (p_value<0.05).
y is a vector 64X1 of 0 and 1.
Example of the col_df- 60X12000
a b c d e
7.6 4.9 8.9 6.0 4.2
25.0 6.5 4.6 13.2 3.0
col_df <- as.matrix(df)
test <- col_df[, apply(col_df, MARGIN = 2, FUN = function(x)
(cor.test(y, col_df[,x], method = "pearson")$p.value <0.05))]
This is the error:
Error in col_df[, x] : subscript out of bounds
Is this the way to do that?
This is a working solution:
df <- structure(list(a = c(7.6, 7.6, 25, 25, 25, 25, 7.6, 7.6, 7.6, 25),
b = c(4.9, 4.9, 6.5, 6.5, 4.9, 6.5, 4.9, 4.9, 6.5, 6.5),
c = c(8.9, 4.6, 8.9, 8.9, 8.9, 4.6, 4.6, 8.9, 8.9, 4.6),
d = c(13.2, 13.2, 6, 6, 6, 6, 6, 13.2, 13.2, 13.2),
e = c(3, 4.2, 3, 4.2, 3, 3, 3, 4.2, 4.2, 4.2)),
class = "data.frame", row.names = c(NA, -10L))
y <- c(1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L)
test <- df[, apply(df, MARGIN = 2, FUN = function(x)
(cor.test(y, x, method = "pearson")$p.value < 0.05))]
test
#> a b
#> 1 7.6 4.9
#> 2 7.6 4.9
#> 3 25.0 6.5
#> 4 25.0 6.5
#> 5 25.0 4.9
#> 6 25.0 6.5
#> 7 7.6 4.9
#> 8 7.6 4.9
#> 9 7.6 6.5
#> 10 25.0 6.5
The difference to your solution ist that apply() gives you the column as x and
not an index. Hence, all you have to do is replace col_df[,x] of your solution with
just x.
You can simplify it a little with sapply(). I also recommend not to put everything into
a single line. It is hard to read and harder to debug.
Columns <- sapply(df, FUN = function(x) (cor.test(y, x, method = "pearson")$p.value < 0.05))
test <- df[, Columns]
test
#> a b
#> 1 7.6 4.9
#> 2 7.6 4.9
#> 3 25.0 6.5
#> 4 25.0 6.5
#> 5 25.0 4.9
#> 6 25.0 6.5
#> 7 7.6 4.9
#> 8 7.6 4.9
#> 9 7.6 6.5
#> 10 25.0 6.5
Created on 2020-07-22 by the reprex package (v0.3.0)
On the following data I am running split.default() in R. The problem is that the separated sub data.frames() in the list are getting into sorted based on column name.
I don't want this to occur and want to preserve the column name sequence as that is the original data. Is there an approach that I can follow to do so? Please suggest.
Input Data
data <- structure(list(`B-DIODE` = c(1.2, 0.4), `B-DIODE` = c(1.3, 0.6
), `A-DIODE` = c(1.4, 0.8), `A-ACC1` = c(1.5, 1), `A-ACC2` = c(1.6,
1.2), `A-ANA0` = c(1.7, 1.4), `A-ANA1` = c(1.8, 1.6), `A-BRICKID` = c(1.9,
1.8), `A-CC0` = c(2L, 2L), `A-CC1` = c(2.1, 2.2), `A-DIGDN` = c(2.2,
2.4), `A-DIGDP` = c(2.3, 2.6), `A-DN1` = c(2.4, 2.8), `A-DN2` = c(2.5,
3), `A-DP1` = c(2.6, 3.2), `A-DP2` = c(2.7, 3.4), `A-SCL` = c(2.8,
3.6), `A-SDA` = c(2.9, 3.8), `A-USB0DN` = 3:4, `A-USB0DP` = c(3.1,
4.2), `A-USB1DN` = c(3.2, 4.4), `A-USB1DP` = c(3.3, 4.6), `A-ACC1` = c(3.4,
4.8), `A-ACC2` = c(3.5, 5), `A-ANA0` = c(3.6, 5.2), `A-ANA1` = c(3.7,
5.4), `A-BRICKID` = c(3.8, 5.6), `A-CC0` = c(3.9, 5.8), `A-CC1` = c(4L,
6L), `A-DIGDN` = c(4.1, 6.2), `A-DIGDP` = c(4.2, 6.4), `A-DN1` = c(4.3,
6.6), `A-DN2` = c(4.4, 6.8), `A-DP1` = c(4.5, 7), `A-DP2` = c(4.6,
7.2), `A-SCL` = c(4.7, 7.4), `A-SDA` = c(4.8, 7.6), `A-USB0DN` = c(4.9,
7.8), `A-USB0DP` = c(5L, 8L), `A-USB1DN` = c(5.1, 8.2), `A-USB1DP` = c(5.2,
8.4), `A-NA` = c(5.3, 8.6), `A-ACC2PWRLKG_0v4` = c(5.4, 8.8),
`A-ACC2PWRLKG_0v4` = c(5.5, 9), `A-P_IN_Leak` = c(5.6, 9.2
)), class = "data.frame", row.names = c(NA, -2L))
Code
split.default(data, sub("-.*", "", names(data)))
Output
$`A`
A-DIODE A-ACC1 A-ACC2 A-ANA0 A-ANA1 A-BRICKID A-CC0 A-CC1 A-DIGDN A-DIGDP A-DN1 A-DN2 A-DP1 A-DP2 A-SCL A-SDA A-USB0DN A-USB0DP A-USB1DN A-USB1DP
1 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3
2 0.8 1.0 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6
A-ACC1.1 A-ACC2.1 A-ANA0.1 A-ANA1.1 A-BRICKID.1 A-CC0.1 A-CC1.1 A-DIGDN.1 A-DIGDP.1 A-DN1.1 A-DN2.1 A-DP1.1 A-DP2.1 A-SCL.1 A-SDA.1 A-USB0DN.1
1 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9
2 4.8 5.0 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8
A-USB0DP.1 A-USB1DN.1 A-USB1DP.1 A-NA A-ACC2PWRLKG_0v4 A-ACC2PWRLKG_0v4.1 A-P_IN_Leak
1 5 5.1 5.2 5.3 5.4 5.5 5.6
2 8 8.2 8.4 8.6 8.8 9.0 9.2
$B
B-DIODE B-DIODE.1
1 1.2 1.3
2 0.4 0.6
In the above output I want the $B to appear first and then $A as that's the sequence that Input Data followed.
One option is to convert the names to factor and set the levels as needed
new_name <- sub("-.*", "", names(data))
split.default(data, factor(new_name, levels = unique(new_name)))
#$B
# B-DIODE B-DIODE.1
#1 1.2 1.3
#2 0.4 0.6
#$A
# A-DIODE A-ACC1 A-ACC2 A-ANA0 ....
#1 1.4 1.5 1.6 1.7 ....
#2 0.8 1.0 1.2 1.4 ....
by specifying levels as unique(new_name) we can ensure that the list will be split based on their occurrence in the dataframe and not alphabetically.
As #thelatemail suggests we can also avoid converting names to factor variable by reordering the list based on unique new_name
split.default(data, new_name)[unique(new_name)]
Another option is to create the group indices for splitting using rle
rl <- rle(sub("-.*", "", names(data)))
split.default(data, rep(1:length(rl), rl$length))
#$`1`
# B-DIODE B-DIODE.1
#1 1.2 1.3
#2 0.4 0.6
#
#$`2`
# A-DIODE A-ACC1 A-ACC2 A-ANA0 A-ANA1 A-BRICKID A-CC0 A-CC1 A-DIGDN A-DIGDP
#1 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3
#2 0.8 1.0 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
# A-DN1 A-DN2 A-DP1 A-DP2 A-SCL A-SDA A-USB0DN A-USB0DP A-USB1DN A-USB1DP
#1 2.4 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3
#2 2.8 3.0 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6
# A-ACC1.1 A-ACC2.1 A-ANA0.1 A-ANA1.1 A-BRICKID.1 A-CC0.1 A-CC1.1 A-DIGDN.1
#1 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1
#2 4.8 5.0 5.2 5.4 5.6 5.8 6 6.2
# A-DIGDP.1 A-DN1.1 A-DN2.1 A-DP1.1 A-DP2.1 A-SCL.1 A-SDA.1 A-USB0DN.1
#1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9
#2 6.4 6.6 6.8 7.0 7.2 7.4 7.6 7.8
# A-USB0DP.1 A-USB1DN.1 A-USB1DP.1 A-NA A-ACC2PWRLKG_0v4 A-ACC2PWRLKG_0v4.1
#1 5 5.1 5.2 5.3 5.4 5.5
#2 8 8.2 8.4 8.6 8.8 9.0
# A-P_IN_Leak
#1 5.6
#2 9.2
I have a character field in a dataframe that contains numbers e.g. (0.5,3.5,7.8,2.4).
For every record I am trying to extract the largest value from the string and put it in a new column.
e.g.
x csi
1 0.5, 6.7, 2.3
2 9.5, 2.6, 1.1
3 0.7, 2.3, 5.1
4 4.1, 2.7, 4.7
The desired output would be:
x csi csi_max
1 0.5, 6.7, 2.3 6.7
2 9.5, 2.6, 1.1 9.5
3 0.7, 2.3, 5.1 5.1
4 4.1, 2.7, 4.7 4.7
I have had various attempts ...with my latest attempt being the following - which provides the maximum csi score from the entire column rather than from the individual row's csi numbers...
library(stringr)
numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")
}
df$max_csi <- max(numextract(df$csi))
Thank you
We can use tidyverse
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(csi) %>%
group_by(x) %>%
summarise(csi_max = max(csi)) %>%
left_join(df1, .)
# x csi csi_max
#1 1 0.5, 6.7, 2.3 6.7
#2 2 9.5, 2.6, 1.1 9.5
#3 3 0.7, 2.3, 5.1 5.1
#4 4 4.1, 2.7, 4.7 4.7
Or this can be done with pmax from base R after separating the 'csi' column into a data.frame with read.table
df1$csi_max <- do.call(pmax, read.table(text=df1$csi, sep=","))
Hope this helps!
df$csi_max <- sapply(df$csi, function(x) max(as.numeric(unlist(strsplit(as.character(x), split=",")))))
Output is:
x csi csi_max
1 1 0.5, 6.7, 2.3 6.7
2 2 9.5, 2.6, 1.1 9.5
3 3 0.7, 2.3, 5.1 5.1
4 4 4.1, 2.7, 4.7 4.7
#sample data
> dput(df)
structure(list(x = 1:4, csi = structure(c(1L, 4L, 2L, 3L), .Label = c("0.5, 6.7, 2.3",
"0.7, 2.3, 5.1", "4.1, 2.7, 4.7", "9.5, 2.6, 1.1"), class = "factor")), .Names = c("x",
"csi"), class = "data.frame", row.names = c(NA, -4L))
Edit:
As suggested by #RichScriven, the more efficient way could be
df$csi_max <- sapply(strsplit(as.character(df$csi), ","), function(x) max(as.numeric(x)))
A solution using the splitstackshape package.
library(splitstackshape)
dat$csi_max <- apply(cSplit(dat, "csi")[, -1], 1, max)
dat
# x csi csi_max
# 1 1 0.5, 6.7, 2.3 6.7
# 2 2 9.5, 2.6, 1.1 9.5
# 3 3 0.7, 2.3, 5.1 5.1
# 4 4 4.1, 2.7, 4.7 4.7
DATA
dat <- read.table(text = "x csi
1 '0.5, 6.7, 2.3'
2 '9.5, 2.6, 1.1'
3 '0.7, 2.3, 5.1'
4 '4.1, 2.7, 4.7'",
header = TRUE, stringsAsFactors = FALSE)