R: box plot with 2 or more series - r

My data frame is simple (and probably is not strictly a dataframe):
date MAE_f0 MAE_f1
1 20140101 0.2 0.2
2 20140102 1.9 0.1
3 20140103 0.1 0.3
4 20140104 7.8 15.9
5 20140105 1.9 4.6
6 20140106 0.8 0.8
7 20140107 0.5 0.6
8 20140108 0.2 0.2
9 20140109 0.2 0.2
10 20140110 0.8 1.1
11 20140111 0.2 0.2
12 20140112 0.4 0.4
13 20140113 2.8 0.9
14 20140114 5.4 5.8
15 20140115 0.2 0.3
16 20140116 4.9 3.1
17 20140117 3.7 6.0
18 20140118 1.4 2.1
19 20140119 0.9 3.0
20 20140120 0.2 3.6
21 20140121 0.3 0.3
22 20140122 0.4 0.4
23 20140123 0.6 1.7
24 20140124 6.1 4.7
25 20140125 0.1 0.0
26 20140126 7.4 4.9
27 20140127 0.8 0.9
28 20140128 0.3 0.3
29 20140129 3.0 4.2
30 20140130 9.9 17.3
On every day I've 2 variables: MAE for f0, and MAE for f1.
I can calculate frequency for my 2 variables on the whole time period using "cut" with the same intervals for both:
cut(mae.df$MAE_f0,c(0,2,5,10,50))
cut(mae.df$MAE_f1,c(0,2,5,10,50))
Well. Now I can use boxplot to plot variable versus it's frequency distribution:
boxplot(mae.df$MAE_f0~cut(mae.df$MAE_f0,c(0,2,5,10,50)))
boxplot(mae.df$MAE_f1~cut(mae.df$MAE_f1,c(0,2,5,10,50)))
The produced boxplot (2) are very simple (but I don't show it 'cause I've ho "reputation"): on x there are the intervals of frequency (0-2,2-5,5-10,10-50), on y the boxplot value for variable MAE_f0 for each interval.
Well, the question is very trivial: I'd like to have only one box plot, with both variables MAE_f0 and MAE_f1 and it's frequency distribution: I'd like to have is a plot with 2 boxplot for each frequency interval (I mean: 2 for 0-2, 2 for 2-5 and so on).
I know that my knowledge on R, data frame and so on is very poor, and, de facto, I'm missing something important about those arguments, specially on data frame and reshaping! Sorry in advance for that!But I've seen some nice examples in stackoverflow about grouping boxplot, all without time variable, and I'm not able to figure out how I can adjust my data frame for doing that.
I hope my question is not misplaced: sorry again for that.
Umbe

Here is how I would do this. I think it makes sense to melt your data first. A quick tutorial on melting your data is available here.
# First, make this reproducible by using dput for the data frame
df <- structure(list(date = 20140101:20140130, MAE_f0 = c(0.2, 1.9, 0.1, 7.8, 1.9, 0.8, 0.5, 0.2, 0.2, 0.8, 0.2, 0.4, 2.8, 5.4, 0.2, 4.9, 3.7, 1.4, 0.9, 0.2, 0.3, 0.4, 0.6, 6.1, 0.1, 7.4, 0.8, 0.3, 3, 9.9), MAE_f1 = c(0.2, 0.1, 0.3, 15.9, 4.6, 0.8, 0.6, 0.2, 0.2, 1.1, 0.2, 0.4, 0.9, 5.8, 0.3, 3.1, 6, 2.1, 3, 3.6, 0.3, 0.4, 1.7, 4.7, 0, 4.9, 0.9, 0.3, 4.2, 17.3)), .Names = c("date", "MAE_f0", "MAE_f1"), row.names = c(NA, -30L), class = "data.frame")
require(ggplot2)
require(reshape2)
# Melt the original data frame
df2 <- melt(df, measure.vars = c("MAE_f0", "MAE_f1"))
head(df2)
# date variable value
# 1 20140101 MAE_f0 0.2
# 2 20140102 MAE_f0 1.9
# 3 20140103 MAE_f0 0.1
# 4 20140104 MAE_f0 7.8
# 5 20140105 MAE_f0 1.9
# 6 20140106 MAE_f0 0.8
# Create a "cuts" variable with the correct breaks
df2$cuts <- cut(df2$value,
breaks = c(-Inf, 2, 5, 10, +Inf),
labels = c("first cut", "second cut", "third cut", "fourth cut"))
head(df2)
# date variable value cuts
# 1 20140101 MAE_f0 0.2 first cut
# 2 20140102 MAE_f0 1.9 first cut
# 3 20140103 MAE_f0 0.1 first cut
# 4 20140104 MAE_f0 7.8 third cut
# 5 20140105 MAE_f0 1.9 first cut
# 6 20140106 MAE_f0 0.8 first cut
# Plotting
ggplot(df2, aes(x = variable, y = value, fill = variable)) +
geom_boxplot() +
facet_wrap(~ cuts, nrow = 1)
Result:

Here is one way. You reshape your data. Then, you want to add a fake data point in this case. I noticed that there is no data point for MAE_f0 for (10,50](frequency 10-50). Combine your reshaped data and the fake data. When you draw a figure, use coord_cartesian with the range of y values in the original data set. Hope this gives you an ideal graphic. Here, your data is called mydf
library(dplyr)
library(tidyr)
library(ggplot2)
mydf <- structure(list(V1 = 1:30, V2 = 20140101:20140130, V3 = c(0.2,
1.9, 0.1, 7.8, 1.9, 0.8, 0.5, 0.2, 0.2, 0.8, 0.2, 0.4, 2.8, 5.4,
0.2, 4.9, 3.7, 1.4, 0.9, 0.2, 0.3, 0.4, 0.6, 6.1, 0.1, 7.4, 0.8,
0.3, 3, 9.9), V4 = c(0.2, 0.1, 0.3, 15.9, 4.6, 0.8, 0.6, 0.2,
0.2, 1.1, 0.2, 0.4, 0.9, 5.8, 0.3, 3.1, 6, 2.1, 3, 3.6, 0.3,
0.4, 1.7, 4.7, 0, 4.9, 0.9, 0.3, 4.2, 17.3)), .Names = c("V1",
"V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -30L
))
ana <- select(mydf, -V1) %>%
rename(date = V2, MAE_f0 = V3, MAE_f1 = V4) %>%
gather(variable, value, -date) %>%
mutate(frequency = cut(value, breaks = c(-Inf,2,5,10,50)))
# Create a fake df
extra <- data.frame(date = 20140101,
variable = "MAE_f0",
value = 60,
frequency = "(10,50]")
new <- rbind(ana, extra)
ggplot(data = new, aes(x = frequency, y = value, fill = variable)) +
geom_boxplot(position = "dodge") +
coord_cartesian(ylim = range(ana$value) + c(-0.25, 0.25))

Related

How can I plot multiple columns under X and Y in ggplot2

data <- structure(list(A_w = c(0, 0.69, 1.41, 2.89, 6.42, 13.3, 25.5,
36.7, 44.3, 46.4), E_w = c(1.2, 1.2, 1.5, 1.6, 1.9, 2.3, 3.4,
4.4, 10.6, 16.5), A_e = c(0, 0.18, 0.37, 0.79, 1.93, 4.82, 11.4,
21.6, 31.1, 36.2), E_e = c(99.4, 99.3, 98.9, 98.4, 97.1, 93.3,
84.7, 71.5, 58.1, 48.7)), row.names = c(NA, -10L), class = "data.frame")
data
#> A_w E_w A_e E_e
#> 1 0.00 1.2 0.00 99.4
#> 2 0.69 1.2 0.18 99.3
#> 3 1.41 1.5 0.37 98.9
#> 4 2.89 1.6 0.79 98.4
#> 5 6.42 1.9 1.93 97.1
#> 6 13.30 2.3 4.82 93.3
#> 7 25.50 3.4 11.40 84.7
#> 8 36.70 4.4 21.60 71.5
#> 9 44.30 10.6 31.10 58.1
#> 10 46.40 16.5 36.20 48.7
Created on 2021-05-31 by the reprex package (v2.0.0)
I am trying to plot this data with all A values as X and Es as Y. How can I put either a) both of these columns plotted on a ggplot2, or b) rearrange this dataframe to combine the A columns and E columns into a final dataframe with only two columns with 2x as many rows as pictured?
Thanks for any help, I am a beginner (obviously)
Edit for Clarity: It's important that the A_e & E_e values remain as pairs, similar to how the A_w and E_w values remain as pairs. The end result plot should resemble the ORANGE and BLUE lines of this image, but I am trying to replicate this while learning R.
Currently I am capable of plotting each separately when dividing into two dataframes of 2x10
A_w E_w
1 0.00 1.2
2 0.69 1.2
3 1.41 1.5
4 2.89 1.6
5 6.42 1.9
6 13.30 2.3
7 25.50 3.4
8 36.70 4.4
9 44.30 10.6
10 46.40 16.5
and the second plot
# A tibble: 10 x 2
A_e E_e
<dbl> <dbl>
1 0 99.4
2 0.18 99.3
3 0.37 98.9
4 0.79 98.4
5 1.93 97.1
6 4.82 93.3
7 11.4 84.7
8 21.6 71.5
9 31.1 58.1
10 36.2 48.7
But my end goal is to have them both on the same plot, like in the Excel graph (orange + blue graph) above.
Here is a try
library(dplyr)
library(ggplot2)
line_1_data <- data %>%
select(A_w, E_w) %>%
mutate(xend = lead(A_w), yend = lead(E_w)) %>%
filter(!is.na(xend))
line_2_data <- data %>%
select(A_e, E_e) %>%
mutate(xend = lead(A_e), yend = lead(E_e)) %>%
filter(!is.na(xend))
# multiple column for with different geom
ggplot(data = data) +
# The blue line
geom_point(aes(x = A_w, y = E_w), color = "blue") +
geom_curve(data = line_1_data, aes(x = A_w, y = E_w, xend = xend,
yend = yend), color = "blue",
curvature = 0.02) +
# The orange line
geom_point(aes(x = A_e, y = E_e), color = "orange") +
geom_curve(data = line_2_data,
aes(x = A_e, y = E_e, xend = xend, yend = yend), color = "orange",
curvature = -0.02) +
# The red connection between two line
geom_curve(data = tail(data, 1),
aes(x = A_w, y = E_w, xend = A_e, yend = E_e), curvature = 0.1,
color = "red") +
# The black straight line between pair
geom_curve(
aes(x = A_w, y = E_w, xend = A_e, yend = E_e), curvature = 0,
color = "black")
Created on 2021-05-31 by the reprex package (v2.0.0)
You may try from this
data <- data.frame(
A_w = c(0,0.69,1.41,2.89,6.42,
13.3,25.5,36.7,44.3,46.4),
E_w = c(1.2, 1.2, 1.5, 1.6, 1.9, 2.3, 3.4, 4.4, 10.6, 16.5),
A_e = c(0,0.18,0.37,0.79,1.93,
4.82,11.4,21.6,31.1,36.2),
E_e = c(99.4,99.3,98.9,98.4,
97.1,93.3,84.7,71.4,58.1,48.7)
)
library(tidyverse)
data %>% pivot_longer(everything(), names_sep = '_', names_to = c('.value', 'type')) %>%
ggplot(aes(x = A, y = E, color = type)) +
geom_point() +
geom_line()
Created on 2021-05-31 by the reprex package (v2.0.0)
Doing it "by hand":
#dummmy data:
df = data.frame(A_w=rnorm(10), E_w=rnorm(10), A_e=rnorm(10), E_e=rnorm(10))
df2 = data.frame(A=c(df$A_w, df$A_e), E=c(df$E_w, df$A_e))
Output:
> df2
A E
1 1.25522468 -0.2441768
2 -0.50585191 -0.1383637
3 0.42374270 -0.9664189
4 -0.39858532 -0.3442157
5 -1.05665363 -1.3574362
6 0.79191788 -0.8202841
7 -1.31349592 0.7280619
8 -0.05609851 0.6365495
9 1.01068811 2.0222241
10 -1.15572972 -0.2190794
11 0.15579931 0.1557993
12 1.58834329 1.5883433
13 1.24933622 1.2493362
14 -0.28197439 -0.2819744
15 0.30593184 0.3059318
16 0.75486103 0.7548610
17 1.19394302 1.1939430
18 -1.79955846 -1.7995585
19 0.59688655 0.5968865
20 0.71519048 0.7151905
And for the plot: ggplot(df2, aes(x=A, y=E)) + geom_point()
Output:
There are ways to do this without having to joint the columns by listing their names - with the tidyr package - but i think that this solution is easier to understand from a beginners pov.

R Make scatter plots (ggplot) from columns based on attributes from rows

I have the following type of table :
df0 <- read.table(text = 'Sample Method Mg Al Ca Ti
Sa A 5.5 2.2 33 0.2
Sb A 4.2 1.2 44 0.1
Sc A 1.1 0.5 25 0.3
Sd A 3.3 1.3 31 0.5
Se A 6.2 0.2 55 0.6
Sa B 5.2 2 35 0.25
Sb B 4.6 1.3 48 0.1
Sc B 1.6 0.8 22 0.32
Sd B 3.1 1.6 29 0.4
Se B 6.8 0.3 51 0.7
Sa C 5.6 2.5 30 0.2
Sb C 4.1 1.2 41 0.15
Sc C 1 0.6 22 0.4
Sd C 3.2 1.5 30 0.5
Se C 6.8 0.1 51 0.65', header = T, stringsAsFactors = F)
Which include chemical compositions. I would like to use the Method A as a reference (X-axis) and to make automated scatter plots with the data from Method B, C in Y (with linear trend). With a reference line of 1:1 which would correspond to a perfect match.
In other words, I would like to produce plots like that :
I think a solution could start from transforming the data frame into:
df <- read.table(text = 'Sample Mg_A Al_A Ca_A Ti_A Mg_B Al_B Ca_B Ti_B Mg_C Al_C Ca_C Ti_C
Sa 5.5 2.2 33 0.2 5.2 2 35 0.25 5.6 2.5 30 0.2
Sb 4.2 1.2 44 0.1 4.6 1.3 48 0.1 4.1 1.2 41 0.15
Sc 1.1 0.5 25 0.3 1.6 0.8 22 0.32 1 0.6 22 0.4
Sd 3.3 1.3 31 0.5 3.1 1.6 29 0.4 3.2 1.5 30 0.5
Se 6.2 0.2 55 0.6 6.8 0.3 51 0.7 6.8 0.1 51 0.65
', header = T, stringsAsFactors = F)
But I don't know how to go further.
Any help would be appreciated.
Best, Anne-Christine
You can use the following code
library(tidyverse)
df0 %>%
pivot_wider(names_from = Method, values_from = c(Mg, Al, Ca, Ti)) %>%
pivot_longer(cols = -Sample) %>% #wide to long data format
separate(name, c("key","number"), sep = "_") %>%
group_by(number) %>% #Group the vaules according to number
mutate(row = row_number()) %>% #For creating unique IDs
pivot_wider(names_from = number, values_from = value) %>%
ggplot() +
geom_point(aes(x=A, y=B, color = "A vs B")) +
geom_point(aes(x=A, y=C, color = "A vs C")) +
geom_abline(slope=1, intercept=0) +
geom_smooth(aes(x=A, y=B, color = "A vs B"), method=lm, se=FALSE, fullrange=TRUE)+
geom_smooth(aes(x=A, y=C, color = "A vs C"), method=lm, se=FALSE, fullrange=TRUE)+
facet_wrap(key~., scales = "free")+
theme_bw()+
ylab("B or C") +
xlab("A")
Data
df0 = structure(list(Sample = c("Sa", "Sb", "Sc", "Sd", "Se", "Sa",
"Sb", "Sc", "Sd", "Se", "Sa", "Sb", "Sc", "Sd", "Se"), Method = c("A",
"A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C",
"C"), Mg = c(5.5, 4.2, 1.1, 3.3, 6.2, 5.2, 4.6, 1.6, 3.1, 6.8,
5.6, 4.1, 1, 3.2, 6.8), Al = c(2.2, 1.2, 0.5, 1.3, 0.2, 2, 1.3,
0.8, 1.6, 0.3, 2.5, 1.2, 0.6, 1.5, 0.1), Ca = c(33L, 44L, 25L,
31L, 55L, 35L, 48L, 22L, 29L, 51L, 30L, 41L, 22L, 30L, 51L),
Ti = c(0.2, 0.1, 0.3, 0.5, 0.6, 0.25, 0.1, 0.32, 0.4, 0.7,
0.2, 0.15, 0.4, 0.5, 0.65)), class = "data.frame", row.names = c(NA,
-15L))

Determine how close proportions are to an even split

I've got a dataset that has info about bunch of cities in it. Variables include % of residents that are several different race categories, % of residents in several employment sectors, etc. I'm trying to determine, for each category, how close each city is to an even split among the options.
So for race, there's 4 race categories, so a city that's 25% of each would be (for example) 1, while a city that was 100% white would be a 0. However, with 7 employment sectors, each would have to be 14.29% for a perfect score (the point being that I'm doing this on multiple categories with different numbers of groups in each category). My output would be a column that has some kind of numeric score for how evenly the group I'm looking at (for example, race) is spread out.
I'm programming in R, so a solution there would be great, but I'm up for whatever kind of answer might be useful.
Here's a sample data frame if that's useful
testdata <- structure(list(city = c("City1", "City2", "City3", "City4"), black = c(0.4, 0.1, 0.3, 0.2), white = c(0.3, 0.7, 0.1, 0.2), hisp = c(0.2, 0.1, 0.2, 0.2),asian = c(0.1, 0.1, 0.4, 0.4), service =c(0.10, 0.14, 0.4, 0.0),tech = c(0.00, 0.14, 0.6, 0.2),govt = c(0.15, 0.14, 0.0, 0.2),nonprofit = c(0.20, 0.14, 0.0, 0.3),agriculture = c(0.05, 0.14, 0.0, 0.1),manufacturing = c(0.40, 0.14, 0.0, 0.1),marketing = c(0.10, 0.16, 0.0, 0.1)), row.names = c(NA, -4L), class = "data.frame")
Here's one way to proceed :
Differentiate the data based on categories. In the example, you have shared you have information about two broad categories, race and employment sectors, once you have the categories you could get the even split number by dividing 1 by number of rows in each group and subtract it from the value present.
library(dplyr)
testdata %>%
tidyr::pivot_longer(cols = -city) %>%
mutate(category=case_when(name %in% c('black', 'white', 'hisp', 'asian') ~ 'race',
TRUE ~ 'sectors')) %>%
group_by(city, category) %>%
mutate(close_ratio = abs(1/n() - value))
# city name value category close_ratio
# <chr> <chr> <dbl> <chr> <dbl>
# 1 City1 black 0.4 race 0.15
# 2 City1 white 0.3 race 0.0500
# 3 City1 hisp 0.2 race 0.0500
# 4 City1 asian 0.1 race 0.15
# 5 City1 service 0.1 sectors 0.0429
# 6 City1 tech 0 sectors 0.143
# 7 City1 govt 0.15 sectors 0.00714
# 8 City1 nonprofit 0.2 sectors 0.0571
# 9 City1 agriculture 0.05 sectors 0.0929
#10 City1 manufacturing 0.4 sectors 0.257
# … with 34 more rows
close_ratio = 0 is ideal which means that the value is exactly same as even split. The more it goes far from 0, the more it is towards uneven split.

Select all values of a variables for which there is data for every year

Say I have some data with 2 numeric variables ranging from 0 to 1 (it1, it2), a name variable, which has the name of the subject the numeric variable belongs to and then some date for every measure, ranging from year 2014 to 2017. Now, what I want to do is create a data set that only contains measures of people that have values for every year of my measure, and then in the future maybe specify that I only want measures for people with data ranging from 2015 to 2017. Does anybody have any hint on what package or code could help me with my problem? Thanks in advance.
date <- c("2015-11-26", "2015-12-30","2016-11-13", "2014-09-22", "2014-01-13", "2014-07-26", "2016-11-26", "2016-04-04", "2017-04-09", "2017-02-23", "2015-03-22")
names <- c("Max", "Allen", "Allen", "Bob", "Max", "Sarah", "Max", "Sarah", "Max", "Sarah", "Sarah")
it1 <- c(0.6, 0.3, 0.1, 0.2, 0.3, 0.8, 0.8, 0.5, 0.5, 0.3, 0.7)
it2 <- c(0.5, 0.8, 0.1, 0.4, 0.4, 0.4, 0.5, 0.8, 0.6, 0.5, 0.4)
date <- as.Date(date, format = "%Y-%m-%d")
myframe <- data.frame(date, names, it1, it2)
Desired output:
date <- c("2015-11-26", "2014-01-13", "2014-07-26", "2016-11-26", "2016-04-04", "2017-04-09", "2017-02-23", "2015-03-22")
names <- c("Max", "Max", "Sarah", "Max", "Sarah", "Max", "Sarah", "Sarah")
it1 <- c(0.6, 0.3, 0.8, 0.8, 0.5, 0.5, 0.3, 0.7)
it2 <- c(0.5, 0.4, 0.4, 0.5, 0.8, 0.6, 0.5, 0.4)
date <- as.Date(date, format = "%Y-%m-%d")
myframe <- data.frame(date, names, it1, it2)
Create a table of year vs. name and for those names in all years select out those rows. No packages are used.
tab <- table(as.POSIXlt(myframe$date)$year + 1900, myframe$names)
subset(myframe, names %in% colnames(tab)[colSums(sign(tab)) == nrow(tab)])
giving:
date names it1 it2
1 2015-11-26 Max 0.6 0.5
5 2014-01-13 Max 0.3 0.4
6 2014-07-26 Sarah 0.8 0.4
7 2016-11-26 Max 0.8 0.5
8 2016-04-04 Sarah 0.5 0.8
9 2017-04-09 Max 0.5 0.6
10 2017-02-23 Sarah 0.3 0.5
11 2015-03-22 Sarah 0.7 0.4
library(lubridate)
myframe[with(data = myframe[year(myframe$date) >= 2014 & year(myframe$date) <= 2017,],
expr = ave(year(date), names, FUN = function(x)
all(year(date) %in% x))) == 1,]
# date names it1 it2
#1 2015-11-26 Max 0.6 0.5
#5 2014-01-13 Max 0.3 0.4
#6 2014-07-26 Sarah 0.8 0.4
#7 2016-11-26 Max 0.8 0.5
#8 2016-04-04 Sarah 0.5 0.8
#9 2017-04-09 Max 0.5 0.6
#10 2017-02-23 Sarah 0.3 0.5
#11 2015-03-22 Sarah 0.7 0.4

R: Extracting the highest numeric value from each character value in a column

I have a character field in a dataframe that contains numbers e.g. (0.5,3.5,7.8,2.4).
For every record I am trying to extract the largest value from the string and put it in a new column.
e.g.
x csi
1 0.5, 6.7, 2.3
2 9.5, 2.6, 1.1
3 0.7, 2.3, 5.1
4 4.1, 2.7, 4.7
The desired output would be:
x csi csi_max
1 0.5, 6.7, 2.3 6.7
2 9.5, 2.6, 1.1 9.5
3 0.7, 2.3, 5.1 5.1
4 4.1, 2.7, 4.7 4.7
I have had various attempts ...with my latest attempt being the following - which provides the maximum csi score from the entire column rather than from the individual row's csi numbers...
library(stringr)
numextract <- function(string){
str_extract(string, "\\-*\\d+\\.*\\d*")
}
df$max_csi <- max(numextract(df$csi))
Thank you
We can use tidyverse
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(csi) %>%
group_by(x) %>%
summarise(csi_max = max(csi)) %>%
left_join(df1, .)
# x csi csi_max
#1 1 0.5, 6.7, 2.3 6.7
#2 2 9.5, 2.6, 1.1 9.5
#3 3 0.7, 2.3, 5.1 5.1
#4 4 4.1, 2.7, 4.7 4.7
Or this can be done with pmax from base R after separating the 'csi' column into a data.frame with read.table
df1$csi_max <- do.call(pmax, read.table(text=df1$csi, sep=","))
Hope this helps!
df$csi_max <- sapply(df$csi, function(x) max(as.numeric(unlist(strsplit(as.character(x), split=",")))))
Output is:
x csi csi_max
1 1 0.5, 6.7, 2.3 6.7
2 2 9.5, 2.6, 1.1 9.5
3 3 0.7, 2.3, 5.1 5.1
4 4 4.1, 2.7, 4.7 4.7
#sample data
> dput(df)
structure(list(x = 1:4, csi = structure(c(1L, 4L, 2L, 3L), .Label = c("0.5, 6.7, 2.3",
"0.7, 2.3, 5.1", "4.1, 2.7, 4.7", "9.5, 2.6, 1.1"), class = "factor")), .Names = c("x",
"csi"), class = "data.frame", row.names = c(NA, -4L))
Edit:
As suggested by #RichScriven, the more efficient way could be
df$csi_max <- sapply(strsplit(as.character(df$csi), ","), function(x) max(as.numeric(x)))
A solution using the splitstackshape package.
library(splitstackshape)
dat$csi_max <- apply(cSplit(dat, "csi")[, -1], 1, max)
dat
# x csi csi_max
# 1 1 0.5, 6.7, 2.3 6.7
# 2 2 9.5, 2.6, 1.1 9.5
# 3 3 0.7, 2.3, 5.1 5.1
# 4 4 4.1, 2.7, 4.7 4.7
DATA
dat <- read.table(text = "x csi
1 '0.5, 6.7, 2.3'
2 '9.5, 2.6, 1.1'
3 '0.7, 2.3, 5.1'
4 '4.1, 2.7, 4.7'",
header = TRUE, stringsAsFactors = FALSE)

Resources