Errors when exporting plots to Plot.ly - r

I have this data (sample of the first 20 rows):
Codering variable value
1 Z1 Week.0 0
2 Z2 Week.0 0
3 Z3 Week.0 0
4 Z4 Week.0 0
5 Z5 Week.0 0
6 Z6 Week.0 0
7 Z7 Week.0 0
8 Z8 Week.0 0
9 Z9 Week.0 0
10 Z101 Week.0 NA
11 Z102 Week.0 NA
12 Z1 Week.1 0
13 Z2 Week.1 0
14 Z3 Week.1 0
15 Z4 Week.1 0
16 Z5 Week.1 0
17 Z6 Week.1 0
18 Z7 Week.1 0
19 Z8 Week.1 0
and I plot it using:
pZ <- ggplot(zmeltdata,aes(x=variable,y=value,color=Codering,group=Codering)) +
geom_line()+
geom_point()+
theme_few()+
theme(legend.position="right")+
scale_color_hue(name = "Treatment group:")+
scale_y_continuous(labels = percent)+
ylab("Germination percentage")+
xlab("Week number")+
labs(title = "Z. monophyllum germination data")
pZ
The graph displays just fine:
Yet when I want to export this to Plot.ly I get the following errors:
> py <- plotly()
> response<-py$ggplotly(pZ)
Error in if (all(xcomp) && all(ycomp)) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In trace.list[[lind[1]]]$y == trace.list[[lind[2]]]$y :
longer object length is not a multiple of shorter object length
And I have searched for these errors, yet the explanation thoroughly confuses me. "The missing value where TRUE/FALSE needed." is supposed to occur if you use logical termms as IF/ELSE/TRUE/FALSE and such in your process, which I don't at all! Even when checking for any NA's in the value of the graph I get:
> is.na(pZ)
data layers scales mapping theme coordinates facet plot_env labels
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
and the 'longer object length is not multiple of shorter object length' is supposed to pop up when you have objects of different lengths, but I'm only using 1 object with 3 rows that have exactly the same length.. The value of the graph does give me a NULL when I ask for those rows, but that is supposed to happen..
> nrow(zmeltdata)
[1] 143
> nrow(test)
NULL
All in all, I'm very confused and don't know how to correctly handle these errors, could someone elaborate?
Thanks for your time.
EDIT: I have tried to export a different graph to Plot.ly using a random sample of 1:100 and that worked just fine, I'm pretty sure the error is in my data, I just can't figure out how to fix it.
EDIT2: In response to #Gregor:
> dput(head(zmeltdata, 20))
structure(list(Codering = structure(c(16L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 26L, 17L, 18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L,
25L, 26L), .Label = c("B1", "C2", "C3", "C8", "M1", "M101", "M102",
"M2", "M3", "M4", "M5", "M6", "M7", "M8", "M9", "Z1", "Z101",
"Z102", "Z2", "Z3", "Z4", "Z5", "Z6", "Z7", "Z8", "Z9"), class = "factor"),
variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Week.0",
"Week.1", "Week.2", "Week.3", "Week.4", "Week.5", "Week.6",
"Week.7", "Week.8", "Week.9", "Week.10", "Week.11", "Week.12"
), class = "factor"), value = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
NA, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Codering",
"variable", "value"), row.names = c(NA, 20L), class = "data.frame")
And the tail:
> dput(tail(zmeltdata, 43))
structure(list(Codering = structure(c(19L, 20L, 21L, 22L, 23L,
24L, 25L, 26L, 17L, 18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L, 25L,
26L, 17L, 18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 17L,
18L, 16L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 17L, 18L), .Label = c("B1",
"C2", "C3", "C8", "M1", "M101", "M102", "M2", "M3", "M4", "M5",
"M6", "M7", "M8", "M9", "Z1", "Z101", "Z102", "Z2", "Z3", "Z4",
"Z5", "Z6", "Z7", "Z8", "Z9"), class = "factor"), variable = structure(c(10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L,
11L, 11L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L,
12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L), .Label = c("Week.0", "Week.1", "Week.2", "Week.3",
"Week.4", "Week.5", "Week.6", "Week.7", "Week.8", "Week.9", "Week.10",
"Week.11", "Week.12"), class = "factor"), value = c(0.1, 0.06,
0.05, 0.09, 0.04, 0.08, 0.05, 0.08, 0, 0, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("Codering",
"variable", "value"), row.names = 101:143, class = "data.frame")
I am not at all surprised by these, there are quite some NA's in the dataset but they shouldn't prove to be an issue, since I have used a similar (bigger) dataset before.
And I also have the .csv file for you to use if you wish: https://www.mediafire.com/?jij1vlp14a29ntt

The issue is about handling NA's... I got https://plot.ly/~marianne2/417/z-monophyllum-germination-data/ by running the following code:
pZ <- ggplot(na.omit(zmeltdata), aes(x=variable, y=value, color=Codering,
group=Codering)) +
geom_line() +
geom_point() +
# theme_few() +
theme(legend.position="right") +
scale_color_hue(name="Treatment group:") +
# scale_y_continuous(labels = percent) +
ylab("Germination percentage") +
xlab("Week number") +
labs(title="Z. monophyllum germination data")
py$ggplotly(pZ, kwargs=list(fileopt="overwrite", filename="test_zdata"))
Note that I had to comment out theme_few() and scale_y_continuous(labels = percent) because from loading only "ggplot2", I would get the following errors:
Error: could not find function "theme_few"
and
Error in structure(list(call = match.call(), aesthetics = aesthetics, :
object 'percent' not found
respectively. I guess these are dependency issues (maybe you're using a version of "ggthemes"?).
I don't know what kind of magic theme_few() does, but if I don't use na.omit() on zmeltdata, my pZ plot looks like this:
Eww, "Week.10" comes after "Week.1" instead of after "Week.9"... So you wouldn't want to send this to plotly anyway! So I cannot exactly reproduce your ggplot example. But I wonder if you really want to keep these NA's (the CSV itself reads "NA", I was expecting blank "cells"). Don't you want to pre-process these anyway?
Note that I get the following warning message when I don't use na.omit() on zmeltdata:
Warning messages:
1: Removed 20 rows containing missing values (geom_path).
2: Removed 47 rows containing missing values (geom_point).
Again, beyond pure displaying/plotting considerations, since this looks like scientific data, wouldn't you want to number weeks with actual numbers, or pad the digits if you really want a string? ("Week.01", "Week.02", etc.)
And it looks like the missing data is all trailing... There's just no data (yet) for weeks 10+, right?
Thanks for reporting,
Marianne

Related

Replace multiple characters from multiple columns in R

Given a dataframe as follows:
structure(list(date = structure(1:24, .Label = c("2010Y1-01m",
"2010Y1-02m", "2010Y1-03m", "2010Y1-04m", "2010Y1-05m", "2010Y1-06m",
"2010Y1-07m", "2010Y1-08m", "2010Y1-09m", "2010Y1-10m", "2010Y1-11m",
"2010Y1-12m", "2011Y1-01m", "2011Y1-02m", "2011Y1-03m", "2011Y1-04m",
"2011Y1-05m", "2011Y1-06m", "2011Y1-07m", "2011Y1-08m", "2011Y1-09m",
"2011Y1-10m", "2011Y1-11m", "2011Y1-12m"), class = "factor"),
a = structure(c(1L, 18L, 19L, 20L, 22L, 23L, 2L, 4L, 5L,
7L, 8L, 10L, 1L, 21L, 3L, 6L, 9L, 11L, 12L, 13L, 14L, 15L,
16L, 17L), .Label = c("--", "10159.28", "10295.69", "10580.82",
"10995.65", "11245.84", "11327.23", "11621.99", "12046.63",
"12139.78", "12848.27", "13398.26", "13962.6", "14559.72",
"14982.58", "15518.64", "15949.87", "7363.45", "8237.71",
"8830.99", "9309.47", "9316.56", "9795.77"), class = "factor"),
b = structure(c(2L, 16L, 23L, 24L, 4L, 6L, 7L, 9L, 10L, 12L,
14L, 17L, 1L, 22L, 3L, 5L, 8L, 11L, 13L, 15L, 18L, 19L, 20L,
21L), .Label = c("-", "--", "1058.18", "1455.6", "1539.01",
"1867.07", "2036.92", "2102.23", "2372.84", "2693.96", "2769.65",
"2973.04", "3146.88", "3227.23", "3604.71", "365.07", "3678.01",
"4043.18", "4438.55", "4860.76", "5360.94", "555.51", "653.19",
"980.72"), class = "factor"), c = structure(c(2L, 6L, 10L,
11L, 13L, 15L, 16L, 18L, 20L, 22L, 24L, 7L, 1L, 9L, 12L,
14L, 17L, 19L, 21L, 23L, 3L, 4L, 5L, 8L), .Label = c("-",
"--", "1092.73", "1222.48", "1409.07", "158.18", "1748.44",
"2179.42", "227.68", "268.53", "331.81", "366.95", "434.19",
"486.41", "538.49", "606.62", "614.75", "651.46", "729.44",
"736.55", "836.46", "890.81", "929.72", "981.65"), class = "factor")), class = "data.frame", row.names = c(NA,
-24L))
How could I replace -- and - in only columns a and b with NA? Thanks.
You can use :
cols <- c('a', 'b')
df[cols][df[cols] == '--' | df[cols] == '-'] <- NA
Or using dplyr :
library(dplyr)
df %>% mutate(across(c(a, b), ~replace(., . %in% c('--', '-'), NA)))
I think it's better to try to avoid the data being read in like this in the first place, but if you need to correct it after, you can try using the na.strings argument in type.convert. Notice that it's na.strings with an "s" -- it's plural, so more than one value can be used to represent NA values.
df[c("a", "b")] <- lapply(df[c("a", "b")], type.convert, na.strings = c("--", "-"))
str(df)
# 'data.frame': 24 obs. of 4 variables:
# $ date: Factor w/ 24 levels "2010Y1-01m","2010Y1-02m",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ a : num NA 7363 8238 8831 9317 ...
# $ b : num NA 365 653 981 1456 ...
# $ c : Factor w/ 24 levels "-","--","1092.73",..: 2 6 10 11 13 15 16 18 20 22 ...
head(df)
# date a b c
# 1 2010Y1-01m NA NA --
# 2 2010Y1-02m 7363.45 365.07 158.18
# 3 2010Y1-03m 8237.71 653.19 268.53
# 4 2010Y1-04m 8830.99 980.72 331.81
# 5 2010Y1-05m 9316.56 1455.60 434.19
# 6 2010Y1-06m 9795.77 1867.07 538.49
Note that in this particular case, you could also use the side effect of as.numeric(as.character(...)) converting anything that can't be coerced to numeric to NA, but keep in mind that you will get a warning for each column that you use this approach on.
lapply(df[c("a", "b")], function(x) as.numeric(as.character(x)))

how to produce Sankey diagram with auto-references and circular-references using NetworkD3 library R

I have this data:
list(nodes = structure(list(name = c(NA, NA, "1.1.1. Formação Florestal",
"1.1.2. Formação Savanica", NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, "3.1. Pastagem", NA, NA, NA, "3.2.1. Cultura Anual e Perene",
NA, "3.3. Mosaico de Agricultura e Pastagem", NA, NA, "4.2. Infraestrutura Urbana",
"4.5. Outra Área não Vegetada", NA, NA, NA, NA, NA, NA, NA, "5.1 Rio ou Lago ou Oceano"
)), class = "data.frame", row.names = c(NA, -33L)), links = structure(list(
source = c(3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 15L, 15L,
15L, 15L, 15L, 15L, 15L, 19L, 19L, 19L, 19L, 21L, 21L, 21L,
21L, 21L, 21L, 24L, 25L, 25L, 25L, 33L), target = c(3L, 21L,
4L, 21L, 15L, 3L, 25L, 4L, 33L, 19L, 15L, 21L, 3L, 25L, 4L,
33L, 15L, 19L, 4L, 21L, 4L, 21L, 25L, 33L, 15L, 3L, 4L, 25L,
4L, 33L, 33L), value = c(0.544859347827813, 0.00354385993588971,
0.494359662221154, 4.67602736159475, 2.20248911690968, 0.501437742068369,
0.00354375594818463, 24.8427814053755, 0.439418727642527,
0.0079740332093807, 11.8060486886398, 2.76329829691466, 0.000886029792298199,
0.00177186270758855, 3.35504921147758, 0.14263144351167,
1.12170804870686, 0.0478454594554582, 0.217079959877658,
0.00620223918980076, 1.79754946594068, 9.02868098124075,
0.00442981113709027, 0.242743895018645, 0.498770814980772,
0.00265782877794886, 0.000885894856554407, 0.379188333632346,
0.00265793188317263, 0.00265771537700804, 0.39158027235054
)), row.names = c(NA, -31L), class = "data.frame"))
and I'm trying to produce a sankey diagram using the networkD3package with this simple code:
sankeyNetwork(Links = landuse$links, Nodes = landuse$nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "km²", fontSize = 12, nodeWidth = 30)
I received this message:
Warning message:
It looks like Source/Target is not zero-indexed. This is required in JavaScript and so your plot may not render.
But even if I zero-indexed the target/source nothing is redering in dev. I have the data in the same format like in this example, so I would like to know the possible problem.
EDIT:
I have auto-references and circular-references. Is it possible to do the diagram with this type of data using the package?
Based on the example you provided a link to in one of your comments (here), you don't actually want auto and circular references, but instead what you want is two distinct nodes for each thing, one for the left column and one for the right column (e.g. "Formação Florestal" in the left/1985 column and "Formação Florestal" in the right/2017 column).
You can achieve that with the data you provided by distinguishing the source and target nodes that have the same index as separate nodes, like so...
landuse <- list(
nodes = data.frame(
name = c(
NA, NA, "1.1.1. Formação Florestal", "1.1.2. Formação Savanica", NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, "3.1. Pastagem", NA, NA, NA,
"3.2.1. Cultura Anual e Perene", NA,
"3.3. Mosaico de Agricultura e Pastagem", NA, NA,
"4.2. Infraestrutura Urbana", "4.5. Outra Área não Vegetada", NA, NA, NA,
NA, NA, NA, NA,"5.1 Rio ou Lago ou Oceano"
),
stringsAsFactors = FALSE
),
links = data.frame(
source = c(
3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 15L, 15L, 15L, 15L, 15L, 15L, 15L,
19L, 19L, 19L, 19L, 21L, 21L, 21L, 21L, 21L, 21L, 24L, 25L, 25L, 25L, 33L
),
target = c(
3L, 21L, 4L, 21L, 15L, 3L, 25L, 4L, 33L, 19L, 15L, 21L, 3L, 25L, 4L, 33L,
15L, 19L, 4L, 21L, 4L, 21L, 25L, 33L, 15L, 3L, 4L, 25L, 4L, 33L,33L
),
value = c(
0.544859347827813, 0.00354385993588971, 0.494359662221154,
4.67602736159475, 2.20248911690968, 0.501437742068369,
0.00354375594818463, 24.8427814053755, 0.439418727642527,
0.0079740332093807, 11.8060486886398, 2.76329829691466,
0.000886029792298199, 0.00177186270758855, 3.35504921147758,
0.14263144351167, 1.12170804870686, 0.0478454594554582,
0.217079959877658, 0.00620223918980076, 1.79754946594068,
9.02868098124075, 0.00442981113709027, 0.242743895018645,
0.498770814980772, 0.00265782877794886, 0.000885894856554407,
0.379188333632346, 0.00265793188317263, 0.00265771537700804,
0.39158027235054
),
stringsAsFactors = FALSE
)
)
# create a links data frame where the right and left column versions of each node
# are distinguishble
links <-
data.frame(source = paste0(landuse$nodes$name[landuse$links$source], " (1985)"),
target = paste0(landuse$nodes$name[landuse$links$target], " (2017)"),
value = landuse$links$value,
stringsAsFactors = FALSE)
# build a nodes data frame from the new links data frame
nodes <- data.frame(name = unique(c(links$source, links$target)),
stringsAsFactors = FALSE)
# change the source and target variables to be the zero-indexed position of
# each node in the new nodes data frame
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
# remove the year indicator from the node names
nodes$name <- substring(nodes$name, 1, nchar(nodes$name) - 7)
# plot it
library(networkD3)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "km²", fontSize = 12, nodeWidth = 30)
Well, because of how it is built sankeyNetwork, you need to start from 0 in your links. As you can see from landuse, your data start from 3.
You can reindex link to start from 0:
landuse$links$source <- landuse$links$source-3
landuse$links$target <- landuse$links$target-3
sankeyNetwork(Links = landuse$links, Nodes = landuse$nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "km²", fontSize = 12, nodeWidth = 30)
For sure, it is does not look as pretty as the sankey you link in your question. Why? Because of your data
You have "autoreferences": links where the source and the target is the same node. That generates those weirds semicircles starting and ending in the same node
You have "circular references": links where the source 'X' goes to target 'Y', source 'Y' going to target 'Z' and then source 'Z' going to target 'Z'. That generates those wierd curves
Some of you values are several orders smaller than other, so those little one are badly visualized.
You need maybe sanity check your data:
Are you really interested in "autoreferences". If not, delete them
Are you comfortable with circular references or you will prefer to duplicate nodes to show a linear sankey?
Are you interested in show very small nodes? If not, delete them

issue when reordering factor variable by numeric

require(ggplot2)
The data: It's shark incidents grouped by shark species. It's actually a real dataset, already summarized.
D <- structure(list(FL_FATAL = structure(c(2L, 2L, 2L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L), .Label = c("FATAL",
"NO FATAL"), class = "factor"), spec = structure(c(26L, 24L,
6L, 26L, 25L, 16L, 2L, 11L, 27L, 5L, 24L, 29L, 12L, 21L, 13L,
15L, 28L, 1L, 17L, 19L, 8L, 3L, 6L, 13L, 22L, 18L, 27L, 14L,
23L, 20L, 7L, 4L, 8L, 9L, 10L), .Label = c("blacknose", "blacktip",
"blue", "bonnethead", "bronze", "bull", "caribbean", "draughtsboard",
"dusky", "galapagos", "ganges", "hammerhead", "involve", "leon",
"mako", "nurse", "porbeagle", "recovered", "reef", "sand", "sandtiger",
"sevengill", "spinner", "tiger", "unconfired", "white", "whitespotted",
"whitetip", "wobbegong"), class = "factor"), N = c(368L, 169L,
120L, 107L, 78L, 77L, 68L, 59L, 56L, 53L, 46L, 42L, 35L, 35L,
33L, 30L, 29L, 29L, 26L, 25L, 25L, 25L, 24L, 24L, 21L, 21L, 20L,
20L, 17L, 16L, 16L, 15L, 11L, 11L, 11L)), .Names = c("FL_FATAL",
"spec", "N"), row.names = c(NA, -35L), class = "data.frame")
.
head(D)
# FL_FATAL spec N Especies
# 1 NO FATAL white 368 white
# 2 NO FATAL tiger 169 tiger
# 3 NO FATAL bull 120 bull
# 4 FATAL white 107 white
# 5 NO FATAL unconfired 78 unconfired
# 6 NO FATAL nurse 77 nurse
Reordering a factor variable by a numeric making a new variable.
# Re-order spec creating Especies variable ordered by D$N
D$Especies <- factor(D$spec, levels = unique(D[order(D$N), "spec"]))
# This two plots work as spected
ggplot(D, aes(x=N, y=Especies)) +
geom_point(aes(size = N, color = FL_FATAL))
ggplot(D, aes(x=N, y=Especies)) +
geom_point(aes(size = N, color = FL_FATAL)) +
facet_grid(. ~ FL_FATAL)
Reordering using reorder()
# Using reorder isn't working or am i missing something?
ggplot(D, aes(x=N, y=reorder(D$spec, D$N))) +
geom_point(aes(size = N, color = FL_FATAL))
# adding facets makes it worse
ggplot(D, aes(x=N, y=reorder(D$spec, D$N))) +
geom_point(aes(size = N, color = FL_FATAL)) +
facet_grid(. ~ FL_FATAL)
Which would be the correct approach for producing the plots with reorder()?
The problem is that by using D$ in your reorder call, you're reordering spec independent of the data frame, so the values no longer match up with the corresponding x values. You need to use it directly on the variables:
ggplot(D, aes(x=N, y=reorder(spec, N, sum))) +
geom_point(aes(size = N, color = FL_FATAL)) +
facet_grid(. ~ FL_FATAL)
I'm surprised you like your first way--it's a happy coincidence that worked out. Most of your species have one N value (NO_FATAL only), but you have a few that have both FATAL and NO_FATAL. Whenever there are more than two numeric rows corresponding to a factor, reorder uses a function of those numerics to do the final sort. The default function is mean, but you probably want sum, to sort by the total number of incidents.
D$spec_order <- reorder(D$spec, D$N, sum)
ggplot(D, aes(x=N, y=spec_order)) +
geom_point(aes(size = N, color = FL_FATAL))
ggplot(D, aes(x=N, y=spec_order)) +
geom_point(aes(size = N, color = FL_FATAL)) +
facet_grid(. ~ FL_FATAL)

Add column of ranks with mean/average tied ranks

I know that this question is very similar to this one:
Add a column of ranks
Considering we have data like this:
test <- data.frame(A=c("aaabbb",
"aaaabb",
"aaaabb",
"aaaaab",
"bbbaaa",
"bbbbaa"),
B=c("10.00",
"00.04",
"00.04",
"00.00",
"20.00",
"00.06"
))
I need the tied ranks to be averaged though, so that I have something like this:
> test
A B C
1 aaabbb 10.00 1
2 aaaabb 00.04 2.5
3 aaaabb 00.04 2.5
4 aaaaab 00.00 3
5 bbbaaa 20.00 4
6 bbbbaa 00.06 5
EDIT:
> dput(qual_orderedadj_ranks)
structure(list(words = structure(c(29L, 7L, 28L, 6L, 19L, 21L,
9L, 11L, 30L, 1L, 8L, 10L, 13L, 12L, 5L, 26L, 27L, 32L, 33L,
3L, 22L, 18L, 16L, 24L, 25L, 31L, 23L, 2L, 17L), .Label = c("average","yellow", "emerald",
"sense","slate", "turcquoise", "green", "orange", "fair", "chestnut", "sand", "good",
"silver", "sense", "sense", "gray", "lousy", "wine", "smalt", "sense", "taupe", "poor",
"blue", "red", "black", "gold", "white", "teal", "terracotta", "purple", "violett",
"olive", "khaki"), class = "factor"), enzo = c(9.57973168019844, 2.68331227860491,
1.85920971038049, 1.28384868054554, 0.885031778228944, 0.740942048756444,
0.415649187810432, 0.0418303446590026, 0.0836608598897025, 0.680367202534345,
1.53377945661345, 1.70660459871111, 39.2413924890553,
239.081124461913, 0, 0, 0, 0, 0, 86.5734538416169, 24.2262630473592,
0.669305983927372, 0.5093534157301, 0.25098462655732, 0.0836608598897025,
0.0418303446590026, 0.276963945905033, 0.839118699701029, 1.00634089909635),
ranks = c(1, 2, 3, 4, 5, 6, 7, 17, 17, 10, 11, 12, 13, 14,
17, 17, 17, 17, 17, 20, 21, 22, 23, 24, 17, 17, 27, 28, 29)), row.names =
c(1L, 2L, 3L, 4L, 6L, 8L, 9L, 28L, 27L, 22L, 21L, 20L, 18L, 16L, 11L,
12L, 13L, 14L, 15L, 17L, 19L, 23L, 24L, 25L, 26L, 29L, 10L, 7L,
5L), .Names = c("words", "enzo", "ranks"), class = "data.frame")
Try this:
within(test, B <- rank(A))
Or, if you want to use the original order in A:
within(test, B <- ave(seq_along(A), by=A))

Split a R dataframe by rows containing a keyword

Is there a quick way to split a large data.frame by keywords
so for example if I have the data set below is there a quick way to split the data frame at each occurrence of the source:restaurant line? Another take on the question would be is there a quick way of creating factors for the dataframe based upon a list of cut offs (in this case c(3,7,10)) that would then give me e.g. factors=c(A,A,A,B,B,B,B,C,C,C) that I could use in a split(mylist,factors) formula? Thanks
mylist=structure(list(V1 = structure(c(5L, 3L, 7L, 8L, 6L, 4L, 7L, 2L,
1L, 7L), .Label = c("cider", "claret", "custard", "krispies",
"rhubarb", "shreddies", "source:restaurant", "weetabix"), class = "factor"),
V2 = c(1L, 5L, NA, 9L, 13L, 17L, NA, 21L, 25L, NA), V3 = c(2L,
6L, NA, 10L, 14L, 18L, NA, 22L, 26L, NA), V4 = c(3L, 7L,
NA, 11L, 15L, 19L, NA, 23L, 27L, NA), V5 = c(4L, 8L, NA,
12L, 16L, 20L, NA, 24L, 28L, NA)), .Names = c("V1", "V2",
"V3", "V4", "V5"), class = "data.frame", row.names = c(NA, -10L
))
A very clunky possible solution below but I'm hoping for something a bit more elegant..
temp=NULL
a=which(mylist[,1] == 'source:restaurant')
for(i in seq_along(a)){temp=c(temp,rep(letters[i],(a[i]-length(temp))))}
temp=as.factor(temp)
split(mylist,temp)
The factor:
factor(cumsum(mylist$V1 == "source:restaurant") + 1)
the split:
split(mylist, cumsum(mylist$V1 == "source:restaurant"))
UPDATE: you probably have the restaurant:soure at the end of each group that it marks, to account for this you can use:
factor(cumsum(c(0, head(mylist$V1 == "source:restaurant", -1))) + 1)
split(mylist, cumsum(c(0, head(mylist$V1 == "source:restaurant", -1))))
would be better.

Resources