get sum of consecutive day values - r

I have large dataset as follows:
Date rain code
2009-04-01 0.0 0
2009-04-02 0.0 0
2009-04-03 0.0 0
2009-04-04 0.7 1
2009-04-05 54.2 1
2009-04-06 0.0 0
2009-04-07 0.0 0
2009-04-08 0.0 0
2009-04-09 0.0 0
2009-04-10 0.0 0
2009-04-11 0.0 0
2009-04-12 5.3 1
2009-04-13 10.1 1
2009-04-14 6.0 1
2009-04-15 8.7 1
2009-04-16 0.0 0
2009-04-17 0.0 0
2009-04-18 0.0 0
2009-04-19 0.0 0
2009-04-20 0.0 0
2009-04-21 0.0 0
2009-04-22 0.0 0
2009-04-23 0.0 0
2009-04-24 0.0 0
2009-04-25 4.3 1
2009-04-26 42.2 1
2009-04-27 45.6 1
2009-04-28 12.6 1
2009-04-29 6.2 1
2009-04-30 1.0 1
I am trying to calculate sum of consecutive values of rain when the code is "1" and I need to have sum of them separately. For example I want to get sum of rain values from 2009-04-12 to 2009-04-15. So I am trying to find way to define when the code is equal 1 and there are consecutive rain values I get sum of them.
Any help on the above problem would be greatly appreciated.

One straightforward solution is to use rle. But I suspect there might be more "elegant" solutions out there.
# assuming dd is your data.frame
dd.rle <- rle(dd$code)
# get start pos of each consecutive 1's
start <- (cumsum(dd.rle$lengths) - dd.rle$lengths + 1)[dd.rle$values == 1]
# how long do each 1's extend?
ival <- dd.rle$lengths[dd.rle$values == 1]
# using these two, compute the sum
apply(as.matrix(seq_along(start)), 1, function(idx) {
sum(dd$rain[start[idx]:(start[idx]+ival[idx]-1)])
})
# [1] 54.9 30.1 111.9
Edit: An even simpler method with rle and tapply.
dd.rle <- rle(dd$code)
# get the length of each consecutive 1's
ival <- dd.rle$lengths[dd.rle$values == 1]
# using lengths, construct a `factor` with levels = length(ival)
levl <- factor(rep(seq_along(ival), ival))
# use these levels to extract `rain[code == 1]` and compute sum
tapply(dd$rain[dd$code == 1], levl, sum)
# 1 2 3
# 54.9 30.1 111.9

Following is vectorized way of getting the desired result.
df <- read.table(textConnection("Date rain code\n2009-04-01 0.0 0\n2009-04-02 0.0 0\n2009-04-03 0.0 0\n2009-04-04 0.7 1\n2009-04-05 54.2 1\n2009-04-06 0.0 0\n2009-04-07 0.0 0\n2009-04-08 0.0 0\n2009-04-09 0.0 0\n2009-04-10 0.0 0\n2009-04-11 0.0 0\n2009-04-12 5.3 1\n2009-04-13 10.1 1\n2009-04-14 6.0 1\n2009-04-15 8.7 1\n2009-04-16 0.0 0\n2009-04-17 0.0 0\n2009-04-18 0.0 0\n2009-04-19 0.0 0\n2009-04-20 0.0 0\n2009-04-21 0.0 0\n2009-04-22 0.0 0\n2009-04-23 0.0 0\n2009-04-24 0.0 0\n2009-04-25 4.3 1\n2009-04-26 42.2 1\n2009-04-27 45.6 1\n2009-04-28 12.6 1\n2009-04-29 6.2 1\n2009-04-30 1.0 1"),
header = TRUE)
df$cumsum <- cumsum(df$rain)
df$diff <- c(diff(df$code), 0)
df$result <- rep(NA, nrow(df))
if (nrow(df[df$diff == -1, ]) == nrow(df[df$diff == 1, ])) {
result <- df[df$diff == -1, "cumsum"] - df[df$diff == 1, "cumsum"]
df[df$diff == -1, "result"] <- result
} else {
result <- c(df[df$diff == -1, "cumsum"], df[nrow(df), "cumsum"]) - df[df$diff == 1, "cumsum"]
df[df$diff == -1, "result"] <- result[1:length(result) - 1]
df[nrow(df), "result"] <- result[length(result)]
}
df
## Date rain code cumsum diff result
## 1 2009-04-01 0.0 0 0.0 0 NA
## 2 2009-04-02 0.0 0 0.0 0 NA
## 3 2009-04-03 0.0 0 0.0 1 NA
## 4 2009-04-04 0.7 1 0.7 0 NA
## 5 2009-04-05 54.2 1 54.9 -1 54.9
## 6 2009-04-06 0.0 0 54.9 0 NA
## 7 2009-04-07 0.0 0 54.9 0 NA
## 8 2009-04-08 0.0 0 54.9 0 NA
## 9 2009-04-09 0.0 0 54.9 0 NA
## 10 2009-04-10 0.0 0 54.9 0 NA
## 11 2009-04-11 0.0 0 54.9 1 NA
## 12 2009-04-12 5.3 1 60.2 0 NA
## 13 2009-04-13 10.1 1 70.3 0 NA
## 14 2009-04-14 6.0 1 76.3 0 NA
## 15 2009-04-15 8.7 1 85.0 -1 30.1
## 16 2009-04-16 0.0 0 85.0 0 NA
## 17 2009-04-17 0.0 0 85.0 0 NA
## 18 2009-04-18 0.0 0 85.0 0 NA
## 19 2009-04-19 0.0 0 85.0 0 NA
## 20 2009-04-20 0.0 0 85.0 0 NA
## 21 2009-04-21 0.0 0 85.0 0 NA
## 22 2009-04-22 0.0 0 85.0 0 NA
## 23 2009-04-23 0.0 0 85.0 0 NA
## 24 2009-04-24 0.0 0 85.0 1 NA
## 25 2009-04-25 4.3 1 89.3 0 NA
## 26 2009-04-26 42.2 1 131.5 0 NA
## 27 2009-04-27 45.6 1 177.1 0 NA
## 28 2009-04-28 12.6 1 189.7 0 NA
## 29 2009-04-29 6.2 1 195.9 0 NA
## 30 2009-04-30 1.0 1 196.9 0 111.9

Related

Iterate over an xpath (string) in R for data scraping

I've got a (pretty simple) code to download a table with data:
library(rvest)
link = "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/team/2442/statistics"
aguada = read_html(link)
stats = aguada %>% html_nodes("tbody")
stats = aguada %>% html_nodes(xpath="/html/body/div[1]/div[6]/div/div/div/div[4]/table") %>% html_table()
my_df <- as.data.frame(stats)
And now I'm trying to do the same, but for the URLs for each player in the same table
for (i in 1:17){
url_path="/html/body/div[1]/div[6]/div/div/div/div[4]/table/tbody/tr[i]/td[1]/a"
jugador[i] = aguada %>% html_nodes(xpath=url_path)%>% html_attr("href")
}
I've tried the code above, and while it doesn't crash, it doesn't work as intended either. I want to create an array with the urls or something like that so I can then get the stats for each player easily. While we're at it, I'd like to know if, instead of doing 1:17 in the for and manually counting the players, there's a way to automate that too, so I can do something like for i in 1:table_length
You need to initialise the vector jugador to be able to append the links to it. Also, when you create a path that invloves changing a character within the path, paste concatenates the strings with the number i to create the path, as shown below:
jugador <- vector()
for(i in 1:17){
url_path <- paste("/html/body/div[1]/div[6]/div/div/div/div[4]/table/tbody/tr[", i, "]/td[1]/a", sep = "")
jugador[i] <- aguada %>% html_nodes(xpath=url_path)%>% html_attr("href")
}
Result:
> jugador
[1] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/15257?"
[2] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/17101?"
[3] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/17554?"
[4] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/43225?"
[5] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/262286?"
[6] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/623893?"
[7] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/725720?"
[8] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/858052?"
[9] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1645559?"
[10] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1651515?"
[11] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1717089?"
[12] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1924883?"
[13] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1924884?"
[14] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1931124?"
[15] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1950388?"
[16] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1971299?"
[17] "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/person/1991297?"
Links in the last column. Without loop
library(tidyverse)
library(rvest)
page <-
"https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/team/2442/statistics" %>%
read_html()
df <- page %>%
html_table() %>%
pluck(1) %>%
janitor::clean_names() %>%
mutate(link = page %>%
html_elements("td a") %>%
html_attr("href") %>%
unique())
# A tibble: 17 x 21
jugador p i pts_pr pts as_pr as ro_pr rd_pr rt_pr rt bl_prom bl re_pr re min_pr tc_percent x2p_percent x3p_percent tl_percent link$value
<chr> <int> <int> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 F. MEDINA 22 9 6 131 1.3 29 0.5 0.8 1.3 28 0 0 0.6 13 22 37 55.6 26.8 60 https://hosted.dcd.share~
2 J. SANTISO 23 23 12 277 5.6 128 0.4 2.9 3.3 75 0 0 0.7 15 31 43.1 43.2 43 75 https://hosted.dcd.share~
3 A. ZUVICH 17 1 8.2 139 0.7 11 2 2.9 4.9 83 0.5 8 1.1 19 15.9 59.8 67.1 16.7 76.5 https://hosted.dcd.share~
4 A. YOUNG 15 14 12.5 187 1.3 20 0.4 3.3 3.7 55 0.5 7 0.6 9 30.5 36.2 41.9 32 78.8 https://hosted.dcd.share~
5 E. VARGAS 23 23 16.1 370 1.9 44 3.5 8.4 11.9 273 1.6 37 1.1 25 30.3 53.3 53.5 0 62.6 https://hosted.dcd.share~
6 L. PLANELLS 23 0 3.6 83 1.6 37 0.5 1.1 1.6 37 0.1 2 0.7 17 15.1 35.4 35.1 35.6 90 https://hosted.dcd.share~
7 T. METZGER 11 9 6.8 75 0.6 7 1.7 3.3 5 55 0.4 4 0.5 5 23.1 37 44.2 28.9 40 https://hosted.dcd.share~
8 L. SILVA 19 0 1.1 21 0.1 2 0.2 0.2 0.3 6 0.1 1 0 0 4 35 71.4 15.4 100 https://hosted.dcd.share~
9 J. STOLL 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1.2 0 0 0 0 https://hosted.dcd.share~
10 G. BRUN 4 0 0.8 3 0 0 0.3 0 0.3 1 0 0 0 0 0.6 50 0 50 0 https://hosted.dcd.share~
11 A. GENTILE 3 0 0 0 0 0 0.3 0.3 0.7 2 0 0 0 0 1 0 0 0 0 https://hosted.dcd.share~
12 L. CERMINATO 19 5 8.6 163 1.7 33 1.3 3.6 4.9 93 0.7 14 0.9 17 20.9 44.1 51.9 27.1 57.1 https://hosted.dcd.share~
13 J. ADAMS 8 8 16.6 133 1.9 15 1 2.5 3.5 28 0.3 2 1.9 15 28.9 46.2 53.9 26.7 81.8 https://hosted.dcd.share~
14 K. FULLER 5 5 4.6 23 1.8 9 0.6 0.6 1.2 6 0 0 0.4 2 20.1 17.1 0 28.6 83.3 https://hosted.dcd.share~
15 S. MAC 4 4 12.5 50 2 8 0 3 3 12 0.5 2 1.8 7 29.9 37.8 35.5 42.9 76.9 https://hosted.dcd.share~
16 O. JOHNSON 12 12 15.4 185 3.4 41 1 3.2 4.2 50 0.3 4 0.8 9 31.8 47.3 53.6 34.7 75 https://hosted.dcd.share~
17 G. SOLANO 2 2 15.5 31 6.5 13 0.5 5.5 6 12 0 0 1 2 32.4 41.4 55.6 18.2 71.4 https://hosted.dcd.share~
Inside the string, i is just a regular character, and XPath doesn’t know it: it has no connection to the variables in your R session.
However, if you want to select all elements with a given XPath, you don’t need the index at all. That is, the following XPath expression works (I’ve simply removed the [i] part):
/html/body/div[1]/div[6]/div/div/div/div[4]/table/tbody/tr/td[1]/a
Here’s the corresponding ‘rvest’ code. Note that it uses no loop:
library(rvest)
link = "https://hosted.dcd.shared.geniussports.com/fubb/es/competition/34409/team/2442/statistics"
aguada = read_html(link)
jugador = aguada %>%
html_nodes(xpath = "/html/body/div[1]/div[6]/div/div/div/div[4]/table/tbody/tr/td[1]/a/#href")
Or, alternatively:
jugador = aguada %>%
html_nodes(xpath = "/html/body/div[1]/div[6]/div/div/div/div[4]/table/tbody/tr/td[1]/a") %>%
html_attr("href")
Both return a vector of hyperrefs. The first solution has a slightly different return type (xml_nodeset) but for most purposes they will be similar.

Check for nearest value in a column

Is there a way to check which value in a vector/column is nearest to a given value?
so for example I have column with number of days:
days: 50, 49, 59, 180, 170, 199, 200
I want to make a new column in the dataframe that marks an X everytime the dayscolumn has the value 183 or close to 183
It should look like this:
DAYS new column
0
12
12
14
133
140 X
0
12
14
15
178
183 X
0
15
30
72
172 X
Hope you can help me!
You're searching for local maxima, essentially. Start off by normalizing your data to your target, i.e. 183, and search for values closest to zero. Those are your local maxima. I added data with values greater than your target to demonstrate.
df <- data.frame(DAYS = c(0,12,12,14,133,140,0,12,14,15,178,183,184,190,0,15,30,72,172,172.5))
df$localmin <- abs(df$DAYS - 183)
df
> df
DAYS localmin
1 0.0 183.0
2 12.0 171.0
3 12.0 171.0
4 14.0 169.0
5 133.0 50.0
6 140.0 43.0
7 0.0 183.0
8 12.0 171.0
9 14.0 169.0
10 15.0 168.0
11 178.0 5.0
12 183.0 0.0
13 184.0 1.0
14 190.0 7.0
15 0.0 183.0
16 15.0 168.0
17 30.0 153.0
18 72.0 111.0
19 172.0 11.0
20 172.5 10.5
targets <- which(diff(sign(diff(c(df$localmin, 183)))) == 2) + 1L
df$targets <- 0
df$targets[targets] <- 1
df
> df
DAYS localmin targets
1 0.0 183.0 0
2 12.0 171.0 0
3 12.0 171.0 0
4 14.0 169.0 0
5 133.0 50.0 0
6 140.0 43.0 1
7 0.0 183.0 0
8 12.0 171.0 0
9 14.0 169.0 0
10 15.0 168.0 0
11 178.0 5.0 0
12 183.0 0.0 1
13 184.0 1.0 0
14 190.0 7.0 0
15 0.0 183.0 0
16 15.0 168.0 0
17 30.0 153.0 0
18 72.0 111.0 0
19 172.0 11.0 0
20 172.5 10.5 1

Using Lagged Values Conditionally in R

What I want to do is take the split_coefficient value in the rows with the split_coefficient !=1 to be used in calculations with the adjusted_close for the prior dates in the data frame. I'm trying to create a loop in R that will multiple the adjusted_close values by the split_coefficient up to but not including the row which contains split_coefficient that != 1 and repeat the process to the end of the data set. I am able to identify those rows with split_coefficients != 1 using which(y[,6] !=1, but cannot figure out how to write the loops to accomplish this task. Any help on how to create this loop would be greatly appreciated. Thank you in advance.
timestamp open high low close adjusted_close split_coefficient
7/20/2018 31.61 31.72 30.95 31.04 31.04 1
7/19/2018 31.17 31.57 30.69 31.19 31.19 1
7/18/2018 30.53 31.33 30.26 30.63 30.63 1
7/17/2018 31.67 31.825 30.49 30.89 30.89 1
7/16/2018 31.24 31.79 31 31.23 31.23 1
7/13/2018 32.06 32.37 31.36 31.45 31.45 1
7/12/2018 32.29 32.68 31.69 31.69 31.69 1
7/11/2018 33.37 33.47 32.43 32.93 32.93 1
7/10/2018 32.19 32.8185 31.75 31.84 31.84 1
7/9/2018 33.32 33.37 32.249 32.48 32.48 0.25
7/6/2018 36.03 36.17 34.15 34.23 34.23 1
7/5/2018 36.47 37.46 36.05 36.09 36.09 1
7/3/2018 36.28 37.8299 36 37.33 37.33 1
7/2/2018 38.74 39.22 37.03 37.08 37.08 1
6/29/2018 36.71 37.06 35.78 37 37 1
6/28/2018 38.88 40.51 37.46 38.03 38.03 0.35
6/27/2018 36.14 39.43 35.21 38.56 38.56 1
6/26/2018 36.54 37.89 35.715 36.48 36.48 1
6/25/2018 34.24 39.745 34.24 38.11 38.11 1
6/22/2018 33.04 33.57 32.72 33.06 33.06 1
6/21/2018 32.26 34.84 32.21 34.15 34.15 1
6/20/2018 32.13 32.21 31.655 32.02 32.02 0.5
6/19/2018 33.33 33.92 32.43 32.79 32.79 1
6/18/2018 32.55 33.02 31.19 31.24 31.24 1
6/15/2018 31.94 32.52 31.52 31.67 31.67 1
6/14/2018 31.5 31.83 30.91 31.33 31.33 1
6/13/2018 31.58 32.45 31.44 32.39 32.39 1
6/12/2018 31.86 32.41 31.66 31.97 31.97 1
6/11/2018 32.67 32.77 31.91 32.09 32.09 1
6/8/2018 33.46 33.56 32.41 32.6 32.6 1
I'll try to clarify my question:
On 6/20/18, the split coefficient is .50. What I want to do is multiple the split_coefficient of .5 by the adjusted_close values from 6/8/18 to 6/19/18. The split_coefficient then changes to .35 on 6/28/18 where I want to multiple the Adjusted_close from 6/21/18 to 6/27/18 by .35. Since the split_coefficient changes periodically, I thought a loop or series of loops would accomplish this.
Based on what I wrote above, I am looking for the following output with anew column named New.adj.Close which will contain the values calculated when multiplying the split_coefficient from 6/20/18 on the adjusted_close values for 6/8/18 - 6/19/18:
timestamp open high low close adjusted_close dividend_amount split_coefficient New.Adj.close
6/19/2018 33.33 33.92 32.43 32.79 32.79 0 1 16.395
6/18/2018 32.55 33.02 31.19 31.24 31.24 0 1 15.62
6/15/2018 31.94 32.52 31.52 31.67 31.67 0 1 15.835
6/14/2018 31.5 31.83 30.91 31.33 31.33 0 1 15.665
6/13/2018 31.58 32.45 31.44 32.39 32.39 0 1 16.195
6/12/2018 31.86 32.41 31.66 31.97 31.97 0 1 15.985
6/11/2018 32.67 32.77 31.91 32.09 32.09 0 1 16.045
6/8/2018 33.46 33.56 32.41 32.6 32.6 0 1 16.3
Okay this uses the tidyverse but you can recode it to use base r or whatever. The important thing is the logic.
As mentioned you do not normally want to use loops for a task like this, and in this case you would have to do a do while loop. Instead take advantage of vectorization.
measure_date <- seq(as.Date("2000/1/1"), by = "day", length.out = 20)
pattern <- c(.5, 1,1,1,1)
split_coefficient <- c(pattern, pattern, pattern, pattern)
value_to_multiply <- c(1:20)
df <- data.frame(measure_date, value_to_multiply, split_coefficient)
# doing this because OP's data is reversed
df <- dplyr::arrange(df, measure_date)
# Change the 1s to NAs.
df$newsplit <- ifelse(df$split_coefficient == 1, NA, df$split_coefficient)
df <- tidyr::fill(df , newsplit)
df$multiplied <- df$value_to_multiply*df$newsplit
df
Results
measure_date value_to_multiply split_coefficient newsplit multiplied
1 2000-01-01 1 0.5 0.5 0.5
2 2000-01-02 2 1.0 0.5 1.0
3 2000-01-03 3 1.0 0.5 1.5
4 2000-01-04 4 1.0 0.5 2.0
5 2000-01-05 5 1.0 0.5 2.5
6 2000-01-06 6 0.5 0.5 3.0
7 2000-01-07 7 1.0 0.5 3.5
8 2000-01-08 8 1.0 0.5 4.0
9 2000-01-09 9 1.0 0.5 4.5
10 2000-01-10 10 1.0 0.5 5.0
11 2000-01-11 11 0.5 0.5 5.5
12 2000-01-12 12 1.0 0.5 6.0
13 2000-01-13 13 1.0 0.5 6.5
14 2000-01-14 14 1.0 0.5 7.0
15 2000-01-15 15 1.0 0.5 7.5
16 2000-01-16 16 0.5 0.5 8.0
17 2000-01-17 17 1.0 0.5 8.5
18 2000-01-18 18 1.0 0.5 9.0
19 2000-01-19 19 1.0 0.5 9.5
20 2000-01-20 20 1.0 0.5 10.0
To clarify, do you just want to multiply adjusted_close by split_coefficient for the observations where split_coefficient equals 1? If so,
library(dplyr)
y %>% filter(split_coefficient == 1) %>% mutate(new_col = split_coefficient *adjusted_close)
Apologies if I misunderstood the question.
As highlighted in the comments, using loops in R is usually avoided and better alternatives are available. For example you can use ifelse:
df <-
data.frame(
adjusted_close = sample(1:5, 10, TRUE),
split_coefficient = sample(1:2, 10, TRUE)
)
# adjusted_close split_coefficient
# 1 5 1
# 2 2 2
# 3 3 2
# 4 2 2
# 5 4 2
# 6 5 2
# 7 1 1
# 8 2 1
# 9 2 2
# 10 2 1
df$m <- ifelse(df$split_coefficient == 1,
df$adjusted_close,
df$adjusted_close * df$split_coefficient
)
# df
# adjusted_close split_coefficient m
# 1 5 1 5
# 2 2 2 4
# 3 3 2 6
# 4 2 2 4
# 5 4 2 8
# 6 5 2 10
# 7 1 1 1
# 8 2 1 2
# 9 2 2 4
# 10 2 1 2

Partially transpose a dataframe in R

Given the following set of data:
transect <- c("B","N","C","D","H","J","E","L","I","I")
sampler <- c(rep("J",5),rep("W",5))
species <- c("ROB","HAW","HAW","ROB","PIG","HAW","PIG","PIG","HAW","HAW")
weight <- c(2.80,52.00,56.00,2.80,16.00,55.00,16.20,18.30,52.50,57.00)
wingspan <- c(13.9, 52.0, 57.0, 13.7, 11.0,52.5, 10.7, 11.1, 52.3, 55.1)
week <- c(1,2,3,4,5,6,7,8,9,9)
# Warning to R newbs: Really bad idea to use this code
ex <- as.data.frame(cbind(transect,sampler,species,weight,wingspan,week))
What I’m trying to achieve is to transpose the species and its associated information on weight and wingspan. For a better idea of the expected result please see below. My data set is about half a million lines long with approximately 200 different species so it will be a very large dataframe.
transect sampler week ROBweight HAWweight PIGweight ROBwingspan HAWwingspan PIGwingspan
1 B J 1 2.8 0.0 0.0 13.9 0.0 0.0
2 N J 2 0.0 52.0 0.0 0.0 52.0 0.0
3 C J 3 0.0 56.0 0.0 0.0 57.0 0.0
4 D J 4 2.8 0.0 0.0 13.7 0.0 0.0
5 H J 5 0.0 0.0 16.0 0.0 0.0 11.0
6 J W 6 0.0 55.0 0.0 0.0 52.5 0.0
7 E W 7 0.0 0.0 16.2 0.0 0.0 10.7
8 L W 8 0.0 0.0 18.3 0.0 0.0 11.1
9 I W 9 0.0 52.5 0.0 0.0 52.3 0.0
10 I W 9 0.0 57.0 0.0 0.0 55.1 0.0
The main problem is that you don't currently have unique "id" variables, which will create problems for the usual suspects of reshape and dcast.
Here's a solution. I've used getanID from my "splitstackshape" package, but it's pretty easy to create your own unique ID variable using many different methods.
library(splitstackshape)
library(reshape2)
idvars <- c("transect", "sampler", "week")
ex <- getanID(ex, id.vars=idvars)
From here, you have two options:
reshape from base R:
reshape(ex, direction = "wide",
idvar=c("transect", "sampler", "week", ".id"),
timevar="species")
melt and dcast from "reshape2"
First, melt your data into a "long" form.
exL <- melt(ex, id.vars=c(idvars, ".id", "species"))
Then, cast your data into a wide form.
dcast(exL, transect + sampler + week + .id ~ species + variable)
# transect sampler week .id HAW_weight HAW_wingspan PIG_weight PIG_wingspan ROB_weight ROB_wingspan
# 1 B J 1 1 NA NA NA NA 2.8 13.9
# 2 C J 3 1 56.0 57.0 NA NA NA NA
# 3 D J 4 1 NA NA NA NA 2.8 13.7
# 4 E W 7 1 NA NA 16.2 10.7 NA NA
# 5 H J 5 1 NA NA 16.0 11.0 NA NA
# 6 I W 9 1 52.5 52.3 NA NA NA NA
# 7 I W 9 2 57.0 55.1 NA NA NA NA
# 8 J W 6 1 55.0 52.5 NA NA NA NA
# 9 L W 8 1 NA NA 18.3 11.1 NA NA
# 10 N J 2 1 52.0 52.0 NA NA NA NA
A better option: "data.table"
Alternatively (and perhaps preferably), you can use the "data.table" package (at least version 1.8.11) as follows:
library(data.table)
library(reshape2) ## Also required here
packageVersion("data.table")
# [1] ‘1.8.11’
DT <- data.table(ex)
DT[, .id := sequence(.N), by = c("transect", "sampler", "week")]
DTL <- melt(DT, measure.vars=c("weight", "wingspan"))
dcast.data.table(DTL, transect + sampler + week + .id ~ species + variable)
# transect sampler week .id HAW_weight HAW_wingspan PIG_weight PIG_wingspan ROB_weight ROB_wingspan
# 1: B J 1 1 NA NA NA NA 2.8 13.9
# 2: C J 3 1 56.0 57.0 NA NA NA NA
# 3: D J 4 1 NA NA NA NA 2.8 13.7
# 4: E W 7 1 NA NA 16.2 10.7 NA NA
# 5: H J 5 1 NA NA 16.0 11.0 NA NA
# 6: I W 9 1 52.5 52.3 NA NA NA NA
# 7: I W 9 2 57.0 55.1 NA NA NA NA
# 8: J W 6 1 55.0 52.5 NA NA NA NA
# 9: L W 8 1 NA NA 18.3 11.1 NA NA
# 10: N J 2 1 52.0 52.0 NA NA NA NA
Add fill = 0 to either of the dcast versions to replace NA values with 0.

R: Overlapping ggplots

I have a quick question about R. I am trying to make a layered histogram from some data I am pulling out of files but I am having a hard time getting ggplot to work with me. I keep getting this error and I have been looking around for an answer but I haven't seen much.
Error: ggplot2 doesn't know how to deal with data of class uneval
Execution halted
Here is a brief look at my program so far.
library("ggplot2")
ex <- '/home/Data/run1.DOC'
ex2 <- '/home/Data/run2.DOC'
...
ex<- read.table(ex,header=TRUE)
ex2<- read.table(ex2,header=TRUE)
...
colnames(ex) <- c(1:18)
colnames(ex2) <- c(1:18)
...
Ex <- c(ex$'14')
Ex2 <- c(ex2$'14')
...
ggplot()+
geom_histogram(data = Ex, fill = "red", alpha = 0.2) +
geom_histogram(data = Ex2, fill = "blue", alpha = 0.2)
And my data is in the files and look a bit like this:
head(ex,10)
1 2 3 4 5 6 7 8 9 10 11 12
1 1:28 400 0.42 400 0.42 1 1 2 41.8 0 0.0 0.0
2 1:96 5599 39.99 5599 39.99 34 42 50 100.0 100 100.0 100.0
3 1:53 334 0.63 334 0.63 1 2 2 62.1 0 0.0 0.0
4 1:27 6932 49.51 6932 49.51 48 52 57 100.0 100 100.0 100.0
5 1:36 27562 124.15 27562 124.15 97 123 157 100.0 100 100.0 100.0
6 1:14 2340 16.71 2340 16.71 13 17 21 100.0 100 100.0 95.7
7 1:96 8202 49.71 8202 49.71 23 43 80 100.0 100 100.0 100.0
8 1:34 3950 28.21 3950 28.21 22 33 36 100.0 100 100.0 100.0
9 1:60 5563 24.62 5563 24.62 11 24 41 100.0 100 96.5 75.2
10 1:06 1646 8.11 1646 8.11 7 8 13 100.0 100 87.2 32.0
13 14 15 16 17 18
1 0.0 0.0 0.0 0.0 0.0 0.0
2 93.6 82.9 57.9 24.3 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0
4 100.0 97.1 87.1 57.1 0.0 0.0
5 100.0 100.0 100.0 100.0 88.3 71.2
6 40.0 0.0 0.0 0.0 0.0 0.0
7 81.2 66.7 54.5 47.9 29.1 0.0
8 76.4 55.7 0.0 0.0 0.0 0.0
9 57.5 35.4 26.5 4.4 0.0 0.0
10 0.0 0.0 0.0 0.0 0.0 0.0
But much larger. This means that ex and ex2 will be a percentage from 0 to 100. The colnames line changes the column heads like %_above_30 to something R likes better so I change it to number each column name.
Does anyone know/see the problem here because I am not really getting it.
Thanks!!
Maybe try combining the two data frames in one and supply that to one geom_histogram:
#maybe reshape it something like this (base reshape or the
#reshape package may be a better tool)
dat <- data.frame(rbind(ex, ex2),
colvar=factor(c(rep("ex", nrow(ex)), rep("ex2", nrow(ex2))))
ggplot(data = dat, fill = colvar)+
geom_histogram(position="identity", alpha = 0.2)
This is untested as your code isn't reproducible (please see this link on how to make a reproducible example).
Here's the idea I'm talking about with a reproducible example:
library(ggplot2)
path = "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data"
saheart <- read.table(path, sep=",",head=T,row.names=1)
fmla <- "chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity"
model <- glm(fmla, data=saheart, family=binomial(link="logit"),
na.action=na.exclude)
dframe <- data.frame(chd=as.factor(saheart$chd),
prediction=predict(model, type="response"))
ggplot(dframe, aes(x=prediction, fill=chd)) +
geom_histogram(position="identity", binwidth=0.05, alpha=0.5)

Resources