I have dataframe df3
df3
x d y
1 bbc Sep 2020 123
2 rsb Sep 2020 234
3 atc Sep 2020 345
4 svc Sep 2020 543
5 mwe Sep 2020 567
6 bpa Oct 2020 322
7 mwe Oct 2020 456
8 uhs Oct 2020 786
9 se Oct 2020 543
10 db Oct 2020 778
11 rsb Nov 2020 358
12 svc Nov 2020 678
13 db Nov 2020 321
14 rb Nov 2020 689
15 bpa Nov 2020 765
After applying split function `df5 = split(df3,f=df3$d)
> df5 = split(df3,f=df3$d)
> df5
$`Sep 2020`
x d y
1 bbc Sep 2020 123
2 rsb Sep 2020 234
3 atc Sep 2020 345
4 svc Sep 2020 543
5 mwe Sep 2020 567
$`Oct 2020`
x d y
6 bpa Oct 2020 322
7 mwe Oct 2020 456
8 uhs Oct 2020 786
9 se Oct 2020 543
10 db Oct 2020 778
$`Nov 2020`
x d y
11 rsb Nov 2020 358
12 svc Nov 2020 678
13 db Nov 2020 321
14 rb Nov 2020 689
15 bpa Nov 2020 765
I would like to dynamically loop through the split dataframe.
I need to find out if any values present in Nov 2020 are also present in Oct 2020.
if it is present in both, then have to check the previous one Sep 2020, and also find the number of times the names have occurred. Here df3$d is in as.yearmon format. If any names in df5[["Nov 2020"]]$x are present in df5[["Sep 2020"]]$x, extract and store it in an object along with its count. here the count is 2 since it would be present in Nov 2020 and Oct 2020. Only if the names are present in the most recent month, it should check the previous months. For this example, the output should be
> df4
names_present present_for
1 bpa 2
2 db 2
Thank you in advance
I have a dataframe like
x d y
bbc Sep 2020 123
rsb Sep 2020 234
atc Sep 2020 345
svc Sep 2020 543
mwe Sep 2020 567
bpa Oct 2020 322
mwe Oct 2020 456
uhs Oct 2020 786
se Oct 2020 543
db Oct 2020 778
rsb Nov 2020 358
svc Nov 2020 678
db Nov 2020 321
rb Nov 2020 689
bpa Nov 2020 765
The column 'd' is in as.yearmon format
I wanna find out the values in x column that are repeated for each month year column d.
In this dataframe bpa and db are present in Nov 2020 and Oct 2020. So the output needs to be like
names_present present_for
bpa 2
db 2
The values of x present in each month year data should be given as output only if they are present continuosly. Here svc is present in Nov 2020 and Sep 2020 but it cannot be considered as a part of the output since it is absent in Oct 2020.
I tried splitting the dataframe based on the df$d column but I couldn't loop through the split dataframes and get the output required. the split dataframe looks like
$`Nov 2020`
x d y
11 rsb Nov 2020 358
12 svc Nov 2020 678
13 db Nov 2020 321
14 rb Nov 2020 689
15 bpa Nov 2020 765
$`Oct 2020`
x d y
6 bpa Oct 2020 322
7 mwe Oct 2020 456
8 uhs Oct 2020 786
9 se Oct 2020 543
10 db Oct 2020 778
$`Sep 2020`
x d y
1 bbc Sep 2020 123
2 rsb Sep 2020 234
3 atc Sep 2020 345
4 svc Sep 2020 543
5 mwe Sep 2020 567
The code needs to first check in the most recent month year followed by the previous month year and so on. Check if any of the x names are present in Nov 2020 first and then followed by Oct 2020 and then Sep 2020 and count their occurrences. if the names are present in Nov 2020 and Sep 2020 but not in Oct 2020, then it cannot be considered as a part of the output dataframe. There can be any number of month year values. The dataframe given here is just a small sample. I wanna dynamically find out this information.
I've been struggling with this for a long time. Would be great if anyone could help me solve this. Thank you in advance.
I have a data frame called "data", that has "date, month, discharge, and station" columns. Another data frame called "perc" that has "month, W1_Percentile, and B1_Percentile" columns. W1_Percentile and B1_Percentile are the monthly percentile values for each of the gauging stations. I want my final output to have columns same as in df(data) with an additional column for "Percentile" that will have the percentile values for the respective month and gauging station (percentile values of each gauging station for the respective months is stored in df(perc)). What steps should I follow?
Here is the sample of input data:
date <- as.Date(c('1950-03-12','1954-03-23','1991-06-27','1997-09-04','1991-06-27','1987-05-06','1987-05-29','1856-07-08','1993-06-04', '2001-09-19','2001-05-06','2001-05-27'))
month <- c('Mar','Mar','Jun','Sep','Jun','May','May','Jul','Jun','Sep','May','May')
disch <- c(125,1535,1654,154,4654,453,1654,145,423,433,438,6426)
station <- c('W1','W1','W1','W1','W1','W1','B1','B1','B1','B1','B1','B1')
data <- data.frame("Date"= date, "Month" = month,"Discharge"=disch,"station"=station)
Date Month Discharge station
1 1950-03-12 Mar 125 W1
2 1954-03-23 Mar 1535 W1
3 1991-06-27 Jun 1654 W1
4 1997-09-04 Sep 154 W1
5 1991-06-27 Jun 4654 W1
6 1987-05-06 May 453 W1
7 1987-05-29 May 1654 B1
8 1856-07-08 Jul 145 B1
9 1993-06-04 Jun 423 B1
10 2001-09-19 Sep 433 B1
11 2001-05-06 May 438 B1
12 2001-05-27 May 6426 B1
Month <- c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
W1 <- c(106,313,531.40,164.10,40,23.39,18.30,24,16,16,12,34)
B1 <- c(1330,1550,1948,1880,1260,853.15,680.15,486.10,503,625,738,1070)
perc <- data.frame("Month"=Month,"W1_Percentile"=W1,"B1_Percentile"=B1)
Month W1_Percentile B1_Percentile
1 Jan 106.00 1330.00
2 Feb 313.00 1550.00
3 Mar 531.40 1948.00
4 Apr 164.10 1880.00
5 May 40.00 1260.00
6 Jun 23.39 853.15
7 Jul 18.30 680.15
8 Aug 24.00 486.10
9 Sep 16.00 503.00
10 Oct 16.00 625.00
11 Nov 12.00 738.00
12 Dec 34.00 1070.00
This is how I want the final output to look like:
Date Month Discharge station Percentile
1 1950-03-12 Mar 125 W1 531.40
2 1954-03-23 Mar 1535 W1 531.40
3 1991-06-27 Jun 1654 W1 23.39
4 1997-09-04 Sep 154 W1 16.00
5 1991-06-27 Jun 4654 W1 23.39
6 1987-05-06 May 453 W1 40.00
7 1987-05-29 May 1654 B1 1260.00
8 1856-07-08 Jul 145 B1 680.15
9 1993-06-04 Jun 423 B1 853.15
10 2001-09-19 Sep 433 B1 503.00
11 2001-05-06 May 438 B1 1260.00
12 2001-05-27 May 6426 B1 1260.00
We need to first convert your perc data into a long format so that we have the columns we want to add to data, then it's a simple join:
library(tidyr)
library(dplyr)
# make the column names the same as the values in data
names(perc)[2:3] = c("W1", "B1")
# convert to long format
perc_long = gather(perc, key = "station", value = "percentile", W1, B1)
# join
left_join(data, perc_long)
# Joining, by = c("Month", "station")
# Date Month Discharge station percentile
# 1 1950-03-12 Mar 125 W1 531.40
# 2 1954-03-23 Mar 1535 W1 531.40
# 3 1991-06-27 Jun 1654 W1 23.39
# 4 1997-09-04 Sep 154 W1 16.00
# 5 1991-06-27 Jun 4654 W1 23.39
# 6 1987-05-06 May 453 W1 40.00
# 7 1987-05-29 May 1654 B1 1260.00
# 8 1856-07-08 Jul 145 B1 680.15
# 9 1993-06-04 Jun 423 B1 853.15
# 10 2001-09-19 Sep 433 B1 503.00
# 11 2001-05-06 May 438 B1 1260.00
# 12 2001-05-27 May 6426 B1 1260.00
There are many ways to do these operations, it's essentially a combination of two R-FAQs. For additional reference see
Reshaping data.frame from wide to long format
How to join (merge) data frames (inner, outer, left, right)
I think I might be missing something very simply, but:
How I can I fill the last spots of a matrix with NA instead of it just repeating previous values?
Data example:
x <- 1:27
m <- matrix(x, nrow = 12, ncol = ceiling(nrow(base.de)/12), byrow = FALSE)
col_names <- c("2013", "2014", "2015")
row_names <- c("Jan", "Feb", "Mar", "Apr", "May",
"Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
dimnames(m) <- list(row_names, col_names)
m
2013 2014 2015
Jan 1 13 25
Feb 2 14 26
Mar 3 15 27
Apr 4 16 1 # NOT NA?
May 5 17 2
Jun 6 18 3
Jul 7 19 4
Aug 8 20 5
Sep 9 21 6
Oct 10 22 7
Nov 11 23 8
Dec 12 24 9
I would like all values after 2015 March to be filled with NA.
If you assign a shorter vector to a longer vector in R, it recycles the values in the shorter vector. That's what you are observing here. (Note that a matrix is just a vector with dimension attribute.) This behaviour cannot be avoided. So, you should assign NA after creating the matrix:
m[28:length(m)] <- NA
Or, alternatively, you could append the necessary number of NA values to 1:27 when creating the matrix.
Create a list of dimnames and from that create a matrix of NAs. Finally, fill it:
x <- 1:27 # input as per question
dnm <- list(month.abb, 2013:2015) # list of dimnames
m <- matrix(NA, nrow = length(dnm[[1]]), ncol = length(dnm[[2]]), dimnames = dnm)
m[seq_along(x)] <- x
Note: You might not want to do this at all and instead create a monthly time series:
library(zoo)
z <- zooreg(x, as.yearmon("2013-01"), freq = 12)
giving:
> z
Jan 2013 Feb 2013 Mar 2013 Apr 2013 May 2013 Jun 2013 Jul 2013 Aug 2013
1 2 3 4 5 6 7 8
Sep 2013 Oct 2013 Nov 2013 Dec 2013 Jan 2014 Feb 2014 Mar 2014 Apr 2014
9 10 11 12 13 14 15 16
May 2014 Jun 2014 Jul 2014 Aug 2014 Sep 2014 Oct 2014 Nov 2014 Dec 2014
17 18 19 20 21 22 23 24
Jan 2015 Feb 2015 Mar 2015
25 26 27
or a ts series:
> as.ts(z)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013 1 2 3 4 5 6 7 8 9 10 11 12
2014 13 14 15 16 17 18 19 20 21 22 23 24
2015 25 26 27
or directly:
ts(x, start = c(2013, 1), freq = 12)
I have this data.frame:
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
First 20 rows of data:
head(counts, 20)
year month count
1 2000 Jan 14
2 2000 Feb 182
3 2000 Mar 462
4 2000 Apr 395
5 2000 May 107
6 2000 Jun 127
7 2000 Jul 371
8 2000 Aug 158
9 2000 Sep 147
10 2000 Oct 41
11 2000 Nov 141
12 2000 Dec 27
13 2001 Jan 72
14 2001 Feb 7
15 2001 Mar 40
16 2001 Apr 351
17 2001 May 342
18 2001 Jun 81
19 2001 Jul 442
20 2001 Aug 389
Lets say I try to calculate the standard deviation of these data using the usual R code:
library(plyr)
ddply(counts, .(month), summarise, s.d. = sd(count))
month s.d.
1 Apr 145.3018
2 Aug 140.9949
3 Dec 173.9406
4 Feb 127.5296
5 Jan 148.2661
6 Jul 162.4893
7 Jun 133.4383
8 Mar 125.8425
9 May 168.9517
10 Nov 93.1370
11 Oct 167.9436
12 Sep 166.8740
This gives the standard deviation around the mean of each month. How can I get R to output standard deviation around maximum value of each month?
you want: "max of values per month and the average from this maximum value" [which is not the same as the standard deviation].
counts <- data.frame(year = sort(rep(2000:2009, 12)), month = rep(month.abb,10), count = sample(1:500, 120, replace = T))
library(data.table)
counts=data.table(counts)
counts[,mean(count-max(count)),by=month]
This question is highly vague. If you want to calculate the standard deviation of the differences to the maximum, you can use this code:
> library(plyr)
> ddply(counts, .(month), summarise, sd = sd(count - max(count)))
month sd
1 Apr 182.5071
2 Aug 114.3068
3 Dec 117.1049
4 Feb 184.4638
5 Jan 138.1755
6 Jul 167.0677
7 Jun 100.8841
8 Mar 144.8724
9 May 173.3452
10 Nov 132.0204
11 Oct 127.4645
12 Sep 152.2162