ggplot2 plotting alternate rows from 2 columns - r

I have a dataframe with 3 columns "ID", "on.tank", "on.mains". the data looks like:
ID: 1,2,3,4,5,6,7,8,9,10 ( sequentially) on.tank: 25,0,10,0,43,0,5 on.mains: 0,12,0,11,0,2,0
so columns 2 and 3 alternate between zero and a value, where when one is a zero the other has a value alternately.
I want to create one column that interleaves each value alternately and a second column which will be a factor on.main, on.tank, on.main, etc alternating as it represents days on tank, then days on mains, then days on tank, etc, etc alternately.
I tried using melt but it doesn't give me alternating it stacks the data so I get on.tank, on.tank, on.tank etc for 2000 rows and then on.mains, on.mains etc
> dput(head(data))
structure(list(ID = 1:6, on.tank = c(0, 56, 0, 1, 0, 97), on.main = c(-1,
0, -9, 0, -18, 0)), .Names = c("ID", "on.tank", "on.main"), row.names = c(NA,
6L), class = "data.frame")

Here's your data:
df <- data.frame(ID=1:7,
on.tank=c( 25,0,10,0,43,0,5),
on.mains=c(0,12,0,11,0,2,0))
Using base R:
df$On.which <- ifelse(df$on.tank > df$on.mains, "on.tank", "on.mains")
This will work unless any of your values are negative. If you have negative values use:
df$On.which <- ifelse(df$on.mains==0, "on.tank", "on.mains")
Does this do what you need? If you remove the quotes you can also use this method to get the values of the columns merged into 1.

Related

R, use newer data to update list

This question is very similar to this question that I created previously which has an answer, however I've come to realize the problem I'm trying to solve has evolved and I figured I should start fresh.
I have two data frames like so:
df1<-structure(list(protocol_no = c("study1", "study2", "study3",
"study4", "study5", "study6", "study7"), status = c("New", "Open",
"Closed", "New", "PI signoff", "Closed", "Open")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
df2<-structure(list(record_id = c(11, 12, 13, 14, 15, 16), protocol_no = c("study1",
"study2", "study3", "study4", "study5", "study6"), status = c("New",
"Closed", "Closed", "New", "PI signoff", "Closed"), form_1_complete = c(0,
0, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
They pretty much reference the same data, but df1 will always be newer and have more rows, whereas df2 is older and has more columns. Also, they will have 20,000+ rows in real life.
I need to update df2 with the new information from df1, this might mean new rows that will be need to be numbered (the record_id column), and it might mean updating the "status" column if its changed.
For instance in this example, the row for study7 is new and needs to be added and given record_id = 17 (because 16 is where that list left off). Additionally the status of study2 changed from Closed to Open (its 'open' in df1) so that needs to be changed.
Things that wouldn't work:
In the previous solution it used binding rows and distinct, but in this scenario since study2 has changed and needs to be updated, that would bind two copies of study2 and have trouble distinguishing which to get rid of.
Output I'm looking for:
A dataframe with all 4 columns, with record_id for everything, one row per protocol ('protocol_no'), and any status's that have changed updated to reflect df1. Like so:
Here, a join would be enough
library(data.table)
setDT(df2)[as.data.table(df1), status := i.status, on = .(protocol_no)]
Or use rows_upsert and use the same code in the other post
library(dplyr)
library(tidyr)
rows_upsert(df2, df1) %>%
fill(record_id) %>%
mutate(record_id = record_id + (rowid(record_id) - 1))
-output
record_id protocol_no status form_1_complete
1 11 study1 New 0
2 12 study2 Open 0
3 13 study3 Closed 0
4 14 study4 New 0
5 15 study5 PI signoff 0
6 16 study6 Closed 0
7 17 study7 Open NA

Continuous multiplication same column previous value

I have a problem.
I have the following data frame.
1
2
NA
100
1.00499
NA
1.00813
NA
0.99203
NA
Two columns. In the second column, apart from the starting value, there are only NAs. I want to fill the first NA of the 2nd column by multiplying the 1st value from column 2 with the 2nd value from column 1 (100* 1.00499). The 3rd value of column 2 should be the product of the 2nd new created value in column 2 and the 3rd value in column 1 and so on. So that at the end the NAs are replaced by values.
These two sources have helped me understand how to refer to different rows. But in both cases a new column is created.I don't want that. I want to fill the already existing column 2.
Use a value from the previous row in an R data.table calculation
https://statisticsglobe.com/use-previous-row-of-data-table-in-r
Can anyone help me?
Thanks so much in advance.
Sample code
library(quantmod)
data.N225<-getSymbols("^N225",from="1965-01-01", to="2022-03-30", auto.assign=FALSE, src='yahoo')
data.N225[c(1:3, nrow(data.N225)),]
data.N225<- na.omit(data.N225)
N225 <- data.N225[,6]
N225$DiskreteRendite= Delt(N225$N225.Adjusted)
N225[c(1:3,nrow(N225)),]
options(digits=5)
N225.diskret <- N225[,3]
N225.diskret[c(1:3,nrow(N225.diskret)),]
N225$diskretplus1 <- N225$DiskreteRendite+1
N225[c(1:3,nrow(N225)),]
library(dplyr)
N225$normiert <-"Value"
N225$normiert[1,] <-100
N225[c(1:3,nrow(N225)),]
N225.new <- N225[,4:5]
N225.new[c(1:3,nrow(N225.new)),]
Here is the code to create the data frame in R studio.
a <- c(NA, 1.0050,1.0081, 1.0095, 1.0016,0.9947)
b <- c(100, NA, NA, NA, NA, NA)
c<- data.frame(ONE = a, TWO=b)
You could use cumprod for cummulative product
transform(
df,
TWO = cumprod(c(na.omit(TWO),na.omit(ONE)))
)
which yields
ONE TWO
1 NA 100.0000
2 1.0050 100.5000
3 1.0081 101.3140
4 1.0095 102.2765
5 1.0016 102.4402
6 0.9947 101.8972
data
> dput(df)
structure(list(ONE = c(NA, 1.005, 1.0081, 1.0095, 1.0016, 0.9947
), TWO = c(100, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-6L))
What about (gasp) a for loop?
(I'll use dat instead of c for your dataframe to avoid confusion with function c()).
for (row in 2:nrow(dat)) {
if (!is.na(dat$TWO[row-1])) {
dat$TWO[row] <- dat$ONE[row] * dat$TWO[row-1]
}
}
This means:
For each row from the second to the end, if the TWO in the previous row is not a missing value, calculate the TWO in this row by multiplying ONE in the current row and TWO from the previous row.
Output:
#> ONE TWO
#> 1 NA 100.0000
#> 2 1.0050 100.5000
#> 3 1.0081 101.3140
#> 4 1.0095 102.2765
#> 5 1.0016 102.4402
#> 6 0.9947 101.8972
Created on 2022-04-28 by the reprex package (v2.0.1)
I'd love to read a dplyr solution!

Add consecutive temp values above threshold to create "degree hours"

I am working with a dataset of hourly temperatures and I need to calculate "degree hours" above a heat threshold for each extreme event. I intend to run stats on the intensities (combined magnitude and duration) of each event to compare multiple sites over the same time period.
Example of data:
Temp
1 14.026
2 13.714
3 13.25
.....
21189 12.437
21190 12.558
21191 12.703
21192 12.896
Data after selecting only hours above the threshold of 18 degrees and then subtracting 18 to reveal degrees above 18:
Temp
5297 0.010
5468 0.010
5469 0.343
5470 0.081
5866 0.010
5868 0.319
5869 0.652
After this step I need help to sum consecutive hours during which the reading exceeded my specified threshold.
What I am hoping to produce out of above sample:
Temp
1 0.010
2 0.434
3 0.010
4 0.971
I've debated manipulating these data within a time series or by adding additional columns, but I do not want multiple rows for each warming event. I would immensely appreciate any advice.
This is an alternative solution in base R.
You have some data that walks around, and you want to sum up the points above a cutoff. For example:
set.seed(99999)
x <- cumsum(rnorm(30))
plot(x, type='b')
abline(h=2, lty='dashed')
which looks like this:
First, we want to split the data in to groups based on when they cross the cutoff. We can use run length encoding on the indicator to get a compressed version:
x.rle <- rle(x > 2)
which has the value:
Run Length Encoding
lengths: int [1:8] 5 2 3 1 9 4 5 1
values : logi [1:8] FALSE TRUE FALSE TRUE FALSE TRUE ...
The first group is the first 5 points where x > 2 is FALSE; the second group is the two following points, and so on.
We can create a group id by replacing the values in the rle object, and then back transforming:
x.rle$values <- seq_along(x.rle$values)
group <- inverse.rle(x.rle)
Finally, we aggregate by group, keeping only the data above the cut off:
aggregate(x~group, subset = x > 2, FUN=sum)
Which produces:
group x
1 2 5.113291213
2 4 2.124118005
3 6 11.775435706
4 8 2.175868979
I'd use data.table for this, although there are certainly other ways.
library( data.table )
setDT( df )
temp.threshold <- 18
First make a column showing the previous value from each one in your data. This will help to find the point at which the temperature rose above your threshold value.
df[ , lag := shift( Temp, fill = 0, type = "lag" ) ]
Now use that previous value column to compare with the Temp column. Mark every point at which the temperature rose above the threshold with a 1, and all other points as 0.
df[ , group := 0L
][ Temp > temp.threshold & lag <= temp.threshold, group := 1L ]
Now we can get cumsum of that new column, which will give each sequence after the temperature rose above the threshold its own group ID.
df[ , group := cumsum( group ) ]
Now we can get rid of every value not above the threshold.
df <- df[ Temp > temp.threshold, ]
And summarise what's left by finding the "degree hours" of each "group".
bygroup <- df[ , sum( Temp - temp.threshold ), by = group ]
I modified your input data a little to provide a couple of test events where the data rose above threshold:
structure(list(num = c(1L, 2L, 3L, 4L, 5L, 21189L, 21190L, 21191L,
21192L, 21193L, 21194L), Temp = c(14.026, 13.714, 13.25, 20,
19, 12.437, 12.558, 12.703, 12.896, 21, 21)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -11L), .Names = c("num",
"Temp"), spec = structure(list(cols = structure(list(num = structure(list(), class = c("collector_integer",
"collector")), Temp = structure(list(), class = c("collector_double",
"collector"))), .Names = c("num", "Temp")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
With that data, here's the output of the code above (note $V1 is in "degree hours"):
> bygroup
group V1
1: 1 3
2: 2 6

How to randomly select two sets of rows from each group in R [duplicate]

This question already has answers here:
Stratified random sampling from data frame
(6 answers)
Closed 6 years ago.
I have a dataframe called test.data where I have a column called Ethnicity. There are three groups of ethnicities (more in actual data), Adygei, Balochi and Biaka_pygmies. I want to subset this data frame to include only two samples (rows) randomly from each ethnic group and get the result. How can I do this in R?
test.data <- structure(list(Sample = c("1793102418_A", "1793102460_A", "1793102500_A",
"1793102576_A", "1749751113_A", "1749751187_A", "1749751189_A",
"1749751285_A", "1749751356_A", "1749751195_A", "1749751218_A",
"1775705355_A"), Ethnicity = c("Adygei", "Adygei", "Adygei",
"Adygei", "Balochi", "Balochi", "Balochi", "Balochi", "Balochi",
"Biaka_Pygmies", "Biaka_Pygmies", "Biaka_Pygmies"), Height = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Sample", "Ethnicity",
"Height"), row.names = c("1793102418_A", "1793102460_A", "1793102500_A",
"1793102576_A", "1749751113_A", "1749751187_A", "1749751189_A",
"1749751285_A", "1749751356_A", "1749751195_A", "1749751218_A",
"1775705355_A"), class = "data.frame")
result
Sample Ethnicity Height
1793102418_A 1793102418_A Adygei 0
1793102460_A 1793102460_A Adygei 0
1749751189_A 1749751189_A Balochi 0
1749751285_A 1749751285_A Balochi 0
1749751195_A 1749751195_A Biaka_Pygmies 0
1775705355_A 1775705355_A Biaka_Pygmies 0
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(test.data)), grouped by 'Ethnicity', we sample the sequence of rows and subset the rows based on that.
setDT(test.data)[, .SD[sample(1:.N,2)], Ethnicity]
Or using tapply from base R
test.data[ with(test.data, unlist(tapply(seq_len(nrow(test.data)),
Ethnicity, FUN = sample, 2))), ]

Conditionally separate daily data based on monthly mean values

I am new to R. I have daily data and want to separate months with mean less than 1 from rest of data. Do something on daily data (with mean greater than 1). The important thing is not to touch daily values with monthly mean less than 1.
I have used aggregate(file,as.yearmon,mean) to get monthly mean but failing to grasp on how to use it to filter specific month's daily values from analysis. Any suggestion to start would be highly appreciative.
I have reproduced data using a small subset of it and dput:
structure(list(V1 = c(0, 0, 0, 0.43, 0.24, 0, 1.06, 0, 0, 0, 1.57, 1.26, 1.34, 0, 0, 0, 2.09, 0, 0, 0.24)), .Names = "V1", row.names = c(NA, 20L), class = "data.frame")
A snippet of code I am using:
library(zoo)
file <- read.table("text.txt")
x_daily <- zooreg(file, start=as.Date("2000-01-01"))
x1_daily <- x_daily[]
con_daily <- subset(x1_daily, aggregate(x1_daily,as.yearmon,mean) > 1 )
Let's create some sample data:
feb2012 <- data.frame(year=2012, month=2, day=1:28, data=rnorm(28))
feb2013 <- data.frame(year=2013, month=2, day=1:28, data=rnorm(28) + 10)
jul2012 <- data.frame(year=2012, month=7, day=1:31, data=rnorm(31) + 10)
jul2013 <- data.frame(year=2013, month=7, day=1:31, data=rnorm(31) + 10)
d <- rbind(feb2012, feb2013, jul2012, jul2013)
You can get an aggregate of the data column by month like this:
> a <- aggregate(d$data, list(year=d$year, month=d$month), mean)
> a
year month x
1 2012 2 0.09704817
2 2013 2 9.93354271
3 2012 7 10.19073868
4 2013 7 9.78324133
Perhaps not the best way, but an easy way to filter the d data frame by the mean of the corresponding year and month is to work with a temporary data frame that merges d and a, like this:
work <- merge(d, a)
subset(work, x > 1)
I hope this will help you get started!

Resources