Hi I'm trying to do a simple accumulation model in R. Very simple to do in excel, but of course i need to do it for about 1000 data sets so i would like to code it in R.
Simply put the model is for accumulating and melting snow. The result should be in the 'pack' column. Which should just be previous days pack + snow - melt. Any thoughts on the best way to call the previous days pack? (should initiate with 0 snowpack on day 1)
The second problem is that pack cannot be negative, so on days when it melts but there is no accumulated snow, the pack should stay at 0.
df <- read.csv(file = "ddf_mod.csv", header = TRUE)
> df
day snow melt pack
1 1 0 6 0
2 2 0 2 0
3 3 0 8 0
4 4 0 2 0
5 5 2 0 2
6 6 3 0 5
7 7 4 0 9
8 8 5 0 14
9 9 0 5 9
10 10 0 6 3
11 11 0 3 0
12 12 5 0 5
13 13 8 0 13
14 14 1 0 14
15 15 3 0 17
16 16 0 0 17
The part where it can't be below 0 makes this a bit trickier than normal, but you can accomplish this stepwise calculation with Reduce(). For example
new_melt <- Reduce(function(prev, change) {
max(prev + change$snow - change$melt, 0)
},
split(df[c("snow","melt")], seq.int(nrow(df))),
init=0,
accumulate = TRUE)[-1]
Here we split the snow/melt values into a list of pairs of observations using split() and then we iterate over them. Each time taking the previous value, adding snow, removing melt, and using max() to make sure it never goes below 0. (We then remove the initial value with [-1]). Can can merge this new value with the original data to see that it gives what you want
cbind(df, new_melt)
# day snow melt pack new_melt
# 1 1 0 6 0 0
# 2 2 0 2 0 0
# 3 3 0 8 0 0
# 4 4 0 2 0 0
# 5 5 2 0 2 2
# 6 6 3 0 5 5
# 7 7 4 0 9 9
# 8 8 5 0 14 14
# 9 9 0 5 9 9
# 10 10 0 6 3 3
# 11 11 0 3 0 0
# 12 12 5 0 5 5
# 13 13 8 0 13 13
# 14 14 1 0 14 14
# 15 15 3 0 17 17
# 16 16 0 0 17 17
Related
I have the following data frame:
df <- data.frame(A_TR1=sample(10:20, 8, replace = TRUE),A_TR2=seq(2, 16, by=2), A_TR3=seq(1, 16, by=2),
B_TR1=seq(1, 16, by=2),B_TR2=seq(2, 16, by=2), B_TR3=seq(1, 16, by=2))
> df
A_TR1 A_TR2 A_TR3 B_TR1 B_TR2 B_TR3
1 11 2 1 1 2 1
2 12 4 3 3 4 3
3 18 6 5 5 6 5
4 11 8 7 7 8 7
5 17 10 9 9 10 9
6 17 12 11 11 12 11
7 14 14 13 13 14 13
8 11 16 15 15 16 15
What I would like to do, is subtract B_TR1 from A_TR1, B_TR2 from A_TR2, and so on and create new columns from these, similar to below:
df$x_TR1 <- (df$A_TR1 - df$B_TR1)
df$x_TR2 <- (df$A_TR2 - df$B_TR2)
df$x_TR3 <- (df$A_TR3 - df$B_TR3)
> df
A_TR1 A_TR2 A_TR3 B_TR1 B_TR2 B_TR3 x_TR1 x_TR2 x_TR3
1 12 2 1 1 2 1 11 0 0
2 11 4 3 3 4 3 8 0 0
3 19 6 5 5 6 5 14 0 0
4 13 8 7 7 8 7 6 0 0
5 12 10 9 9 10 9 3 0 0
6 16 12 11 11 12 11 5 0 0
7 16 14 13 13 14 13 3 0 0
8 18 16 15 15 16 15 3 0 0
I would like to name these columns "x TR1", "x TR2", etc. I tried to do the following:
xdf <- df%>%mutate(across(starts_with("A_TR"), -across(starts_with("B_TR")), .names="x TR{.col}"))
However, I get an error in mutate():
attempt to select less than one element in integerOneIndex
I also don't know how to create the proper column names, in terms of getting the numbers right -- I am not even sure the glue() syntax allows for it. Any help appreciated here.
We could use .names in the first across to replace the substring 'a' with 'x' from the column names (.col) while subtracting from the second set of columns
library(dplyr)
library(stringr)
df <- df %>%
mutate(across(starts_with("A_TR"),
.names = "{str_replace(.col, 'A', 'x')}") -
across(starts_with("B_TR")))
-output
df
A_TR1 A_TR2 A_TR3 B_TR1 B_TR2 B_TR3 x_TR1 x_TR2 x_TR3
1 10 2 1 1 2 1 9 0 0
2 10 4 3 3 4 3 7 0 0
3 16 6 5 5 6 5 11 0 0
4 12 8 7 7 8 7 5 0 0
5 20 10 9 9 10 9 11 0 0
6 19 12 11 11 12 11 8 0 0
7 17 14 13 13 14 13 4 0 0
8 14 16 15 15 16 15 -1 0 0
I have a vector of consecutive states (you can only go from 3 to 4, from 4 to 5 etc., and there's no way back):
cons_states <- c(3,4,5,6)
Simultenously I have data:
from to status id
2 3 1 1
2 4 0 2
2 5 0 3
2 6 0 4
2 8 0 5
2 16 0 6
3 4 0 7
3 8 0 8
3 16 1 9
16 3 0 10
16 4 0 11
16 5 0 12
16 6 0 13
16 8 1 14
8 3 0 15
8 4 1 16
8 5 0 17
8 6 0 18
I have two assumptions that I would like my data to perform:
if state was visited there's no way back, for example once state 3 was visited (to=3 & status=1) there shouldn't be anymore possibility to move to state 3 from the next states (there shouldn't be anymore to=3):
from to status id
2 3 1 1
2 4 0 2
2 5 0 3
2 6 0 4
2 8 0 5
2 16 0 6
3 4 0 7
3 8 0 8
3 16 1 9
16 4 0 11
16 5 0 12
16 6 0 13
16 8 1 14
8 4 1 16
8 5 0 17
8 6 0 18
I managed to do it with (it's ugly I realize it, but it works):
ind <- data[which(data$status == 1),]
res <- NULL
for (j in 1:nrow(ind)){
ind_to <- unlist(ind [j,c("to")])
ind_id <- unlist(ind [j,c("id")])
id_remove <- data[which(data$to == ind_to & data$id> ind_id ),"seq"]
if(length(id_remove) == 0) next
res <- rbind(id_remove, res)
}
Which gives me a vector of IDs to remove from my data that fulfills my first assumption.
Also I would like to meet an assumption that if we going to state that belongs to vector cons_states we can go only to the consecutive one yet no visited. As we can see if the state number in "from" belongs to cons_states vector - the problem doesn't exist. Otherwise there's a possibility to move to other states only than the consecutive.
My desired output would be:
from to status id
2 3 1 1
2 8 0 5
2 16 0 6
3 4 0 7
3 8 0 8
3 16 1 9
16 4 0 11
16 8 1 14
8 4 1 16
I spent a lot of time trying to figure it out but I'm stucking on writing complicated loops that doesn't work. Is there any not super complicated way to do it?
trying to get the spread() function to work with duplicates in the key column- yes, this has been covered before but I can't seem to get it to work and I've spent the better part of a day on it (somewhat new to R).
I have two columns of data. The first column 'snowday' represents the first day of a winter season, with the corresponding snow depth in the 'depth' column. This is several years of data (~62 years). So there should be sixty two years of first, second, third, etc days for the snowday column- this produces duplicates in snowday:
snowday row depth
1 1 0
1 2 0
1 3 0
1 4 0
1 5 0
1 6 0
...
75 4633 24
75 4634 4
75 4635 6
75 4636 20
75 4637 29
75 4638 1
I added a "row" column to make the data frame more transient (which I vaguely understand to be hones so 1:4638 rows is the total measurements taken over ~62 years at 75 days per year . Now i'd like to spread it wide:
wide <- spread(seasondata, key = snowday, value = depth, fill = 0)
and i get all zeros:
row 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
what I want it to look like is something like this (the columns are defined by the "snowday" and the row values are the various depths recorded on for that particular day over the various years- e.g. days 1 through 11 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 1 3 4 0 0 1 0 2 8 9 19 0 3
0 8 0 0 0 4 0 6 6 0 1 0 2 0
3 5 0 0 0 2 0 1 0 2 7 0 12 4
I think I'm fundamentally missing something here- I've tried working through drop=TRUE or convert = TRUE, and the output values are either all zeros or NA's depending on how I tinker. Also, all values in the data.frame(seasondata) are integers. Any thoughts?
It seems to me what you wish to do is to split up the the depth column according to values of snowday, and then bind all the 75 columns together.
There is a complication, in that 62*75 is not 4638, so I assume we do not observe 75 snowdays in some years. That is, some of the 75 columns (snowdays) will not have 62 observations. We'll make sure all 75 columns are 62 entries long by filling short columns up with NAs.
I make some fake data as an example. We observe 3 "years" of data for snowdays 1 and 2, but only 2 "years" of data for snowdays 3 and 4.
set.seed(1)
seasondata <- data.frame(
snowday = c(rep(1:2, each = 3), rep(3:4, each = 2)),
depth = round(runif(10, 0, 10), 0))
# snowday depth
# 1 1 3
# 2 1 4
# 3 1 6
# 4 2 9
# 5 2 2
# 6 2 9
# 7 3 9
# 8 3 7
# 9 4 6
# 10 4 1
We first figure out how long a column should be. In your case, m == 62. In my example, m == 3 (the years of data).
m <- max(table(seasondata$snowday))
Now, we use the by function to split up depth by values of snowdays, and fill short columns with NAs, and finally cbind all the columns together:
out <- do.call(cbind,
by(seasondata$depth, seasondata$snowday,
function(x) {
c(x, rep(NA, m - length(x)))
}
)
)
out
# 1 2 3 4
# [1,] 3 9 9 6
# [2,] 4 2 7 1
# [3,] 6 9 NA NA
Using spread:
You can use spread if you wish. In this case, you have to define row correctly. row should be 1 for the first first snowday (snowday == 1), 2 for the second first snowday, etc. row should also be 1 for the first second snowday, 2 for the second second snowday, etc.
seasondata$row <- unlist(sapply(rle(seasondata$snowday)$lengths, seq_len))
seasondata
# snowday depth row
# 1 1 3 1
# 2 1 4 2
# 3 1 6 3
# 4 2 9 1
# 5 2 2 2
# 6 2 9 3
# 7 3 9 1
# 8 3 7 2
# 9 4 6 1
# 10 4 1 2
Now we can use spread:
library(tidyr)
spread(seasondata, key = snowday, value = depth, fill = NA)
# row 1 2 3 4
# 1 1 3 9 9 6
# 2 2 4 2 7 1
# 3 3 6 9 NA NA
Data
I have following data frame:
dm <- data.frame(vehid = rep(c(2,3), each=10),
frameid=rep(c(7,10,11,13,6,5,14,13,12,1),2),
frame.diff = rep(c(NA,3,1,2,7,1,9,1,1,11), 2),
s.frame = rep(c(rep(0,9), 12), 2))
Question
For the s.frame!=0 I want to detect all the frameid which are in the sequence of frameid to s.frame and assign a value loop to them and . to others.
Desired Output
> dom
vehid frameid frame.diff s.frame loop
1 2 7 NA 0 loop
2 2 10 3 0 loop
3 2 11 1 0 loop
4 2 13 2 0 .
5 2 6 7 0 loop
6 2 5 1 0 loop
7 2 14 9 0 .
8 2 13 1 0 .
9 2 12 1 0 loop
10 2 1 11 12 loop
11 3 7 NA 0 loop
12 3 10 3 0 loop
13 3 11 1 0 loop
14 3 13 2 0 .
15 3 6 7 0 loop
16 3 5 1 0 loop
17 3 14 9 0 .
18 3 13 1 0 .
19 3 12 1 0 loop
20 3 1 11 12 loop
In the above example, for both vehids, the sequence was 1 to 12 (frameid:s.frame). The output shows the word 'loop' for the sequence and '.' for all others.
There should be a simple solution for this but I can't figure it out. I would appreciate if someone provide a data.table or plyr solution. Thanks.
I'm not sure I understand your question correctly, but here's a data.table option
library(data.table)
setDT(dm)[, loop := ifelse(frameid %between% c(frameid[s.frame != 0], s.frame[s.frame != 0]), "loop", "."), by = vehid]
dm
# vehid frameid frame.diff s.frame loop
# 1: 2 7 NA 0 loop
# 2: 2 10 3 0 loop
# 3: 2 11 1 0 loop
# 4: 2 13 2 0 .
# 5: 2 6 7 0 loop
# 6: 2 5 1 0 loop
# 7: 2 14 9 0 .
# 8: 2 13 1 0 .
# 9: 2 12 1 0 loop
# 10: 2 1 11 12 loop
# 11: 3 7 NA 0 loop
# 12: 3 10 3 0 loop
# 13: 3 11 1 0 loop
# 14: 3 13 2 0 .
# 15: 3 6 7 0 loop
# 16: 3 5 1 0 loop
# 17: 3 14 9 0 .
# 18: 3 13 1 0 .
# 19: 3 12 1 0 loop
# 20: 3 1 11 12 loop
Using dplyr
library(dplyr)
dm %>%
group_by(vehid) %>%
mutate(loop= c(".", "loop")[(max(frameid[!!s.frame]) <= frameid & frameid <= min(s.frame[!!s.frame]))+1])
# vehid frameid frame.diff s.frame loop
# 1 2 7 NA 0 loop
# 2 2 10 3 0 loop
# 3 2 11 1 0 loop
# 4 2 13 2 0 .
# 5 2 6 7 0 loop
# 6 2 5 1 0 loop
# 7 2 14 9 0 .
# 8 2 13 1 0 .
# 9 2 12 1 0 loop
# 10 2 1 11 12 loop
# 11 3 7 NA 0 loop
# 12 3 10 3 0 loop
# 13 3 11 1 0 loop
# 14 3 13 2 0 .
# 15 3 6 7 0 loop
# 16 3 5 1 0 loop
# 17 3 14 9 0 .
# 18 3 13 1 0 .
# 19 3 12 1 0 loop
# 20 3 1 11 12 loop
I am working with longitudinal data. I want to remove the observations of people that were only measured once (ids 5,7,9 below). How do I do this? Assume id is the unique identifier for people in the data set. Therefore, I would want to remove observations associated with ids 5,7, and 9. I've played with duplicated, unique, the table function, and the count function in plyr but haven't been successful. Example data below.
y<-sample(1:10, 20, replace=TRUE)
x<-sample(c(0,1),20, replace=TRUE)
id<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,7,8,8,8,9)
data<-data.frame(cbind(y,x,id))
You would have received immediate assistance had you tagged the post as R,data.frame
Here, the ! "not" function is used to remove id rows which match the values c(5,7,9)
> data[!data$id %in% c(5,7,9),]
y x id
1 3 0 1
2 2 1 1
3 3 0 1
4 9 0 2
5 9 0 2
6 1 0 2
7 9 0 3
8 7 0 3
9 4 0 3
10 9 1 4
11 7 0 4
12 8 1 4
14 4 1 6
15 1 0 6
17 2 0 8
18 8 0 8
19 2 0 8