I have the following data.frame:
head(data.c)
mark high_mark mark_cum
5 0 0
7 1 1
7 1 2
NA 0 2
7 1 3
7 1 4
As there are NAs I need to construct an additional column of normal sequence from 1:length(mark). However, if it is NA vector cell has to take a previous value. So it must look like this:
mark high_mark mark_cum mark_seq
5 0 0 1
7 1 1 2
7 1 2 3
NA 0 2 3
7 1 3 4
7 1 4 5
NA 0 4 5
1) cumsum This solution uses the fact that each mark_seq element equals the cumulative number of non-NA elements in mark at that point.
transform(data.c, mark_seq = cumsum(!is.na(mark)))
giving:
mark high_mark mark_cum mark_seq
1 5 0 0 1
2 7 1 1 2
3 7 1 2 3
4 NA 0 2 3
5 7 1 3 4
6 7 1 4 5
7 NA 0 4 5
data.c <- read.table(text = Lines, header = TRUE)
2) na.locf Here is a second solution using seq_along and na.locf (from zoo). It creates a sequence the same length as the number of non-NA elements in mark and uses replace to put them in the spots where the non-NA elements exist. Then na.locf is used to fill in the NAs with the prior values.
library(zoo)
transform(data.c, mark_seq=na.locf(replace(mark, !is.na(mark), seq_along(na.omit(mark)))))
3) mark_cum It was not stated in the question how the input column mark_cum is constructed but in the sample output in the question the mark_seq column equals the mark_cum column plus 1 so if that is always the case then an easy solution is:
transform(data.c, mark_seq = mark_cum + 1)
Note: We used this as the input:
Lines <- "mark high_mark mark_cum
5 0 0
7 1 1
7 1 2
NA 0 2
7 1 3
7 1 4
NA 0 4"
Related
My LIST of data.frames below is made from my data. However, this LIST is missing the scale column which is available in the original data.
I was wondering how to put back the missing scale column into LIST to achive my DESIRED_LIST?
Reproducible data and code are below.
m3="
scale study outcome time ES bar
2 1 1 0 1 8
2 1 2 0 2 7
1 2 1 0 3 6
1 2 1 1 4 5
2 3 1 0 5 4
2 3 1 1 6 3
1 4 1 0 7 2
1 4 2 0 8 1"
data <- read.table(text = m3, h=T)
LIST <- list(data.frame(study=c(3,3) ,outcome=c(1,1) ,time=0:1),
data.frame(study=c(1,1) ,outcome=c(1,2) ,time=c(0,0)),
data.frame(study=c(2,2,4,4),outcome=c(1,1,1,2),time=c(0,1,0,0)))
DESIRED_LIST <- list(data.frame(scale=c(2,2) ,study=c(3,3) ,outcome=c(1,1) ,time=0:1),
data.frame(scale=c(2,2) ,study=c(1,1) ,outcome=c(1,2) ,time=c(0,0)),
data.frame(scale=c(1,1,1,1),study=c(2,2,4,4),outcome=c(1,1,1,2),time=c(0,1,0,0)))
In base R, you could do:
lapply(LITS, \(x)merge(x, data)[names(data)])
I have some cumulative count data. Because of reporting innacuracies, sometimes the cumulative sum decreases such as 0 1 2 2 3 3 2 4 5.
I would like to created a new vector that retains the largest value reported and carries it forward until the cumulative count data catches up. So the corrected version of the above would be 0 1 2 2 3 3 3 4 5
I tried the following
mydf <- data.frame(ts1 = c(0,1,1,1,2,3,2,2,3,4,4,5))
mydf$lag1 <- lag(mydf[,1])
mydf$corrected <- ifelse(is.na(mydf[,2]),mydf[,1],
ifelse(mydf[,2] > mydf[,1], mydf[,2], mydf[,1]))
which returns:
ts1 lag1 corrected
1 0 NA 0
2 1 0 1
3 1 1 1
4 1 1 1
5 2 1 2
6 3 2 3
7 2 3 3
8 2 2 2
9 3 2 3
10 4 3 4
11 4 4 4
12 5 4 5
This worked for the case of the first time that the next value was smaller than the previous value(line7) but it fails for the second time(line 8).
I thought there must be a better way of doing this. New Vector that is equal to input vector unless value decreases in which case it retains prior value until input vector exceeds that retained value.
You are looking for cummax :
cummax(mydf$ts1)
#[1] 0 1 1 1 2 3 3 3 3 4 4 5
I could use some help. I need to add a new variable to a dataframe based on whether or not the value of a variable in a dataframe equals the index value of another vector. Below is a simplified example:
vector [2 7 15 4 5]
dataframe (4 variables; Index, Site, Quad, Count)
Index Site Quad Count
1 2 3 0
1 3 7 2
2 1 8 0
2 3 3 1
3 2 3 0
4 3 7 2
5 1 8 0
5 3 3 1
The variable I would like to create would match value of df$Index from the dataframe with the matching position in the vector. That is, when df$Index = 1, the new variable would be 2 (position 1 in the vector), when df$Index = 2, the new variable would be 7 (position 2 in the vector), when df$Index = 3, the new variable would be 3 (position 3 in the vector).
I've ended up in a R wormhole, and know the solution is simple, but I cannot seem to get it. Thanks for any help.
If your indexes are atually integer indices, for example
dd<-read.table(text="Index Site Quad Count
1 2 3 0
1 3 7 2
2 1 8 0
2 3 3 1
3 2 3 0
4 3 7 2
5 1 8 0
5 3 3 1", header=TRUE)
vec <- c(2, 7, 15, 4, 5)
Then you can create the new column with
dd$value <- vec[dd$Index]
dd
# Index Site Quad Count value
# 1 1 2 3 0 2
# 2 1 3 7 2 2
# 3 2 1 8 0 7
# 4 2 3 3 1 7
# 5 3 2 3 0 15
# 6 4 3 7 2 4
# 7 5 1 8 0 5
# 8 5 3 3 1 5
trying to get the spread() function to work with duplicates in the key column- yes, this has been covered before but I can't seem to get it to work and I've spent the better part of a day on it (somewhat new to R).
I have two columns of data. The first column 'snowday' represents the first day of a winter season, with the corresponding snow depth in the 'depth' column. This is several years of data (~62 years). So there should be sixty two years of first, second, third, etc days for the snowday column- this produces duplicates in snowday:
snowday row depth
1 1 0
1 2 0
1 3 0
1 4 0
1 5 0
1 6 0
...
75 4633 24
75 4634 4
75 4635 6
75 4636 20
75 4637 29
75 4638 1
I added a "row" column to make the data frame more transient (which I vaguely understand to be hones so 1:4638 rows is the total measurements taken over ~62 years at 75 days per year . Now i'd like to spread it wide:
wide <- spread(seasondata, key = snowday, value = depth, fill = 0)
and i get all zeros:
row 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
what I want it to look like is something like this (the columns are defined by the "snowday" and the row values are the various depths recorded on for that particular day over the various years- e.g. days 1 through 11 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 1 3 4 0 0 1 0 2 8 9 19 0 3
0 8 0 0 0 4 0 6 6 0 1 0 2 0
3 5 0 0 0 2 0 1 0 2 7 0 12 4
I think I'm fundamentally missing something here- I've tried working through drop=TRUE or convert = TRUE, and the output values are either all zeros or NA's depending on how I tinker. Also, all values in the data.frame(seasondata) are integers. Any thoughts?
It seems to me what you wish to do is to split up the the depth column according to values of snowday, and then bind all the 75 columns together.
There is a complication, in that 62*75 is not 4638, so I assume we do not observe 75 snowdays in some years. That is, some of the 75 columns (snowdays) will not have 62 observations. We'll make sure all 75 columns are 62 entries long by filling short columns up with NAs.
I make some fake data as an example. We observe 3 "years" of data for snowdays 1 and 2, but only 2 "years" of data for snowdays 3 and 4.
set.seed(1)
seasondata <- data.frame(
snowday = c(rep(1:2, each = 3), rep(3:4, each = 2)),
depth = round(runif(10, 0, 10), 0))
# snowday depth
# 1 1 3
# 2 1 4
# 3 1 6
# 4 2 9
# 5 2 2
# 6 2 9
# 7 3 9
# 8 3 7
# 9 4 6
# 10 4 1
We first figure out how long a column should be. In your case, m == 62. In my example, m == 3 (the years of data).
m <- max(table(seasondata$snowday))
Now, we use the by function to split up depth by values of snowdays, and fill short columns with NAs, and finally cbind all the columns together:
out <- do.call(cbind,
by(seasondata$depth, seasondata$snowday,
function(x) {
c(x, rep(NA, m - length(x)))
}
)
)
out
# 1 2 3 4
# [1,] 3 9 9 6
# [2,] 4 2 7 1
# [3,] 6 9 NA NA
Using spread:
You can use spread if you wish. In this case, you have to define row correctly. row should be 1 for the first first snowday (snowday == 1), 2 for the second first snowday, etc. row should also be 1 for the first second snowday, 2 for the second second snowday, etc.
seasondata$row <- unlist(sapply(rle(seasondata$snowday)$lengths, seq_len))
seasondata
# snowday depth row
# 1 1 3 1
# 2 1 4 2
# 3 1 6 3
# 4 2 9 1
# 5 2 2 2
# 6 2 9 3
# 7 3 9 1
# 8 3 7 2
# 9 4 6 1
# 10 4 1 2
Now we can use spread:
library(tidyr)
spread(seasondata, key = snowday, value = depth, fill = NA)
# row 1 2 3 4
# 1 1 3 9 9 6
# 2 2 4 2 7 1
# 3 3 6 9 NA NA
I want to subtract the smallest value in each subset of a data frame from each value in that subset i.e.
A <- c(1,3,5,6,4,5,6,7,10)
B <- rep(1:4, length.out=length(A))
df <- data.frame(A, B)
df <- df[order(B),]
Subtracting would give me:
A B
1 0 1
2 3 1
3 9 1
4 0 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4
I think the output you show is not correct. In any case, from what you explain, I think this is what you want. This uses ave base function:
within(df, { A <- ave(A, B, FUN=function(x) x-min(x))})
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4
Of course there are other alternatives such as plyr and data.table.
Echoing Arun's comment above, I think your expected output might be off. In any event, you should be able to use can use tapply to calculate subsets and then use match to line those subsets up with the original values:
subs <- tapply(df$A, df$B, min)
df$A <- df$A - subs[match(df$B, names(subs))]
df
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4