data frame cumulative run length encoding in R - r

I've got a data frame containing values relating to observations, 1 or 0. I want to count the continual occurrences of 1, resetting at 0. The run length encoding function (rle) seems like it would do the work but I can't work out getting the data into the desired format. I want to try doing this without writing a custom function. In the data below, I have observation in a data frame, then I want to derive the "continual" column and write back to the dataframe. This link was a good start.
observation continual
0 0
0 0
0 0
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
1 11
1 12
0 0
0 0

You can do this pretty easily in a couple of steps:
x <- rle(mydf$observation) ## run rle on the relevant column
new <- sequence(x$lengths) ## create a sequence of the lengths values
new[mydf$observation == 0] <- 0 ## replace relevant values with zero
new
# [1] 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 0

Using the devel version you could try
library(data.table) ## v >= 1.9.5
setDT(df)[, continual := seq_len(.N) * observation, by = rleid(observation)]

There is probably a better way, but:
g <- c(0,cumsum(abs(diff(df$obs))))
df$continual <- ave(g,g,FUN=seq_along)
df$continual[df$obs==0] <- 0

Simply adapting the accepted answer from the question you linked:
unlist(mapply(function(x, y) seq(x)*y, rle(df$obs)$lengths, rle(df$obs)$values))
# [1] 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 0

You can use a simple base R one liner, using the fact observation contains only 0 and 1 , coupled with a vectorized operation:
transform(df, continual=ifelse(observation, cumsum(observation), observation))
# observation continual
#1 0 0
#2 0 0
#3 0 0
#4 1 1
#5 1 2
#6 1 3
#7 1 4
#8 1 5
#9 1 6
#10 1 7
#11 1 8
#12 1 9
#13 1 10
#14 1 11
#15 1 12
#16 0 0
#17 0 0

Related

How to add sequence for specific values in R

I have following dataframe in R
a b
1 0
2 0
3 0
4 1
5 1
6 1
7 0
8 0
9 0
10 1
11 1
Desired dataframe would be
a b Flag
1 0 1
2 0 2
3 0 3
4 1 4
5 1 4
6 1 4
7 0 5
8 0 6
9 0 7
10 1 8
11 1 8
The sequence should change for 0 and shall remain same for 1.
I am doing it with following command
df$flag <- with(a, match(b, unique(b)))
But,does not give me desired output.
This has been updated to account for the first element of b being 1. Thanks to #tk3 for pointing out that a change was needed.
It looks like your rule is to increase flag if b is zero OR if it is the first 1 in a sequence.
This will give your answer.
cumsum(1 + c(df$b[1],diff(df$b)>0) - df$b)
[1] 1 2 3 4 4 4 5 6 7 8 8
If you just wanted to increase flag when b is zero, you could use
cumsum(1-df$b). Except that would not change the flag for the first one in a series. So I wanted to make an altered version of b that would set b=0 for all of the first ones. You can use c(df$b[1], diff(df$b) >0) to get all of the places that b changed from zero to one - the "first ones". Now
df$b - c(df$b[1],diff(df$b)>0)
0 0 0 0 1 1 0 0 0 0 1
changes all of the "first ones" to zeros unless it is the first element of b. With this altered b we can use cumsum as above. We want to take cumsum of
1 - ( df$b - c(df$b[1],diff(df$b)>0) ) = 1 + c(df$b[1],diff(df$b)>0) - df$b
Which was my response
cumsum(1 + c(df$b[1],diff(df$b)>0) - df$b)
[1] 1 2 3 4 4 4 5 6 7 8 8
The original version worked only for df$b[1] = 0. The updated version should also work for df$b[1] = 1.
The following seems to do what you want.
I find it a bit complicated but it works.
sp <- split(df, cumsum(c(0, abs(diff(df$b)))))
df2 <- lapply(sp, function(DF) {
DF$Flag <- as.integer(DF$b != 1)
if(DF$b[1] == 1) DF$Flag[1] <- 1
DF
})
rm(sp) # clean up
df2 <- do.call(rbind, df2)
df2$Flag <- cumsum(df2$Flag)
row.names(df2) <- NULL
df2
# a b Flag
#1 1 0 1
#2 2 0 2
#3 3 0 3
#4 4 1 4
#5 5 1 4
#6 6 1 4
#7 7 0 5
#8 8 0 6
#9 9 0 7
#10 10 1 8
#11 11 1 8

R Spread function with duplicates- still can't get to work after adding transient row

trying to get the spread() function to work with duplicates in the key column- yes, this has been covered before but I can't seem to get it to work and I've spent the better part of a day on it (somewhat new to R).
I have two columns of data. The first column 'snowday' represents the first day of a winter season, with the corresponding snow depth in the 'depth' column. This is several years of data (~62 years). So there should be sixty two years of first, second, third, etc days for the snowday column- this produces duplicates in snowday:
snowday row depth
1 1 0
1 2 0
1 3 0
1 4 0
1 5 0
1 6 0
...
75 4633 24
75 4634 4
75 4635 6
75 4636 20
75 4637 29
75 4638 1
I added a "row" column to make the data frame more transient (which I vaguely understand to be hones so 1:4638 rows is the total measurements taken over ~62 years at 75 days per year . Now i'd like to spread it wide:
wide <- spread(seasondata, key = snowday, value = depth, fill = 0)
and i get all zeros:
row 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
what I want it to look like is something like this (the columns are defined by the "snowday" and the row values are the various depths recorded on for that particular day over the various years- e.g. days 1 through 11 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 1 3 4 0 0 1 0 2 8 9 19 0 3
0 8 0 0 0 4 0 6 6 0 1 0 2 0
3 5 0 0 0 2 0 1 0 2 7 0 12 4
I think I'm fundamentally missing something here- I've tried working through drop=TRUE or convert = TRUE, and the output values are either all zeros or NA's depending on how I tinker. Also, all values in the data.frame(seasondata) are integers. Any thoughts?
It seems to me what you wish to do is to split up the the depth column according to values of snowday, and then bind all the 75 columns together.
There is a complication, in that 62*75 is not 4638, so I assume we do not observe 75 snowdays in some years. That is, some of the 75 columns (snowdays) will not have 62 observations. We'll make sure all 75 columns are 62 entries long by filling short columns up with NAs.
I make some fake data as an example. We observe 3 "years" of data for snowdays 1 and 2, but only 2 "years" of data for snowdays 3 and 4.
set.seed(1)
seasondata <- data.frame(
snowday = c(rep(1:2, each = 3), rep(3:4, each = 2)),
depth = round(runif(10, 0, 10), 0))
# snowday depth
# 1 1 3
# 2 1 4
# 3 1 6
# 4 2 9
# 5 2 2
# 6 2 9
# 7 3 9
# 8 3 7
# 9 4 6
# 10 4 1
We first figure out how long a column should be. In your case, m == 62. In my example, m == 3 (the years of data).
m <- max(table(seasondata$snowday))
Now, we use the by function to split up depth by values of snowdays, and fill short columns with NAs, and finally cbind all the columns together:
out <- do.call(cbind,
by(seasondata$depth, seasondata$snowday,
function(x) {
c(x, rep(NA, m - length(x)))
}
)
)
out
# 1 2 3 4
# [1,] 3 9 9 6
# [2,] 4 2 7 1
# [3,] 6 9 NA NA
Using spread:
You can use spread if you wish. In this case, you have to define row correctly. row should be 1 for the first first snowday (snowday == 1), 2 for the second first snowday, etc. row should also be 1 for the first second snowday, 2 for the second second snowday, etc.
seasondata$row <- unlist(sapply(rle(seasondata$snowday)$lengths, seq_len))
seasondata
# snowday depth row
# 1 1 3 1
# 2 1 4 2
# 3 1 6 3
# 4 2 9 1
# 5 2 2 2
# 6 2 9 3
# 7 3 9 1
# 8 3 7 2
# 9 4 6 1
# 10 4 1 2
Now we can use spread:
library(tidyr)
spread(seasondata, key = snowday, value = depth, fill = NA)
# row 1 2 3 4
# 1 1 3 9 9 6
# 2 2 4 2 7 1
# 3 3 6 9 NA NA

How to create a table shows frequency of all dummy variables in r

I am a rookie in R.
I want to create a frequency table of all dummy variables and I have a data like this
ID Dummy_2008 Dummy_2009 Dummy_2010 Dummy_2011 Dummy_2012 Dummy_2013
1 1 1 0 0 1 1
2 0 0 1 1 0 1
3 0 0 1 0 0 1
4 0 1 1 0 0 1
5 0 0 0 0 1 0
6 0 0 0 1 0 0
I want to see how total frequency in each variable like this
0 1 sum
Dummy_2008 5 1 6
Dummy_2009 4 2 6
Dummy_2010 3 3 6
Dummy_2011 4 2 6
Dummy_2012 4 2 6
Dummy_2013 2 4 6
I only know to use table() , but I can only do this one variable a time.
I have many time serious dummy variables, and I want to see the trend of them.
Many thanks for the help
Terence
Here is another option using mtabulate and addmargins
library(qdapTools)
addmargins(as.matrix(mtabulate(df1[-1])),2)
# 0 1 Sum
#Dummy_2008 5 1 6
#Dummy_2009 4 2 6
#Dummy_2010 3 3 6
#Dummy_2011 4 2 6
#Dummy_2012 4 2 6
#Dummy_2013 2 4 6
result = as.data.frame(t(sapply(dat[,-1], table)))
result$Sum = rowSums(result)
0 1 Sum
Dummy_2008 5 1 6
Dummy_2009 4 2 6
Dummy_2010 3 3 6
Dummy_2011 4 2 6
Dummy_2012 4 2 6
Dummy_2013 2 4 6
Explanation:
sapply applies a function to each column of a data frame and returns a matrix. So sapply(dat[,-1], table) returns a matrix with the output of table for each column (except the first column, which we've excluded).
The matrix needs to be transposed so that the column names from the original data frame are the rows and the dummy values are the columns, so we use the t (transpose) function for that.
We want a data frame, not a matrix, so we wrap the whole thing in as.data.frame.
Next, we want another column giving the total number of values, so we use the rowSums function.

R Counting duplicate values and adding them to separate vectors [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
How do I get a contingency table?
(6 answers)
Closed 4 years ago.
x <- c(1,1,1,2,3,3,4,4,4,5,6,6,6,6,6,7,7,8,8,8,8)
y <- c('A','A','C','A','B','B','A','C','C','B','A','A','C','C','B','A','C','A','A','A','B')
X <- data.frame(x,y)
Above I have a data frame where I want to identify the duplicates in vector x, while counting the number of duplicate instances for both (x,y)....
For example I have found that ddply and this post here is similar to what I am looking for (Find how many times duplicated rows repeat in R data frame).
library(ddply)
ddply(X,.(x,y), nrow)
This counts the number of instances 1 - A occurs which is 2 times... However I am looking for R to return the unique identifier in vector x with the counted number of times that x matches in column y (getting rid of vector y if necessary), like below..
x A B C
1 2 0 1
2 1 0 0
3 0 2 0
4 1 0 2
5 0 1 0
6 2 1 2
Any help will be appreciated, thanks
You just need the table function :)
> table(X)
y
x A B C
1 2 0 1
2 1 0 0
3 0 2 0
4 1 0 2
5 0 1 0
6 2 1 2
7 1 0 1
8 3 1 0
This is fairly straightforward by casting your data.frame.
require(reshape2)
dcast(X, x ~ y, fun.aggregate=length)
Or if you'd want things to be faster (say working on large data), then you can use the newly implemented dcast.data.table function from data.table package:
require(data.table) ## >= 1.9.0
setDT(X) ## convert data.frame to data.table by reference
dcast.data.table(X, x ~ y, fun.aggregate=length)
Both result in:
x A B C
1: 1 2 0 1
2: 2 1 0 0
3: 3 0 2 0
4: 4 1 0 2
5: 5 0 1 0
6: 6 2 1 2
7: 7 1 0 1
8: 8 3 1 0

subtract first value from each subset of dataframe

I want to subtract the smallest value in each subset of a data frame from each value in that subset i.e.
A <- c(1,3,5,6,4,5,6,7,10)
B <- rep(1:4, length.out=length(A))
df <- data.frame(A, B)
df <- df[order(B),]
Subtracting would give me:
A B
1 0 1
2 3 1
3 9 1
4 0 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4
I think the output you show is not correct. In any case, from what you explain, I think this is what you want. This uses ave base function:
within(df, { A <- ave(A, B, FUN=function(x) x-min(x))})
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4
Of course there are other alternatives such as plyr and data.table.
Echoing Arun's comment above, I think your expected output might be off. In any event, you should be able to use can use tapply to calculate subsets and then use match to line those subsets up with the original values:
subs <- tapply(df$A, df$B, min)
df$A <- df$A - subs[match(df$B, names(subs))]
df
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4

Resources