Binning a discrete variable (preferably in dplyr) - r

I would like to "bin" a large discrete variable by combining two consecutive rows into one bin. I would also like to call the bin by the first row value.
As an example:
x<-data.frame(x=c(1,2,3,4,5,6,7,8,9,10,11,12),
y=c(1,1,3,3,5,5,7,7,9,9,11,11))
x

We may use gl to create the grouping bin
library(dplyr)
x %>%
mutate(grp = as.integer(gl(n(), 2, n())))
x y grp
1 1 1 1
2 2 1 1
3 3 3 2
4 4 3 2
5 5 5 3
6 6 5 3
7 7 7 4
8 8 7 4
9 9 9 5
10 10 9 5
11 11 11 6
12 12 11 6

Performing the steps as you exactly outlined them would be this:
library(dplyr)
x %>%
mutate(bins = rep(1:(length(x) / 2), each = 2)) %>%
group_by(bins) %>%
filter(row_number() == 1) %>%
ungroup()
However this would give you the exact same result (without the bins column) in one line of code:
x[seq(1, nrow(x), by = 2), ]

Another way using seq and ceiling.
x$bin <- ceiling(seq(nrow(x))/2)
x
# x y bin
#1 1 1 1
#2 2 1 1
#3 3 3 2
#4 4 3 2
#5 5 5 3
#6 6 5 3
#7 7 7 4
#8 8 7 4
#9 9 9 5
#10 10 9 5
#11 11 11 6
#12 12 11 6

Related

get the value of a cell of a dataframe based on the value in one of the columns in R

I have an example of a data frame in which columns "a" and "b" have certain values, and in column "c" the values are 1 or 2. I would like to create column "d" in which the value found in the frame will be located at the index specified in column "c".
x = data.frame(a = c(1:10), b = c(3:12), c = seq(1:2))
x
a b c
1 1 3 1
2 2 4 2
3 3 5 1
4 4 6 2
5 5 7 1
6 6 8 2
7 7 9 1
8 8 10 2
9 9 11 1
10 10 12 2
thus column "d" for the first row will contain the value 1, since the index in column "c" is 1, for the second row d = 4, since the index in column "c" is 2, and so on. I was not helped by the standard indexing in R, it just returns the value of the column c. in what ways can I solve my problem?
You may can create a matrix of row and column numbers to subset values from the dataframe.
x$d <- x[cbind(1:nrow(x), x$c)]
x
# a b c d
#1 1 3 1 1
#2 2 4 2 4
#3 3 5 1 3
#4 4 6 2 6
#5 5 7 1 5
#6 6 8 2 8
#7 7 9 1 7
#8 8 10 2 10
#9 9 11 1 9
#10 10 12 2 12
If the input is tibble, you need to change the tibble to dataframe to use the above answer.
If you don't want to change to dataframe, here is another option using rowwise.
library(dplyr)
x <- tibble(x)
x %>% rowwise() %>% mutate(d = c_across()[c])
By using dplyr::mutate and ifelse,
x %>% mutate(d = ifelse(c == 1, a, b))
a b c d
1 1 3 1 1
2 2 4 2 4
3 3 5 1 3
4 4 6 2 6
5 5 7 1 5
6 6 8 2 8
7 7 9 1 7
8 8 10 2 10
9 9 11 1 9
10 10 12 2 12

Create new column and carry forward value from previous group to next

I am trying to carry forward value from the previous group to the next group. I tried to solve it using rleid but that could not get the desired result.
df <- data.frame(signal = c(1,1,5,5,5,2,3,3,3,4,4,5,5,5,5,6,7,7,8,9,9,9,10),
desired_outcome = c(NA, NA, 1, 1, 1, 5, 2, 2, 2, 3, 3, 4, 4,4,4,5,6,6,7,8,8,8,9))
# outcome column has the expected result -
signal desired_outcome
1 1 NA
2 1 NA
3 5 1
4 5 1
5 5 1
6 2 5
7 3 2
8 3 2
9 3 2
10 4 3
11 4 3
12 5 4
13 5 4
14 5 4
15 5 4
16 6 5
17 7 6
18 7 6
19 8 7
20 9 8
21 9 8
22 9 8
23 10 9
rle will give the lengths and values of sequences where the same value occur. Then: remove the last value, shift remaining values one over, add an NA to the beginning of the value to account for removing the last value, and repeat each value as given by lengths (i.e. the lengths of sequences of same value in the original vector).
with(rle(df$signal), rep(c(NA, head(values, -1)), lengths))
# [1] NA NA 1 1 1 5 2 2 2 3 3 4 4 4 4 5 6 6 7 8 8 8 9
Another way could be to first lag signal then use rleid to create groups and use mutate to broadcast first value of each group to all the values.
library(dplyr)
df %>%
mutate(out = lag(signal)) %>%
group_by(group = data.table::rleid(signal)) %>%
mutate(out = first(out)) %>%
ungroup() %>%
select(-group)
# A tibble: 23 x 2
# signal out
# <dbl> <dbl>
# 1 1 NA
# 2 1 NA
# 3 5 1
# 4 5 1
# 5 5 1
# 6 2 5
# 7 3 2
# 8 3 2
# 9 3 2
#10 4 3
# … with 13 more rows

dplyr calculate new columns in batch

I would like to add new columns to a data.frame using dplyr. One by one it is easy using mutate. However, I have a situation where I have a function that calculates several parameters based on some other column and I would like to add them to the table in one go. Suppose I have a function
f = function(x) {data.frame(A = x + 1, B = x + 2, C = x + 3)}
And I want to run this function against a column in a data.frame and add the results to the same data.frame, so
df = data.frame(x = 1:10)
df %>% XXX(f(x))
would result in data.frame like this:
x A B C
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
10 11 12 13
I know I have read about function like XXX in the example above, but I'm unable to find it right now. Anybody has hints?
We can use do
library(dplyr)
df %>%
do(data.frame(., f(.$x)))
# x A B C
#1 1 2 3 4
#2 2 3 4 5
#3 3 4 5 6
#4 4 5 6 7
#5 5 6 7 8
#6 6 7 8 9
#7 7 8 9 10
#8 8 9 10 11
#9 9 10 11 12
#10 10 11 12 13
Or
library(purrr)
df %>%
map_df(f) %>%
bind_cols(df, .)

How to generate an uneven sequence of numbers in R

Here's an example data frame:
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
I want to generate a sequence of numbers according to the number of observations of y per x group (e.g. there are 2 observations of y for x=1). I want the sequence to be continuously increasing and jumps by 2 after each x group.
The desired output for this example would be:
1,2,5,6,7,10,11,14,17,20,21,22,25,26
How can I do this simply in R?
To expand on my comment, the groupings can be arbitrary, you simply need to recast it to the correct ordering. There are a few ways to do this, #akrun has shown that this can be accomplished using match function, or you can make use the the as.numeric function if this is easier to understand for yourself.
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
# these are equivalent
df$newx <- as.numeric(factor(df$x, levels=unique(df$x)))
df$newx <- match(df$x, unique(df$x))
Since you now have a "new" releveling which is sequential, we can use the logic that was discussed in the comments.
df$newNumber <- 1:nrow(df) + (df$newx-1)*2
For this example, this will result in the following dataframe:
x y newx newNumber
1 1 1 1
1 2 1 2
2 3 2 5
2 4 2 6
2 6 2 7
3 3 3 10
3 7 3 11
4 8 4 14
5 6 5 17
6 4 6 20
6 3 6 21
6 7 6 22
9 3 7 25
9 2 7 26
where df$newNumber is the output you wanted.
To create the sequence 0,0,4,4,4,9,..., basically what you're doing is taking the minimum of each group and subtracting 1. The easiest way to do this is using the library(dplyr).
library(dplyr)
df %>%
group_by(x) %>%
mutate(newNumber2 = min(newNumber) -1)
Which will have the output:
Source: local data frame [14 x 5]
Groups: x
x y newx newNumber newNumber2
1 1 1 1 1 0
2 1 2 1 2 0
3 2 3 2 5 4
4 2 4 2 6 4
5 2 6 2 7 4
6 3 3 3 10 9
7 3 7 3 11 9
8 4 8 4 14 13
9 5 6 5 17 16
10 6 4 6 20 19
11 6 3 6 21 19
12 6 7 6 22 19
13 9 3 7 25 24
14 9 2 7 26 24

How to find difference between values in two rows in an R dataframe using dplyr

I have an R dataframe such as:
df <- data.frame(period=rep(1:4,2),
farm=c(rep('A',4),rep('B',4)),
cumVol=c(1,5,15,31,10,12,16,24),
other = 1:8);
period farm cumVol other
1 1 A 1 1
2 2 A 5 2
3 3 A 15 3
4 4 A 31 4
5 1 B 10 5
6 2 B 12 6
7 3 B 16 7
8 4 B 24 8
How do I find the change in cumVol at each farm in each period, ignoring the 'other' column? I would like a dataframe like this (optionally with the cumVol column remaining):
period farm volume other
1 1 A 0 1
2 2 A 4 2
3 3 A 10 3
4 4 A 16 4
5 1 B 0 5
6 2 B 2 6
7 3 B 4 7
8 4 B 8 8
In practice there may be many 'farm'-like columns, and many 'other'-like (ie. ignored) columns. I'd like to be able to specify all the column names using variables.
I am using the dplyr package.
In dplyr:
require(dplyr)
df %>%
group_by(farm) %>%
mutate(volume = cumVol - lag(cumVol, default = cumVol[1]))
Source: local data frame [8 x 5]
Groups: farm
period farm cumVol other volume
1 1 A 1 1 0
2 2 A 5 2 4
3 3 A 15 3 10
4 4 A 31 4 16
5 1 B 10 5 0
6 2 B 12 6 2
7 3 B 16 7 4
8 4 B 24 8 8
Perhaps the desired output should actually be as follows?
df %>%
group_by(farm) %>%
mutate(volume = cumVol - lag(cumVol, default = 0))
period farm cumVol other volume
1 1 A 1 1 1
2 2 A 5 2 4
3 3 A 15 3 10
4 4 A 31 4 16
5 1 B 10 5 10
6 2 B 12 6 2
7 3 B 16 7 4
8 4 B 24 8 8
Edit: Following up on your comments I think you are looking for arrange(). It that is not the case it might be best to start a new question.
df1 <- data.frame(period=rep(1:4,4), farm=rep(c(rep('A',4),rep('B',4)),2), crop=(c(rep('apple',8), rep('pear',8))), cumCropVol=c(1,5,15,31,10,12,16,24,11,15,25,31,20,22,26,34), other = rep(1:8,2) );
df1 %>%
arrange(desc(period), desc(farm)) %>%
group_by(period, farm) %>%
summarise(cumVol=sum(cumCropVol))
Edit: Follow up #2
df1 <- data.frame(period=rep(1:4,4), farm=rep(c(rep('A',4),rep('B',4)),2), crop=(c(rep('apple',8), rep('pear',8))), cumCropVol=c(1,5,15,31,10,12,16,24,11,15,25,31,20,22,26,34), other = rep(1:8,2) );
df <- df1 %>%
arrange(desc(period), desc(farm)) %>%
group_by(period, farm) %>%
summarise(cumVol=sum(cumCropVol))
ungroup(df) %>%
arrange(farm) %>%
group_by(farm) %>%
mutate(volume = cumVol - lag(cumVol, default = 0))
Source: local data frame [8 x 4]
Groups: farm
period farm cumVol volume
1 1 A 12 12
2 2 A 20 8
3 3 A 40 20
4 4 A 62 22
5 1 B 30 30
6 2 B 34 4
7 3 B 42 8
8 4 B 58 16
In dplyr -- so you don't have to replace NAs
library(dplyr)
df %>%
group_by(farm)%>%
mutate(volume = c(0,diff(cumVol)))
period farm cumVol other volume
1 1 A 1 1 0
2 2 A 5 2 4
3 3 A 15 3 10
4 4 A 31 4 16
5 1 B 10 5 0
6 2 B 12 6 2
7 3 B 16 7 4
8 4 B 24 8 8
Would creating a new column in your original dataset be an option?
Here is an option using the data.table operator :=.
require("data.table")
DT <- data.table(df)
DT[, volume := c(0,diff(cumVol)), by="farm"]
or
diff_2 <- function(x) c(0,diff(x))
DT[, volume := diff_2(cumVol), by="farm"]
Output:
# > DT
# period farm cumVol other volume
# 1: 1 A 1 1 0
# 2: 2 A 5 2 4
# 3: 3 A 15 3 10
# 4: 4 A 31 4 16
# 5: 1 B 10 5 0
# 6: 2 B 12 6 2
# 7: 3 B 16 7 4
# 8: 4 B 24 8 8
tapply and transform?
> transform(df, volumen=unlist(tapply(cumVol, farm, function(x) c(0, diff(x)))))
period farm cumVol other volumen
A1 1 A 1 1 0
A2 2 A 5 2 4
A3 3 A 15 3 10
A4 4 A 31 4 16
B1 1 B 10 5 0
B2 2 B 12 6 2
B3 3 B 16 7 4
B4 4 B 24 8 8
ave is a better option, see # thelatemail's comment
with(df, ave(cumVol,farm,FUN=function(x) c(0,diff(x))) )

Resources