I want to add a column to a dataframe that makes a cumulated sum of another variable if yet another variable is equal for two rows. For example:
Row Var1 Var2 CumVal
1 A 2 2
2 A 4 6
3 B 5 5
So I want CumVal to cumulate/sum the Var2 column, if Var1 obs for row 2 equals Var1 obs for row 1. With other words, if it is equal to the obs before.
If the cumsum is based on the Var1 as a grouping variable
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(CumVal=cumsum(Var2))
Or
library(data.table)
setDT(df)[, CumVal:=cumsum(Var2), by=Var1]
Or using base R
transform(df, CumVal=ave(Var2, Var1, FUN=cumsum))
Update
If it is based on whether adjacent elements are not equal
transform(df, CumVal= ave(Var2, cumsum(c(TRUE,Var1[-1]!=
Var1[-nrow(df)])), FUN=cumsum))
# Row Var1 Var2 CumVal
#1 1 A 2 2
#2 2 A 4 6
#3 3 B 5 5
#4 4 A 6 6
Or the dplyr approach
df %>%
group_by(indx= cumsum(c(TRUE,(lag(Var1)!=Var1)[-1]))) %>%
mutate(CumVal=cumsum(Var2)) %>%
ungroup() %>%
select(-indx)
data
df <- structure(list(Row = 1:4, Var1 = c("A", "A", "B", "A"), Var2 = c(2L,
4L, 5L, 6L)), .Names = c("Row", "Var1", "Var2"), class = "data.frame",
row.names = c(NA, -4L))
I like rle, which detects similar successive values in a vector and describe it in a nice synthetic way. E.g. let's say we have a vector x of length 10:
x <- c(2, 3, 2, 2, 2, 2, 0, 0, 2, 1)
rle is able to detect that there are 4 successive 2s and 2 successive 0s:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 4 2 1 1
# values : num [1:6] 2 3 2 0 2 1
(in the output, we can that there are 2 lengths different from 1 corresponding to values 4 and 2)
We can use this function to apply cumsum to subvectors of another vector. Let's say we want to apply cumcum on a new vector y <- 1:10, but only for repeated values of x (which will be stored in a factor f):
y <- 1:10
z <- rle(x)$lengths
f <- factor(rep( seq_along(z), z) )
We can then use by or tapply (or something else to achieve the desired output):
cumval <- unlist(tapply(y, f, cumsum))
Related
I have two data frames. df_sub is a subset of the main data frame, df. I want to take a subset of df based on df_sub where the resulting data frame is going to be df_sub plus the observations that occur before and after.
As an example, consider the two data sets
df <- data.frame(var1 = c("a", "x", "x", "y", "z", "t"),
var2 = c(4, 1, 2, 45, 56, 89))
df_sub <- data.frame(var1 = c("x", "y"),
var2 = c(2, 45))
They look like
> df
var1 var2
1 a 4
2 x 1
3 x 2
4 y 45
5 z 56
6 t 89
> df_sub
var1 var2
1 x 2
2 y 45
The result I want would be
> df_result
2 x 1
3 x 2
4 y 45
5 z 56
I was thinking of using an inner_join or something similar
We could use match to get the index, then add or subtract 1 on those index, take the unique and subset the rows
v1 <- na.omit(match(do.call(paste, df_sub), do.call(paste, df)) )
df[unique(v1 + rep(c(-1, 0, 1), each = length(v1))),]
-output
var1 var2
2 x 1
3 x 2
4 y 45
5 z 56
Or create a 'flag' column in the 'df_sub', do a left_join, and then filter based on the lead/lag values of 'flag'
library(dplyr)
df %>%
left_join(df_sub %>%
mutate(flag = TRUE)) %>%
filter(flag|lag(flag)|lead(flag)) %>%
select(-flag)
var1 var2
1 x 1
2 x 2
3 y 45
4 z 56
You can create a row number to keep track of the rows that are selected via join. Subset the data by including minimum row number - 1 and maximum row number + 1.
library(dplyr)
tmp <- df %>%
mutate(row = row_number()) %>%
inner_join(df_sub, by = c("var1", "var2"))
df[c(min(tmp$row) - 1, tmp$row, max(tmp$row) + 1), ]
# var1 var2
#2 x 1
#3 x 2
#4 y 45
#5 z 56
So I have this big data set with 32 variables and I need to work with relative values of these variables using all possible subtractions among them. Ex. var1-var2...var1-var32; var3-var4...var3-var32, and so on. I'm new in R, so I would like to do this without going full manually on the process. I'm out of idea, other than doing all manually. Any help appreciated! Thanks!
Ex:
df_original
id
Var1
Var2
Var3
x
1
3
2
y
2
5
7
df_wanted
id
Var1
Var2
Var3
Var1-Var2
Var1-Var3
Var2-Var3
x
1
3
2
-2
-1
1
y
2
5
7
-3
-5
-2
You can do this combn which will create combination of columns taking 2 at a time. In combn you can apply a function to every combination where we can subtract the two columns from the dataframe and add the result as new columns.
cols <- grep('Var', names(df), value = TRUE)
new_df <- cbind(df, do.call(cbind, combn(cols, 2, function(x) {
setNames(data.frame(df[x[1]] - df[x[2]]), paste0(x, collapse = '-'))
}, simplify = FALSE)))
new_df
# id Var1 Var2 Var3 Var1-Var2 Var1-Var3 Var2-Var3
#1 x 1 3 2 -2 -1 1
#2 y 2 5 7 -3 -5 -2
data
df <- structure(list(id = c("x", "y"), Var1 = 1:2, Var2 = c(3L, 5L),
Var3 = c(2L, 7L)), class = "data.frame", row.names = c(NA, -2L))
I would like to add values to a column based on non-unique values in another column. For example, say I have a dataframe with a currently empty column that looks like this:
Site
Species Richness
A
0
A
0
A
0
B
0
B
0
I want to assign known species richness values for each site. Let's say site A has species richness 3, and site B has species richness 5. I would like the output to be:
Site
Species Richness
A
3
A
3
A
3
B
5
B
5
How do I input species richness values for specific sites?
I've tried this:
rows_update(df, tibble(Site = A, richness = 3))
rows_update(df, tibble(Site = B, richness = 5))
But I get an error message saying "'x' key values are not unique"
Any help would be appreciated!
Here, we could make use of join on from data.table and assign := the corresponding column of 'SpeciesRichness'. It would be more efficient
library(data.table)
setDT(df)[data.table(Site = c('A','B'), SpeciesRichness = c(3, 5)),
SpeciesRichness := i.SpeciesRichness, on = .(Site)]
The issue with ?rows_update is that the by column should be uniquely identifying in both data.
The two tables are matched by a set of key variables whose values must uniquely identify each row.
With 'df', the values are replicated 3 times for 'A' and 2 for 'B'. Using dplyr, we can do a left_join
library(dplyr)
df %>%
left_join(tibble(Site = c('A', "B"), new = c(3, 5))) %>%
transmute(Site, SpeciesRichness = new)
-output
# Site SpeciesRichness
#1 A 3
#2 A 3
#3 A 3
#4 B 5
#5 B 5
data
df <- structure(list(Site = c("A", "A", "A", "B", "B"),
SpeciesRichness = c(0L,
0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -5L
))
You can create a dataframe with Site and Richness value and join them together.
In base R :
df1 <- data.frame(Site = rep(c('A', 'B'), c(3, 2)))
df2 <- data.frame(Site = c('A', 'B'), richness = c(3, 5))
df1 <- merge(df1, df2)
df1
# Site richness
#1 A 3
#2 A 3
#3 A 3
#4 B 5
#5 B 5
You can also use match :
df1$richness <- df2$richness[match(df1$Site, df2$Site)]
You could define the values then use case_when
x <- 3
y <- 5
df %>%
mutate(SpeciesRichness= case_when(Site=="A" ~ x,
Site=="B" ~ y))
Output:
Site SpeciesRichness
1 A 3
2 A 3
3 A 3
4 B 5
5 B 5
I'm not quite sure even how to google my question, so I think an example is best to explain what I want to achieve. In summary, I'd like to multiply each value of a data frame, grouped by some variable, and the multiplication value will depend on which group is it. I'll place down an example:
data <- data.frame(group = c("a", "b", "c"), value = c(1, 2, 3))
multiplier <- c(a = 1, b = 2, c = 3)
data %>%
group_by(group) %>%
// Something that multiplies the value column by the corresponding multiplier contained in the vector
EDIT:
The expected returned values to replace the value column should be 1, 4, 9 respectively to the order.
I think this should do it:
library(dplyr)
data %>%
mutate(value = value * multiplier[as.character(group)])
# group value
#1 a 1
#2 b 4
#3 c 9
Alternatively, you could append multiplier as a column of data and then calculate.
data %>%
mutate(mutiplier = multiplier[as.character(group)]) %>%
mutate(new.value = value * multiplier)
# group value mutiplier new.value
#1 a 1 1 1
#2 b 2 2 4
#3 c 3 3 9
In base R, we can do
transform(merge(data, stack(multiplier), by.x = 'group', by.y = 'ind'),
value = value * values)[-3]
# group value
#1 a 1
#2 b 4
#3 c 9
I didn't find a solution for this common grouping problem in R:
This is my original dataset
ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C
This should be my grouped resulting dataset
State min(ID) max(ID)
A 1 2
B 3 5
A 6 8
C 9 10
So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.
Any ideas?
You could try:
library(dplyr)
df %>%
mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
group_by(rleid) %>%
summarise(State = first(State), min = min(ID), max = max(ID)) %>%
select(-rleid)
Or as per mentioned by #alistaire in the comments, you can actually mutate within group_by() with the same syntax, combining the first two steps. Stealing data.table::rleid() and using summarise_all() to simplify:
df %>%
group_by(State, rleid = data.table::rleid(State)) %>%
summarise_all(funs(min, max)) %>%
select(-rleid)
Which gives:
## A tibble: 4 × 3
# State min max
# <fctr> <int> <int>
#1 A 1 2
#2 B 3 5
#3 A 6 8
#4 C 9 10
Here is a method that uses the rle function in base R for the data set you provided.
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=c(1, head(cumsum(temp$lengths) + 1, -1)),
max.ID=cumsum(temp$lengths))
which returns
newDF
State min.ID max.ID
1 A 1 2
2 B 3 5
3 A 6 8
4 C 9 10
Note that rle requires a character vector rather than a factor, so I use the as.is argument below.
As #cryo111 notes in the comments below, the data set might be unordered timestamps that do not correspond to the lengths calculated in rle. For this method to work, you would need to first convert the timestamps to a date-time format, with a function like as.POSIXct, use df <- df[order(df$ID),], and then employ a slight alteration of the method above:
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=df$ID[c(1, head(cumsum(temp$lengths) + 1, -1))],
max.ID=df$ID[cumsum(temp$lengths)])
data
df <- read.table(header=TRUE, as.is=TRUE, text="ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
An idea with data.table:
require(data.table)
dt <- fread("ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
dt[,rle := rleid(State)]
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")]
which gives:
rle State min max
1: 1 A 1 2
2: 2 B 3 5
3: 3 A 6 8
4: 4 C 9 10
The idea is to identify sequences with rleid and then get the min and max of IDby the tuple rle and State.
you can remove the rle column with
dt2[,rle:=NULL]
Chained:
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")][,rle:=NULL]
You can shorten the above code even more by using rleid inside by directly:
dt2 <- dt[, .(min=min(ID),max=max(ID)), by=.(State, rleid(State))][, rleid:=NULL]
Here is another attempt using rle and aggregate from base R:
rl <- rle(df$State)
newdf <- data.frame(ID=df$ID, State=rep(1:length(rl$lengths),rl$lengths))
newdf <- aggregate(ID~State, newdf, FUN = function(x) c(minID=min(x), maxID=max(x)))
newdf$State <- rl$values
# State ID.minID ID.maxID
# 1 A 1 2
# 2 B 3 5
# 3 A 6 8
# 4 C 9 10
data
df <- structure(list(ID = 1:10, State = c("A", "A", "B", "B", "B",
"A", "A", "A", "C", "C")), .Names = c("ID", "State"), class = "data.frame",
row.names = c(NA,
-10L))