An example of data
Var1 <- rep(c("X", "Y", "Z"),2)
Var2 <- rep(c("A","B"),3)
Count<-sample(c(10:100), 6)
data<-data.frame(Var1,Var2,Count)
Produces
Var1 Var2 Count
1 X A 89
2 Y B 97
3 Z A 29
4 X B 38
5 Y A 50
6 Z B 88
I would like to divide the counts only of Var2 B by two, to get
Var1 Var2 Count Count2
1 X A 89 89
2 Y B 97 48.5
3 Z A 29 29
4 X B 38 19
5 Y A 50 50
6 Z B 88 44
But I'm not sure how to only divide based on a variable.
I'm new to coding, so any help is appreciated!
Base R solution:
data$Count2 <- data$Count ## copy to new variable
## Then change the subset to desired value. LHS subsets, RHS provides change
data$Count2[data$Var2 == "B"] <- data$Count[data$Var2 == "B"]/2
And Tidyverse/dplyr solution
library(dplyr)
data = data %>%
mutate(Count2 = ifelse(Var2 == "B", Count/2, Count ))
# alternatively, this is identical to above
data = mutate(data, Count2 = ifelse(Var2 == "B", Count/2, Count ))
Slight variation to Brian's dplyr solution, use replace to update a portion of the column inside the mutate function.
require(tidyverse)
Var1 <- rep(c("X", "Y", "Z"),2)
Var2 <- rep(c("A","B"),3)
Count<-sample(c(10:100), 6)
data<-data.frame(Var1,Var2,Count)
data %<>%
mutate(Count=replace(Count, Var2=='B', Count[Var2=='B']/2))
With data.table
library(data.table)
setDT(data)[, Count := as.numeric(Count)][Var2 == 'B', Count := Count/2]
Related
I have two data frames. df_sub is a subset of the main data frame, df. I want to take a subset of df based on df_sub where the resulting data frame is going to be df_sub plus the observations that occur before and after.
As an example, consider the two data sets
df <- data.frame(var1 = c("a", "x", "x", "y", "z", "t"),
var2 = c(4, 1, 2, 45, 56, 89))
df_sub <- data.frame(var1 = c("x", "y"),
var2 = c(2, 45))
They look like
> df
var1 var2
1 a 4
2 x 1
3 x 2
4 y 45
5 z 56
6 t 89
> df_sub
var1 var2
1 x 2
2 y 45
The result I want would be
> df_result
2 x 1
3 x 2
4 y 45
5 z 56
I was thinking of using an inner_join or something similar
We could use match to get the index, then add or subtract 1 on those index, take the unique and subset the rows
v1 <- na.omit(match(do.call(paste, df_sub), do.call(paste, df)) )
df[unique(v1 + rep(c(-1, 0, 1), each = length(v1))),]
-output
var1 var2
2 x 1
3 x 2
4 y 45
5 z 56
Or create a 'flag' column in the 'df_sub', do a left_join, and then filter based on the lead/lag values of 'flag'
library(dplyr)
df %>%
left_join(df_sub %>%
mutate(flag = TRUE)) %>%
filter(flag|lag(flag)|lead(flag)) %>%
select(-flag)
var1 var2
1 x 1
2 x 2
3 y 45
4 z 56
You can create a row number to keep track of the rows that are selected via join. Subset the data by including minimum row number - 1 and maximum row number + 1.
library(dplyr)
tmp <- df %>%
mutate(row = row_number()) %>%
inner_join(df_sub, by = c("var1", "var2"))
df[c(min(tmp$row) - 1, tmp$row, max(tmp$row) + 1), ]
# var1 var2
#2 x 1
#3 x 2
#4 y 45
#5 z 56
I have a data frame that has percentage values for a number of variables and observations, as follows:
obs <- data.frame(Site = c("A", "B", "C"), X = c(11, 22, 33), Y = c(44, 55, 66), Z = c(77, 88, 99))
I need to prepare this data as an edge list for network analysis, with "Site" as the nodes and the remaining variables as the edges. The result should look like this:
Node1 Node2 Weight Type
A B 33 X
A C 44 X
...
B C 187 Z
So that for "Weight" we are calculating the sum of all possible pairs, and this separately for each column (which ends up in "Type").
I suppose the answer to this has to be using apply on a combn expression, like here Applying combn() function to data frame, but I haven't quite been able to work it out.
I can do this all by hand taking the combinations for "Site"
sites <- combn(obs$Site, 2)
Then the individual columns like so
combA <- combn(obs$A, 2, function(x) sum(x)
and binding those datasets together, but this obviously become annoying very soon.
I have tried to do all the variable columns in one go like this
b <- apply(newdf[, -1], 1, function(x){
sum(utils::combn(x, 2))
}
)
but there is something wrong with that.
Can anyone help, please?
One option would be to create a function and then map that function to all the columns that you have.
func1 <- function(var){
obs %>%
transmute(Node1 = combn(Site, 2)[1, ],
Node2 = combn(Site, 2)[2, ],
Weight = combn(!!sym(var), 2, function(x) sum(x)),
Type = var)
}
map(colnames(obs)[-1], func1) %>% bind_rows()
Here is an example using combn
do.call(
rbind,
combn(1:nrow(obs),
2,
FUN = function(k) cbind(data.frame(t(obs[k, 1])), stack(data.frame(as.list(colSums(obs[k, -1]))))),
simplify = FALSE
)
)
which gives
X1 X2 values ind
1 A B 33 X
2 A B 99 Y
3 A B 165 Z
4 A C 44 X
5 A C 110 Y
6 A C 176 Z
7 B C 55 X
8 B C 121 Y
9 B C 187 Z
try it this way
library(tidyverse)
obs_long <- obs %>% pivot_longer(-Site, names_to = "type")
sites <- combn(obs$Site, 2) %>% t() %>% as_tibble()
Type <- tibble(type = c("X", "Y", "Z"))
merge(sites, Type) %>%
left_join(obs_long, by = c("V1" = "Site", "type" = "type")) %>%
left_join(obs_long, by = c("V2" = "Site", "type" = "type")) %>%
mutate(res = value.x + value.y) %>%
select(-c(value.x, value.y))
V1 V2 type res
1 A B X 33
2 A C X 44
3 B C X 55
4 A B Y 99
5 A C Y 110
6 B C Y 121
7 A B Z 165
8 A C Z 176
9 B C Z 187
I am new to dplyr and I am struggling with what I believe is a simple function. I have a dataset similar to:
require(dplyr)
dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) )
dat <- arrange(dat, t)
dat <- data.frame(dat, group = c("B", "A", "A", "A", "A", "B", "C", "D", "A", "B", "D", "C", "A", "D", "C", "A", "A", "C", "C", "B") )
dat
I would like to attach a new column to the dataset dat containing the following operation:
for each row, for example row 3 with id == C, take the remaining rows such that their values in group is different to the starting id, that is C in this case
group the observations by time t
perform the following operation if the id (in this case the id C in row 3) has value 1 in col h: sum all the values (from the group based on t) in x and divide by the standard deviation of the values in y and x (from the group based on t). If id has a value of 0 in col h place a 0. If there are no observations the code should place a zero.
For example, for id A in row 1 the code should produce a 0 because all observations at time t == 1 have group == A. For id B in row 2 the code should produce (11 + 16) / sd(c(11, 16, 61, 66)).
How to perform this on dplyr or anyother way that does not include looping? Thank you.
The data looks like
dat
# t id x y h group
# 1 1 A 1 51 1 B
# 2 1 B 6 56 1 A
# 3 1 C 11 61 0 A
# 4 1 D 16 66 0 A
# 5 2 A 2 52 1 A
# 6 2 B 7 57 1 B
# 7 2 C 12 62 0 C
# 8 2 D 17 67 0 D
# 9 3 A 3 53 1 A
# 10 3 B 8 58 1 B
# 11 3 C 13 63 0 D
# 12 3 D 18 68 0 C
# 13 4 A 4 54 1 A
# 14 4 B 9 59 1 D
# 15 4 C 14 64 0 C
# 16 4 D 19 69 0 A
# 17 5 A 5 55 1 A
# 18 5 B 10 60 1 C
# 19 5 C 15 65 0 C
# 20 5 D 20 70 0 B
I tried the following but it does not produce the correct result.
dat %>%
group_by(t) %>%
mutate(new = ifelse(id != group, h * (sum(x) /map_dbl(row_number(), ~
sd(c(x[-.x], y[-.x]) ))) , 0) )
This should just illustrate speed performance of data.tables vs dplyr. I just took the whole ifelse of the mutate and packed it in a data.table operation and grouped with (by = t). So the results will not be the desired ones, but the results are at least the same for dplyr and data.tables.
library(data.table)
library(dplyr)
datDT <- data.table(dat)
DTF <- function(){
d <- datDT[ , new := ifelse( id != group, h * (sum(x) /
map_dbl(row_number(x), ~
sd(c(x[-.x], y[-.x])))) , 0) , by = t]
d
}
DPF <- function(){
d <- dat %>%
group_by(t) %>%
mutate(new = ifelse(id != group, h * (sum(x) /map_dbl(row_number(x), ~
sd(c(x[-.x], y[-.x]) ))) , 0) )
d
}
dtres = DTF()
dplres = DPF()
all.equal(dtres, data.table(dplres))
library(microbenchmark)
mc <- microbenchmark(times = 100,
DT = DTF(),
DPLYR = DPF()
)
mc
Unit: milliseconds
expr min lq mean median uq max neval cld
DT 7.428605 7.821919 8.324179 8.056762 8.429851 15.39028 100 a
DPLYR 11.154076 11.439025 11.895716 11.720050 12.139022 16.40934 100 b
The gain is not huge, but still noticeable and I'm sure there is still some optimization that can be done with setting keys, getting rid of the ifelse etc, but I leave that to the real data.table experts :).
So if you're new to both, maybe dig into data.tables, since you can also use dplyr-verbs with them (like below) and be slightly faster than with tbl structures.
dtres %>%
group_by(t) %>%
summarise(mN = mean(new))
Is there a function in dplyr that allows you to test the same condition against a selection of columns?
Take the following dataframe:
Demo1 <- c(8,9,10,11)
Demo2 <- c(13,14,15,16)
Condition <- c('A', 'A', 'B', 'B')
Var1 <- c(13,76,105,64)
Var2 <- c(12,101,23,23)
Var3 <- c(5,5,5,5)
df <- as.data.frame(cbind(Demo1, Demo2, Condition, Var1, Var2, Var3), stringsAsFactors = F)
df[4:6] <- lapply(df[4:6], as.numeric)
I want to take all the rows in which there is at least one value greater than 100 in any of Var1, Var2, or Var3. I realise that I could do this with a series of or statements, like so:
df <- df %>%
filter(Var1 > 100 | Var2 > 100 | Var3 > 100)
However, since I have quite a few columns in my actual dataset this would be time-consuming. I am assuming that there is some reasonably straightforward way to do this but haven't been able to find a solution on SO.
We can do this with filter_at and any_vars
df %>%
filter_at(vars(matches("^Var")), any_vars(.> 100))
# Demo1 Demo2 Condition Var1 Var2 Var3
#1 9 14 A 76 101 5
#2 10 15 B 105 23 5
Or using base R, create a logical expression with lapply and Reduce and subset the rows
df[Reduce(`|`, lapply(df[grepl("^Var", names(df))], `>`, 100)),]
In base-R one can write the same filter using rowSums as:
df[rowSums((df[,grepl("^Var",names(df))] > 100)) >= 1, ]
# Demo1 Demo2 Condition Var1 Var2 Var3
# 2 9 14 A 76 101 5
# 3 10 15 B 105 23 5
I want to add a column to a dataframe that makes a cumulated sum of another variable if yet another variable is equal for two rows. For example:
Row Var1 Var2 CumVal
1 A 2 2
2 A 4 6
3 B 5 5
So I want CumVal to cumulate/sum the Var2 column, if Var1 obs for row 2 equals Var1 obs for row 1. With other words, if it is equal to the obs before.
If the cumsum is based on the Var1 as a grouping variable
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(CumVal=cumsum(Var2))
Or
library(data.table)
setDT(df)[, CumVal:=cumsum(Var2), by=Var1]
Or using base R
transform(df, CumVal=ave(Var2, Var1, FUN=cumsum))
Update
If it is based on whether adjacent elements are not equal
transform(df, CumVal= ave(Var2, cumsum(c(TRUE,Var1[-1]!=
Var1[-nrow(df)])), FUN=cumsum))
# Row Var1 Var2 CumVal
#1 1 A 2 2
#2 2 A 4 6
#3 3 B 5 5
#4 4 A 6 6
Or the dplyr approach
df %>%
group_by(indx= cumsum(c(TRUE,(lag(Var1)!=Var1)[-1]))) %>%
mutate(CumVal=cumsum(Var2)) %>%
ungroup() %>%
select(-indx)
data
df <- structure(list(Row = 1:4, Var1 = c("A", "A", "B", "A"), Var2 = c(2L,
4L, 5L, 6L)), .Names = c("Row", "Var1", "Var2"), class = "data.frame",
row.names = c(NA, -4L))
I like rle, which detects similar successive values in a vector and describe it in a nice synthetic way. E.g. let's say we have a vector x of length 10:
x <- c(2, 3, 2, 2, 2, 2, 0, 0, 2, 1)
rle is able to detect that there are 4 successive 2s and 2 successive 0s:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 4 2 1 1
# values : num [1:6] 2 3 2 0 2 1
(in the output, we can that there are 2 lengths different from 1 corresponding to values 4 and 2)
We can use this function to apply cumsum to subvectors of another vector. Let's say we want to apply cumcum on a new vector y <- 1:10, but only for repeated values of x (which will be stored in a factor f):
y <- 1:10
z <- rle(x)$lengths
f <- factor(rep( seq_along(z), z) )
We can then use by or tapply (or something else to achieve the desired output):
cumval <- unlist(tapply(y, f, cumsum))