Dividing Counts only of certain groups

Dividing Counts only of certain groups - r

An example of data
Var1 <- rep(c("X", "Y", "Z"),2)
Var2 <- rep(c("A","B"),3)
Count<-sample(c(10:100), 6)
data<-data.frame(Var1,Var2,Count)
Produces
Var1 Var2 Count
1 X A 89
2 Y B 97
3 Z A 29
4 X B 38
5 Y A 50
6 Z B 88
I would like to divide the counts only of Var2 B by two, to get
Var1 Var2 Count Count2
1 X A 89 89
2 Y B 97 48.5
3 Z A 29 29
4 X B 38 19
5 Y A 50 50
6 Z B 88 44
But I'm not sure how to only divide based on a variable.
I'm new to coding, so any help is appreciated!

Base R solution:
data$Count2 <- data$Count ## copy to new variable
## Then change the subset to desired value. LHS subsets, RHS provides change
data$Count2[data$Var2 == "B"] <- data$Count[data$Var2 == "B"]/2
And Tidyverse/dplyr solution
library(dplyr)
data = data %>%
mutate(Count2 = ifelse(Var2 == "B", Count/2, Count ))
# alternatively, this is identical to above
data = mutate(data, Count2 = ifelse(Var2 == "B", Count/2, Count ))

Slight variation to Brian's dplyr solution, use replace to update a portion of the column inside the mutate function.
require(tidyverse)
Var1 <- rep(c("X", "Y", "Z"),2)
Var2 <- rep(c("A","B"),3)
Count<-sample(c(10:100), 6)
data<-data.frame(Var1,Var2,Count)
data %<>%
mutate(Count=replace(Count, Var2=='B', Count[Var2=='B']/2))

With data.table
library(data.table)
setDT(data)[, Count := as.numeric(Count)][Var2 == 'B', Count := Count/2]

Related

How can I extract a subset of data based on another data frame and grab observations before and after that subset

I have two data frames. df_sub is a subset of the main data frame, df. I want to take a subset of df based on df_sub where the resulting data frame is going to be df_sub plus the observations that occur before and after.
As an example, consider the two data sets
df <- data.frame(var1 = c("a", "x", "x", "y", "z", "t"),
var2 = c(4, 1, 2, 45, 56, 89))
df_sub <- data.frame(var1 = c("x", "y"),
var2 = c(2, 45))
They look like
> df
var1 var2
1 a 4
2 x 1
3 x 2
4 y 45
5 z 56
6 t 89
> df_sub
var1 var2
1 x 2
2 y 45
The result I want would be
> df_result
2 x 1
3 x 2
4 y 45
5 z 56
I was thinking of using an inner_join or something similar

We could use match to get the index, then add or subtract 1 on those index, take the unique and subset the rows
v1 <- na.omit(match(do.call(paste, df_sub), do.call(paste, df)) )
df[unique(v1 + rep(c(-1, 0, 1), each = length(v1))),]
-output
var1 var2
2 x 1
3 x 2
4 y 45
5 z 56
Or create a 'flag' column in the 'df_sub', do a left_join, and then filter based on the lead/lag values of 'flag'
library(dplyr)
df %>%
left_join(df_sub %>%
mutate(flag = TRUE)) %>%
filter(flag|lag(flag)|lead(flag)) %>%
select(-flag)
var1 var2
1 x 1
2 x 2
3 y 45
4 z 56

You can create a row number to keep track of the rows that are selected via join. Subset the data by including minimum row number - 1 and maximum row number + 1.
library(dplyr)
tmp <- df %>%
mutate(row = row_number()) %>%
inner_join(df_sub, by = c("var1", "var2"))
df[c(min(tmp$row) - 1, tmp$row, max(tmp$row) + 1), ]
# var1 var2
#2 x 1
#3 x 2
#4 y 45
#5 z 56

R using combn with apply

I have a data frame that has percentage values for a number of variables and observations, as follows:
obs <- data.frame(Site = c("A", "B", "C"), X = c(11, 22, 33), Y = c(44, 55, 66), Z = c(77, 88, 99))
I need to prepare this data as an edge list for network analysis, with "Site" as the nodes and the remaining variables as the edges. The result should look like this:
Node1 Node2 Weight Type
A B 33 X
A C 44 X
...
B C 187 Z
So that for "Weight" we are calculating the sum of all possible pairs, and this separately for each column (which ends up in "Type").
I suppose the answer to this has to be using apply on a combn expression, like here Applying combn() function to data frame, but I haven't quite been able to work it out.
I can do this all by hand taking the combinations for "Site"
sites <- combn(obs$Site, 2)
Then the individual columns like so
combA <- combn(obs$A, 2, function(x) sum(x)
and binding those datasets together, but this obviously become annoying very soon.
I have tried to do all the variable columns in one go like this
b <- apply(newdf[, -1], 1, function(x){
sum(utils::combn(x, 2))
}
)
but there is something wrong with that.
Can anyone help, please?

One option would be to create a function and then map that function to all the columns that you have.
func1 <- function(var){
obs %>%
transmute(Node1 = combn(Site, 2)[1, ],
Node2 = combn(Site, 2)[2, ],
Weight = combn(!!sym(var), 2, function(x) sum(x)),
Type = var)
}
map(colnames(obs)[-1], func1) %>% bind_rows()

Here is an example using combn
do.call(
rbind,
combn(1:nrow(obs),
2,
FUN = function(k) cbind(data.frame(t(obs[k, 1])), stack(data.frame(as.list(colSums(obs[k, -1]))))),
simplify = FALSE
)
)
which gives
X1 X2 values ind
1 A B 33 X
2 A B 99 Y
3 A B 165 Z
4 A C 44 X
5 A C 110 Y
6 A C 176 Z
7 B C 55 X
8 B C 121 Y
9 B C 187 Z

try it this way
library(tidyverse)
obs_long <- obs %>% pivot_longer(-Site, names_to = "type")
sites <- combn(obs$Site, 2) %>% t() %>% as_tibble()
Type <- tibble(type = c("X", "Y", "Z"))
merge(sites, Type) %>%
left_join(obs_long, by = c("V1" = "Site", "type" = "type")) %>%
left_join(obs_long, by = c("V2" = "Site", "type" = "type")) %>%
mutate(res = value.x + value.y) %>%
select(-c(value.x, value.y))
V1 V2 type res
1 A B X 33
2 A C X 44
3 B C X 55
4 A B Y 99
5 A C Y 110
6 B C Y 121
7 A B Z 165
8 A C Z 176
9 B C Z 187

Speeding subsetting of data.frame by row based conditions avoiding loops (dplyr, R)

I am new to dplyr and I am struggling with what I believe is a simple function. I have a dataset similar to:
require(dplyr)
dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) )
dat <- arrange(dat, t)
dat <- data.frame(dat, group = c("B", "A", "A", "A", "A", "B", "C", "D", "A", "B", "D", "C", "A", "D", "C", "A", "A", "C", "C", "B") )
dat
I would like to attach a new column to the dataset dat containing the following operation:
for each row, for example row 3 with id == C, take the remaining rows such that their values in group is different to the starting id, that is C in this case
group the observations by time t
perform the following operation if the id (in this case the id C in row 3) has value 1 in col h: sum all the values (from the group based on t) in x and divide by the standard deviation of the values in y and x (from the group based on t). If id has a value of 0 in col h place a 0. If there are no observations the code should place a zero.
For example, for id A in row 1 the code should produce a 0 because all observations at time t == 1 have group == A. For id B in row 2 the code should produce (11 + 16) / sd(c(11, 16, 61, 66)).
How to perform this on dplyr or anyother way that does not include looping? Thank you.
The data looks like
dat
# t id x y h group
# 1 1 A 1 51 1 B
# 2 1 B 6 56 1 A
# 3 1 C 11 61 0 A
# 4 1 D 16 66 0 A
# 5 2 A 2 52 1 A
# 6 2 B 7 57 1 B
# 7 2 C 12 62 0 C
# 8 2 D 17 67 0 D
# 9 3 A 3 53 1 A
# 10 3 B 8 58 1 B
# 11 3 C 13 63 0 D
# 12 3 D 18 68 0 C
# 13 4 A 4 54 1 A
# 14 4 B 9 59 1 D
# 15 4 C 14 64 0 C
# 16 4 D 19 69 0 A
# 17 5 A 5 55 1 A
# 18 5 B 10 60 1 C
# 19 5 C 15 65 0 C
# 20 5 D 20 70 0 B
I tried the following but it does not produce the correct result.
dat %>%
group_by(t) %>%
mutate(new = ifelse(id != group, h * (sum(x) /map_dbl(row_number(), ~
sd(c(x[-.x], y[-.x]) ))) , 0) )

This should just illustrate speed performance of data.tables vs dplyr. I just took the whole ifelse of the mutate and packed it in a data.table operation and grouped with (by = t). So the results will not be the desired ones, but the results are at least the same for dplyr and data.tables.
library(data.table)
library(dplyr)
datDT <- data.table(dat)
DTF <- function(){
d <- datDT[ , new := ifelse( id != group, h * (sum(x) /
map_dbl(row_number(x), ~
sd(c(x[-.x], y[-.x])))) , 0) , by = t]
d
}
DPF <- function(){
d <- dat %>%
group_by(t) %>%
mutate(new = ifelse(id != group, h * (sum(x) /map_dbl(row_number(x), ~
sd(c(x[-.x], y[-.x]) ))) , 0) )
d
}
dtres = DTF()
dplres = DPF()
all.equal(dtres, data.table(dplres))
library(microbenchmark)
mc <- microbenchmark(times = 100,
DT = DTF(),
DPLYR = DPF()
)
mc
Unit: milliseconds
expr min lq mean median uq max neval cld
DT 7.428605 7.821919 8.324179 8.056762 8.429851 15.39028 100 a
DPLYR 11.154076 11.439025 11.895716 11.720050 12.139022 16.40934 100 b
The gain is not huge, but still noticeable and I'm sure there is still some optimization that can be done with setting keys, getting rid of the ifelse etc, but I leave that to the real data.table experts :).
So if you're new to both, maybe dig into data.tables, since you can also use dplyr-verbs with them (like below) and be slightly faster than with tbl structures.
dtres %>%
group_by(t) %>%
summarise(mN = mean(new))

Filter by testing logical condition across multiple columns

Is there a function in dplyr that allows you to test the same condition against a selection of columns?
Take the following dataframe:
Demo1 <- c(8,9,10,11)
Demo2 <- c(13,14,15,16)
Condition <- c('A', 'A', 'B', 'B')
Var1 <- c(13,76,105,64)
Var2 <- c(12,101,23,23)
Var3 <- c(5,5,5,5)
df <- as.data.frame(cbind(Demo1, Demo2, Condition, Var1, Var2, Var3), stringsAsFactors = F)
df[4:6] <- lapply(df[4:6], as.numeric)
I want to take all the rows in which there is at least one value greater than 100 in any of Var1, Var2, or Var3. I realise that I could do this with a series of or statements, like so:
df <- df %>%
filter(Var1 > 100 | Var2 > 100 | Var3 > 100)
However, since I have quite a few columns in my actual dataset this would be time-consuming. I am assuming that there is some reasonably straightforward way to do this but haven't been able to find a solution on SO.

We can do this with filter_at and any_vars
df %>%
filter_at(vars(matches("^Var")), any_vars(.> 100))
# Demo1 Demo2 Condition Var1 Var2 Var3
#1 9 14 A 76 101 5
#2 10 15 B 105 23 5
Or using base R, create a logical expression with lapply and Reduce and subset the rows
df[Reduce(`|`, lapply(df[grepl("^Var", names(df))], `>`, 100)),]

In base-R one can write the same filter using rowSums as:
df[rowSums((df[,grepl("^Var",names(df))] > 100)) >= 1, ]
# Demo1 Demo2 Condition Var1 Var2 Var3
# 2 9 14 A 76 101 5
# 3 10 15 B 105 23 5

cumsum when current obs equals next obs for same variable (column)

I want to add a column to a dataframe that makes a cumulated sum of another variable if yet another variable is equal for two rows. For example:
Row Var1 Var2 CumVal
1 A 2 2
2 A 4 6
3 B 5 5
So I want CumVal to cumulate/sum the Var2 column, if Var1 obs for row 2 equals Var1 obs for row 1. With other words, if it is equal to the obs before.

If the cumsum is based on the Var1 as a grouping variable
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(CumVal=cumsum(Var2))
Or
library(data.table)
setDT(df)[, CumVal:=cumsum(Var2), by=Var1]
Or using base R
transform(df, CumVal=ave(Var2, Var1, FUN=cumsum))
Update
If it is based on whether adjacent elements are not equal
transform(df, CumVal= ave(Var2, cumsum(c(TRUE,Var1[-1]!=
Var1[-nrow(df)])), FUN=cumsum))
# Row Var1 Var2 CumVal
#1 1 A 2 2
#2 2 A 4 6
#3 3 B 5 5
#4 4 A 6 6
Or the dplyr approach
df %>%
group_by(indx= cumsum(c(TRUE,(lag(Var1)!=Var1)[-1]))) %>%
mutate(CumVal=cumsum(Var2)) %>%
ungroup() %>%
select(-indx)
data
df <- structure(list(Row = 1:4, Var1 = c("A", "A", "B", "A"), Var2 = c(2L,
4L, 5L, 6L)), .Names = c("Row", "Var1", "Var2"), class = "data.frame",
row.names = c(NA, -4L))

I like rle, which detects similar successive values in a vector and describe it in a nice synthetic way. E.g. let's say we have a vector x of length 10:
x <- c(2, 3, 2, 2, 2, 2, 0, 0, 2, 1)
rle is able to detect that there are 4 successive 2s and 2 successive 0s:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 4 2 1 1
# values : num [1:6] 2 3 2 0 2 1
(in the output, we can that there are 2 lengths different from 1 corresponding to values 4 and 2)
We can use this function to apply cumsum to subvectors of another vector. Let's say we want to apply cumcum on a new vector y <- 1:10, but only for repeated values of x (which will be stored in a factor f):
y <- 1:10
z <- rle(x)$lengths
f <- factor(rep( seq_along(z), z) )
We can then use by or tapply (or something else to achieve the desired output):
cumval <- unlist(tapply(y, f, cumsum))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Dividing Counts only of certain groups - r

With data.table library(data.table) setDT(data)[, Count := as.numeric(Count)][Var2 == 'B', Count := Count/2]

Related

How can I extract a subset of data based on another data frame and grab observations before and after that subset

R using combn with apply

Speeding subsetting of data.frame by row based conditions avoiding loops (dplyr, R)

Filter by testing logical condition across multiple columns

cumsum when current obs equals next obs for same variable (column)

Categories

Resources