I have the following data frame
df <- data.frame(Gender = c(rep(c("M","F"),each=4)),
DiffA=c(1,1,-1,-1,1,1,1,-1),
DiffB=c(1,-1,1,-1,1,1,1,-1))
I would like to create 2 new variables which summarize for each gender i)the number of rows for which DiffA and DiffB are positive and ii) the number of rows for which DiffA and DiffB are negative in order to obtain:
df2 <- data.frame(Gender = c("M","F"),
Diff_Pos=c(1,3),
Diff_Neg=c(1,1))
I have failed to combine the summary function from dplyr n() which returns the count of rows and the required logical statement. Thanks in advance
I would consider doing
library(tidyr)
df %>% filter(DiffA == DiffB) %>% count(Gender, DiffA) %>% spread(DiffA, n)
Gender -1 1
# (fctr) (int) (int)
# 1 F 1 3
# 2 M 1 1
The analogous data.table code is
dcast(df[DiffA == DiffB, .N, by=.(Gender, DiffA)], Gender ~ DiffA)
# Gender -1 1
# 1: F 1 3
# 2: M 1 1
If your real data goes beyond -1 and 1, wrap the relevant columns in sign().
Here is a base R option
with(subset(df, DiffA==DiffB), table(Gender, DiffA))
# DiffA
#Gender -1 1
# F 1 3
# M 1 1
This should work:
df %>%
dplyr::mutate(
Diff_Pos = DiffA > 0 & DiffB > 0,
Diff_Neg = DiffA < 0 & DiffB < 0) %>%
dplyr::group_by(Gender) %>%
dplyr::summarise(
Diff_Pos = sum(Diff_Pos),
Diff_Neg = sum(Diff_Neg))
Related
I am trying to analyse a dataframe where every row represents a timeseries. My df is structured as follows:
df <- data.frame(key = c("10A", "11xy", "445pe"),
Obs1 = c(0, 22, 0),
Obs2 = c(10, 0, 0),
Obs3 = c(0, 3, 5),
Obs4 = c(0, 10, 0)
)
I would now like to create a new dataframe, where every row represents again the key, and the columns consist of the following results:
"TotalZeros": counts the total number of zeros for each row (=key)
"LeadingZeros": counts the number of zeros before the first nonzero obs for each row
This means I would like to receive the following dataframe in the end:
key TotalZeros LeadingZeros
10A 3 1
11xy 1 0
445pe 3 2
I managed to count the total number of zeros for each row:
zeroCountDf <- data.frame(key = df$key, TotalNonZeros = rowSums(df ! = 0))
But I am struggling with counting the LeadingZeros. I found how to count the first non-zero position in a vector, but I don't understand how to apply this approach to my dataframe:
vec <- c(0,1,1)
min(which(vec != 0)) # returns 2, meaning the second position is first nonzero value
Can anyone explain how to count leading zeros for every row in a dataframe? I am new to R and thankful for any insight and tips. Thanks in advance.
We could use rowCumsums from matrixStats along with rowSums
library(matrixStats)
cbind(df[1], total_zeros = rowSums(df[-1] == 0),
Leading_zeros = rowSums(!rowCumsums(df[-1] != 0)))
-output
key total_zeros Leading_zeros
1 10A 3 1
2 11xy 1 0
3 445pe 3 2
or in tidyverse, we may also use rowwise
library(dplyr)
df %>%
mutate(total_zeros = rowSums(select(., starts_with("Obs")) == 0)) %>%
rowwise %>%
transmute(key, total_zeros,
Leading_zeros = sum(!cumsum(c_across(starts_with('Obs')) != 0))) %>%
ungroup
-output
# A tibble: 3 x 3
key total_zeros Leading_zeros
<chr> <dbl> <int>
1 10A 3 1
2 11xy 1 0
3 445pe 3 2
Edit Added Miff's comment to the solution.
Here is a tidyverse solution:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(starts_with("Obs"),
names_pattern = "Obs(\\d+)") %>%
arrange(key, as.integer(name)) %>%
group_by(key) %>%
summarize(
leading_zeros = sum(cumsum(abs(value)) == 0),
total_zeros = sum(value == 0),
trailing_zeros = sum(cumsum(abs(value)) == last(cumsum(abs(value)))) - 1)
This returns
# A tibble: 3 x 4
key leading_zeros total_zeros trailing_zeros
<chr> <int> <int> <dbl>
1 10A 1 3 2
2 11xy 0 1 0
3 445pe 2 3 1
A data.table option
setDT(df)[
, .(
total_zeros = rowSums(.SD == 0),
Leading_zeros = which.max(.SD != 0) - 1,
Trailing_zeros = length(.SD)-max(which(.SD!=0))
),
key
]
gives
key total_zeros Leading_zeros Trailing_zeros
1: 10A 3 1 2
2: 11xy 1 0 0
3: 445pe 3 2 1
I have a data.frame with more than 50 columns and 10,000 rows I want select those columns that are haveing 0 or 1 in them excluding other values in those columna
sample data.frame is as below:
dummy_df <- data.frame(
id=1:4,
gender=c(4,1,0,1),
height=seq(150, 180,by = 10),
smoking=c(3,0,1,0)
)
I want to select all those columns with 0 or 1 value and exclude other values like 4 in gender and 3 in smoking and as below
gender smoking
1 0
0 1
1 0
but I have 50 columns in actual data frame and I don't know which of them are having 0 or 1
What I'm trying is:
dummy_df %>% select_if(~ all( . %in% 0:1))
Is this useful for you?
dummy_df %>%
select(- c(id, height)) %>%
rowwise() %>%
filter(any(c_across() == 0)|any(c_across() == 1))
# A tibble: 3 x 2
# Rowwise:
gender smoking
<dbl> <dbl>
1 1 0
2 0 1
3 1 0
EDIT:
If you don't know in advance which cols contain 0 and/or 1, you can determine that in base R:
temp <- dummy_df[sapply(dummy_df, function(x) any(x == 0|x == 1))]
Now you can filter for rows with 0and/or 1:
temp %>%
rowwise() %>%
filter(any(c_across() == 0)|any(c_across() == 1))
I think it's more like a case of filter than select:
library(dplyr)
dummy_df %>%
filter(if_all(c(gender, smoking), ~ .x %in% c(0, 1)))
id gender height smoking
1 2 1 160 0
2 3 0 170 1
3 4 1 180 0
I have some groups of data and in each group there is one number that is a multiple of 7.
For each group, I want to subtract the first value from that multiple.
Reproducible example:
temp.df <- data.frame("temp" = c(48:55, 70:72, 93:99))
temp.df$group <- cumsum(c(TRUE, diff(temp.df$temp) > 1))
Expected result:
group 1: 49-48 = 1
group 2: 70-70 = 0
group 3: 98-93 = 5
Can you suggest me a way that do not require using any loop?
You can get the number divisible by 7 in each group and subtract it with first value.
This can be done in base R using aggregate.
aggregate(temp~group, temp.df, function(x) x[x %% 7 == 0] - x[1])
# group temp
#1 1 1
#2 2 0
#3 3 5
You can also do this using dplyr
library(dplyr)
temp.df %>%
group_by(group) %>%
summarise(temp = temp[temp %% 7 == 0] - first(temp))
and data.table
library(data.table)
setDT(temp.df)[, .(temp = temp[temp %% 7 == 0] - first(temp)), group]
We can also do
library(dplyr)
temp.df %>%
group_by(group) %>%
summarise(temp = temp[which.max(!temp %% 7)] - first(temp))
# A tibble: 3 x 2
# group temp
# <int> <int>
#1 1 1
#2 2 0
#3 3 5
This feels like it should be more straightforward and I'm just missing something. The goal is to filter the data into a new df where both var values 1 & 2 are represented in the group
here's some toy data:
grp <- c(rep("A", 3), rep("B", 2), rep("C", 2), rep("D", 1), rep("E",2))
var <- c(1,1,2,1,1,2,1,2,2,2)
id <- c(1:10)
df <- as.data.frame(cbind(id, grp, var))
only grp A and C should be present in the new data because they are the only ones where var 1 & 2 are present.
I tried dplyr, but obviously '&' won't work since it's not row based and '|' just returns the same df:
df.new <- df %>% group_by(grp) %>% filter(var==1 & var==2) #returns no rows
Here is another dplyr method. This can work for more than two factor levels in var.
library(dplyr)
df2 <- df %>%
group_by(grp) %>%
filter(all(levels(var) %in% var)) %>%
ungroup()
df2
# # A tibble: 5 x 3
# id grp var
# <fct> <fct> <fct>
# 1 1 A 1
# 2 2 A 1
# 3 3 A 2
# 4 6 C 2
# 5 7 C 1
We can condition on there being at least one instance of var == 1 and at least one instance of var == 2 by doing the following:
library(tidyverse)
df1 <- data_frame(grp, var, id) # avoids coercion to character/factor
df1 %>%
group_by(grp) %>%
filter(sum(var == 1) > 0 & sum(var == 2) > 0)
grp var id
<chr> <dbl> <int>
1 A 1 1
2 A 1 2
3 A 2 3
4 C 2 6
5 C 1 7
I am slicing a data.frame by removing 50% of rows with lowest revenue, now I want to join back old data.frame so I can compare result with slice against result before slice.
I have a solution but looking for a more elegant.
require(dplyr)
> #creating my data.frame with revenue for id and subid
> df <- data.frame(id = gl(n = 2, k= 5, length = 10),
+ subid = gl(n = 6, k = 2, length = 10),
+ rev = rnorm(10, 100, 15))
> df
id subid rev
1 1 1 102.80694
2 1 1 77.88691
3 1 2 122.71019
4 1 2 67.13475
5 1 3 93.21146
6 2 3 91.48368
7 2 4 103.05535
8 2 4 82.27343
9 2 5 106.03651
10 2 5 81.14182
>
> #keep only subid with 50% highest turnover within each id
> df_sliced <- df %>%
+ arrange(id, desc(rev)) %>%
+ group_by(id) %>%
+ slice(seq(n()*0.5)) %>%
+ group_by(id) %>%
+ summarise(rev_sliced = sum(rev))
>
> df_sliced
Source: local data frame [2 x 2]
id rev_sliced
(fctr) (dbl)
1 1 225.5171
2 2 209.0919
>
> #now I want to join back and compare my sliced result with result before slice.
> df_desired <- df %>%
+ group_by(id) %>%
+ summarise(rev = sum(rev)) %>%
+ cbind(df_sliced) #this will obviously also give me two columns with id. Desired result is with only one column for id.
>
> df_desired
id rev id rev_sliced
1 1 463.7503 1 225.5171
2 2 463.9908 2 209.0919
I have not solved how to use join and not how to have everything in one chain.
For the sliced sum, you can calculate the sum of rev that is above the 50% quantile as follows; then you can calculate both in the same summarize expression without the need of a join:
df %>%
group_by(id) %>%
summarise(rev_sliced = sum(rev[rev > quantile(rev, 0.5)]),
rev = sum(rev))
# A tibble: 2 x 3
# id rev_sliced rev
# <int> <dbl> <dbl>
#1 1 225.5171 463.7502
#2 2 209.0919 463.9908