Removing all columns summing to zero with dplyr - r

I'm currently working on a dataframe that looks something like this:
Site Spp1 Spp2 Spp3 LOC TYPE
S01 2 4 0 A FLOOD
S02 4 0 0 A REG
....
S10 0 1 0 B FLOOD
S11 1 0 0 B REG
What I'm trying to do is subset the dataframe so I can run some indicator species analysis in R.
The following code works in that I create two subsets of the data, merge them into one frame and then drop the unused factor levels
A.flood <- filter(data, TYPE == "FLOOD", LOC == "A")
B.flood <- filter(data, TYPE == "FLOOD", LOC == "B")
A.B.flood <- rbind(A.flood, B.flood) %>% droplevels.data.frame(A.B.flood, except = c("A", "B"))
What I was also hoping/need to do is to drop all Spp columns (in my real dataset there are ~ 60) that sum to zero. Is there a way to achieve this this with dplyr, and if there is, is it possible to pipe that code onto the existing A.B.flood dataframe code?
Thanks!
EDIT
I managed to remove all the columns that summed to zero, by selecting only the columns that summed to > zero:
A.B.flood.subset <- A.B.flood[, apply(A.B.flood[1:(ncol(A.B.flood))], 2, sum)!=0]

I realize this question is now quite old, but I came accross and found another solution using dplyr's "select" and "which", which might seem clearer to dplyr's enthusiasts:
A.B.flood.subset <- A.B.flood %>% select(which(!colSums(A.B.flood, na.rm=TRUE) %in% 0))

Without using any package, we can use rowSums of the 'Spp' columns (subset the columns using grep) and double negate so that rows with sum>0 will be TRUE and others FALSE. Use this index to subset the rows.
data[!!rowSums(data[grep('Spp', names(data))]),]
Or using dplyr/magrittr, we select the 'Spp' columns, get the sum of each row with Reduce, double negate and use extract from magrittr to subset the original dataset with the index derived.
library(dplyr)
library(magrittr)
data %>%
select(matches('^Spp')) %>%
Reduce(`+`, .) %>%
`!` %>%
`!` %>%
extract(data,.,)
data
data <- structure(list(Site = c("S01", "S02", "S03", "S04"),
Spp1 = c(2L,
4L, 0L, 4L), Spp2 = c(4L, 0L, 0L, 0L), Spp3 = c(0L, 0L, 0L, 0L
), LOC = c("A", "A", "A", "A"), TYPE = c("FLOOD", "REG",
"FLOOD",
"REG")), .Names = c("Site", "Spp1", "Spp2", "Spp3", "LOC",
"TYPE"), class = "data.frame", row.names = c(NA, -4L))

You should convert to tidy data with tidyr::gather() and the data frame will be much easier to manipulate.
library(tidyr)
library(dplyr)
A.B.Flood %>% gather(Species, Sp.Count, -Site, -LOC, -TYPE) %>%
group_by(Species) %>%
filter(Sp.Count > 0)
Voila, your tidy data minus the zero counts.
# Site LOC TYPE Species Sp.Count
# <fctr> <fctr> <fctr> <chr> <int>
#1 S01 A FLOOD Spp1 2
#2 S02 A REG Spp1 4
#3 S11 B REG Spp1 1
#4 S01 A FLOOD Spp2 4
#5 S10 B FLOOD Spp2 1
Personally I'd keep it like this. If you want your original format back with the zero counts for the non-discarded species, just add %>% spread(Species, Sp.Count, fill = 0) to the pipeline.
# Site LOC TYPE Spp1 Spp2
#* <fctr> <fctr> <fctr> <dbl> <dbl>
#1 S01 A FLOOD 2 4
#2 S02 A REG 4 0
#3 S10 B FLOOD 0 1
#4 S11 B REG 1 0

There is an even easier and quicker way to do this (and also more in line with your question: using dplyr).
A.B.flood.subset <- A.B.flood[, colSums(A.B.flood != 0) > 0]
or with a MWE:
df <- data.frame (x = rnorm(100), y = rnorm(100), z = rep(0, 100))
df[, colSums(df != 0) > 0]

For those who want to use dplyr 1.0.0 with the where keyword, you can do:
A.B.flood %>%
select(where( ~ is.numeric(.x) && sum(.x) != 0))
returns:
Spp1 Spp2
1 2 4
2 4 0
3 0 0
4 4 0
using the same data given by #akrun:
A.B.flood <- structure(
list(
Site = c("S01", "S02", "S03", "S04"),
Spp1 = c(2L,
4L, 0L, 4L),
Spp2 = c(4L, 0L, 0L, 0L),
Spp3 = c(0L, 0L, 0L, 0L),
LOC = c("A", "A", "A", "A"),
TYPE = c("FLOOD", "REG",
"FLOOD",
"REG")
),
.Names = c("Site", "Spp1", "Spp2", "Spp3", "LOC",
"TYPE"), class = "data.frame", row.names = c(NA, -4L))

Related

cumsum with a condition to restart in R

I have this dataset containing multiple columns. I want to use cumsum() on a column conditioning the sum on another column. That is when X happens I want the sum to restart from zero but, I want to sum also the number of the "x" event row. I'll be more precise here in an example.
inv ass port G cumsum(G)
i x 2 1 1
i x 2 0 1
i x 0 1 2
i x 3 0 0
i x 3 1 1
So in the 3rd row the condition port == 0 happens. I want to cumsum(G), but on the 3rd row i want to still sum the value of G and to restart the count from the following row.
I'm using dplyr to group_by(investor, asset) but I'm stuck here since I'm doing:
res1 <- res %>%
group_by(investor, asset) %>%
mutate(posdays = ifelse(operation < 0 & portfolio == 0, 0, cumsum(G))) %>%
ungroup()
Since this restart the cumsum() but excludes the sum of the 3rd row.
I think something saying "cumsum(G) but when condition "x" in the previous row, restart the sum in the following row".
Can you help me?
You may use cumsum to create groups as well.
library(dplyr)
df <- df %>%
group_by(group = cumsum(dplyr::lag(port == 0, default = 0))) %>%
mutate(cumsum_G = cumsum(G)) %>%
ungroup
df
# inv ass port G group cumsum_G
# <chr> <chr> <int> <int> <dbl> <int>
#1 i x 2 1 0 1
#2 i x 2 0 0 1
#3 i x 0 1 0 2
#4 i x 3 0 1 0
#5 i x 3 1 1 1
You may remove the group column from output using %>% select(-group).
data
df <- structure(list(inv = c("i", "i", "i", "i", "i"), ass = c("x",
"x", "x", "x", "x"), port = c(2L, 2L, 0L, 3L, 3L), G = c(1L,
0L, 1L, 0L, 1L)), class = "data.frame", row.names = c(NA, -5L))

R: count times per column a condition is met and row names appear in a list

I have a dataframe with count information (df1)
rownames
sample1
sample2
sample3
m1
0
5
1
m2
1
7
5
m3
6
2
0
m4
3
1
0
and a second with sample information (df2)
rownames
batch
total count
sample1
a
10
sample2
b
15
sample3
a
6
I also have two lists with information about the m values (could easily be turned into another data frame if necessary but I would rather not add to the count information as it is quite large). No patterns (such as even and odd) exist, I am just using a very simplistic example
x <- c("m1", "m3") and y <- c("m2", "m4")
What I would like to do is add another two columns to the sample information. This is a count of each m per sample that has a value of above 5 and appears in list x or y
rownames
batch
total count
x
y
sample1
a
10
1
0
sample2
b
15
1
1
sample3
a
6
0
1
My current strategy is to make a list of values for both x and y and then append them to df2. Here are my attempts so far:
numX <- colSums(df1[sum(rownames(df1)>10 %in% x),]) and numX <- colSums(df1[sum(rownames(df1)>10 %in% x),]) both return a list of 0s
numX <- colSums(df1[rownames(df1)>10 %in% x,]) returns a list of the sum of count values meeting the conditions for each column
numX <- length(df1[rownames(df1)>10 %in% novel,]) returns the number of times the condition is met (in this example 2L)
I am not really sure how to approach this so I have just been throwing around attempts. I've tried looking for answers but maybe I am just struggling to find the proper wording.
We may do this with rowwise
library(dplyr)
df2 %>%
rowwise %>%
mutate(x = +(sum(df1[[rownames]][df1$rownames %in% x]) >= 5),
y = +(sum(df1[[rownames]][df1$rownames %in% y]) >= 5)) %>%
ungroup
-output
# A tibble: 3 × 5
rownames batch totalcount x y
<chr> <chr> <int> <int> <int>
1 sample1 a 10 1 0
2 sample2 b 15 1 1
3 sample3 a 6 0 1
Or based on the data, a base R option would be
out <- aggregate(. ~ grp, FUN = sum,
transform(df1, grp = c('x', 'y')[1 + (rownames %in% y)] )[-1])
df2[out$grp] <- +(t(out[-1]) >= 5)
-output
> df2
rownames batch totalcount x y
1 sample1 a 10 1 0
2 sample2 b 15 1 1
3 sample3 a 6 0 1
data
df1 <- structure(list(rownames = c("m1", "m2", "m3", "m4"), sample1 = c(0L,
1L, 6L, 3L), sample2 = c(5L, 7L, 2L, 1L), sample3 = c(1L, 5L,
0L, 0L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(rownames = c("sample1", "sample2", "sample3"),
batch = c("a", "b", "a"), totalcount = c(10L, 15L, 6L)),
class = "data.frame", row.names = c(NA,
-3L))
How about using using dplyr and reshape2::melt
df3 <- df1 %>%
melt %>%
filter(value >= 5) %>%
mutate(x = as.numeric(rownames %in% c("m1", "m3")),
y = as.numeric(rownames %in% c("m2", "m4"))) %>%
select(-rownames, - value) %>%
group_by(variable) %>%
summarise(x = sum(x), y = sum(y))
df2 %>% left_join(df3, by = c("rownames" = "variable"))
rownames batch total_count x y
1 sample1 a 10 1 0
2 sample2 b 15 1 1
3 sample3 a 6 0 1
You can create a named list of vectors and for each rownames count how many values of x and y in the respective sample is >= 5.
Base R option -
list_vec <- list(x = x, y = y)
cbind(df2, do.call(rbind, lapply(df2$rownames, function(x)
sapply(list_vec, function(y) {
sum(df1[[x]][df1$rownames %in% y] >= 5)
}))))
# rownames batch total.count x y
#1 sample1 a 10 1 0
#2 sample2 b 15 1 1
#3 sample3 a 6 0 1
Using tidyverse -
library(dplyr)
library(purrr)
list_vec <- lst(x, y)
df2 %>%
bind_cols(map_df(df2$rownames, function(x)
map(list_vec, ~sum(df1[[x]][df1$rownames %in% .x] >= 5))))

Filter row based on max & unique value in one column

Its a bit tricky to explain, Ill try my best, query below. I have a df as below. I need to filter rows by group based on maximum pop in country column but which has not already occurred in the above groups. (As per output (image), the reason why A didnt feature in group2 because it had already occured in Group 1)
In short, I need to get unique values in country column at the same time get maximum value in pop (on a group level). I hope picture can convey what I could not. (Tidyverse solution preferred)
[![Expected output][2]][2]
df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L,
100L)), class = "data.frame", row.names = c(NA, -9L))
I think this will do. Explanation of syntax
split the data into list for each group
leave first group (as it will be used as .init in next step but after filtering for the max of pop value.
use purrr::reduce here which will reduce the list of tibbles to a single tibble
iterations used in reduce
.init used as filtered first group
thereafter countries in previous groups removed through anti_join
this data filtered for max pop again
added the previously filtered countries by bind_rows()
Thus, in the end we will have desired tibble.
df %>% group_split(Group) %>% .[-1] %>%
reduce(.init =df %>% group_split(Group) %>% .[[1]] %>%
filter(pop == max(pop)),
~ .y %>%
anti_join(.x, by = c("country" = "country")) %>%
filter(pop == max(pop)) %>%
bind_rows(.x) %>% arrange(Group))
# A tibble: 3 x 3
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 G 100
You can create a helper function that writes the maximum pop from each group in a vector and use it to filter the dataframe.
library(tidyverse)
max_values <- c()
helper <- function(dat, ...){
dat <- dat[!(dat %in% max_values)] # exclude maximum values from previous groups
max_value <- max(dat) # get current max. value
max_values <<- c(max_values, max_value) # append
return(max_value)
}
df %>%
group_by(Group) %>%
filter(pop == helper(pop))
which gives you:
# A tibble: 3 x 3
# Groups: Group [3]
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 H 120
Data used:
> df
Group country pop
1 1 A 200
2 1 B 100
3 1 C 50
4 2 A 200
5 2 E 150
6 2 F 120
7 3 A 200
8 3 E 150
9 3 G 100
10 3 H 120
Here is another possibility, but
Overly Simplified in that it does not take into account
the possibility of a group having a higher population in a Group where
it does not win.
library(dplyr)
df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L,
100L)), class = "data.frame", row.names = c(NA, -9L))
df %>%
group_by(country) %>%
summarize(popmax = max(pop)) %>%
inner_join(df, by = c("popmax" = 'pop')) %>%
rename(country = country.y) %>%
select(-country.x) %>%
group_by(country) %>%
arrange(Group) %>%
slice(1) %>%
ungroup() %>%
group_by(Group) %>%
arrange(country) %>%
slice(1) %>%
select(Group, country, popmax) %>%
rename(pop = popmax)
My answer fails (while other answers don't) with this data set:
df <- tribble(
~Group, ~ country, ~pop,
1 , 'A', 200,
1 , 'B', 100,
1 , 'C', 50,
1 , 'G', 150,
2 , 'A', 200,
2 , 'E', 150,
2 , 'F', 120,
3 , 'A', 200,
3 , 'E', 150,
3 , 'G', 100
)
Update
#Crestor who is claiming that my answer is not correct.
My answer is correct because my code gives the desired output as requested by OP.
Your objection that my code does not work on another scenario may be correct, but in this setting it is irrelevant, as my answer was only intended to solve the task at hand.
Here is the answer to your raised scenario with this dataset:
df1 <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"),
pop = c(200L, 100L, 250L, 220L, 150L, 120L, 200L, 150L, 100L
)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))
expected output by Crestor:
# A tibble: 3 x 3
Group country pop
<int> <chr> <int>
1 1 C 250
2 2 A 220
3 3 E 150
My code for your scenario #crestor
library(dplyr)
df1 %>%
group_by(country) %>%
arrange(Group) %>%
filter(pop == max(pop)) %>%
group_by(Group) %>%
filter(pop == max(pop))
Output:
Group country pop
<int> <chr> <int>
1 1 C 250
2 2 A 220
3 3 E 150
Original answer to the question by OP
To keep it simple: First arrange to bring your dataset in position. Then group_by and keep first row in each group with slice. Then group_by Group and filter the max pop
library(dplyr)
df %>%
arrange(country, pop) %>%
group_by(country) %>%
slice(1) %>%
group_by(Group) %>%
filter(pop==max(pop))
Output:
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 G 100

How to replace NULL/? with 'None' or '0' in r

DF1 is
ID CompareID Distance
1 256 0
1 834 0
1 946 0
2 629 0
2 735 1
2 108 1
Expected output should be DF2 as below (Condition for generating DF2 -> In DF1, For any ID if 'Distance'==1, put the corresponding 'CompareID' into 'SimilarID' column, for 'Distance'==0, ignore the corresponding 'CompareID')
ID SimilarID
1 None
2 735,108
Comparison is done correctly , but i got below output
ID SimilarID
1 ?
2 735,108
I understood that, as there are no 'CompareID' to put in 'SimilarID' - ? mark is displayed.
I want to replace this '?' with 'None' or '0'. Kindly help
In some cases, i observe that instead of '?' i can also see 'NULL' value.
Thanks !
Using the data.table package, where df is your original data ...
library(data.table)
setDT(df)[, .(SimilarID = if(all(Distance == 0)) "None"
else toString(CompareID[Distance == 1])), by = ID]
# ID SimilarID
# 1: 1 None
# 2: 2 735, 108
This follows your expected output by returning, by ID
"None" when all of the Distance column is zero
the CompareID values for when Distance is 1, as a comma-delimited string
Data:
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), CompareID = c(256L,
834L, 946L, 629L, 735L, 108L), Distance = c(0L, 0L, 0L, 0L, 1L,
1L)), .Names = c("ID", "CompareID", "Distance"), class = "data.frame", row.names = c(NA,
-6L))
Try the following using dplyr:
summarise.func <- function (Distance,CompareID) {
SimilarID <- CompareID[Distance == 1]
if (length(SimilarID)==0) "None" else paste0(SimilarID, collapse=",")
}
library(dplyr)
df2 <- df1 %>% group_by(ID) %>%
summarise(SimilarID=summarise.func(Distance,CompareID))
First, define a summarizing function summarise.func that:
Extracts the CompareID to a SimilarID vector if the Distance == 1.
If this SimilarID vector has elements, then return a string that are these CompareIDs collapsed with ","; otherwise return "None".
Then, use this summarise.func to summarise SimilarID grouped by ID.
Using your data:
print(df2)
### A tibble: 2 x 2
## ID SimilarID
## <int> <chr>
##1 1 None
##2 2 735,108
Using aggregate in base R:
df2 <- aggregate((CompareID*Distance)~ID, df, FUN=function(x)
ifelse(sum(x)>0, paste(x[x>0], collapse = ","), "None"))
names(df2) <- c("ID", "SimilarID") #if necessary
# ID SimilarID
#1 1 None
#2 2 735,108
CompareID*Distance ensures that CompareID will be ignored if Distance==0. Further, its grouped by ID, if sum is greater than 0, the non-zero values (x[x>0]) are comma-delimited and None, otherwise.

Select values per specific occurrence from data frame in R

I am stuck with this one:
I have a dataframe with the following properties:
a variable type (values: "P", "T", "I")
a variable id (subject id)
a variable RT (reaction time)
It looks like this:
id type rt
1 T 333
1 P 912
1 P 467
1 I 773
1 I 123
...
2 P 125
2 I 843
2 T 121
2 P 982
...
The order of the variable type is random for each subject but each subject has the same amount of each type. What I want is to select the first 2 RT values where type=="P" for each participant and then average over occurrences, so that I get a mean RT of all participants for the first occurrence of P, and a mean for the second occurrence of P.
So far, say, 20 participants, I want to extract a total of 20 RTs for the first occurrence and 20 RTs for the second occurrence.
I tried tapply, aggregate, for loop and simple subsetting but these either average "too early" or fail since the order is random for each subject.
Try
devtools::install_github("hadley/dplyr")
library(dplyr)
df%>%
group_by(id) %>%
filter(type=="P") %>%
slice(1:2)%>%
mutate(N=row_number()) %>%
group_by(N) %>%
summarise(rt=mean(rt))
#Source: local data frame [2 x 2]
# N rt
#1 1 518.5
#2 2 724.5
Or using data.table
library(data.table)
setDT(df)[type=="P", list(rt=rt[1:2], N=seq_len(.N)), by=id][,
list(Meanrt=mean(rt)), by=N]
# N Meanrt
#1: 1 518.5
#2: 2 724.5
Or using aggregate from base R
df1 <- subset(df, type=="P")
df1$indx <- with(df1, ave(rt, id, FUN=seq_along))
aggregate(rt~indx, df1[df1$indx %in% 1:2,], FUN=mean)
# indx rt
#1 1 518.5
#2 2 724.5
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), type = c("T",
"P", "P", "I", "I", "P", "I", "T", "P"), rt = c(333L, 912L, 467L,
773L, 123L, 125L, 843L, 121L, 982L)), .Names = c("id", "type",
"rt"), class = "data.frame", row.names = c(NA, -9L))
I hope I got it right, using dplyr:
df %>%
group_by(id, type) %>%
mutate(occ=1:n()) %>%
group_by(type, occ) %>%
summarise(av=mean(rt)) %>%
filter(type=="P")
Source: local data frame [2 x 3]
Groups: type
type occ av
1 P 1 518.5
2 P 2 724.5

Resources