How to take the diversity function and add it as a column - r

I have been following this code here:
calculating diversity by groups
However, when I implement the code it returns only the values. I have tried cbind to put it into another dataframe, however, I am afraid that the rows do not match. Is there a way to run that code, which places it in the same dataframe so the rows match with where they were taken from..

following the code given in the answer linked what you need is to use right_join to link the diversity vector with the method_one data.frame :
diversity(aggregate(. ~ LOC_ID + week + year + POSTCODE , method_one, sum)[, 5:25], MARGIN=1, index="simpson") %>%
tibble(diversity=., ID=1:length(.)) %>%
right_join(method_one %>% mutate(ID=1:nrow(.)))
Explanation :
method_one %>% mutate(ID=1:nrow(.)) adds an ID column.
tibble(diversity=., ID=1:length(.)) turns the result of the diversity call into a tibble with an ID column.
right_join(x, y, by=NULL, ...) : return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. if by==NULL it will use columns with matching names in this case ID.

Here is an option with dplyr
library(dplyr)
library(vegan)
method_one %>%
group_by(LOC_ID, week, year, POSTCODE) %>%
summarise(across(everything(), sum, na.rm = TRUE)) %>%
ungroup %>%
select(5:25) %>%
diversity %>%
tibble(diversity = .) %>%
bind_cols(method_one, .)

Related

Counting the rows based on two other column values, and manipulate the value in a loop through one of these column values in R

There are three columns: website, Date ("%Y %m"), click_tracking (T/F). I would like to add a variable describing the number of websites whose click tracking = T in each month / the number of all website in that month.
I thought the steps would be something like:
aggregate(sum(df$click_tracking = TRUE), by=list(Category=df$Date), FUN = sum)
as.data.frame(table(Date))
Then somehow loop through Date and divide the two variables above which would have been already grouped by Date. How can I achieve this? Many thanks!
If we are creating a column, then do a group by 'Date' and get the sum of 'click_tracking' (assuming it is a logical column - TRUE/FALSE) iin mutate
library(dplyr)
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(click_tracking))
If the column is factor, convert to logical with as.logical
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(as.logical(click_tracking)))
If it is to create a summarised output
df %>%
group_by(Date) %>%
summarise(countTRUE = sum(click_tracking))
In the OP's code, = (assignment) is used instead of == in sum(df$click_tracking = TRUE) and there is no need to do a comparison on a logical column
aggregate(cbind(click_tracking = as.logical(click_tracking)) ~ Date, FUN = sum)
This will create the proportion of websites with click tracking (out of all websites) per month.
aggregate(data=df, click_tracking ~ Date, mean)

Calculate percent NA by ID variable in R

Title is self-explanatory. Looking to calculate percent NA by ID group in R. There are lots of posts on calculating NA by variable column but almost nothing on doing it by row groups.
If there are multiple columns, after grouping by 'ID', use summarise_at to loop over the columns, create a logical vector with is.na, get the mean, and multiply by 100
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise_at(vars(-group_cols()), ~ 100 *mean(is.na(.)))
If we want to get the percentage across all other variables,
library(tidyr)
df1 %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(Perc = 100 * mean(is.na(value)))
Or with aggregate from base R
aggregate(.~ ID, df1, FUN = function(x) 100 * mean(is.na(x)), na.action = na.pass)
Or to get the percentage across, then unlist, the other columns, create a table with the logical vector and the 'ID' column, and use prop.table to get the percentage
prop.table(table(cbind(ID = df1$ID,
value = is.na(unlist(df1[setdiff(names(df1), "ID")]))))

Normalize specified columns in dplyr by value in first row

I have a data frame with four rows, 23 numeric columns and one text column. I'm trying to normalize all the numeric columns by subtracting the value in the first row.
I've tried getting it to work with mutate_at, but I couldn't figure out a good way to get it to work.
I got it to work by converting to a matrix and converting back to a tibble:
## First, did some preprocessing to get out the group I want
totalNKFoldChange <- filter(signalingFrame,
Population == "Total NK") %>% ungroup
totalNKFoldChange_mat <- select(totalNKFoldChange, signalingCols) %>%
as.matrix()
normedNKFoldChange <- sweep(totalNKFoldChange_mat,
2, totalNKFoldChange_mat[1,])
normedNKFoldChange %<>% cbind(Timepoint =
levels(totalNKFoldChange$Timepoint)) %>%
as.tibble %>%
mutate(Timepoint = factor(Timepoint,
levels = levels(totalNKFoldChange$Timepoint)))
I'm so certain there's a nicer way to do it that would be fully dplyr native. Anyone have tips? Thank you!!
If we want to normalize all the numeric columns by subtracting the value in the first row, use mutate_if
library(dplyr)
df1 %>%
mutate_if(is.numeric, list(~ .- first(.)))

Why does dplyr::distinct behave like this for grouped data frames

My question involves the distinct function from dplyr.
First, set up the data:
set.seed(0)
df <- data.frame(
x = sample(10, 100, rep = TRUE),
y = sample(10, 100, rep = TRUE)
)
Consider the following two uses of distinct.
df %>%
group_by(x) %>%
distinct()
df %>%
group_by(x) %>%
distinct(y)
The first produces a different result to the second. As far as I can tell, the first set of operations finds "All distinct values of x, and return first value of y", where as the second finds "For each value of x, find all distinct values of y".
Why should this be so when
df %>%
distinct(x, y)
df %>% distinct()
produce the same result?
EDIT: It looks like this is a known bug already: https://github.com/hadley/dplyr/issues/1110
As far as I can tell, the answer is that distinct considers grouping columns when determining distinctness, which to me seems inconsistent with how the rest of dplyr works.
Thus:
df %>%
group_by(x) %>%
distinct()
Group by x, find values that are distinct in x(!). This seems to be a bug.
However:
df %>%
group_by(x) %>%
distinct(y)
Group by x, find values that are distinct in y given x. This is equivalent to either of these cases:
df %>%
distinct(x, y)
df %>% distinct()
Both find distinct values in x and y.
The take-home message seems to be: Don't use grouping and distinct. Just use the relevant column names as arguments in distinct.

Run function on all pairs of objects in column of data frame

Suppose I have a data frame with factor "subject", and continuous variables "a" and "b". For each level of subject, I create a distance matrix from a and b:
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.)))
This returns an n-by-2 data frame, with subject and dmat as columns. What I would like to do matrix norms of each pairwise subtraction. Something along the lines of:
norm(data$dmat[[1]]-data$dmat[[2]])
norm(data$dmat[[1]]-data$dmat[[3]])
# etc etc
Ideally, I'd get out an n^2-by-3 data frame, with the first two columns indicating the two subject levels that are being compared, and the third column containing this norm calculation.
Apologies for not providing a sample dataset. I'm hoping the answer is simple enough, but if one is required I will try to write some code to generate one.
You can use mapply for this.
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.))) %>%
ungroup %>%
do(data.frame(s1 = rep(.$subject, each=nrow(.)),
s2 = rep(.$subject, times=nrow(.)),
dist = mapply(rep(.$dmat, each=nrow(.)),
rep(.$dmat, times=nrow(.)),
FUN=function(x, y) norm(x-y))))
I would probably find the matrix representation of this result easier to understand:
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.))) %>%
ungroup %>%
do(data.frame(matrix(mapply(rep(.$dmat, each=nrow(.)),
rep(.$dmat, times=nrow(.)),
FUN=function(x, y) norm(x-y)) , nrow=nrow(.))))

Resources