applying a function across rows in dataframe - r

I have a dataset of approximate counts of birds of 5 species. I wrote a function to calculate the diversity of species using Broullions Index. My data looks like this and my function is written like this:
df <- data.frame(
sp1 = c(2, 3, 4, 5),
sp2 = c(1, 6, 7, 2),
sp3 = c(1, 9, 4, 3),
sp4 = c(2, 2, 2, 4),
sp5 = c(3, 3, 2, 1),
treatment1 = c("A", "B", "C", "A"),
treatment2 = c("D", "E", "D", "E")
)
#write function that estimates Broullion's Index
Brillouin_Index <- function(x){
N <- sum(x)
(log10(factorial(N)) - sum(log10(factorial(x)))) / N
}
df2 <- df %>%
mutate(bindex = Brillon_Index(matrix(df[1:5,])
How do apply my function to calculate the Broullions Index across rows? I thought something like the above would work but no luck yet. The point would be to use the diversity index as the response variable in relation to treatment 1 and 2 which is why I'd like to sum across rows and get a single value across for each row for a new variable called bindex. Any help will be greatly appreciated. Best,

We can use rowwise to group by row
library(dplyr)
df <- df %>%
rowwise %>%
mutate(bindex = Brillouin_Index(as.matrix(c_across(1:5)))) %>%
ungroup
-output
df
# A tibble: 4 x 8
# sp1 sp2 sp3 sp4 sp5 treatment1 treatment2 bindex
# <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#1 2 1 1 2 3 A D 0.464
#2 3 6 9 2 3 B E 0.528
#3 4 7 4 2 2 C D 0.527
#4 5 2 3 4 1 A E 0.505
Or use apply in base R
df$bindex <- apply(df[1:5], 1, Brillouin_Index)
df$bindex
#[1] 0.4643946 0.5277420 0.5273780 0.5051951
Or with dapply in collapse
library(collapse
df$bindex <- dapply(slt(df, 1:4), Brillouin_Index, MARGIN = 1)

Related

Multiply numbers from different data frames based on all the possible combinations

I have 5 data frames like the ones below:
df_mon <- data.frame(mon = as.factor(c(6, 7, 8, 9, 10)),
number = c(1.11, 1.02, 0.95, 0.92, 0.72))
df_year <- data.frame(year = as.factor(c(1, 2)),
number = c(1.61, 0.4))
df_cat <- data.frame(cat = c("A", "B", "C"),
number = c(1.11, 1.02, 0.44))
df_bin <- data.frame(bin = as.factor(c(1, 2)),
number = c(1.42, 0.56))
df_cat2 <- data.frame(cat2 = c("A", "B", "C", "D", "AA"),
number = c(0.11, 1.22, 1.34, 0.88, 0.75))
I need to multiple all the numbers in the 'number' columns from each of these data frames with each other. So, look at all the possible combinations in the first column in each data set and then take the number and multiple them. The final results data frame should look something like this (First 3 are done)
results_df <- data.frame(combi = c("mon6_year1_catA_bin1_cat2A", "mon6_year1_catA_bin1_cat2B", "mon6_year1_catA_bin1_cat2C"),
final_number = c(1.11*1.61*1.11*1.42*0.11, 1.11*1.61*1.11*1.42*1.22, 1.11*1.61*1.11*1.42*1.34))
We can see the first column in the the results_df shows what combination was used to calculate the final_number. The first example shows, the 'number' column from mon_df cat 6 (1.11) is taken and multiplied with the following:
category 1 (1.61) from df_year
category A (1.11) from df_cat
category 1 (1.42) from df_bin
category A (0.11) from df_cat2
The answer for this combination is 1.11 x 1.61 x 1.11 x 1.42 x 0.11 = 0.3098.
The 2nd row shows the next possible combination and so on.
I'm not sure how to achieve this, so any help will be greatly appreciated!
Maybe you can try expand.grid like below
lst <- list(df_mon, df_year, df_cat, df_bin, df_cat2)
results_df <- data.frame(
combi = do.call(
paste,
c(do.call(
expand.grid,
lapply(lst, function(v) paste0(names(v[1]), v[, 1]))
), sep = "_")
),
final_number = Reduce(
"*",
do.call(
expand.grid,
lapply(lst, `[[`, 2)
)
)
)
which gives
> head(results_df)
combi final_number
1 mon6_year1_catA_bin1_cat2A 0.30985097
2 mon7_year1_catA_bin1_cat2A 0.28472792
3 mon8_year1_catA_bin1_cat2A 0.26518777
4 mon9_year1_catA_bin1_cat2A 0.25681342
5 mon10_year1_catA_bin1_cat2A 0.20098441
6 mon6_year2_catA_bin1_cat2A 0.07698161
Here is an approach using dplyr and tidyr.
df_all <- df_mon %>%
full_join(df_year, by = character()) %>% # by = character() ensures cross join
full_join(df_cat, by = character()) %>%
full_join(df_bin, by = character()) %>%
full_join(df_cat2, by = character()) %>%
pivot_longer(cols = c(-mon, -year, -cat, -bin, -cat2)) %>%
group_by(mon, year, cat, bin, cat2) %>%
summarize(final_number = prod(value), .groups = "keep")
# A tibble: 300 x 6
# Groups: mon, year, cat, bin, cat2 [300]
mon year cat bin cat2 final_number
<fct> <fct> <chr> <fct> <chr> <dbl>
1 6 1 A 1 A 0.310
2 6 1 A 1 AA 2.11
3 6 1 A 1 B 3.44
4 6 1 A 1 C 3.77
5 6 1 A 1 D 2.48
6 6 1 A 2 A 0.122
7 6 1 A 2 AA 0.833
8 6 1 A 2 B 1.36
9 6 1 A 2 C 1.49
10 6 1 A 2 D 0.978
# ... with 290 more rows
It keeps the variables from the other data.frames intact as columns for further analysis, but you could create your combi column with a little paste().

Replacing NA's with LOCF using Sparklyr

My aim is to replace NA's in a spark data frame using the Last Observation Carried Forward method. I wrote the following code and works. However, it seems to take longer than expected for a larger dataset.
It would be great if someone can recommend a better approach or improve the code.
Example and Code with Sparklyr
In the following example, NA's are replaced after ordering them using the
time and grouping them by grp.
df_with_nas <- data.frame(time = seq(as.Date('2001/01/01'),
as.Date('2010/01/01'), length.out = 10),
grp = c(rep(1, 5), rep(2, 5)),
v1 = c(1, rep(NA, 3), 5, rep(NA, 5)),
v2 = c(NA, NA, 3, rep(NA, 4), 3, NA, NA))
tbl <- copy_to(sc, df_with_nas, overwrite = TRUE)
tbl %>%
spark_apply(function(df) {
library(dplyr)
na_locf <- function(x) {
v <- !is.na(x)
c(NA, x[v])[cumsum(v) + 1]
}
df %>% arrange(time) %>% group_by(grp) %>% mutate_at(vars(-v1, -grp),
funs(na_locf(.)))
})
# # Source: spark<?> [?? x 4]
# time grp v1 v2
# <dbl> <dbl> <dbl> <dbl>
# 1 11323 1 1 NaN
# 2 11688. 1 NaN NaN
# 3 12053. 1 NaN 3
# 4 12419. 1 NaN 3
# 5 12784. 1 5 3
# 6 13149. 2 NaN NaN
# 7 13514. 2 NaN NaN
# 8 13880. 2 NaN 3
# 9 14245. 2 NaN 3
# 10 14610 2 NaN 3
data.table
Following approach with data.table works quite fast for the data I have. I am expecting the size of the data to increase soon, and then I may have to rely on sparklyr.
library(data.table)
setDT(df_with_nas)
df_with_nas <- df_with_nas[order(time)]
cols <- c("v1", "v2")
df_with_nas[, (cols) := zoo::na.locf(.SD, na.rm = FALSE),
by = grp, .SDcols = cols]
I did this sort of loop, is quite slow...
df_with_nas = df_with_nas %>% mutate(row = 1:nrow(df_with_nas))
for(n in 1:50){
df_with_nas = df_with_nas %>%
arrange(row) %>%
mutate_all(~if_else(is.na(.),lag(.,1),.))
}
run until no NA
then
collect(df_with_nas)
Will run the code.
You can leverage the spark_apply() function and run the na.locf function in each of your cluster nodes.
Install R runtimes on each of your cluster nodes.
Install the zoo R package on each nodes as well.
Run spark apply this way:
data_filled <- spark_apply(data_with_holes, function(df) zoo:na.locf(df))
You can do this quite quickly using sql with the added benefit that you can easily apply LOCF on grouped basis. The pattern you want to use is LAST_VALUE(column, true) OVER (window) - this searches over the window for the most recent column value which is not NA (passing "true" to LAST_VALUE sets ignore NA = true). Since you want to look backwards from the current value the window should be
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING
Of course, if the first value in the group is NA it will remain NA.
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
test_table <- data.frame(
v1 = c(1, 2, NA, 3, NA, 5, NA, 6, NA),
v2 = c(1, 1, 1, 1, 1, 2, 2, 2, 2),
time = c(1, 2, 3, 4, 5, 2, 1, 3, 4)
) %>%
sdf_copy_to(sc, ., "test_table")
spark_session(sc) %>%
sparklyr::invoke("sql", "SELECT *, LAST_VALUE(v1, true)
OVER (PARTITION BY v2
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND -1 FOLLOWING)
AS last_non_na
FROM test_table") %>%
sdf_register() %>%
mutate(v1 = ifelse(is.na(v1), last_non_na, v1))
#> # Source: spark<?> [?? x 4]
#> v1 v2 time last_non_na
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 NaN
#> 2 2 1 2 1
#> 3 2 1 3 2
#> 4 3 1 4 2
#> 5 3 1 5 3
#> 6 NaN 2 1 NaN
#> 7 5 2 2 NaN
#> 8 6 2 3 5
#> 9 6 2 4 6
Created on 2019-08-27 by the reprex package (v0.3.0)

How to create a rank for a variable in a longitudinal dataset based on a condition?

I have a longitudinal dataset where each subject is represented more than once. One represents one admission for a patient. Each admission, regardless of the subject also has a unique "key". I need to figure out which admission is the "INDEX" admission, that is, the first admission, so that I know that which rows are the subsequent RE-admission. The variable to use is "Daystoevent"; the lowest number represents the INDEX admission. I want to create a new variable based on the condition that for each subject, the lowest number in the variable "Daystoevent" is the "index" admission and each subsequent gets a number "1" , "2" etc. I want to do this WITHOUT changing into the horizontal format.
The dataset looks like this:
Subject Daystoevent Key
A 5 rtwe
A 8 erer
B 3 tter
B 8 qgfb
A 2 sada
C 4 ccfw
D 7 mjhr
B 4 sdfw
C 1 srtg
C 2 xcvs
D 3 muyg
Would appreciate some help.
This may not be an elegant solution but will do the job:
library(dplyr)
df <- df %>%
group_by(Subject) %>%
arrange(Subject, Daystoevent) %>%
mutate(
Admission = if_else(Daystoevent == min(Daystoevent), 0, 1),
) %>%
ungroup()
for(i in 1:(nrow(df) - 1)) {
if(df$Admission[i] == 1) {
df$Admission[i + 1] <- 2
} else if(df$Admission[i + 1] != 0){
df$Admission[i + 1] <- df$Admission[i] + 1
}
}
df[df == 0] <- "index"
df
# # A tibble: 11 x 4
# Subject Daystoevent Key Admission
# <chr> <dbl> <chr> <chr>
# 1 A 2 sada index
# 2 A 5 rtwe 1
# 3 A 8 erer 2
# 4 B 3 tter index
# 5 B 4 sdfw 1
# 6 B 8 qgfb 2
# 7 C 1 srtg index
# 8 C 2 xcvs 1
# 9 C 4 ccfw 2
# 10 D 3 muyg index
# 11 D 7 mjhr 1
Data:
df <- data_frame(
Subject = c("A", "A", "B", "B", "A", "C", "D", "B", "C", "C", "D"),
Daystoevent = c(5, 8, 3, 8, 2, 4, 7, 4, 1, 2, 3),
Key = c("rtwe", "erer", "tter", "qgfb", "sada", "ccfw", "mjhr", "sdfw", "srtg", "xcvs", "muyg")
)

Ordering a dataframe by its subsegments

My team and I are dealing with many thousands of URLs that have similar segments.
Some URLs have one segment ("seg", plural, "segs") in a position of interest to us. Other similar URLs have a different seg in the position of interest to us.
We need to sort a dataframe consisting of URLs and associated unique segs
in the position of interest, showing the frequency of those unique segs.
Here is a simplified example:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
We are looking for the following:
url freq seg
1 3 a in other words, url #1 appears three times each with a seg = "a",
2 2 b in other words: url #2 appears twice each with a seg = "b",
3 3 c in other words: url #3 appears three times with a seg = "c",
3 2 x two times with a seg = "x", and,
3 1 y once with a seg = "y"
4 1 d etc.
I can get there using a loop and several small steps, but am convinced there is a more elegant way of doing this. Here's my inelegant approach:
Create empty dataframe with num.unique rows and three columns (url, freq, seg)
result <- data.frame(url=0, Freq=0, seg=0)
Determine the unique URLs
unique.df.url <- unique(df$url)
Loop through the dataframe
for (xx in unique.df.url) {
url.seg <- df[which(df$url == unique.df.url[xx]), ] # create a dataframe for each of the unique urls and associated segs
freq.df.url <- data.frame(table(url.seg)) # summarize the frequency distribution of the segs by url
result <- rbind(result,freq.df.url) # append a new data.frame onto the last one
}
Eliminate rows in the dataframe where Frequency = 0
result.freq <- result[which(result$Freq |0), ]
Sort the dataframe by URL
result.order <- result.freq[order(result.freq$url), ]
This yields the desired results, but since it is so inelegant, I am concerned that once we move to scale, the time required will be prohibitive or at least a concern. Any suggestions?
In base R you can do this :
aggregate(freq~seg+url,`$<-`(df,freq,1),sum)
# or aggregate(freq~seg+url, data.frame(df,freq=1),sum)
# seg url freq
# 1 a 1 3
# 2 b 2 2
# 3 c 3 3
# 4 x 3 2
# 5 y 3 1
# 6 d 4 1
The trick with $<- is just to add a column freq of value 1 everywhere, without changing your source table.
Another possibility:
subset(as.data.frame(table(df[2:1])),Freq!=0)
# seg url Freq
# 1 a 1 3
# 8 b 2 2
# 15 c 3 3
# 17 x 3 2
# 18 y 3 1
# 22 d 4 1
Here I use [2:1] to switch the order of columns so table orders the results in the required way.
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
library(dplyr)
df %>% count(url, seg) %>% arrange(url, desc(n))
# # A tibble: 6 x 3
# url seg n
# <dbl> <fct> <int>
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
Would the following code be better for you?
library(dplyr)
df %>% group_by(url, seg) %>% summarise(n())
Or paste & tapply:
url <- c(1, 3, 1, 4, 2, 3, 1, 3, 3, 3, 3, 2)
seg <- c("a", "c", "a", "d", "b", "c", "a", "x", "x", "y", "c", "b")
df <- data.frame(url,seg)
want <- tapply(url, INDEX = paste(url, seg, sep = "_"), length)
want <- data.frame(do.call(rbind, strsplit(names(want), "_")), want)
colnames(want) <- c("url", "seg", "freq")
want <- want[order(want$url, -want$freq), ]
rownames(want) <- NULL # needed?
want <- want[ , c("url", "freq", "seg")] # needed?
want
An option can be to use table and tidyr::gather to get data in format needed by OP:
library(tidyverse)
table(df) %>% as.data.frame() %>%
filter(Freq > 0 ) %>%
arrange(url, desc(Freq))
# url seg Freq
# 1 1 a 3
# 2 2 b 2
# 3 3 c 3
# 4 3 x 2
# 5 3 y 1
# 6 4 d 1
OR
df %>% group_by(url, seg) %>%
summarise(freq = n()) %>%
arrange(url, desc(freq))
# # A tibble: 6 x 3
# # Groups: url [4]
# url seg freq
# <dbl> <fctr> <int>
# 1 1.00 a 3
# 2 2.00 b 2
# 3 3.00 c 3
# 4 3.00 x 2
# 5 3.00 y 1
# 6 4.00 d 1

Subset tibble based on column sums, while retaining character columns

I have a feeling this is a pretty stupid issue, but I haven't been able to find the solution either
I have a tibble where each row is a sample and the first column is a character variable containing the sample ID and all subsequent columns are variables with numeric variables.
For example:
id <- c("a", "b", "c", "d", "e")
x1 <- rep(1,5)
x2 <- seq(1,5,1)
x3 <- rep(2,5)
x4 <- seq(0.1, 0.5, 0.1)
tb <- tibble(id, x1, x2, x3, x4)
I want to subset this to include only the columns with a sum greater than 5, and the id column. With the old dataframe structure, I know the following worked:
df <- as.data.frame(tb)
df2 <- cbind(df$id, df[,colSums(df[,2:5])>5)
colnames(df2)[1] <- "id"
However, when I try to subset this way with a tibble, I get the error message:
Error: Length of logical index vector must be 1 or 5, got: 4
Does anyone know how to accomplish this task without converting to the old data frame format? Preferably without creating an intermediate tibble with the id variable missing, because separating my ids from my data is just asking for trouble down the road.
Thanks!
# install.packages(c("tidyverse"), dependencies = TRUE)
library(tibble)
df <- tibble(id = letters[1:5], x1 = 1, x2 = 1:5, x3 = 2, x4 = seq(.1, .5, len = 5))
### two additional examples of how to generate the Tibble data
### exploiting that its arguments are evaluated lazily and sequentially
# df <- tibble(id = letters[1:5], x1 = 1, x2 = 1:5, x3 = x1 + 1, x4 = x2/10)
# df <- tibble(x2 = 1:5, id = letters[x2], x3 = 2, x1 = x3-1, x4 = x2/10) %>%
# select(id, num_range("x", 1:4))
base R solution, cf. HubertL's comment above,
### HubertL's base solution
df[c(TRUE,colSums(df[2:5])>5)]
#> # A tibble: 5 x 3
#> id x2 x3
#> <chr> <int> <dbl>
#> 1 a 1 2
#> 2 b 2 2
#> 3 c 3 2
#> 4 d 4 2
#> 5 e 5 2
dplyr solution, cf David Klotz's comment,
### Klotz's dplyr solution
library(dplyr)
df %>% select_if(function(x) is.character(x) || sum(x) > 5)
#> # A tibble: 5 x 3
#> id x2 x3
#> <chr> <int> <dbl>
#> 1 a 1 2
#> 2 b 2 2
#> 3 c 3 2
#> 4 d 4 2
#> 5 e 5 2

Resources