Count if >0 into frequency distribution in R - r

Say I have 900 dataframes at hand, and I wanted to get something similar to a frequency distribution based off of another column for each "type".
Sample Code makin;
df1 <- as_tibble(iris)
df2 <- slice(df1, 1:7)
df2 <- df2 %>%
mutate(type = 1:7)
This is similar to what I currently have just working with one dataframe:
df2 %>% select(type, Sepal.Length) %>%
mutate(Count = ifelse(Sepal.Length > 0, 1, 0)) %>%
mutate(Percentage = Count/7)
In the case that for any row, Sepal.Length = 0, then I'm not going to count it (count column will be = 0 for that row value).
But I'm going to have 900 dataframes that I'll be running this code on, so I was thinking about running it through a loop.
Ideally, if two dataframes are inputted, and both have Sepal.Length values >0 for row 1, then I want the count to be 2 for row 1 / type 1. Is there a better way to approach this? And if I do go for the looping option then is there a way to combine all the dataframes to tell R that row 1 / type 1 has multiple > 0 values?

For your iris example, what it sounds like you want is:
library(tidyverse)
df1 <- as_tibble(iris)
df2 <- slice(df1, 1:7)
df2 <- df2 %>%
mutate(type = 1:7)
group_by(df2, type) %>%
transmute(has_sepal = sum(Sepal.Length > 0))
# A tibble: 7 x 2
# Groups: type [7]
# type has_sepal
# <int> <int>
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# 6 6 1
# 7 7 1
To do this over 900 data frames... If you want this to work on iris, hard code. Someone who is familiar with writing functions using tidyverse evaluation could write a more general version for you, but that's still on my todo list.
f_fill_in_blank_first <- function(tib){
# hard code the var1 and var2
group_by(tib, <var1>) %>%
transmute(var1_not_zero = sum(<var 1> != 0))
}
f_iris <- function(tib)
group_by(tib, type) %>%
transmute(var1_not_zero = sum(Sepal.Length != 0)
}
Depending on the structure of your 900 data frames, you could vapply with this function (edit, no, not this function, refactor so it produces a named atomic vector if you want to vapply this function) to put the whole thing into an array, then collapse one of the dimensions with apply and sum

If you want to keep your code:
df2 %>% select(type, Sepal.Length) %>%
mutate(Count = ifelse(Sepal.Length > 0, 1, 0)) %>%
mutate(Percentage = Count/7)
You can wrap it into a function (add_a_count):
library(tidyverse)
df1 <- as_tibble(iris)
df2 <- df1 %>%
mutate(type = nrow(df1))
add_a_count = function(df)
{
counted_df = df %>%
select(type, Sepal.Length) %>%
mutate(Count = ifelse(Sepal.Length > 0, 1, 0),
Percentage = Count/7)
return(counted_df)
}
I generate 100 duplicates of the test df2 with the following function:
duplicate_df = function(df, no_duplicates)
{
tmp_df_list = list()
for(i in c(1:no_duplicates))
{
print(paste0("Duplicate ", i, " generated."))
tmp_df_list[[i]] = df
}
return(tmp_df_list)
}
data_frames_list = duplicate_df(df = df2, no_duplicates = 100)
And use it with lapply: counted_data_frames = lapply(data_frames_list, add_a_count)
The list counted_data_frames can relatively easily be manipulated (You can use another apply function if you want a non-list output). This might not be the fastest way to do it, but it's straightforward.
EDIT
You can get your Counts columns via looping over the list of data frames. A new data frame counts_data_frame contains all counts with every column being counts of one original data frame:
counts_data_frame = data.frame(type = seq(from = 1, to = nrow(df2)))
for(i in c(1:length(counted_data_frames)))
{
counts_data_frame = cbind(counts_data_frame, as.vector(counted_data_frames[[i]]["Count"]))
}
When looping over the rows of this new data frame, you can sum up your counts and get a vector of counts for plotting:
counts_summarised = vector(length = nrow(counts_data_frame))
for(i in c(1:nrow(counts_data_frame)))
{
counts_summarised[i] = sum(counts_data_frame[i, 2:ncol(counts_data_frame)])
}
plot(counts_summarised, ylab = "Counts", xlab = "Type")

In this solution, I will show you how to:
import all CSV files, into separate data frames in a list, assuming that they all have the same column name for the variable you are interested in and that the files are in one folder (your working directory, preferrably);
count the number of 0 and nonzero measurements and their proportions;
convert the list into a dataframe
Specifically, I used lapply() for looping through the data.frames, converting the list to a data.frame using enframe(), unnesting the value column with unnest(), and spreading the pct by type using spread().
Let's first create a data to work with.
library(tidyverse)
# create a list
datlist <- list()
# this list will contain ten data frames with
# a sample with up to 8 0's and 20 random uniforms as observations
for (i in seq_len(10)){
datlist[[i]] = data.frame(x = sample(c(sample(c(0,1,2,3,4), 8, replace = T), runif(20,0,10))))
}
# name each element of the list datlist
name_element <- LETTERS[1:10]
datlist <- set_names(datlist, name_element)
# save each file separately
mapply(write.csv, datlist, file=paste0(names(datlist), '.csv'), row.names = FALSE)
The following will import your data into R and store them as data.frames in a list.
# import all csv files in the folder into separate data frames in the temp list
temp <- list.files(pattern = "*.csv")
myfiles <- lapply(temp, read.csv)
The following will calculate the percentages by type if we assume that each file contains the same variables.
# Calculate the frequency and relative distributions
lapply(myfiles,
function(varname) mutate(varname, type = if_else(x == 0, 0, 1)) %>%
group_by(type) %>% summarise(n = n()) %>%
mutate(pct = n / sum(n))
) %>%
enframe() %>% # convert the list into a data.frame
unnest(value) %>% # unnest the values
spread(type, pct) # spread the values by type
# A tibble: 17 x 4
name n `0` `1`
<int> <int> <dbl> <dbl>
1 1 3 0.107 NA
2 1 25 NA 0.893
3 2 28 NA 1.00
4 3 1 0.0357 NA
5 3 27 NA 0.964
6 4 2 0.0714 NA
7 4 26 NA 0.929
8 5 28 NA 1.00
9 6 28 NA 1.00
10 7 2 0.0714 NA
11 7 26 NA 0.929
12 8 3 0.107 NA
13 8 25 NA 0.893
14 9 1 0.0357 NA
15 9 27 NA 0.964
16 10 1 0.0357 NA
17 10 27 NA 0.964

Related

R: conditionally mutate a variable when columns match in different dataframes

I am attempting to write some R code that assesses whether or not two dataframes have any matches in their columns. If there are matches, one of the columns in the second dataframe should assign a "link" (via the links variable) to the first dataframe using the id column of the first dataframe.
In the event that there are multiple matches, I am trying to get the "link" variable to randomly select one of the matching id's.
Some reproducible code:
library(dplyr)
df1 = data.frame(ids = c(1:5),
var = c("a","a","c","b","b"))
df2 = data.frame(var = c('c','a','b','b','d'),
links = 0)
Ideally, I would like a resulting dataframe that looks like:
var links
1 c 3
2 a 1 or 2
3 b 4 or 5
4 b 4 or 5
5 d 0
where observations in the links column randomly select ids from df1 when df1$var matches df2$var. In the dataframe above, this is denoted by "or".
Note 1: The links column should be a numeric, I only made it character to allow to write the word "or".
Note 2: If there is not a match between df1$var and df2$var, the links column should remain a 0.
So far, I've gone this route, but I'm unsure about what to put after the ~
linked_df = df2 %>%
mutate(links=case_when(links==0 & var %in% df1$var ~
sample(c(df1$ids),n(),replace=T) # unsure about this line
TRUE ~ links)
I think this is what you want. I've left the ids column in the result, but
it can be removed when the sampling is complete.
library(dplyr)
library(tidyr)
df1_nest = df1 %>%
group_by(var) %>%
summarize(ids = list(ids))
safe_sample = function(x, ...) {
if(length(x) == 1) return(x)
sample(x, ...)
}
set.seed(47)
df2 %>%
left_join(df1_nest) %>%
mutate(
links = sapply(ids, \(x) if(is.null(x)) 0L else safe_sample(x, size = 1))
)
# Joining, by = "var"
# var links ids
# 1 c 3 3
# 2 a 1 1, 2
# 3 b 4 4, 5
# 4 b 5 4, 5
# 5 d 0 NULL
Something like this could do the trick, just a map of a filter of the first dataframe:
df2 %>%
as_tibble() %>%
mutate(links = map(var, ~sample(filter(df1, var == .)$ids), 1),
index = row_number()) %>%
unnest(links, keep_empty = TRUE) %>%
group_by(index) %>%
slice_sample(n = 1) %>%
ungroup() %>%
select(-index)
# # A tibble: 5 × 2
# var links
# <chr> <int>
# 1 c 1
# 2 a 1
# 3 b 4
# 4 b 5
# 5 d NA

Appending a column to each data frame within a list

I have a list of dataframes and want to append a new column to each, however I keep getting various error messages. Can anybody explain why the below code doesn't work for me? I'd be happy if rowid_to)column works as the data in my actual set is alright ordered correctly, otherwise i'd like a new column with a list going from 1:length(data$data)
##dataset
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
nest_by(Location)
###map + rowid_to_column
attempt1<- data%>%
map(.,rowid_to_column(.,var = "hour"))
##mutate
attempt2<-data %>%
map(., mutate("Hours" = 1:6))
###add column
attempt3<- data%>%
map(.$data,add_column(.data,hours = 1:6))
newcolumn<- 1:6
###lapply
attempt4<- lapply(data,cbind(data$data,newcolumn))
Many thanks,
Stuart
You were nearly there with your base R attempt, but you want to iterate over data$data, which is a list of data frames.
data$data <- lapply(data$data, function(x) {
hour <- seq_len(nrow(x))
cbind(x, hour)
})
data$data
# [[1]]
# Day Average Amplitude hour
# 1 1 6.070539 1.123182 1
# 2 2 3.638313 8.218556 2
# 3 3 11.220683 2.049816 3
# 4 4 12.832782 14.858611 4
# 5 5 12.485757 7.806147 5
# 6 6 19.250489 6.181270 6
Edit: Updated as realised it was iterating over columns rather than rows. This approach will work if the data frames have different numbers of rows, which the methods with the vector defined as 1:6 will not.
a data.table approach
library(data.table)
setDT(data)
data[, data := lapply(data, function(x) cbind(x, new_col = 1:6))]
data$data
# [[1]]
# Day Average Amplitude test new_col
# 1 1 11.139917 0.3690539 1 1
# 2 2 5.350847 7.0925508 2 2
# 3 3 9.602104 6.1782818 3 3
# 4 4 14.866074 13.7356913 4 4
# 5 5 1.114201 1.1007080 5 5
# 6 6 2.447236 5.9944926 6 6
#
# [[2]]
# Day Average Amplitude test new_col
# 1 1 17.230213 13.966576 1 1
# .....
A purrr approach:
data<- tibble(Location = c(rep("London",6),rep("Glasgow",6),rep("Dublin",6)),
Day= rep(seq(1,6,1),3),
Average = runif(18,0,20),
Amplitude = runif(18,0,15))%>%
group_split(Location) %>%
purrr::map_dfr(~.x %>% mutate(Hours = c(1:6)))
If you want to use your approach and preserve the same data structure, this is a way again using purrr (you need to ungroup, otherwise it will not work due to the rowwise grouping)
data %>% ungroup() %>%
mutate_at("data", .f = ~map(.x, ~.x %>% mutate(Hours = c(1:6))) )

Replace column values in a df with matching index with new values in R

I have df, containing 2 variables, df and val. df contains numbers from 1-255 and val is random numbers generated. I also have new_vals that is a vector of 255 different values.
df = (seq(1,255,by=1))
df = as.data.frame(df)
df$val = seq(0,1,length.out=255)
new_vals = (df$val+1)
new_vals=as.data.frame(new_vals)
I want to replace the value in df, where each number 1-255 in df$df corresponds to the 255 numbers in new_vals. If the index matches replace df$val with the value at each index from new_vals.
dataframe df
df val
1 0.000000000
2 0.003937008
3 0.007874016
dataframe newvals (these are the values at index 1,2,3)
new_vals
<dbl>
1.000000
1.003937
1.007874
Expected Output of dataframe df after replacing values at matching index
df val
1 1.000000
2 1.003937
3 1.007874
What is the easiest way I could do this?
Edit: I realized in this example i can just replace column, but imagine df$df's order of 1-255 was randomized or have more rows
If I'm understanding correctly, here's a way to match indices with dplyr:
library(dplyr)
new_vals %>%
mutate(index = row_number()) %>%
left_join(df, by = c("index" = "df"), keep = T)
Which gives us:
new_vals index df val
1 1.000000 1 1 0.000000000
2 1.003937 2 2 0.003937008
3 1.007874 3 3 0.007874016
Proposed solution without the example would be:
new_vals %>%
mutate(index = row_number()) %>%
left_join(df, by = c("index" = "df"), keep = T) %>%
select(df, val = new_vals)
Which gives us:
df val
1 1 1.000000
2 2 1.003937
3 3 1.007874
4 4 1.011811
5 5 1.015748
6 6 1.019685
7 7 1.023622
8 8 1.027559
9 9 1.031496
10 10 1.035433
If you are sure, there are df:1-255 in df, then:
df$val[which(df$df %in% c(1:255))] <- new_vals$new_vals
In addition a for loop can bring you more control and check the index accurately:
for (row in df$df) {
df$val[df$df==row] <- new_vals$new_vals[row]
}

findInterval by group with dplyr [duplicate]

This question already has answers here:
How to quickly form groups (quartiles, deciles, etc) by ordering column(s) in a data frame
(11 answers)
Closed 1 year ago.
In this example I have a tibble with two variables:
a group variable gr
the variable of interest val
set.seed(123)
df <- tibble(gr = rep(1:3, each = 10),
val = gr + rnorm(30))
Goal
I want to produce a discretized version of val using the function findInterval but the breakpoints should be gr-specific, since in my actual data as well as in this example, the distribution of valdepends on gr. The breakpoints are determined within each group using the quartiles of val.
What I did
I first construct a nested tibble containing the vectors of breakpoints for each value of gr:
df_breakpoints <- bind_cols(gr = 1:3,
purrr::map_dfr(1:3, function(gr) {
c(-Inf, quantile(df$val[df$gr == gr], c(0.25, 0.5, 0.75)), Inf)
})) %>%
nest(bp = -gr) %>%
mutate(bp = purrr::map(.$bp, unlist))
Then I join it with df:
df <- inner_join(df, df_breakpoints, by = "gr")
My first guess to define the discretized variable lvl was
df %>% mutate(lvl = findInterval(x = val, vec = bp))
It produces the error
Error : Problem with `mutate()` input `lvl2`.
x 'vec' must be sorted non-decreasingly and not contain NAs
ℹ Input `lvl` is `findInterval(x = val, vec = bp)`.
Then I tried
df$lvl <- purrr::imap_dbl(1:nrow(df),
~findInterval(x = df$val[.x], vec = df$bp[[.x]]))
or
df %>% mutate(lvl = purrr::map2_int(df$val, df$bp, findInterval))
It does work. However it is highly unefficient. With my actual data (1.2 million rows) it takes several minutes to run. I guess there is a much better way of doing this than iterating on rows. Any idea?
You can do this in group_by + mutate step -
library(dplyr)
df %>%
group_by(gr) %>%
mutate(breakpoints = findInterval(val,
c(-Inf, quantile(val, c(0.25, 0.5, 0.75)), Inf))) %>%
ungroup
# gr val breakpoints
# <int> <dbl> <int>
# 1 1 0.440 1
# 2 1 0.770 2
# 3 1 2.56 4
# 4 1 1.07 3
# 5 1 1.13 3
# 6 1 2.72 4
# 7 1 1.46 4
# 8 1 -0.265 1
# 9 1 0.313 1
#10 1 0.554 2
# … with 20 more rows
findInterval is applied for each gr separately.

Create column of a tibble (or data frame) that contains a list from a long-format tibble

I have objects that have varying numbers of events at varying times. This is currently stored in a long format (using tibbles from library(tidyverse)) :
timing_tbl <- tibble(ID = c(101,101,101,102,102,103,103,103,103),
event_time = c(0,4,8,0,6,0,4,9,12))
The real data has thousands of objects, with up to 50 or so events, so I want to make this process as efficient as possible.
I would like to convert this to a pseudo-wide format, where the first column is the patient ID, and the second column is a list of the event times for that object. I can do that where the second column is a column of tibbles in the following way
tmp <- lapply(unique(timing_tbl$ID),
function(x) timing_tbl[timing_tbl$ID == x, "event_time"])
timing_tbl2 <- tibble(unique(timing_tbl$ID),tmp)
> timing_tbl2[1,2]
# A tibble: 1 x 1
tmp
<list>
1 <tibble [3 × 1]>
> timing_tbl2[[1,2]]
# A tibble: 3 x 1
event_time
<dbl>
1 0
2 4.00
3 8.00
I would prefer to store these objects as lists, as I then want to find the “distance” between each pair of objects using the following function, and I worry that extracting the vector from the list adds unnecessary processing, slowing down the calculation.
lap_exp2 <- function(x,y,tau) {
exp(-abs(x - y)/tau)
}
distance_lap2 <- function(vec1,vec2,tau) {
## vec1 is first list of event times
## vec2 is second list of event times
## tau is the decay parameter
0.5*(sum(outer(vec1,vec1,FUN=lap_exp2, tau = tau)) +
sum(outer(vec2,vec2,FUN=lap_exp2, tau = tau))
) -
sum(outer(vec1,vec2,FUN=lap_exp2, tau = tau))
}
distance_lap2(timing_tbl2[[1,2]]$event_time,timing_tbl2[[2,2]]$event_time,2)
[1] 0.8995764
If I try extracting the list instead of the tibble using [[
tmp <- lapply(unique(timing_tbl$ID),
function(x) timing_tbl[[timing_tbl$ID == x, "event_time"]])
I get the following error, which makes sense
Error in col[[i, exact = exact]] : attempt to select more than one element in vectorIndex
Is there a reasonably simple way I can extract the column from the long tibble as a list and store it in the new tibble? Is this even the right way to go about this?
I've found using tidyr::nest a good way to generate the 'list columns' I think you may be after (especially for stuffing in time series-ish sort of data). Hope the following helps!
library(dplyr)
library(tidyr)
library(purrr)
timing_tbl <- tibble(ID = c(101,101,101,102,102,103,103,103,103),
event_time = c(0,4,8,0,6,0,4,9,12))
ID_times <-
timing_tbl %>%
group_by(ID) %>%
nest(.key = "times_df") %>%
split(.$ID) %>%
map(~ .$times_df %>% unlist(use.names = F))
# > ID_times
# $`101`
# [1] 0 4 8
# $`102`
# [1] 0 6
# $`103`
# [1] 0 4 9 12
dists_long <-
names(ID_times) %>%
expand.grid(IDx = ., IDy = .) %>%
filter(IDx != IDy) %>%
rowwise() %>%
mutate(dist = distance_lap2(vec1 = ID_times[[IDx]], vec2 = ID_times[[IDy]], tau = 2))
# # A tibble: 6 x 3
# IDx IDy dist
# <fct> <fct> <dbl>
# 1 102 101 0.900
# 2 103 101 0.981
# 3 101 102 0.900
# 4 103 102 1.68
# 5 101 103 0.981
# 6 102 103 1.68

Resources