Recherche value in other dataframe for computation - r

I have a df that looks like that:
df1 <- data.frame(country = c("C1","C1","C2","C2"),year = c(1998,2001,1998,2001), amount = c(11000,11500,5000,4100))
I created another df based on the first one as follow:
df2 <- aggregate(amount ~ year, df1, sum)
I would to create a new column df1$ratio corresponding to the amount ration of each ID for each year. it should look like:
df3 <- data.frame(country = c("C1","C1","C2","C2"),year = c(1998,2001,1998,2001), amount = c(11000,11500,5000,4100), ratio = c(.6875, .7372,.3125,.2628))
Any Idea?

Instead of two step process, it can be done with ave from base R
df1$ratio <- with(df1, amount/ave(amount, year, FUN = sum))
Or with mutate from dplyr
library(dplyr)
df1 %>%
group_by(year) %>%
mutate(ratio = amount/sum(amount))

Related

How to subset a data frame by id, with sampling 1 row by id? (in R)

I have a big data frame and each row have an id code.
But i want to create another data frame with only one row of each id.
How can i do it?
This is one part of the data (the id column is "codigo_pon"):
Using dplyr, you can do this:
library(dplyr)
your_data %>%
group_by(id_column) %>%
sample_n(1) %>%
ungroup()
Based on the question, you could do somethink like this:
library(tidyverse)
Example data
data <-
tibble(
id = rep(1:20,each = 5),
value = rnorm(100)
)
Sample data, 1 row by id
data %>%
#Group by id variable
group_by(id) %>%
#Sample 1 row by id
sample_n(size = 1)
base R
data[!ave(seq_len(nrow(data)), data$codigo_pon,
FUN = function(z) seq_along(z) != sample(length(z), size = 1)),]
or
do.call(rbind, by(data, data$codigo_pon,
FUN = function(z) z[sample(nrow(z), size = 1),]))
(Previously I suggested aggregate, but that sampled each column separately, breaking up the rows.)
data.table
library(data.table)
as.data.table(data)[, .SD[sample(.N, size = 1),], by = codigo_pon]
(dplyr has already been demonstrated twice)

How to collapse one dataset to get incremental instances in a new dataset?

I'm trying to collapse one dataframe that has the following unit of analysis: country-pta-yea into country-year. I tried group_by and summarise but it would sum up all the values instead of incrementally adding them as each observation of "value" is present for each PTA in different years. Below is the dataframe (df) I have and the dataframe I would like to achieve (df2).
What should I do next?
country <- c("USA","USA","USA","USA","USA","USA")
year <- c(2000,2001,2002,2000,2001,2002)
pta <- c("a","a","a","y","y","y")
value <- c(0,1,1,0,0,1)
df <- data.frame(country, year,pta, value)
country1 <- c("USA","USA","USA")
year1 <- c(2000,2001,2002)
value1 <- c(0,1,2)
df2 <- data.frame(country1,year1, value1)
It is a group by sum i.e. grouped by 'country', 'year', get the sum of value
library(dplyr)
df %>%
group_by(country, year) %>%
summarise(value = sum(value, na.rm = TRUE), .groups = 'drop')

Mutating a count of rows per group matching a subset condition

I wish to mutate a new column called SF_COUNT which is a count per group (ID) of the number of rows per group where the column type contains 'SF'
A reproducible example looks as follows:
df <- data.frame(ID = c(1234,1234,1234,4567,4567,4567,4567,8900,8900,8900),type = c('RF','SF','SF','RF','SF','SF','SF','RF','SF','SF'))
My final data frame looks like:
final_df <- data.frame(ID = c(1234,1234,1234,4567,4567,4567,4567,8900,8900,8900),type = c('RF','SF','SF','RF','SF','SF','SF','RF','SF','SF'), SF_COUNT = c(2,2,2,3,3,3,3,2,2,2))
How can I achieve this in dplyr please?
After grouping by 'ID', get the sum of logical vector (type == 'SF') in mutate to create the new column
library(dplyr)
df <- df %>%
group_by(ID) %>%
mutate(SF_COUNT = sum(type == 'SF', na.rm = TRUE))
If it is a substring, then use str_detect
library(stringr)
df <- df %>%
group_by(ID) %>%
mutate(SF_COUNT = sum(str_detect(type, 'SF'), na.rm = TRUE))

sample multiple different sample sizes using crossing and sample_n to create single df

I am attempting to sample a dataframe using sample_n. I know that sample_n usually takes a single size= argument at a time, however, I would like to sample sizes from 2 to the max # of rows in the df. Unfortunately, the code I have compiled below does not do the job. The needed output would be a dataframe with an id= column or a list divided by the id column from crossing().
df <- data.frame(Date = 1:15,
grp = rep(1:3,each = 5),
frq = rep(c(3,2,4), each = 5))
data_sampled_by_stratum <- df %>%
group_by(Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
group_by(id) %>%
sample_n(size=c(2:15)) %>%
group_by(CLUSTER_ID,Date) %>% filter(n() > 2)
If you had a column with different sites you could do this.
data_sampled_by_stratum <- data_grouped_by_stratum %>%
group_by(siteid, Date) %>%
crossing(id = seq(500)) %>% # repeat dataframes
sample_n(rbinom(1,sum(siteid==i),(1-s)^2))

How to Calculate A Daily Max for Different Locations in R?

So currently I am able to calculate a daily max for one site using the following code:
library('dplyr')
library('data.table')
library('tidyverse')
library('tidyr')
library('lubridate')
funcVolume <- function(max_data$enter_yard, max_data$exit_yard)
{
vecOnes <- array(1,c(length(max_data$enter_yard),1))
vecTime <- c(max_data$enter_yard,max_data$exit_yard)
vecCount <- c(vecOnes,-vecOnes)
df_test <- data.frame(T = vecTime, Count = vecCount)
df_test <- df_test %>%
arrange(T) %>%
mutate(Volume = cumsum(Count))
df_test
}
df_test2 <- df_test
df_test2$date <- as.Date(format(df_test$T, "%Y-%m-%d"))
df_test3 <- df_test2
df_test3 <- tibble(x = df_test2$Volume, y = df_test2$date) %>%
arrange(y)
dataset <- df_test3 %>%
group_by(y) %>%
dplyr::filter(x == max(x)) %>%
distinct(x,.keep_all = T) %>%
ungroup()
However, I would like to do this for multiple locations. In my original dataframe, I have a column that lists the name of the site, and two columns for when an object enter or leaves a site. The name is just a general text column, and the other two columns are datetime columns. Ideally, I would want an output that looks like the following:
Date | Max Count | Site
x y z
x a b
I also have a couple million rows of data, so I need something that can run in a reasonable time frame.

Resources