I am trying to create a summary table that tells me a Bikes usagewithin a Borough. The formula for which is
(No. of times a Bike is rented in particular Borough) / (Total No of rentals in that Borough).
Final output should look something like this.
BikeId Borough Pct
1 K&C 0.02
1 Hammersmith 0.45
7 K&C 0.32
To achieve that I am trying to implement a function as below:
smplData <- function(df) {
#initialize an empty dataframe
summDf <- data.frame(BikeId = character(), Borough = character(), Pct =
double())
#create a vector of unique borough names
boro <- unique(df[,"Start.Borough"])
for (i in 1:length(boro)){
#looping through each borough and create a freq table
bkCntBor<- table(df[df$Start.Borough==boro[i],"Bike.Id"])
#total number of rentals in a particular borough
borCnt <- nrow(df[df$Start.Borough==boro[i],])
for (j in 1:length(bkCntBor)){
#looping thru each bike for the ith borough and calculate ratio of jth bike
bkPct <- as.vector(bkCntBor[j])/borCnt
#temp dataframe to store a single row corresponding to bike, boro and ratio
dfTmp <- data.frame(BikeId = names(bkCntBor[j]), Borough = boro[i],
Pct = bkPct)
#append to summary table
summDf <<- rbind(summDf, dfTmp)
}
}
}
The head of the df dataset is as below
>head(df)
Bike.Id Start.Borough Rental.Id
1 K&C 61349872
1 K&C 61361611
1 Royal Parks 61362295
1 K&C 61364627
1 K&C 61367817
1 H&F 61368333
When I run the function after inserting one record in summDf I get the below error
Error in data.frame(BikeId = names(bkCntBor[j]), Borough = boro[i], Pct = bkPct) :
arguments imply differing number of rows: 0, 1
I can the run the function code in the console by passing one value at a time for i and j. But when I run it as a function I get the error mentioned above.
Any help you guys can provide will be amazing.
Here is some sample data for the same.
Bike.Id Start.Borough
1 K&C
1 K&C
1 K&C
7 K&C
7 K&C
1 Hammersmith
1 Hammersmith
7 Hammersmith
9 Hammersmith
9 Westminster
Here's an option using dplyr
library(dplyr)
dd %>%
group_by(Start.Borough, Bike.Id) %>%
summarize(n=n()) %>%
mutate(pct = n / sum(n)) %>%
select(-n)
First we use group_by() find the counts of borough/bike combinations. Then we mutate those records to divide each borough/bike count with the sum of all the bikes in the borough.
Start.Borough Bike.Id prop
<fctr> <int> <dbl>
1 Hammersmith 1 0.50
2 Hammersmith 7 0.25
3 Hammersmith 9 0.25
4 K&C 1 0.60
5 K&C 7 0.40
6 Westminster 9 1.00
with the sample input
dd <- data.frame(Bike.Id = c(1, 1, 1, 7, 7, 1, 1, 7, 9, 9),
Start.Borough = c("K&C", "K&C", "K&C", "K&C", "K&C", "Hammersmith",
"Hammersmith", "Hammersmith", "Hammersmith", "Westminster"))
Related
I have a dataframe of coordinates for different studies that have been conducted. The studies are either experiment or observation however at some locations both experiment AND observation occur. For these sites, I would like to create a new study category called both. How can I do this using dplyr?
Example Data
df1 <- data.frame(matrix(ncol = 4, nrow = 6))
colnames(df1)[1:4] <- c("value", "study", "lat","long")
df1$value <- c(1,1,2,3,4,4)
df1$study <- rep(c('experiment','observation'),3)
df1$lat <- c(37.541290,37.541290,38.936604,29.9511,51.509865,51.509865)
df1$long <- c(-77.434769,-77.434769,-119.986649,-90.0715,-0.118092,-0.118092)
df1
value study lat long
1 1 experiment 37.54129 -77.434769
2 1 observation 37.54129 -77.434769
3 2 experiment 38.93660 -119.986649
4 3 observation 29.95110 -90.071500
5 4 experiment 51.50986 -0.118092
6 4 observation 51.50986 -0.118092
Note that the value above is duplicated when study has experiment AND observation.
The ideal output would look like this
value study lat long
1 1 both 37.54129 -77.434769
2 2 experiment 38.93660 -119.986649
3 3 observation 29.95110 -90.071500
4 4 both 51.50986 -0.118092
We can replace those 'value' cases where both experiment and observation is available to 'both' and get the distinct
library(dplyr)
df1 %>%
group_by(value) %>%
mutate(study = if(all(c("experiment", "observation") %in% study))
"both" else study) %>%
ungroup %>%
distinct
-output
# A tibble: 4 × 4
value study lat long
<dbl> <chr> <dbl> <dbl>
1 1 both 37.5 -77.4
2 2 experiment 38.9 -120.
3 3 observation 30.0 -90.1
4 4 both 51.5 -0.118
What I'm trying to accomplish is basically the equivalent of a vlookup in Excel with an if statement to determine which table to use based on the value of a given column.
Main dataset looks like this:
#STATE CODE AMOUNT
# NJ 1 88
# DE 2 75
# VA 1 24
# PA 1 32
Then there are a handful of other tables that I need to use for the lookups to add the factor column, depending on the state - some states are unique, and the rest all use a common table. For example (the actual tables are much longer than this):
NJ:
#CODE FACTOR
# 1 0.75
# 2 0.90
PA:
#CODE FACTOR
1 0.80
2 0.95
All Other:
#CODE FACTOR
1 0.82
2 0.93
So the final output would be:
#STATE CODE AMOUNT FACTOR
# NJ 1 88 0.75
# DE 2 75 0.93
# VA 1 24 0.82
# PA 1 32 0.80
Is there a way to conditionally join/lookup from the various factor tables depending on the value of State, in this example? Or would I need to combine the factor tables into a single table and explicitly list every state/factor combination and then join based on both State and Code? Thanks for your help.
Instead of having different dataframes for each state it would be easier if the data is present in one dataframe.
Here is one approach -
library(dplyr)
combined_states <- bind_rows(lst(NJ, PA, other), .id = "STATE")
main %>%
mutate(STATE_temp = replace(STATE,
!STATE %in% unique(combined_states$STATE), 'other')) %>%
left_join(combined_states, by = c('STATE_temp' = 'STATE', 'CODE')) %>%
select(-STATE_temp)
# STATE CODE AMOUNT FACTOR
# <chr> <dbl> <dbl> <dbl>
#1 NJ 1 88 0.75
#2 DE 2 75 0.93
#3 VA 1 24 0.82
#4 PA 1 32 0.8
Note that the name of other dataframe should match with the replaced value for STATE_temp.
data
main <- tibble(STATE = c('NJ', 'DE', 'VA', 'PA'),
CODE = c(1, 2, 1, 1),
AMOUNT = c(88, 75, 24, 32))
NJ <- tibble(CODE = c(1, 2), FACTOR = c(0.75, 0.9))
PA <- tibble(CODE = c(1, 2), FACTOR = c(0.8, 0.95))
other <- tibble(CODE = c(1, 2), FACTOR = c(0.82, 0.93))
Make a list of all data frames from your global environment with state abbreviation names, bind them into a single data frame with a STATE column, and join to the main dataset. Then for anything left over we can use rows_patch to fill in NA values.
library(dplyr)
state_tables = ls(pattern = paste(state.abb, collapse = "|")) |>
mget() |>
bind_rows(.id = "STATE")
main = main |>
left_join(state_tables, by = c("STATE", "CODE")) |>
rows_patch(other, by = "CODE")
main
# A tibble: 4 × 4
# STATE CODE AMOUNT FACTOR
# <chr> <dbl> <dbl> <dbl>
# 1 NJ 1 88 0.75
# 2 DE 2 75 0.93
# 3 VA 1 24 0.82
# 4 PA 1 32 0.8
(Using Ronak's kindly shared data)
i'm new to R and trying to use it in place of Excel (where i have more experience). I'm still working out the full 'for' logic, but not having the values to determine if it's working how i think it should is stopping me in my tracks. The goal is to generate what will be used as a factor with 3 levels; 0 = no duplicates, 1 is if duplicate, Oldest, 2 = if duplicate, newest.
I have a dataframe that looks like this
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- cbind(Person, Date, ID, DuplicateStatus, IdealResult)
I am trying to use a for loop to evaluate if person duplicates. If a person does not duplicate, value= 0 and if they do duplicate, they should have a 1 for the oldest value and a 2 for the newest value (see ideal result). NOTE: I have already sorted the data to be by person and then date, so if duplicated, first appearance is oldest.
previous investigations of Vlookup in R answers here are aimed at merging datasets based on identical values in multiple datasets. Here, i am attempting to modify a column based on the relationship between columns, within a single dataset.
currentID = 0
nextID =0
for(i in mydata$ID){
currentID = i
nextID = currentID++1
CurrentPerson ##Vlookup function that does - find currentID in ID, return associated value in Person column in same position.
NextPerson ##Vlookup function that does - find nextID in ID, return associated value in Person column in same position.
if CurrentPerson = NextPerson, then DuplicateStatus at ID associated with current person should be 1, and DuplicateStatus at ID associated with NextPerson = 2.
**This should end when current person = total number of people
Thanks!
You really need to spend some time with a simple tutorial on R. Your cbind() function converts all of your data to a character matrix which is probably not what you want. Look at the results of str(mydata). Instead of looping, this creates an index number within each Person group and then zeros out the groups with a single observation:
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
IR <- ave(mydata$ID, mydata$Person, FUN=seq_along)
IR
# [1] 1 1 1 2 1 1 2
tbl <- table(mydata$Person)
tozero <- mydata$Person %in% names(tbl[tbl == 1])
IR[tozero] <- 0
IR
# [1] 0 0 1 2 0 1 2
Is what you are looking for just to count the number of observations for a person, in one column (like a column ID)? If so, this will work using tidyverse:
Person <- c("A", "B", "C", "C", "D", "E","E")
Date <- c(1/1/20, 1/1/20,12/25/19, 1/1/20, 1/1/20, 12/25/19, 1/1/20)
ID <- c(1,2,3,4,5,6,7)
DuplicateStatus <- c(0,0,0,0,0,0,0)
IdealResult <- c(0,0,1,2,0,1,2)
mydata <- data.frame(Person, Date, ID, DuplicateStatus, IdealResult)
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = seq_along(Person))
mydata
# A tibble: 7 x 6
# Groups: Person [5]
Person Date ID DuplicateStatus IdealResult Duplicate
<fct> <dbl> <dbl> <dbl> <dbl> <int>
1 A 0.05 1 0 0 1
2 B 0.05 2 0 0 1
3 C 0.0253 3 0 1 1
4 C 0.05 4 0 2 2
5 D 0.05 5 0 0 1
6 E 0.0253 6 0 1 1
7 E 0.05 7 0 2 2
You could assign row number within each group provided if there are more than 1 row in each.
This can be implemented in base R, dplyr as well as data.table
In base R :
mydata$ans <- with(mydata, ave(ID, Person, FUN = function(x)
seq_along(x) * (length(x) > 1)))
# Person Date ID IdealResult ans
#1 A 0.0500000 1 0 0
#2 B 0.0500000 2 0 0
#3 C 0.0252632 3 1 1
#4 C 0.0500000 4 2 2
#5 D 0.0500000 5 0 0
#6 E 0.0252632 6 1 1
#7 E 0.0500000 7 2 2
Using dplyr:
library(dplyr)
mydata %>% group_by(Person) %>% mutate(ans = row_number() * (n() > 1))
and with data.table
library(data.table)
setDT(mydata)[, ans := seq_along(ID) * (.N > 1), Person]
data
mydata <- data.frame(Person, Date, ID, IdealResult)
I would argue that n() is the ideal function for you problem
library(tidyverse)
mydata <- mydata %>%
group_by(Person) %>%
mutate(Duplicate = n())
I'm trying to create decile factors corresponding to my dataframe's values. I would like the factors to appear as a range e.g. if the value is "164" then the factored result should be "160 - 166".
In the past I would do this:
quantile(countries.Imported$Imported, seq(0,1, 0.1), na.rm = T) # display deciles
Imported.levels <- c(0, 1000, 10000, 20000, 30000, 50000, 80000) # create levels from observed deciles
Imported.labels <- c('< 1,000t', '1,000t - 10,000t', '10,000t - 20,000t', etc) # create corresponding labels
colfunc <- colorRampPalette(c('#E5E4E2', '#8290af','#512888'))
# apply factor function
Imported.colors <- colfunc(10)
names(Imported.colors) <- Imported.labels
countries.Imported$Imported.fc <- factor(
cut(countries.Imported$Imported, Imported.levels),labels = Imported.labels)
Instead, I would like to apply a function that will factor the values into decile range. I want to avoid manually setting factor labels since I will be running many queries and plotting maps that have discrete legends. I've created a column called Value.fc but I cannot format it to "160 - 166" from "(160, 166]". Please see the problematic code below:
corn_df <- corn_df %>%
mutate(Value.fc = gtools::quantcut(Value, 10))
corn_df %>%
select(Value, unit_desc, domain_desc, Value.fc) %>%
head(6)
A tibble: 6 x 4
Value unit_desc domain_desc Value.fc
<dbl> <chr> <chr> <fct>
1 164. BU / ACRE TOTAL (160,166]
2 196. BU / ACRE TOTAL (191,200]
3 203. BU / ACRE TOTAL (200,230]
4 205. BU / ACRE TOTAL (200,230]
5 172. BU / ACRE TOTAL (171,178]
6 213. BU / ACRE TOTAL (200,230]
You can try to use dplyr::ntile() or Hmisc::cut2().
If you're interested where the decline of the variable starts and ends you can use Hmisc::cut2() and stringr::str_extract_all()
require(tidyverse)
require(Hmisc)
require(stringr)
df <- data.frame(value = 1:100) %>%
mutate(decline = cut2(value, g=10),
decline = factor(sapply(str_extract_all(decline, "\\d+"),
function(x) paste(x, collapse="-"))))
head(df)
value decline
1 1 1-11
2 2 1-11
3 3 1-11
4 4 1-11
5 5 1-11
6 6 1-11
If you're looking only for the decline of the variable you can use dplyr::ntile().
require(tidyverse)
df <- data.frame(value = 1:100) %>%
mutate(decline = ntile(value, 10))
head(df)
value decline
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
I have a dataframe (df1) containing many records Each record has up to three trials, each trial can be repeat up to five times. Below is an example of some data I have:
Record Trial Start End Speed Number
1 2 1 4 12 9
1 2 4 6 11 10
1 3 1 3 10 17
2 1 1 5 14 5
I have the following code that calculates the longest 'Distance' and 'Maximum Number' for each Record.:
getInfo <- function(race_df) {
race_distance <- as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.distance = max(End - Start)))
race_max_number = as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.N = max(Number)))
rd_rmn_merge <- as.data.frame(merge(x = race_distance, y = race_max_number)
total_summary <- as.data.frame(rd_rmn_merge[order(rd_rmn_merge$trial,])
return(list(race_distance, race_max_number, total_summary)
}
list_summary <- getInfo(race_df)
total_summary <- list_of_races[[3]]
list_summary gives me an output like this:
[[1]]
Record Trial Max.Distance
1 2 3
1 3 2
2 1 4
[[2]]
Record Trial Max.Number
1 2 10
1 3 17
2 1 5
[[3]]
Record Trial Max.Distance Max.Number
1 2 3 10
1 3 2 17
2 1 4 5
I am now trying to seek the longest distance with the corresponding 'Number' regardless if it being maximum. So having Record 1, Trial 2 look like this instead:
Record Trial Max.Distance Corresponding Number
1 2 3 9
Eventually I would like to be able to create a function that is able to take arguments 'Record' and 'Trial' through the 'race_df' dataframe to make searching for a specific record and trial's longest distance easier.
Any help on this would be much appreciated.
The data (in case anyone else wants to offer their solution):
df <- data.frame( Record = c(1,1,1,2),
Trial = c(2,2,3,1),
Start = c(1,4,1,1),
End = c(4,6,3,5),
Speed = c(12,11,10,14),
Number = c(9,10,17,5))
Here's a tidyverse solution:
library(tidyverse)
df %>%
mutate( Max.Distance = End - Start) %>%
select(-Start,-End,-Speed) %>%
group_by(Record) %>%
nest() %>%
mutate( data = map( data, ~ filter(.x, Max.Distance == max(Max.Distance)) )) %>%
unnest()
The output:
Record Trial Number Max.Distance
<dbl> <dbl> <dbl> <dbl>
1 1 2 9 3
2 2 1 5 4
Note if you want to keep all of your columns in the final data frame, just remove select....
I hope I get right what your function is supposed to do. In the end it should take a record and a trial and put out the row(s) where we have the maximum distance, right?
So, it boils down to two filters:
filter rows for the record and trial.
filter the row inside that subset that has the maximum distance
Between those two filters, we have to calculate the distance although I suggest you move that outside the function because it is basically a one time operation.
race_df <- data.frame(Record = c(1, 1, 1, 2), Trial = c(2, 2, 3, 1),
Start = c(1, 4, 1, 1), End = c(4, 6, 3, 5), Speed = c(12, 11, 10, 14),
Number = c(9, 10, 17, 5))
get_longest <- function(df, record, trial){
df %>%
filter(Record == record & Trial == trial) %>%
mutate(Distance = End - Start) %>%
filter(Distance == max(Distance)) %>%
select(Number, Distance)
}
get_longest(race_df, 1, 2)