for loop: not equally-sized data frames in R - r

I'm working on a data frame which looks like this
Here's how it looks like:
shape id day hour week id footfall category area name
22496 22/3/14 3 12 634 Work cluster CBD area 1
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1
23287 22/3/14 3 12 723 Airport Changi Airport 2
16430 22/3/14 4 12 947 Work cluster CBD area 2
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2
and so on... until 635 rows return.
with the other dataset that I want to compare with can be found here
Here's how it looks like:
category Foreigners Locals
Work cluster 1600000 3623900
Shopping cluster 1800000 3646666.667
Airport 15095152 8902705
Residential area 527700 280000
and also this last dataset that i want to compare with their previousHour
The first and second share the same attribute, i.e. category & first and third dataset share the same attribute hour.
As for previousHour based on category. Eg, for workcluster here
The previousHour should look like this:
hour
0
3
4
4
4
5
until 144 rows return... for each category.
Click here for shopping category
previousHour eg. for shopping should look like this:
hour
0
3
3
4
4
5
until 144 rows return...
Click here for airport category
Click here for residential category
all 144 rows return...
SumHour dataset:
category sumHour
1 Airport 2208
2 Residential area 1656
3 Shopping cluster 1656
4 Work cluster 1656
Here's, what I ideally want to find in R:
#for n in 1: number of rows{
# calculate sumHours(in SumHours dataset) - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
I'm not sure how to do that and here's what i tried:
mergetbl <- function(tbl1, tbl2)
{
newtbl = data.frame(hour=numeric(),forgHour=numeric())
ntbl1rows<-nrow(tbl1) # get the number of rows
for(n in 1:ntbl1rows)
{
#for n in 1: number of rows{
# check the previous hour from IDA dataset !!!!
# calculate sumDate - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
newHourSum <- 3588 - tbl1
footfallHour <- (tbl1$hour/(newHourSum-previousHour)) * tbl2$Foreigners
newtbl <- rbind(newtbl, footfallHour)
}
}
But nothing happened to newtbl...
Here's what ideally looks like for newtbl:
hour forgHour
0 1337.79 (the function should calculate this)
3 ...
3 ...
3 ...
4 ...
3 ...
and so on...

Thinking in terms of vectors gives this :
Try this:
### this is to get your Foreigners/Locals to be at the same size as tbl1
Foreigners=ifelse(tbl1$category=="Work cluster",tbl2$Foreigners[1], ifelse (tbl1$category=="Shopping cluster", tbl2$Foreigners[2], ifelse(tbl1$category=="Airport", tbl2$Foreigners[3], tbl2$Foreigners[4])))
Locals=ifelse(tbl1$category=="Work cluster",tbl2$Locals[1], ifelse (tbl1$category=="Shopping cluster", tbl2$Locals[2], ifelse(tbl1$category=="Airport", tbl2$Locals[3], tbl2$Locals[4])))
And now, the function
resultHour = function(tbl1, tbl2, ForeOrLoca)
{
previousHour = rep (0, nrow(tbl1))
for (i in 2:nrow(tbl1))
{
previousHour[i] = tbl1$hour[i-1]
}
### The conditional sum matching the category from tbl1
NewHourSum = ifelse(tbl1$category=="Work cluster",sum(with(tbl1, hour*I(category == "Work cluster"))), ifelse (tbl1$category=="Shopping cluster", sum(with(tbl1, hour*I(category == "Shopping cluster"))), ifelse(tbl1$category=="Airport", sum(with(tbl1, hour*I(category == "Airport"))), sum(with(tbl1, hour*I(category == "Residential area"))))))
##and finally, this
hour = as.vector(tbl1$hour)
footfallHour <- (hour/(newHourSum - previousHour)) * ForeOrLoca
newtbl <- cbind(hour, footfallHour)
return (newtbl)
}
this is the output I get :
> head(newtbl)
hour footfallHour
[1,] 3 1337.7926
[2,] 3 1506.2762
[3,] 3 12631.9264
[4,] 4 1785.2162
[5,] 3 441.7132
[6,] 3 1506.2762
Using the function:
TheResultIWant = resultHour (tbl1,tbl2)

For your new question.
Provided you cut your data frame into several containing only one category, you can use this function:
new_function_hour_result = function (tbl1_categ, vec_categ, prevHour_Categ, sumHour_Categ)
hour = as.vector(tbl1_categ$hour)
footfallHour <- (hour/(sumHour_Categ- previousHour)) * vec_categ
newtbl <- cbind(hour, footfallHour)
return (newtbl)
}
With tbl1_categ your data frame for a given category, vec_categ your foreigner or local data for a given category, prevHour_Categ the previousHour for a given category, and finally sumHour_Categ the sumHour for a given category.
To get your vectors to be the same size as the df they will be compared to :
for instance, for the vec_categ in the case locals/airport category:
locals_airport = rep(category[3,3], nrow = nrow(tbl1_airport))
for foreigners and airport category: foreig_airport = rep(category[3,2], nrow = nrow(tbl1_airport))
this will repeat the value contained in category[3,2], nrow(tbl1_airport) times.
for locals and workcluster: locals_workcluster = rep(category[1,3], nrow = nrow(tbl1_workcluster))
and so on for each vector (ie prevHour_Categ, sumHour_Categ, vec_categ) for each category!!

Related

Is there an R function to keep all rows even if criteria not met?

I have a data table of teams data by number of players by Region and want to copy to an existing spreadsheet that logs by region by team within certain size category (1, 2, 3-7, etc) in some cases teams are not within the criteria (0) so they do not pull into the data table. How do I ensure the data table continues to show these rows with a 0 rather than removing all together?
Data is available here: https://github.com/rweingarten/Team-Data
team_sizes <- setDT(teams)[, .(sum = sum(uniqueN(player_id))), by = list(REGION, team_ID)]
team_size_3<-subset(team_sizes, team_sizes$sum >= 3 & team_sizes$sum <= 7, by = (team_sizes$REGION))
team_size_three <- setDT(team_size_3)[, uniqueN(team_ID), by = REGION]
This team_size_three results in:
team_size_three
PA 3
GB 1
NE 1
NJ 5
NY 1
OK 2
but it should display:
team_size_three
AR 0
PA 3
GA 1
NE 1
NJ 5
NY 1
OK 2
MD 0
where you can see AR and MD pulling in 0s.

R data frame: Change value in 1 column depending on value in another

I have a data frame called nurse. At the moment it contains several columns but only one (nurse$word) is relevant at the moment. I want to create a new column named nurse$w.frequency which looks at the words in the nurse$word column and if it finds the one specified, I want it to change the corresponding nurse$w.frequency value to a specified integer.
nurse <- read.csv(...)
file word w.frequency
1 determining
2 journey
3 journey
4 serving
5 work
6 journey
... ...
The word frequency for determining and journey, for instance, is 1590 and 4650 respectively. So it should look like the following:
file word w.frequency
1 determining 1590
2 journey 4650
3 journey 4650
4 serving
5 work
6 journey 4650
... ...
I have tried it with the an ifelse statement (below) which seems to work, however, every time I try to change the actual word and frequency it overwrites the results from before.
nurse$w.frequency <- ifelse(nurse$word == "determining", nurse$w.frequency[nurse$word["determining"]] <- 1590, "")
You could first initialise an empty column
nurse$w.frequency <- NA
then populated it with the data you want
nurse$w.frequency[nurse$word == "determining"] <- 1590
nurse$w.frequency[nurse$word == "journey"] <- 4650
Using dplyr:
nurse %>%
mutate(w.frequency =
case_when(
word == "determining" ~ "1590",
word == "journey" ~ "4650",
TRUE ~ ""
))
Gives us:
word w.frequency
1 determining 1590
2 journey 4650
3 journey 4650
4 serving
5 work
6 journey 4650
Data:
nurse <- data.frame(word = c("determining", "journey", "journey", "serving", "work", "journey"))

How to check for skipped values in a series in a R dataframe column?

I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.

Gathering data using R - multiple urls

I have a dataframe which has a several columns and rows - some contain information, some are filled with NA, which should be replaced with certain data.
The rows represent specific instruments and columns contain various details of the instrument in a given row. The last column of the dataframe has a url for each instrument, which then will be used to grab data for empty columns:
Issuer NIN or ISIN Type Nominal Value # of Bonds Issue Volume Start Date End Date
1 NBRK KZW1KD079112 discount notes NA NA NA NA NA
2 NBRK KZW1KD079146 discount notes NA NA NA NA NA
3 NBRK KZW1KD079153 discount notes NA NA NA NA NA
4 NBRK KZW1KD089137 discount notes NA NA NA NA NA
URL
1 http://www.kase.kz/en/gsecs/show/NTK007_1911
2 http://www.kase.kz/en/gsecs/show/NTK007_1914
3 http://www.kase.kz/en/gsecs/show/NTK007_1915
4 http://www.kase.kz/en/gsecs/show/NTK008_1913
For example, with the following code I get the details for the first instrument in the row NBRK KZW1KD079112:
sp = readHTMLTable(newd$URL[[1]])
sp[[4]]
Which gives the following:
V1
V2
1 Trading code NTK007_1911
2 List of securities official
3 System of quotation price
4 Unit of quotation nominal value percentage fraction
5 Quotation currency KZT
6 Quotation accuracy 4 characters
7 Trade lists admission date 04/21/17
8 Trade opening date 04/24/17
9 Trade lists exclusion date 04/28/17
10 Security <NA>
11 Bond's name short-term notes of the National Bank of the Republic of Kazakhstan
12 NSIN KZW1KD079112
13 Currency of issue and service KZT
14 Nominal value in issue's currency 100.00
15 Number of registered bonds 1,929,319,196
16 Number of bonds outstanding 1,929,319,196
17 Issue volume, KZT 192,931,919,600
18 Settlement basis (days in month / days in year) actual / 365
19 Date of circulation start 04/21/17
20 Circulation term, days 7
21 Register fixation date at maturity 04/27/17
22 Principal repayment date 04/28/17
23 Paying agent Central securities depository JSC (Almaty)
24 Registrar Central securities depository JSC (Almaty)
From this, I will have to keep only:
14 Nominal value in issue's currency 100.00
16 Number of bonds outstanding 1,929,319,196
17 Issue volume, KZT 192,931,919,600
19 Date of circulation start 04/21/17
22 Principal repayment date 04/28/17
I then will copy the needed data to the initial dataframe and carry on with the next row ... The dataframe consist of 100+ rows and will keep changing.
I would appreciate any help.
UPDATE:
Looks like the data that I need are not always in sp[[4]]. Sometimes its sp[[7]], maybe in the future it will be totally different table. Is there any way that looks for the information in the scraped tables and identifies a specific table that could be used further to collect data?:
sp = readHTMLTable(newd$URL[[1]])
sp[[4]]
library(XML)
library(reshape2)
library(dplyr)
name = c(
"NBRK KZW1KD079112 discount notes",
"NBRK KZW1KD079146 discount notes",
"NBRK KZW1KD079153 discount notes",
"NBRK KZW1KD089137 discount notes")
URL = c(
"http://www.kase.kz/en/gsecs/show/NTK007_1911",
"http://www.kase.kz/en/gsecs/show/NTK007_1914",
"http://www.kase.kz/en/gsecs/show/NTK007_1915",
"http://www.kase.kz/en/gsecs/show/NTK008_1913")
# data
instruments <- data.frame(name, URL, stringsAsFactors = FALSE)
# define the columns wanted and the mapping to desired name
# extend to all wanted columns
wanted <- c("Nominal value in issue's currency" = "Nominal Value",
"Number of bonds outstanding" = "# of Bonds Issue")
# function returns a data frame of wanted columns for given URL
getValues <- function (name, url) {
# get the table and rename columns
sp = readHTMLTable(url, stringsAsFactors = FALSE)
df <- sp[[4]]
names(df) <- c("full_name", "value")
# filter and remap wanted columns
result <- df[df$full_name %in% names(wanted),]
result$column_name <- sapply(result$full_name, function(x) {wanted[[x]]})
# add the identifier to every row
result$name <- name
return (result[,c("name", "column_name", "value")])
}
# invoke function for each name/URL pair - returns list of data frames
columns <- apply(instruments[,c("name", "URL")], 1, function(x) {getValues(x[["name"]], x[["URL"]])})
# bind using dplyr:bind_rows to make a tall data frame
tall <- bind_rows(columns)
# make wide using dcast from reshape2
wide <- dcast(tall, name ~ column_name, id.vars = "value")
wide
# name # of Bonds Issue Nominal Value
# 1 NBRK KZW1KD079112 discount notes 1,929,319,196 100.00
# 2 NBRK KZW1KD079146 discount notes 1,575,000,000 100.00
# 3 NBRK KZW1KD079153 discount notes 701,390,693 100.00
# 4 NBRK KZW1KD089137 discount notes 1,380,368,000 100.00
enter code here

How to find sum and average for some columns based on the numbers from another column in R

GIVEN DATA
I have 6 columns of data of vehicle trajectory (observation of vehicles' change in position, velocity, etc over time) a part of which is shown below:
Vehicle ID Frame ID Global X Vehicle class Vehicle velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5
Vehicle ID is the identification of individual vehicles e.g. vehicle 1, vehicle 2, etc. It is repeated in the column for each frame in which it was observed. Please note that each frame is 0.1 seconds long so 10 frames make 1 second. The IDs of frames is in Frame ID column. Vehicle class is the type of vehicle (1=motorcycle, 2=car, 3=truck). Vehicle velocity column represents instantaneous speed of vehicle in that instant of time i.e. in a frame. Lane represents the number or ID of the lane in which vehicle is present in a particular frame.
WHAT I NEED TO FIND
The data I have is for 15 minutes period. The minimum frame ID is 5 and maximum frame ID is 9952. I need to find the total number of vehicles in every 30 seconds time period. This means that starting from the first 30 seconds (frame ID 5 to frame ID 305), I need to know the unique vehicle IDs observed. Also, for these 30 seconds period, I need to find the average velocity of each vehicle class. This means that e.g. for cars I need to find the average of all velocities of those vehicles whose vehicle class is 2.
I need to find this for all 30 seconds time period i.e. 5-305, 305-605, 605-905,..., 9605-9905. The ouput should tables for cars, trucks and motorcycles like this:
Time Slots Total Cars Average Velocity
5-305 xx xx
305-605 xx xx
. . .
. . .
9605-9905 xx xx
WHAT I HAVE TRIED SO FAR
# Finding the minimum and maximum Frame ID for creating 30-seconds time slots
minfid <- min(data$'Frame ID') # this was 5
maxfid <- max(data$'Frame ID') # this was 9952
for (i in 'Frame ID'==5:Frame ID'==305) {
table ('Vehicle ID')
mean('Vehicle Velocity', 'Vehicle class'==2)
} #For cars in first 30 seconds
I can't generate the required output and I don't know how can I do this for all 30 second periods. Please help.
It's a bit tough to make sure code is completely correct with your data since there is only one vehicle in the sample you show. That said, this is a typical split-apply-combine type analysis you can execute easily with the data.table package:
library(data.table)
dt <- data.table(df) # I just did a `read.table` on the text you posted
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
Here, I just converted your data into a data.table (df was a direct import of your data posted above), and then created 300 frame buckets using cut. Then, you just let data.table do the work. In the first expression we calculate total unique vehicles per frame.group
dt[, list(tot.vehic=length(unique(Vehicle_ID))), by=frame.group]
# frame.group tot.vehic
# 1: [5,305] 1
Now we group by frame.group and Vehicle_class to get average speed and count for those combinations:
dt[, list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 2 1 24.965
Again, a bit silly when we only have one vehicle, but this should work for your data set.
EDIT: to show that it works:
library(data.table)
set.seed(101)
dt <- data.table(
Frame_ID=sample(5:9905, 50000, rep=T),
Vehicle_ID=sample(1:400, 50000, rep=T),
Vehicle_velocity=runif(50000, 25, 100)
)
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
dt[, Vehicle_class:=Vehicle_ID %% 3]
head(
dt[order(frame.group, Vehicle_class), list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
)
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 0 130 63.34589
# 2: [5,305] 1 131 61.84366
# 3: [5,305] 2 129 64.13968
# 4: (305,605] 0 132 61.85548
# 5: (305,605] 1 132 64.76820
# 6: (305,605] 2 133 61.57129
Maybe it's your data?
Here is a plyr version:
data$timeSlot <- cut(data$FrameID,
breaks = seq(5, 9905, by=300),
dig.lab=5,
include.lowest=TRUE)
# split & combine
library(plyr)
data.sum1 <- ddply(.data = data,
.variables = c("timeSlot"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
# include VehicleClass
data.sum2 <- ddply(.data = data,
.variables = c("timeSlot", "VehicleClass"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
The column names like FrameID would have to be edited to match the ones you use:
data <- read.table(sep = "", header = TRUE, text = "
VehicleID FrameID GlobalX VehicleClass velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5")
data.sum1
# timeSlot totalCars AverageVelocity
# 1 [5,305] 1 24.965
data.sum2
# timeSlot VehicleClass totalCars AverageVelocity
# 1 [5,305] 2 1 24.965

Resources