Gathering data using R - multiple urls - r

I have a dataframe which has a several columns and rows - some contain information, some are filled with NA, which should be replaced with certain data.
The rows represent specific instruments and columns contain various details of the instrument in a given row. The last column of the dataframe has a url for each instrument, which then will be used to grab data for empty columns:
Issuer NIN or ISIN Type Nominal Value # of Bonds Issue Volume Start Date End Date
1 NBRK KZW1KD079112 discount notes NA NA NA NA NA
2 NBRK KZW1KD079146 discount notes NA NA NA NA NA
3 NBRK KZW1KD079153 discount notes NA NA NA NA NA
4 NBRK KZW1KD089137 discount notes NA NA NA NA NA
URL
1 http://www.kase.kz/en/gsecs/show/NTK007_1911
2 http://www.kase.kz/en/gsecs/show/NTK007_1914
3 http://www.kase.kz/en/gsecs/show/NTK007_1915
4 http://www.kase.kz/en/gsecs/show/NTK008_1913
For example, with the following code I get the details for the first instrument in the row NBRK KZW1KD079112:
sp = readHTMLTable(newd$URL[[1]])
sp[[4]]
Which gives the following:
V1
V2
1 Trading code NTK007_1911
2 List of securities official
3 System of quotation price
4 Unit of quotation nominal value percentage fraction
5 Quotation currency KZT
6 Quotation accuracy 4 characters
7 Trade lists admission date 04/21/17
8 Trade opening date 04/24/17
9 Trade lists exclusion date 04/28/17
10 Security <NA>
11 Bond's name short-term notes of the National Bank of the Republic of Kazakhstan
12 NSIN KZW1KD079112
13 Currency of issue and service KZT
14 Nominal value in issue's currency 100.00
15 Number of registered bonds 1,929,319,196
16 Number of bonds outstanding 1,929,319,196
17 Issue volume, KZT 192,931,919,600
18 Settlement basis (days in month / days in year) actual / 365
19 Date of circulation start 04/21/17
20 Circulation term, days 7
21 Register fixation date at maturity 04/27/17
22 Principal repayment date 04/28/17
23 Paying agent Central securities depository JSC (Almaty)
24 Registrar Central securities depository JSC (Almaty)
From this, I will have to keep only:
14 Nominal value in issue's currency 100.00
16 Number of bonds outstanding 1,929,319,196
17 Issue volume, KZT 192,931,919,600
19 Date of circulation start 04/21/17
22 Principal repayment date 04/28/17
I then will copy the needed data to the initial dataframe and carry on with the next row ... The dataframe consist of 100+ rows and will keep changing.
I would appreciate any help.
UPDATE:
Looks like the data that I need are not always in sp[[4]]. Sometimes its sp[[7]], maybe in the future it will be totally different table. Is there any way that looks for the information in the scraped tables and identifies a specific table that could be used further to collect data?:
sp = readHTMLTable(newd$URL[[1]])
sp[[4]]

library(XML)
library(reshape2)
library(dplyr)
name = c(
"NBRK KZW1KD079112 discount notes",
"NBRK KZW1KD079146 discount notes",
"NBRK KZW1KD079153 discount notes",
"NBRK KZW1KD089137 discount notes")
URL = c(
"http://www.kase.kz/en/gsecs/show/NTK007_1911",
"http://www.kase.kz/en/gsecs/show/NTK007_1914",
"http://www.kase.kz/en/gsecs/show/NTK007_1915",
"http://www.kase.kz/en/gsecs/show/NTK008_1913")
# data
instruments <- data.frame(name, URL, stringsAsFactors = FALSE)
# define the columns wanted and the mapping to desired name
# extend to all wanted columns
wanted <- c("Nominal value in issue's currency" = "Nominal Value",
"Number of bonds outstanding" = "# of Bonds Issue")
# function returns a data frame of wanted columns for given URL
getValues <- function (name, url) {
# get the table and rename columns
sp = readHTMLTable(url, stringsAsFactors = FALSE)
df <- sp[[4]]
names(df) <- c("full_name", "value")
# filter and remap wanted columns
result <- df[df$full_name %in% names(wanted),]
result$column_name <- sapply(result$full_name, function(x) {wanted[[x]]})
# add the identifier to every row
result$name <- name
return (result[,c("name", "column_name", "value")])
}
# invoke function for each name/URL pair - returns list of data frames
columns <- apply(instruments[,c("name", "URL")], 1, function(x) {getValues(x[["name"]], x[["URL"]])})
# bind using dplyr:bind_rows to make a tall data frame
tall <- bind_rows(columns)
# make wide using dcast from reshape2
wide <- dcast(tall, name ~ column_name, id.vars = "value")
wide
# name # of Bonds Issue Nominal Value
# 1 NBRK KZW1KD079112 discount notes 1,929,319,196 100.00
# 2 NBRK KZW1KD079146 discount notes 1,575,000,000 100.00
# 3 NBRK KZW1KD079153 discount notes 701,390,693 100.00
# 4 NBRK KZW1KD089137 discount notes 1,380,368,000 100.00
enter code here

Related

How to extract data from the dataset with a certain condition and how to combine data from two columns into one in R

This is my dataset example for one person:
dataset example for one person
I have made this table :
deathYear
diagYear
fcediags
pid
2013
NA
I21
1
2011
NA
I63
2
2033
NA
I21
4
2029
NA
I25
5
2020
NA
I21
18
2012
NA
I63
19
I have the problem with the data for the diagYear above. The results are NA.
And also:
The table T2 should only show the rows for persons that have at least one of these Diags: "I20","I21","I22","I25" or "I63" in the document data (no matter document$fces$alive=TRUE or FALSE), but (and only for these persons with this condition) it should also show the year of death (extracted from the date like in the code above - deathYear code) no matter the pearson died from some other diagnoses.
I also need to make one column Year instead of these two (deathYear and diagYear) which would contain the data for the year (extracted from the date - document$FCEs$date (pls see the picture) depending on the next conditions: 1. if document$fces$alive is TRUE, the Year column should have the data for the year only if there's at least one Diag1 in the person's document set that is either "I20", "I21", "I22", "I25" or "I63"
2. if document$fces$alive is FALSE (but only for these persons from the condition 1.), then the column Year should have a deathYear data from the code above no matter the Diag value for the case of death (in this case Diag1, doesn't have to be "I20", "I21", "I22", "I25" or "I63").
I have tried these codes:
getDiags <- function(x) {
document<-fromJSON(x)
fcediags <- document$FCEs$Diag1
fcedage <- document$FCEs$pAge
fcealive <- document$FCEs$alive
deathYear<-2030
if(length(strsplit(document$FCEs[document$FCEs$alive==FALSE,]$date, "/"))>0)
deathYear<-as.numeric(strsplit(document$FCEs[document$FCEs$alive==FALSE,]$date, "/")[[1]][1])
diagYear<-0
v1 = c("I20","I21","I22","I25","I63")
for (i in 1:length(document$FCEs$Diag1)){
if (document$FCEs$Diag1[i] %in% v1){
diagYear<-as.numeric(strsplit(document$FCEs[document$FCEs$Diag1[i],]$date, "/")[[1]][1])
}
} #this block of code doesn't work, it shows NA in the table
return (data.frame(fcedage,fcediags,fcealive,sex,ldl,pid=document$ID,deathYear,diagYear))
}
for (i in 1:length(fces$fcediags)){
T2 <- subset(fces,fces$fcediags == "I20" | fces$fcediags == "I21" | fces$fcediags == "I22" | fces$fcediags == "I25" | fces$fcediags == "I63", select = c(deathYear,diagYear,fcediags,pid))
}
#I've obviously made this table wrong because it shows rows for only these "I20","I21",...,"I63" Diag1s, but for these persons (with these mentioned Diag1s), it should show the year of death (document$fces$alive=FALSE) no matter the Diag1 value for the case of death.
(pid is pearson's ID), but they are not good enough. Results in the column diagYear shouldn't be NA and the two columns should be merged in one.
Can someone please help me? Thank you in advance!

Exctract number and sum number from free text input, add to df

I have a dataframe with a column that contains free text entries on years of education. From the free text entries I want to extract all of the numbers and sum them.
Example: data_en$educationTxt[1] gives "6 primary school 10 highschool"
With the following code I can extract both numbers and sum them.
library(stringr)
x <- as.numeric(str_extract_all(data_en$education[1], "[0-9A]+")[[1]])
x <- as.vector(x)
x <- sum(x)
However, I would ideally like to do this for all free text entries (i.e. each row) and subsequently add the results to the dataframe per row (i.e. in a variable such as data_en$educationNum). I'm a bit stuck on how to proceed.
You can use sapply:
data_en$educationNum <- sapply(str_extract_all(data_en$education, "[0-9]+"),
function(i) sum(as.numeric(i)))
data_en
# education educationNum
# 1 6 primary school 10 highschool 16
# 2 10 primary school 2 highschool 12
# 3 no school 0
Data
data_en <- data.frame(education = c("6 primary school 10 highschool",
"10 primary school 2 highschool",
"no school"))
You just need to map over the output of str_extract_all
x <- c('300 primary 1 underworld', '6 secondary 9 dungeon lab')
library(purrr)
map_dbl(str_extract_all(x, '\\d+'), ~ sum(as.numeric(.)))
# [1] 301 15

How to check for skipped values in a series in a R dataframe column?

I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.

for loop: not equally-sized data frames in R

I'm working on a data frame which looks like this
Here's how it looks like:
shape id day hour week id footfall category area name
22496 22/3/14 3 12 634 Work cluster CBD area 1
22670 22/3/14 3 12 220 Shopping cluster Orchard Road 1
23287 22/3/14 3 12 723 Airport Changi Airport 2
16430 22/3/14 4 12 947 Work cluster CBD area 2
4697 22/3/14 3 12 220 Residential area Ang Mo Kio 2
4911 22/3/14 3 12 1001 Shopping cluster Orchard Rd 3
11126 22/3/14 3 12 220 Residential area Ang Mo Kio 2
and so on... until 635 rows return.
with the other dataset that I want to compare with can be found here
Here's how it looks like:
category Foreigners Locals
Work cluster 1600000 3623900
Shopping cluster 1800000 3646666.667
Airport 15095152 8902705
Residential area 527700 280000
and also this last dataset that i want to compare with their previousHour
The first and second share the same attribute, i.e. category & first and third dataset share the same attribute hour.
As for previousHour based on category. Eg, for workcluster here
The previousHour should look like this:
hour
0
3
4
4
4
5
until 144 rows return... for each category.
Click here for shopping category
previousHour eg. for shopping should look like this:
hour
0
3
3
4
4
5
until 144 rows return...
Click here for airport category
Click here for residential category
all 144 rows return...
SumHour dataset:
category sumHour
1 Airport 2208
2 Residential area 1656
3 Shopping cluster 1656
4 Work cluster 1656
Here's, what I ideally want to find in R:
#for n in 1: number of rows{
# calculate sumHours(in SumHours dataset) - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
I'm not sure how to do that and here's what i tried:
mergetbl <- function(tbl1, tbl2)
{
newtbl = data.frame(hour=numeric(),forgHour=numeric())
ntbl1rows<-nrow(tbl1) # get the number of rows
for(n in 1:ntbl1rows)
{
#for n in 1: number of rows{
# check the previous hour from IDA dataset !!!!
# calculate sumDate - previousHour = newHourSum and store it as newHourSum
# calculate hour/(newHourSum-previousHour) * Foreigners and store it as footfallHour
# add to the empty dataframe }
newHourSum <- 3588 - tbl1
footfallHour <- (tbl1$hour/(newHourSum-previousHour)) * tbl2$Foreigners
newtbl <- rbind(newtbl, footfallHour)
}
}
But nothing happened to newtbl...
Here's what ideally looks like for newtbl:
hour forgHour
0 1337.79 (the function should calculate this)
3 ...
3 ...
3 ...
4 ...
3 ...
and so on...
Thinking in terms of vectors gives this :
Try this:
### this is to get your Foreigners/Locals to be at the same size as tbl1
Foreigners=ifelse(tbl1$category=="Work cluster",tbl2$Foreigners[1], ifelse (tbl1$category=="Shopping cluster", tbl2$Foreigners[2], ifelse(tbl1$category=="Airport", tbl2$Foreigners[3], tbl2$Foreigners[4])))
Locals=ifelse(tbl1$category=="Work cluster",tbl2$Locals[1], ifelse (tbl1$category=="Shopping cluster", tbl2$Locals[2], ifelse(tbl1$category=="Airport", tbl2$Locals[3], tbl2$Locals[4])))
And now, the function
resultHour = function(tbl1, tbl2, ForeOrLoca)
{
previousHour = rep (0, nrow(tbl1))
for (i in 2:nrow(tbl1))
{
previousHour[i] = tbl1$hour[i-1]
}
### The conditional sum matching the category from tbl1
NewHourSum = ifelse(tbl1$category=="Work cluster",sum(with(tbl1, hour*I(category == "Work cluster"))), ifelse (tbl1$category=="Shopping cluster", sum(with(tbl1, hour*I(category == "Shopping cluster"))), ifelse(tbl1$category=="Airport", sum(with(tbl1, hour*I(category == "Airport"))), sum(with(tbl1, hour*I(category == "Residential area"))))))
##and finally, this
hour = as.vector(tbl1$hour)
footfallHour <- (hour/(newHourSum - previousHour)) * ForeOrLoca
newtbl <- cbind(hour, footfallHour)
return (newtbl)
}
this is the output I get :
> head(newtbl)
hour footfallHour
[1,] 3 1337.7926
[2,] 3 1506.2762
[3,] 3 12631.9264
[4,] 4 1785.2162
[5,] 3 441.7132
[6,] 3 1506.2762
Using the function:
TheResultIWant = resultHour (tbl1,tbl2)
For your new question.
Provided you cut your data frame into several containing only one category, you can use this function:
new_function_hour_result = function (tbl1_categ, vec_categ, prevHour_Categ, sumHour_Categ)
hour = as.vector(tbl1_categ$hour)
footfallHour <- (hour/(sumHour_Categ- previousHour)) * vec_categ
newtbl <- cbind(hour, footfallHour)
return (newtbl)
}
With tbl1_categ your data frame for a given category, vec_categ your foreigner or local data for a given category, prevHour_Categ the previousHour for a given category, and finally sumHour_Categ the sumHour for a given category.
To get your vectors to be the same size as the df they will be compared to :
for instance, for the vec_categ in the case locals/airport category:
locals_airport = rep(category[3,3], nrow = nrow(tbl1_airport))
for foreigners and airport category: foreig_airport = rep(category[3,2], nrow = nrow(tbl1_airport))
this will repeat the value contained in category[3,2], nrow(tbl1_airport) times.
for locals and workcluster: locals_workcluster = rep(category[1,3], nrow = nrow(tbl1_workcluster))
and so on for each vector (ie prevHour_Categ, sumHour_Categ, vec_categ) for each category!!

Merge two dataframes with repeated columns

I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.

Resources