Writing a function to clean string data and rename columns - r

I am writing a function to be applied to many individual matrices. Each matrix has 5 columns of string text. I want to remove a piece of one string which matches the string inside another element exactly, then apply a couple more stringr functions, transform it into a data frame, then rename the columns and in the last step I want to add a number to the end of each column name, since I will apply this to many matrices and need to identify the columns later.
This is very similar to another function I wrote so I can't figure out why it won't work. I tried running each line individually by filling in the inputs like this and it works perfectly:
Review1[,4] <- str_remove(Review1[,4], Review1[,3])
Review1[,4] <- str_sub(Review1[,4], 4, -4)
Review1[,4] <- str_trim(Review1[,4], "both")
Review1 <- as.data.frame(Review1)
colnames(Review1) <- c("Title", "Rating", "Date", "User", "Text")
Review1 <- Review1 %>% rename_all(paste0, 1)
But when I run the function nothing seems to happen at all.
Transform_Reviews <- function(x, y, z, a) {
x[,y] <- str_remove(x[,y], x[,z])
x[,y] <- str_sub(x[,y], 4, -4)
x[,y] <- str_trim(x[,y], "both")
x <- as.data.frame(x)
colnames(x) <- c("Title", "Rating", "Date", "User", "Text")
x <- x %>% rename_all(paste0, a)
}
Transform_Reviews(Review1, 4, 3, 1)
This is the only warning message I get. I also receive this when I run the str_remove function individually, but it still changes the elements. But it changes nothing when I run the UDF.
Warning messages:
1: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), ... :
empty search patterns are not supported
This is an example of the part of Review1 that I'm working with.
[,3] [,4]
[1,] "6 April 2014" "By Copnovelist on 6 April 2014"
[2,] "18 Dec. 2015" "By kenneth bell on 18 Dec. 2015"
[3,] "26 May 2015" "By Simon.B :-) on 26 May 2015"
[4,] "22 July 2013" "By Lilla Lukacs on 22 July 2013"
This is what I want the output to look like:
Date1 User1
1 6 April 2014 Copnovelist
2 18 Dec. 2015 kenneth bell
3 26 May 2015 Simon.B :-)
4 22 July 2013 Lilla Lukacs

I realized I just needed to use an assignment operator to see my function work.
Review1 <- Transform_Reviews(Review1, 4, 3, 1)

Related

Scraping table from timeanddate.com using R

I'm trying to scrape weather data (in R) for the 2nd of March on the following web page: https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020 I am interested in the table at the end, below "Stockholm Weather History for..."
Just above and to the right of that table is a drop-down list where I chose the 2nd of March. But when I scrape using rselenium I only get the data of the 1st of March.
How can I get the data for the 2nd (and any other date except the 1st)
I have also tried to scrape the entire page using read_html but I can't find a way to extract the data I want from that.
The following code is the one that only seem to work for the 1st but any other date in the month.
library(tidyverse)
library(rvest)
library(RSelenium)
library(stringr)
library(dplyr)
rD <- rsDriver(browser="chrome", port=4234L, chromever ="85.0.4183.83")
remDr <- rD[["client"]]
remDr$navigate("https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020")
webElems <- remDr$findElements(using="class name", value="sticky-wr")
s<-webElems[[1]]$getElementText()
s<-as.character(s)
print(s)
Here's an approach with RSelenium
library(RSelenium)
library(rvest)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.timeanddate.com/weather/sweden/stockholm/historic?month=3&year=2020")
client$findElement(using = "link text","Mar 2")$clickElement()
source <- client$getPageSource()[[1]]
read_html(source) %>%
html_node(xpath = '//*[#id="wt-his"]') %>%
html_table %>%
head
Conditions Conditions Conditions Comfort Comfort Comfort
1 Time Temp Weather Wind Humidity Barometer Visibility
2 12:20 amMon, Mar 2 39 °F Chilly. 7 mph ↑ 87% 29.18 "Hg N/A
3 12:50 am 37 °F Chilly. 7 mph ↑ 87% 29.18 "Hg N/A
4 1:20 am 37 °F Passing clouds. 7 mph ↑ 87% 29.18 "Hg N/A
5 1:50 am 37 °F Passing clouds. 7 mph ↑ 87% 29.18 "Hg N/A
6 2:20 am 37 °F Overcast. 8 mph ↑ 87% 29.18 "Hg N/A
You can then iterate over dates for findElement().
You can find the xpath by right clicking on the table and choosing Inspect in Chrome:
Then, you can find the table element, right click and choose Copy > Copy XPath.
It is always useful to use your browser's "developer tools" to inspect the web page and figure out how to extract the information you need.
A couple of tutorials that explain this I just googled:
https://towardsdatascience.com/tidy-web-scraping-in-r-tutorial-and-resources-ac9f72b4fe47
https://www.scrapingbee.com/blog/web-scraping-r/
For example, in this particular webpage, when we select a new date in the dropdown list, the webpage sends a GET request to the server, which returns a JSON string with the data of the requested date. Then the webpage updates the data in the table (probably using javascript -- did not check this).
So, in this case you need to emulate this behavior, capture the json file and parse the info in it.
In Chrome, if you look at the developer tool network pane, you will see that the address of the GET request is of the form:
https://www.timeanddate.com/scripts/cityajax.php?n=sweden/stockholm&mode=historic&hd=YYYYMMDD&month=M&year=YYYY&json=1
where YYYY stands for year with 4 digits, MM(M) month with two (one) digits, and DD day of the month with two digits.
So you can set your code to do the GET request directly to this address, get the json response and parse it accordingly.
library(rjson)
library(rvest)
library(plyr)
library(dplyr)
year <- 2020
month <- 3
day <- 7
# create formatted url with desired dates
url <- sprintf('https://www.timeanddate.com/scripts/cityajax.php?n=sweden/stockholm&mode=historic&hd=%4d%02d%02d&month=%d&year=%4d&json=1', year, month, day, month, year)
webpage <- read_html(url) %>% html_text()
# json string is not formatted the way fromJSON function needs
# so I had to parse it manually
# split string on each row
x <- strsplit(webpage, "\\{c:")[[1]]
# remove first element (garbage)
x <- x[2:length(x)]
# clean last 2 characters in each row
x <- sapply(x, FUN=function(xx){substr(xx[1], 1, nchar(xx[1])-2)}, USE.NAMES = FALSE)
# function to get actual data in each row and put it into a dataframe
parse.row <- function(row.string) {
# parse columns using '},{' as divider
a <- strsplit(row.string, '\\},\\{')[[1]]
# remove some lefover characters from parsing
a <- gsub('\\[\\{|\\}\\]', '', a)
# remove what I think is metadata
a <- gsub('h:', '', gsub('s:.*,', '', a))
df <- data.frame(time=a[1], temp=a[3], weather=a[4], wind=a[5], humidity=a[7],
barometer=a[8])
return(df)
}
# use ldply to run function parse.row for each element of x and combine the results in a single dataframe
df.final <- ldply(x, parse.row)
Result:
> head(df.final)
time temp weather wind humidity barometer
1 "12:20 amSat, Mar 7" "28 °F" "Passing clouds." "No wind" "100%" "29.80 \\"Hg"
2 "12:50 am" "28 °F" "Passing clouds." "No wind" "100%" "29.80 \\"Hg"
3 "1:20 am" "28 °F" "Passing clouds." "1 mph" "100%" "29.80 \\"Hg"
4 "1:50 am" "30 °F" "Passing clouds." "2 mph" "100%" "29.80 \\"Hg"
5 "2:20 am" "30 °F" "Passing clouds." "1 mph" "100%" "29.80 \\"Hg"
6 "2:50 am" "30 °F" "Low clouds." "No wind" "100%" "29.80 \\"Hg"
I left everything as strings in the data frame, but you can convert the columns to numeric or dates with you need.

R for loop is numeric

I have a "test" dataframe with 3 companies (ciknum variable)
and years in which each company filed annual reports (fileyear):
ciknum fileyear
1 1408356 2013
2 1557255 2013
3 1557255 2014
4 1557255 2015
5 1557255 2016
6 1557255 2017
7 1555538 2014
8 1555538 2015
9 1555538 2016
10 1555538 2017
These two columns are numeric:
> is.numeric(test$ciknum)
[1] TRUE
> is.numeric(test$fileyear)
[1] TRUE
However, I need a loop that goes for each ciknum-fileyear pair to download annual reports from one site. This loop requires numeric variables for successful download, and it seems I don't get them. For instance, writing the following loop (either for the variable firm, or year, gives me that none are numeric variables):
for (row in 1:nrow(test)){
firm <- test[row, "ciknum"]
year <- test[row, "fileyear"]
my_getFilings(firm, '10-K', year, downl.permit="y") #download function over firm-year
}
Error: Input year(s) is not numeric #error repeated 10 times (one per row)
I checked whether new df firm and year are numeric, and there is mixed evidence. On the one hand, it seems it reads year as numeric variable:
> for (row in 1:nrow(test)){
+ firm <- test[row, "ciknum"]
+ year <- test[row, "fileyear"]
+
+ if(year>2015) {
+ print(paste("I have this", firm, "showing a numeric", year))
+ }
+ }
[1] "I have this 1557255 showing a numeric 2016" #it only states years>2015. Seems it reads a number
[1] "I have this 1557255 showing a numeric 2017"
[1] "I have this 1555538 showing a numeric 2016"
[1] "I have this 1555538 showing a numeric 2017"
But on the other hand, it seems it does not:
> for (row in 1:nrow(test)){
+ firm <- test[row, "ciknum"]
+ year <- test[row, "fileyear"]
+
+ if(!is.numeric(year)) {
+ print(paste("is not numeric"))
+ }
+ }
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
Can anyone tell me whether these are numeric variables or not? Getting lost on this one... My download function "my_getFilings" seems to depend on that.
Thank you in advance.

How to sort by Library of Congress Classification (LCC) number in R

Library of Congress Classification numbers are used in libraries to give call numbers to things so they be ordered on the shelf. They can be simple or quite complex, with a few mandatory parts but many optional. (See "entering call numbers in 050" on 050 Library of Congress Call Number for how they break down, or lc_callnumber for a Ruby tool that sorts them.)
I would like to sort by LCC number in R. I've looked at Sort a list of nontrivial elements in R and Sorting list of list of elements of a custom class in R? but haven't got it figured out.
Here are four call numbers, entered in sorted order:
call_numbers <- c("QA 7 H3 1992", "QA 76.73 R3 W53 2015", "QA 90 H33 2016", "QA 276.45 R3 A35 2010")
sort sorts them by character, so 276 < 7 < 76.73 < 90.
> sort(call_numbers)
[1] "QA 276.45 R3 A35 2010" "QA 7 H3 1992" "QA 76.73 R3 W53 2015" "QA 90 H33 2016"
To sort them properly I think I'd have to define a class and then some methods on it, like this:
library(stringr)
class(call_numbers) <- "LCC"
## Just pick out the letters and digits for now, leave the rest
## until sorting works, then work down more levels.
lcc_regex <- '([[:alpha:]]+?) ([[:digit:]\\.]+?) (.*)'
"<.LCC" <- function(x, y) {
x_lcc <- str_match(x, lcc_regex)
y_lcc <- str_match(y, lcc_regex)
if(x_lcc[2] < y_lcc[2]) return(x)
if(as.integer(x_lcc[3]) < as.integer(y_lcc[3])) return(x)
}
"==.LCC" <- function(x, y) {
x_lcc <- str_match(x, lcc_regex)
y_lcc <- str_match(y, lcc_regex)
x_lcc[2] == y_lcc[2] && x_lcc[3] == y_lcc[3]
}
">.LCC" <- function(x, y) {
x_lcc <- str_match(x, lcc_regex)
y_lcc <- str_match(y, lcc_regex)
if(x_lcc[2] > y_lcc[2]) return(x)
if(as.integer(x_lcc[3]) > as.integer(y_lcc[3])) return(x)
}
This doesn't change the sort order. I haven't defined a subset method ("[.myclass") because I have no idea what it should be.
This might be a simplier approach. This assumes every number has the following format: 2-letter code, space, number, space, letter-number, space...Year.
The strategy is two split the LOC number by spaces and then obtain 3 columns of data for the first 3 fields and then each column can be sequentially sorted with the order function.
call_numbers <- c("QA 7 H3 1992", "QA 76.73 R3 W53 2015", "QA 90 H33 2016", "QA 276.45 R3 A35 2010")
#split on the spaces
split<-strsplit(call_numbers, " " )
#Retrieve the 2 letter code
letters<-sapply(split, function(x){x[1]})
#retrieve the 2nd number group and convert to numeric values for sorting
second<-sapply(split, function(x){as.numeric(x[2])})
#obtain the 3rd grouping
third<-sapply(split, function(x){x[3]})
#find the year
year<-sapply(split, function(x){x[length(x)]})
df<-data.frame(call_numbers)
#sort data based on the first and 2nd column
call_numbers[order(letters, second, third)]
For this limited dataset the technique works.
mixedsort from the gtools package (part of standard R) turns out to do just the trick:
library(gtools)
call_numbers <- c("QA 7 H3 1992", "QA 76.73 R3 W53 2015", "QA 90 H33 2016", "QA 276.45 R3 A35 2010")
mixedsort(call_numbers)
## [1] "QA 7 H3 1992" "QA 76.73 R3 W53 2015" "QA 90 H33 2016" "QA 276.45 R3 A35 2010"
Further, mixedorder can be used to sort a data frame by one column.
This is a special case of what was answered earlier in How to sort a character vector where elements contain letters and numbers in R?
I feel like I spent way too much time on figuring out a solution to exactly what you're trying to do --only mine was for JavaScript. But it basically comes down to the notion of "normalization" of these numbers so that they can be sorted alphabetically.
Maybe this solution can be used and ported over to R. At the very least, hopefully this could get you started. It involves some regular expressions and a little bit of extra scripting to get the call numbers into a state where they can be sorted.
https://github.com/rayvoelker/js-loc-callnumbers/blob/master/locCallClass.js
Good luck!
Easiest (and elegant) way: using str_sortfrom the packg stringr
# install.packages("stringr") ## Uncomment if not already installed
library(stringr)
str_sort(call_numbers, numeric = TRUE)
[1] "QA 7 H3 1992" "QA 76.73 R3 W53 2015" "QA 90 H33 2016"
[4] "QA 276.45 R3 A35 2010"

Extract dates from a vector of character strings

I have vector with two elements. Each element contains a string of characters
with two sets of dates. I need to extract the latter of these two dates,
and make a new vector or list with them.
#webextract vector
webextract <- list("The Employment Situation, December 2006 January 5 \t 8:30 am\r","The Employment Situation, January 2007 \tFeb. 2, 2007\t 8:30 am \r")
#This is how the output of webextract looks like:
[[1]]
[1] The Employment Situation, December 2006 January 5 \t 8:30 am\r
[[2]]
[1] The Employment Situation, January 2007 \tFeb. 2, 2007\t 8:30 am \r
webextract is the result of web scraping an URL with plain text, that's why it looks like that. What I need to extract is "January 5" and "Feb. 2". I have been experimenting with grep and strsplit and failed to get anywhere. Have gone through all related SO questions without success. Thank you for your help.
We can try with gsub after unlisting the 'webextract'
gsub("^\\D+\\d+\\s+|(,\\s+\\d+)*\\D+\\d+:.*$", "", unlist(webextract))
#[1] "January 5" "Feb. 2"

Creating a unified time-series, with dates coming from different (natural) languages

I am using the as.Date function as follows:
x$time_date <- as.Date(x$time_date, format = "%H:%M - %d %b %Y")
This worked fine until I saw a lot of NA values in the output, which I traced back to some of the dates stemming from a different language: German.
My English dates look like this: 18:00 - 10 Dec 2014
Where the German equivalent is: 18:00 - 10 Dez 2014
The month December is abbreviated the German way. This is not recognised by the as.Date function. I have the same problem for five other months:
Mar - März
May - Mai
Jun - Juni
Jul - Juli
Oct - Okt
This looks like it would be of use, but I am unsure of how to implement it for 'unrecognised' formats:
How to change multiple Date formats in same column
I attempted to just go through and use gsub to replace all the occurences of German months, but without luck. x below is the data.table and I work on just the time_date column:
x$time_date <- gsub("(März)?", "Mar", x$time_date) %>%
gsub("(Mai)?", "May", .) %>%
gsub("(Juni)?", "Jun", .) %>%
gsub("(Juli)?", "Jul", .) %>%
gsub("(Okt)?", "Oct", .) %>%
gsub("(Dez)?", "Dec", .)
Not only did this not work, but it is also a very slow process and I have nearly 20 GB of pure .csv files to work through.
In the as.Date documentation there is mention of different locales / languages, but not how to work with several simultaneously. I also found instructions on how to use different languages, however my data is all mixed, so I can only thing of a conditional loop using the correct language for each file, however that would also be slow.
Is there a known workaround for this, which I can't find?
Create a table tab that contains all the translations and then use subscripting to actually do the translation. The code below seems to work for me on Windows provided your input abbreviations are the same as the standard ones generated but the precise language names ("German", etc.) may vary depending on your system. See ?Sys.setlocale for more information. Also if the abbreviations in your input are different than the ones generated here you will have to add those to tab yourself, e.g. tab <- c(tab, Juli = "Jul")
langs <- c("French", "German", "English")
tab <- unlist(lapply(langs, function(lang) {
Sys.setlocale("LC_TIME", lang)
nms <- format(ISOdate(2000, 1:12, 1), "%b")
setNames(month.abb, nms)
}))
x <- c("18:00 - 10 Juli 2014", "18:00 - 10 Mai 2014") # test input
source_month <- gsub("[^[:alpha:]]", "", x)
mapply(sub, source_month, tab[source_month], x, USE.NAMES = FALSE)
giving:
[1] "18:00 - 10 Jul 2014" "18:00 - 10 May 2014"

Resources