I have a "test" dataframe with 3 companies (ciknum variable)
and years in which each company filed annual reports (fileyear):
ciknum fileyear
1 1408356 2013
2 1557255 2013
3 1557255 2014
4 1557255 2015
5 1557255 2016
6 1557255 2017
7 1555538 2014
8 1555538 2015
9 1555538 2016
10 1555538 2017
These two columns are numeric:
> is.numeric(test$ciknum)
[1] TRUE
> is.numeric(test$fileyear)
[1] TRUE
However, I need a loop that goes for each ciknum-fileyear pair to download annual reports from one site. This loop requires numeric variables for successful download, and it seems I don't get them. For instance, writing the following loop (either for the variable firm, or year, gives me that none are numeric variables):
for (row in 1:nrow(test)){
firm <- test[row, "ciknum"]
year <- test[row, "fileyear"]
my_getFilings(firm, '10-K', year, downl.permit="y") #download function over firm-year
}
Error: Input year(s) is not numeric #error repeated 10 times (one per row)
I checked whether new df firm and year are numeric, and there is mixed evidence. On the one hand, it seems it reads year as numeric variable:
> for (row in 1:nrow(test)){
+ firm <- test[row, "ciknum"]
+ year <- test[row, "fileyear"]
+
+ if(year>2015) {
+ print(paste("I have this", firm, "showing a numeric", year))
+ }
+ }
[1] "I have this 1557255 showing a numeric 2016" #it only states years>2015. Seems it reads a number
[1] "I have this 1557255 showing a numeric 2017"
[1] "I have this 1555538 showing a numeric 2016"
[1] "I have this 1555538 showing a numeric 2017"
But on the other hand, it seems it does not:
> for (row in 1:nrow(test)){
+ firm <- test[row, "ciknum"]
+ year <- test[row, "fileyear"]
+
+ if(!is.numeric(year)) {
+ print(paste("is not numeric"))
+ }
+ }
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
[1] "is not numeric"
Can anyone tell me whether these are numeric variables or not? Getting lost on this one... My download function "my_getFilings" seems to depend on that.
Thank you in advance.
Related
I wanted to exclude rows with participants who show error rates above 15%
When I look at the error rate of participant 2, it is for example 2,97%
semdata[2,"error_rate"]
[1] "2,97"
But if I run this ifelse-statement, many participants get excluded that don´t display error rates (but others not, which is correct).
15% (e.g., this participant 2).
for(i in 1:NROW(semdata)){
#single trial blocks
ifelse((semdata[i,"error_rate"] >= 15),print(paste(i, "exclusion: error rate ST too high",semdata[i,"dt_tswp.err.prop_st"])),0)
ifelse((semdata[i,"error_rate"] >= 15),semdata[i,6:NCOL(semdata)]<-NA,0)
#dual-task blocks
# ifelse((semdata[i,"error_rate"] >= 15),print(paste(i, "exclusion: error rate DT too high")),0)
# ifelse((semdata[i,"error_rate"] >= 15),semdata[i,6:NCOL(semdata)]<-NA,0)
}
[1] "1 exclusion: error rate ST too high 6,72"
[1] "2 exclusion: error rate ST too high 2,97"
[1] "7 exclusion: error rate ST too high 2,87"
[1] "9 exclusion: error rate ST too high 5,28"
...
What am I doing wrong here?
You are comparing strings here.
"6,72" > 15
#[1] TRUE
You should convert the data to numeric first before comparing which can be done by using sub
as.numeric(sub(",", ".", "6,72"))
#[1] 6.72
This can be compared with 15.
as.numeric(sub(",", ".", "6,72")) > 15
#[1] FALSE
For the entire column you can do -
semdata$error_rate <- as.numeric(sub(",", ".", semdata$error_rate))
This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 2 years ago.
Having issues using string to extract string between two characters. I need to get the everything between these characters including the line breaks:
reprEx <- "2100\n\nELECTRONIC WITHDRAWALS| om93 CCD ID: 964En To American Hon\nELECTRONIC WITHDRAWALSda Finance Corp 295.00\nTotal Electronic Withdrawals $93,735.18\n[OTHER WITHDRAWALS| WITHDRAWALS\nDATE DES $93,735.18\n[OTHER WITHDRAWALS| WITHDRAWALS\nDATE DESCRIPTION AMOUNT\n04/09 Pmt ID 7807388390 Refunded IN Error On 04/08"
desiredResult <- "| om93 CCD ID: 964En To American Hon\nELECTRONIC WITHDRAWALSda Finance Corp 295.00\nTotal Electronic Withdrawals $93,735.18\n[OTHER WITHDRAWALS| WITHDRAWALS\nDATE DES $93,735.18\n["
I have tried using:
desiredResult <- str_match(reprEx, "ELECTRONIC WITHDRAWALS\\s*(.*?)\\s*OTHER WITHDRAWALS")[,2]
but I just get NA back. I just want to get everything in the string that is between the first occurrence of ELECTRONIC WITHDRAWALS and the first occurrence of OTHER WITHDRAWALS. I can't tell if the new lines are what is causing the problem
I think your desiredOutput is inconsistent with your paragraph, I'll prioritize the latter:
everything in the string that is between the first occurrence of ELECTRONIC WITHDRAWALS and the first occurrence of OTHER WITHDRAWALS
first <- gregexpr("ELECTRONIC WITHDRAWALS", reprEx)[[1]]
first
# [1] 7 66
# attr(,"match.length")
# [1] 22 22
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
# generalized a little, in case you change the reprEx string
leftside <- if (first[1] > 0) first[1] + attr(first, "match.length")[1] else 1
second <- gregexpr("OTHER WITHDRAWALS", substr(reprEx, leftside, nchar(reprEx)))[[1]]
second
# [1] 124 176
# attr(,"match.length")
# [1] 17 17
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
rightside <- leftside + second[1] - 2
c(leftside, rightside)
# [1] 29 151
substr(reprEx, leftside, rightside)
# [1] "| om93 CCD ID: 964En To American Hon\nELECTRONIC WITHDRAWALSda Finance Corp 295.00\nTotal Electronic Withdrawals $93,735.18\n["
I am writing a function to be applied to many individual matrices. Each matrix has 5 columns of string text. I want to remove a piece of one string which matches the string inside another element exactly, then apply a couple more stringr functions, transform it into a data frame, then rename the columns and in the last step I want to add a number to the end of each column name, since I will apply this to many matrices and need to identify the columns later.
This is very similar to another function I wrote so I can't figure out why it won't work. I tried running each line individually by filling in the inputs like this and it works perfectly:
Review1[,4] <- str_remove(Review1[,4], Review1[,3])
Review1[,4] <- str_sub(Review1[,4], 4, -4)
Review1[,4] <- str_trim(Review1[,4], "both")
Review1 <- as.data.frame(Review1)
colnames(Review1) <- c("Title", "Rating", "Date", "User", "Text")
Review1 <- Review1 %>% rename_all(paste0, 1)
But when I run the function nothing seems to happen at all.
Transform_Reviews <- function(x, y, z, a) {
x[,y] <- str_remove(x[,y], x[,z])
x[,y] <- str_sub(x[,y], 4, -4)
x[,y] <- str_trim(x[,y], "both")
x <- as.data.frame(x)
colnames(x) <- c("Title", "Rating", "Date", "User", "Text")
x <- x %>% rename_all(paste0, a)
}
Transform_Reviews(Review1, 4, 3, 1)
This is the only warning message I get. I also receive this when I run the str_remove function individually, but it still changes the elements. But it changes nothing when I run the UDF.
Warning messages:
1: In stri_replace_first_regex(string, pattern, fix_replacement(replacement), ... :
empty search patterns are not supported
This is an example of the part of Review1 that I'm working with.
[,3] [,4]
[1,] "6 April 2014" "By Copnovelist on 6 April 2014"
[2,] "18 Dec. 2015" "By kenneth bell on 18 Dec. 2015"
[3,] "26 May 2015" "By Simon.B :-) on 26 May 2015"
[4,] "22 July 2013" "By Lilla Lukacs on 22 July 2013"
This is what I want the output to look like:
Date1 User1
1 6 April 2014 Copnovelist
2 18 Dec. 2015 kenneth bell
3 26 May 2015 Simon.B :-)
4 22 July 2013 Lilla Lukacs
I realized I just needed to use an assignment operator to see my function work.
Review1 <- Transform_Reviews(Review1, 4, 3, 1)
I have got a list of data frames, and I assigned them names based on the date in year.month format. However, in the list itself, the data frame objects are stored in the wrong order.
For example like this:
[1] "2004.Apr" "2004.Aug" "2004.Dec" "2004.Feb" "2004.Jul" "2004.Jun"
"2004.Mar" "2004.May" "2004.Nov" "2004.Oct" "2004.Jan"
So I want to reorder the objects within the list into the pattern illustrated below, i.e. from Jan to Dec. Is there a way of doing it without having to use a loop?
[1] "2004.Jan" "2004.Feb" "2004.Mar" "2004.Apr" "2004.May" "2004.Jun"
"2004.Jul" "2004.Aug" "2004.Sep" "2004.Oct" "2004.Nov" "2004.Dec"
You can factor the list names and set levels, but I assume you have many different lists with year and month combinations. You can use the built in constant month.abb to get the corresponding index, see below:
# sample list
byMonthList <- list()[rep(1,11)]
names(byMonthList) <- c("2004.Apr", "2004.Aug", "2004.Dec", "2004.Feb", "2004.Jul", "2004.Jun", "2004.Mar", "2004.May", "2004.Nov", "2004.Oct", "2004.Jan")
# split month name from year.month names using strsplit
# extract month name using sapply
ls_month <- sapply(strsplit(names(byMonthList), split="\\."), function(x) x[2])
ls_month
# [1] "Apr" "Aug" "Dec" "Feb" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Jan"
# use match and built in constant month.abb to get numeric value of months
ls_month_num <- match(ls_month, month.abb)
ls_month_num
# [1] 4 8 12 2 7 6 3 5 11 10 1
# Use ls_month_num to reorder your list
byMonthList[names(byMonthList)[order(ls_month_num)]]
# Or
byMonthList[names(byMonthList)[order(match(sapply(strsplit(names(byMonthList), split="\\."), function(x) x[2]), month.abb))]]
The zoo package has the as.yearmonfunction that returns a 'yearmon'-classed value which is actually numeric and so can be used as an index:
inp[ order(zoo::as.yearmon(inp, format="%Y.%b"))]
[1] "2004.Jan" "2004.Feb" "2004.Mar" "2004.Apr" "2004.May" "2004.Jun" "2004.Jul"
[8] "2004.Aug" "2004.Oct" "2004.Nov" "2004.Dec"
Data input:
inp <- scan(text='"2004.Apr" "2004.Aug" "2004.Dec" "2004.Feb" "2004.Jul" "2004.Jun" "2004.Mar" "2004.May" "2004.Nov" "2004.Oct" "2004.Jan"', what="")
They get rearranged that way because the names are being stored as a character vector. If you do an as.factor(name_vector, levels=name_vector), then they'll stay in the right order.
I have an output from Elastic that takes very long to convert to an R data frame. I have tried multiple options; and feel there may be some trick there to quicken the process.
The structure of the list is as follows. The list has aggregated data over 29 days (say). If lets say the Elastic query output is in list 'v_day' then l[[5]]$articles_over_time$buckets[1:29] represents each of the 29 days
length(v_day[[5]]$articles_over_time$buckets)
[1] 29
page(v_day[[5]]$articles_over_time$buckets[[1]],method="print")
$key
[1] 1446336000000
$doc_count
[1] 35332
$group_by_state
$group_by_state$doc_count_error_upper_bound
[1] 0
$group_by_state$sum_other_doc_count
[1] 0
$group_by_state$buckets
$group_by_state$buckets[[1]]
$group_by_state$buckets[[1]]$key
[1] "detail"
$group_by_state$buckets[[1]]$doc_count
[1] 876
There is a "key" value here right at the top here (1446336000000) that I am interested in (lets call it "time bucket key").
Within each day(lets take day i), "v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets" has more data I am interested in. This is an aggregation over each property (property is an entity in the scheme of things here).
page(v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets,method="print")
[[1]]
[[1]]$key
[1] "detail"
[[1]]$doc_count
[1] 876
[[2]]
[[2]]$key
[1] "ff8081814fdf2a9f014fdf80b05302e0"
[[2]]$doc_count
[1] 157
[[3]]
[[3]]$key
[1] "ff80818150a7d5930150a82abbc50477"
[[3]]$doc_count
[1] 63
[[4]]
[[4]]$key
[1] "ff8081814ff5f428014ffb5de99f1da5"
[[4]]$doc_count
[1] 57
[[5]]
[[5]]$key
[1] "ff8081815038099101503823fe5d00d9"
[[5]]$doc_count
[1] 56
This shows data over 5 properties in day i, each property has a "key" (lets call it "property bucket key") and a "doc_count" that I am interested in.
Eventually I want a data frame with "time bucket key", "property bucket key", "doc count".
Currently I am looping over using the below code:
v <- NULL
ndays <- length(v_day[[5]]$articles_over_time$buckets)
for (i in 1:ndays) {
v1 <- do.call("rbind", lapply(v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets, data.frame))
th_dt <- as.POSIXct(v_day[[5]]$articles_over_time$buckets[[i]]$key / 1000, origin="1970-01-01")
v1$view_date <- th_dt
v <- rbind(v, v1)
msg <- sprintf("Read views for %s. Found %d \n", th_dt, sum(v1$doc_count))
cat(msg)
}
v