How to select specific column from imported table?

How to select specific column from imported table? - web-scraping

I am using the following formula in Google Sheets to pull in some financial data:
=TRANSPOSE(IMPORTHTML("https://finance.yahoo.com/quote/MSFT/key-statistics?p=MSFT,"table",4))
The IMPORTHTML result is
Forward Annual Dividend Rate 4 2.04
Forward Annual Dividend Yield 4 1.11%
Trailing Annual Dividend Rate 3 1.94
Trailing Annual Dividend Yield 3 1.05%
5 Year Average Dividend Yield 4 2.02
Payout Ratio 4 32.93%
Dividend Date 3 Mar 11, 2020
Ex-Dividend Date 4 Feb 18, 2020
Last Split Factor 2 2:1
Last Split Date 3 Feb 17, 2003
I am TRANSPOSING the result to prepare the data for querying:
Forward Annual Dividend Rate 4 Forward Annual Dividend Yield 4 Trailing Annual Dividend Rate 3 ...
2.04 1.11% 1.94 ...
What I need is the value of the Ex-Dividend Date 4 column (so: Feb 18, 2020) (and later also other columns so I am seeking a generic solution). I have tried multiple ways (see below, but all resulting in #VALUE! errors:
=QUERY(TRANSPOSE(IMPORTHTML("https://finance.yahoo.com/quote/MSFT/key-statistics?p=MSFT","table",4)), "SELECT * LIMIT 2 OFFSET 1 WHERE COL=""Ex-Dividend Date 4"")")
=QUERY(TRANSPOSE(IMPORTHTML("https://finance.yahoo.com/quote/MSFT/key-statistics?p=MSFT","table",4)), "SELECT [Ex-Dividend Date 4] LIMIT 2 OFFSET 1")
How do I query this table correctly?

try:
=INDEX(IMPORTHTML("https://finance.yahoo.com/quote/MSFT/key-statistics?p=MSFT",
"table", 4), 8, 2)
or already formatted:
=TEXT(INDEX(IMPORTHTML("https://finance.yahoo.com/quote/MSFT/key-statistics?p=MSFT",
"table", 4), 8, 2), "mm/dd/yyyy")
in QUERY:
=QUERY(IMPORTHTML("https://finance.yahoo.com/quote/MSFT/key-statistics?p=MSFT",
"table", 4), "select Col2 where Col1 contains 'Ex-Dividend Date 4'", 0)

Related

index a dataframe with repeated values according to vector

I am trying to average values in different months over vectors of dates. Basically, I have a dataframe with monthly values of a variable, and I'm trying to get a representative average of the experienced values for samples that sometimes span month boundaries.
I've ended up with a dataframe of monthly values, and vectors of the representative number of "month-year" combinations of every sampling duration (e.g. if a sample was out from Jan 28, 2000 to Feb 1, 2000, the vector would show 4 values of Jan 2000, 1 value of Feb 2000). Later I'm going to average the values with these weights, so it's important that the returned variable values appear in representative numbers.
I am having trouble figuring out how to index the dataframe pulling the representative value repeatedly. See below.
# data frame of monthly values
reprex_df <-
tribble(
~my, ~value,
"2000-01", 10,
"2000-02", 11,
"2000-03", 15,
"2000-04", 9,
"2000-05", 13
) %>%
as.data.frame()
# vector of month-year dates from Jan 28 to Feb 1:
reprex_vec <- c("2000-01","2000-01","2000-01","2000-01","2000-02")
# I want to index the df using the vector to get a return vector of
# January value*4, Feb value*1, or 10, 10, 10, 10, 11
# I tried this:
reprex_df[reprex_df$my %in% reprex_vec,"value"]
# but %in% only returns each value once ("10 11", not "10 10 10 10 11").
# is there a different way I should be indexing to account for repeated values?
# eventually I will take an average, e.g.:
mean(reprex_df[reprex_df$my %in% reprex_vec,"value"])
# but I want this average to equal 10.2 for mean(c(10,10,10,10,11)), not 10.5 for mean(c(10,11))

Simple tidy solution with inner_join:
dplyr::inner_join(reprex_df, data.frame(my = reprex_vec), by = "my")$value

in base R:
merge(reprex_df, list(my = reprex_vec))
my value
1 2000-01 10
2 2000-01 10
3 2000-01 10
4 2000-01 10
5 2000-02 11

Perhaps use match from base R to get the index
reprex_df[match(reprex_vec, reprex_df$my),]
my value
1 2000-01 10
1.1 2000-01 10
1.2 2000-01 10
1.3 2000-01 10
2 2000-02 11

Another base R option using setNames
with(
reprex_df,
data.frame(
my = reprex_vec,
value = setNames(value, my)[reprex_vec]
)
)
gives
my value
1 2000-01 10
2 2000-01 10
3 2000-01 10
4 2000-01 10
5 2000-02 11

How can I use a vectorised function to multiply all values in a different data frame for a given ID in R?

I have a huge dataset with 750,000 IDs, for which I want to aggregate monthly values to yearly values by multiplying all values for a given ID. The ID consists of a combination of an identification number and a year.
The data I want to extract:
ID
monthly value
1 - 1997
Product of Monthly Values in Year 1997
1 - 1998
Product of Monthly Values in Year 1998
1 - 1999
Product of Monthly Values in Year 1999
...
...
2 - 1997
Product of Monthly Values in Year 1997
2 - 1998
Product of Monthly Values in Year 1998
2 - 1999
Product of Monthly Values in Year 1999
...
...
The dataset which is the source:
ID
monthly value
1 - 1997
Monthly Value 1 in Year 1997
1 - 1997
Monthly Value 2 in Year 1997
1 - 1997
Monthly Value 3 in Year 1997
...
...
2 - 1997
Monthly Value 1 in Year 1997
2 - 1997
Monthly Value 2 in Year 1997
2 - 1997
Monthly Value 3 in Year 1997
...
...
I have written a for loop, which takes about 0.74s for 10 IDs, which is way to slow. It would take about 15 hours for the whole data to run through. The for loop multiplies all monthly values for a given ID and stores it in a separate data frame.
for (i in 1:nrow(yearlyreturns)){
yearlyreturns[i, "yret"] <- prod(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"]) - 1
yearlyreturns[i, "monthcount"] <- length(monthlyreturns[monthlyreturns$ID == yearlyreturns[i,"ID"],"change"])
}
I don't know how to get from here to a vectorised function, which takes less time.
Is this possible to do in R?

Something like this:
library(dplyr)
df %>%
mutate(monthly_value = paste("Product of", str_replace(monthly_value, 'Value\\s\\d', 'Values'))) %>%
group_by(ID, monthly_value) %>%
summarise()
ID monthly_value
<chr> <chr>
1 1 - 1997 Product of Monthly Values in Year 1997
2 2 - 1997 Product of Monthly Values in Year 1997
data:
structure(list(ID = c("1 - 1997", "1 - 1997", "1 - 1997", "2 - 1997",
"2 - 1997", "2 - 1997"), monthly_value = c("Monthly Value 1 in Year 1997",
"Monthly Value 2 in Year 1997", "Monthly Value 3 in Year 1997",
"Monthly Value 1 in Year 1997", "Monthly Value 2 in Year 1997",
"Monthly Value 3 in Year 1997")), class = "data.frame", row.names = c(NA,
-6L))

Based on the for loop code, this may be a done with a join
library(data.table)
setDT(yearlyreturns)[monthlyreturns, c("yret", "monthcount")
:= .(prod(change) -1, .N), on = .(ID), by = .EACHI]

In addition to the most excellent previous answers - here's a link to an earlier post comparing 10 common ways to calculate means by group. Data.table based solutions are definitely the way to go - especially for datasets with millions of rows. Unless you're writing to individual output files - I'm not sure why this would take hours rather than minutes.

extract specific digits from column of numbers in R

Apologies if this is a repeat question, I searched and could not find the specific answer I am looking for.
I have a data frame where one column is a 16-digit code, and there are a number of other columns. Here is a simplified example:
code = c("1109619910224003", "1157919910102001", "1539820070315001", "1563120190907002")
year = c(1991, 1991, 2007, 2019)
month = c(02, 01, 03, 09)
dat = as.data.frame(cbind(code,year,month))
dat
> dat
code year month
1 1109619910224003 1991 2
2 1157919910102001 1991 1
3 1539820070315001 2007 3
4 1563120190907002 2019 9
As you can see, the code contains year, month, and day information. I already have columns for year and month in my dataframe, but I need to also create a day column, which would be 24, 02, 15, and 07 in this example. The date is always in the format yyyymmdd and begins as the 6th digit in the code. So I essentially need to extract the 12th and 13th digits from each code to create my day column.
I then need to create another column for day of year from the date information, so I end up with the following:
day = c(24, 02, 15, 07)
dayofyear = c(55, 2, 74, 250)
dat2 = as.data.frame(cbind(code,year,month,day,dayofyear))
dat2
> dat2
code year month day dayofyear
1 1109619910224003 1991 2 24 55
2 1157919910102001 1991 1 2 2
3 1539820070315001 2007 3 15 74
4 1563120190907002 2019 9 7 250
Any suggestions? Thanks!

You can leverage the Date data type in R to accomplish all of these tasks. First we will parse out the date portion of the code (characters 6 to 13), and convert them to Date format using readr::parse_date(). Once the date is converted, we can simply access all of the values you want rather than calculating them ourselves.
library(tidyverse)
out <- dat %>%
mutate(
date=readr::parse_date(substr(code, 6, 13), format="%Y%m%d"),
day=format(date, "%d"),
month=format(date, "%m"),
year=format(date, "%Y"),
day.of.year=format(date, "%j")
)
(I'm using tidyverse syntax here because I find it quicker for these types of problems)
Once we create these columns, we can look at the updated data.frame out:
code year month date day day.of.year
1 1109619910224003 1991 02 1991-02-24 24 055
2 1157919910102001 1991 01 1991-01-02 02 002
3 1539820070315001 2007 03 2007-03-15 15 074
4 1563120190907002 2019 09 2019-09-07 07 250
Edit: note that the output for all the new columns is character. We can tell this without using str() because of the leading zeros in the new columns. To get rid of this, we can do something like out <- out %>% mutate_all(as.integer), or just append the mutate_all call to the end of our existing pipeline.

How to remove rows based on different column values in R?

I have a 24h temporal series (it's a really big data) which I have to work only in certain periods of time. Like in hour = 7, 15 and 23.
I have to remove all the other rows that correspond to the 1, 2, 3, 4... hours. I have to filter the rows in a gap of 40 hours. I will have to always stay with the ones that correspond to the 7, 15 and 23h of a day.
I've been struggling to create another data.frame including only these rows and even just NULL the worthless ones.
I added one print from my data. The columns represent the months, days & hours of 1955. It keeps going until the last day and hour record of 1955.
Data Example

One way to select the 7th, 15th, and 23rd hours based on the data in the screen capture, using the base R extract operator, is as follows.
Tempo <- c(0,3600,7200,10800,1440,18000,21600,25200,28800,
23400,36000,39600,43200,46800,50400,5400,57600,
61200,64800,68400,72000,75600,79200,82800)
Ano <- rep(1955,24)
Mes <- rep(1,24)
Dia <- rep(1,24)
Hora <- 0:23
NiveldoMar <- c(1.07,0.91,0.81,0.78,0.91,1.05,1.32,1.57,
1.60,1.48,1.30,1.07,1.10,1.22,1.42,1.45,1.40,1.32,
1.27,1.40,1.62,NA,NA,NA)
ContoleInterno <- rep(0,24)
data <- data.frame(Tempo,Ano,Mes,Dia,Hora,NiveldoMar)
# select data from hours 7,15, and 23
data[data$Hora %in% c(7,15,23),]
...and the output:
> data[data$Hora %in% c(7,15,23),]
Tempo Ano Mes Dia Hora NiveldoMar
8 25200 1955 1 1 7 1.57
16 5400 1955 1 1 15 1.45
24 82800 1955 1 1 23 NA
>

Multiplying multiple values by a single value in a loop

I am attempting to multiply a column of numbers representing daily precipitation amounts by the corresponding monthly precipitation amount of the same year. From the example below, this means multiplying every PPT value in January 1890 by the monthly PPT value for January 1890, i.e. multiplying 31 numbers from D.SIM by the same number from M.SIM, and then doing the same for all the remaining months and years in the record. Is there an easy way?
Many thanks.
Dataset: D.SIM
Day Month Year PPT
1 1 1890 2.4
2 1 1890 0.0
3 1 1890 3.6
Dataset: M.SIM
Year Jan Feb Mar ...
1890 78.5 69.6 62.1 ...

Create loop to repeat daily values to align with monthly values
for (i in df){
JAN <- data.frame(rep(df$Jan, each=31))
}
and then repeated for the other 11 months.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to select specific column from imported table? - web-scraping

Related

index a dataframe with repeated values according to vector

How can I use a vectorised function to multiply all values in a different data frame for a given ID in R?

extract specific digits from column of numbers in R

How to remove rows based on different column values in R?

Multiplying multiple values by a single value in a loop

Categories

Resources