Can R do the equivalent of an HLOOKUP nested within a VLOOKUP? - r

I am trying (unsuccessfully) to do the equivalent of an HLOOKUP nested within a VLOOKUP in Excel using R Studio.
Here is the situation.
I have two tables. Table 1 has historical stock prices, where each column represents a ticker name and each row represents a particular date. Table 1 contains the closing stock price for each ticker on each date.
Assume Table 1 looks like this:
|----------------------------|
| Date |MSFT | AMZN |EPD |
|----------------------------|
| 6/1/2020 | 196 | 2600 | 19 |
| 5/1/2020 | 186 | 2200 | 20 |
| 4/1/2020 | 176 | 2000 | 15 |
| 3/1/2020 | 166 | 1800 | 14 |
| 2/1/2020 | 170 | 2200 | 18 |
| 1/1/2020 | 180 | 2300 | 17 |
|----------------------------|
Table 2 has a list of ticker symbols, as well as two dates and placeholders for the stock price on each date. Date1 is always an earlier date than Date2, and each of Date1 and Date2 corresponds with a date in Table 1. Note that Date1 and Date2 are different for each row of Table 2.
My objective is to pull the applicable PriceOnDate1 and PriceOnDate2 into Table 2 similar to VLOOKUP / HLOOKUP functions in Excel. (I can't use Excel going forward on this, as the file is too big for Excel to handle). Then I can calculate the return for each row by a formula like this: (Date2 - Date1) / Date1
Assume I want Table 2 to look like this, but I am unable to pull in the pricing data for PriceOnDate1 and PriceOnDate2:
|-----------------------------------------------------------|
| Ticker | Date1 | Date2 |PriceOnDate1 |PriceOnDate2 |
|-----------------------------------------------------------|
| MSFT | 1/1/2020 | 4/1/2020 | _________ | ________ |
| MSFT | 2/1/2020 | 6/1/2020 | _________ | ________ |
| AMZN | 5/1/2020 | 6/1/2020 | _________ | ________ |
| EPD | 1/1/2020 | 3/1/2020 | _________ | ________ |
| EPD | 1/1/2020 | 4/1/2020 | _________ | ________ |
|-----------------------------------------------------------|
My question is whether there is a way to use R to pull into Table 2 the closing price data from Table 1 for each Date1 and Date2 in each row of Table 2. For instance, in the first row of Table 2, ideally the R code would pull in 180 for PriceOnDate1 and 176 for PriceOnDate2.
I've tried searching for answers, but I am unable to craft a solution that would allow me to do this in R Studio. Can anyone please help me with a solution? I greatly appreciate your time. THANK YOU!!

Working in something like R requires you to think of the data a bit differently. Your Table 1 is probably easiest to work with pivoted into a long format. You can then just join together on the Ticker and Date to pull the values you want.
Data:
table_1 <- data.frame(Date = c("6/1/2020", "5/1/2020", "4/1/2020", "3/1/2020",
"2/1/2020", "1/1/2020"),
MSFT = c(196, 186, 176, 166, 170, 180),
AMZN = c(2600, 2200, 2000, 1800, 2200, 2300),
EPD = c(19, 20, 15, 14, 18, 17))
# only created part of Table 2
table_2 <- data.frame(Ticker = c("MSFT", "AMZN"),
Date1 = c("1/1/2020", "5/1/2020"),
Date2 = c("4/1/2020", "6/1/2020"))
Solution:
The tidyverse approach is pretty easy here.
library(dplyr)
library(tidyr)
First, pivot Table 1 to be longer.
table_1_long <- table_1 %>%
pivot_longer(-Date, names_to = "Ticker", values_to = "Price")
Then join in the prices that you want by matching the Date and Ticker.
table_2 %>%
left_join(table_1_long, by = c(Date1 = "Date", "Ticker")) %>%
left_join(table_1_long, by = c(Date2 = "Date", "Ticker")) %>%
rename(PriceOnDate1 = Price.x,
PriceOnDate2 = Price.y)
# Ticker Date1 Date2 PriceOnDate1 PriceOnDate2
# 1 MSFT 1/1/2020 4/1/2020 180 176
# 2 AMZN 5/1/2020 6/1/2020 2200 2600

The mapply function would do it here:
Let's say your first table is stored in a data.frame called df and the second in a data.frame called df2
df2$PriceOnDate1 <- mapply(function(ticker, date){temp[[ticker]][df$Date == date]}, df2$Ticker, df2$Date1)
df2$PriceOnDate2 <- mapply(function(ticker, date){temp[[ticker]][df$Date == date]}, df2$Ticker, df2$Date2)
In this code, the Hlookup is the double brackets ([[), which returns the column with that name. The VLOOKUP is the single brackets ([) which returns the value in a certain position.

This can be done with a single join if both data frames are in long format, followed by a pivot_wider to get the desired final shape.
The code below uses #Adam's sample data. Note that in the sample data, the dates are coded as factors. You'll probably want your dates coded as R's Date class in your real data.
library(tidyverse)
table_2 %>%
pivot_longer(-Ticker, values_to="Date") %>%
left_join(
table_1 %>%
pivot_longer(-Date, names_to="Ticker", values_to="Price")
) %>%
pivot_wider(names_from=name, values_from=c(Date, Price)) %>%
rename_all(~gsub("Date_", "", .))
Ticker Date1 Date2 Price_Date1 Price_Date2
1 MSFT 1/1/2020 4/1/2020 180 176
2 AMZN 5/1/2020 6/1/2020 2200 2600

Related

Group by and Pivot table in R

I'm just starting to learn R and transition a project from Jupyter Notebook to an R Markdown document. I have a data set that looks like this:
DATE | ROUTE | STOP_NAME | BOARDING
-----------------------------------------------
2020-03-09 | 1 | STOP A | 2
2020-03-09 | 1 | STOP B | 3
2020-03-09 | 2 | STOP C | 1
There are 20,xxx records over several days and 16 routes. I am trying to group by DATE and ROUTE and sum the BOARDING column. I was able to do this in Python using
df.groupby(['DATE','ROUTE'],as_index = False)['BOARDING'].sum().pivot('DATE','ROUTE').fillna(0)
I've been able to create a table in R close to what I want using:
groupcol1 <- c("DATE","ROUTE")
datacol1 <- ("BOARDING")
route_totals_table <- ddply(df,groupcol1,function(x) colSums(x[datacol1]))
But this gives me a table with a row for each date and route. I am wanting a table like this.
DATE | ROUTE 1 | Route 2 | Route 3
-----------------------------------------------
2020-03-09 | 25 | 45 | 10
2020-03-10 | 36 | 69 | 22
2020-03-11 | 95 | 100 | 29
I would suggest using the tidyverse package to do this work, and the spread or pivot_wider functions from the tidyr package. Suppose your data is in a data.frame called "dat":
library(tidyverse)
# using spread
dat %>%
mutate(ROUTE = paste0("Route ", ROUTE)) %>%
group_by(DATE, ROUTE)%>%
summarise(BOARDING = sum(BOARDING)) %>%
spread(ROUTE, BOARDING)
# using pivot_wider
dat %>%
mutate(ROUTE = paste0("Route ", ROUTE)) %>%
group_by(DATE, ROUTE)%>%
summarise(BOARDING = sum(BOARDING)) %>%
pivot_wider(names_from = ROUTE, values_from = BOARDING)
Both return:
DATE `Route 1` `Route 2`
<chr> <int> <int>
1 "2020-03-09" 5 1

How do I programmatically filter last 52 weeks on the most recent date of my dataframe?

I am trying to programmatically filter my dataframe on the last 52 weeks from the most completed week in my dataframe. The most recent date is '04-21-2019'. This data will be loaded weekly, and I am trying to avoid filtering on the data manually.
I know there is lubridate. But I do not know which function would filter my data on the last 52 weeks based weeks. Would the better solution filter on 365 days instead?
The table looks like this:
|---------------------|------------------|
| Date (week) | Product |
|---------------------|------------------|
| Apr 21, 19 | A |
|---------------------|------------------|
| Apr 21, 19 | B |
|---------------------|------------------|
| Apr 21, 19 | C |
|---------------------|------------------|
| Apr 14, 19 | A |
|---------------------|------------------|
| Apr 14, 19 | B |
|---------------------|------------------|
| Apr 14, 19 | C |
|---------------------|------------------|
and so on
The ideal result would give me a table with the last 52 weeks of data available.
As you said, you can use {lubridate}.
If I understand your question well, this should be what you are looking for:
library(dplyr)
library(lubridate)
text_date <- '04-21-2019'
last_date <- mdy(text_date)
first_date <- last_date - weeks(52)
data <- tibble(
date = c("04-21-2019", "01-21-2019", "08-21-2018", "04-21-2018"),
product = LETTERS[1:4]
)
data %>%
mutate(date = mdy(date)) %>%
filter(date >= first_date, date <= last_date)
Please, for your next question on SO, could you provide a reproducible example, with a usable dataframe.

How can I transform a dataframe with POSIXct dates into a time series?

I have a data frame (DF) with two columns. In column one I have dates, in column two I have my value of interest (VOI).
DF's display would be this:
|---------------------|------------------|
| Date | VOI |
|---------------------|------------------|
| Jan-1971 | 34 |
|---------------------|------------------|
| Jan-1972 | 28 |
|---------------------|------------------|
| Jan-1973 | 29 |
|---------------------|------------------|
| Jan-1974 | 37 |
|---------------------|------------------|
| ... | ... |
|---------------------|------------------|
| Jan-2017 | 36 |
|---------------------|------------------|
| Fev-1971 | 48 |
|---------------------|------------------|
| Fev-1972 | 49 |
|---------------------|------------------|
| Fev-1973 | 52 |
|---------------------|------------------|
| Fev-1974 | 50 |
|---------------------|------------------|
| ... | ... |
|---------------------|------------------|
| Mar-1971 | 30 |
|---------------------|------------------|
| ... | ... |
|---------------------|------------------|
| Mar-2017 | 36 |
|---------------------|------------------|
| ... | ... |
|---------------------|------------------|
| Dez-1971 | 15 |
|---------------------|------------------|
| ... | ... |
|---------------------|------------------|
| Dez-2017 | 19 |
|---------------------|------------------|
In a nutshell, the data are presented in aggregated cycles of months.
First I have all the VOIs for January from 1971 to 2017 (47 data points), then I have all the VOIs for February of the same period, hence, the same amount of points. This repetition goes on until December, also with 47 data points.
I applied ymd() from the lubridate to transform my date into POSIXct values.
Now I wanted to create a time series object out of my VOIs. I tried:
ts = xts(x = df$Vazao, order.by = index(df$Date))
and
ts = xts(x = df$Vazao, order.by = df$Data)
but none worked. I don't know where I am making a mistake, but I wonder it has anything to do with the fact my dates don't come chronologically. I thought that using the ymd() command would sort that out and "make R understand" that my times series goes from Jan 1971, Feb 1971, Mar 1971, ..., Dec 2017.
How would I transform this data frame into a time series object?
Thank you for your input.
Is this what you are looking for?
First, make up some data.
y <- 1971:2017
length(ano)
m <- seq(as.Date("2017-01-01"), as.Date("2017-12-31"), by = 28)
m <- unique(format(m, "%b"))
Date <- expand.grid(y, m)[2:1]
Date <- apply(Date, 1, paste, collapse = "-")
DF <- data.frame(Date = date, VOI = sample(100, length(date), TRUE))
head(DF)
# Date VOI
#1 Jan-1971 12
#2 Jan-1972 89
#3 Jan-1973 99
#4 Jan-1974 77
#5 Jan-1975 5
#6 Jan-1976 46
Now, it's just a matter of applying function xts with the appropriate arguments. Note that your Date column doesn't have a day value, so I have to paste one. Day 01 is always a good choice.
library(xts)
ts <- xts(DF[, "VOI"], order.by = as.Date(paste0("01-", DF$Date), "%d-%b-%Y"))
str(ts)
#An ‘xts’ object on 1971-01-01/2017-12-01 containing:
# Data: int [1:564, 1] 76 90 7 61 3 49 1 19 51 90 ...
# Indexed by objects of class: [Date] TZ: UTC
# xts Attributes:
# NULL
head(ts)
[,1]
#1971-01-01 76
#1971-02-01 90
#1971-03-01 7
#1971-04-01 61
#1971-05-01 3
#1971-06-01 49
Since, your Date got only month and year for a date hence, you can use zoo::yearmon function to convert Date to class yearmon which is acceptable by xts function.
The expectation for the the order.by argument of xts is explained
in help as:
An xts object extends the S3 class zoo from the package of the same
name.
The first difference in this extension provides for a requirement that
the index values not only be unique and ordered, but also must be of a
time-based class. Currently acceptable classes include: Date,
POSIXct, timeDate, as well as yearmon and yearqtr where the
index values remain unique.
A solution can be as:
# Sample data. This data will have Date in `Jan-1971` format.
# Data has been created only for 36 months.
set.seed(1)
df <- data.frame( Date = format(seq(as.Date("1971-01-01"),
as.Date("1973-12-31"), by="month"), "%b-%Y"),
VOI = as.integer(runif(36)*100), stringsAsFactors = FALSE)
library(zoo)
library(xts)
#Convert Date column to type `yearmon`
ts = xts(x = df$VOI, order.by = as.yearmon(df$Date, "%b-%Y"))
head(ts)
# [,1]
# Jan 1971 26
# Feb 1971 37
# Mar 1971 57
# Apr 1971 90
# May 1971 20
# Jun 1971 89

Summarising Multiple Columns in R (while retaining filter)

I've hit a bit of a brick wall with my code below. Essentially, dftable should be a filtered dataframe containing clicks on a widget (I loop through the columns for each widget).
I then want to get the sum of all pageviews the widget was active on (it's not on all pages, and I filter as such to exclude those where it is NA). However, dfviews just returns all pageviews, as opposed to filtering on where the widget is not NA.
Any guidance would be appreciated:
mixpanelData example:
--------------------------------------------------------------
| Group | Date | WidgetClick | Widget2Click | ViewedPageResult
--------------------------------------------------------------
| ABC | 01/01/2017 | 123456 | NA | 1450544
--------------------------------------------------------------
| ABN | 01/01/2017 | NA | 1245 | 4560000
--------------------------------------------------------------
| ABN | 01/02/2017 | NA | 1205 | 4561022
--------------------------------------------------------------
| BNN | 01/02/2017 | 1044 | NA | 4561021
--------------------------------------------------------------
An my ideal output would be along the lines of... (with proportions, which is fine as I can handle these)
WidgetClick CSV
--------------------------------------------------------------
Date | WidgetClick | ViewedPageResult
--------------------------------------------------------------
01/01/2017 | 123455 | 1450544
------------------------------------------------------------
01/02/2017 | 1044 | 4561021
--------------------------------------------------------------
WidgetClick 2 CSV
--------------------------------------------------------------
|Date | Widget2Click | ViewedPageResult
--------------------------------------------------------------
01/01/2017 | 1245 | 4560000
--------------------------------------------------------------
01/02/2017 | 1205 | 4561022
--------------------------------------------------------------
Code is provided below...
vars = colnames(mixpanelData)
vars =vars[-c(1,2)]
k = 1
for (v in vars) {
filename <- paste(v,k,".csv",sep="")
dftable <- mixpanelData %>% filter(!is.na(v)) %>% group_by(Date) %>% summarise_(clicksum=interp(~sum(var, na.rm = TRUE), var = as.name(v)))
dfviews <- mixpanelData %>% filter(!is.na(v)) %>% group_by(Date) %>% summarise(viewsum=sum((ViewedPageResult)))
total <- merge(dftable,dfviews,by="Date")
total <- mutate(total, proportion = clicksum / viewsum * 100)
write.csv(total, file = filename,row.names=FALSE, na="")
k <- k +1 }
In your desired results, you show two separate tables. But you also mention that you have several widgets, so separate table might not be ideal. I'll show how you can get separate tables and then I'll show how you can calculate for all widgets at once.
Separate tables
Using dplyr and tidyr, you can use filter to get your two tables like so:
library(dplyr);library(tidyr)
df <- read.table(text="Group Date WidgetClick Widget2Click ViewedPageResult
ABC 01/01/2017 123456 NA 1450544
ABN 01/01/2017 NA 1245 4560000
ABN 01/02/2017 NA 1205 4561022
BNN 01/02/2017 1044 NA 4561021",header=TRUE,
stringsAsFactors=FALSE)
df%>% filter(!is.na(WidgetClick)) %>% select(-Widget2Click)
Group Date WidgetClick ViewedPageResult
1 ABC 01/01/2017 123456 1450544
2 BNN 01/02/2017 1044 4561021
df%>% filter(!is.na(Widget2Click)) %>% select(-WidgetClick)
Group Date Widget2Click ViewedPageResult
1 ABN 01/01/2017 1245 4560000
2 ABN 01/02/2017 1205 4561022
Single table
To get all the results in a single table, you first need to gather the Widget*Click column and then filter:
df%>%
gather(Widget_number,Click,starts_with("Widget"))%>%
filter(!is.na(Click))
Group Date ViewedPageResult Widget_number Click
1 ABC 01/01/2017 1450544 WidgetClick 123456
2 BNN 01/02/2017 4561021 WidgetClick 1044
3 ABN 01/01/2017 4560000 Widget2Click 1245
4 ABN 01/02/2017 4561022 Widget2Click 1205
EDIT
To summarise the number of clicks per month per widget, you can mutate to add a Year_mon column using as.yearmon from package zoo. Then, group_by Widget_number and Year_month, then summarise to get the total clicks per month. You can do other calculations such as proportion inside the summarise statement. I assumed the date was "%m/%d/%Y". Make sure it's the case.
library(zoo)
df%>%
gather(Widget_number,Click,starts_with("Widget"))%>%
filter(!is.na(Click)) %>%
mutate(Year_month=as.yearmon(as.Date(Date,"%m/%d/%Y"))) %>%
group_by(Widget_number,Year_month) %>%
summarise(Sum_clicks=sum(Click,na.rm=TRUE))
Widget_number Year_month Sum_clicks
<chr> <S3: yearmon> <int>
1 Widget2Click Jan 2017 2450
2 WidgetClick Jan 2017 124500

How to groupby column value using R programming

I have a table
Employee Details:
EmpID | WorkingPlaces | Salary
1001 | Bangalore | 5000
1001 | Chennai | 6000
1002 | Bombay | 1000
1002 | Chennai | 500
1003 | Pune | 2000
1003 | Mangalore | 1000
A same employee works for different places in a month. How to find the top 2 highly paid employees.
The result table should look like
EmpID | WorkingPlaces | Salary
1001 | Chennai | 6000
1001 | Bangalore | 5000
1003 | Pune | 2000
1003 | Mangalore | 1000
My code: in R language
knime.out <- aggregate(x= $"EmpID", by = list(Thema = $"WorkingPlaces", Project = $"Salary"), FUN = "length") [2]
But this doesnt give me the expected result. Kindly help me to correct the code.
We can try with dplyr
library(dplyr)
df1 %>%
group_by(EmpID) %>%
mutate(SumSalary = sum(Salary)) %>%
arrange(-SumSalary, EmpID) %>%
head(4) %>%
select(-SumSalary)
A base R solution. Considering your dataframe as df. We first aggregate the data by EmpId and calculate their sum. Then we select the top 2 EmpID's for which the salary is highest and find the subset of those ID's in the original dataframe using %in%.
temp <- aggregate(Salary~EmpID, df, sum)
df[df$EmpID %in% temp$EmpID[tail(order(temp$Salary), 2)], ]
# EmpID WorkingPlaces Salary
#1 1001 Bangalore 5000
#2 1001 Chennai 6000
#5 1003 Pune 2000
#6 1003 Mangalore 1000

Resources