How to groupby column value using R programming - r

I have a table
Employee Details:
EmpID | WorkingPlaces | Salary
1001 | Bangalore | 5000
1001 | Chennai | 6000
1002 | Bombay | 1000
1002 | Chennai | 500
1003 | Pune | 2000
1003 | Mangalore | 1000
A same employee works for different places in a month. How to find the top 2 highly paid employees.
The result table should look like
EmpID | WorkingPlaces | Salary
1001 | Chennai | 6000
1001 | Bangalore | 5000
1003 | Pune | 2000
1003 | Mangalore | 1000
My code: in R language
knime.out <- aggregate(x= $"EmpID", by = list(Thema = $"WorkingPlaces", Project = $"Salary"), FUN = "length") [2]
But this doesnt give me the expected result. Kindly help me to correct the code.

We can try with dplyr
library(dplyr)
df1 %>%
group_by(EmpID) %>%
mutate(SumSalary = sum(Salary)) %>%
arrange(-SumSalary, EmpID) %>%
head(4) %>%
select(-SumSalary)

A base R solution. Considering your dataframe as df. We first aggregate the data by EmpId and calculate their sum. Then we select the top 2 EmpID's for which the salary is highest and find the subset of those ID's in the original dataframe using %in%.
temp <- aggregate(Salary~EmpID, df, sum)
df[df$EmpID %in% temp$EmpID[tail(order(temp$Salary), 2)], ]
# EmpID WorkingPlaces Salary
#1 1001 Bangalore 5000
#2 1001 Chennai 6000
#5 1003 Pune 2000
#6 1003 Mangalore 1000

Related

Can R do the equivalent of an HLOOKUP nested within a VLOOKUP?

I am trying (unsuccessfully) to do the equivalent of an HLOOKUP nested within a VLOOKUP in Excel using R Studio.
Here is the situation.
I have two tables. Table 1 has historical stock prices, where each column represents a ticker name and each row represents a particular date. Table 1 contains the closing stock price for each ticker on each date.
Assume Table 1 looks like this:
|----------------------------|
| Date |MSFT | AMZN |EPD |
|----------------------------|
| 6/1/2020 | 196 | 2600 | 19 |
| 5/1/2020 | 186 | 2200 | 20 |
| 4/1/2020 | 176 | 2000 | 15 |
| 3/1/2020 | 166 | 1800 | 14 |
| 2/1/2020 | 170 | 2200 | 18 |
| 1/1/2020 | 180 | 2300 | 17 |
|----------------------------|
Table 2 has a list of ticker symbols, as well as two dates and placeholders for the stock price on each date. Date1 is always an earlier date than Date2, and each of Date1 and Date2 corresponds with a date in Table 1. Note that Date1 and Date2 are different for each row of Table 2.
My objective is to pull the applicable PriceOnDate1 and PriceOnDate2 into Table 2 similar to VLOOKUP / HLOOKUP functions in Excel. (I can't use Excel going forward on this, as the file is too big for Excel to handle). Then I can calculate the return for each row by a formula like this: (Date2 - Date1) / Date1
Assume I want Table 2 to look like this, but I am unable to pull in the pricing data for PriceOnDate1 and PriceOnDate2:
|-----------------------------------------------------------|
| Ticker | Date1 | Date2 |PriceOnDate1 |PriceOnDate2 |
|-----------------------------------------------------------|
| MSFT | 1/1/2020 | 4/1/2020 | _________ | ________ |
| MSFT | 2/1/2020 | 6/1/2020 | _________ | ________ |
| AMZN | 5/1/2020 | 6/1/2020 | _________ | ________ |
| EPD | 1/1/2020 | 3/1/2020 | _________ | ________ |
| EPD | 1/1/2020 | 4/1/2020 | _________ | ________ |
|-----------------------------------------------------------|
My question is whether there is a way to use R to pull into Table 2 the closing price data from Table 1 for each Date1 and Date2 in each row of Table 2. For instance, in the first row of Table 2, ideally the R code would pull in 180 for PriceOnDate1 and 176 for PriceOnDate2.
I've tried searching for answers, but I am unable to craft a solution that would allow me to do this in R Studio. Can anyone please help me with a solution? I greatly appreciate your time. THANK YOU!!
Working in something like R requires you to think of the data a bit differently. Your Table 1 is probably easiest to work with pivoted into a long format. You can then just join together on the Ticker and Date to pull the values you want.
Data:
table_1 <- data.frame(Date = c("6/1/2020", "5/1/2020", "4/1/2020", "3/1/2020",
"2/1/2020", "1/1/2020"),
MSFT = c(196, 186, 176, 166, 170, 180),
AMZN = c(2600, 2200, 2000, 1800, 2200, 2300),
EPD = c(19, 20, 15, 14, 18, 17))
# only created part of Table 2
table_2 <- data.frame(Ticker = c("MSFT", "AMZN"),
Date1 = c("1/1/2020", "5/1/2020"),
Date2 = c("4/1/2020", "6/1/2020"))
Solution:
The tidyverse approach is pretty easy here.
library(dplyr)
library(tidyr)
First, pivot Table 1 to be longer.
table_1_long <- table_1 %>%
pivot_longer(-Date, names_to = "Ticker", values_to = "Price")
Then join in the prices that you want by matching the Date and Ticker.
table_2 %>%
left_join(table_1_long, by = c(Date1 = "Date", "Ticker")) %>%
left_join(table_1_long, by = c(Date2 = "Date", "Ticker")) %>%
rename(PriceOnDate1 = Price.x,
PriceOnDate2 = Price.y)
# Ticker Date1 Date2 PriceOnDate1 PriceOnDate2
# 1 MSFT 1/1/2020 4/1/2020 180 176
# 2 AMZN 5/1/2020 6/1/2020 2200 2600
The mapply function would do it here:
Let's say your first table is stored in a data.frame called df and the second in a data.frame called df2
df2$PriceOnDate1 <- mapply(function(ticker, date){temp[[ticker]][df$Date == date]}, df2$Ticker, df2$Date1)
df2$PriceOnDate2 <- mapply(function(ticker, date){temp[[ticker]][df$Date == date]}, df2$Ticker, df2$Date2)
In this code, the Hlookup is the double brackets ([[), which returns the column with that name. The VLOOKUP is the single brackets ([) which returns the value in a certain position.
This can be done with a single join if both data frames are in long format, followed by a pivot_wider to get the desired final shape.
The code below uses #Adam's sample data. Note that in the sample data, the dates are coded as factors. You'll probably want your dates coded as R's Date class in your real data.
library(tidyverse)
table_2 %>%
pivot_longer(-Ticker, values_to="Date") %>%
left_join(
table_1 %>%
pivot_longer(-Date, names_to="Ticker", values_to="Price")
) %>%
pivot_wider(names_from=name, values_from=c(Date, Price)) %>%
rename_all(~gsub("Date_", "", .))
Ticker Date1 Date2 Price_Date1 Price_Date2
1 MSFT 1/1/2020 4/1/2020 180 176
2 AMZN 5/1/2020 6/1/2020 2200 2600

Populating column based on row matches without for loop

Is there a way to obtain the annual count values based on the state, species, and year, without using a for loop?
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | ?
Dora | CA | 2 | Regal Tang | ?
Lookup table:
State | Species | Year | AnnualCt
NY | Clownfish | 2012 | 500
NY | Clownfish | 2014 | 200
CA | Regal Tang | 2001 | 400
CA | Regal Tang | 2014 | 680
CA | Regal Tang | 2000 | 700
The output would be:
Name | State | Age | Species | Annual Ct
Nemo | NY | 5 | Clownfish | 200
Dora | CA | 2 | Regal Tang | 680
What I've tried:
pets <- data.frame("Name" = c("Nemo","Dora"), "State" = c("NY","CA"),
"Age" = c(5,2), "Species" = c("Clownfish","Regal Tang"))
fishes <- data.frame("State" = c("NY","NY","CA","CA","CA"),
"Species" = c("Clownfish","Clownfish","Regal Tang",
"Regal Tang", "Regal Tang"),
"Year" = c("2012","2014","2001","2014","2000"),
"AnnualCt" = c("500","200","400","680","700"))
pets["AnnualCt"] <- NA
for (row in (1:nrow(pets))){
pets$AnnualCt[row] <- as.character(droplevels(fishes[which(fishes$State == pets[row,]$State &
fishes$Species == pets[row,]$Species &
fishes$Year == 2014),
which(colnames(fishes)=="AnnualCt")]))
}
I'm confused as to what you're trying to do; isn't this just this?
library(dplyr);
left_join(pets, fishes) %>%
filter(Year == 2014) %>%
select(-Year);
#Joining, by = c("State", "Species")
# Name State Age Species AnnualCt
#1 Nemo NY 5 Clownfish 200
#2 Dora CA 2 Regal Tang 680
Explanation: left_join both data.frames by State and Species, filter for Year == 2014 and output without Year column.

R - Performing a CountIF for a multiple rows data frame

I've googled lots of examples about how to perform a CountIF in R, however I still didn't find the solution for what I want.
I basically have 2 dataframes:
df1: customer_id | date_of_export - here, we have only 1 date of export per customer
df2: customer_id | date_of_delivery - here, a customer can have different delivery dates (which means, same customer will appear more than once in the list)
And I need to count, for each customer_id in df1, how many deliveries they got after the export date. So, I need to count if df1$customer_id = df2$customer_id AND df1$date_of_export <= df2$date_of_delivery
To understand better:
customer_id | date_of_export
1 | 2018-01-12
2 | 2018-01-12
3 | 2018-01-12
customer_id | date_of_delivery
1 | 2018-01-10
1 | 2018-01-17
2 | 2018-01-13
2 | 2018-01-20
3 | 2018-01-04
My output should be:
customer_id | date_of_export | deliveries_after_export
1 | 2018-01-12 | 1 (one delivery after the export date)
2 | 2018-01-12 | 2 (two deliveries after the export date)
3 | 2018-01-12 | 0 (no delivery after the export date)
Doesn't seem that complicated but I didn't find a good approach to do that. I've been struggling for 2 days and nothing accomplished.
I hope I made myself clear here. Thank you!
I would suggest merging the two data.frames together and then it's a simple sum():
library(data.table)
df3 <- merge(df1, df2)
setDT(df3)[, .(deliveries_after_export = sum(date_of_delivery > date_of_export)), by = .(customer_id, date_of_export)]
# customer_id date_of_export deliveries_after_export
#1: 1 2018-01-12 1
#2: 2 2018-01-12 2
#3: 3 2018-01-12 0

Summarize number of cases in category and calculate new column

I hava a dataset with the following structure
zip code |type of crime
------ |------
1002 |crime1
1002 |crime1
1002 |crime2
1002 |crime1
9210 |crime1
9210 |crime1
9210 |crime2
9210 |crime2
I also have a list of minimum sentences for each crime
crime | minimum sentence (days)
------| ------
crime1|10
crime2|15
Using these two tables, I would like to do the following:
calculate the total of each crime in each neighborhood
zip code | crime |number of crimes
------ | ------ |-----
1002 | crime1 | 3
1002 | crime2 | 1
9210 | crime1 | 2
9210 | crime2 | 2
multiply each crime by it's minimum sentence and then calculate the total of days by neighborhood.
zip | crime | crimexdays
---- | ------ | -----
1002 | crime1 | 30
1002 | crime2 | 15
9210 | crime1 | 20
9210 | crime2 | 30
I'd really appreciate any help here. Cheers!!
Get the frequency with count, left_join with second dataset and trasmute to create the new column
df1 %>%
count(zipcode, typeofcrime) %>%
left_join(., df2, by = c("typeofcrime" = "crime")) %>%
transmute(typeofcrime, crimexsentence = n*minimumsentence)
# zipcode typeofcrime crimexsentence
# <int> <chr> <int>
#1 1002 crime1 30
#2 1002 crime2 15
#3 9210 crime1 20
#4 9210 crime2 30

How to sort unique values based on another column in R

I would like to extract unique values based on the sum in another column. For example, I have the following data frame "music"
ID | Song | artist | revenue
7520 | Dance with me | R kelly | 2000
7531 | Gone girl | Vincent | 1890
8193 | Motivation | R Kelly | 3500
9800 | What | Beyonce | 12000
2010 | Excuse Me | Pharell | 1010
1999 | Remove me | Jack Will | 500
Basically, I would like to sort the top 5 artists based on revenue, without the duplicate entries on a given artist
You just need order() to do this. For instance:
head(unique(music$artist[order(music$revenue, decreasing=TRUE)]))
or, to retain all columns (although uniqueness of artists would be a little trickier):
head(music[order(music$revenue, decreasing=TRUE),])
Here's the dplyr way:
df <- read.table(text = "
ID | Song | artist | revenue
7520 | Dance with me | R Kelly | 2000
7531 | Gone girl | Vincent | 1890
8193 | Motivation | R Kelly | 3500
9800 | What | Beyonce | 12000
2010 | Excuse Me | Pharell | 1010
1999 | Remove me | Jack Will | 500
", header = TRUE, sep = "|", strip.white = TRUE)
You can group_by the artist, and then you can choose how many entries you want to peak at (here just 3):
require(dplyr)
df %>% group_by(artist) %>%
summarise(tot = sum(revenue)) %>%
arrange(desc(tot)) %>%
head(3)
Result:
Source: local data frame [3 x 2]
artist tot
1 Beyonce 12000
2 R Kelly 5500
3 Vincent 1890

Resources