How to "wrap" rows in R? - r

I currently have a data set that has all information within one single row (or column if I transpose).
The very first items in the data are actually column names:
Country | Population | Country Column One | Country Column 2 | USA | 400 million | USA Column 1 | USA Column 2 | Canada | 38 Million | Canada Column 1 | Canada Column 2 | etc..
I notice that I can just "wrap" and have everything start at a new row once it reaches a new country. How would I go about that? Is there a more efficient way?

d <- t(matrix(scan(text=string, sep='|', what = "", strip.white = TRUE), 4))
colnames(d) <- d[1,]
data.frame(d[-1,])
Country Population Country.Column.One Country.Column.2
1 USA 400 million USA Column 1 USA Column 2
2 Canada 38 Million Canada Column 1 Canada Column 2
string <- "Country | Population | Country Column One | Country Column 2 | USA | 400 million | USA Column 1 | USA Column 2 | Canada | 38 Million | Canada Column 1 | Canada Column 2 "

Here is a base R option using read.table + gsub
> read.table(text = gsub("(([^|]+\\|){3}[^|]+)\\|", "\\1\n", s),sep = "|",header = TRUE)
Country Population Country.Column.One Country.Column.2
1 USA 400 million USA Column 1 USA Column 2
2 Canada 38 Million Canada Column 1 Canada Column 2
given
s <- "Country | Population | Country Column One | Country Column 2 | USA | 400 million | USA Column 1 | USA Column 2 | Canada | 38 Million | Canada Column 1 | Canada Column 2 "

Here is a custom approach:
We create a tibble separate the rows and pull it as vector
with split we create a list
then we use bind_rows and do pivoting.
library(tidyverse)
my_vec <- as_tibble(string) %>%
separate_rows("value", sep = " \\| ") %>%
pull(value)
my_list <- split(my_vec, ceiling(seq_along(my_vec) / 4))
bind_rows(my_list) %>%
pivot_longer(-`1`) %>%
pivot_wider(names_from = `1`, values_from = value) %>%
select(-name)
Country Population `Country Column One` `Country Column 2`
<chr> <chr> <chr> <chr>
1 USA 400 million USA Column 1 "USA Column 2"
2 Canada 38 Million Canada Column 1 "Canada Column 2 "

Related

How to fuzzy lookup a string in one column to another column ignoring sub-setted words

I have the following 2 dataframes vendor_list and firm_list:
MARKET_ID <- c(1,2,3,4,5)
MARKET_NAME <- c("DELHI","MUMBAI","BANGALORE","KOLKATA","CHENNAI")
vendor_list <- data.frame(MARKET_ID,MARKET_NAME)
MARKET_NAME <- c("DELHI MUNICIPAL CORP","DELHI","MUMBAI","BENGALURU","BANGALORES","CITYKOLKATA")
POPULATION <- c(1000,2000,3000,4000,5000,6000)
firm_list <- data.frame(MARKET_NAME,POPULATION)
I need to search for strings in MARKET_NAME column in vendor_list dataframe in MARKET_NAME column in firm_list dataframe. But there are certain conditions:
It should only show as a match if the string is present as a stand alone block, i.e. it should not be a sub-set of the word.
So,
The match of DELHI to DELHI MUNICIPAL CORP is TRUE
The match of DELHI to DELHI is TRUE
The match of BANGALORE to BANGALORES is FALSE as BANGALORE is a sub-set of BANGALORES
The match of KOLKATA to CITYKOLKATA is FALSE as KOLKATA is a sub-set of CITYKOLKATA
Thus, the final dataframe final_market_info after lookup should look like this:
| MARKET_ID| MARKET_NAME.x | MARKET_NAME.y | POPULATION |
| 1 | DELHI | DELHI MUNICIPAL CORP| 1000 |
| 1 | DELHI | DELHI | 2000 |
| 2 | MUMBAI | MUMBAI | 3000 |
I had tried stringdist_join in stringr package using lcs and jw method but it was not giving me correct result like this.
Is this what you need?
firm_list %>%
mutate(match = str_extract(MARKET_NAME, str_c("\\b", vendor_list$MARKET_NAME, collapse = "|", "\\b"))) %>%
left_join(., vendor_list %>% rename(match = MARKET_NAME), by = "match")
MARKET_NAME POPULATION match MARKET_ID
1 DELHI MUNICIPAL CORP 1000 DELHI 1
2 DELHI 2000 DELHI 1
3 MUMBAI 3000 MUMBAI 2
4 BENGALURU 4000 <NA> NA
5 BANGALORES 5000 <NA> NA
6 CITYKOLKATA 6000 <NA> NA
The point here is that the elements in vendor_list$MARKET_NAME are wrapped into word \\boundary markers to get exact matches and concatenated as an alternation pattern.
To remove the rows without matches, use inner_join instead of left_join:
firm_list %>%
mutate(match = str_extract(MARKET_NAME, str_c("\\b", vendor_list$MARKET_NAME, collapse = "|", "\\b"))) %>%
inner_join(., vendor_list %>% rename(match = MARKET_NAME), by = "match")
MARKET_NAME POPULATION match MARKET_ID
1 DELHI MUNICIPAL CORP 1000 DELHI 1
2 DELHI 2000 DELHI 1
3 MUMBAI 3000 MUMBAI 2

Deleting rows based on a calculated criteria in R

I conducted an analysis for some M&A-Deals. My current output looks like this:
Deal-Nr | Event-Date | Target-Nation | CAR | SIC
----------------------------------------------------
1 | 01-01-1999 | Italy | 5.1% | 201
2 | 02-01-1999 | Germany | 2.3% | 202
3 | 06-01-1999 | Spain | 1.5% | 201
4 | 10-09-1999 | Germany | 0.3% | 201
5 | 15-09-1999 | UK | 1.1% | 201
6 | 25-10-2000 | Spain | 0.8% | 201
However, for my final analysis I want to exclude all deals within the same SIC-Code, which do not have at least 180 trading days between them. So in this case, I would want to exclude my deal 3 from the analysis (as they have the same SIC-code and do not have 180 days between them). Then the code should continue and check the next deal within that SIC-Code industry and remove (<180 days) or keep it (>180 days). This should be done for all the different SIC codes in my analysis.
As I'm rather new in R, I'm reaching out for help. Thank you so much for your support.
Edit:
As indicated below I provide some further information. I'm interested in the deals that are in the same SIC-Code and >180 days apart. This would mean in the table to remove row (3) and row (5). If one deal is more than 180 days apart the subsequent dates should be checked.
First, your Event.Date column needs to be a real date, not a string. I'm inferring month-day-year. From there, we need to group by SIC and calculate the difference in dates.
base R
dat$Event.Date <- as.Date(dat$Event.Date, format = "%d-%m-%Y")
keep <- ave(as.numeric(dat$Event.Date), dat$SIC, FUN = function(z) c(TRUE, diff(z) >= 180)) > 0
dat[keep,]
# Deal.Nr Event.Date Target.Nation CAR SIC
# 1 1 1999-01-01 Italy 5.1% 201
# 2 2 1999-01-02 Germany 2.3% 202
# 4 4 1999-09-10 Germany 0.3% 201
# 6 6 2000-10-25 Spain 0.8% 201
dplyr
library(dplyr)
dat %>%
# mutate(Event.Date = as.Date(Event.Date, format = "%d-%m-%Y")) %>%
# group_by(SIC) %>%
# filter(c(TRUE, diff(Event.Date) >= 180)) %>%
# ungroup()
# . + # A tibble: 4 x 5
# Deal.Nr Event.Date Target.Nation CAR SIC
# <int> <date> <chr> <chr> <int>
# 1 1 1999-01-01 Italy 5.1% 201
# 2 2 1999-01-02 Germany 2.3% 202
# 3 4 1999-09-10 Germany 0.3% 201
# 4 6 2000-10-25 Spain 0.8% 201
data.table
library(data.table)
as.data.table(dat
# )[, Event.Date := as.Date(Event.Date, format = "%d-%m-%Y")
# ][, .SD[c(TRUE, diff(Event.Date) >= 180),], by = .(SIC)]
+ > SIC Deal.Nr Event.Date Target.Nation CAR
# 1: 201 1 1999-01-01 Italy 5.1%
# 2: 201 4 1999-09-10 Germany 0.3%
# 3: 201 6 2000-10-25 Spain 0.8%
# 4: 202 2 1999-01-02 Germany 2.3%
Data
dat <- structure(list(Deal.Nr = 1:6, Event.Date = c("01-01-1999", "02-01-1999", "06-01-1999", "10-09-1999", "15-09-1999", "25-10-2000"), Target.Nation = c("Italy", "Germany", "Spain", "Germany", "UK", "Spain"), CAR = c("5.1%", "2.3%", "1.5%", "0.3%", "1.1%", "0.8%"), SIC = c(201L, 202L, 201L, 201L, 201L, 201L)), row.names = c(NA, -6L), class = "data.frame")

How to find top 5 most occurring names in column grouped by another column

I'm trying to find the top occuring names in a column for each group from another column. I am new to R and am struggling to understand how other solutions are achieving this (solutions I find seem to resolve either the first or second part of the above).
A sample of the dataset is as follows:
Australia City | International City | Port_Region | Airline | Month_num
"Melbourne" | "Kular Lumpar" | "East Asia" | "Air Asia" | 1
"Melbourne" | "Auckland" | "Oceania" | "Air New Zealand" | 1
"Melbourne" | "Auckland" | "Oceania" | "Air New Zealand" | 1
"Melbourne" | "Auckland" | "Oceania" | "Air New Zealand" | 2
I am trying to find the top occurring airlines per month for an Australia city and display in a jitter chart.
Where I am having issues is with grouping the flights by airline and finding the top airlines.
The current code I am trying is:
sort(table(airlineMelb$Airline),decreasing = TRUE)[1:5]
airlineMelbPop <- c("Air New Zealand", "Air Asia")
as.factor(airlineMelbPop) %>%
ggplot(aes(x=Month_num, y=Port_Region, color=Airline)) +
labs(title="Most popular airlines per month for Melbourne") +
geom_jitter()
Any help would be greatly appreciated.
Edit: I can get the below now. This seems to be on the right track, where it is showing, for example, 'Qantas Airways' has 248 occurrences during the 9th month.
> dt = as.data.table(airlineMelb)
> dt[, .(nobs = .N), by = .(Australian_City, Month_num, Airline)][order(-nobs)]
Australian_City Month_num Airline nobs
1: Melbourne 9 Qantas Airways 248
2: Melbourne 12 Qantas Airways 242
3: Melbourne 3 Qantas Airways 224
4: Melbourne 6 Qantas Airways 224
5: Melbourne 1 Qantas Airways 195
---
494: Melbourne 1 SriLankan Airlines 2
495: Melbourne 1 LATAM Airlines 2
496: Melbourne 1 Scoot Tigerair 2
497: Melbourne 1 Japan Airlines 2
498: Melbourne 1 Air Canada 2
How can this be used with ggplot2 to graph the top 5 airlines for each month (the above is only showing 5 months?
You can use data.table to get the counts and the choose 5 rows from the sorted count column
library(data.table)
dt=data.table(airlineMelb)
dt[,counts:=sort(.N,descending=T),by=c("Australia City","Month_num","Airline")]
dt_top_5=dt[,.SD[1:5],by=c("Australia City","Month_num","Airline")]
The first groupby gets the count in each group and sorts in descending order
The second groupby is used to extract the first 5 rows from each sorted group.
Note, if a particular group has less than 5 rows, a row of NA will be added
With data.table you could do
library(data.table)
dt = as.data.table(airlineMelb)
dt_res = dt[, .(nobs = .N), by = .(city, month, airline)][order(-nobs)]
.N gives you the number of observations within the groups in by, giving you the number of observations per airline, per city, and per month in decreasing order.

How to search and extract matched words from a different column in a dataframe?

I have a variable in a dataframe which has field name 'Destination'. This field contains destination/places (can be country, continent, multiple countires, cities, city, etc. or both). I have another dataframe which contains 3 columns continent_name, country_name, city_name, etc. I want to get new column having continent, country, city names by matching destination field with 2 dataframe columns.
Dataframe A:
+---------+------------------------------------+
| Name | Destination |
+---------+------------------------------------+
| Alex | North America, Europe & France |
| Mike | Boston, London, Germany, Australia |
| Charlie | China, Europe, India, New York |
| Lophy | Antartica, UK, Europe, Delhi |
+---------+------------------------------------+
Dataframe B:
---------------+-----------+----------+
| Continent | Country | City |
+---------------+-----------+----------+
| north america | france | boston |
| anatartica | germany | london |
| europe | australia | delhi |
| XYZ | china | new york |
| ABC | india | RST |
| PQR | UK | JKL |
+---------------+-----------+----------+
Expected Output:
+---------+-----------------------+--------------------+----------------+
| Name | Continent | Country | City |
+---------+-----------------------+--------------------+----------------+
| Alex | North America, Europe | France | |
| Mike | NA | Germany, Australia | Boston, London |
| Charlie | Europe | China, India | New York |
| Lophy | Antartica, Europe | UK | Delhi |
+---------+-----------------------+--------------------+----------------+
First all continent name should be matched and stored in comma separated value in case of multiple matches then country names and then city names.
I went through multiple questions but couldn't get anything concrete.
The easiest is to put both tables in long format and join them, then go back to wide format using the destination type :
library(tidyverse)
B2 <- B %>%
gather(type,lower_dest) %>%
mutate_at("lower_dest", tolower)
A2 <- A %>%
separate_rows(Destination,sep="\\s*[,&]\\s*") %>%
mutate(lower_dest = tolower(Destination))
left_join(A2, B2, by = "lower_dest") %>%
group_by(Name, type) %>%
summarize_at("Destination", paste,collapse=", ") %>%
spread(type, Destination) %>%
ungroup
# # A tibble: 4 x 4
# Name City Continent Country
# * <chr> <chr> <chr> <chr>
# 1 Alex <NA> North America, Europe France
# 2 Charlie New York Europe China, India
# 3 Lophy Delhi Antartica, Europe UK
# 4 Mike Boston, London <NA> Germany, Australia
data
A <-
tribble(~Name , ~Destination ,
'Alex' , 'North America, Europe & France',
'Mike' , 'Boston, London, Germany, Australia',
'Charlie' , 'China, Europe, India, New York',
'Lophy' , 'Antartica, UK, Europe, Delhi')
# anatartica typo corrected into antartica
B <- tribble(~Continent, ~Country, ~City,
'north america' , 'france' , 'boston' ,
'antartica' , 'germany' , 'london' ,
'europe' , 'australia' , 'delhi' ,
'XYZ' , 'china' , 'new york' ,
'ABC' , 'india' , 'RST' ,
'PQR' , 'UK' , 'JKL')
# data
d <- read.table(text = "Name Destination
Alex 'North America, Europe & France'
Mike 'Boston, London, Germany, Australia'
Charlie 'China, Europe, India, New York'
Lophy 'Antartica, UK, Europe, Delhi'",
header = TRUE,
stringsAsFactors = FALSE)
d$Destination <- gsub("&", ",", d$Destination)
d$Destination <- tolower(d$Destination)
d$Destination <- trimws(d$Destination)
d
d2 <- read.table(text = " Continent Country City
'north america' france boston
anatartica germany london
europe australia delhi
XYZ china 'new york'
ABC india RST
PQR UK JKK", header = TRUE, stringsAsFactors = FALSE)
d2
# splits ..
check_fun <- function(a, b) {
toString(intersect(trimws(strsplit(d$Destination[a], ",")[[1]], "both"), d2[[b]]))
}
want <- as.data.frame(do.call(cbind,
lapply(colnames(d2),
function(x) {
sapply(seq_along(d$Destination),
function(y) {
check_fun(y, x)
}
)
})), stringsAsFactors = FALSE)
colnames(want) <- colnames(d2)
want$Name <- d$Name
want
# Continent Country City Name
# 1 north america, europe france Alex
# 2 germany, australia boston, london Mike
# 3 europe china, india new york Charlie
# 4 europe delhi Lophy
Several functions that will help you:
tolower() will put all your words to lower case so you have matches when there is a mix of capital letters.
str_split() from stringr will allow you to separate your destinations by elements separated by commas
So first, you need to get a vector with all your destinations:
destination_vector <-unique(unlist(strsplit(tolower(Destination), ","))) will do. Because strsplit gives you a list, you need unlist to get a vector. unique will get delete duplicates if any.
Then, you need to check if any of your destinations is in Continent, Country or City:
Continent[Continent %in% destination_vector] will do. The same for country and city
Then, you can use paste with sep="," to join all using commas as separator.
Best!

How to groupby column value using R programming

I have a table
Employee Details:
EmpID | WorkingPlaces | Salary
1001 | Bangalore | 5000
1001 | Chennai | 6000
1002 | Bombay | 1000
1002 | Chennai | 500
1003 | Pune | 2000
1003 | Mangalore | 1000
A same employee works for different places in a month. How to find the top 2 highly paid employees.
The result table should look like
EmpID | WorkingPlaces | Salary
1001 | Chennai | 6000
1001 | Bangalore | 5000
1003 | Pune | 2000
1003 | Mangalore | 1000
My code: in R language
knime.out <- aggregate(x= $"EmpID", by = list(Thema = $"WorkingPlaces", Project = $"Salary"), FUN = "length") [2]
But this doesnt give me the expected result. Kindly help me to correct the code.
We can try with dplyr
library(dplyr)
df1 %>%
group_by(EmpID) %>%
mutate(SumSalary = sum(Salary)) %>%
arrange(-SumSalary, EmpID) %>%
head(4) %>%
select(-SumSalary)
A base R solution. Considering your dataframe as df. We first aggregate the data by EmpId and calculate their sum. Then we select the top 2 EmpID's for which the salary is highest and find the subset of those ID's in the original dataframe using %in%.
temp <- aggregate(Salary~EmpID, df, sum)
df[df$EmpID %in% temp$EmpID[tail(order(temp$Salary), 2)], ]
# EmpID WorkingPlaces Salary
#1 1001 Bangalore 5000
#2 1001 Chennai 6000
#5 1003 Pune 2000
#6 1003 Mangalore 1000

Resources