How to include select 2-word phrases as tokens in tidytext? - r

I'm preprocessing some text data for further analysis. I tokenized the text using unnest_tokens() [into singular words] but want to keep certain commonly-occuring 2 word phrases such as "United States" or "social security." How can I do this using tidytext?
tidy_data <- data %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
dput(data[1:6, 1:6])
structure(list(race = c("US House", "US House", "US House", "US House",
"", "US House"), district = c(8L, 3L, 6L, 17L, 2L, 1L), party = c("Republican",
"Republican", "Republican", "Republican", "", "Republican"),
state = c("AZ", "AZ", "KY", "TX", "IL", "NH"), sponsor = c(4,
4, 4, 1, NA, 4), approve = structure(c(1L, 1L, 1L, 4L, NA,
1L), .Label = c("no oral statement of approval, authorization",
"beginning of the spot", "middle of the spot", "end of the spot"
), class = "factor")), row.names = c(NA, 6L), class = "data.frame")

If I were in this situation and I only had a short list of two-word phrases I need to keep in my analysis, I would do some prudent replacing before and after tokenization.
First, I would replace the two-word phrases with something that will stick together and not get broken apart by the tokenization process I'm using, like perhaps "united states" to "united_states".
library(tidyverse)
library(tidytext)
df <- tibble(text = c("I live in the United States",
"United we stand, divided we fall",
"Information security is important!",
"I work at the Social Security Administration"))
df_parsed <- df %>%
mutate(text = str_to_lower(text),
text = str_replace_all(text, "united states", "united_states"),
text = str_replace_all(text, "social security", "social_security"))
df_parsed
#> # A tibble: 4 x 1
#> text
#> <chr>
#> 1 i live in the united_states
#> 2 united we stand, divided we fall
#> 3 information security is important!
#> 4 i work at the social_security administration
Then you can tokenize like normal, and afterward, replace the things you just made with the two-word phrases again, so "united_states" back to "united states".
df_parsed %>%
unnest_tokens(word, text) %>%
mutate(word = case_when(word == "united_states" ~ "united states",
word == "social_security" ~ "social security",
TRUE ~ word))
#> # A tibble: 21 x 1
#> word
#> <chr>
#> 1 i
#> 2 live
#> 3 in
#> 4 the
#> 5 united states
#> 6 united
#> 7 we
#> 8 stand
#> 9 divided
#> 10 we
#> # … with 11 more rows
Created on 2019-08-03 by the reprex package (v0.3.0)
If you have a long list of these, it's going to get difficult and onerous, and then it might make sense to look at ways to use bigram and unigram tokenization. You can see an example of that here.

Related

How to a row in a dataframe based on certain conditions

I have some data that looks like this:
id
ethnicity
1
white
2
south asian
2
other
3
other
4
white
4
south asian
as seen above there is potential for an id to have two ethnicity values. How would I go about removing these 'other' rows if that id already has an entry such as "white" or "south asian" while keeping the "white" or "south asian" entry?
I have noticed there are entries which also have south asian along with a white entry
My priority would be South Asian > White > Other in terms of keeping rows
So an expected output would be
id
ethnicity
1
white
2
south asian
3
other
4
south asian
If the intention is to get the prioritized 'ethnicity' per 'id', convert the column 'ethnicity' to ordered with levels specified in the order of preference, then do a group by 'id' and filter the first available level in that order
library(dplyr)
df2 %>%
mutate(ethnicity = ordered(ethnicity,
c( "south asian", "white", "other"))) %>%
group_by(id) %>%
filter(ethnicity %in% first(levels(droplevels(ethnicity)))) %>%
ungroup
-output
# A tibble: 4 × 2
id ethnicity
<int> <ord>
1 1 white
2 2 south asian
3 3 other
4 4 south asian
data
df2 <- structure(list(id = c(1L, 2L, 2L, 3L, 4L, 4L), ethnicity = c("white",
"south asian", "other", "other", "white", "south asian")),
class = "data.frame", row.names = c(NA,
-6L))

Combining data with Base R

I currently need to translate my dplyr code into base R code. My dplyr code gives me 3 columns, competitor sex, the olympic season and the number of different sports. The code looks like this:
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n()) %>%
setNames(c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
My data structure looks like this.
structure(list(Name = c("A Lamusi", "Juhamatti Tapio Aaltonen",
"Andreea Aanei", "Jamale (Djamel-) Aarrass (Ahrass-)", "Nstor Abad Sanjun",
"Nstor Abad Sanjun"), Sex = c("M", "M", "F", "M", "M", "M"),
Age = c(23L, 28L, 22L, 30L, 23L, 23L), Height = c(170L, 184L,
170L, 187L, 167L, 167L), Weight = c(60, 85, 125, 76, 64,
64), Team = c("China", "Finland", "Romania", "France", "Spain",
"Spain"), NOC = c("CHN", "FIN", "ROU", "FRA", "ESP", "ESP"
), Games = c("2012 Summer", "2014 Winter", "2016 Summer",
"2012 Summer", "2016 Summer", "2016 Summer"), Year = c(2012L,
2014L, 2016L, 2012L, 2016L, 2016L), Season = c("Summer",
"Winter", "Summer", "Summer", "Summer", "Summer"), City = c("London",
"Sochi", "Rio de Janeiro", "London", "Rio de Janeiro", "Rio de Janeiro"
), Sport = c("Judo", "Ice Hockey", "Weightlifting", "Athletics",
"Gymnastics", "Gymnastics"), Event = c("Judo Men's Extra-Lightweight",
"Ice Hockey Men's Ice Hockey", "Weightlifting Women's Super-Heavyweight",
"Athletics Men's 1,500 metres", "Gymnastics Men's Individual All-Around",
"Gymnastics Men's Floor Exercise"), Medal = c(NA, "Bronze",
NA, NA, NA, NA), BMI = c(20.7612456747405, 25.1063327032136,
43.2525951557093, 21.7335354170837, 22.9481157445588, 22.9481157445588
)), .Names = c("Name", "Sex", "Age", "Height", "Weight",
"Team", "NOC", "Games", "Year", "Season", "City", "Sport", "Event",
"Medal", "BMI"), row.names = c(NA, 6L), class = "data.frame")
Does anyone know how to translate this into base R?
Since you are grouping twice in dplyr you can use double aggregate in base R
setNames(aggregate(Name~Sex + Season,
aggregate(Name~Sex + Season + Sport, olympics, length), length),
c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
# Competitor_Sex Olympic_Season Num_Sports
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
This gives the same output as dplyr option
library(dplyr)
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n()) %>%
setNames(c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
# Competitor_Sex Olympic_Season Num_Sports
# <chr> <chr> <int>
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
A base R option would be using aggregate twice
out <- aggregate(BMI ~ Sex + Season,
aggregate(BMI ~ Sex + Season + Sport, olympics, length), length)
names(out) <- c("Competitor_Sex", "Olympic_Season", "Num_Sports")
out
# Competitor_Sex Olympic_Season Num_Sports
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
It is similar to the OP's output
olympics %>%
group_by(Sex, Season, Sport) %>%
summarise(n()) %>%
group_by(Sex, Season) %>%
summarise(n()) %>%
setNames(c("Competitor_Sex", "Olympic_Season", "Num_Sports"))
# A tibble: 3 x 3
# Groups: Sex [2]
# Competitor_Sex Olympic_Season Num_Sports
# <chr> <chr> <int>
#1 F Summer 1
#2 M Summer 3
#3 M Winter 1
Or it can be done in a compact way with table from base R
table(sub(",[^,]+$", "", names(table(do.call(paste,
c(olympics[c("Sex", "Season", "Sport")], sep=","))))))
# F,Summer M,Summer M,Winter
# 1 3 1

Create new variable in dataframe based on condition in one column, pulling from other column? (dplyr)

I have the following dataframe:
df <- structure(list(country = c("Ghana", "Eritrea", "Ethiopia", "Ethiopia",
"Congo - Kinshasa", "Ethiopia", "Ethiopia", "Ghana", "Botswana",
"Nigeria"), CommodRank = c(1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L,
1L), topCommodInCountry = c(TRUE, FALSE, FALSE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE), Main_Commod = c("Gold", "Copper",
"Nickel", "Gold", "Gold", "Gold", "Gold", "Gold", "Diamonds",
"Iron Ore")), row.names = c(NA, -10L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = "country", drop = TRUE, indices = list(
8L, 4L, 1L, c(2L, 3L, 5L, 6L), c(0L, 7L), 9L), group_sizes = c(1L,
1L, 1L, 4L, 2L, 1L), biggest_group_size = 4L, labels = structure(list(
country = c("Botswana", "Congo - Kinshasa", "Eritrea", "Ethiopia",
"Ghana", "Nigeria")), row.names = c(NA, -6L), class = "data.frame", vars = "country", drop = TRUE, .Names = "country"), .Names = c("country",
"CommodRank", "topCommodInCountry", "Main_Commod"))
df
country CommodRank topCommodInCountry Main_Commod
1 Ghana 1 TRUE Gold
2 Eritrea 2 FALSE Copper
3 Ethiopia 3 FALSE Nickel
4 Ethiopia 1 TRUE Gold
5 Congo - Kinshasa 3 FALSE Gold
6 Ethiopia 1 TRUE Gold
7 Ethiopia 1 TRUE Gold
8 Ghana 1 TRUE Gold
9 Botswana 1 TRUE Diamonds
10 Nigeria 1 TRUE Iron Ore
I am trying to add another column showing the top commodity (top CommodRank) for every country in this dataset, but I'm not sure how. I'm able to label 'topcommod' with the 'Main_Commod' where CommodRank == 1, but I want to copy this same value to cases where CommodRank != 1. Looking below, both Ethiopia values at rows 3 & 4 should read 'Gold'.
df %>% mutate(topcommod = ifelse(CommodRank == 1, Main_Commod, 'unknown'))
country CommodRank topCommodInCountry Main_Commod topcommod
1 Ghana 1 TRUE Gold Gold
2 Eritrea 2 FALSE Copper unknown
3 Ethiopia 3 FALSE Nickel unknown
4 Ethiopia 1 TRUE Gold Gold
5 Congo - Kinshasa 3 FALSE Gold unknown
6 Ethiopia 1 TRUE Gold Gold
7 Ethiopia 1 TRUE Gold Gold
8 Ghana 1 TRUE Gold Gold
9 Botswana 1 TRUE Diamonds Diamonds
10 Nigeria 1 TRUE Iron Ore Iron Ore
I'm ideally looking for a dplyr solution I can add to an existing long series of pipe %>% function calls, but any solution would help.
IIUC, there are multiple ways to do this, for example:
df %>% mutate(topCom = if(!any(topCommodInCountry)) "unknown"
else Main_Commod[which.max(topCommodInCountry)])
# A tibble: 10 x 5
# Groups: country [6]
country CommodRank topCommodInCountry Main_Commod topCom
<chr> <int> <lgl> <chr> <chr>
1 Ghana 1 TRUE Gold Gold
2 Eritrea 2 FALSE Copper unknown
3 Ethiopia 3 FALSE Nickel Gold
4 Ethiopia 1 TRUE Gold Gold
5 Congo - Kinshasa 3 FALSE Gold unknown
6 Ethiopia 1 TRUE Gold Gold
7 Ethiopia 1 TRUE Gold Gold
8 Ghana 1 TRUE Gold Gold
9 Botswana 1 TRUE Diamonds Diamonds
10 Nigeria 1 TRUE Iron Ore Iron Ore
Regarding OP's question in comment how to handle ties of multiple top Commodities, you could do the following:
df %>%
mutate(topCom = if(!any(topCommodInCountry)) "unknown"
else paste(unique(Main_Commod[topCommodInCountry]), collapse = "/"))
If there are multiple unique top Commodities in a country, they will be paste together into a single string, separated by /.
another pattern with dplyr...
df %>% arrange(CommodRank) %>%
mutate(topCommod = Main_Commod[1])
It's not an answer but learning greatly from #docendo discimus answer, it took me a second to understand the "if negative" (!any(topCommodInCountry)), and I was wondering if it's only me or it would take my computer a second more to do that too :)
Using the same dataset I examined the idea of making the if else positive. First I tested for identical between the two solutions:
identical(
#Negative
df %>%
mutate(topCom = if(!any(topCommodInCountry)) "unknown"
else Main_Commod[which.max(topCommodInCountry)]),
#Positive
df %>%
mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)]
else "unknown"))
[1] TRUE
Next, I tested the benchmark of the two:
require(rbenchmark)
benchmark("Negative" = {
df %>%
mutate(topCom = if(!any(topCommodInCountry)) "unknown"
else Main_Commod[which.max(topCommodInCountry)])
},
"Positive" = {
df %>%
mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)]
else "unknown")
},
replications = 10000,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
The difference is not that big but I'm assuming that with a bigger dataset it will increase.
test replications elapsed relative user.self sys.self
1 Negative 10000 12.59 1.015 12.44 0
2 Positive 10000 12.41 1.000 12.30 0

How to delimit a string field into two different numeric columns in R

I have a dataframe which has a text field that captures how long a person has stayed in a city. It is in the format of y year(s) m month(s) with y and m being numeric. If the person has lived in the city less than a year, then the value will only be in the format m months
I want to convert this column into two separate numeric columns, one of them showing the years lived and the other showing the months lived.
Here is a sample of my dataframe:
df <- structure(list(Time.in.current.role = c("1 year 1 month", "11
months",
"3 years 11 months", "1 year 1 month", "8 months"), City =
c("Philadelphia",
"Seattle", "Washington D.C.", "Ashburn", "Cork, Ireland")), .Names =
c("Time.in.current.role",
"City"), row.names = c(NA, 5L), class = "data.frame")
My desire dataframe looks like:
result <- structure(list(Year = c(1, 0, 3, 1, 0), Month = c(1, 11,
11,
1, 8), City = structure(c(3L, 4L, 5L, 1L, 2L), .Label = c("Ashburn",
"Cork, Ireland", "Philadelphia", "Seattle", "Washington D.C."
), class = "factor")), .Names = c("Year", "Month", "City"), row.names
= c(NA,
-5L), class = "data.frame")
I was thinking of using grep to locate which rows have the substring "year" in it and which rows have the substring "month" in it. But after that, I am having trouble trying to get the number that appropriately associates to either "year" or "month".
* EDIT *
In my original post, I forgot to account for the case that it is possible to have only y year(s). Here is the new original dataframe and desired dataframe:
df <- structure(list(Time.in.current.role = c("1 year 1 month", "11
months",
"3 years 11 months", "1 year 1 month", "8 months", "2 years"),
City = c("Philadelphia", "Seattle", "Washington D.C.", "Ashburn",
"Cork, Ireland", "Washington D.C.")), .Names =
c("Time.in.current.role",
"City"), row.names = c(1L, 2L, 3L, 4L, 5L, 18L), class =
"data.frame")
result <- structure(list(Year = c(1, 0, 3, 1, 0, 2), Month = c(1, 11,
11,
1, 8, 0), City = structure(c(3L, 4L, 5L, 1L, 2L, 5L), .Label =
c("Ashburn",
"Cork, Ireland", "Philadelphia", "Seattle", "Washington D.C."
), class = "factor")), .Names = c("Year", "Month", "City"), row.names
= c(NA,
-6L), class = "data.frame")
This defines a function extr (also see alternative definition at end) that will extract from its first argument the match to the second argument's capture group, i.e. the match to the part of the regular expression within parentheses. Then the match is converted to numeric, or if the pattern is not found 0 is returned.
It is only 3 lines of code, has a pleasing symmetry in how it handles the year and month and can handle not only year and month but also just year and just month. It allows junk before the y and m such as the \n shown in the sample data in the question.
library(gsubfn)
extr <- function(x, pat) strapply(x, pat, as.numeric, empty = 0, simplify = TRUE)
transform(df, Year = extr(Time.in.current.role, "(\\d+) +\\W*y"),
Month = extr(Time.in.current.role, "(\\d+) +\\W*m"))
giving (for the data frame defined in the question):
Time.in.current.role City Year Month
1 1 year 1 month Philadelphia 1 1
2 11 \nmonths Seattle 0 11
3 3 years 11 months Washington D.C. 3 11
4 1 year 1 month Ashburn 1 1
5 8 months Cork, Ireland 0 8
Note that strapply uses the tcl regex engine by default but if tcltk does not work on your system then use this slightly longer version of extr or even better would be to fix your installation since tcltk is a base package and if that does not work your R installation is broken.
extr <- function(x, pat) {
sapply(strapply(x, pat, as.numeric), function(x) if (is.null(x)) 0 else x)
}
A quick 'n dirty solution:
Code:
ym <- gsub("[^0-9|^ ]", "", df$Time.in.current.role)
ym <- gsub("^ | $", "", ym)
df$Year <- ifelse(
grepl(" ", ym),
gsub("([0-9]+) .+", "\\1", ym),
0
)
df$Month <- gsub(".+ ([0-9]+)$", "\\1", ym)
df$Time.in.current.role <- NULL
df
City Year Month
1 Philadelphia 1 1
2 Seattle 0 11
3 Washington D.C. 3 11
4 Ashburn 1 1
5 Cork, Ireland 0 8
Words:
Start by deleting everything that is not a number or a space
Delete all spaces at the start or end of string
If the string contains two numbers then extract first as the year, otherwise year = 0.
The last number is alway the month.
Drop original column from data.frame
Enjoy
You could do the following:
z = regmatches(x = df$Time.in.current.role, gregexpr("\\d+", df$Time.in.current.role))
years = sapply(z, function(x){ifelse(length(x)==1, 0, x[1])})
months = sapply(z, function(x){ifelse(length(x)==1, x[1], x[2])})
This gives:
> years
[1] "1" "0" "3" "1" "0"
> months
[1] "1" "11" "11" "1" "8"
This method works if there are or two numbers. If there is only one, this assumes that it corresponds to months. A case where this does not works is, for example, "5 years".
In this case you could do the following:
m = regmatches(x = df$Time.in.current.role, gregexpr("\\d+ m", df$Time.in.current.role))
y = regmatches(x = df$Time.in.current.role, gregexpr("\\d+ y", df$Time.in.current.role))
y2 = sapply(y, function(x){ifelse(length(x)==0,0,gsub("\\D+","",x))})
m2 = sapply(m, function(x){ifelse(length(x)==0,0,gsub("\\D+","",x))})
Example:
> df
Time.in.current.role City
1 1 year 1 month Philadelphia
2 11 months Seattle
3 3 years 11 months Washington D.C.
4 1 year 1 month Ashburn
5 8 months Cork, Ireland
6 5 years Miami
> y2
[1] "1" "0" "3" "1" "0" "5"
> m2
[1] "1" "11" "11" "1" "8" "0"
An alternative would be to use the package splitstackshape to split the column in two. To do that you would first need to set a delimiter between years and months with gsub, then remove all characters and then use cSplit:
# replace delimiter year with ;
df$Time.in.current.role <- gsub("year", ";", df$Time.in.current.role)
# If no year was found add 0; at the beginning of the cell
df$Time.in.current.role[!grepl(";", df$Time.in.current.role)] <- paste0("0;", df$Time.in.current.role[!grepl(";", df$Time.in.current.role)])
# remove characters and whitespace
df$Time.in.current.role <- gsub("[[:alpha:]]|\\s+", "", df$Time.in.current.role)
# Split column by ;
df <- splitstackshape::cSplit(df, "Time.in.current.role", sep = ";")
# Rename new columns
colnames(df)[2:3] <- c("Year", "Month")
df
City Year Month
1: Philadelphia 1 1
2: Seattle 0 11
3: Washington D.C. 3 11
4: Ashburn 1 1
5: Cork, Ireland 0 8

replace NA with values from another table based on groupings (not one-by-one lookup table)

My objective is to replace values in one table with values from another look-up table. There is one catch: this lookup table isn't one-by-one lookup table as discussed in Replace na's with value from another df but the lookup will be done based on multiple column groupings. As a result, if multiple entries are returned based on those groupings in the look-up table, all of them would need to be populated in the original table.
I was able to do this task but I need help with two things:
a) my code is really messy. Every time I have to do similar thing, I end up spending enormous amount of time trying to figure out what I have done, and then re-using it. So, I'd appreciate anything that's more clean and simpler.
b) It's very slow. I have multiple ifelse statements. When I run this on the actual data with 36M records, it takes a lot of time.
Here's my source with dummy data:
dput(DFile)
structure(list(Region_SL = c("G1", "G1", "G1", "G1", "G2", "G2",
"G3", "G3", "G3", "G3", "G4", "G4", "G4", "G4", "G5", "G5"),
Country_SV = c("United States", "United States", "United States",
"United States", "United States", "United States", "United States",
"United States", "United States", "United States", "United States",
"United States", "United States", "United States", "UK",
"UK"), Product_BU = c("Laptop", "Laptop", "Laptop", "Laptop",
"Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Laptop",
"Laptop", "Laptop", "Laptop", "Laptop", "Power Cord", "Laptop"
), Prob_model3 = c(0, 79647405.9878251, 282615405.328728,
NA, NA, 363419594.065383, 0, 72870592.8458704, 260045174.088548,
369512727.253779, 0, 79906001.2878251, 285128278.558728,
405490639.873629, 234, NA), DoS.FY = c(2014, 2013, 2012,
NA, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015, 2016, NA), Insured = c("Covered", "Covered", "Covered",
NA, NA, "Not Covered", "Not Covered", "Not Covered", "Not Covered",
"Not Covered", "Not Covered", "Not Covered", "Not Covered",
"Not Covered", "Covered", NA)), .Names = c("Region_SL", "Country_SV",
"Product_BU", "Prob_model3", "DoS.FY", "Insured"), row.names = c(NA,
16L), class = "data.frame")
Here's my grouped look-up table:
dput(Master_Joined)
structure(list(Region_SL = c("G1", "G1", "G1", "G1", "G2", "G3",
"G4", "G5", "G5", "G5"), Country_SV = c("United States", "United States",
"United States", "United States", "United States", "United States",
"United States", "UK", "UK", "UK"), Product_BU = c("Laptop",
"Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Power Cord",
"Laptop", "Laptop"), DoS.FY = c(2014, 2013, 2012, 2015, 2015,
2015, 2015, 2016, 2017, 2017), Insured = c("Covered", "Covered",
"Covered", "Uncovered", "Not Covered", "Not Covered", "Not Covered",
"Covered", "Uncovered", "Covered")), .Names = c("Region_SL",
"Country_SV", "Product_BU", "DoS.FY", "Insured"), row.names = c(NA,
10L), class = "data.frame")
This is "grouped" in a sense that all entries are unique.
Finally, here's my code:
#Which fields are missing?
Missing<-DFile[is.na(DFile$Prob_model3),]
Column_name<-colnames(DFile)[4]
colnames(DFile)[4]<-"temp_prob"
#Replace Prob_model3
DFile<-DFile %>%
group_by(Region_SL, Country_SV, Product_BU) %>%
dplyr::mutate(Average_Value = mean(temp_prob,na.rm = TRUE)) %>%
rowwise() %>%
dplyr::mutate(Col_name1 = ifelse(is.na(temp_prob),Average_Value,temp_prob)) %>%
dplyr::select(Region_SL:Product_BU,DoS.FY,Insured,Col_name1)
colnames(DFile)[6]<-Column_name
Missing$DoS.FY<-NULL
Missing_FYear<-Missing %>%
inner_join(Master_Joined,by = c("Region_SL", "Country_SV", "Product_BU")) %>%
group_by(Region_SL, Country_SV, Product_BU, DoS.FY, Insured.y) %>%
dplyr::distinct() %>%
left_join(Missing)
Missing_FYear$Prob_model3<-NULL
DFile <-DFile %>%
left_join(Missing_FYear,by = c("Region_SL", "Country_SV", "Product_BU", "Insured")) %>%
dplyr::rowwise() %>%
mutate(DoS.FY=ifelse((is.na(`DoS.FY.y`)|is.na(`DoS.FY.x`)),sum(`DoS.FY.y`,`DoS.FY.x`,na.rm=TRUE),`DoS.FY.x`), Insured_Combined = ifelse(is.na(Insured),Insured.y,Insured)) %>%
dplyr::select(Region_SL:Product_BU,Prob_model3,DoS.FY, Insured_Combined)
colnames(DFile)[6]<-"Insured"
#Check again
Missing<-DFile[is.na(DFile$Prob_model3),]
if (nrow(Missing) > 1)
{ #you have NaNs, replace them with 0
DFile[is.nan(DFile$Prob_model3),"Prob_model3"] <- 0
}
Missing<-DFile[is.na(DFile$Prob_model3),]
Expected Output: DFile as after the running the code above.
I'd sincerely appreciate your help. I have been struggling with this problem for about a week now.
An idea is to find the Region_SL which have NA. Once we do we use plyr's rbind.fill to rbind to a new_df. We then filter out any rows with NA (except on last column - col 6). We create a new variable Prob_model4 which holds the means per group of Region_SL. We then use coalesce to "merge" the two columns.
library(dplyr)
ind <- unique(as.integer(which(is.na(DFile), arr.ind = TRUE)[,1]))
new_df <- plyr::rbind.fill(Master_joined[Master_joined$Region_SL %in% DFile$Region_SL[ind],], DFile)
new_df %>%
arrange(Region_SL, Prob_model3) %>%
filter(complete.cases(.[-6])) %>%
group_by(Region_SL) %>%
mutate(Prob_model3 = replace(Prob_model3, is.na(Prob_model3), mean(Prob_model3, na.rm = T))) %>%
ungroup()
# A tibble: 21 × 6
# Region_SL Country_SV Product_BU DoS.FY Insured Prob_model3
# <chr> <chr> <chr> <dbl> <chr> <dbl>
#1 G1 United States Laptop 2014 Covered 0
#2 G1 United States Laptop 2013 Covered 79647406
#3 G1 United States Laptop 2012 Covered 282615405
#4 G1 United States Laptop 2014 Covered 120754270
#5 G1 United States Laptop 2013 Covered 120754270
#6 G1 United States Laptop 2012 Covered 120754270
#7 G1 United States Laptop 2015 Uncovered 120754270
#8 G2 United States Laptop 2015 Not Covered 363419594
#9 G2 United States Laptop 2015 Not Covered 363419594
#10 G3 United States Laptop 2015 Not Covered 0
# ... with 11 more rows
Another way to think about this is to only merge the rows which have missing values in DoS.FY or Insured with the master data:
#Replace missing probabilities by grouped average
DFile_new <- DFile %>% group_by(Region_SL,Country_SV,Product_BU) %>% mutate(Prob_model3 = coalesce(Prob_model3,mean(Prob_model3, na.rm = T))) %>% ungroup()
#This leads to one NaN because for
# 16 G5 UK Laptop NA NA <NA>
#there are no other rows in the same group
DFile_new$Prob_model3[is.nan(DFile_new$Prob_model3)] <- 0
#Split dataset into two parts
#1) The part that has no NA's in DoS.FY and Insured
DFile_new1 <- filter(DFile_new,!is.na(DoS.FY) & !is.na(Insured))
#2) The part has NA's in either DoS.FY or Insured
DFile_new2 <- filter(DFile_new,is.na(DoS.FY) | is.na(Insured))
#merge DFile_new2 and Master_Joined
DFile_new2 <- merge(DFile_new2,Master_Joined,by=c("Region_SL","Country_SV","Product_BU")) %>%
mutate(DoS.FY.x = coalesce(DoS.FY.x,DoS.FY.y), Insured.x = coalesce(Insured.x,Insured.y)) %>%
select(-Insured.y,-DoS.FY.y) %>% rename(Insured=Insured.x, DoS.FY = DoS.FY.x)
#Put all rows in frame
my_out_new <- rbind(DFile_new1,DFile_new2)
This yields the same result as OP's code (albeit in a different order):
> compare <- function(df1,df2) {
+ idx1 <- c()
+ idx2 <- c()
+ for(i in 1:nrow(df1)) {
+ found <- FALSE
+ for(j in 1:nrow(df2)) {
+ if(!(j %in% idx2)) {
+ idx = as.logical(df1[i,] != df2[j,])
+ d <- suppressWarnings(abs(as.numeric(df1[i,idx])-as.numeric(df2[j,idx]))) < 1e-5
+ if(!(any(is.na(d))) & all(d)) {
+ idx1 <- c(idx1,i)
+ idx2 <- c(idx2,j)
+ break;
+ }
+ }
+ }
+ }
+ rbind(idx1,idx2)
+ }
> compare(my_out,my_out_new)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
idx1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
idx2 1 2 3 14 15 16 17 4 18 5 6 7 8 9 10 11 12
[,18] [,19] [,20]
idx1 18 19 20
idx2 13 19 20
(where my_out is the resulting DFile of the OP's code)

Resources