Keep specific rows of a data frame based on word sequence in R - r

I have a dataframe (df) like this. What I want to do is to go through the values for each ID and if there are two strings starting with the same word, I want to compare them to keep distinct values.
df <- data.frame(id = c(1,1,2,3,3,4,4,4,4,5),
value = c('australia', 'australia sydney', 'brazil',
'australia', 'usa', 'australia sydney', 'australia sydney randwick', 'australia', 'australia sydney circular quay', 'australia sydney'))
I want to get the first words to compare them and if they are different keep both but if they are the same go to the second words to compare them and so on...
so like for ID 1 I want to keep the row with the value 'australia sydney' and for Id 4 I want to keep both 'australia sydney circular quay', 'australia sydney randwick'.
For this example I need to get rows 2:5, 7, 9,10

Based on your edit, you can check within groups if any entry matches the start of any other entry and remove entries that do:
library(tidyverse)
df %>%
group_by(id) %>%
filter(!map_lgl(seq_along(value), ~ any(if (length(value) == 1) FALSE else str_detect(value[-.x], paste0("^", value[.x])))))
# A tibble: 7 x 2
# Groups: id, value [7]
id value
<dbl> <chr>
1 1 australia sydney
2 2 brazil
3 3 australia
4 3 usa
5 4 australia sydney randwick
6 4 australia sydney circular quay
7 5 australia sydney

Related

Insert a value to a column by condition

I am attempting to fill in a new column in my dataset. I have a dataset containing information on football matches. There is a column called "Stadium", which has various stadium names. I wish to add a new column which contains the country of which the stadium is located within. My set looks something like this
Match ID Stadium
1 Anfield
2 Camp Nou
3 Stadio Olimpico
4 Anfield
5 Emirates
I am attempting to create a new column looking like this:
Match ID Stadium Country
1 Anfield England
2 Camp Nou Spain
3 Stadio Olimpico Italy
4 Anfield England
5 Emirates England
There is only a handful of stadiums but many rows, meaning I am trying to find a way to avoid inserting the values manually. Any tips?
You want to get the unique stadium names from your data, manually create a vector with the country for each of those stadiums, then join them using Stadium as a key.
library(dplyr)
# Example data
df <- data.frame(`Match ID` = 1:12,
Stadium = rep(c("Stadio Olympico", "Anfield",
"Emirates"), 4))
# Get the unique stadium names in a vector
unique_stadiums <- df %>% pull(Stadium) %>% unique()
unique_stadiums
#> [1] "Stadio Olympico" "Anfield" "Emirates"
# Manually create a vector of country names corresponding to each element of
# the unique stadum name vector. Ordering matters here!
countries <- c("Italy", "England", "England")
# Place them both into a data.frame
lookup <- data.frame(Stadium = unique_stadiums, Country = countries)
# Join the country names to the original data on the stadium key
left_join(x = df, y = lookup, by = "Stadium")
#> Match.ID Stadium Country
#> 1 1 Stadio Olympico Italy
#> 2 2 Anfield England
#> 3 3 Emirates England
#> 4 4 Stadio Olympico Italy
#> 5 5 Anfield England
#> 6 6 Emirates England
#> 7 7 Stadio Olympico Italy
#> 8 8 Anfield England
#> 9 9 Emirates England
#> 10 10 Stadio Olympico Italy
#> 11 11 Anfield England
#> 12 12 Emirates England

How to iterate one dataframe based on a mapping file in R?

Serial No.
Company 1
Company 2
Company 3
01
NA
2
NA
02
2
NA
5
03
NA
NA
4
04
1
NA
NA
05
NA
4
NA
I have a data structure like this where the column headings represent some companies and the row headings represents consumers who buy the products. 'NA' representing no purchase for that company's products by the consumer.
I have a second mapping file where the companies are represented as row headings as follows -
Company
Country
Category
Company 1
UK
FMCG
Company 2
UK
FMCG
Company 3
India
FMCG
Company 4
US
Nicotine
The data set is for over 10000 consumers and 1000 companies. I'm getting the market share for different countries and categories using the aggregate function and mapping file.
I want to make a look to iterate values in the first data-frame to change the share for different countries and categories. The idea is to make a loop where I can choose which country's (or category) share needs to be changed along with the share and then to use the mapping file to iterate values for companies in those countries (or category). The values need to be changes for only those consumers who buy the products from companies belonging to that country (or category).
Can someone suggest how can this be done in R (preferably) or Python?
Edit:
Before iteration I will use the aggregate function in R to get the shares for a country (or category) like this -
Country
Share
UK
0.33
US
0.02
IN
0.41
IR
0.11
PK
0.13
In the loop I want to be able to specify the share for some country (say UK) to whatever is required (say 0.5). The mapping file will be used to iterate values to the first data structure where people have bought products from companies in UK.
The final output will be something like this.
Country
Share
UK
0.50
US
0.00
IN
0.38
IR
0.11
PK
0.01
Here's a guess: ultimately, this is a combination of reshape from wide to long, then merge/join, and finally aggregation/summarizing by group. If you need more information for either operation, using those key-words (on SO) will provide very useful information.
base R (and reshape2)
## reshape
dat1melted <- reshape2::melt(dat1, "Serial No.", variable.name = "Company")
dat1melted$Company <- as.character(dat1melted$Company)
dat1melted <- dat1melted[!is.na(dat1melted$value),]
dat1melted
# Serial No. Company value
# 2 02 Company 1 2
# 4 04 Company 1 1
# 6 01 Company 2 2
# 10 05 Company 2 4
# 12 02 Company 3 5
# 13 03 Company 3 4
## merge
dat1merged <- merge(dat1melted, dat2, by = "Company", all.x = TRUE)
dat1merged
# Company Serial No. value Country Category
# 1 Company 1 02 2 UK FMCG
# 2 Company 1 04 1 UK FMCG
# 3 Company 2 01 2 UK FMCG
# 4 Company 2 05 4 UK FMCG
# 5 Company 3 02 5 India FMCG
# 6 Company 3 03 4 India FMCG
## aggregate by group
aggregate(value ~ Country, data = dat1merged, FUN = sum)
# Country value
# 1 India 9
# 2 UK 9
dplyr
library(dplyr)
# library(tidyr) # pivot_longer
dat1 %>%
## reshape
tidyr::pivot_longer(-`Serial No.`, names_to = "Company") %>%
filter(!is.na(value)) %>%
## merge
left_join(., dat2, by = "Company") %>%
## aggregate by group
group_by(Country) %>%
summarize(value = sum(value))
# # A tibble: 2 x 2
# Country value
# <chr> <int>
# 1 India 9
# 2 UK 9

dplyr, filter if both values are above a number [duplicate]

This question already has answers here:
dplyr filter with condition on multiple columns
(6 answers)
Closed 2 years ago.
I have a data set like such.
df = data.frame(Business = c('HR','HR','Finance','Finance','Legal','Legal','Research'), Country = c('Iceland','Iceland','Norway','Norway','US','US','France'), Gender=c('Female','Male','Female','Male','Female','Male','Male'), Value =c(10,5,20,40,10,20,50))
I need to be filter out all rows where both male value and female value are >= 10. For example, Iceland HR should be removed as well as Research France.
I've tried df %>% group_by(Business,Country) %>% filter((Value>=10)) but this filters out any value less than 10. any ideas?
Maybe this can help:
library(reshape2)
df2 <- reshape(df,idvar = c('Business','Country'),timevar = 'Gender',direction = 'wide')
df2 %>% mutate(Index=ifelse(Value.Female>=10 & Value.Male>=10,1,0)) %>%
filter(Index==1) -> df3
df4 <- reshape2::melt(df3[,-5],idvar=c('Business','Country'))
Business Country variable value
1 Finance Norway Value.Female 20
2 Legal US Value.Female 10
3 Finance Norway Value.Male 40
4 Legal US Value.Male 20
You could just use two ave steps, one with length, one with min.
df <- df[with(df, ave(Value, Country, FUN=length)) == 2, ]
df[with(df, ave(Value, Country, FUN=min)) >= 10, ]
# Business Country Gender Value
# 3 Finance Norway Female 20
# 4 Finance Norway Male 40
# 5 Legal US Female 10
# 6 Legal US Male 20
Notice that this also works if we disturb the data frame.
set.seed(42)
df2 <- df[sample(1:nrow(df)), ]
df2 <- df2[with(df2, ave(Value, Country, FUN=length)) == 2, ]
df2[with(df2, ave(Value, Country, FUN=min)) >= 10, ]
# Business Country Gender Value
# 5 Legal US Female 10
# 6 Legal US Male 20
# 3 Finance Norway Female 20
# 4 Finance Norway Male 40

Parsing data from an Excel cell that has more than one data point in it in R

I have an Excel sheet of patient information. The heading for one of the columns is "Discharge diagnosis" The problem is that some patients were discharged with more than one diagnosis and so more than one diagnosis is in some of the cells, separated by a "/".
I am using R to analyze the data. I am trying to find the frequency of any given discharge diagnosis.
How can I get R to look for a diagnosis no matter how it is presented in a cell?
For example, I want to know the frequency of the discharge diagnosis "flu". Some patients have a diagnosis of "flu" while others have a diagnosis of "flu/pneumonia". How can I get R to recognize both of these as containing "flu"?
You didn't provide a sample dataset, so I've made one up. I assume you're OK with getting the data from Excel since you didn't specifically ask about that.
library(tidyverse)
library(stringr)
pts <- tribble(~Pt, ~Diag,
"Bob", "Flu/Pneumonia",
"Cathy", "Flu/Explosive Diarrhea",
"Carol", "Pneumonia/Syphilis")
What I can do next is split the Diag column by the / character, and then use unnest to make a data frame in which each patient gets a record for each diagnosis.
pts <- pts %>%
mutate(Diags = str_split(Diag, "/")) %>%
unnest()
# A tibble: 6 x 3
Pt Diag Diags
<chr> <chr> <chr>
1 Bob Flu/Pneumonia Flu
2 Bob Flu/Pneumonia Pneumonia
3 Cathy Flu/Explosive Diarrhea Flu
4 Cathy Flu/Explosive Diarrhea Explosive Diarrhea
5 Carol Pneumonia/Syphilis Pneumonia
6 Carol Pneumonia/Syphilis Syphilis
Here is a frequency table of diagnoses:
pts %>% count(Diags)
# A tibble: 4 x 2
Diags n
<chr> <int>
1 Explosive Diarrhea 1
2 Flu 2
3 Pneumonia 2
4 Syphilis 1

R make new data frame from current one

I'm trying to calculate the best goal differentials in the group stage of the 2014 world cup.
football <- read.csv(
file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE,
strip.white = TRUE
)
football <- head(football,n=48L)
football[which(max(abs(football$home_score - football$away_score)) == abs(football$home_score - football$away_score)),]
Results in
home home_continent home_score away away_continent away_score result
4 Cameroon Africa 0 Croatia Europe 4 l
7 Spain Europe 1 Netherlands Europe 5 l
37 Germany
So those are the games with the highest goal differntial, but now I need to make a new data frame that has a team name, and abs(football$home_score-football$away_score)
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- ifelse(football$home_score > football$away_score, as.character(football$home),
ifelse(football$result == "d", NA, as.character(football$away)))
You could save some typing in this way. You first get score differences and winners. When the result indicates w, home is the winner. So you do not have to look into scores at all. Once you add the score difference and winner, you can subset your data by subsetting data with max().
mydf <- read.csv(file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE, strip.white = TRUE)
mydf <- head(mydf,n = 48L)
library(dplyr)
mutate(mydf, scorediff = abs(home_score - away_score),
winner = ifelse(result == "w", as.character(home),
ifelse(result == "l", as.character(away), "draw"))) %>%
filter(scorediff == max(scorediff))
# home home_continent home_score away away_continent away_score result scorediff winner
#1 Cameroon Africa 0 Croatia Europe 4 l 4 Croatia
#2 Spain Europe 1 Netherlands Europe 5 l 4 Netherlands
#3 Germany Europe 4 Portugal Europe 0 w 4 Germany
Here is another option without using ifelse for creating the "winner" column. This is based on row/column indexes. The numeric column index is created by matching the result column with its unique elements (match(football$result,..), and the row index is just 1:nrow(football). Subset the "football" dataset with columns 'home', 'away' and cbind it with an additional column 'draw' with NAs so that the 'd' elements in "result" change to NA.
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- cbind(football[c('home', 'away')],draw=NA)[
cbind(1:nrow(football), match(football$result, c('w', 'l', 'd')))]
football[with(football, score_diff==max(score_diff)),]
# home home_continent home_score away away_continent away_score result
#60 Brazil South America 1 Germany Europe 7 l
# score_diff winner
#60 6 Germany
If the dataset is very big, you could speed up the match by using chmatch from library(data.table)
library(data.table)
chmatch(as.character(football$result), c('w', 'l', 'd'))
NOTE: I used the full dataset in the link

Resources