Looping over a data frame and adding a new column in R with certain logic - r

I have a data frame which contains information about sales branches, customers and sales.
branch <- c("Chicago","Chicago","Chicago","Chicago","Chicago","Chicago","LA","LA","LA","LA","LA","LA","LA","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa")
customer <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21)
sales <- c(33816,24534,47735,1467,39389,30659,21074,20195,45165,37606,38967,41681,47465,3061,23412,22993,34738,19408,11637,36234,23809)
data <- data.frame(branch, customer, sales)
What I need to accomplish is to iterate over each branch, take each customer in the branch and divide the sales for that customer by the total of the branch. I need to do it to find out how much each customer is contributing towards the total sales of the corresponding branch. E.g. for customer 1 I would like to divide 33816/177600 and store this value in a new column. (177600 is the total of chicago branch)
I have tried to write a function to iterate over each row in a for loop but I am not sure how to do it at a branch level. Any guidance is appreciated.

Consider base R's ave for new column of inline aggregate which also considers same customer with multiple records within the same branch:
data$customer_contribution <- ave(data$sales, data$customer, FUN=sum) /
ave(data$sales, data$branch, FUN=sum)
data
# branch customer sales customer_contribution
# 1 Chicago 1 33816 0.190405405
# 2 Chicago 2 24534 0.138141892
# 3 Chicago 3 47735 0.268778153
# 4 Chicago 4 1467 0.008260135
# 5 Chicago 5 39389 0.221784910
# 6 Chicago 6 30659 0.172629505
# 7 LA 7 21074 0.083576241
# 8 LA 8 20195 0.080090263
# 9 LA 9 45165 0.179117441
# 10 LA 10 37606 0.149139610
# 11 LA 11 38967 0.154537126
# 12 LA 12 41681 0.165300433
# 13 LA 13 47465 0.188238887
# 14 Tampa 14 3061 0.017462291
# 15 Tampa 15 23412 0.133560003
# 16 Tampa 16 22993 0.131169705
# 17 Tampa 17 34738 0.198172193
# 18 Tampa 18 19408 0.110718116
# 19 Tampa 19 11637 0.066386372
# 20 Tampa 20 36234 0.206706524
# 21 Tampa 21 23809 0.135824795
Or less wordy:
data$customer_contribution <- with(data, ave(sales, customer, FUN=sum) /
ave(sales, branch, FUN=sum))

We can use dplyr::group_by and dplyr::mutate to calculate fractional sales of total by branch.
library(dplyr);
library(magrittr);
data %>%
group_by(branch) %>%
mutate(sales.norm = sales / sum(sales))
## A tibble: 21 x 4
## Groups: branch [3]
# branch customer sales sales.norm
# <fct> <dbl> <dbl> <dbl>
# 1 Chicago 1. 33816. 0.190
# 2 Chicago 2. 24534. 0.138
# 3 Chicago 3. 47735. 0.269
# 4 Chicago 4. 1467. 0.00826
# 5 Chicago 5. 39389. 0.222
# 6 Chicago 6. 30659. 0.173
# 7 LA 7. 21074. 0.0836
# 8 LA 8. 20195. 0.0801
# 9 LA 9. 45165. 0.179
#10 LA 10. 37606. 0.149

Related

dplyr relative frequency within group

(hopefully) simplified
I have asked farmers of a specific farmtype (organic and conventional) that I asked for a report on species (A,B) occur (0/1) on their land.
So, I have
df<-data.frame(id=1:10,
farmtype=c(rep("org",4), rep("conv",6)),
spA=c(0,0,0,1,1,1,1,1,1,1),
spB=c(1,1,1,0,0,0,0,0,0,0)
)
And my question is pretty simple... In what percentage of organic or conventional farms do the species occur?
solution
sp A occurs in 25% of org farms and 100% of conv farms
sp B occurs in 75% of org farms and 0% of conv farms
None of the solutions outlined below achieve that.
**additional question **
All I want is a simple ggplot with the species on the x-axis and the percentage of detection on the y-axis (once for org and once for conv).
ggplot(df.melt)+
geom_bar(aes(x=species, fill=farmtype))
### but, of course the species recognitions not just the farm types
janitor's tabyl is your friend. What you're calculating is "row"-percentages, but what you want is "col"-percentages. E.g.
set.seed(1234)
df <- data.frame(farmtype=sample(c("organic","conventional"),100, replace=T),
species=sample(letters[1:4], 100, replace=T),
occ=sample(c("yes","no"),100, replace=T))
df |>
tabyl(species,farmtype) |>
adorn_percentages("col")
# species conventional organic
# a 0.2553191 0.2641509
# b 0.2765957 0.2452830
# c 0.2553191 0.1886792
# d 0.2127660 0.3018868
But you could also use your own approach. Group by farmtype in the second group_by and remember to save the dataframe. This would be easier to use with ggplot2 as it is already in a long format.
df <-
df %>%
group_by(species, farmtype) %>%
dplyr::summarise(count = n()) %>%
group_by(farmtype) %>%
dplyr::mutate(prop = count/sum(count))
df
# A tibble: 8 × 4
# Groups: farmtype [2]
# species farmtype count prop
# <chr> <chr> <int> <dbl>
# a conventional 12 0.255
# a organic 14 0.264
# b conventional 13 0.277
# b organic 13 0.245
# c conventional 12 0.255
# c organic 10 0.189
# d conventional 10 0.213
# d organic 16 0.302
df %>%
ggplot(aes(x = species, y = prop, fill = farmtype)) +
geom_col()
Update: A variant of second option also suggested by Isaac Bravo.
Here you can have another option using your approach:
df %>%
group_by(farmtype, species) %>%
summarize(n = n()) %>%
mutate(percentage = n/sum(n))
OUTPUT:
farmtype species n percentage
<chr> <chr> <int> <dbl>
1 conventional a 12 0.235
2 conventional b 12 0.235
3 conventional c 12 0.235
4 conventional d 15 0.294
5 organic a 16 0.327
6 organic b 9 0.184
7 organic c 14 0.286
8 organic d 10 0.204
If I understand the poster's first question correctly, the poster seeks the proportion of organic versus conventional farm types among farms that grew a given species. This can also be accomplished using the data.table package as follows.
First, the example data set is recreated by setting the seed.
set.seed(1234) ##setting seed for reproducible example
df<-data.frame(farmtype=sample(c("organic","conventional"),100, replace=T),
species=sample(letters[1:4], 100, replace=T),
occ=sample(c("yes","no"),100, replace=T))
require(data.table)
df = data.table(df)
Next, the "no" answers are filtered out because we are only interested in farms that reported growing the species in the "occur" column. We then count the occurrences of the species for each farm type. The column "N" gives the count.
#Filter out "no" answers because they shouldn't affect the result sought
#and count the number of farmtypes that reported each species
ans = df[occ == "yes",.N,by = .(farmtype,species)]
ans
# farmtype species N
#1: conventional a 8
#2: conventional c 8
#3: organic a 6
#4: conventional d 11
#5: organic d 5
#6: organic c 7
#7: organic b 4
#8: conventional b 6
The total occurrences of each species for either farm type are then counted. As a check for this result, each row for a given species should give the same species total.
#Total number of farms that reported the species
ans[,species_total := sum(N), by = species] #
ans
# farmtype species N species_total
#1: conventional a 8 14
#2: conventional c 8 15
#3: organic a 6 14
#4: conventional d 11 16
#5: organic d 5 16
#6: organic c 7 15
#7: organic b 4 10
#8: conventional b 6 10
Finally, the columns are combined to calculate the proportion of organic or conventional farms for each species that was reported. As a check against the result, the proportion of organic and the proportion of conventional for each species should sum to 1 because there are only two farm types.
##Calculate the proportion of each farm type reported for each species
ans[, proportion := N/species_total]
ans
# farmtype species N species_total proportion
#1: conventional a 8 14 0.5714286
#2: conventional c 8 15 0.5333333
#3: organic a 6 14 0.4285714
#4: conventional d 11 16 0.6875000
#5: organic d 5 16 0.3125000
#6: organic c 7 15 0.4666667
#7: organic b 4 10 0.4000000
#8: conventional b 6 10 0.6000000
##Gives the proportion of organic farms specifically
ans[farmtype == "organic"]
# farmtype species N species_total proportion
#1: organic a 6 14 0.4285714
#2: organic d 5 16 0.3125000
#3: organic c 7 15 0.4666667
#4: organic b 4 10 0.4000000
If, on the other hand, one wanted to calculate the fraction of each species to all species occurrences reported for organic or conventional farms, you could use this code:
ans = df[,.N, by = .(species, farmtype,occ)] ##count by species,farmtype, and occurrence
ans[, spf := sum(N), by = .(occ,farmtype)] ##spf is the total number of times an occurrence was reported for each type
ans[, prop := N/spf]
ans = ans[occ == "yes"] ##proportion of the given species to all species occurrences reported for each farm type
ans
# species farmtype occ N spf prop
#1: a conventional yes 8 33 0.2424242
#2: c conventional yes 8 33 0.2424242
#3: a organic yes 6 22 0.2727273
#4: d conventional yes 11 33 0.3333333
#5: d organic yes 5 22 0.2272727
#6: c organic yes 7 22 0.3181818
#7: b organic yes 4 22 0.1818182
#8: b conventional yes 6 33 0.1818182
This result means that, for example, conventional farmers reported species "a" about 24.2% of the times that they reported any species. The result can be verified by selecting a species and farmtype and calculating manually as a spot check.

Comparing Dates Across Multiple Variables

I'm attempting to figure out the amount of days in between games and if that has an impact on wins/losses, this is the information I'm starting with:
schedule:
Home
Away
Home_Final
Away_Final
Date
DAL
OAK
30
35
9/1/2015
KC
PHI
21
28
9/2/2015
This is the result I'd like to get:
Home
Away
Home_Final
Away_Final
Date
Home_Rest
Away_Rest
Adv
Adv_Days
Adv_Won
DAL
OAK
30
35
9/1/2015
null
null
null
null
null
KC
PHI
21
28
9/2/2015
null
null
null
null
null
DAL
PHI
28
7
9/9/2015
8
7
1
1
1
OAK
KC
14
21
9/9/2015
8
7
1
1
0
'Home_Rest' = The home teams amount of days between their games
'Away Rest' = The away teams amount of days between their games
'Adv' = True/False that there was an advantage on one side
'Adv_Days' = The amount of advantage in days
'Adv_Won' = The side with the advantage won
Here is what I've tried, I was able to get it to count how many days were between games for one team, but when I bring all the other ones in I can't wrap my head around how to do that.
library(tidyverse)
library(lubridate)
team_post <- schedule %>% filter(home == 'PHI' | visitor == 'PHI')
day_dif = interval(lag(ymd(team_post$date)), ymd(team_post$date))
team_post <- team_post %>% mutate(days_off = time_length(day_dif, "days"))
You can extend this to all teams using a grouped mutate. See docs for group_by() here.
Something like
schedule %>%
group_by(vars_to_group_by) %>%
mutate(new_var = expr_to_calculate_new_var)
In future, it would be helpful if you included code to recreate a minimal dataset for your example.
The problem is that before you can calculate differences between dates, you must put your dataframe in a friendlier format. Because the Date applies to both teams, that is, one item applies to two columns in the dataframe, which makes it difficult to give it a uniform treatment.
We'll add an id (row number) to the schedule dataframe, as a primary key, so it becomes easy to identify the rows later on.
schedule <- tidyr::tribble(
~Home, ~Away, ~Home_Final, ~Away_Final, ~Date,
"DAL", "OAK", 30, 35, "9/1/2015",
"KC", "PHI", 21, 28, "9/2/2015",
"DAL", "PHI", 28, 7, "9/9/2015",
"OAK", "KC", 14, 21, "9/9/2015"
)
schedule <- schedule %>% mutate(id = row_number())
> schedule
# A tibble: 4 x 6
Home Away Home_Final Away_Final Date id
<chr> <chr> <dbl> <dbl> <chr> <int>
1 DAL OAK 30 35 9/1/2015 1
2 KC PHI 21 28 9/2/2015 2
3 DAL PHI 28 7 9/9/2015 3
4 OAK KC 14 21 9/9/2015 4
Now we'll place your dataframe in a more 'relational' format.
schedule_relational <-
rbind(
schedule %>%
transmute(
id,
Team = Home,
Role = "Home",
Final = Home_Final,
Date
),
schedule %>%
transmute(
id,
Team = Away,
Role = "Away",
Final = Away_Final,
Date
)
)
> schedule_relational
# A tibble: 8 x 5
id Team Role Final Date
<int> <chr> <chr> <dbl> <chr>
1 1 DAL Home 30 9/1/2015
2 2 KC Home 21 9/2/2015
3 3 DAL Home 28 9/9/2015
4 4 OAK Home 14 9/9/2015
5 1 OAK Away 35 9/1/2015
6 2 PHI Away 28 9/2/2015
7 3 PHI Away 7 9/9/2015
8 4 KC Away 21 9/9/2015
How about that!
Now it becomes easy to calculate the difference between dates of games for each team:
schedule_relational <-
schedule_relational %>%
group_by(Team) %>%
arrange(Date) %>%
mutate(Rest = mdy(Date) - mdy(lag(Date))) %>%
ungroup()
> schedule_relational
# A tibble: 8 x 6
id Team Role Final Date Rest
<int> <chr> <chr> <dbl> <chr> <drtn>
1 1 DAL Home 30 9/1/2015 NA days
2 1 OAK Away 35 9/1/2015 NA days
3 2 KC Home 21 9/2/2015 NA days
4 2 PHI Away 28 9/2/2015 NA days
5 3 DAL Home 28 9/9/2015 8 days
6 4 OAK Home 14 9/9/2015 8 days
7 3 PHI Away 7 9/9/2015 7 days
8 4 KC Away 21 9/9/2015 7 days
Observe that the appropriate function to convert dates in character format is mdy(), because your dates are in month/day/year format.
We're very close to a solution! Now all we have to do is to pivot your data back to the wider format. We'll join back the data on the home team and away team by using the id as our unique key.
result <-
schedule_relational %>%
pivot_wider(
names_from = Role,
values_from = c(Team, Final, Rest),
names_glue = "{Role}_{.value}"
)
> result
# A tibble: 4 x 8
id Date Home_Team Away_Team Home_Final Away_Final Home_Rest Away_Rest
<int> <chr> <chr> <chr> <dbl> <dbl> <drtn> <drtn>
1 1 9/1/2015 DAL OAK 30 35 NA days NA days
2 2 9/2/2015 KC PHI 21 28 NA days NA days
3 3 9/9/2015 DAL PHI 28 7 8 days 7 days
4 4 9/9/2015 OAK KC 14 21 8 days 7 days
We'll adjust column names and ordering, and make the final calculations now.
result_final <-
result %>%
transmute(
Home = Home_Team,
Away = Away_Team,
Home_Final,
Away_Final,
Date,
Home_Rest,
Away_Rest,
Adv = as.integer(Home_Rest != Away_Rest),
Adv_Days = abs(Home_Rest != Away_Rest),
Adv_Won = as.integer(Home_Rest > Away_Rest & Home_Final > Away_Final | Away_Rest > Home_Rest & Away_Final > Home_Final)
)
> result_final
# A tibble: 4 x 10
Home Away Home_Final Away_Final Date Home_Rest Away_Rest Adv Adv_Days Adv_Won
<chr> <chr> <dbl> <dbl> <chr> <drtn> <drtn> <int> <int> <int>
1 DAL OAK 30 35 9/1/2015 NA days NA days NA NA NA
2 KC PHI 21 28 9/2/2015 NA days NA days NA NA NA
3 DAL PHI 28 7 9/9/2015 8 days 7 days 1 1 1
4 OAK KC 14 21 9/9/2015 8 days 7 days 1 1 0
It would be interesting if instead of reducing Adv and Adv_Won to yes/no (discrete) values, you'd track the number of days of rest and difference in score. Therefore you could correlate the results also in terms of magnitude.
I've made the code step by step, so you can see intermediate results and understand it better. You may later coalesce some of the statements if you want.
There may be more convoluted solutions, but this is very clear to read and understand.

Is there a R function to convert numeric values from a vector into observations(rows) in a dataframe? [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 1 year ago.
First time asking a question here, sry if I aren't clear enough
Here's my data:
df <- data.frame(Year=c("2018","2018","2019","2019","2018","2018","2019","2019"),Area=c("CF","CF","CF","CF","NY","NY","NY","NY"), Birth=c(1000,1100,1100,1000,2000,2100,2100,2000),Gender= c("F","M","F","M","F","M","F","M"))
df
# Year Area Birth Gender
# 1 2018 CF 1000 F
# 2 2018 CF 1100 M
# 3 2019 CF 1100 F
# 4 2019 CF 1000 M
# 5 2018 NY 2000 F
# 6 2018 NY 2100 M
# 7 2019 NY 2100 F
# 8 2019 NY 2000 M
where birth is the new babies born..
What I want to do is creates a classification model where it predicts how likely a new born baby would be a male/female, with area/year as predictor.
yes I know it should be linear regression with Y as birth, X as others, however I just somehow fall into this situation.
With the given data, I already know the results as 50% of an observation being male and 50% of an observation being female. What I want to know is the probability of a baby being male/female, not which observation(row) being male/female which I already knows.
Is their a way that I can make birth as observation which is 1000+1100+1100+1000+2000+2100+2100+2000=12400 rows of data? which would be something like 1st observation is a 2018 born female baby from CF, 2nd observation is a 2018 born male baby from CF. With 12400 of it.
Or any suggestion to deal with this?
We may use uncount
library(dplyr)
library(tidyr)
df %>%
uncount(Birth) %>%
as_tibble
-output
# A tibble: 12,400 x 3
Year Area Gender
<chr> <chr> <chr>
1 2018 CF F
2 2018 CF F
3 2018 CF F
4 2018 CF F
5 2018 CF F
6 2018 CF F
7 2018 CF F
8 2018 CF F
9 2018 CF F
10 2018 CF F
# … with 12,390 more rows
Or using base R
transform(df[rep(seq_len(nrow(df)), df$Birth),], Birth = sequence(df$Birth))
You could use dplyr and summarize:
library(tidyverse)
df_expanded <- df %>%
group_by(Year, Area, Gender) %>%
summarize(expanded = 1:Birth)
# A tibble: 12,400 x 4
# Groups: Year, Area, Gender [8]
Year Area Gender expanded
<chr> <chr> <chr> <int>
1 2018 CF F 1
2 2018 CF F 2
3 2018 CF F 3
4 2018 CF F 4
5 2018 CF F 5
6 2018 CF F 6
7 2018 CF F 7
8 2018 CF F 8
9 2018 CF F 9
10 2018 CF F 10
# … with 12,390 more rows
Uncount is without a doubt the best solution for this problem. One alternative to the solutions shown could be
library(dplyr)
library(tidyr)
df %>%
mutate(Birth = lapply(Birth, function(n) 1:n)) %>%
unnest(Birth)
This returns
# A tibble: 12,400 x 4
Year Area Birth Gender
<chr> <chr> <int> <chr>
1 2018 CF 1 F
2 2018 CF 2 F
3 2018 CF 3 F
4 2018 CF 4 F
5 2018 CF 5 F
6 2018 CF 6 F
7 2018 CF 7 F
8 2018 CF 8 F
9 2018 CF 9 F
10 2018 CF 10 F
# ... with 12,390 more rows

how to calculate mean based on conditions in for loop in r

I have what I think is a simple question but I can't figure it out! I have a data frame with multiple columns. Here's a general example:
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
test.df
I would like for R to calculate average activity based on the age of the colony in the data frame. Specifically, I want it to only calculate the average activity of the colonies that are the same age or older than the colony in that row, not including the activity of the colony in that row. For example, colony 29683 is 21 years old. I want the average activity of colonies older than 21 for this row of my data. That would include colony 25077 and colony 4865; and the mean would be (45+33)/2 = 39. I want R to do this for each row of the data by identifying the age of the colony in the current row, then identifying the colonies that are older than that colony, and then averaging the activity of those colonies.
I've tried doing this in a for loop in R. Here's the code I used:
test.avg = vector("numeric",nrow(test.df))`
for (i in 1:10){
test.avg[i] <- mean(subset(test.df$activity,test.df$age >= age[i])[-i])
}
R returns a list of values where half of them are correct and the the other half are not (I'm not even sure how it calculated those incorrect numbers..). The numbers that are correct are also out of order compared to how they're listed in the dataframe. It's clearly able to do the right thing for some iterations of the loop but not all. If anyone could help me out with my code, I would greatly appreciate it!
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
library(tidyverse)
test.df %>%
mutate(result = map_dbl(age, ~mean(activity[age > .x])))
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
# base
test.df$result <- with(test.df, sapply(age, FUN = function(x) mean(activity[age > x])))
test.df
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
Created on 2021-03-22 by the reprex package (v1.0.0)
The issue in your solution is that the index would apply to the original data.frame, yet you subset that and so it does not match anymore.
Try something like this: First find minimum age, then exclude current index and calculate average activity of cases with age >= pre-calculated minimum age.
for (i in 1:10){
test.avg[i] <- {amin=age[i]; mean(subset(test.df[-i,], age >= amin)$activity)}
}
You can use map_df :
library(tidyverse)
test.df %>%
mutate(map_df(1:nrow(test.df), ~
test.df %>%
filter(age >= test.df$age[.x]) %>%
summarise(av_acti= mean(activity))))

How to remove rows that contain duplicate characters in R

I want remove entire row if there are duplicates in two columns. Any quick help in doing so in R (for very large dataset) would be highly appreciated. For example:
mydf <- data.frame(p1=c('a','a','a','b','g','b','c','c','d'),
p2=c('b','c','d','c','d','e','d','e','e'),
value=c(10,20,10,11,12,13,14,15,16))
This gives:
mydf
p1 p2 value
1 a b 10
2 c c 20
3 a d 10
4 b c 11
5 d d 12
6 b b 13
7 c d 14
8 c e 15
9 e e 16
I want to get:
p1 p2 value
1 a b 10
2 a d 10
3 b c 11
4 c d 14
5 c e 15
your note in the comments suggests your actual problem is more complex. There's some preprocessing you could do to your strings before you compare p1 to p2. You will have the domain expertise to know what steps are appropriate, but here's a first start. I remove all spaced and punctuation from p1 and p2. I then convert them all to uppercase before testing for equality. You can modify the clean_str function to include more / different cleaning operations.
Additionally, you may consider approximate matching to address typos / colloquial naming conventions. Package stringdist is a good place to start.
mydf <- data.frame(p1=c('New York','New York','New York','TokYo','LosAngeles','MEMPHIS','memphis','ChIcAGo','Cleveland'),
p2=c('new York','New.York','MEMPHIS','Chicago','knoxville','tokyo','LosAngeles','Chicago','CLEVELAND'),
value=c(10,20,10,11,12,13,14,15,16),
stringsAsFactors = FALSE)
mydf[mydf$p1 != mydf$p2,]
#> p1 p2 value
#> 1 New York new York 10
#> 2 New York New.York 20
#> 3 New York MEMPHIS 10
#> 4 TokYo Chicago 11
#> 5 LosAngeles knoxville 12
#> 6 MEMPHIS tokyo 13
#> 7 memphis LosAngeles 14
#> 8 ChIcAGo Chicago 15
#> 9 Cleveland CLEVELAND 16
clean_str <- function(col){
#removes all punctuation
d <- gsub("[[:punct:][:blank:]]+", "", col)
d <- toupper(d)
return(d)
}
mydf$p1 <- clean_str(mydf$p1)
mydf$p2 <- clean_str(mydf$p2)
mydf[mydf$p1 != mydf$p2,]
#> p1 p2 value
#> 3 NEWYORK MEMPHIS 10
#> 4 TOKYO CHICAGO 11
#> 5 LOSANGELES KNOXVILLE 12
#> 6 MEMPHIS TOKYO 13
#> 7 MEMPHIS LOSANGELES 14
Created on 2020-05-03 by the reprex package (v0.3.0)
Several ways to do that. Among them :
Base R
mydf[mydf$p1 != mydf$p2, ]
dplyr
library(dplyr)
mydf %>% filter(p1 != p2)
data.table
library(data.table)
setDT(mydf)
mydf[p1 != p2]
Here's a two-step solution based on #Chase's data:
First step (as suggested by #Chase) - preprocess your data in p1and p2to make them comparable:
# set to lower-case:
mydf[,c("p1", "p2")] <- lapply(mydf[,c("p1", "p2")], tolower)
# remove anything that's not alphanumeric between words:
mydf[,c("p1", "p2")] <- lapply(mydf[,c("p1", "p2")], function(x) gsub("(\\w+)\\W(\\w+)", "\\1\\2", x))
Second step - (i) using apply, paste the rows together, (ii) use grepl and backreference \\1 to look out for immediately adjacent duplicates in these rows, and (iii) remove (-) those rows which contain these duplicates:
mydf[-which(grepl("\\b(\\w+)\\s+\\1\\b", apply(mydf, 1, paste0, collapse = " "))),]
p1 p2 value
3 newyork memphis 10
4 tokyo chicago 11
5 losangeles knoxville 12
6 memphis tokyo 13
7 memphis losangeles 14

Resources