Showing multiple columns in aggregate function including strings/characters in R - r

R noob question here.
Let's say I have this data frame:
City State Pop
Fresno CA 494
San Franciso CA 805
San Jose CA 945
San Diego CA 1307
Los Angeles CA 3792
Reno NV 225
Henderson NV 257
Las Vegas NV 583
Gresham OR 105
Salem OR 154
Eugene OR 156
Portland OR 583
Fort Worth TX 741
Austin TX 790
Dallas TX 1197
San Antonio TX 1327
Houston TX 2100
I want to get let's say every 3rd lowest population per State, which would have:
City State Pop
San Jose CA 945
Las Vegas NV 583
Eugene OR 156
Dallas TX 1197
I tried this one:
ord_pop_state <- aggregate(Pop ~ State , data = ord_pop, function(x) { x[3] } )
And I get this one:
State Pop
CA 945
NV 583
OR 156
TX 1197
What do I lack on this one, in order for me to get the desired output that includes the City?

I would suggest to try data.table package for such task as the syntax is easier and the code is more efficient. I would also suggest to add order function in order to make sure that the data is sorted
library(data.table)
setDT(ord_pop)[order(Pop), .SD[3L], keyby = State]
# State City Pop
# 1: CA San Jose 945
# 2: NV Las Vegas 583
# 3: OR Eugene 156
# 4: TX Dallas 1197
So basically, first the data was ordered by Pop, then we subsetted .SD (which the notation parameter of the data itself) by State
Though this is easily solvable with base R too (we will assume that the data is sorted here), we can just create an index per group and then just do a simple subset by that index
ord_pop$indx <- with(ord_pop, ave(Pop, State, FUN = seq))
ord_pop[ord_pop$indx == 3L, ]
# City State Pop indx
# 3 San Jose CA 945 3
# 8 Las Vegas NV 583 3
# 11 Eugene OR 156 3
# 15 Dallas TX 1197 3

Here is a dplyr version:
df2 <- df %>%
group_by(state) %>% # Group observations by state
arrange(-pop) %>% # Within those groups, sort in descending order by pop
slice(3) # Extract the third row in each arranged group
Here's the toy data I used to test it:
set.seed(1)
df <- data.frame(state = rep(LETTERS[1:3], each = 5), city = rep(letters[1:5], 3), pop = round(rnorm(15, 1000, 100), digits=0))
And here's the output from that; it's a coincidence that 'b' was third-largest in each case, not a glitch in the code:
> df2
Source: local data frame [3 x 3]
Groups: state
state city pop
1 A b 1018
2 B b 1049
3 C b 1039

In R same end results can be achieved using different packages.Choice of package is a trade-off between efficiency and simplicity of code.
Since you come from a strong SQL background,this might be easier to use:
library(sqldf)
#Example to return 3rd lowest population of a State
result <-sqldf('Select City,State,Pop from data order by Pop limit 1 offset 2;')
#Note the SQL query is a sample and needs to be modifed to get desired result.

Related

R how to speed up pattern matching using vectors

I have a column in one dataframe with city and state names in it:
ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
ac <- as.data.frame(ac)
I would like to search for the values in ac$ac in another data frame column, d$description and return the value of column id if there is a match.
dput(df)
structure(list(month = c(202110L, 201910L, 202005L, 201703L,
201208L, 201502L), id = c(100559687L, 100558763L, 100558934L,
100558946L, 100543422L, 100547618L), description = c("residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95",
"digital video programming service multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission",
"residential all distance telephone service unlimited voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission",
"residential all distance telephone service unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking",
"local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125",
"residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online"
)), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")
I have tried to do this via accessing the row indexes of the matches via the following methods:
which(ac$ac %in% df$description)--this returns integer(0).
grep(ac$ac, df$description, value = FALSE)--this returns the first index, 1. But this isn't vectorized.
str_detect(string = ac$ac, pattern = df$description) -- but this returns all FALSE which is incorrect.
My question: how do I search for ac$ac in df$description and return the corresponding value of df$id in the event of a match? Note that the vectors are not of the same length. I am looking for ALL matches, not just the first. I would prefer something simple and fast, because the actual datasets that I will be using have over 100k rows each but any suggestions or ideas are welcome. Thanks.
Edit. Due to Andre's initial answer below, the name of the question was changed to account for the change in the scope of the question.
Edit (12/7): bounty added to generate additional interest and a fast, efficient scalable solution.
Edit (12/8): Clarification--I would like to be able to add the id variable from df to the ac dataframe, as in ac$id.
The simplest solutions are usually the fastest!
Here is my suggestion:
str = paste0(ac, collapse="|")
df$id[grep(str, df$description)]
But you can also this way
df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]
Or this way
df$id[grepl(str, df$description, perl=T)]
However, it has to be compared. By the way, I added suggestions from #Andre Wildberg and #Martina C. Arnolda.
Below is the Benchmark.
str = paste0(ac, collapse="|")
fFiolka1 = function() df$id[grep(str, df$description)]
fFiolka2 = function() df$id[as.logical(rowSums(!is.na(sapply(ac, function(x) stringr::str_match(df$description, x)))))]
fFiolka3 = function() df$id[grepl(str, df$description, perl=T)]
fWildberg1 = function() df$id[unlist(sapply(ac, function(x) grep(x, df$description)))]
fWildberg2 = function() df$id[as.logical(rowSums(sapply(ac, function(x) stri_detect_regex(df$description, x))))]
fArnolda1 = function() df[grep(str, df$description), ]["id"]
fArnolda2 = function() df[stringi::stri_detect_regex(df$description, str), ]["id"]
fArnolda3 = function() df %>% filter(description %>% str_detect(str)) %>% select(id)
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fFiolka1(), fFiolka2(), fFiolka3(),
fWildberg1(), fWildberg2(),
fArnolda1(), fArnolda2(), fArnolda3(),
times=100))
Note, for the sake of simplicity I left ac as a vector !.
ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
Special update for #jvalenti
OKAY. Now I understand better what you want to achieve. However, in order to fully show the best solution, I have slightly modified your data. Here they are
library(tidyverse)
ac <- c("san francisco ca", "pittsburgh pa", "philadelphia pa", "washington dc", "new york ny", "aliquippa pa", "gainesville fl", "manhattan ks")
ac = tibble(ac = ac)
df = structure(list(
month = c(202110L, 201910L, 202005L, 201703L, 201208L, 201502L),
id = c(100559687L, 100558763L, 100558934L, 100558946L, 100543422L, 100547618L),
description = c(
"residential local telephone pittsburgh pa local with more san francisco ca flat rate with eas philadelphia pa plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95",
"digital video san francisco ca pittsburgh pa multilatino ultra bensalem pa service includes digital economy multilatino digital preferred tier and certain additonal digital channels coaxial cable transmission",
"residential all distance telephone pittsburgh pa unlimited voice only harrisburg pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking coaxial cable transmission",
"residential all distance telephone pittsburgh pa unlimited voice philadelphia pa san francisco ca pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking",
"local spot advertising 30 second advertisement austin tx weekday 6 am 6 pm other audience demographic w18 49 number of rating points for daypart 0 29 average cpp 125",
"residential public switched toll pittsburgh pa manhattan ks ks plan area residence switched toll base san philadelphia pa ca average revenue per minute 0 18 minute online"
)), row.names = c(1L, 1245L, 3800L, 10538L, 20362L, 50000L), class = "data.frame")
Below you will find four different solutions. One based on the for loop, two solutions based on the functions from the dplyr package, and yet a function from the collapse package.
fSolition1 = function(){
id = vector("list", nrow(ac))
for(i in seq_along(ac$ac)){
id[[i]] = df$id[grep(ac$ac[i], df$description)]
}
ac %>% mutate(id = id) %>% unnest(id)
}
fSolition1()
fSolition2 = function(){
ac %>% group_by(ac) %>%
mutate(id = list(df$id[grep(ac, df$description)])) %>%
unnest(id)
}
fSolition2()
fSolition3 = function(){
ac %>% rowwise(ac) %>%
mutate(id = list(df$id[grep(ac, df$description)])) %>%
unnest(id)
}
fSolition3()
fSolition4 = function(){
ac %>%
collapse::ftransform(id = lapply(ac, function(x) df$id[grep(x, df$description)])) %>%
unnest(id)
}
fSolition4()
Note that for the given data, all functions that return the following table as a result
# A tibble: 12 x 2
ac id
<chr> <int>
1 san francisco ca 100559687
2 san francisco ca 100558763
3 san francisco ca 100558946
4 pittsburgh pa 100559687
5 pittsburgh pa 100558763
6 pittsburgh pa 100558934
7 pittsburgh pa 100558946
8 pittsburgh pa 100547618
9 philadelphia pa 100559687
10 philadelphia pa 100558946
11 philadelphia pa 100547618
12 manhattan ks 100547618
It's time for a benchmark
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fSolition1(), fSolition2(), fSolition3(), fSolition4(), times=100))
It is perhaps no surprise to anyone that the collapse based solution is the fastest. However, second place may be a big surprise. The good old solution based on the for function is in second place!! Anyone else want to say that for is slow?
Special update for #Gwang-Jin Kim
The actions on vectors did not change much. Look below.
df_ac = ac$ac
df_decription = df$description
df_id = df$id
fSolition5 = function(){
id = vector("list", length = length(df_ac))
for(i in seq_along(df_ac)){
id[[i]] = df_id[grep(df_ac[i], df_decription)]
}
ac %>% mutate(id = id) %>% unnest(id)
}
fSolition5()
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), times=100))
But the combination of for and ftransform can be surprising !!!
fSolition6 = function(){
id = vector("list", nrow(ac))
for(i in seq_along(ac$ac)){
id[[i]] = df$id[grep(ac$ac[i], df$description)]
}
ac %>% collapse::ftransform(id = id) %>% unnest(id)
}
fSolition6()
library(microbenchmark)
ggplot2::autoplot(microbenchmark(
fSolition1(), fSolition2(), fSolition3(), fSolition4(), fSolition5(), fSolition6(), times=100))
Last update for #jvalenti
Dear jvaleniti, in your question you wrote I have a column in one dataframe with city and state names and then I will be using have over 100k rows. My conclusion is that it is very likely that a given city will appear several times in your variable description.
However, in the comment you wrote I don't want to change the number of rows in ac
So what kind of results do you expect? Let's see what can be done with it.
Solution 1 - we return all id as a list of vectors
ac %>% collapse::ftransform(id = map(ac, ~df$id[grep(.x, df$description)]))
# # A tibble: 8 x 2
# ac id
# * <chr> <list>
# 1 san francisco ca <int [3]>
# 2 pittsburgh pa <int [5]>
# 3 philadelphia pa <int [3]>
# 4 washington dc <int [0]>
# 5 new york ny <int [0]>
# 6 aliquippa pa <int [0]>
# 7 gainesville fl <int [0]>
# 8 manhattan ks <int [1]>
Solution 2 - we only return the first id
ac %>% collapse::ftransform(id = map_int(ac, ~df$id[grep(.x, df$description)][1]))
# # A tibble: 8 x 2
# ac id
# * <chr> <int>
# 1 san francisco ca 100559687
# 2 pittsburgh pa 100559687
# 3 philadelphia pa 100559687
# 4 washington dc NA
# 5 new york ny NA
# 6 aliquippa pa NA
# 7 gainesville fl NA
# 8 manhattan ks 100547618
Solution 3 - we only return the last id
ac %>%
collapse::ftransform(id = map_int(ac, function(x) {
idx = grep(x, df$description)
ifelse(length(idx)>0, df$id[idx[length(idx)]], NA)}))
# # A tibble: 8 x 2
# ac id
# * <chr> <int>
# 1 san francisco ca 100558946
# 2 pittsburgh pa 100547618
# 3 philadelphia pa 100547618
# 4 washington dc NA
# 5 new york ny NA
# 6 aliquippa pa NA
# 7 gainesville fl NA
# 8 manhattan ks 100547618
Solution 4 - or maybe you would like to choose any id from all possible
ac %>%
collapse::ftransform(id = map_int(ac, function(x) {
idx = grep(x, df$description)
ifelse(length(idx)==0, NA, ifelse(length(idx)==1, df$id[idx], df$id[sample(idx, 1)]))}))
# # A tibble: 8 x 2
# ac id
# * <chr> <int>
# 1 san francisco ca 100558763
# 2 pittsburgh pa 100559687
# 3 philadelphia pa 100547618
# 4 washington dc NA
# 5 new york ny NA
# 6 aliquippa pa NA
# 7 gainesville fl NA
# 8 manhattan ks 100547618
Solution 5 - if you accidentally wanted to see all the id's and wanted to keep the number of ac lines at the same time
ac %>%
collapse::ftransform(id = map(ac, function(x) {
idx = grep(x, df$description)
if(length(idx)==0) tibble(id = NA, idn = "id1") else tibble(
id = df$id[idx],
idn = paste0("id",1:length(id)))})) %>%
unnest(id) %>%
pivot_wider(ac, names_from = idn, values_from = id)
# # A tibble: 8 x 6
# ac id1 id2 id3 id4 id5
# <chr> <int> <int> <int> <int> <int>
# 1 san francisco ca 100559687 100558763 100558946 NA NA
# 2 pittsburgh pa 100559687 100558763 100558934 100558946 100547618
# 3 philadelphia pa 100559687 100558946 100547618 NA NA
# 4 washington dc NA NA NA NA NA
# 5 new york ny NA NA NA NA NA
# 6 aliquippa pa NA NA NA NA NA
# 7 gainesville fl NA NA NA NA NA
# 8 manhattan ks 100547618 NA NA NA NA
Unfortunately, the description provided by you does not indicate which of the above five solutions is an acceptable solution for you. You will have to decide for yourself.
Perhaps this is an option?
ac$id <- sapply(ac$ac, function(x) d$id[grep(x, d$description)])
# ac id
# 1 san francisco ca 100559687
# 2 pittsburgh pa 100558946
# 3 philadelphia pa
# 4 washington dc
# 5 new york ny
# 6 aliquippa pa
# 7 gainesville fl
# 8 manhattan ks 100547618
Try this sapply with grep.
df$id[ unlist( sapply( ac$ac, function(x) grep(x, df$description ) ) ) ]
[1] 100559687 100558946 100547618
EDIT, try stri_detect_regex from stringi. Should be 2-5 times faster.
library(stringi)
df$id[ as.logical( rowSums( sapply( ac$ac, function(x)
stri_detect_regex( df$description, x ) ) ) ) ]
[1] 100559687 100558946 100547618
Microbenchmark on an extended data set with 1.728M rows:
Memory should not be a problem unless you are using a system with less than 4Gb RAM total.
nrow(df)
[1] 1728000
library(microbenchmark)
microbenchmark(
"grep1" = { res <- sapply(ac$ac, function(x) df$id[grep(x, df$description)]) },
"grep2" = { res <- df$id[ unlist( sapply( ac$ac, function(x) grep(x, df$description ) ) ) ] },
"stringi" = { res <- df$id[ as.logical( rowSums( sapply( ac$ac, function(x) stri_detect_regex( df$description, x ) ) ) ) ] }, times=10 )
Unit: seconds
expr min lq mean median uq max neval cld
grep1 96.90757 97.98706 100.13299 99.05837 101.99050 107.04312 10 b
grep2 97.51382 97.66425 100.00610 99.20753 101.17921 106.86661 10 b
stringi 46.15548 46.65894 48.68073 47.29635 50.15713 53.50351 10 a
Memory footprint during microbenchmark:
Path: /Library/Frameworks/R.framework/Versions/4.0/Resources/bin/exec/R
Physical footprint: 638.3M
Physical footprint (peak): 1.8G
You can use regex_inner_join from package fuzzyjoin
> library(fuzzyjoin)
> regex_inner_join(df, ac, by = c(description = "ac"))
month id
1 202110 100559687
2 201703 100558946
3 201502 100547618
description
1 residential local telephone service local with more san francisco ca flat rate with eas package plan includes voicemail call forwarding call waiting caller id call restriction three way calling id block speed dialing call return call screening modem rental voip transmission telephone access line 34 95 modem rental 7 00 total 41 95
2 residential all distance telephone service unlimited voice only pittsburgh pa flat rate with eas only features call waiting caller id caller id with call waiting call screening call forwarding call forwarding selective call return 69 3 way calling anonymous call rejection repeat dialing speed dial caller id blocking
3 residential public switched toll interstate manhattan ks ks plan area residence switched toll base period average revenue per minute 0 18 minute online
ac
1 san francisco ca
2 pittsburgh pa
3 manhattan ks
Checking using a regular expression and non-expensive functions should be fast:
First, we generate the pattern to be checked: ac_regex <- paste(ac$ac, collapse = "|").
There are several ways to detect matches in description and subset. Here are three:
# 1 grep()
df[grep(ac_regex, df$description), ]["id"],
# 2 stringi::stri_detect_*()
df[stri_detect_regex(df$description, ac_regex), ]["id"],
# 3 stringr::str_detect() + tidy subsetting
df %>% filter(description %>% str_detect(ac_regex)) %>% select(id),
All three return the desired subset of df:
id
1 100559687
2 100558946
3 100547618
(You need the packages tidyverse and stringi for options 2 and 3.)
Let's benchmark (using package bench):
bench::mark(
base_grep = df[grep(ac_regex, df$description), ]["id"],
base_stringi = df[stringi::stri_detect_regex(df$description, ac_regex), ]["id"],
tidy = df %>% filter(description %>% str_detect(ac_regex)) %>% select(id),
check = F
)
expression median
<bch:expr> <bch:tm>
1 base_grep 146.61µs
2 base_stringi 119.6µs
3 tidy 1.99ms
I'd go with stringi!
First there is no c$c assignment in the provided code. All the data is assigned to a variable called c. This variable does not have any c members (c$c) you are trying to work with.
Second it is a very bad practice to assign any data to variables called as the basic functions of R c <- c(...).

summarizing and grouping rows based on rank or order of data values in R

My data looks like this:
EMPLOYEE_ID
LAST_NAME
FIRST_NAME
UNIT
CITY
STATE
DATA_RANK
221
SMITH
JILL
X1
DALLAS
TX
2
221
SMITH-WU
JILL
TX
1
331
DEVIN
MARY
X2
HOUSTON
2
331
TRUNG
MARY
HOUSTON
TX
1
441
SWAN
ANNA-BELLE
X2
AUBURN
CA
1
441
DUCK
ANNA
X3
AUBURN
2
I am trying to get the output to look like this (group rows by EMPLOYEE_ID) and also pick the row that has data_rank = 1 where there is a duplicate employee-id.
EMPLOYEE_ID
LAST_NAME
FIRST_NAME
UNIT
CITY
STATE
DATA_RANK
221
SMITH-WU
JILL
TX
1
331
TRUNG
MARY
HOUSTON
TX
1
441
SWAN
ANNA-BELLE
X2
AUBURN
CA
1
I tried using the following code:
data <- data %>%
group_by(EMPLOYEE_ID, substr(LAST_NAME,0,4), substr(FIRST_NAME,0,3)) %>%
mutate_at(vars(-group_cols()),funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>%
filter(row_number()==n())
But that's not quite getting me here. Any thoughts? Thank you!
Is there a reason why you're using substr()?
I believe this code should work.
data %>%
group_by(EMPLOYEE_ID) %>%
filter(DATA_RANK == 1)

standardizing customer ids based on the same company name

I need to use one of the many customers ids and standarize it upon all companies names that are extact same.
Before
Customer.Ids Company Location
1211 Lightz New York
1325 Comput.Inc Seattle
1756 Lightz California
After
Customer.Ids Company Location
1211 Lightz New York
1325 Comput.Inc Seattle
1211 Lightz California
The customer ids for the two companies are now the same. Which code would be the best for this?
We can use match here as it returns the first matching position. We can match Company with Company. According to ?match
match returns a vector of the positions of (first) matches of its first argument in its second.
df$Customer.Ids <- df$Customer.Ids[match(df$Company, df$Company)]
df
# Customer.Ids Company Location
#1 1211 Lightz NewYork
#2 1325 Comput.Inc Seattle
#3 1211 Lightz California
where
match(df$Company, df$Company) #returns
#[1] 1 2 1
Some other options, using sapply
df$Customer.Ids <- df$Customer.Ids[sapply(df$Company, function(x)
which.max(x == df$Company))]
Here we loop over each Company and get the first instance of it's occurrence.
Or another option using ave which follows same logic as that of #Shree, to get first occurrence by group.
with(df, ave(Customer.Ids, Company, FUN = function(x) head(x, 1)))
#[1] 1211 1325 1211
Here's a way using dplyrpackage. It'll replace all Ids as per the first instance for any company -
df %>%
group_by(Company) %>%
mutate(
Customer.Ids = Customer.Ids[1]
) %>%
ungroup()
# A tibble: 3 x 3
Customer.Ids Company Location
<int> <fct> <fct>
1 1211 Lightz New York
2 1325 Comput.Inc Seattle
3 1211 Lightz California

r: dplyr mutate error non-numeric argument to binary operator

trying mutate in dplyr on the data.frame (list) but get an error: non-numeric argument to binary operator. tried converting delayed and 'on time' to numeric but still getting the error, is there an error in the code?
list$delayed <- as.numeric(as.character(list$delayed))
list$'on time' <- as.numeric(as.character(list$'on time'))
list <- mutate(list, total = delayed + 'on tine', pctdelay = delayed / total * 100)
Carrier City delayed on time
1 Alaska Los Angeles 62 497
2 Alaska Phoenix 12 221
3 Alaska San Diego 20 212
4 Alaska San Francisco 102 503
5 Alaska Seattle 305 1841
6 AM WEST Los Angeles 117 694

Find all largest values in a range, across different objects in data frame

I wonder if there is an simpler way than writing if...else... for the following case. I have a dataframe and I only want the rows with number in column "percentage" >=95. Moreover, for one object, if there is multiple rows fitting this criteria, I only want the largest one(s). If there are more than one largest ones, I would like to keep all of them.
For example:
object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75
Then I'd like the result shows:
object city street percentage
A NY Sun 100
A NY Waterfall 100
B CA Washington 98
I am able to do it using if else statement, but I feel there should be some smarter ways to say: 1. >=95 2. if more than one, choose the largest 3. if more than one largest, choose them all.
You can do this by creating an variable that indicates the rows that have the maximum percentage for each of the objects. We can then use this indicator to subset the data.
# your data
dat <- read.table(text = "object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75", header=TRUE, na.strings="", stringsAsFactors=FALSE)
# create an indicator to identify the rows that have the maximum
# percentage by object
id <- with(dat, ave(percentage, object, FUN=function(i) i==max(i)) )
# subset your data - keep rows that are greater than 95 and have the
# maximum group percentage (given by id equal to one)
dat[dat$percentage >= 95 & id , ]
This works by the addition statement creating a logical, which can then be used to subset the rows of dat.
dat$percentage >= 95 & id
#[1] TRUE FALSE TRUE TRUE FALSE FALSE
Or putting these together
with(dat, dat[percentage >= 95 & ave(percentage, object,
FUN=function(i) i==max(i)) , ])
# object city street percentage
# 1 A NY Sun 100
# 3 A NY Waterfall 100
# 4 B CA Washington 98
You could do this also in data.table using the same approach by #user20650
library(data.table)
setDT(dat)[dat[,percentage==max(percentage) & percentage >=95, by=object]$V1,]
# object city street percentage
#1: A NY Sun 100
#2: A NY Waterfall 100
#3: B CA Washington 98
Or using dplyr
dat %>%
group_by(object) %>%
filter(percentage==max(percentage) & percentage >=95)
Following works:
ddf2 = ddf[ddf$percentage>95,]
ddf3 = ddf2[-c(1:nrow(ddf2)),]
for(oo in unique(ddf2$object)){
tempdf = ddf2[ddf2$object == oo, ]
maxval = max(tempdf$percentage)
tempdf = tempdf[tempdf$percentage==maxval,]
for(i in 1:nrow(tempdf)) ddf3[nrow(ddf3)+1,] = tempdf[i,]
}
ddf3
object city street percentage
1 A NY Sun 100
3 A NY Waterfall 100
4 B CA Washington 98

Resources