Select unique values in dataframe based on sorted value - r

Has anyone selected unique values from a dataframe based on a second value's highest value?
Example:
name value
cheese 15
pepperoni 12
cheese 9
tomato 4
cheese 3
tomato 2
The best I've come up with - which I am SURE there's a better way - is to sort df by value descending, extract df$name, run unique() on that, then do a left join back with dplyr.
The ideal outcome is this:
name value
cheese 15
pepperoni 12
tomato 4
Thanks in advance!

Seeing your expected result, for each name, you are looking for the row that has the largest number. One way to achieve this task is the following.
library(dplyr)
group_by(mydf, name) %>%
slice(which.max(value))
# A tibble: 3 x 2
# Groups: name [3]
# name value
# <fct> <int>
#1 cheese 15
#2 pepperoni 12
#3 tomato 4
DATA
mydf <- structure(list(name = structure(c(1L, 2L, 1L, 3L, 1L, 3L), .Label = c("cheese",
"pepperoni", "tomato"), class = "factor"), value = c(15L, 12L,
9L, 4L, 3L, 2L)), class = "data.frame", row.names = c(NA, -6L
))

Related

Aggregating data in R using group by and keeping values of other columns that are not NA

I wonder if someone can help out. I have the following dataset where an ID is a company that has hired different number people over time with ID duplicates. And we have the address of IDs but it is not collected for each row:
ID Address Number of hiring
1 5
2 Montreal 2
3 3
4 Helsinki 4
1 London 1
2 3
3 Dubai 5
and I'd like to group by ID and add a column that shows the total number of hiring cities that an ID has hired to as well as a column showing the address ID. When I do it, because there are missing values in address, R automatically selects the first row for each ID that may have missing value. So, the following should be the result:
ID Address Total Number of hiring
1 London 6
2 Montreal 5
3 Dubai 8
4 Helsinki 4
I am trying to use the dplyr in R
You can select the first non-empty Address for each ID :
library(dplyr)
df %>%
group_by(ID) %>%
summarise(Address = Address[Address != ''][1],
total_hiring = sum(Number_of_hiring, na.rm =TRUE))
# ID Address total_hiring
# <int> <chr> <int>
#1 1 London 6
#2 2 Montreal 5
#3 3 Dubai 8
#4 4 Helsinki 4
data
df <- structure(list(ID = c(1L, 2L, 3L, 4L, 1L, 2L, 3L), Address = c("",
"Montreal", "", "Helsinki", "London", "", "Dubai"), Number_of_hiring = c(5L,
2L, 3L, 4L, 1L, 3L, 5L)), class = "data.frame", row.names = c(NA, -7L))

Group by function query

Hi guys i am new to R,
While i have attached screenshot of the df i am working with (https://i.stack.imgur.com/CUz4l.png), here is a short description
I have a data frame with a total of 7 columns, one of which is a month column, rest of the 6 columns are (integer) values and these also have empty rows
Need to summarise by count of all the 6 columns and group them by month
tried the following code: group_by(Month) %>% summarise(count=n(),na.omit())
get the following error:
Error: Problem with summarise() input ..2.
x argument "object" is missing, with no default
i Input ..2 is na.omit().
i The error occurred in group 1: Month = "1".
Run rlang::last_error() to see where the error occurred.
Can someone please assist?
[head of data][1] (https://i.stack.imgur.com/stfoG.png)
> dput(head(Dropoff))
structure(list(Start.Date = c("01-11-2019 06:07", "01-11-2019 06:07",
"01-11-2019 06:08", "01-11-2019 06:08", "02-11-2019 06:08", "02-11-2019 06:07"
), End.Date = c("01-11-2019 06:12", "01-11-2019 09:28", "01-11-2019 10:02",
"01-11-2019 13:05", "02-11-2019 06:13", "02-11-2019 06:16"),
Month = structure(c(3L, 3L, 3L, 3L, 3L, 3L), .Label = c("1",
"2", "11"), class = "factor"), nps = c(9L, 10L, 9L, 8L, 9L,
9L), effort = c(9L, 10L, 9L, 9L, 9L, 8L), knowledge = c(NA,
NA, 5L, NA, NA, 5L), confidence = c(5L, 5L, NA, NA, 5L, NA
), listening = c(NA, NA, NA, 5L, NA, NA), fcr = c(1L, 1L,
1L, 1L, 1L, 1L), fixing.issues = c(NA, NA, NA, NA, NA, NA
)), row.names = c(NA, 6L), class = "data.frame")
id like the output to look something like this
Month
count of nps
count of effort
1
xxx
xxx
2
xxx
xxx
11
6
6
....so on (count)for all the variables
the following
df%>% group_by(Month) %>% summarise(count=n())
provides this output
[1]: https://i.stack.imgur.com/u3nxv.png
this is not what i am hoping for
looks like the na.omit() causes problems in this case. Given that you want to count NA but not have them in any following sum, you might use
df[is.na(df)] = 0
and then
df %>% group_by(Month) %>% summarise(count=n())
thanks for the clarifications. The semi-manual solution
df %>% group_by(Month) %>% summarize(
c_nps= sum(!is.na(nps)),
c_effort= sum(!is.na(effort)),
c_knowledge= sum(!is.na(knowledge)),
c_confidence= sum(!is.na(confidence)),
c_listening= sum(!is.na(listening)),
c_fcr= sum(!is.na(fcr))
)
should do the trick. Since it's only 6 columns to be summarized, I would use the manual specification over an automated implementation (i.e. count non-NA in all other columns).
It results in
# A tibble: 1 x 7
Month c_nps c_effort c_knowledge c_confidence c_listening c_fcr
<fct> <int> <int> <int> <int> <int> <int>
1 11 6 6 2 3 1 6
Cheers and good luck!
From you example I understand, that you want to count the non-NA values in every column.
Dropoff %>% group_by(Month) %>%
summarise_at(vars(nps:fixing.issues), list(count=~sum(!is.na(.x))))
summarize_at: The term performs a summarize at every column given in the vars() expression. Here I chose all columns from nps to fixing.issues.
As summarizing function (which describes how the data is summarized), I defined to count all non-NA values. The syntax is to give all functions as named list. Here the ~ does the same as function(x). A more lengthy way to write it would be: function(x) sum(!is.na(x))
The "count" expression works as follows: check the vector of the column (x) if those are NA values is.na. The ! negates this expression. As this is a vector with only true/false values, you can just count the true values with sum.
The expression works for all kind of column types (text, numbers, ...)
Giving the result:
# A tibble: 1 x 8
Month nps_count effort_count knowledge_count confidence_count listening_count fcr_count fixing.issues_count
<fct> <int> <int> <int> <int> <int> <int> <int>
1 11 6 6 2 3 1 6 0
If that is not what you are aiming at, please precise your question.

Is there a equivalent for the tidyr fill() for strings in R?

So I have a data frame like this one:
First Group Bob
Joe
John
Jesse
Second Group Jane
Mary
Emily
Sarah
Grace
I would like to fill in the empty cells in the first column in the data frame with the last string in that column i.e
First Group Bob
First Group Joe
First Group John
First Group Jesse
Second Group Jane
Second Group Mary
Second Group Emily
Second Group Sarah
Second Group Grace
With tidyr, there is fill() but it obviously doesn't work with strings. Is there an equivalent for strings? If not is there a way to accomplish this?
Seems fill() is designed to be used in isolation. When using fill() inside a mutate() statement this error appears (regardless of the data type), but it works when using it as just a component of the pipe structure. Could that have been the problem?
Just for full clarity, a quick example. Assuming you have a data frame called 'people' with columns 'group' and 'name', the right structure would be:
people %>%
fill(group)
and the following would give the error you described (and a similar error when using numbers):
people %>%
mutate(
group = fill(group)
)
(I made the assumption that this was output from an R console session. If it's a raw text file the data input may need to be done with read.fwf.)
The display suggests those are empty character values in the "spaces">
First set them to NA and then use na.locf from zoo:
dat[dat==""] <- NA
dat[1:2] <- lapply(dat[1:2], zoo::na.locf)
dat
#------------
V1 V2 V3
1 First Group Bob
2 First Group Joe
3 First Group John
4 First Group Jesse
5 Second Group Jane
6 Second Group Mary
7 Second Group Emily
8 Second Group Sara
9 Second Group Grace
To start with what I was using:
dat <-
structure(list(V1 = structure(c(2L, 1L, 1L, 1L, 3L, 1L, 1L, 1L,
1L), .Label = c("", "First", "Second"), class = "factor"), V2 = structure(c(2L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("", "Group"), class = "factor"),
V3 = structure(c(1L, 6L, 7L, 5L, 4L, 8L, 2L, 9L, 3L), .Label = c("Bob",
"Emily", "Grace", "Jane", "Jesse", "Joe", "John", "Mary",
"Sara"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
If I have to take a stab at what your data structure is, I might have something like this:
df <- data.frame(c1=c("First Group", "", "", "", "Second Group", "", "", "", ""),
c2=c("Bob","Joe","Jon","Jesse","Jane","Mary","Emily","Sara","Grace"),
stringsAsFactors = FALSE)
Then, a very basic way to do this would be by simply looping:
for(i in 2:nrow(df)) if(df$c1[i]=="") df$c1[i] <- df$c1[i-1]
df
c1 c2
1 First Group Bob
2 First Group Joe
3 First Group Jon
4 First Group Jesse
5 Second Group Jane
6 Second Group Mary
7 Second Group Emily
8 Second Group Sara
9 Second Group Grace
However, I would suggest you accept #42-'s solution if you have anything other than a small data set as zoo::na.locf is optimized to work with large numbers of records and is a very respected, widely used stable package.

Getting column name as a result for question about row

I have a dataset called test
I need to write code for the following
q1. what does jim do the most and the ANS will be 10 runs
q2 what are the three least things mike does and the Answill be walks 6 runs 4 drives 4
q3 who travels furthest and the answer will be Jim 40
This will give you an idea of how to put it into tidy format, and start to answer the questions.
library(tidyverse)
df <- data.frame(stringsAsFactors=FALSE,
name = c("paul", "john", "mike", "jim"),
walks = c(10L, 9L, 6L, 7L),
runs = c(6L, 5L, 4L, 10L),
cycles = c(2L, 5L, 8L, 9L),
drives = c(2L, 3L, 4L, 5L),
flys = c(2L, 6L, 8L, 9L)
)
df
df <- df %>% gather(key = transport, value = "freq", walks:flys)
df
df %>% filter(name == "jim") %>%
group_by(transport) %>%
arrange(desc(freq))
which gives you an output table like:
# A tibble: 5 x 3
# Groups: transport [5]
name transport freq
<chr> <chr> <int>
1 jim runs 10
2 jim cycles 9
3 jim flys 9
4 jim walks 7
5 jim drives 5
which lets you answer Q1.
Notice that gather() is used to make the data in the tidy format, like:
name transport freq
1 paul walks 10
2 john walks 9
3 mike walks 6
4 jim walks 7
5 paul runs 6
This looks like your homework, so it's probably better for you to figure out the rest for yourself, but this will get you on the right track.
Look into the functions in dplyr that you need.

how many people received 4 drugs of interest? R

I have a long list of people receiving drugs coded in the variable ATC. I want to find out how many people have used 4 specific drugs. For example, I want to count how many people have used this particular pattern of drugs "C07ABC" & "C09XYZ" &"C08123" &"C03ZOO". Some people may have used some agents (eg C07 or C08) more than once, thats ok, I just want to count how many unique people had the regimen I'm interested in. I don't care how many times they had the unique drugs. However, because I have various patterns that I want to look up - I would like to use the grepl function. To explain this further, my first attempt at this problem tried a sum command:
sum(df[grepl('^C07.*?'|'^C09.*?'|'^C08.*?|C03.*?', as.character(df$atc)),])
However this doesn't work because I think the sum command needs a boolean function. ALso, I think the | sign isn't correct here either (I want an &) but I'm just showing the code so that you know what I'm after. Maybe an ave function is what I need - but am unsure of how I would code this?
Thanks in advance.
df
names fruit dates atc
4 john kiwi 2010-07-01 C07ABC
7 john apple 2010-09-01 C09XYZ
9 john banana 2010-11-01 C08123
13 john orange 2010-12-01 C03ZOO
14 john apple 2011-01-01 C07ABC
2 mary orange 2010-05-01 C09123
5 mary apple 2010-07-01 C03QRT
8 mary orange 2010-07-01 C09ZOO
10 mary apple 2010-09-01 C03123
12 mary apple 2010-11-01 C09123
1 tom apple 2010-02-01 C03897
3 tom banana 2010-03-01 C02CAMN
6 tom apple 2010-06-01 C07123
11 tom kiwi 2010-08-01 C02DA12
You might consider avoiding the use of regular expressions, and instead derive some set of meaningful columns from column atc. For combinations, you probably want a 2-way table of person and drug, and then compute on the matrix to count combinations.
For example:
tab <- xtabs(~ names + atc, df)
combo <- c("C07ABC", "C09XYZ", "C08123", "C03ZOO")
haveCombo <- rowSums(tab[,combo] > 0) == length(combo)
sum(haveCombo)
The last two lines could easily be turned into a function for each combination.
EDIT: This approach can be applied to other, derived columns, so if you're interested in the prefix then,
df$agent <- substring(df$atc, 1, 3)
tab <- xtabs(~ names + agent, df)
combo <- c("C07", "C09", "C08", "C03")
and proceed as before.
In addition to not needing to deliver entire dataframe lines to sum you also had extra quote marks in that pattern:
> sum( grepl('^C07.*|^C09.*|^C08.*|C03.*', df$atc) )
[1] 12
I think this is easier to read:
> sum( grepl('^(C07|C09|C08|C03).*', df$atc) )
[1] 12
But now I read that you want all of thos used and to do the calculation within a patient id. That might have requiree using & as the connector but I decide to try a different route and use unique and then count then number of unique matches while doing it within an aggregate operation.
> aggregate(atc ~ names, data=df,
function(drgs) length(unique(grep('^(C07|C09|C08|C03)', drgs))))
names atc
1 john 5
2 mary 5
3 tom 2
Although that's the number of matching items but not the number of unique items, because I forgot to put value=TRUE in the grep call (and also need to use substr to avoid separately counting congeners with different trailing ATC codes):
> aggregate(atc ~ names, data=df, function(drgs) length(unique(grep('^C0[7983]', substr(drgs,1,3), value=TRUE))))
names atc
1 john 4
2 mary 2
3 tom 2
This would be somewhat similar to #MichaelLawrence's matrix/table approach, but I think it would scale better since the "tables" being created would be much smaller:
combo <- c("C07", "C09", "C08", "C03")
tapply(df$atc, df$names, function(drgs) sum(combo %in% substr(drgs,1,3)) )
#------
john mary tom
4 2 2
you can try this
drugs <- c("C07ABC","C09XYZ", "C08123", "C03ZOO")
table(unique(df[df$atc %in% drugs, c("names", "atc")])$names)
# john mary tom
# 4 0 0
names(which(table(unique(df[df$atc %in% drugs, c("names", "atc")])$names) > 3))
# [1] "john"
Data
df <- structure(list(names = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("john", "mary", "tom"
), class = "factor"), fruit = structure(c(3L, 1L, 2L, 4L, 1L,
4L, 1L, 4L, 1L, 1L, 1L, 2L, 1L, 3L), .Label = c("apple", "banana",
"kiwi", "orange"), class = "factor"), dates = structure(c(5L,
7L, 8L, 9L, 10L, 3L, 5L, 5L, 7L, 8L, 1L, 2L, 4L, 6L), .Label = c("2010-02-01",
"2010-03-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-08-01",
"2010-09-01", "2010-11-01", "2010-12-01", "2011-01-01"), class = "factor"),
atc = structure(c(8L, 11L, 9L, 6L, 8L, 10L, 5L, 12L, 3L,
10L, 4L, 1L, 7L, 2L), .Label = c("C02CAMN", "C02DA12", "C03123",
"C03897", "C03QRT", "C03ZOO", "C07123", "C07ABC", "C08123",
"C09123", "C09XYZ", "C09ZOO"), class = "factor")), .Names = c("names",
"fruit", "dates", "atc"), class = "data.frame", row.names = c("4",
"7", "9", "13", "14", "2", "5", "8", "10", "12", "1", "3", "6",
"11"))
This is just a continuation of #Michael Lawrence's answer. I changed the drugs to what #user2363642 wanted, and I also substringed the atc column to only use the three first characters, which again, I believe is what #user2363642 wanted. Also, for the rowSums, I first changed all non-zero quantities to 1, to ensure we don't double count drugs.
drugs <- c("C07", "C09", "C08", "C03")
df$atc.abbr <- substring(df$atc, 1, 3)
xt <- xtabs(~ names + atc.abbr, df)
xt[xt>0] <- 1
rowSums(xt[,drugs]) >= length(drugs)
Output:
john mary tom
TRUE FALSE FALSE

Resources