R count occurrences of an element by groups [duplicate] - r

This question already has answers here:
Add column with order counts
(2 answers)
Count number of rows within each group
(17 answers)
Closed 7 years ago.
What is the easiest way to count the occurrences of a an element on a vector or data.frame at every grouop?
I don't mean just counting the total (as other stackoverflow questions ask) but giving a different number to every succesive occurence.
for example for this simple dataframe: (but I will work with dataframes with more columns)
mydata <- data.frame(A=c("A","A","A","B","B","A", "A"))
I've found this solution:
cbind(mydata,myorder=ave(rep(1,nrow(mydata)),mydata$A, FUN=cumsum))
and here the result:
A myorder
A 1
A 2
A 3
B 1
B 2
A 4
A 5
Isn't there any single command to do it?. Or using an specialized package?
I want it to later use tidyr's spread() function.
My question is not the same than
Is there an aggregate FUN option to count occurrences?
because I don't want to know the total number of occurrencies at the end but the cumulative occurencies till every element.
OK, my problem is a little bit more complex
mydata <- data.frame(group=c("x","x","x","x","y","y", "y"), letter=c("A","A","A","B","B","A", "A"))
I only know to solve the first example I wrote above.
But what happens when I want it also by a second grouping variable?
something like occurrencies(letter) by group.
group letter "occurencies within group"
x A 1
x A 2
x A 3
x B 1
y B 1
y A 1
y A 2
I've found the way with
ave(rep(1,nrow(mydata)),list(mydata$group, mydata$letter), FUN=cumsum)
though it shoould be something easier.

Using data.table
library(data.table)
setDT(mydata)
mydata[, myorder := 1:.N, by = .(group, letter)]
The by argument makes the table be dealt with within the groups of the column called A. .N is the number of rows within that group (if the by argument was empty it would be the number of rows in the table), so for each sub-table, each row is indexed from 1 to the number of rows in that sub-table.
mydata
group letter myorder
1: x A 1
2: x A 2
3: x A 3
4: x B 1
5: y B 1
6: y A 1
7: y A 2
or a dplyr solution which is pretty much the same
mydata %>%
group_by(group, letter) %>%
mutate(myorder = 1:n())

Related

How to extend values down an r dataframe when row conditions are met? [duplicate]

This question already has answers here:
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Replacing NAs with latest non-NA value
(21 answers)
Closed yesterday.
This post was edited and submitted for review yesterday.
Note that there are solutions to other questions that may resolve this specific question, such as Replace NA with previous or next value, by group, using dplyr. However, this question isn't about replacing NA's, NA's are OK in this question in certain circumstances. This question addresses replacing all cells in a group in a dataframe that fall after the the first non-NA in that group, with that first non-NA value. When I researched this issue I didn't find solutions that fit because I only want to replace NA's in certain circumstances (NA's in a group that occur prior to the first non-NA in that group remain; and a group with all NA's and no non-NA in that group retain all their NA's).
Is there a method, with a preference for dplyr or data.table, for extending target values down an R dataframe range when specified row conditions are met in a row within a group? I vaguely remember an rleid function in data.table that may do the trick but I'm having trouble implementing. Either as a new column or by over-writing existing column "State" in my below example.
For example, if we start with the below example dataframe, I'd like to send the target value of 1 in each row to the end of each ID grouping, after the first occurrence of that target value of 1 in a group, and as better explained in the illustration underneath:
myDF <- data.frame(
ID = c(1,1,1,1,2,2,2,3,3,3),
State = c(NA,NA,1,NA,1,NA,NA,NA,NA,NA))
You can use fill:
library(tidyr)
myDF %>%
group_by(ID) %>%
fill(State, .direction = "down")
# A tibble: 10 × 2
# Groups: ID [3]
ID State
<dbl> <dbl>
1 1 NA
2 1 NA
3 1 1
4 1 1
5 2 1
6 2 1
7 2 1
8 3 NA
9 3 NA
10 3 NA

How to count singular entries from multiple entries in a data frame [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 4 years ago.
I'm trying very hard to break my C mold, as you'll see, it's still present in my R code. I know there will be a smart R way of doing this!
Trying to essentially go through a long list of individuals held in a DF. Each individual can have multiple rows in this if they have taken more than one particular drug or even multiple instances of the same drug. Per row there is a drug name entry. Similar to:
patientID drugname
1 A
2 A
2 B
3 C
3 C
4 A
I have a list containing the unique drug names from this DF (A, B, C). I would like to build a dataframe with columns drugname and drugCount. In the drugCount I want to count up the number of unique instances a drug was prescribed but not multiple counts per person, more of a binary operation of "was this drug given to person X?".
A start of an attempt using a very C-style manner:
uniqueDrugList <- unique(therapyDF$prodcode)
numDrugs <- length(uniqueDrugList)
prevalenceDF <-as.data.frame(drugName=character(numDrugs),drugcount=integer(numDrugs),prevalence=numeric(numDrugs),stringsAsFactors=FALSE)
for(i in 1:length(idList)) {
individualDF <- subset(therapyDF,therapyDF$patid==idList[[i]])
for(j in 1:numDrugs) {
if(uniqueDrugList[[j]] %in% individualDF%prodcode) {
prevalenceDF <---- some how tally up here
}
}
Firstly, I take a subset of my main DF by identifying each individual with a particular ID for a list of unique IDs. Then, for each unique drug (and this is where it is slow), I want to see whether that drug is present in that individual's records. I would like to increment a 1 to an entry if present, else moves on to the next individual's subset.
Expected output
drugname count
A 3
B 1
C 1
We can do a group by 'drugname' and get the length of unique elements of 'patientID'
library(dplyr)
df %>%
group_by(drugname) %>%
summarise(count = n_distinct(patientID))
# A tibble: 3 x 2
# drugname count
# <chr> <int>
#1 A 3
#2 B 1
#3 C 1
Or use table from base R after getting the unique rows
table(unique(df)[2])

Generate cross-section from panel data in R [duplicate]

This question already has answers here:
data.frame Group By column [duplicate]
(4 answers)
Closed 6 years ago.
I have a panel data file (long format) and I need to convert it to cross-sectional data. That is I don't just need a transformation to the wide format but I need exactly one observation per individual that contains the mean for each variable.
Here's what I want to to: I have panel data (a number of observations for each individual) in a data frame and I'm looking for an easy way in R to generate a new data frame that contains cumulated data for each individual, i. e. either the sum of all observations in each variable or their mean. It might also be interesting to get a measure of volatility.
For example I have a given data frame panel_data that contains panel data:
> individual <- c(1,1,2,2,3,3)
> var1 <- c(2,3,3,3,4,3)
> panel_data <- data.frame(individual,var1)
> panel_data
individual var1
1 1 2
2 1 3
3 2 3
4 2 3
5 3 4
6 3 3
The result should look like this:
> cross_data
individual var1
1 1 5
2 2 6
3 3 7
Now this is only an example. I need this feature in a number of varieties, the most important one probably being the intra-individual mean for each variable.
There are ways to do this using base R or using the popular packages data.table or dplyr. Everyone has their own preference and mine is dplyr.
You can very easily perform a variety of operation to summarise your data per individual. With dplyr syntax, you first group_by individual to specify that operations should be performed on groups defined by the variable "individual". You can then summarise your groups using a function you specify.
Try the following:
library("dplyr")
panel_data %>%
group_by(individual) %>%
summarise(sum_var1 = sum(var1), mean_var1=mean(var1))
Do not be put off by the %>% notation, it is just a convenient shortcut to chain operations:
x %>% f is equivalent to f(x)
x %>% f(a) is equivalent to f(x, a)
x %>% f(a) %>% g(b) is equivalent to g(f(x, a), b)

R Data-Frame: Get Maximum of Variable B condititional on Variable A [duplicate]

This question already has answers here:
Extract the maximum value within each group in a dataframe [duplicate]
(3 answers)
Closed 7 years ago.
I am searching for an efficient and fast way to do the following:
I have a data frame with, say, 2 variables, A and B, where the values for A can occur several times:
mat<-data.frame('VarA'=rep(seq(1,10),2),'VarB'=rnorm(20))
VarA VarB
1 0.95848233
2 -0.07477916
3 2.08189370
4 0.46523827
5 0.53500190
6 0.52605101
7 -0.69587974
8 -0.21772252
9 0.29429577
10 3.30514605
1 0.84938361
2 1.13650996
3 1.25143046
Now I want to get a vector giving me for every unique value of VarA
unique(mat$VarA)
the maximum of VarB conditional on VarA.
In the example here that would be
1 0.95848233
2 1.13650996
3 2.08189370
etc...
My data-frame is very big so I want to avoid the use of loops.
Try this:
library(dplyr)
mat %>% group_by(VarA) %>%
summarise(max=max(VarB))
Try to use data.table package.
library(data.table)
mat <- data.table(mat)
result <- mat[,max(VarB),VarA]
print(result)
Try this:
library(plyr)
ddply(mat, .(VarA), summarise, VarB=min(VarB))

Select groups with more than one distinct value per group [duplicate]

This question already has answers here:
Select groups with more than one distinct value
(3 answers)
Closed 7 years ago.
I have data like below:
ID category class
1 a m
1 a s
1 b s
2 a m
3 b s
4 c s
5 d s
I want to subset the data by only including those "ID" which have several (> 1) different categories.
My expected output:
ID category class
1 a m
1 a s
1 b s
Is there a way to doing so?
I tried
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(category, class) > 1)
But it gave me an error:
# Error: expecting a single value
Using data.table
library(data.table) #see: https://github.com/Rdatatable/data.table/wiki for more
setDT(data) #convert to native 'data.table' type by reference
data[ , if(uniqueN(category) > 1) .SD, by = ID]
uniqueN is data.table's (fast) native mask for length(unique()), and .SD is just the whole data.table (in more general cases, it can represent a subset of columns, e.g. when the .SDcols argument is activated). So basically the middle statement (j, the column selection argument) says to return all columns and rows associated with an ID for which there are at least two distinct values of category.
Use the by argument to extend to a case involving counts ok multiple columns.

Resources