Count values in column then reset [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 4 years ago.
I am trying to have a column that counts the number of names and starts from scratch each time it is different like this :
NAME ID
PIERRE 1
PIERRE 2
PIERRE 3
PIERRE 4
JACK 1
ALEXANDRE 1
ALEXANDRE 2
Reproducible data
structure(list(NAME = structure(c(3L, 3L, 3L, 3L, 2L, 1L, 1L), .Label =
c("ALEXANDRE",
"JACK", "PIERRE"), class = "factor")), class = "data.frame", row.names
= c(NA,
-7L))

You could build a sequence along the elements in each group (= Name):
ave(1:nrow(df), df$NAME, FUN = seq_along)
Or, if names may reoccur later on, and it should still count as a new group (= Name-change), e.g.:
groups <- cumsum(c(FALSE, df$NAME[-1]!=head(df$NAME, -1)))
ave(1:nrow(df), groups, FUN = seq_along)

Using dplyr and data.table:
df %>%
group_by(ID_temp = rleid(NAME)) %>%
mutate(ID = seq_along(ID_temp)) %>%
ungroup() %>%
select(-ID_temp)
Or just data.table:
setDT(df)[, ID := seq_len(.N), by=rleid(NAME)]

Here's a quick way to do it.
First you can set up your data:
mydata <- data.frame("name"=c("PIERRE", "ALEX", "PIERRE", "PIERRE", "JACK", "PIERRE", "ALEX"))
Next, I add a dummy column of 1s that makes the solution inelegant:
mydata$placeholder <- 1
Finally, I add up the placeholder column (cumulative sum), grouped by the name column:
mydata$ID <- ave(mydata$placeholder, mydata$name, FUN=cumsum)
Since I started with unsorted names, my dataframe is currently unsorted, but that can be fixed with:
mydata <- mydata[order(mydata$name, mydata$ID),]

Related

identify observations based on 2 elements in 2 dataframes that do not match [duplicate]

This question already has answers here:
Delete rows that exist in another data frame? [duplicate]
(3 answers)
Find complement of a data frame (anti - join)
(7 answers)
Closed 2 years ago.
I want to identify observations in 1 df that do not match that of another df using 2 indicators (id and date). Below is sample df1 and df2.
df1
id date n
12-40 12/22/2018 3
11-08 10/02/2016 11
df2
id date interval
12-40 12/22/2018 3
11-08 10/02/2016 32
22-22 11/10/2015 11
I want a df that outputs rows that are in df2, but not in df1, like so. Note that row 3 (based on id and date) of df2 is not in df1.
df3
id date interval
22-22 11/10/2015 11
I tried doing this in tidyverse and was not able to get the code to work. Does anyone have suggestions on how to do this?
We can use anti_join from tidyverse (as the OP mentioned about working with tidyverse). Here we use both 'id' and 'date' as mentioned in the OP's post. More complex joins can be done with tidyverse
library(dplyr)
anti_join(df2, df1, by = c('id', 'date'))
# id date interval
#1 22-22 11/10/2015 11
Or a similar option with data.table and it should be very efficient
library(data.table)
setDT(df2)[!df1, on = .(id, date)]
# id date interval
#1: 22-22 11/10/2015 11
data
df1 <- structure(list(id = c("12-40", "11-08"), date = c("12/22/2018",
"10/02/2016"), n = c(3L, 11L)), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(id = c("12-40", "11-08", "22-22"), date = c("12/22/2018",
"10/02/2016", "11/10/2015"), interval = c(3L, 32L, 11L)), class = "data.frame",
row.names = c(NA,
-3L))
Try this (Both options are base R, follow OP directions and do not require any package):
#Code1
df3 <- df2[!paste(df2$id,df1$date) %in% paste(df1$id,df2$date),]
Output:
id date interval
3 22-22 11/10/2015 11
It can also be considered:
#Code 2
df3 <- subset(df2,!paste(id,date) %in% paste(df1$id,df1$date))
Output:
id date interval
3 22-22 11/10/2015 11
Some data used:
#Data1
df1 <- structure(list(id = c("12-40", "11-08"), date = c("12/22/2018",
"10/02/2016"), n = c(3L, 11L)), class = "data.frame", row.names = c(NA,
-2L))
#Data2
df2 <- structure(list(id = c("12-40", "11-08", "22-22"), date = c("12/22/2018",
"10/02/2016", "11/10/2015"), interval = c(3L, 32L, 11L)), class = "data.frame", row.names = c(NA,
-3L))
Another base R option using merge + subset + complete.cases
df3 <- subset(
u <- merge(df1, df2, by = c("id", "date"), all.y = TRUE),
!complete.cases(u)
)[names(df2)]
which gives
> df3
id date interval
3 22-22 11/10/2015 11

How do I aggregate data in R in a way that returns the entire row that satisfies the aggregation condition? [no dplyr]

I have data that looks like this:
ID FACTOR_VAR INT_VAR
1 CAT 1
1 DOG 0
I want to aggregate by ID such that the resulting dataframe contains the entire row that satisfies my aggregate condition. So if I aggregate by the max of INT_VAR, I want to return the whole first row:
ID FACTOR_VAR INT_VAR
1 CAT 1
The following will not work because FACTOR_VAR is a factor:
new_data <- aggregate(data[,c("ID", "FACTOR_VAR", "INT_VAR")], by=list(data$ID), fun=max)
How can I do this? I know dplyr has a group by function, but unfortunately I am working on a computer for which downloading packages takes a long time. So I'm looking for a way to do this with just vanilla R.
If you want to keep all the columns, use ave instead :
subset(df, as.logical(ave(INT_VAR, ID, FUN = function(x) x == max(x))))
You can use aggregate for this. If you want to retain all the columns, merge can be used with it.
merge(aggregate(INT_VAR ~ ID, data = df, max), df, all.x = T)
# ID INT_VAR FACTOR_VAR
#1 1 1 CAT
data
df <- structure(list(ID = c(1L, 1L), FACTOR_VAR = structure(1:2, .Label = c("CAT", "DOG"), class = "factor"), INT_VAR = 1:0), class = "data.frame", row.names = c(NA,-2L))
We can do this in dplyr
library(dplyr)
df %>%
group_by(ID)
filter(INT_VAR == max(INT_VAR))
Or using data.table
library(data.table)
setDT(df)[, .SD[INT_VAR == max(INT_VAR)], by = ID]

Return record where frequency count does not match [duplicate]

This question already has answers here:
duplicates in multiple columns
(2 answers)
Closed 5 years ago.
I am working on a dataset in R, where WO can have values "K" and "B". I want to have the WO be returned where the frequency per WO does not match between the "K" and "B" records. For example the following table:
df <- structure(list(WO = c(917595L, 917595L, 1011033L, 1011033L),
Invoice = c("B", "K", "B", "K"), freq = c(3L, 6L, 2L, 2L)),
.Names = c("WO", "Invoice", "freq"),
class = "data.frame", row.names = c(NA, -4L)
)
I want 917595 returned because 3 does not equal 6. However, 1011033 should be returned because its frequency matches.
Reshaping the data let's you do a comparison on the frequency values.
library(dplyr)
library(reshape2)
dframe <-
"WO,Invoice,freq
917595,B,3
917595,K,6
1011033,B,2
1011033,K,2" %>%
read.csv(text = .,
stringsAsFactors = FALSE)
dcast(dframe,
WO ~ Invoice,
value.var = "freq") %>%
filter(B != K)
We could do it with base R using duplicated
df1[!(duplicated(df1[c(1, 3)])|duplicated(df1[c(1,3)], fromLast = TRUE)),]
# WO Invoice freq
#1 917595 B 3
#2 917595 K 6
Or another option is to group by 'WO' and check if the number of unique elements in 'freq' is greater than 1
library(data.table)
setDT(df1)[, if(uniqueN(freq)>1) .SD, WO]
# WO Invoice freq
#1: 917595 B 3
#2: 917595 K 6

Count IDs of groups if one variable are equal in a group

I have a data frame in R like the following:
Group.ID status
1 1 open
2 1 open
3 2 open
4 2 closed
5 2 closed
6 3 open
I want to count the number of IDs under the condition: when all status are "open" for same ID number. For example, Group ID 1 has two observations, and their status are both "open", so that's one for my count. Group ID 2 is not because not all status are open for group ID 2.
I can count the rows or the group IDs under conditions. However I don't know how to apply "all status equal to one value for a group" logic.
DATA.
df1 <-
structure(list(Group.ID = c(1, 1, 2, 2, 2, 3), status = structure(c(2L,
2L, 2L, 1L, 1L, 2L), .Label = c("closed", "open"), class = "factor")), .Names = c("Group.ID",
"status"), row.names = c(NA, -6L), class = "data.frame")
Here are two solutions, both using base R, one more complicated with aggregate and the other with tapply. If you just want the total count of Group.ID matching you request, I suggest that you use the second solution.
agg <- aggregate(status ~ Group.ID, df1, function(x) as.integer(all(x == "open")))
sum(agg$status)
#[1] 2
sum(tapply(df1$status, df1$Group.ID, FUN = function(x) all(x == "open")))
#[1] 2
a dplyrsolution:
library(dplyr)
df1 %>%
group_by(Group.ID) %>%
filter(cumsum(status == "open") == 2) %>%
nrow()

Count the number of duplicate for a column

My objective is to get a count on how many duplicate are there in a column.So i have a column of 3516 obs. of 1 variable, there are all dates with about 144 duplicate each from 1/4/16 to 7/3/16. Example:(i put 1 duplicate each for example sake)1/4/161/4/1631/3/1631/3/1630/3/1630/3/1629/3/1629/3/1628/3/1628/3/16so i used the function date = count(date)where date is my df date.But once i execute it my date sequence is not in order anymore. Hope someone can solve my problem.
If we need to count the total number of duplicates
sum(table(df1$date)-1)
#[1] 5
Suppose, we need the count of each date, one option would be to group by 'date' and get the number of rows. This can be done with data.table.
library(data.table)
setDT(df1)[, .N, date]
If you want the count of number of duplicates in your column , you can use duplicated
sum(duplicated(df$V1))
#[1] 5
Assuming V1 as your column name.
EDIT
As per the update if you want the count of each data, you can use the table function which will give you exactly that
table(df$V1)
#1/4/16 28/3/16 29/3/16 30/3/16 31/3/16
# 2 2 2 2 2
library(dplyr)
library(janitor)
df%>% get_dupes(Variable) %>% tally()
You can add group_by in the pipe too if you want.
One way is to create a data frame with unique values of your initial data which will preserve the order and then use left_join from dplyr package to join the two data frames. Note that the name of your column should be the same.
Initial_data <- structure(list(V1 = structure(c(1L, 1L, 5L, 5L, 4L, 4L, 3L, 3L,
2L, 2L, 2L), .Label = c("1/4/16", "28/3/16", "29/3/16", "30/3/16",
"31/3/16"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-11L))
df1 <- unique(Initial_data)
count1 <- count(df1)
left_join(df1, count1, by = 'V1')
# V1 freq
#1 1/4/16 2
#2 31/3/16 2
#3 30/3/16 2
#4 29/3/16 2
#5 28/3/16 3
if you want to count number of duplicated records use:
sum(duplicated(df))
and when you want to calculate the percentage of duplicates use:
mean(duplicated(df))

Resources