Count IDs of groups if one variable are equal in a group - r

I have a data frame in R like the following:
Group.ID status
1 1 open
2 1 open
3 2 open
4 2 closed
5 2 closed
6 3 open
I want to count the number of IDs under the condition: when all status are "open" for same ID number. For example, Group ID 1 has two observations, and their status are both "open", so that's one for my count. Group ID 2 is not because not all status are open for group ID 2.
I can count the rows or the group IDs under conditions. However I don't know how to apply "all status equal to one value for a group" logic.
DATA.
df1 <-
structure(list(Group.ID = c(1, 1, 2, 2, 2, 3), status = structure(c(2L,
2L, 2L, 1L, 1L, 2L), .Label = c("closed", "open"), class = "factor")), .Names = c("Group.ID",
"status"), row.names = c(NA, -6L), class = "data.frame")

Here are two solutions, both using base R, one more complicated with aggregate and the other with tapply. If you just want the total count of Group.ID matching you request, I suggest that you use the second solution.
agg <- aggregate(status ~ Group.ID, df1, function(x) as.integer(all(x == "open")))
sum(agg$status)
#[1] 2
sum(tapply(df1$status, df1$Group.ID, FUN = function(x) all(x == "open")))
#[1] 2

a dplyrsolution:
library(dplyr)
df1 %>%
group_by(Group.ID) %>%
filter(cumsum(status == "open") == 2) %>%
nrow()

Related

R; For each level in column A, replace values in column B, following condition in column C

I have a list of species observations in N sites. Observations are presence, absence or unknown (1, 0, 'na'). What I need to do is, for each species, satisfy the condition:
for each SITE, if no 1 %in% SITE, replace all 0 with 'na'
I've managed a workaround using a nested loop and lists, but that seems horribly inefficient. Some questions pertaining matching values in column provided more elegant solutions, but I couldn't apply them in a more complex setting.
Here's some dummy data:
x <- c(1,2,3,4,5,6,7,8,9,10)
site <- c(1,1,1,2,2,2,3,3,3,1)
sp1 <- factor(c(1,1,'na','na',0,0,'na','na','na',0))
sp2 <- factor(c(0,0,1,1,'na','na',0,1,'na','na'))
table <- cbind.data.frame(x,site,sp1,sp2)
And what I did:
for (j in c(3:4)){
site.present <- unique(table$site[which(table[,j]==1)])
for (i in (1:length(table[,j]))) {
ifelse(!(table[i,2]%in%site.present),
ifelse(table[i,j]==0,table[i,j]<-'na',T),T)
}
}
In this example [5,3] and [6,3] should become 'na' instead of 0 (because for sp1 there is no presence in site 2). The code above works, but it seems silly for processing millions of entries...
Much appreciated!
Using dplyr and base::replace. We can replace any zero with NA where is no species equal to 1 in that site.
library(dplyr)
df <- table
df %>% mutate_all(~as.numeric(as.character(.))) %>%
group_by(site) %>%
#mutate(sp1_mod=replace(sp1,all(sp1!=1, na.rm = TRUE) & sp1==0,NA)) #for one column
mutate_at(vars('sp1','sp2'), list(~replace(.,all(.!=1, na.rm = TRUE) & .==0,NA)))
Also, instead of naming variables inside vars one by one we can use one of the Select helpers see ?dplyr::select, e.g. we can use matches to match any column names start with sp and with a digit or more
mutate_at(vars(matches('sp\\d+')), list(~replace(.,any(.==1, na.rm = TRUE) & .==0,NA)))
Is this what you are looking for?
library(dplyr)
table %>%
group_by(site) %>%
mutate(sp1 = if_else(
!any(sp1 == 1) & sp1 == 0,
"na",
as.character(sp1)
))
If I understand you right, you want a compact and fast solution that can be applied at once to an entire range from 1 to n species.
I would first reshape the data to a long format, and then set NA using by sp* if it is an element of c(0, NA) for each site. Thirdly, we could optionally reshape back to the original large format.
tmp <- reshape(dat, varying=list(3:ncol(dat)), v.names="sp", idvar=1:2, direction="long")
tmp <- do.call(rbind, by(tmp, tmp[c("site", "time")], function(x)
if (all(x$sp %in% c(0, NA))) cbind(x[-4], sp=NA) else x))
dat <- reshape(tmp, timevar="time", idvar=c("x", "site"), direction="wide", sep="")
dat
# x site sp1 sp2
# 1.1.1 1 1 1 0
# 2.1.1 2 1 1 0
# 3.1.1 3 1 <NA> 1
# 10.1.1 10 1 0 <NA>
# 4.2.1 4 2 <NA> 1
# 5.2.1 5 2 <NA> <NA>
# 6.2.1 6 2 <NA> <NA>
# 7.3.1 7 3 <NA> 0
# 8.3.1 8 3 <NA> 1
# 9.3.1 9 3 <NA> <NA>
If we want more speed we could use melt and dcast for the reshape process from the data.table package which almost doubles the speed. The code changes slightly:
library(data.table)
tmp <- melt(dat, id.vars=c("x", "site"), variable.name="time", value.name="sp")
tmp <- do.call(rbind, by(tmp, tmp[c("site", "time")], function(x)
if (all(x$sp %in% c(0, NA))) cbind(x[-4], sp=NA) else x))
dcast(tmp, x + site ~ time, value.var="sp")
To test if both works, expand the dataset to the number of Zoraptera species, which is 28, and run the code again:
set.seed(42)
n <- 28 - 2
add <- setNames(as.data.frame(
replicate(n, factor(sample(c(1, 0, NA), nrow(dat), replace=TRUE)))),
paste0("sp", 3:(n + 2)))
dat <- cbind(dat, add)
Data
# I'd rather use a neutral name for the data, since `table` is a function name, see `?table`
dat <- structure(list(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), site = c(1,
1, 1, 2, 2, 2, 3, 3, 3, 1), sp1 = structure(c(2L, 2L, 3L, 3L,
1L, 1L, 3L, 3L, 3L, 1L), .Label = c("0", "1", "na"), class = "factor"),
sp2 = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 2L, 3L, 3L), .Label = c("0",
"1", "na"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
# first thing to do is make proper NAs!
levels(dat$sp1) <- levels(dat$sp2) <- c(0, 1, NA)

Count values in column then reset [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 4 years ago.
I am trying to have a column that counts the number of names and starts from scratch each time it is different like this :
NAME ID
PIERRE 1
PIERRE 2
PIERRE 3
PIERRE 4
JACK 1
ALEXANDRE 1
ALEXANDRE 2
Reproducible data
structure(list(NAME = structure(c(3L, 3L, 3L, 3L, 2L, 1L, 1L), .Label =
c("ALEXANDRE",
"JACK", "PIERRE"), class = "factor")), class = "data.frame", row.names
= c(NA,
-7L))
You could build a sequence along the elements in each group (= Name):
ave(1:nrow(df), df$NAME, FUN = seq_along)
Or, if names may reoccur later on, and it should still count as a new group (= Name-change), e.g.:
groups <- cumsum(c(FALSE, df$NAME[-1]!=head(df$NAME, -1)))
ave(1:nrow(df), groups, FUN = seq_along)
Using dplyr and data.table:
df %>%
group_by(ID_temp = rleid(NAME)) %>%
mutate(ID = seq_along(ID_temp)) %>%
ungroup() %>%
select(-ID_temp)
Or just data.table:
setDT(df)[, ID := seq_len(.N), by=rleid(NAME)]
Here's a quick way to do it.
First you can set up your data:
mydata <- data.frame("name"=c("PIERRE", "ALEX", "PIERRE", "PIERRE", "JACK", "PIERRE", "ALEX"))
Next, I add a dummy column of 1s that makes the solution inelegant:
mydata$placeholder <- 1
Finally, I add up the placeholder column (cumulative sum), grouped by the name column:
mydata$ID <- ave(mydata$placeholder, mydata$name, FUN=cumsum)
Since I started with unsorted names, my dataframe is currently unsorted, but that can be fixed with:
mydata <- mydata[order(mydata$name, mydata$ID),]

Return record where frequency count does not match [duplicate]

This question already has answers here:
duplicates in multiple columns
(2 answers)
Closed 5 years ago.
I am working on a dataset in R, where WO can have values "K" and "B". I want to have the WO be returned where the frequency per WO does not match between the "K" and "B" records. For example the following table:
df <- structure(list(WO = c(917595L, 917595L, 1011033L, 1011033L),
Invoice = c("B", "K", "B", "K"), freq = c(3L, 6L, 2L, 2L)),
.Names = c("WO", "Invoice", "freq"),
class = "data.frame", row.names = c(NA, -4L)
)
I want 917595 returned because 3 does not equal 6. However, 1011033 should be returned because its frequency matches.
Reshaping the data let's you do a comparison on the frequency values.
library(dplyr)
library(reshape2)
dframe <-
"WO,Invoice,freq
917595,B,3
917595,K,6
1011033,B,2
1011033,K,2" %>%
read.csv(text = .,
stringsAsFactors = FALSE)
dcast(dframe,
WO ~ Invoice,
value.var = "freq") %>%
filter(B != K)
We could do it with base R using duplicated
df1[!(duplicated(df1[c(1, 3)])|duplicated(df1[c(1,3)], fromLast = TRUE)),]
# WO Invoice freq
#1 917595 B 3
#2 917595 K 6
Or another option is to group by 'WO' and check if the number of unique elements in 'freq' is greater than 1
library(data.table)
setDT(df1)[, if(uniqueN(freq)>1) .SD, WO]
# WO Invoice freq
#1: 917595 B 3
#2: 917595 K 6

Count the number of duplicate for a column

My objective is to get a count on how many duplicate are there in a column.So i have a column of 3516 obs. of 1 variable, there are all dates with about 144 duplicate each from 1/4/16 to 7/3/16. Example:(i put 1 duplicate each for example sake)1/4/161/4/1631/3/1631/3/1630/3/1630/3/1629/3/1629/3/1628/3/1628/3/16so i used the function date = count(date)where date is my df date.But once i execute it my date sequence is not in order anymore. Hope someone can solve my problem.
If we need to count the total number of duplicates
sum(table(df1$date)-1)
#[1] 5
Suppose, we need the count of each date, one option would be to group by 'date' and get the number of rows. This can be done with data.table.
library(data.table)
setDT(df1)[, .N, date]
If you want the count of number of duplicates in your column , you can use duplicated
sum(duplicated(df$V1))
#[1] 5
Assuming V1 as your column name.
EDIT
As per the update if you want the count of each data, you can use the table function which will give you exactly that
table(df$V1)
#1/4/16 28/3/16 29/3/16 30/3/16 31/3/16
# 2 2 2 2 2
library(dplyr)
library(janitor)
df%>% get_dupes(Variable) %>% tally()
You can add group_by in the pipe too if you want.
One way is to create a data frame with unique values of your initial data which will preserve the order and then use left_join from dplyr package to join the two data frames. Note that the name of your column should be the same.
Initial_data <- structure(list(V1 = structure(c(1L, 1L, 5L, 5L, 4L, 4L, 3L, 3L,
2L, 2L, 2L), .Label = c("1/4/16", "28/3/16", "29/3/16", "30/3/16",
"31/3/16"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-11L))
df1 <- unique(Initial_data)
count1 <- count(df1)
left_join(df1, count1, by = 'V1')
# V1 freq
#1 1/4/16 2
#2 31/3/16 2
#3 30/3/16 2
#4 29/3/16 2
#5 28/3/16 3
if you want to count number of duplicated records use:
sum(duplicated(df))
and when you want to calculate the percentage of duplicates use:
mean(duplicated(df))

lookup data in a datatable and add it to a new column

I have two data tables as shown below:
bigrams
w1w2 freq w1 w2
common names 1 common names
department of 4 department of
family name 6 family name
bigrams = setDT(structure(list(w1w2 = c("common names", "department of", "family name"
), freq = c(1L, 4L, 6L), w1 = c("common", "department", "family"
), w2 = c("names", "of", "name")), .Names = c("w1w2", "freq",
"w1", "w2"), row.names = c(NA, -3L), class = "data.frame"))
unigrams
w1 freq
common 2
department 3
family 4
name 5
names 1
of 9
unigrams = setDT(structure(list(w1 = c("common", "department", "family", "name",
"names", "of"), freq = c(2L, 3L, 4L, 5L, 1L, 9L)), .Names = c("w1",
"freq"), row.names = c(NA, -6L), class = "data.frame"))
desired output
w1w2 freq w1 w2 w1freq w2freq
common names 1 common names 2 1
department of 4 department of 3 9
family name 6 family name 4 5
What I have done so far
setkey(bigrams, w1)
setkey(unigrams, w1)
result <- bigrams[unigrams]
This gives me the i.freq column for w1 but when I try to do the same for w2 the i.freq column is updated to reflect the freq of w2.
How can I get freq for both w1 and w2 in separate columns?
Note: I have already seen solutions to data.table Lookup value and translate and Modify column of a data.table based on another column and add the new column
You can do two joins, and in v1.9.6 of data.table you can specify the on= argument for differing column names.
library(data.table)
bigrams[unigrams, on=c("w1"), nomatch = 0][unigrams, on=c(w2 = "w1"), nomatch = 0]
w1w2 freq w1 w2 i.freq i.freq.1
1: family name 6 family name 4 5
2: common names 1 common names 2 1
3: department of 4 department of 3 9
You can do this with a bit of reshaping.
library(dplyr)
library(tidyr)
bigrams %>%
rename(w1w2_string = w1w2,
w1w2_freq = freq) %>%
gather(order, string,
w1, w2) %>%
left_join(unigrams %>%
rename(string = w1) ) %>%
gather(type, value,
string, freq) %>%
unite(order_type, order, type) %>%
spread(order_type, value)
Edit: Explanation
The first observation you can make is that bigrams contains in fact information about three different units of analysis: a bigram and two unigrams. Convert to long form so that the unit of analysis is a unigram. Then we can merge in the other unigram data. Now note that your unigram has two different pieces of information per row: the frequency for the unigram, and the text of the unigram. Convert to long form again so that the unit of analysis is a piece of information about a unigram. Now spread, so that each new column is a type of information about a unigram.

Resources