Count the number of duplicate for a column - r

My objective is to get a count on how many duplicate are there in a column.So i have a column of 3516 obs. of 1 variable, there are all dates with about 144 duplicate each from 1/4/16 to 7/3/16. Example:(i put 1 duplicate each for example sake)1/4/161/4/1631/3/1631/3/1630/3/1630/3/1629/3/1629/3/1628/3/1628/3/16so i used the function date = count(date)where date is my df date.But once i execute it my date sequence is not in order anymore. Hope someone can solve my problem.

If we need to count the total number of duplicates
sum(table(df1$date)-1)
#[1] 5
Suppose, we need the count of each date, one option would be to group by 'date' and get the number of rows. This can be done with data.table.
library(data.table)
setDT(df1)[, .N, date]

If you want the count of number of duplicates in your column , you can use duplicated
sum(duplicated(df$V1))
#[1] 5
Assuming V1 as your column name.
EDIT
As per the update if you want the count of each data, you can use the table function which will give you exactly that
table(df$V1)
#1/4/16 28/3/16 29/3/16 30/3/16 31/3/16
# 2 2 2 2 2

library(dplyr)
library(janitor)
df%>% get_dupes(Variable) %>% tally()
You can add group_by in the pipe too if you want.

One way is to create a data frame with unique values of your initial data which will preserve the order and then use left_join from dplyr package to join the two data frames. Note that the name of your column should be the same.
Initial_data <- structure(list(V1 = structure(c(1L, 1L, 5L, 5L, 4L, 4L, 3L, 3L,
2L, 2L, 2L), .Label = c("1/4/16", "28/3/16", "29/3/16", "30/3/16",
"31/3/16"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-11L))
df1 <- unique(Initial_data)
count1 <- count(df1)
left_join(df1, count1, by = 'V1')
# V1 freq
#1 1/4/16 2
#2 31/3/16 2
#3 30/3/16 2
#4 29/3/16 2
#5 28/3/16 3

if you want to count number of duplicated records use:
sum(duplicated(df))
and when you want to calculate the percentage of duplicates use:
mean(duplicated(df))

Related

Using lapply to sum a subset of a dataframe

I'm quite new to R and using lapply. I have a large dataframe and I'm attempting to use lapply to output the sum of some subsets of this dataframe.
group_a
group_b
n_variants_a
n_variants_b
1
NA
1
2
NA
2
5
4
1
2
2
0
I want to look at subsets based on multiple different groups (group_a, group_b) and sum each column of n_variants.
Running this over just one group and n_variant set works:
sum(subset(df, (!is.na(group_a)))$n_variants_a
However I want to sum every n_variant column based on every grouping. My lapply script for this outputs values of 0 for each sum.
summed_variants <- lapply(list_of_groups, function(g) {
lapply(list_of_variants, function(v) {
sum(subset(df, !(is.na(g)))$v)
I was wondering if I need to use paste0 to paste the list of variants in, but I couldn't get this to work.
Thanks for your help!
We may use Map/mapply for this - loop over the group names, and its corresponding 'n_variants' (assuming they are in order), extract the columns based on the names, apply the condition (!is.na), subset the 'n_variants' and get the sum
mapply(function(x, y) sum(df1[[y]][!is.na(df1[[x]])]),
names(df1)[1:2], names(df1)[3:4])
group_a group_b
3 4
Or another option can be done using tidyverse. Loop across the 'n_variants' columns, get the column name (cur_column()) replace the substring with 'group', get the value, create the condition to subset the column and get the sum
library(stringr)
library(dplyr)
df1 %>%
summarise(across(contains('variants'),
~ sum(.x[!is.na(get(str_replace(cur_column(), 'n_variants', 'group')))])))
-output
n_variants_a n_variants_b
1 3 4
data
df1 <- structure(list(group_a = c(1L, NA, 1L), group_b = c(NA, 2L, 2L
), n_variants_a = c(1L, 5L, 2L), n_variants_b = c(2L, 4L, 0L)),
class = "data.frame", row.names = c(NA,
-3L))

How can I apply case_when(mapply (adist, x, y) <= 3 ~ x, TRUE ~ y)) to columns of different length and order

Hi I have been trying for a while to match two large columns of names, several have different spellings etc... so far I have written some code to practice on a smaller dataset
examples%>% mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1, TRUE ~ example_2))
This manages to create a new column with names the name from example 1 if it is less than an edit distance of 3 away. However, it does not give the name from example 2 if it does not meet this criteria which I need it to do.
This code also only works on the adjacent row of each column, whereas, I need it to work on a dataset which has two columns (one is larger- so cant be put in the same order).
Also needs to not try to match the NAs from the smaller column of names (there to fill it out to equal length to the other one).
Anyone know how to do something like this?
dput(head(examples))
structure(list(. = structure(c(4L, 3L, 2L, 1L, 5L), .Label = c("grarryfieldsred","harroldfrankknight", "sandramaymeres", "sheilaovensnew", "terrifrank"), class = "factor"), example_2 = structure(c(4L, 2L, 3L, 1L,
5L), .Label = c(" grarryfieldsred", "candramymars", "haroldfranrinight",
"sheilowansknew", "terryfrenk"), class = "factor")), row.names = c(NA,
5L), class = "data.frame")
The problem is that your columns have become factors rather than character vectors. When you try to combine two columns together with different factor levels, unexpected results can happen.
First convert your columns to character:
library(dplyr)
examples %>%
mutate(across(contains("example"),as.character)) %>%
mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1,
TRUE ~ example_2))
# example_1 example_2 new_ID
#1 sheilaovensnew sheilowansknew sheilowansknew
#2 sandramaymeres candramymars candramymars
#3 harroldfrankknight haroldfranrinight harroldfrankknight
#4 grarryfieldsred grarryfieldsred grarryfieldsred
#5 terrifrank terryfrenk terrifrank
In your dput output, somehow the name of example_1 was changed. I ran this first:
names(examples)[1] <- "example_1"

Count values in column then reset [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 4 years ago.
I am trying to have a column that counts the number of names and starts from scratch each time it is different like this :
NAME ID
PIERRE 1
PIERRE 2
PIERRE 3
PIERRE 4
JACK 1
ALEXANDRE 1
ALEXANDRE 2
Reproducible data
structure(list(NAME = structure(c(3L, 3L, 3L, 3L, 2L, 1L, 1L), .Label =
c("ALEXANDRE",
"JACK", "PIERRE"), class = "factor")), class = "data.frame", row.names
= c(NA,
-7L))
You could build a sequence along the elements in each group (= Name):
ave(1:nrow(df), df$NAME, FUN = seq_along)
Or, if names may reoccur later on, and it should still count as a new group (= Name-change), e.g.:
groups <- cumsum(c(FALSE, df$NAME[-1]!=head(df$NAME, -1)))
ave(1:nrow(df), groups, FUN = seq_along)
Using dplyr and data.table:
df %>%
group_by(ID_temp = rleid(NAME)) %>%
mutate(ID = seq_along(ID_temp)) %>%
ungroup() %>%
select(-ID_temp)
Or just data.table:
setDT(df)[, ID := seq_len(.N), by=rleid(NAME)]
Here's a quick way to do it.
First you can set up your data:
mydata <- data.frame("name"=c("PIERRE", "ALEX", "PIERRE", "PIERRE", "JACK", "PIERRE", "ALEX"))
Next, I add a dummy column of 1s that makes the solution inelegant:
mydata$placeholder <- 1
Finally, I add up the placeholder column (cumulative sum), grouped by the name column:
mydata$ID <- ave(mydata$placeholder, mydata$name, FUN=cumsum)
Since I started with unsorted names, my dataframe is currently unsorted, but that can be fixed with:
mydata <- mydata[order(mydata$name, mydata$ID),]

Create function to count values across list of columns

R folks:
I have a dataframe with many sets of columns. Each set is a bank of survey items. I would like to count the number of columns in each set having a certain value. I wrote a function to do this but it results in a list of repeated values that is appended to my dataframe.
df<- structure(list(RespondentID = c(6764279930, 6779986023, 6760279439,
6759243066),
q1 = c(3L, 3L, 4L, 1L),
q2 = c(2L, 2L, 4L, 4L),
q3 = c(4L, 2L, 4L, 5L),
q0010_0004 = c(1L, 2L, 3L, 1L)),
.Names = c("RespondentID", "q1", "q2", "q3", "q4"),
row.names = c(NA, 4L), class = "data.frame")
group1<-c("q1","q2","q3","q4")
# Objective: Count number of ratings==4 for each row
# Make function that receives list of columns &
# then returns ONE column in dataframe with total # columns
# having certain value (in this case, 4)
countcol<-function(colgroup) {
s<-subset(df, select=c(colgroup)) #select only the columns designated by list
s$sum<-Reduce("+", apply(X=s,1,FUN=function(x) (sum(x==4, na.rm = TRUE)))) # count instances of value==4
s2<-subset(s,select=c(sum)) # return ONE column with result for each row
return(s2$sum) }
countcol(group1)
My function, countcol runs without errors but as stated above results in what appears to be a transposed list of results for each row. I would like to have ONE number for each row that indicates the count of values.
I attempted various apply functions here but could not prevail. Anyone have a tip?
Thanks!
rowSums can give you results OP is looking for. This return count of ratings==4 for each group.
rowSums(df[2:5]==4)
#1 2 3 4
#1 0 3 1
OR just part of function from OP can give answer.
apply(df[2:5], 1, function(x)(sum(x==4)))
#1 2 3 4
#1 0 3 1

Merging in R based on column and row

For a sample dataframe:
survey <- structure(list(id = 1:10, cntry = structure(c(2L, 3L, 1L, 2L,
2L, 3L, 1L, 1L, 3L, 2L), .Label = c("DE", "FR", "UK"), class = "factor"),
age.cat = structure(c(1L, 1L, 2L, 4L, 1L, 3L, 4L, 4L, 1L,
2L), .Label = c("Y_15.24", "Y_40.54", "Y_55.plus", "Y_less.15"
), class = "factor")), .Names = c("id", "cntry", "age.cat"
), class = "data.frame", row.names = c(NA, -10L))
I want to add an extra column called 'age.cat' that is populated by another dataframe:
age.cat <- structure(list(cntry = structure(c(2L, 3L, 1L), .Label = c("DE",
"FR", "UK"), class = "factor"), Y_less.15 = c(0.2, 0.2, 0.3),
Y_15.24 = c(0.2, 0.1, 0.2), Y_25.39 = c(0.2, 0.3, 0.1), Y_40.54 = c(0.3,
0.2, 0.1), Y_55.plus = c(0.1, 0.2, 0.3)), .Names = c("cntry",
"Y_less.15", "Y_15.24", "Y_25.39", "Y_40.54", "Y_55.plus"), class = "data.frame", row.names = c(NA,
-3L))
The age.cat dataframe lists proportions of people in the three countries by the different age categories. The corresponding country/age category needs to be added as an additional column in the survey dataframe. Previously, when I used a single country for example, I use merge, but this wouldn't work here as I understand as I need matching on a column and row.
Does anyone have any ideas?
Using data.table, I'd do this directly as follows:
require(data.table) # v1.9.6+
dt1[dt2, ratio := unlist(mget(age.cat)), by=.EACHI, on="cntry"]
where,
dt1 = as.data.table(survey)[, age.cat := as.character(age.cat)]
dt2 = as.data.table(age.cat)
For each row in dt2, the matching rows in dt1$cntry are found corresponding to dt2$cntry (it helps to think of it like a subset operation by matching on cntry column). age.cat values for those matching rows are extracted and passed to mget() function, that looks for variables named with the values in age.cat, and finds it in dt2 (we allow for columns in dt2 to be also visible for exactly this purpose), and extracts the corresponding values. Since it returns a list, we unlist it. Those values are assigned to the column ratio by reference.
Since this avoids unnecessary materialising of intermediate data by melting/gathering, it is quite efficient. Additionally, since it adds a new column by reference while joining, it avoids another intermediate materialisation and is doubly efficient.
Personally, I find the code much more straightforward to understand as to what's going on (with sufficient base R knowledge of course), but that is of course subjective.
Slightly more detailed explanation:
The general form of data.table syntax is DT[i, j, by] which reads:
Take DT, subset rows by i, then compute j grouped by by.
The i argument in data.table, in addition to being subset operations e.g., dt1[cntry == "FR"], can also be another data.table.
Consider the expression: dt1[dt2, on="cntry"].
The first thing it does is to compute, for each row in dt2, all matching row indices in dt1 by matching on the column provided in on = "cntry". For example, for dt2$cntry == "FR", the matching row indices in dt1 are c(1,4,5,10). These row indices are internally computed using fast binary search.
Once the matching row indices are computed it looks as to whether an expression is provided in the j argument. In the above expression j is empty. Therefore it returns all the columns from both dt1 and dt2 (leading to a right join).
In other words, data.table allows join operations to be performed in a similar fashion to subsets (because in both operations, the purpose of i argument is to obtain matching rows). For example, dt1[cntry == "FR"] would first compute the matching row indices, and then extract all columns for those rows (since no columns are provided in the j argument). This has several advantages. For example, if we would only like to return a subset of columns, then we can do, for example:
dt1[dt2, .(cntry, Y_less.15), on="cntry"]
This is efficient because we look at the j expression and notice that only those two columns are required. Therefore on the computed row indices, we only extract the required columns thereby avoiding unnecessary materialisation of all the other columns. Hence efficient
Also, just like how we can select columns, we can also compute on columns. For example, what if you'd like to get sum(Y_less.15)?
dt1[dt2, sum(Y_less.15), on="cntry"]
# [1] 2.3
This is great, but it computes the sum on all the matching rows. What if you'd like to get the sum for each row in dt2$cntry? This is where by = .EACHI comes in.
dt1[dt2, sum(Y_less.15), on="cntry", by=.EACHI]
# cntry V1
# 1: FR 0.2
# 2: UK 0.2
# 3: DE 0.3
by=.EACHI ensures that the j expression is evaluated for each row in i = dt2.
Similarly, we can also add/update columns while joining using the := operator. And that's the answer shown above. The only tricky part there is to extract the values for those matching rows from dt2, since they are stored in separate columns. Hence we use mget(). And the expression unlist(mget(.)) gets evaluated for each row in dt2 while matching on "cntry" column. And the corresponding values are assigned to ratio by using the := operator.
For more details on history of := operator see this, this and this post on SO.
For more on by=.EACHI, see this post.
For more on data.table syntax introduction and reference semantics, see the vignettes.
Hope this helps.
You can turn age.cat into long format and then use join as follows:
library(dplyr)
library(tidyr)
age.cat <- gather(age.cat, age.cat, proportion, -cntry)
inner_join(survey, age.cat)
We can do a join after melting the second dataset to 'long' format
library(data.table) #v1.9.7
melt(setDT(age.cat), id.var="cntry")[survey, on = c("cntry", "variable" = "age.cat")]
# cntry variable value id
# 1: FR Y_15.24 0.2 1
# 2: UK Y_15.24 0.1 2
# 3: DE Y_40.54 0.1 3
# 4: FR Y_less.15 0.2 4
# 5: FR Y_15.24 0.2 5
# 6: UK Y_55.plus 0.2 6
# 7: DE Y_less.15 0.3 7
# 8: DE Y_less.15 0.3 8
# 9: UK Y_15.24 0.1 9
#10: FR Y_40.54 0.3 10
If we are using the CRAN version i.e. data.table_1.9.6,
melt(setDT(age.cat), id.var="cntry", variable.name = "age.cat")[survey,
on = c("cntry", "age.cat")]
you can do this, using packages reshape2 and dplyr:
age.cat %>% melt(variable.name="age.cat") %>% left_join(survey, .)
#### id cntry age.cat value
#### 1 1 FR Y_15.24 0.2
#### 2 2 UK Y_15.24 0.1
#### 3 3 DE Y_40.54 0.1
Is that what you want?

Resources