Create a new column with data.table that count unique values

Create a new column with data.table that count unique values - r

ID
1
1
2
3
3
3
3
I want to create an additional column with data table that count the unique 1s, 2s, 3s, etc and sums them up. The final dat.table would be
ID
CountID
1
2
1
2
2
1
3
4
3
4
3
4
3
4
I'm trying this but does not work:
df[, CountID := uniqueN(df, by = ID)]

Using dplyr package
df1 = group_by(df, id) %>% count()
merge(df, df1)
id n
1 1 3
2 1 3
3 1 3
4 2 1
5 3 4
6 3 4
7 3 4
8 3 4
9 4 2
10 4 2
Data
df = data.frame('id' = c( 1 , 1 , 1, 2, 3, 3, 3, 3, 4, 4))

data.table
You can use .N for this:
library(data.table)
DT[, CountID := .N, by = ID]
DT
# ID CountID
# <int> <int>
# 1: 1 2
# 2: 1 2
# 3: 2 1
# 4: 3 4
# 5: 3 4
# 6: 3 4
# 7: 3 4
base R
DT$CountID2 <- ave(rep(1L, nrow(DT)), DT$ID, FUN = length)
Data
DT <- setDT(structure(list(ID = c(1L, 1L, 2L, 3L, 3L, 3L, 3L), CountID = c(2L, 2L, 1L, 4L, 4L, 4L, 4L)), class = c("data.table", "data.frame"), row.names = c(NA, -7L)))

Related

Creating loop to count the number of unique values in column based on values in another column

So, for example, I have the following dataframe, data:
col1
col2
1
5
1
5
1
3
2
10
2
11
3
11
Now, I want to make a new column, col3, which gives me the number of unique values in col2 for every grouping in col1.
So far, I have the following code:
length(unique(data$col2[data$col1 == 1]))
Which would here return the number 2.
However, I'm having a hard time making a loop that goes through all the values in col1 to create the new column, col3.

We can use n_distinct after grouping
library(dplyr)
data <- data %>%
group_by(col1) %>%
mutate(col3 = n_distinct(col2)) %>%
ungroup
-output
data
# A tibble: 6 × 3
col1 col2 col3
<int> <int> <int>
1 1 5 2
2 1 5 2
3 1 3 2
4 2 10 2
5 2 11 2
6 3 11 1
Or with data.table
library(data.table)
setDT(data)[, col3 := uniqueN(col2), col1]
data
data <- structure(list(col1 = c(1L, 1L, 1L, 2L, 2L, 3L), col2 = c(5L,
5L, 3L, 10L, 11L, 11L)), class = "data.frame", row.names = c(NA,
-6L))

You want the counts for every row, so using a for loop you would do
data$col3 <- NA_real_
for (i in seq_len(nrow(data))) {
data$col3[i] <- length(unique(data$col2[data$col1 == data$col1[i]]))
}
data
# col1 col2 col3
# 1 1 5 2
# 2 1 5 2
# 3 1 3 2
# 4 2 10 2
# 5 2 11 2
# 6 3 11 1
However, using for loops in R is mostly inefficient, and in this case we can use the grouping function ave which comes with R.
data <- transform(data, col3=ave(col2, col1, FUN=\(x) length(unique(x))))
data
# col1 col2 col3
# 1 1 5 2
# 2 1 5 2
# 3 1 3 2
# 4 2 10 2
# 5 2 11 2
# 6 3 11 1
Data:
data <- structure(list(col1 = c(1L, 1L, 1L, 2L, 2L, 3L), col2 = c(5L,
5L, 3L, 10L, 11L, 11L)), class = "data.frame", row.names = c(NA,
-6L))

reverse code select vars in df1 conditional on lookup values in df2

I am converting old base R code into tidyverse and could use some help. I want to reverse code some vars in df1 conditional on the variable being tagged as positive==1 in a lookup table df2. Here's my base R solution:
library(tidyverse)
set.seed(1)
df1 <- data.frame(item1 = sample(1:4, 10, replace = TRUE),
item2 = sample(1:4, 10, replace = TRUE),
item3 = sample(1:4, 10, replace = TRUE))
df1
# item1 item2 item3
# 1 2 1 4
# 2 2 1 1
# 3 3 3 3
# 4 4 2 1
# 5 1 4 2
# 6 4 2 2
# 7 4 3 1
# 8 3 4 2
# 9 3 2 4
# 10 1 4 2
df2 <- data.frame(name = c("item1", "item2"),
positive = c(1, 0))
# name positive
# 1 item1 1
# 2 item2 0
vars <- c("item1", "item2")
# reverse code if positive==1
# 4=1, 3=2, 2=3, 1=4
for (i in vars) {
if (df2$positive[df2$name==i]==1) {
df1[i] <- 4 - df1[, i] + 1 # should reverse code item1
}
}
df1
# item1 item2 item3
# 1 3 1 4
# 2 3 1 1
# 3 2 3 3
# 4 1 2 1
# 5 4 4 2
# 6 1 2 2
# 7 1 3 1
# 8 2 4 2
# 9 2 2 4
# 10 4 4 2

We can use mutate_at where we specify the vars by subsetting the 'name' column based on the binary values of 'positive' converted to logical and subtract 4 from the column
library(dplyr)
dfn <- df1 %>%
mutate_at(vars(intersect(names(.),
as.character(df2$name)[as.logical(df2$positive)])), ~ 4 - . + 1)
dfn
# item1 item2 item3
#1 3 1 4
#2 3 1 1
#3 2 3 3
#4 1 2 1
#5 4 4 2
#6 1 2 2
#7 1 3 1
#8 2 4 2
#9 2 2 4
#10 4 4 2
Or with base R
vars1 <- with(df2, as.character(name[as.logical(positive)]))
df1[vars1] <- lapply(df1[vars1], function(x) 4 - x + 1)
data
df1 <- structure(list(item1 = c(2L, 2L, 3L, 4L, 1L, 4L, 4L, 3L, 3L,
1L), item2 = c(1L, 1L, 3L, 2L, 4L, 2L, 3L, 4L, 2L, 4L), item3 = c(4L,
1L, 3L, 1L, 2L, 2L, 1L, 2L, 4L, 2L)), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))

how to determine duplicate rows where not all are the same in a column?

suppose I want to find duplicate rows for columns:
cols<-c("col1", "col2")
I know for data f4 duplicate rows are:
Jo<-df4[duplicated(df4[cols]) | duplicated(df4[cols], fromLast = TRUE), ]
and removing these duplicate rows from data set is given:
No<-df4[!(duplicated(df4[cols]) | duplicated(df4[cols], fromLast = TRUE)), ]
I want to modify the above codes. Suppose there is a column called mode. It takes integers between 1 to 4. I don't want all of duplicate rows have the same mode==2.
example
col1 col2 mode
1 3 5
5 3 9
1 2 1
1 2 1
3 2 2
3 2 2
4 1 3
4 1 2
4 1 2
output
Jo:
col1 col2 mode
1 2 1
1 2 1
4 1 3
4 1 2
4 1 2
No:
col1 col2 mode
1 3 5
5 3 9
3 2 2
3 2 2
in the above example in 3 and 4-th rows since mode==2 for both it is not duplicate but for three last row since one of them is not 2 , the are duplicate

Based on the updated dataset,
library(dplyr)
out1 <- df2 %>%
group_by_at(vars(cols)) %>%
filter(n() > 1, !all(mode ==2))
out2 <- anti_join(df2, out1)
out1
# A tibble: 5 x 3
# Groups: col1, col2 [2]
# col1 col2 mode
# <int> <int> <int>
#1 1 2 1
#2 1 2 1
#3 4 1 3
#4 4 1 2
#5 4 1 2
out2
# col1 col2 mode
#1 1 3 5
#2 5 3 9
#3 3 2 2
#4 3 2 2
Or with data.table
library(data.table)
i1 <- setDT(df2)[ , .I[.N > 1 & !all(mode == 2)], by = cols]$V1
df2[i1]
# col1 col2 mode
#1: 1 2 1
#2: 1 2 1
#3: 4 1 3
#4: 4 1 2
#5: 4 1 2
df2[!i1]
# col1 col2 mode
#1: 1 3 5
#2: 5 3 9
#3: 3 2 2
#4: 3 2 2
Or using base R
i1 <- duplicated(df2[1:2])|duplicated(df2[1:2], fromLast = TRUE)
out11 <- df2[i1 & with(df2, !ave(mode==2, col1, col2, FUN = all)),]
out22 <- df2[setdiff(row.names(df2), row.names(out11)),]
data
df2 <- structure(list(col1 = c(1L, 5L, 1L, 1L, 3L, 3L, 4L, 4L, 4L),
col2 = c(3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), mode = c(5L,
9L, 1L, 1L, 2L, 2L, 3L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-9L))

R: reordering columns based on order of different column

I have the following data:
x y id
1 2
2 2 1
3 4
5 6 2
3 4
2 1 3
The blanks in column id should have the same values as the next id value. Meaning my data should actually look like this:
x y id
1 2 1
2 2 1
3 4 2
5 6 2
3 4 3
2 1 3
I also have a list:
list[[1]] = 1 3 2
Or alternatively a column:
c(1,3,2) = 1, 3, 2
Now I would like to reorder my data based on column id accroding to the order in the list. My data should like this then:
x y id
1 2 1
2 2 1
3 4 3
2 1 3
3 4 2
5 6 2
Is there an efficient way to do this?
EDIT: I don't think it is a duplicate of in R Sorting by absolute value without changing the data because I do no want to sort by absolute value but by specific order that is given in a list.

A base R option would be (assuming that the blanks in 'id' column is NA)
i1 <- !is.na(df1$id)
df1[i1,][match(df1$id[i1], list[[1]]),] <- df1[i1, ]
df1
# x y id
#1 1 2 NA
#2 2 2 1
#3 3 4 NA
#4 2 1 3
#5 3 4 NA
#6 5 6 2
If we need to change the NA to succeeding non-NA element
library(zoo)
df1$id <- na.locf(df1$id, fromLast = TRUE)
data
df1 <- structure(list(x = c(1L, 2L, 3L, 5L, 3L, 2L), y = c(2L, 2L, 4L,
6L, 4L, 1L), id = c(NA, 1L, NA, 2L, NA, 3L)), class = "data.frame",
row.names = c(NA, -6L))

Conditionally remove rows from a database using R

ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6
Each person has several records.
There is only one record of a person whose Number is 1, the rest is 2.
The variable Var has different values for the same person.
When the Number equals to 1, the corresponding Var (we call it P) is different for different persons.
Now, I want to delete the rows whose Var > P for every person.
At the end, I want this
ID Number Var
1 2 6
1 2 7
1 1 8
2 2 3
2 2 4
2 1 5

You can use dplyr::first where Num==1 to get the first Var value
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag=first(Var[Number==1])) %>%
filter(Var <= Flag) %>% select(-Flag)
#short version and you sure there is a one Num==1
df %>% group_by(ID) %>% filter(Var <= Var[Number==1])

Here is a solution with data.table:
library(data.table)
dt <- fread(
"ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6")
dt[, .SD[Var <= Var[Number==1]], ID]
# ID Number Var
# 1: 1 2 6
# 2: 1 2 7
# 3: 1 1 8
# 4: 2 2 3
# 5: 2 2 4
# 6: 2 1 5

A base R option would be
df1[with(df1, Var <= ave(Var * (Number == 1), ID, FUN = function(x) x[x!=0])),]
# ID Number Var
#1 1 2 6
#2 1 2 7
#3 1 1 8
#6 2 2 3
#7 2 2 4
#8 2 1 5
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Number = c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), Var = c(6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L, 6L)), row.names = c(NA, -9L), class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create a new column with data.table that count unique values - r

ID 1 1 2 3 3 3 3 I want to create an additional column with data table that count the unique 1s, 2s, 3s, etc and sums them up. The final dat.table would be ID CountID 1 2 1 2 2 1 3 4 3 4 3 4 3 4 I'm trying this but does not work: df[, CountID := uniqueN(df, by = ID)]

Using dplyr package df1 = group_by(df, id) %>% count() merge(df, df1) id n 1 1 3 2 1 3 3 1 3 4 2 1 5 3 4 6 3 4 7 3 4 8 3 4 9 4 2 10 4 2 Data df = data.frame('id' = c( 1 , 1 , 1, 2, 3, 3, 3, 3, 4, 4))

Related

Creating loop to count the number of unique values in column based on values in another column

reverse code select vars in df1 conditional on lookup values in df2

how to determine duplicate rows where not all are the same in a column?

R: reordering columns based on order of different column

Conditionally remove rows from a database using R

Categories

Resources