I have the following dataframe:
user_id <- c(97, 97, 97, 97, 96, 95, 95, 94, 94)
event_id <- c(42, 15, 43, 12, 44, 32, 38, 10, 11)
plan_id <- c(NA, 38, NA, NA, 30, NA, NA, 30, 25)
treatment_id <- c(NA, 20, NA, NA, NA, 28, 41, 17, 32)
system <- c(1, 1, 1, 1, NA, 2, 2, NA, NA)
df <- data.frame(user_id, event_id, plan_id, treatment_id system)
I would like to count the distinct number of user_id for each column, excluding the NA values. The output I am hoping for is:
user_id event_id plan_id treatment_id system
1 4 4 3 4 2
I tried to leverage mutate_all, but that was unsuccessful because my data frame is too large. In other functions, I've used the following two lines of code to get the nonnull count and the count distinct for each column:
colSums(!is.empty(df[,]))
apply(df[,], 2, function(x) length(unique(x)))
Optimally, I would like to combine the two with an ifelse to minimize the mutations, as this will ultimately be thrown into a function to be applied with a number of other summary statistics to a list of data frames.
I have tried a brute-force method, where make the values 1 if not null and 0 otherwise and then copy the id to that column if 1. I can then just use the count distinct line from above to get my output. However, I get the wrong values when copying it into the other columns and the number of adjustments is sub optimal. See code:
binary <- cbind(df$user_id, !is.empty(df[,2:length(df)]))
copied <- binary %>% replace(. > 0, binary[.,1])
I'd greatly appreciate your help.
1: Base
sapply(df, function(x){
length(unique(df$user_id[!is.na(x)]))
})
# user_id event_id plan_id treatment_id system
# 4 4 3 3 2
2: Base
aggregate(user_id ~ ind, unique(na.omit(cbind(stack(df), df[1]))[-1]), length)
# ind user_id
#1 user_id 4
#2 event_id 4
#3 plan_id 3
#4 treatment_id 3
#5 system 2
3: tidyverse
df %>%
mutate(key = user_id) %>%
pivot_longer(!key) %>%
filter(!is.na(value)) %>%
group_by(name) %>%
summarise(value = n_distinct(key)) %>%
pivot_wider()
## A tibble: 1 x 5
# event_id plan_id system treatment_id user_id
# <int> <int> <int> <int> <int>
#1 4 3 2 3 4
Thanks #dcarlson I had misunderstood the question:
apply(df, 2, function(x){length(unique(df[!is.na(x), 1]))})
A data.table option with uniqueN
> setDT(df)[, lapply(.SD, function(x) uniqueN(user_id[!is.na(x)]))]
user_id event_id plan_id treatment_id system
1: 4 4 3 3 2
Using dplyr you can use summarise with across :
library(dplyr)
df %>% summarise(across(.fns = ~n_distinct(user_id[!is.na(.x)])))
# user_id event_id plan_id treatment_id system
#1 4 4 3 3 2
Related
I am trying to find a tidyverse-based programmatic approach to calculating the number of variables meeting a condition in a dataset. The data contains a row for each individual and a variable containing a code that describes something about that individual. I need to efficiently create a tally of how many times that variable code meets multiple sets of criteria. My current process uses dplyr's mutate along with row-wise summing within a tidyverse pipeline to create the required tally.
Other similar posts to this answer the question by summing rowwise, as I already do. In practice, this approach results in an extensive amount of code and slow processing since there are five variables, thousands of individuals, and a dozen criteria to tally separately.
Here is a demonstration of what I've tried so far. The desired output here is calculated as if the condition were for the code in the variables to match 20 or 24.
## Sample data and result
sample <- tibble(
subjectNum = 1:10,
var1 = c(20, 24, 20, 1, 24, 27, 7, 21, 20, 3),
var2 = c(24, 20, 7, 19, 12, 8, 8, 10, 22, NA),
var3 = c(NA, NA, 24, 20, NA, 20, 9, 3, 24, NA),
desired_output = c(2, 2, 2, 1, 1, 1, 0, 0, 2, 0)
)
sample_calc <- sample %>%
rowwise() %>%
mutate(output = sum(var1 %in% c(20, 24), var2 %in% c(20, 24), var3 %in% c(20, 24), na.rm= TRUE))
all(sample_calc$output == sample_calc$desired_output) # should return TRUE
The actual analysis requires conducting such a test for multiple sets of criteria that are available in a separate data file. It also requires the data structure to generally be maintained, so solutions using pivot_longer to count the variables fail as well.
We may use the vectorized rowSums by looping across the columns that starts_with 'var', create the condition within the loop and do the rowSums on the logical columns. It should be more efficient than rowwise sum
library(dplyr)
sample %>%
mutate(output = rowSums(across(starts_with('var'),
~ .x %in% c(20, 24)), na.rm = TRUE))
-output
# A tibble: 10 × 6
subjectNum var1 var2 var3 desired_output output
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 20 24 NA 2 2
2 2 24 20 NA 2 2
3 3 20 7 24 2 2
4 4 1 19 20 1 1
5 5 24 12 NA 1 1
6 6 27 8 20 1 1
7 7 7 8 9 0 0
8 8 21 10 3 0 0
9 9 20 22 24 2 2
10 10 3 NA NA 0 0
I have a very long data frame (~10,000 rows), in which two of the columns look something like this.
A B
1 5.5
1 5.5
2 201
9 18
9 18
2 201
9 18
... ...
Just scrubbing through the data it seems that the two columns are "paired" together, but is there any way of explicitly checking this?
You want to know if value x in column A always means value y in column B?
Let's group by A and count the distinct values in B:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.5, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 1
2 2 1
3 9 1
If we now alter the df to the case that this is not true:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.4, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 2
2 2 1
3 9 1
Observe the increased count for group 1. As you have more than 10000 rows, what remains is to see whether or not there is at least one instance that has n_unique > 1, for instance by filter(n_unique > 1)
If you run this you will see how many unique values of B there are for each value of A
tapply(dat$B, dat$A, function(x) length(unique(x)))
So if the max of this vector is 1 then there are no values of A that have more than one corresponding value of B.
I have the following dataframe
student_id <- c(1,1,1,2,2,2)
test_score <- c(100, 90, 80, 100, 70, 90)
test_type <- c("English", "English", "English", "Spanish", "Spanish", "Spanish")
time_period <- c(1, 0, 1, 0, 1, 0)
df <- data.frame(student_id, test_score, test_type, time_period)
I am trying to filter my observations so that each student_id has a Spanish test and an English test. I have tried the following:
df <- df %>%
group_by(student_id, test_type) %>%
dplyr::filter(row_number() == 1)
But this seems to only return values from the English test. Is there a way to return single observations from each student_id for English and Spanish tests?
Your example data does not contain any student who has done multiple tests, i.e., filtering out only those that have done both English and Spanish will leave you with an empty dataframe. However, let's suppose the following is your data:
df <- data.frame(student_id = c(1,1,2,2,3,3),
test_score = c(100, 90, 80, 100, 70, 90),
test_type = c("English", "English", "English", "Spanish", "Spanish", "Spanish"),
time_period = c(1, 0, 1, 0, 1, 0)
)
Here, student 2 has done both, and we wish to filter for all students who have done both types of exams. One approach is to look at each student and count the number of unique exam types. If that is larger than 1, then we found the relevant rows (including students who have completed three or more languages).
df %>% group_by(student_id) %>%
mutate(n_dist = n_distinct(test_type)) %>%
filter(n_dist>1) %>%
select(-n_dist)
# A tibble: 2 x 4
# Groups: student_id [1]
student_id test_score test_type time_period
<dbl> <dbl> <fct> <dbl>
1 2 80 English 1
2 2 100 Spanish 0
This gives you all rows for student 2.
Having said that, it is a bit unclear what you wish to achieve, but if all you want is the first row per student x test_type combination, then your code does work. Another option is to use slice, as in:
df %>% group_by(student_id, test_type) %>% slice(1)
# A tibble: 4 x 4
# Groups: student_id, test_type [4]
student_id test_score test_type time_period
<dbl> <dbl> <fct> <dbl>
1 1 100 English 1
2 2 80 English 1
3 2 100 Spanish 0
4 3 70 Spanish 1
Note I am using df as defined in my answer above.
I have a data set
customerId <- c(101,101,101,102,102,102,104,104,106,109,109,109)
Purchasedate<- c("2020-06-19","2020-06-20","2020-06-21","2020-06-24","2020-06-27","2020-06-28","2020-06-20","2020-06-21"
,"2020-06-24","2020-06-10","2020-06-14","2020-06-16")
df <- data.frame(customerId,Purchasedate)
I am trying to find out following output
101 3
104 2
as the 101 & 104 customer id only representing continuous purchase dates
I am trying to find out the customerid who had make continuous purchase and for how many days by using R
Maybe you could consider checking for difference between dates using group_by for each id, filter by those ids where the difference is always 1, and summarise to total up the number of rows/dates:
library(dplyr)
df %>%
group_by(customerId) %>%
mutate(diffDays = c(1, diff(Purchasedate))) %>%
filter(n_distinct(diffDays) == 1 & n() > 1) %>%
summarise(continuousDays = n())
Output
customerId continuousDays
<dbl> <int>
1 101 3
2 104 2
Data
df <- structure(list(customerId = c(101, 101, 101, 102, 102, 102, 104,
104, 106, 109, 109, 109), Purchasedate = structure(c(18432, 18433,
18434, 18437, 18440, 18441, 18433, 18434, 18437, 18423, 18427,
18429), class = "Date")), row.names = c(NA, -12L), class = "data.frame")
A similar question was answered here: https://stackoverflow.com/a/53713204/12400385
Adapting it slightly to your case
library(dplyr)
library(lubridate)
# Convert Purchasedate to a date column
df <- df %>%
mutate(Purchasedate = ymd(Purchasedate))
# Create custom function
max_consec <- function(x) {
y <- c(unclass(diff(x))) # c and unclass -- preparing it for rle
r <- rle(y)
with(r, max(lengths[values==1]) + 1)
}
# Apply function to each customer
df %>%
group_by(customerId) %>%
summarize(max.consecutive = max_consec(Purchasedate))
#-------
# A tibble: 5 x 2
customerId max.consecutive
<dbl> <dbl>
1 101 3
2 102 2
3 104 2
4 106 -Inf
5 109 -Inf
I have looked into dplyr and tidyr and even base R but I cannot seem to figure out how to subset my data based on a row value.
I have tried using dplyr filter() and select() functions but because gender, language, and age are in the id column, I cannot filter by just typing data %>% filter(gender == 1).
I have a list of 50 raters. For the example here I will display 5. I have 183 rows, which include the raters answers to each question and the three last rows have demographic data, such as age, gender and whether someone is a native or non-native speaker. I will illustrate here with 6 rows as an example.
What I am trying to do is find a way to subset my data according to the values in the age, gender, and language values. Let's say I want to select all the ratings for gender 1, or for language 1, or for gender 1 AND language 1.
Thank you.
Code:
data <- data.frame("id" = c(901,902,903,"age",
"gender",
"language"),
"rater1" = c(7, 9, 9, 21, 1, 1),
"rater2" = c(9, 9, 9, 39, 2, 2),
"rater3" = c(9, 9, 9, 38, 2, 1),
"rater4" = c(9, 9, 9, 33, 2, 1),
"rater5" = c(2, 9, 9, 21, 2, 1))
In order to filter by gender and the other variables of interest we will need to rearrange the data so that they are columns and not rows within a column. One way we can do that is to use gather and then spread. After changing the structure you can utilize dplyr filtering.
data <- data %>%
gather("Rater",rater1:rater5, value = "Value") %>%
spread(id, value = Value) %>%
filter(gender == 1)
Well, I am not sure whether this scales well for your use case but you could do basic indexing:
# data
x <- data.frame("id" = c(901,902,903,"age","gender","language"),
"rater1" = c(7, 9, 9, 21, 1, 1),
"rater2" = c(9, 9, 9, 39, 2, 2),
"rater3" = c(9, 9, 9, 38, 2, 1),
"rater4" = c(9, 9, 9, 33, 2, 1),
"rater5" = c(2, 9, 9, 21, 2, 1))
# ensure id is character and not factor
x$id <- as.character(x$id)
# select all raters whose gender or language is 1
x[, c(TRUE, x[x$id == "gender", -1] == 1) |
c(TRUE, x[x$id == "language", -1] == 1) ]
The TRUE ensures that the id column is kept in any case and the -1 ensures that the logical vector has the desired length (number of columns).
I'd suggest working with two data frames, one (I call demo) for the demographic information on raters, 1 row per rater, and one (I call ratings) for the ratings each rater gave, 1 row per response:
library(tidyr)
library(dplyr)
demo = tail(data, 3)
ratings = head(data, -3)
demo_cols = demo$id
demo = data.frame(t(demo[-1]))
names(demo) = demo_cols
demo$rater = as.numeric(sub(pattern = "rater", replacement = "", rownames(demo)))
demo
# age gender language rater
# rater1 21 1 1 1
# rater2 39 2 2 2
# rater3 38 2 1 3
# rater4 33 2 1 4
# rater5 21 2 1 5
ratings = tidyr::pivot_longer(ratings, cols = starts_with("rater"),
names_to = "rater", names_prefix = "rater") %>%
mutate(rater = as.numeric(rater))
ratings
# # A tibble: 15 x 3
# id rater value
# <fct> <dbl> <dbl>
# 1 901 1 7
# 2 901 2 9
# 3 901 3 9
# 4 901 4 9
# 5 901 5 2
# 6 902 1 9
# ...
Then, when you want to do something like "select all the ratings for gender 1, or for language 1, or for gender 1 AND language 1", you do a simple filter of demo, and join to the ratings data to get the matching records:
demo %>% filter(gender == 1 & language == 1) %>%
inner_join(ratings)
# Joining, by = "rater"
# age gender language rater id value
# 1 21 1 1 1 901 7
# 2 21 1 1 1 902 9
# 3 21 1 1 1 903 9
You could also do the complete join
ratings_with_demo = inner_join(ratings, demo) and filter that data frame directly. But remember if you do this that each row is a response. If you want to do something like count the number of raters by gender, the demo data frame is a much nicer starting place.
Just turn it on its side. Make sure to turn id into row names first, and then remove id to prevent type coercion. t also returns a matrix, so you'll need to turn the data back into a data frame with as_tibble or as.data.frame:
library(dplyr)
data <- as_tibble(t(`rownames<-`(data, data$id)[-1]))
Now filter should do what you expect:
data %>% filter(gender == 1)
#### OUTPUT ####
# A tibble: 1 x 6
`901` `902` `903` age gender language
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7 9 9 21 1 1