R - Distinct count across columns - r

I have a large and complex dataset (a bit too complex to share here, and probably not necessary to share the whole thing) but here's an example of what it looks like. This is just one day and the full sample spans hundreds of days:
What I want to do, is to devise a way to count variation of the Genre within each Row. To put it more simply (I hope): each Row has 12 Columns and I want to measure the variation of Genre across those 12 Columns (it's the BBC iPlayer, which many of you might be familiar with). E.g. If a Row is comprised of 4 "sport", 4 "drama", and 4 "documentary", there would be a distinct count of 3 genres.
I'm thinking that a simple distinct count would be a good way to measure variation within each row (the more distinct the count, the higher the variation) but it's not a very nuanced approach. I.e. if a row is comprised of 11 "sport" and 1 "documentary" it's a distinct count of 2. If it's comprised of 6 "sport" and 6 "documentary" it's still a distinct count of 2 - so distinct count doesn't really help in that sense.
I guess I'm asking for advice on two things here:
Firstly, what would be the most appropriate way to measure variation
of Genre within each Row
Secondly, how would I go about doing that! I.e. what code / packages would I need?
I hope that's all clear, but if not, I'd be happy to elaborate on anything. It's perhaps worth noting (as I mentioned above) that I want to determine variation on a specific date, and the sample data shared here is just one date (but I have hundreds).
Thanks in advance :)
*** Update ***
Thanks for the comments below - especially about sharing a snapshot of the real data (which you'll find below). My apologies - I'm a bit of a novice in this area and not really familiar with the proper conventions!
Here's a sample of the data - I hope it's right and I hope it helps:
structure(list(Row = c(0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L), Genre = c("", "Sport", "Drama", "Documentary",
"Entertainment", "Drama", "Comedy", "Crime Drama", "Entertainment",
"Documentary", "Entertainment", "History", "Crime Drama", "",
"", "", "", "", "", "", "", "", "", "", "", "Drama", "Drama",
"Documentary", "Entertainment", "Period Drama"), Column = c(1L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L, 5L)), row.names = c(NA,
30L), class = "data.frame")

First create some reproducible data. All we need is Row and Genre:
set.seed(42)
Row <- rep(1:10, each=10)
Genre <- sample(c("Sport", "Drama", "Documentary", "Entertainment", "History", "Crime Drama", "Period Drama", "Film - Comedy", "Film-Thriller"), 100, replace=TRUE)
example <- data.frame(Row, Genre)
str(example)
# 'data.frame': 100 obs. of 2 variables:
# $ Row : int 1 1 1 1 1 1 1 1 1 1 ...
# $ Genre: chr "Sport" "History" "Sport" "Film-Thriller" ...
Now to get the number of different genres in each row:
Count <- tapply(example$Genre, example$Row, function(x) length(unique(x)))
Count
# 1 2 3 4 5 6 7 8 9 10
# 7 5 6 7 6 8 7 7 7 6
There are 7 genres in row 1 and only 5 in row 2. For more detail:
xtabs(~Genre+Row, example)
# Row
# Genre 1 2 3 4 5 6 7 8 9 10
# Crime Drama 0 0 1 1 3 1 1 0 2 1
# Documentary 0 1 1 1 1 0 1 0 1 0
# Drama 1 1 1 3 2 1 2 1 1 0
# Entertainment 2 2 3 1 1 1 1 2 0 0
# Film - Comedy 1 0 3 2 0 1 2 2 0 2
# Film-Thriller 1 3 0 0 0 1 1 1 2 2
# History 1 3 0 1 2 1 2 2 2 1
# Period Drama 1 0 0 1 0 2 0 1 1 2
# Sport 3 0 1 0 1 2 0 1 1 2

Reproducible sample data:
set.seed(42)
sampdata <- transform(
expand.grid(Date = Sys.Date() + 0:2, Row=0:3, Column=1:12),
Genre = sample(c("Crime Drama","Documentary","Drama","Entertainment"),
size = 48, replace = TRUE)
)
head(sampdata)
# Date Row Column Genre
# 1 2022-02-18 0 1 Crime Drama
# 2 2022-02-19 0 1 Crime Drama
# 3 2022-02-20 0 1 Crime Drama
# 4 2022-02-18 1 1 Crime Drama
# 5 2022-02-19 1 1 Documentary
# 6 2022-02-20 1 1 Entertainment
nrow(sampdata)
# [1] 144
Using dplyr and tidyr, we can group, summarize, then pivot:
library(dplyr)
# library(tidyr) # pivot_wider
sampdata %>%
group_by(Date, Row) %>%
summarize(
Uniq = n_distinct(Genre),
Var = var(table(Genre))
) %>%
tidyr::pivot_wider(
Date, names_from = Row, values_from = c(Uniq, Var)
) %>%
ungroup()
# # A tibble: 3 x 9
# Date Uniq_0 Uniq_1 Uniq_2 Uniq_3 Var_0 Var_1 Var_2 Var_3
# <date> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 2022-02-18 2 3 2 2 0 3 18 18
# 2 2022-02-19 3 3 1 3 3 3 NA 3
# 3 2022-02-20 2 3 3 3 18 3 3 3
Two things: Uniq_# is per-Row counts of distinct Genre values, and Var_# are the variance of the counts. For instance, in your example, two genres with counts 6 and 6 will have a variance of 0, but counts of 11 and 1 will have a variance of 50 (var(c(11,1))), indicating more variation for that Date/Row combination.
Because we use group_by, if you have even more grouping variables, it is straight-forward to extend this, both in the grouping and in what aggregation we can do in addition to n_distinct(.) and var(.).
BTW: depending on your other calculations, analysis, and reporting/plotting, it might be useful to keep this in the long format, removing the pivot_wider.
sampdata %>%
group_by(Date, Row) %>%
summarize(
Uniq = n_distinct(Genre),
Var = var(table(Genre))
) %>%
ungroup()
# # A tibble: 12 x 4
# Date Row Uniq Var
# <date> <int> <int> <dbl>
# 1 2022-02-18 0 2 0
# 2 2022-02-18 1 3 3
# 3 2022-02-18 2 2 18
# 4 2022-02-18 3 2 18
# 5 2022-02-19 0 3 3
# 6 2022-02-19 1 3 3
# 7 2022-02-19 2 1 NA
# 8 2022-02-19 3 3 3
# 9 2022-02-20 0 2 18
# 10 2022-02-20 1 3 3
# 11 2022-02-20 2 3 3
# 12 2022-02-20 3 3 3
Good examples of when to keep it long include further aggregation by Date/Row and plotting with ggplot2 (which really rewards long-data).

Related

R - delete rows according to the value of another row

I am quite a beginner in R but thanks to the community of Stackoverflow I am improving!
However, I am stuck with a problem:
I have a dataset with 5 variables:
id_house represents the id for each household
id_ind is an id which values 1 for the first individual in the household, 2 for the next, 3 for the third...
Indicator_tb_men which indicates if the first person has answered to the survey (1 = yes, 0 = no). All the other members of the household take the value 0.
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
3 1 0
3 2 0
3 3 0
4 1 1
5 1 0
I would like to delete all members of households where the first individual has not answered the survey.
So it would give:
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
4 1 1
Using dplyr here is one way :
library(dplyr)
df %>%
arrange(id_house, id_ind) %>%
group_by(id_house) %>%
filter(first(indicator_tb_men) != 0)
# id_house id_ind indicator_tb_men
# <int> <int> <int>
#1 1 1 1
#2 1 2 NA
#3 2 1 1
#4 4 1 1
data
df <- structure(list(id_house = c(1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L),
id_ind = c(1L, 2L, 1L, 1L, 2L, 3L, 1L, 1L), indicator_tb_men = c(1L,
NA, 1L, 0L, NA, NA, 1L, 0L)), class = "data.frame", row.names = c(NA, -8L))
in base we can use nested logic
df[df$id_house %in% df$id_house[df$id_ind == 1 & df$indicator_tb_men == 1],]
id_house id_ind indicator_tb_men
1 1 1 1
2 1 2 NA
3 2 1 1
7 4 1 1
Data: Using Ronak Shah's data

Conditionally remove rows from a database using R

ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6
Each person has several records.
There is only one record of a person whose Number is 1, the rest is 2.
The variable Var has different values for the same person.
When the Number equals to 1, the corresponding Var (we call it P) is different for different persons.
Now, I want to delete the rows whose Var > P for every person.
At the end, I want this
ID Number Var
1 2 6
1 2 7
1 1 8
2 2 3
2 2 4
2 1 5
You can use dplyr::first where Num==1 to get the first Var value
library(dplyr)
df %>% group_by(ID) %>% mutate(Flag=first(Var[Number==1])) %>%
filter(Var <= Flag) %>% select(-Flag)
#short version and you sure there is a one Num==1
df %>% group_by(ID) %>% filter(Var <= Var[Number==1])
Here is a solution with data.table:
library(data.table)
dt <- fread(
"ID Number Var
1 2 6
1 2 7
1 1 8
1 2 9
1 2 10
2 2 3
2 2 4
2 1 5
2 2 6")
dt[, .SD[Var <= Var[Number==1]], ID]
# ID Number Var
# 1: 1 2 6
# 2: 1 2 7
# 3: 1 1 8
# 4: 2 2 3
# 5: 2 2 4
# 6: 2 1 5
A base R option would be
df1[with(df1, Var <= ave(Var * (Number == 1), ID, FUN = function(x) x[x!=0])),]
# ID Number Var
#1 1 2 6
#2 1 2 7
#3 1 1 8
#6 2 2 3
#7 2 2 4
#8 2 1 5
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Number = c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L), Var = c(6L, 7L, 8L, 9L, 10L,
3L, 4L, 5L, 6L)), row.names = c(NA, -9L), class = "data.frame")

Subsetting and repetition of rows in a dataframe using R

Suppose we have the following data with column names "id", "time" and "x":
df<-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(20L, 6L, 7L, 11L, 13L, 2L, 6L),
x = c(1L, 1L, 0L, 1L, 1L, 1L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id has multiple observations for time and x. I want to extract the last observation for each id and form a new dataframe which repeats these observations according to the number of observations per each id in the original data. I am able to extract the last observations for each id using the following codes
library(dplyr)
df<-df%>%
group_by(id) %>%
filter( ((x)==0 & row_number()==n())| ((x)==1 & row_number()==n()))
What is left unresolved is the repetition aspect. The expected output would look like
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(7L, 7L, 7L, 13L, 13L, 6L, 6L),
x = c(0L, 0L, 0L, 1L, 1L, 0L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Thanks for your help in advance.
We can use ave to find the max row number for each ID and subset it from the data frame.
df[ave(1:nrow(df), df$id, FUN = max), ]
# id time x
#3 1 7 0
#3.1 1 7 0
#3.2 1 7 0
#5 2 13 1
#5.1 2 13 1
#7 3 6 0
#7.1 3 6 0
You can do this by using last() to grab the last row within each id.
df %>%
group_by(id) %>%
mutate(time = last(time),
x = last(x))
Because last(x) returns a single value, it gets expanded out to fill all the rows in the mutate() call.
This can also be applied to an arbitrary number of variables using mutate_at:
df %>%
group_by(id) %>%
mutate_at(vars(-id), ~ last(.))
slice will be your friend in the tidyverse I reckon:
df %>%
group_by(id) %>%
slice(rep(n(),n()))
## A tibble: 7 x 3
## Groups: id [3]
# id time x
# <int> <int> <int>
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
In data.table, you could also use the mult= argument of a join:
library(data.table)
setDT(df)
df[df[,.(id)], on="id", mult="last"]
# id time x
#1: 1 7 0
#2: 1 7 0
#3: 1 7 0
#4: 2 13 1
#5: 2 13 1
#6: 3 6 0
#7: 3 6 0
And in base R, a merge will get you there too:
merge(df["id"], df[!duplicated(df$id, fromLast=TRUE),])
# id time x
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
Using data.table you can try
library(data.table)
setDT(df)[,.(time=rep(time[.N],.N), x=rep(x[.N],.N)), by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0
Following #thelatemai, to avoid name the columns you can also try
df[, .SD[rep(.N,.N)], by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0

Tidy up and reshape messy dataset (reshape/gather/unite function)?

Following my earlier question:
R: reshape/gather function to create dataset ready for multilevel analysis
I discovered it is a bit more complicated. My dataset is actually 'messier' than I hoped. So here's the full story:
I have a big dataset, 240 cases. Each row is a case (breast cancer patient). Somewhere at the end of the dataset(say from column 417 onwards) I have partner data of the patients, that also filled in a questionnaire.
In the beginning, there are demographic variables for both patients and partners, followed by test outcomes only of patients, thus followed by partner data.
I want to create a dataset, where I 'split' the patient and partner data, but keep it coupled. Thus: I want to duplicate the subject ID and create new column with 1s and 2s (1 corresponding to patient and 2 to partner).
Then, I want my data actually as it is now, but some variables can be matched though (for example, I know have "date of birth" for patient [pgebdat] and for partner [prgebdat] separate. Ofcourse, I can turn this into 'gebdat' with the two birth dates below each other.
This code worked for me for a small subset of my data:
mydf_long <- mydf4 %>%
unite(bb1:bb50rec, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
unite(pbb1:pbb50recM, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
gather(couple, value, `1`:`2`) %>% # Form into long data
separate(value, sep = ";", into = c(paste0("bb", seq(1:104),"", sep = ','))) %>% # Separate and retrieve original answers
arrange(id)
results in:
id groep_MNC zkhs fbeh pgebdat couple bb1,
1 3 1 1 1 1955-12-01 1 4
2 3 1 1 1 1955-12-01 2 5
3 5 1 1 1 1943-04-09 1 2
4 5 1 1 1 1943-04-09 2 2
But now it copies and pastes the date of birth of the patient also to 'partner' row.
I'm stuck, and don't even quite know what data you would need to be able to answer my question, so please do ask. I'll provide something of an example below:
Example of data
id groep_MNC zkhs fbeh pgebdat p_age pgesl prgebdat pr_age prgesl relpnst
1 3 1 1 1 1955-12-01 42.50000 1 <NA> NA 2 1
2 5 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.50000 1 2
3 7 1 1 1 1958-04-10 40.25000 1 <NA> NA 2 1
4 10 1 1 1 1958-04-17 40.25000 1 1957-07-31 41.33333 2 1
5 12 1 1 2 1947-11-01 50.66667 1 1944-06-08 54.58333 2 1
And then, after couple of hundred variables for only patients, this partner data comes along:
pbb1 pbb2 pbb3 pbb4 pbb5 pbb6 pbb7 pbb8 pbb9
1 5 5 5 5 2 5 4 2 3
2 2 1 4 1 3 4 3 3 4
3 5 3 4 4 4 3 5 3 4
4 5 3 5 5 5 5 4 4 4
5 5 5 5 5 5 4 4 3 4
note, I didn't create this dataset myself - I'm just here to tidy up the mess :)
Edit: The dataset is in dutch. Pgesl = gender for patient, prgesl = gender for partner... etc.
Using the melt function from the data.table-package you can use multiple measures by patterns and as a result create more than one value column:
library(data.table)
melt(setDT(df), measure.vars = patterns('_age','gesl','gebdat'),
value.name = c('age','geslacht','geboortedatum')
)[, variable := c('patient','partner')[variable]][]
you get:
id groep_MNC zkhs fbeh relpnst pbb1 pbb2 variable age geslacht geboortedatum
1: 3 1 1 1 1 5 5 patient 42.50000 1 1955-12-01
2: 5 1 1 1 2 2 1 patient 55.16667 1 1943-04-09
3: 7 1 1 1 1 5 3 patient 40.25000 1 1958-04-10
4: 10 1 1 1 1 5 3 patient 40.25000 1 1958-04-17
5: 12 1 1 2 1 5 5 patient 50.66667 1 1947-11-01
6: 3 1 1 1 1 5 5 partner NA 2 <NA>
7: 5 1 1 1 2 2 1 partner 36.50000 1 1962-04-18
8: 7 1 1 1 1 5 3 partner NA 2 <NA>
9: 10 1 1 1 1 5 3 partner 41.33333 2 1957-07-31
10: 12 1 1 2 1 5 5 partner 54.58333 2 1944-06-08
Instead of patterns you could also use a list of column indexes or columnnames.
HTH
Used data:
df <- structure(list(id = c(3L, 5L, 7L, 10L, 12L),
groep_MNC = c(1L, 1L, 1L, 1L, 1L),
zkhs = c(1L, 1L, 1L, 1L, 1L),
fbeh = c(1L, 1L, 1L, 1L, 2L),
pgebdat = c("1955-12-01", "1943-04-09", "1958-04-10", "1958-04-17", "1947-11-01"),
p_age = c(42.5, 55.16667, 40.25, 40.25, 50.66667),
pgesl = c(1L, 1L, 1L, 1L, 1L),
prgebdat = c("<NA>", "1962-04-18", "<NA>", "1957-07-31", "1944-06-08"),
pr_age = c(NA, 36.5, NA, 41.33333, 54.58333),
prgesl = c(2L, 1L, 2L, 2L, 2L),
relpnst = c(1L, 2L, 1L, 1L, 1L),
pbb1 = c(5L, 2L, 5L, 5L, 5L),
pbb2 = c(5L, 1L, 3L, 3L, 5L)),
.Names = c("id", "groep_MNC", "zkhs", "fbeh", "pgebdat", "p_age", "pgesl", "prgebdat", "pr_age", "prgesl", "relpnst", "pbb1", "pbb2"),
class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

How can I exclude zeros when finding a frequency and seperate into 4 categories

I have a data frame data.2016 and am trying to find the frequency in which "DIPL" occurs (excluding zero), "DIPL" is the number of a worms parasite found in the a fish.
Data looks something like this:
data.2016
Site DIPL
1 0
1 1
1 1
2 6
2 8
2 1
2 1
3 0
3 0
3 0
4 1258
4 501
I want to output to look like this:
Site freq
1 2
2 4
3 0
4 2
From this I can interpret, out of the 3 fish found in site #1 (from the data frame), 2 of them had worm parasites.
I've tried
aggregate(DIPL~Site, data=data.2016, frequency) #and get:
Site DIPL
1 1 1
2 2 1
3 3 1
4 4 1
Is there a way to count the number of fish with worms from the DIPL column (meaning the value in the column is higher than zero) per site?
Just use a custom function that removes the zeros.
aggregate(DIPL ~ Site, data.2016, function(x) length(x[x != 0])) # or sum(x != 0)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
Another option would be to temporarily transform the DIPL column then just take the sum.
aggregate(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0), sum)
# Site DIPL
# 1 1 2
# 2 2 4
# 3 3 0
# 4 4 2
xtabs() is fun too ...
xtabs(DIPL ~ Site, transform(data.2016, DIPL = DIPL != 0))
# Site
# 1 2 3 4
# 2 4 0 2
By the way, frequency is for use on time-series data.
Data:
data.2016 <- structure(list(Site = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L), DIPL = c(0L, 1L, 1L, 6L, 8L, 1L, 1L, 0L, 0L, 0L, 1258L,
501L)), .Names = c("Site", "DIPL"), class = "data.frame", row.names = c(NA,
-12L))
Might something like this be what you're looking for?
# first some fake data
site <- c("A","A","A","B","B","B")
numworms <- c(1,0,3,0,0,42)
data.frame(site,numworms)
site numworms
1 A 1
2 A 0
3 A 3
4 B 0
5 B 0
6 B 42
tapply(numworms, site, function(x) sum(x>0))
A B
2 1

Resources