Counting Number of Times Each Row is Duplicated in R - r

In my dataset, I want to count the number of times each row appears in my dataset, which consists of five columns. I tried using table; however, this seems to only work with seeing how many times one column, not multiple, is duplicated since I get the error
attempt to make a table with >= 2^31 elements
As a quick example, say my dataframe is as follows:
dat <- data.frame(
SSN = c(204,401,204,666,401),
Name=c("Blossum","Buttercup","Blossum","MojoJojo","Buttercup"),
Age = c(7,8,7,43,8),
Gender = c(0,0,0,1,0)
)
How do I add another column with how many times each row appears in this dataframe?

With dplyr, we could group by all columns:
dat %>%
group_by(across(everything())) %>%
mutate(n = n())
# # A tibble: 5 x 5
# # Groups: SSN, Name, Age, Gender [3]
# SSN Name Age Gender n
# <dbl> <chr> <dbl> <dbl> <int>
# 1 204 Blossum 7 0 2
# 2 401 Buttercup 8 0 2
# 3 204 Blossum 7 0 2
# 4 666 MojoJojo 43 1 1
# 5 401 Buttercup 8 0 2
(mutate(n = n()) is has a shortcut, add_tally(), if you prefer. Use summarize(n = n() or count() if you want to collapse the data frame to the unique rows while adding counts)

Using data.table package. setDT is used to inplace transform data.frame into a data.table.
Inplace (:=) modification of dat by adding count (.N) of lines grouped by all columns of dat (by=names(dat)).
Note: inplace modification result is invisible. So you need to explicitly print it or add [] after (dat[, ...][]).
setDT(dat)
dat[,by=names(dat),N:=.N][]
#> SSN Name Age Gender N
#> 1: 204 Blossum 7 0 2
#> 2: 401 Buttercup 8 0 2
#> 3: 204 Blossum 7 0 2
#> 4: 666 MojoJojo 43 1 1
#> 5: 401 Buttercup 8 0 2
or (to collapse lines)
setDT(dat)
dat[,by=names(dat),.N]
#> SSN Name Age Gender N
#> 1: 204 Blossum 7 0 2
#> 2: 401 Buttercup 8 0 2
#> 3: 666 MojoJojo 43 1 1

We can use add_count without grouping as well
library(dplyr)
dat %>%
add_count(across(everything()))
-output
# SSN Name Age Gender n
#1 204 Blossum 7 0 2
#2 401 Buttercup 8 0 2
#3 204 Blossum 7 0 2
#4 666 MojoJojo 43 1 1
#5 401 Buttercup 8 0 2

I am not sure which is your desired output. Below are some base R options
> aggregate(
+ cnt ~ .,
+ cbind(dat, cnt = 1),
+ sum
+ )
SSN Name Age Gender cnt
1 204 Blossum 7 0 2
2 401 Buttercup 8 0 2
3 666 MojoJojo 43 1 1
> transform(
+ cbind(dat, n = 1),
+ n = ave(n, SSN, Name, Age, Gender, FUN = sum)
+ )
SSN Name Age Gender n
1 204 Blossum 7 0 2
2 401 Buttercup 8 0 2
3 204 Blossum 7 0 2
4 666 MojoJojo 43 1 1
5 401 Buttercup 8 0 2

Related

Remove if unit only has one observation

I have a long form of clinical data that looks something like this:
patientid <- c(100,100,100,101,101,101,102,102,102,104,104,104)
outcome <- c(1,1,1,1,1,NA,1,NA,NA,NA,NA,NA)
time <- c(1,2,3,1,2,3,1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
A patient should be kept in the database only if they 2 or 3 observations (so patients that have complete data for 0 or only 1 time points should be thrown out. So for this example my desired result is this:
patientid <- c(100,100,100,101,101,101)
outcome <- c(1,1,1,1,1,NA)
time <- c(1,2,3,1,2,3)
Data <- data.frame(patientid=patientid, outcome=outcome, time=time)
Hence patients 102 and 104 are thrown out of the database because of they were missing the outcome variable in 2 or 3 of the time points.
We can create a logical expression on the sum of non-NA elements as a logical vector, grouped by 'patientid' to filter patientid's having more than one non-NA 'outcome'
library(dplyr)
Data %>%
group_by(patientid) %>%
filter(sum(!is.na(outcome)) > 1) %>%
ungroup
-output
# A tibble: 6 x 3
# patientid outcome time
# <dbl> <dbl> <dbl>
#1 100 1 1
#2 100 1 2
#3 100 1 3
#4 101 1 1
#5 101 1 2
#6 101 NA 3
A base R option using subset + ave
subset(
Data,
ave(!is.na(outcome), patientid, FUN = sum) > 1
)
giving
patientid outcome time
1 100 1 1
2 100 1 2
3 100 1 3
4 101 1 1
5 101 1 2
6 101 NA 3
A data.table option
setDT(Data)[, Y := sum(!is.na(outcome)), patientid][Y > 1, ][, Y := NULL][]
or a simpler one (thank #akrun)
setDT(Data)[Data[, .I[sum(!is.na(outcome)) > 1], .(patientid)]$V1]
which gives
patientid outcome time
1: 100 1 1
2: 100 1 2
3: 100 1 3
4: 101 1 1
5: 101 1 2
6: 101 NA 3
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(observation = sum(outcome, na.rm = TRUE)) %>% # create new variable (observation) and count the observation per patient
filter(observation >=2) %>%
ungroup
output:
# A tibble: 6 x 4
patientid outcome time observation
<dbl> <dbl> <dbl> <dbl>
1 100 1 1 3
2 100 1 2 3
3 100 1 3 3
4 101 1 1 2
5 101 1 2 2
6 101 NA 3 2

Determine percentage of rows with missing values in a dataframe in R

I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!
It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25
For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]
Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4

Add a new row in each group (Day)

I am trying to make a function with this data and would really appreciate help with this!
example<- data.frame(Day=c(2,4,8,16,32,44,2,4,8,16,32,44,2,4,8,16,32,44),
Replicate=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,
1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,
1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
Treament=c("CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC","CC",
"HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP","HP",
"LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL","LL"),
AFDM=c(94.669342,94.465752,84.897023,81.435993,86.556221,75.328294,94.262162,88.791240,75.735474,81.232403,
67.050593,76.346244,95.076522,88.968823,83.879073,73.958836,70.645724,67.184695,99.763156,92.022673,
92.245362,74.513934,50.083136,36.979418,94.872932,86.353037,81.843173,67.795465,46.622106,18.323099,
95.089932,93.244212,81.679814,65.352385,18.286525,7.517794,99.559972,86.759404,84.693433,79.196504,
67.456961,54.765706,94.074014,87.543693,82.492548,72.333367,51.304676,51.304676,98.340870,86.322153,
87.950873,84.693433,63.316485,63.723665))
Example:
I want to insert a new row with an AFDM value (e.g., 0.9823666) that was calculated with another function.
This new row must be on each Day 2 (and call it as Day 0), and I want to preserve the name of each Replica and Treatment of each group.
Thus, this new row must be: Day 0, Replicate=same, Treatment=same, AFDM=0.9823666.
This is so I can later run a regression with the data (from 0 to 44, 3 replicates for each Treatment).
I would prefer a solution on dplyr.
Thanks in advance
We can create a grouping column with cumsum, then expand the dataset with complete and fill the other columns
library(dplyr)
library(tidyr)
example %>%
group_by(grp = cumsum(Day == 2)) %>%
complete(Day = c(0, unique(Day)), fill = list(AFDM = 0.9823666)) %>%
fill(Replicate, Treament, .direction = 'updown')
# A tibble: 63 x 5
# Groups: grp [9]
# grp Day Replicate Treament AFDM
# <int> <dbl> <dbl> <chr> <dbl>
# 1 1 0 1 CC 0.982
# 2 1 2 1 CC 94.7
# 3 1 4 1 CC 94.5
# 4 1 8 1 CC 84.9
# 5 1 16 1 CC 81.4
# 6 1 32 1 CC 86.6
# 7 1 44 1 CC 75.3
# 8 2 0 2 CC 0.982
# 9 2 2 2 CC 94.3
#10 2 4 2 CC 88.8
# … with 53 more rows
You can use distinct to get unique Replicate and Treament, add Day and AFDM column with the default values and bind the rows to the original dataframe.
library(dplyr)
example %>%
distinct(Replicate, Treament) %>%
mutate(Day = 0, AFDM = 0.9823666) %>%
bind_rows(example) %>%
arrange(Replicate, Treament)
# Replicate Treament Day AFDM
#1 1 CC 0 0.9823666
#2 1 CC 2 94.6693420
#3 1 CC 4 94.4657520
#4 1 CC 8 84.8970230
#5 1 CC 16 81.4359930
#6 1 CC 32 86.5562210
#7 1 CC 44 75.3282940
#8 1 HP 0 0.9823666
#9 1 HP 2 99.7631560
#10 1 HP 4 92.0226730
#.....

Create lags relative to whole change within group

I've tried creating a variable that represents the lagged version of another variable relative to the whole change of the variable within the group.
Let's use this example dataframe:
game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3), score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
game_data
player level score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I've tried the following, but while lagging the variable works, I am not able to create a new variable that shows the lag of the variable relative to the whole change for the player:
result <-
+ game_data %>%
+ group_by(player) %>%
+ mutate(
+ lag_score = score - dplyr::lag(score, n=1, default = NA),
+ lag_score_relative = lag_score/sum(lag_score))
result
# A tibble: 9 x 5
# Groups: player [3]
player level score lag_score lag_score_relative
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 NA NA
2 1 2 150 150 NA
3 1 3 170 20 NA
4 2 1 80 NA NA
5 2 2 100 20 NA
6 2 3 110 10 NA
7 3 1 75 NA NA
8 3 2 100 25 NA
9 3 3 0 -100 NA
For example, for player 1 it should be in
Level 1: NA/170 = NA
Level 2: 150/170
Level 3: 20/170
Thanks in advance, I hope anyone can help.
If you sum the lagged scores you include an NA. The sum then returns NA. You divide by NA which in the end returns NA for every value. To avoid this just set the na.rm argument to TRUE in your call of sum and NAs do not get included in the sum:
game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3),
score=as.numeric(c(0,150,170,80,100,110,75,100,0)))
game_data %>%
group_by(player) %>%
mutate(
lag_score = score - dplyr::lag(score, n=1, default = NA),
lag_score_relative = lag_score/sum(lag_score, na.rm = TRUE))

How can I create an incremental ID column based on whenever one of two variables are encountered?

My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.
Thus I somehow need to add an ID column with this data:
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID
I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?
Advice on relevant tags to use is helpful!
Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
# Use a tibble and get rid of factors.
dfTest = as_tibble(testdat) %>%
mutate_all(as.character)
# A little dplyr magic to see find if the start of a new patient, then give them an id.
dfTest = dfTest %>%
mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>%
mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>%
mutate(patientID = cumsum(isStart)) %>%
select(-couldBeStart, -isStart)
# # A tibble: 17 x 3
# col1 col2 patientID
# <chr> <chr> <int>
# 1 surgery yes 1
# 2 age 54 1
# 3 weight 153 1
# 4 albumin normal 1
# 5 abiotics 2 1
# 6 surgery no 2
# 7 age 65 2
# 8 weight 134 2
# 9 BAPPS yes 2
# 10 abiotics 1 2
# 11 surgery yes 3
# 12 age 61 3
# 13 weight 210 3
# 14 age 46 4
# 15 weight 178 4
# 16 BAPPS no 4
# 17 albumin low 4
# Get the data to a wide workable format.
dfTest %>% spread(col1, col2)
# # A tibble: 4 x 7
# patientID abiotics age albumin BAPPS surgery weight
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 54 normal NA yes 153
# 2 2 1 65 NA yes no 134
# 3 3 NA 61 NA NA yes 210
# 4 4 NA 46 low no NA 178
Using dplyr:
library(dplyr)
testdat = testdat %>%
mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))
This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.
You can try the following
keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1
col1 col2 ID
1 surgery yes 1
2 age 54 1
3 weight 153 1
4 albumin normal 1
5 abiotics 2 1
6 surgery no 2
7 age 65 2
8 weight 134 2
9 BAPPS yes 2
10 abiotics 1 2
11 surgery yes 3
12 age 61 3
13 weight 210 3
14 age 46 4
15 weight 178 4
16 BAPPS no 4
17 albumin low 4

Resources