summarize data with NAs using ddply function - r

I am trying to summarize data with NA values and am using the ddply function.
For example, using the data included below,
set.seed(123)
dat <- data.frame(IndID = rep(c("AAA", "BBB", "CCC"), 100),
ValOne = sample(c(1, 0, NA), replace = T, 300),
ValTwo = sample(c(1,NA), replace = T, 300),
VarThree = sample(c("Thanks", "alot"), replace = T, 300))
> head(dat)
IndID ValOne ValTwo
1 AAA 1 NA
2 BBB NA 1
3 CCC 0 NA
4 AAA NA NA
5 BBB NA NA
6 CCC 1 1
I want to calculate the number of times that each individual has a 1 in the ValOne and ValTwo column I am using the code below to create a new data.frame and summarize the data by IndID and use both length and sum functions.
library(plyr)
tbl <- ddply(dat, "IndID", summarise,
ColOne = length(dat$ValOne[dat$ColOne == 1]),
NumHighHDOP = sum(dat$ValTwo[dat$ValTwo == 1], na.rm = T))
As seen below,
IndID ColOne NumHighHDOP
1 AAA 0 155
2 BBB 0 155
3 CCC 0 155
the resulting table summarizes the data for the entire data.frame and not for each individual.
Both approaches (length and sum) are struggling with the NAs in the data.frame. Any suggestions would be appreciated.
EDIT With the new data set including a factor. Is it also possible to calculate the number of "Thanks" for each individual?

We can use dplyr. We group by 'IndID', and get the count of 1 for each column with summarise_each. To remove the NA elements, either na.omit or use a logical condition to output TRUE only for 1.
library(dplyr)
dat %>%
group_by(IndID) %>%
summarise_each(funs(sum(.==1 & !is.na(.))))
#or
#summarise_each(funs(sum(na.omit(.)==1)))
Update
Based on the updated dataset in the OP's post, if we want to count the 'Thanks' in the third column, we can use %in% (assuming that 'Thanks' is not present in the other two columns and 1 not in the last column).
dat %>%
group_by(IndID) %>%
summarise_each(funs(sum(na.omit(.) %in% c(1, 'Thanks'))))

Related

Find rows with incomplete set depending on a factor, then replace values that exist by NA for the incomplete set

I cannot work this one out.
I have an incomplete dataset (many rows and variables) with one factor that specify whether all the other variables are pre- or post- something. I need to get summary statistics for all variables pre- and post- only including rows where the pre- AND post- values are not NA.
I am trying to find a way to replace existing values with NA if the set is incomplete separately for each variable.
The following is a simple example of what I am trying to achieve:
df = data.frame(
id = c(1,1,2,2),
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20),
var3change = c(5,10,15,20),
var4change = c(NA,2,3,8)
)
which leads to:
id myfactor var2change var3change var4change
1 1 1 10 5 NA
2 1 2 10 10 2
3 2 1 NA 15 3
4 2 2 20 20 8
My desired output would be:
id myfactor var2change var3change var4change
1 1 1 10 5 NA
2 1 2 10 10 NA
3 2 1 NA 15 3
4 2 2 NA 20 8
I have much more than one variable to deal with and the set is incomplete in a different way for each variable independently. I have the feeling this may be achieved with smart use of existing functions from the plyr / tidyr packages but I cannot find an elegant way to apply the concepts to my problem.
Any help would be appreciated.
You can group by id and if any value has NA in it replace all of them with NA. To apply a function to multiple columns we use across.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with('var'), ~if(any(is.na(.))) NA else .))
#for dplyr < 1.0.0 we can use `mutate_at`
#mutate_at(vars(starts_with('var')), ~if(any(is.na(.))) NA else .)
# id myfactor var2change var3change var4change
# <dbl> <fct> <dbl> <dbl> <dbl>
#1 1 1 10 5 NA
#2 1 2 10 10 NA
#3 2 1 NA 15 3
#4 2 2 NA 20 8
It would help to have a grouping variable (group) as well as your time variable (myfactor). Then you can do some finangling to create the variables you want with dplyr.
library(dplyr)
df = data.frame(
group = rep(c(1,2), each = 2),
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20)
)
df %>% group_by(group) %>%
mutate(var3change = all(!is.na(var2change)),
var4change = if_else(var3change, var2change, as.numeric(NA)))
I'm assuming that the dataset you have is ordered, so each pair of observations is grouped by their row index.
By default, the mean() function will return an NA if any of the inputs to it are NA. This is therefore a neat way of getting an NA by group, using dplyr.
library(dplyr)
df = data.frame(
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20)
)
# 1 Create ID variable to group rows in pairs
id = c()
j = 0
for (i in 1:length(df$var2change)){
k = floor(j/2)
id = c(id, k)
j = j + 1
}
df$id = id
# Set all variables within group to NA if one of them is
df = df %>%
group_by(id) %>%
mutate(var_changed = mean(var2change))
If you have an explicit ID variable in your data, you can replace the first part of this solution.
EDIT: doing this for multiple variables (based on change to the question):
df = data.frame(
id = c(1,1,2,2),
myfactor = as.factor(c(1,2,1,2)),
var2change = c(10,10,NA,20),
var3change = c(5,10,15,20),
var4change = c(NA,2,3,8)
)
for (col in 2:4) {
col = paste0("var", col, "change")
df = df %>%
group_by(id) %>%
mutate(new_col = mean(get(col)))
df[["new_col"]] = ifelse(is.na(df["new_col"]), NA, df[[col]])
df[col] = NULL
names(df)[names(df) == "new_col"] <- col
}
If speed is an issue, you could speed this up by moving the group_by outside the loop

Assigning values to patterns of letters in character strings using R

I have a data frame that looks like this:
head(df)
shotchart
1 BMMMBMMBMMBM
2 MMMBBMMBBMMB
3 BBBBMMBMMMBB
4 MMMMBBMMBBMM
Different patterns of the letter 'M' are worth certain values such as the following:
MM = 1
MMM = 2
MMMM = 3
I want to create an extra column to this data frame that calculates the total value of the different patterns of 'M' in each row individually.
For example:
head(df)
shotchart score
1 BMMMBMMBMMBM 4
2 MMMBBMMBBMMB 4
3 BBBBMMBMMMBB 3
4 MMMMBBMMBBMM 5
I can't seem to figure out how to assign the values to the different 'M' patterns.
I tried using the following code but it didn't work:
df$score <- revalue(df$scorechart, c("MM"="1", "MMM"="2", "MMMM"="3"))
We create a named vector ('nm1'), split the 'shotchart' to extract only 'M' and then use the named vector to change the values to get the sum
nm1 <- setNames(1:3, strrep("M", 2:4))
sapply(strsplit(gsub("[^M]+", ",", df$shotchart), ","),
function(x) sum(nm1[x[nzchar(x)]], na.rm = TRUE))
Or using tidyverse
library(tidyverse)
df %>%
mutate(score = str_extract_all(shotchart, "M+") %>%
map_dbl(~ nm1[.x] %>%
sum(., na.rm = TRUE)))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5
You can also split on "B" and base the result on the count of "M" characters -1 as follows:
df <- data.frame(shotchart = c("BMMMBMMBMMBM", "MMMBBMMBBMMB", "BBBBMMBMMMBB", "MMMMBBMMBBMM"),
score = NA_integer_,
stringsAsFactors = F)
df$score <- lapply(strsplit(df$shotchart, "B"), function(i) sum((nchar(i)-1)[(nchar(i)-1)>0]))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5

Multiple values in one cell

I have data looking somewhat similar to this:
number type results
1 5 x, y, z
2 6 a
3 8 x
1 5 x, y
Basically, I have data in Excel that has commas in a couple of individual cells and I need to count each value that is separated by a comma, after a certain requirement is met by subsetting.
Question: How do I go about receiving the sum of 5 when subsetting the data with number == 1 and type == 5, in R?
If we need the total count, then another option is str_count after subsetting
library(stringr)
with(df, sum(str_count(results[number==1 & type==5], "[a-z]"), na.rm = TRUE))
#[1] 5
Or with gregexpr from base R
with(df, sum(lengths(gregexpr("[a-z]", results[number==1 & type==5])), na.rm = TRUE))
#[1] 5
If there are no matching pattern for an element, use
with(df, sum(unlist(lapply(gregexpr("[a-z]",
results[number==1 & type==5]), `>`, 0)), na.rm = TRUE))
Here is an option using dplyr and tidyr. filter function can filter the rows based on conditions. separate_rows can separate the comma. group_by is to group the data. tally can count the numbers.
dt2 <- dt %>%
filter(number == 1, type == 5) %>%
separate_rows(results) %>%
group_by(results) %>%
tally()
# # A tibble: 3 x 2
# results n
# <chr> <int>
# 1 x 2
# 2 y 2
# 3 z 1
Or you can use count(results) only as the following code shows.
dt2 <- dt %>%
filter(number == 1, type == 5) %>%
separate_rows(results) %>%
count(results)
DATA
dt <- read.table(text = "number type results
1 5 'x, y, z'
2 6 a
3 8 x
1 5 'x, y'",
header = TRUE, stringsAsFactors = FALSE)
Here is a method using base R. You split results on the commas and get the length of each list, then add these up grouping by number.
aggregate(sapply(strsplit(df$results, ","), length), list(df$number), sum)
Group.1 x
1 1 5
2 2 1
3 3 1
Your data:
df = read.table(text="number type results
1 5 'x, y, z'
2 6 'a'
3 8 'x'
1 5 'x, y'",
header=TRUE, stringsAsFactors=FALSE)

Getting a summary data frame for all the combinations of categories represented in two columns

I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.
You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464
Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622
Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464

Two equal max values in R

I have a dataframe with some numbers(score) and repeating ID. I want to get the maximum value for each of the ID.
I used this function
top = aggregate(df$score, list(df$ID),max)
This returned me a top dataframe with maximum values corresponding to each ID.
But it so happens that for one of the ID, we have two EQUAL max value. But this function is ignoring the second value.
Is there any way to retain BOTH the max values.?
For Example:
df
ID score
1 12
1 15
1 1
1 15
2 23
2 12
2 13
The above function gives me this:
top
ID Score
1 15
2 23
I need this:
top
ID Score
1 15
1 15
2 23
I recommend data.table as Chris mentioned (good for speed, but steeper learning curve).
Or if you don't want data.table you could use plyr:
library(plyr)
ddply(df, .(ID), subset, score==max(score))
# same as ddply(df, .(ID), function (x) subset(x, score==max(score)))
You can convert to a data.table:
DT <- as.data.table(df)
DT[, .SD[score == max(score)], by=ID]
Here is a dplyr solution.
library(dplyr)
df %>%
group_by(ID) %>%
filter(score == max(score))
Otherwise, to build on what you have done, we can use a sneaky property of merge on your "top" dataframe, see the following example:
df1 <- data.frame(ID = c(1,1,5,2), score = c(5,5,2,6))
top_df <- data.frame(ID = c(1,2), score = c(5,6))
merge(df1, top_df)
which gives:
ID score
1 1 5
2 1 5
3 2 6
Staying with a data.frame:
df[unlist(by(df, df$ID, FUN=function(D) rownames(D)[D$score == max(D$score)] )),]
# ID score
#2 1 15
#4 1 15
#5 2 23
This works because by splits df into a list of data.frames on the basis of df$ID, but retains the original rownames of df ( see by(df, df$ID, I) ). Therefore, returning the rownames of each D subset corresponding to a max score value in each group can still be used to subset the original df.
A simple base R solution:
df <- data.frame(ID = c(1, 1, 1, 1, 2, 2, 2),
score = c(12, 15, 1, 15, 23, 12, 13))
Several options:
df[df$score %in% tapply(df$score, df$ID, max), ]
df[df$score %in% aggregate(score ~ ID, data = df, max)$score, ]
df[df$score %in% aggregate(df$score, list(df$ID), max)$x, ]
Output:
ID score
2 1 15
4 1 15
5 2 23
Using sqldf:
library(sqldf)
sqldf('SELECT df.ID, score FROM df
JOIN (SELECT ID, MAX(score) AS score FROM df GROUP BY ID)
USING (score)')
Output:
ID score
2 1 15
4 1 15
5 2 23

Resources