Count consecutive non NA items - r

I have a dataset that looks like this:
library(purrr)
library(dplyr)
temp<-as.data.frame(cbind(col_A<-c(1,2,NA,3,4,5,6),col_B<-c(NA,1,2,NA,1,NA,NA)))
names(temp)<-c("col_A","col_B")
col_A col_B
1 NA
2 1
NA 2
3 NA
4 3
5 NA
6 NA
I want to create a new dataframe which contains the count of non NA items for each column.
Like the following example:
count_A count_B
1 0
2 1
0 2
1 0
2 1
3 0
4 0
I am strugling in getting the count of items.
My closest approximation is this:
count_days<-function(prev,new){
ifelse(!is.na(new),prev+1,0)
}
temp[,"col_A"] %>%
mutate(count_a=accumulate(count_a,count_days))
But I get the following error:
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "c('double', 'numeric')"
Can anyone help me with this code or just give me another glance.
I know this piece of code just tries to count, not creating the new df, which I think is easier after I get the correct result.

Using rle in a (somewhat nested) lapply approach. We first list if an element of the data is.na. Then, using rle we decode values and lengths. Those lengths which are NA we set to 0 by multiplication and unlist the thing.
res <- as.data.frame(lapply(lapply(temp, is.na), function(x) {
r <- rle(x)
s <- sapply(r$lengths, seq_len)
s[r$values] <- lapply(s[r$values], `*`, 0)
unlist(s)
}))
res
# col_A col_B
# 1 1 0
# 2 2 1
# 3 0 2
# 4 1 0
# 5 2 1
# 6 3 0
# 7 4 0

We can use rleid from data.table
library(data.table)
setDT(temp)[, lapply(.SD, function(x) rowid(rleid(!is.na(x))) * !is.na(x))]
# col_A col_B
#1: 1 0
#2: 2 1
#3: 0 2
#4: 1 0
#5: 2 1
#6: 3 0
#7: 4 0

library(tidyverse)
You can use sequence and rle from data.table
First set all non-NA as 1 and then rle count the sequence of same numbers
library(data.table)
temp %>%
replace(.,!is.na(.),1) %>%
mutate(col_A=case_when(!is.na(col_A)~sequence(rle(col_A)$lengths))) %>%
mutate(col_B=case_when(!is.na(col_B)~sequence(rle(col_B)$lengths))) %>%
replace(.,is.na(.),0)

Related

r replace missing values with a constant and column name follow a common pattern

My dataset has columns and values like this. The column names all start with a common string, Col_a_**
ID Col_a_01 Col_a_02 Col_a_03
1 1 2 1
2 1 NA 0
3 NA 0 2
4 1 0 1
5 0 0 2
My goal is to replace the missing values with the mode values for that column.
The expected dataset to be like this
ID Col_a_01 Col_a_02 Col_a_03
1 1 2 1
2 1 0** 0
3 1** 0 2
4 1 0 1
5 0 0 2
The NA in the first column is replaced by 1 because the mode of the 1st column is 1. The NA in the second column is replaced by 0 because the mode for the 2nd column is 0.
I can do this like this below
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
df$Col_a_01[is.na(Col_a_01)==TRUE] <- getmode(df$Col_a_01)
df$Col_a_03[is.na(Col_a_02)==TRUE] <- getmode(df$Col_a_02)
df$Col_a_03[is.na(Col_a_03)==TRUE] <- getmode(df$Col_a_03)
But this becomes unwieldy if I have 100 columns starting with the similar names ending in 1,2,3..100. I am curious if there is an easier and more elegant way to accomplish this. Thanks in advance.
You can change the NA values with ifelse/replace, to apply a function to multiple columns use across in dplyr.
library(dplyr)
df <- df %>%
mutate(across(starts_with('Col_a'), ~replace(., is.na(.), getmode(.))))
In base R , use lapply -
cols <- grep('Col_a', names(df))
df[cols] <- lapply(df[cols], function(x) replace(x, is.na(x), getmode(x)))
We can use na.aggregate with FUN specified as getmode
library(zoo)
library(dplyr)
df1 <- df1 %>%
mutate(across(starts_with('Col_a'), na.aggregate, FUN = getmode))
-output
df1
ID Col_a_01 Col_a_02 Col_a_03
1 1 1 2 1
2 2 1 0 0
3 3 1 0 2
4 4 1 0 1
5 5 0 0 2
Or it can be simply
na.aggregate(df1, FUN = getmode)
ID Col_a_01 Col_a_02 Col_a_03
1 1 1 2 1
2 2 1 0 0
3 3 1 0 2
4 4 1 0 1
5 5 0 0 2

Generate matrix of unique combination using 2 variables from a data frame r

I have a data frame as
df<- as.data.frame(expand.grid(0:1, 0:4, 0:3,0:7, 2:7))
I want to get all unique combinations using 2 variables of the given 5 variables in the data frame df
Apply a function f (extracting unique couple) to each couple of columns:
f<-function(col,df)
{
return(unique(df[,col]))
}
#All combinantions
comb_col<-combn(colnames(df),2)
Your output
apply(comb_col,2,f,df=df)
[[1]]
Var1 Var2
1 0 0
2 1 0
3 0 1
4 1 1
5 0 2
6 1 2
7 0 3
8 1 3
9 0 4
10 1 4
[[2]]
Var1 Var3
1 0 0
2 1 0
11 0 1
12 1 1
21 0 2
22 1 2
31 0 3
32 1 3
...
You can use distinct function from dplyr package:
df <- as.data.frame(expand.grid(0:1, 0:4, 0:3,0:7, 2:7))
library(dplyr)
df %>%
distinct(Var1, Var2)
Also you have an option to keep the rest of your columns with .keep_all = TRUE parameter.
If you want to get all the possible combinations:
# Generate matrix with all combinations of variables
comb <- combn(names(df), 2)
# Generate a list with all unique values in your data.frame
apply(comb, 2, function(x) df %>% distinct_(.dots = x))

How to identify frequencies of multiple columns based on condition using R

I have a dataframe which contains 63 columns and 50 rows. I have given below a toy dataset.
>df
rs_1 rs_2 rs_3 rs_4 ... rs_60 A.Ag B.Ag C.Ag
0 0 1 2 ... 1 02:/01 02:/07 03:07/04:01
1 2 1 2 ... 0 02:/01 02:/07 03:07/04:01
2 1 1 2 ... 2 02:/01 02:/07 03:07/04:01
0 0 1 0 ... 2 02:/01 02:/07 03:07/04:01
Now I need to find the highest frequencies of the columns (A.Ag, B.Ag and C.Ag) for each rs_* =0, 1 and 2 separately. The desire outcome would be for example rs_*=0
rs_id Code A.Ag Code B.Ag Code C.Ag
rs_1 02:/01 2 02:/07 5 03:07 5
rs_2 02:/01 3 01:/05 2 05:00 4
could you please help me with this? I tried with the following function
for (i in 1:60){
if (file[,i]==0)
{
temp1 = data.frame(sort(table(file[,61]), decreasing = TRUE)) #onlr for A.Ag coulmn
temp1$Var1 = names(file)[i]
res_types = rbind(res_types, temp1)
}
}
I got the number of frequencies and rs_id. But could not get the code. Can anyone help me with this?
The desire outcome will be
rs_id Code Combination A.A Combination B.Ag Combination C.Ag
rs_1 0 1:01/1:01 7 13:02/13:02 2 03:04/03:04 3
rs_1 0 1:01/11:01 5 13:02/49:01 2 03:04/15:02 3
rs_1 0 1:01/2:01 4 13:02/57:01 2 03:04/7:01 3
rs_1 1 1:01/2:05 3 13:02/8:01 4 06:02/06:02 3
rs_1 1 1:01/24:02 3 14:01/14:02 3 06:02/15:02 3
rs_1 1 1:01/24:02 3 14:01/14:02 2 06:02/15:02 3
rs_2 0 1:01/31:01 3 15:01/15:01 1 06:02/3:03 4
rs_2 0 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1
It might be easier to do this using data.table package. Explanation inline.
library(data.table)
#convert into a long format
longDat <- melt(dat, measure.vars=patterns("^rs"), variable.name="rs_id",
value.name="val_id")
#for each group of rs_id (rs_1, ..., rs_60) and val_id in (0,1,2),
#count the frequency of each code
longDat[,
unlist(
lapply(c("A.Ag","B.Ag","C.Ag"),
function(x) setNames(aggregate(get(x), list(get(x)), length), c("Code", x))
),
recursive=FALSE),
by=c("rs_id", "val_id")]
Is this what you are looking for? Does this help?
data:
library(data.table)
dat <- fread("rs_1,rs_2,rs_3,rs_4,rs_60,A.Ag,B.Ag,C.Ag
0,0,1,2,1,02:/01,02:/07,03:07/04:01
1,2,1,2,0,02:/01,02:/07,03:07/04:01
2,1,1,2,2,02:/01,02:/07,03:07/04:01
0,0,1,0,2,02:/01,02:/07,03:07/04:01")
edit: OP request to retrieve top 3 for each rs_id, val_id and *.Ag
It is prob more readable to do it one *.Ag at a time, count and then take top 3 and then finally merge all the results as follows:
library(data.table)
#convert into a long format
longDat <- melt(dat, measure.vars=patterns("^rs"), variable.name="rs_id",
value.name="val_id")
ids <- c("rs_id", "val_id")
Reduce(function(dt1,dt2) merge(dt1,dt2,by=ids,all=TRUE),
lapply(c("A.Ag","B.Ag","C.Ag"), function(x) {
res <- longDat[, list(.N), by=c(ids, x)][order(-N)]
setnames(res[, head(.SD ,3L), by=ids], c(x, "N"), c(paste0(x,"_Code"), x))
}))

R - How would I check to see if columns corresponding to a given group are all equal within a group?

Let's say I have data like this:
group value
1 0
1 0
1 0
2 1
2 0
3 1
3 0
4 1
4 1
How would I iterate through all values of "group" to see if the values corresponding with the group have all equal values. I want to have a dataset that includes ONLY groups where the values are not identical. I'm not sure of an easy way to do this avoiding a for loop.
You can do:
tapply(DF$value, DF$group, FUN = function(x) length(unique(x))) > 1L
# 1 2 3 4
# FALSE TRUE TRUE FALSE
To subset the table, write the same with ave:
DF[ ave(DF$value, DF$group, FUN = function(x) length(unique(x))) > 1L, ]
# group value
# 4 2 1
# 5 2 0
# 6 3 1
# 7 3 0
With packages, the latter step looks like...
library(data.table)
setDT(DF)[, if (uniqueN(value) > 1L) .SD, by=group]
# or
library(dplyr)
DF %>% group_by(group) %>% filter(n_distinct(value) > 1L)
Here is another option using table
tbl <- rowSums(table(df1)>0)>1
subset(df1, group %in% names(tbl)[tbl])
# group value
#4 2 1
#5 2 0
#6 3 1
#7 3 0

count adjacent NAs in data.frame column

I like to add an extra column "na_count" that counts adjacent NAs in the column value, like
value na_count
8 0
2 0
NA 4
NA 4
NA 4
NA 4
5 0
9 0
1 0
NA 2
NA 2
5 0
NA 3
NA 3
NA 3
8 0
5 0
NA 1
Is there perhaps a way with dplyr window functions?
Here is an option using dplyr (as the author asked for). We create a grouping column by taking the difference of logical vector (!is.na(value)), compare with 1 and do the cumsum, then create the 'NA_count' by multiplying the logical vector with number of elements in the group (n()).
library(dplyr)
df1 %>%
select(-na_count) %>% #removing the column that was not needed
group_by(grp=cumsum(c(TRUE,abs(diff(!is.na(value)))==1))) %>%
mutate(NA_count = is.na(value)*n()) %>%
ungroup() %>%
select(-grp)
Or we can convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by the rleid of logical vector (is.na(value)), we get the nrow (.N), multiply with the logical vector and extract the 'V1' column.
library(data.table)#v1.9.6+
setDT(df1)[, .N*is.na(value) ,rleid(is.na(value))]$V1
#[1] 0 0 4 4 4 4 0 0 0 2 2 0 3 3 3 0 0 1
If this is to create a new column,
setDT(df1)[, Na_count:= .N*is.na(value) ,rleid(is.na(value))]
Or we can use rle (run length encoding) from base R. We get the rle of 'value' that are NA (is.na(df1$value)) in a list, use within.list to change the 'values' i.e. TRUE elements by using that as index to the corresponding 'lengths' and then return the atomic vector with inverse.rle.
inverse.rle(within.list(rle(is.na(df1$value)),
{values[values] <- lengths[values] }))
#[1] 0 0 4 4 4 4 0 0 0 2 2 0 3 3 3 0 0 1
Or a slightly more compact version is
inverse.rle(within.list(rle(is.na(df1$value)), values <-lengths*values))
#[1] 0 0 4 4 4 4 0 0 0 2 2 0 3 3 3 0 0 1
Not with dplyr, but using rle from base-R:
# get run-length of missings
dd_rle <- rle(is.na(dd$value))
# use rep: value is length if missing, 0 otherwise, number of repetitions
# is length of runs
# na_count2 so comparison to expected output possible
dd$na_count2 <- rep(ifelse(dd_rle$values, dd_rle$lengths, 0),
dd_rle$lengths)

Resources