I like to add an extra column "na_count" that counts adjacent NAs in the column value, like
value na_count
8 0
2 0
NA 4
NA 4
NA 4
NA 4
5 0
9 0
1 0
NA 2
NA 2
5 0
NA 3
NA 3
NA 3
8 0
5 0
NA 1
Is there perhaps a way with dplyr window functions?
Here is an option using dplyr (as the author asked for). We create a grouping column by taking the difference of logical vector (!is.na(value)), compare with 1 and do the cumsum, then create the 'NA_count' by multiplying the logical vector with number of elements in the group (n()).
library(dplyr)
df1 %>%
select(-na_count) %>% #removing the column that was not needed
group_by(grp=cumsum(c(TRUE,abs(diff(!is.na(value)))==1))) %>%
mutate(NA_count = is.na(value)*n()) %>%
ungroup() %>%
select(-grp)
Or we can convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by the rleid of logical vector (is.na(value)), we get the nrow (.N), multiply with the logical vector and extract the 'V1' column.
library(data.table)#v1.9.6+
setDT(df1)[, .N*is.na(value) ,rleid(is.na(value))]$V1
#[1] 0 0 4 4 4 4 0 0 0 2 2 0 3 3 3 0 0 1
If this is to create a new column,
setDT(df1)[, Na_count:= .N*is.na(value) ,rleid(is.na(value))]
Or we can use rle (run length encoding) from base R. We get the rle of 'value' that are NA (is.na(df1$value)) in a list, use within.list to change the 'values' i.e. TRUE elements by using that as index to the corresponding 'lengths' and then return the atomic vector with inverse.rle.
inverse.rle(within.list(rle(is.na(df1$value)),
{values[values] <- lengths[values] }))
#[1] 0 0 4 4 4 4 0 0 0 2 2 0 3 3 3 0 0 1
Or a slightly more compact version is
inverse.rle(within.list(rle(is.na(df1$value)), values <-lengths*values))
#[1] 0 0 4 4 4 4 0 0 0 2 2 0 3 3 3 0 0 1
Not with dplyr, but using rle from base-R:
# get run-length of missings
dd_rle <- rle(is.na(dd$value))
# use rep: value is length if missing, 0 otherwise, number of repetitions
# is length of runs
# na_count2 so comparison to expected output possible
dd$na_count2 <- rep(ifelse(dd_rle$values, dd_rle$lengths, 0),
dd_rle$lengths)
Related
My dataset has columns and values like this. The column names all start with a common string, Col_a_**
ID Col_a_01 Col_a_02 Col_a_03
1 1 2 1
2 1 NA 0
3 NA 0 2
4 1 0 1
5 0 0 2
My goal is to replace the missing values with the mode values for that column.
The expected dataset to be like this
ID Col_a_01 Col_a_02 Col_a_03
1 1 2 1
2 1 0** 0
3 1** 0 2
4 1 0 1
5 0 0 2
The NA in the first column is replaced by 1 because the mode of the 1st column is 1. The NA in the second column is replaced by 0 because the mode for the 2nd column is 0.
I can do this like this below
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
df$Col_a_01[is.na(Col_a_01)==TRUE] <- getmode(df$Col_a_01)
df$Col_a_03[is.na(Col_a_02)==TRUE] <- getmode(df$Col_a_02)
df$Col_a_03[is.na(Col_a_03)==TRUE] <- getmode(df$Col_a_03)
But this becomes unwieldy if I have 100 columns starting with the similar names ending in 1,2,3..100. I am curious if there is an easier and more elegant way to accomplish this. Thanks in advance.
You can change the NA values with ifelse/replace, to apply a function to multiple columns use across in dplyr.
library(dplyr)
df <- df %>%
mutate(across(starts_with('Col_a'), ~replace(., is.na(.), getmode(.))))
In base R , use lapply -
cols <- grep('Col_a', names(df))
df[cols] <- lapply(df[cols], function(x) replace(x, is.na(x), getmode(x)))
We can use na.aggregate with FUN specified as getmode
library(zoo)
library(dplyr)
df1 <- df1 %>%
mutate(across(starts_with('Col_a'), na.aggregate, FUN = getmode))
-output
df1
ID Col_a_01 Col_a_02 Col_a_03
1 1 1 2 1
2 2 1 0 0
3 3 1 0 2
4 4 1 0 1
5 5 0 0 2
Or it can be simply
na.aggregate(df1, FUN = getmode)
ID Col_a_01 Col_a_02 Col_a_03
1 1 1 2 1
2 2 1 0 0
3 3 1 0 2
4 4 1 0 1
5 5 0 0 2
I have a data.frame with 4 cathegorical variables with scale 1-5.
data.frame(
first=c(2,3,3,2,2),
second=c(5,5,4,5,5),
third=c(5,5,5,4,4),
fourth=c(2,1,1,1,2))
first second third fourth
2 5 5 2
3 5 5 1
3 4 5 1
2 5 4 1
2 5 4 2
I want to transform names of variables to one column and do cumulative counts of values and set up new variables to rows with cathegorical scale.
newvar 1 2 3 4 5
first 0 3 2 0 0
second 0 0 0 1 4
third 0 0 0 2 3
fourth 3 2 0 0 0
Using data.table :
library(data.table)
dcast(melt(df), variable~value)
# variable 1 2 3 4 5
#1 first 0 3 2 0 0
#2 second 0 0 0 1 4
#3 third 0 0 0 2 3
#4 fourth 3 2 0 0 0
This returns some warning since we are relying on the default options of melt and dcast, it is safe to ignore them in this case. To avoid warnings you can use this extended version.
library(data.table)
dcast(melt(setDT(df), measure.vars = names(df)), variable~value, fun.aggregate = length)
Not the cleanest method, nevertheless it works.
You use pivot_longer to transform the data into a long format.
Then we can group the data and count how many occurrences there are for each of your original columns.
Transform the data back into wide format using pivot_wider and then the last two lines rearranges the data to match your desired output.
df %>%
pivot_longer(c(first:fourth)) %>%
count(name, value) %>%
pivot_wider(names_from = "value",
values_from = "n") %>%
select(name, `1`, `2`, `3`, `4`, `5`) %>%
arrange(match(name, c("first", "second", "third", "fourth")), desc(name))
I have a dataset that looks like this:
library(purrr)
library(dplyr)
temp<-as.data.frame(cbind(col_A<-c(1,2,NA,3,4,5,6),col_B<-c(NA,1,2,NA,1,NA,NA)))
names(temp)<-c("col_A","col_B")
col_A col_B
1 NA
2 1
NA 2
3 NA
4 3
5 NA
6 NA
I want to create a new dataframe which contains the count of non NA items for each column.
Like the following example:
count_A count_B
1 0
2 1
0 2
1 0
2 1
3 0
4 0
I am strugling in getting the count of items.
My closest approximation is this:
count_days<-function(prev,new){
ifelse(!is.na(new),prev+1,0)
}
temp[,"col_A"] %>%
mutate(count_a=accumulate(count_a,count_days))
But I get the following error:
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "c('double', 'numeric')"
Can anyone help me with this code or just give me another glance.
I know this piece of code just tries to count, not creating the new df, which I think is easier after I get the correct result.
Using rle in a (somewhat nested) lapply approach. We first list if an element of the data is.na. Then, using rle we decode values and lengths. Those lengths which are NA we set to 0 by multiplication and unlist the thing.
res <- as.data.frame(lapply(lapply(temp, is.na), function(x) {
r <- rle(x)
s <- sapply(r$lengths, seq_len)
s[r$values] <- lapply(s[r$values], `*`, 0)
unlist(s)
}))
res
# col_A col_B
# 1 1 0
# 2 2 1
# 3 0 2
# 4 1 0
# 5 2 1
# 6 3 0
# 7 4 0
We can use rleid from data.table
library(data.table)
setDT(temp)[, lapply(.SD, function(x) rowid(rleid(!is.na(x))) * !is.na(x))]
# col_A col_B
#1: 1 0
#2: 2 1
#3: 0 2
#4: 1 0
#5: 2 1
#6: 3 0
#7: 4 0
library(tidyverse)
You can use sequence and rle from data.table
First set all non-NA as 1 and then rle count the sequence of same numbers
library(data.table)
temp %>%
replace(.,!is.na(.),1) %>%
mutate(col_A=case_when(!is.na(col_A)~sequence(rle(col_A)$lengths))) %>%
mutate(col_B=case_when(!is.na(col_B)~sequence(rle(col_B)$lengths))) %>%
replace(.,is.na(.),0)
I have a vector of numbers in a data.frame such as below.
df <- data.frame(a = c(1,2,3,4,2,3,4,5,8,9,10,1,2,1))
I need to create a new column which gives a running count of entries that are greater than their predecessor. The resulting column vector should be this:
0,1,2,3,0,1,2,3,4,5,6,0,1,0
My attempt is to create a "flag" column of diffs to mark when the values are greater.
df$flag <- c(0,diff(df$a)>0)
> df$flag
0 1 1 1 0 1 1 1 1 1 1 0 1 0
Then I can apply some dplyr group/sum magic to almost get the right answer, except that the sum doesn't reset when flag == 0:
df %>% group_by(flag) %>% mutate(run=cumsum(flag))
a flag run
1 1 0 0
2 2 1 1
3 3 1 2
4 4 1 3
5 2 0 0
6 3 1 4
7 4 1 5
8 5 1 6
9 8 1 7
10 9 1 8
11 10 1 9
12 1 0 0
13 2 1 10
14 1 0 0
I don't want to have to resort to a for() loop because I have several of these running sums to compute with several hundred thousand rows in a data.frame.
Here's one way with ave:
ave(df$a, cumsum(c(F, diff(df$a) < 0)), FUN=seq_along) - 1
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
We can get a running count grouped by diff(df$a) < 0. Which are the positions in the vector that are less than their predecessors. We add c(F, ..) to account for the first position. The cumulative sum of that vector creates an index for grouping. The function ave can carry out a function on that index, we use seq_along for a running count. But since it starts at 1, we subtract by one ave(...) - 1 to start from zero.
A similar approach using dplyr:
library(dplyr)
df %>%
group_by(cumsum(c(FALSE, diff(a) < 0))) %>%
mutate(row_number() - 1)
You don't need dplyr:
fun <- function(x) {
test <- diff(x) > 0
y <- cumsum(test)
c(0, y - cummax(y * !test))
}
fun(df$a)
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
a <- c(1,2,3,4,2,3,4,5,8,9,10,1,2,1)
f <- c(0, diff(a)>0)
ifelse(f, cumsum(f), f)
that it is without reset.
with reset:
unlist(tapply(f, cumsum(c(0, diff(a) < 0)), cumsum))
I have a vector of numbers in a data.frame such as below.
df <- data.frame(a = c(1,2,3,4,2,3,4,5,8,9,10,1,2,1))
I need to create a new column which gives a running count of entries that are greater than their predecessor. The resulting column vector should be this:
0,1,2,3,0,1,2,3,4,5,6,0,1,0
My attempt is to create a "flag" column of diffs to mark when the values are greater.
df$flag <- c(0,diff(df$a)>0)
> df$flag
0 1 1 1 0 1 1 1 1 1 1 0 1 0
Then I can apply some dplyr group/sum magic to almost get the right answer, except that the sum doesn't reset when flag == 0:
df %>% group_by(flag) %>% mutate(run=cumsum(flag))
a flag run
1 1 0 0
2 2 1 1
3 3 1 2
4 4 1 3
5 2 0 0
6 3 1 4
7 4 1 5
8 5 1 6
9 8 1 7
10 9 1 8
11 10 1 9
12 1 0 0
13 2 1 10
14 1 0 0
I don't want to have to resort to a for() loop because I have several of these running sums to compute with several hundred thousand rows in a data.frame.
Here's one way with ave:
ave(df$a, cumsum(c(F, diff(df$a) < 0)), FUN=seq_along) - 1
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
We can get a running count grouped by diff(df$a) < 0. Which are the positions in the vector that are less than their predecessors. We add c(F, ..) to account for the first position. The cumulative sum of that vector creates an index for grouping. The function ave can carry out a function on that index, we use seq_along for a running count. But since it starts at 1, we subtract by one ave(...) - 1 to start from zero.
A similar approach using dplyr:
library(dplyr)
df %>%
group_by(cumsum(c(FALSE, diff(a) < 0))) %>%
mutate(row_number() - 1)
You don't need dplyr:
fun <- function(x) {
test <- diff(x) > 0
y <- cumsum(test)
c(0, y - cummax(y * !test))
}
fun(df$a)
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
a <- c(1,2,3,4,2,3,4,5,8,9,10,1,2,1)
f <- c(0, diff(a)>0)
ifelse(f, cumsum(f), f)
that it is without reset.
with reset:
unlist(tapply(f, cumsum(c(0, diff(a) < 0)), cumsum))