R: colSums & Mutate (Wrong Result Size) - r

I am trying to mutate colSums at the bottom of a dataframe, but the first column of the table is a character vector containing labels.
For example,
df=data.frame(
label = c("A","B","C","D","E","F","G","H","I","J"),
x1=c(1,0,0,NA,0,1,1,NA,0,1),
x2=c(1,1,NA,1,1,0,NA,NA,0,1),
x3=c(0,1,0,1,1,0,NA,NA,0,1),
x4=c(1,0,NA,1,0,0,NA,0,0,1),
x5=c(1,1,NA,1,1,1,NA,1,0,1))
Without the label col,
df %>% mutate(Total = colSums(df[, 1:5], na.rm = TRUE))
should work fine... but I tried
df %>% mutate(Total = colSums(df[, 2:6], na.rm = TRUE))
which gave me an error message
Error in mutate_impl(.data, dots) : wrong result size (5), expected
10 or 1
How can I ignore that first column and still mutate colSums into the bottom of my data frame?
Thank you.

mutate adds a new column to a data.frame. You indicate you're trying to add a new row to the bottom. Thus the error message: in trying to create a new column, mutate expects a vector of length 10 (or a single value that it can fill the entire column with).
If you want to add a totals row to a data.frame, try janitor::adorn_totals("row"):
library(janitor)
df %>%
adorn_totals("row")
label x1 x2 x3 x4 x5
A 1 1 0 1 1
B 0 1 1 0 1
C 0 NA 0 NA NA
D NA 1 1 1 1
E 0 1 1 0 1
F 1 0 0 0 1
G 1 NA NA NA NA
H NA NA NA 0 1
I 0 0 0 0 0
J 1 1 1 1 1
Total 4 5 4 3 7
Self-promotion disclaimer, I wrote the janitor package and this function - posting this answer because the function addresses precisely this situation.

Related

Count consecutive non NA items

I have a dataset that looks like this:
library(purrr)
library(dplyr)
temp<-as.data.frame(cbind(col_A<-c(1,2,NA,3,4,5,6),col_B<-c(NA,1,2,NA,1,NA,NA)))
names(temp)<-c("col_A","col_B")
col_A col_B
1 NA
2 1
NA 2
3 NA
4 3
5 NA
6 NA
I want to create a new dataframe which contains the count of non NA items for each column.
Like the following example:
count_A count_B
1 0
2 1
0 2
1 0
2 1
3 0
4 0
I am strugling in getting the count of items.
My closest approximation is this:
count_days<-function(prev,new){
ifelse(!is.na(new),prev+1,0)
}
temp[,"col_A"] %>%
mutate(count_a=accumulate(count_a,count_days))
But I get the following error:
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "c('double', 'numeric')"
Can anyone help me with this code or just give me another glance.
I know this piece of code just tries to count, not creating the new df, which I think is easier after I get the correct result.
Using rle in a (somewhat nested) lapply approach. We first list if an element of the data is.na. Then, using rle we decode values and lengths. Those lengths which are NA we set to 0 by multiplication and unlist the thing.
res <- as.data.frame(lapply(lapply(temp, is.na), function(x) {
r <- rle(x)
s <- sapply(r$lengths, seq_len)
s[r$values] <- lapply(s[r$values], `*`, 0)
unlist(s)
}))
res
# col_A col_B
# 1 1 0
# 2 2 1
# 3 0 2
# 4 1 0
# 5 2 1
# 6 3 0
# 7 4 0
We can use rleid from data.table
library(data.table)
setDT(temp)[, lapply(.SD, function(x) rowid(rleid(!is.na(x))) * !is.na(x))]
# col_A col_B
#1: 1 0
#2: 2 1
#3: 0 2
#4: 1 0
#5: 2 1
#6: 3 0
#7: 4 0
library(tidyverse)
You can use sequence and rle from data.table
First set all non-NA as 1 and then rle count the sequence of same numbers
library(data.table)
temp %>%
replace(.,!is.na(.),1) %>%
mutate(col_A=case_when(!is.na(col_A)~sequence(rle(col_A)$lengths))) %>%
mutate(col_B=case_when(!is.na(col_B)~sequence(rle(col_B)$lengths))) %>%
replace(.,is.na(.),0)

If any value is present in a column of Dataframe, change the value to 1 else insert 0

I have a dataframe with about 1000 rows and 1000 columns. What I want to do is that if any value is present in any cell of the dataframe then change the value to 1 or else put a 0 in that cell. I am programming in R so a R code would be appreciated. I don't want the value of the T column to change but only for the rest of the columns to change.
For example
I have a dataframe like this :
T A B C D
1 29 90 0 100
2 30 12 76 0
3 0 12 0 32
convert it to :
T A B C D
1 1 1 0 1
2 1 1 1 0
3 0 1 0 1
To ignore the first column, you could combine it with a simple modification of akrun's first solution. For example,
data.frame(df[, 1, drop=FALSE], +(df[,-1] != 0))
We can convert to a logical matrix and coerce it to integer
df1 <- +(df != 0)
Or with replace
replace(df, df != 0, 1)
If we need to do this without taking the first column
df[-1] <- +(df[-1] != 0)
Or with sapply
+(sapply(df, `!=`, 0))
In tidyverse, we can use mutate_all
library(dplyr)
df <- df %>%
mutate_all(~ as.integer(. != 0))

How to add an aggregated variable to an existing dataset in R

How do you add a variable to a dataset using the aggregate and by commands? For example, I have:
num x1
1 1
1 0
2 0
2 0
And I'm looking to create a variable to identify every variable for which any num is 1, for example:
num x1 x2
1 1 1
1 0 1
2 0 0
2 0 0
or
num x1 x2
1 1 TRUE
1 0 TRUE
2 0 FALSE
2 0 FALSE
I've tried to use
df$x2 <- aggregate(df$x1, by = list(df$num), FUN = sum)
But I'm getting an error that says the replacement has a different number of rows than the data. Can anyone help?
This can be done by grouping with 'num' and checking if there are any 1 element in 'x'1. The ave from base R is convenient for this instead of aggregate
df1$x2 <- with(df1, ave(x1==1, num, FUN = any))
df1$x2
#[1] 1 1 0 0
Or using dplyr, we group by 'num' and create the 'x2' by checking if any 'x1' is equal to 1. It will be a logical vector if we are not wrapping with as.integer to convert to binary
library(dplyr)
df1 %>%
group_by(num) %>%
mutate(x2 = as.integer(any(x1==1)))

Counting occurrencies by row

Imagine I have a data.frame (or matrix) with few different values such as this
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
test2 <- test
If I want to add extra columns with counts I could do:
test2$good <- apply(test,1, function(x) sum(x==1))
test2$bad <- apply(test,1, function(x) sum(x==-1))
test2$neutral <- apply(test,1, function(x) sum(x==0))
But If I had many possible values instead I would have to create many lines, it won't be elegant.
I've tried with table(), but the output is not easily usable
apply(test,1, function(x) table(x))
and there is a big problem, if any row doesn't contain any occurrency of some factor the result generated by table() doesn't have the same length and it can't be binded.
Is there way to force table() to take that value into account, telling it has zero occurrencies?
Then I've thought of using do.call or lapply and merge but it's too difficult for me.
I've also read about dplyr count but I have no clue on how to do it.
Could anyone provide a solution with dplyr or tidyr?
PD: What about a data.table solution?
We could melt the dataset to long format after converting to matrix, get the frequency using table and cbind with the original dataset.
library(reshape2)
cbind(test2, as.data.frame.matrix(table(melt(as.matrix(test2))[-2])))
Or use mtabulate on the transpose of 'test2' and cbind with the original dataset.
library(qdapTools)
cbind(test2, mtabulate(as.data.frame(t(test2))))
Or we can use gather/spread from tidyr after creating row id with add_rownames from dplyr
library(dplyr)
library(tidyr)
add_rownames(test2) %>%
gather(Var, Val, -rowname) %>%\
group_by(rn= as.numeric(rowname), Val) %>%
summarise(N=n()) %>%
spread(Val, N, fill=0) %>%
bind_cols(test2, .)
you can use rowSums():
test2 <- cbind(test2, sapply(c(-1, 0, 1), function(x) rowSums(test==x)))
similar to the code in the comment from etienne, but without the call to apply()
Here is the answer using base R.
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
testCopy <- test
# find all unique values, note that data frame is a list
uniqVal <- unique(unlist(test))
# the new column names start with Y
for (val in uniqVal) {
test[paste0("Y",val)] <- apply(testCopy, 1, function(x) sum(x == val))
}
head(test)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y-1 Y1 Y0
# 1 -1 0 1 1 1 0 -1 -1 1 1 3 5 2
# 2 1 -1 0 1 1 -1 -1 0 0 1 3 4 3
# 3 -1 0 1 0 1 1 1 1 -1 1 2 6 2
# 4 1 1 1 1 0 1 1 0 1 0 0 7 3
# 5 0 -1 1 -1 -1 0 0 1 0 0 3 2 5
# 6 1 1 0 1 1 1 1 1 1 1 0 9 1

Set NA to 0 in R

After merging a dataframe with another im left with random NA's for the occasional row. I'd like to set these NA's to 0 so I can perform calculations with them.
Im trying to do this with:
bothbeams.data = within(bothbeams.data, {
bothbeams.data$x.x = ifelse(is.na(bothbeams.data$x.x) == TRUE, 0, bothbeams.data$x.x)
bothbeams.data$x.y = ifelse(is.na(bothbeams.data$x.y) == TRUE, 0, bothbeams.data$x.y)
})
Where $x.x is one column and $x.y is the other of course, but this doesn't seem to work.
You can just use the output of is.na to replace directly with subsetting:
bothbeams.data[is.na(bothbeams.data)] <- 0
Or with a reproducible example:
dfr <- data.frame(x=c(1:3,NA),y=c(NA,4:6))
dfr[is.na(dfr)] <- 0
dfr
x y
1 1 0
2 2 4
3 3 5
4 0 6
However, be careful using this method on a data frame containing factors that also have missing values:
> d <- data.frame(x = c(NA,2,3),y = c("a",NA,"c"))
> d[is.na(d)] <- 0
Warning message:
In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
invalid factor level, NA generated
It "works":
> d
x y
1 0 a
2 2 <NA>
3 3 c
...but you likely will want to specifically alter only the numeric columns in this case, rather than the whole data frame. See, eg, the answer below using dplyr::mutate_if.
A solution using mutate_all from dplyr in case you want to add that to your dplyr pipeline:
library(dplyr)
df %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
Result:
A B C
1 0 0 0
2 1 0 0
3 2 0 2
4 3 0 5
5 0 0 2
6 0 0 1
7 1 0 1
8 2 0 5
9 3 0 2
10 0 0 4
11 0 0 3
12 1 0 5
13 2 0 5
14 3 0 0
15 0 0 1
If in any case you only want to replace the NA's in numeric columns, which I assume it might be the case in modeling, you can use mutate_if:
library(dplyr)
df %>%
mutate_if(is.numeric, funs(ifelse(is.na(.), 0, .)))
or in base R:
replace(is.na(df), 0)
Result:
A B C
1 0 0 0
2 1 <NA> 0
3 2 0 2
4 3 <NA> 5
5 0 0 2
6 0 <NA> 1
7 1 0 1
8 2 <NA> 5
9 3 0 2
10 0 <NA> 4
11 0 0 3
12 1 <NA> 5
13 2 0 5
14 3 <NA> 0
15 0 0 1
Update
with dplyr 1.0.0, across is introduced:
library(dplyr)
# Replace `NA` for all columns
df %>%
mutate(across(everything(), ~ ifelse(is.na(.), 0, .)))
# Replace `NA` for numeric columns
df %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), 0, .)))
Data:
set.seed(123)
df <- data.frame(A=rep(c(0:3, NA), 3),
B=rep(c("0", NA), length.out = 15),
C=sample(c(0:5, NA), 15, replace = TRUE))
You can use replace_na() from tidyr package
df %>% replace_na(list(column1 = 0, column2 = 0)
To add to James's example, it seems you always have to create an intermediate when performing calculations on NA-containing data frames.
For instance, adding two columns (A and B) together from a data frame dfr:
temp.df <- data.frame(dfr) # copy the original
temp.df[is.na(temp.df)] <- 0
dfr$C <- temp.df$A + temp.df$B # or any other calculation
remove('temp.df')
When I do this I throw away the intermediate afterwards with remove/rm.
If you only want to replace NAs with 0s for a few select columns you also use an lapply solution, e.g:
data = data.frame(
one = c(NA,0),
two = c(NA,NA),
three = c(1,2),
four = c("A",NA)
)
data[1:2] = lapply(data[1:2],function(x){
x[is.na(x)] = 0
return(x)
})
data
Why not try this
na.zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
na.zero(df)

Resources