Length of columns excluding NA in r - r

Suppose that I have a data.frame as follows:
a b c
1 5 NA 6
2 NA NA 7
3 6 5 8
I would like to find the length of each column, excluding NA's. The answer should look like
a b c
2 1 3
So far, I've tried:
!is.na() # Gives TRUE/FALSE
length(!is.na()) # 9 -> Length of the whole matrix
dim(!is.na()) # 3 x 3 -> dimension of a matrix
na.omit() # removes rows with any NA in it.
Please tell me how can I get the required answer.

Or faster :
colSums(!is.na(dat))
a b c
2 1 3

Though the sum is probably a faster solution, I think that length(x[!is.na(x)]) is more readable.

> apply(dat, 2, function(x){sum(!is.na(x))})
a b c
2 1 3

I tried NCOL instead of ncol and it worked.
> nrow(tsa$Region)
NULL
> NROW(tsa$Region)
[1] 27457
> ncol(tsa$Region)
NULL
> NCOL(tsa$Region)
[1] 1

Related

How do I lag a data.frame?

I'd like to lag whole dataframe in R.
In python, it's very easy to do this, using shift() function
(ex: df.shift(1))
However, I could not find any as an easy and simple method as in pandas shift() in R.
How can I do this?
> x = data.frame(a=c(1,2,3),b=c(4,5,6))
> x
a b
1 1 4
2 2 5
3 3 6
What I want is,
> lag(x,1)
>
a b
1 NA NA
2 1 4
3 2 5
Any good idea?
Pretty simple in base R:
rbind(NA, head(x, -1))
a b
1 NA NA
2 1 4
3 2 5
head with -1 drops the final row and rbind with NA as the first argument adds a row of NAs.
You can also use row indexing [, like this
x[c(NA, 1:(nrow(x)-1)),]
a b
NA NA NA
1 1 4
2 2 5
This leaves an NA in the row name of the first variable, to "fix" this, you can strip the data.frame class and then reassign it:
data.frame(unclass(x[c(NA, 1:(nrow(x)-1)),]))
a b
1 NA NA
2 1 4
3 2 5
Here, you can use rep to produce the desired lags
data.frame(unclass(x[c(rep(NA, 2), 1:(nrow(x)-2)),]))
a b
1 NA NA
2 NA NA
3 1 4
and even put this into a function
myLag <- function(dat, lag) data.frame(unclass(dat[c(rep(NA, lag), 1:(nrow(dat)-lag)),]))
Give it a try
myLag(x, 2)
a b
1 NA NA
2 NA NA
3 1 4
library(dplyr)
x %>% mutate_all(lag)
a b
1 NA NA
2 1 4
3 2 5
Just for completeness this would be analogous to how zoo implements it (but for a data.frame since the zoo lag(...) method doesn't work on data.frame objects):
lag.df <- function(x, lag) {
if (lag < 0)
rbind(NA, head(x, lag))
else
rbind(tail(x, -lag), NA)
}
and use like this:
x <- data.frame(dt=c(as.Date('2019-01-01'), as.Date('2019-01-02'), as.Date('2019-01-03')), a=c(1,2,3),b=c(4,5,6))
lag.df(x, -1)
lag.df(x, 1)
or you can just use zoo:
library(zoo)
x <- data.frame(dt=c(as.Date('2019-01-01'), as.Date('2019-01-02'), as.Date('2019-01-03')), a=c(1,2,3),b=c(4,5,6))
x.zoo <- read.zoo(x)
lag(x.zoo, -1)
lag(x.zoo, 1)

Setting values to NA in a dataframe in R

Here is some reproducible code that shows the problem I am trying to solve in another dataset. Suppose I have a dataframe df with some NULL values in it. I would like to replace these with NAs, as I attempt to do below. But when I print this, it comes out as <NA>. See the second dataframe, which comes is the dataframe I would like to produce from df, in which the NA is a regular old NA without the carrots.
> df = data.frame(a=c(1,2,3,"NULL"),b=c(1,5,4,6))
> df[4,1] = NA
> print(df)
a b
1 1 1
2 2 5
3 3 4
4 <NA> 6
>
> d = data.frame(a=c(1,2,3,NA),b=c(1,5,4,6))
> print(d)
a b
1 1 1
2 2 5
3 3 4
4 NA 6

Convert a list of varying lengths into a dataframe

I am trying to convert a simple list of varying lengths into a data frame as shown below. I would like to populate the missing values with NaN. I tried using ldply, rbind, as.data.frame() but I failed to get it into the format I want. Please help.
x=c(1,2)
y=c(1,2,3)
z=c(1,2,3,4)
a=list(x,y,z)
a
[[1]]
[1] 1 2
[[2]]
[1] 1 2 3
[[3]]
[1] 1 2 3 4
Output should be:
x y z
1 1 1
2 2 2
NaN 3 3
NaN NaN 4
Using rbind.fill.matrix from "plyr" gets you very close to what you're looking for:
> library(plyr)
> t(rbind.fill.matrix(lapply(a, t)))
[,1] [,2] [,3]
1 1 1 1
2 2 2 2
3 NA 3 3
4 NA NA 4
This is a lot of code, so not as clean as Ananda's solution, but it's all base R:
maxl <- max(sapply(a,length))
out <- do.call(cbind, lapply(a,function(x) x[1:maxl]))
# out <- matrix(unlist(lapply(a,function(x) x[1:maxl])), nrow=maxl) #another way
out <- as.data.frame(out)
#names(out) <- names(a)
Result:
> out
V1 V2 V3
1 1 1 1
2 2 2 2
3 NA 3 3
4 NA NA 4
Note: names of the resulting df will depend on the names of your list (a), which doesn't currently have names.

Check for unique elements

just a simple question.
I have a data frame(only one vector is shown) that looks like:
cln1
A
b
A
A
c
d
A
....
I would like the following output:
cln1
b
c
d
In other words I would like to remove all items that are replicated. The functions "unique" as well as "duplicated" return the output including the replicated element represented one time. I would like to remove it definitively.
You can use setdiff for that :
R> v <- c(1,1,2,2,3,4,5)
R> setdiff(v, v[duplicated(v)])
[1] 3 4 5
You could use count from the plyr package to count the occurences of an item, and delete all who occur more than once.
library(plyr)
l = c(1,2,3,3,4,5,6,6,7)
count_l = count(l)
x freq
1 1 1
2 2 1
3 3 2
4 4 1
5 5 1
6 6 2
7 7 1
l[!l %in% with(count_l, x[freq > 1])]
[1] 1 2 4 5 7
Note the !, which means NOT. You of course put this in a oneliner:
l[!l %in% with(count(l), x[freq > 1])]
Another way using table:
With #juba's data:
as.numeric(names(which(table(v) == 1)))
# [1] 3 4 5
For OP's data, since its a character output, as.numeric is not required.
names(which(table(v) == 1))
# [1] "b" "c" "d"

Removing NAs when multiplying columns

This is a really simple question, but I am hoping someone will be able to help me avoid extra lines of unnecessary code. I have a simple dataframe:
Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
What I want to do is produce an extra column which is the multiplication of A, B and C, which I will then cbind to the original dataframe.
So, I would normally use:
attach(Df.1)
D<-A*B*C
But obviously where the NAs are in column C, I get an NA in variable D. I don't want to exclude all the NA rows, rather just ignore the NA values in this column (and then the value in D would simply be the multiplication of A and B, or where C was available, A*B*C.
I know I could simply replace the NAs with 1s, so the calculation remains unchanged, or use if statements, but I was wodnering what the simplist way of doing this is?
Any ideas?
You can use prod which has an na.rm argument. To do it by row use apply:
apply(Df.1,1,prod,na.rm=TRUE)
[1] 10 60 14 120 72 36
As #James said, prod and apply will work, but you don't need to waste memory storing it in a separate variable, or even cbinding it
Df.1$D = apply(Df.1, 1, prod, na.rm=T)
Assigning the new variable in the data frame directly will work.
> Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
> Df.1
A B C
1 5 1 2
2 4 5 3
3 7 2 NA
4 6 4 5
5 8 9 NA
6 4 1 9
> Df.1$D = apply(Df.1, 1, prod, na.rm=T)
> Df.1$D
[1] 10 60 14 120 72 36
> Df.1
A B C D
1 5 1 2 10
2 4 5 3 60
3 7 2 NA 14
4 6 4 5 120
5 8 9 NA 72
6 4 1 9 36

Resources