Related
I have a dataset of 392 rows and 156 columns that represent detections and non-detections of a species. Each column represents a 'visit' to the field for survey, whereas each row represent surveyed sites, holding 0 and 1 whether the species of interested was recorded in each of these 'visits' to the field, or NA if there was no survey conducted during that specific time period. I agregated my visits by month, so each column represent 'monthly visits', that is a 30 day interval within a given year. Given that I have several years of data, I created consecutive and sequential month periods that span all the years for which I have data. Given that most sites were surveyed in different years, several columns (time periods) are unique for each site and thus I have a LOT of NAs: I have 1,646 records of either 0/1 and 59,506 NAs.
I want to restructure my database in such a way that I can remove as many NAs as possible, by treating each column not as a specific time period but as a generic time interval. So instead of column 1 being for example the specific period of 3/2008-4/2018, to be only 'Survey 1', which will represent a different month and year for each site. By removing all the NAs of each site previous to the actual survey period for that site, I can have a cleaner, smaller database with less NAs. The idea would be the following:
Go from this df I have:
df <- read.table(text = "3/2008-4/2018 5/2008-6/2008 7/2009-8/2009 9/2009-10/2009 11/2009-12/2009 01/2010-02/2010 03/2010-04/2010 05/2010-06/2010 07/2010-08/2010
1 NA NA NA NA NA NA 1 1 1
2 NA NA NA 1 0 NA NA NA NA
3 NA NA NA 0 0 NA NA NA NA
4 0 1 0 1 1 1 NA NA NA
5 0 1 NA NA NA 1 0 1 1")
To this new df:
df_new <- read.table(text = "v1 v2 v3 v4 V5 V6
1 1 1 1 NA NA NA
2 1 0 NA NA NA NA
3 0 0 NA NA NA NA
4 0 1 0 1 1 1
5 0 1 1 0 1 1")
Could anyone help me create a code to do this please? Thank you!
You can use na.omit and then subset using [ to get vectors of equal length.
x <- apply(unname(df), 1, na.omit)
t(sapply(x, "[", 1:max(lengths(x))))
# [,1] [,2] [,3] [,4] [,5] [,6]
#1 1 1 1 NA NA NA
#2 1 0 NA NA NA NA
#3 0 0 NA NA NA NA
#4 0 1 0 1 1 1
#5 0 1 1 0 1 1
I am trying to add two columns. My dataframe is like this one:
data <- data.frame(a = c(0,1,NA,0,NA,NA),
x = c(NA,NA,NA,NA,1,0),
t = c(NA,2,NA,NA,2,0))
I want to add some of the columns like this:
yep <- cbind.data.frame( data$a, data$x, rowSums(data[,c(1, 2)], na.rm = TRUE))
However the output looks like this:
> yep
data$a data$x rowSums(data[,c(1, 2)], na.rm = TRUE)
1 0 NA 0
2 1 NA 1
3 NA NA 0
4 0 NA 0
5 NA 1 1
6 NA 0 0
And I would like an oputput like this:
> yep
data$a data$x rowSums(data[,c(1, 2)], na.rm = TRUE)
1 0 NA 0
2 1 NA 1
3 NA NA NA
4 0 NA 0
5 NA 1 1
6 NA 0 0
If the columns contain only NA values I want to leave the NA values.
How I could achive this?
Base R:
data <- data.frame("a" = c(0,1,NA,0,NA,NA),
"x" = c(NA,NA,NA,NA,1,0),
"t" = c(NA,2,NA,NA,2,0)
)
yep <- cbind.data.frame( data$a, data$x, rs = rowSums(data[,c(1, 2)], na.rm = TRUE))
yep$rs[is.na(data$a) & is.na(data$x)] <- NA
yep
Base R (ifelse):
cbind(data$a,data$x,ifelse(is.na(data$a) & is.na(data$x),NA,rowSums(data[,1:2],na.rm = TRUE)))
If you are looking for the column name then replace cbind with cbind.data.frame
Output:
[,1] [,2] [,3]
[1,] 0 NA 0
[2,] 1 NA 1
[3,] NA NA NA
[4,] 0 NA 0
[5,] NA 1 1
[6,] NA 0 0
You might try dplyr::coalesce
cbind.data.frame( data$a, data$x, dplyr::coalesce(data$a, data$x))
# data$a data$x dplyr::coalesce(data$a, data$x)
#1 0 NA 0
#2 1 NA 1
#3 NA NA NA
#4 0 NA 0
#5 NA 1 1
#6 NA 0 0
base r ifelse
data[['rowsum']]<-ifelse(is.na(data$a) & is.na(data$x),NA,ifelse(is.na(data$a),0,data$a)+ifelse(is.na(data$x),0,data$x))
a x t rowsum
1: 0 NA NA 0
2: 1 NA 2 1
3: NA NA NA NA
4: 0 NA NA 0
5: NA 1 2 1
6: NA 0 0 0
Another base R approach.
If all the values in the rows are NA then return NA or else return sum of the row ignoring NA's.
#Select only the columns which we need
sub_df <- data[c("a", "x")]
sub_df$answer <- ifelse(rowSums(is.na(sub_df)) == ncol(sub_df), NA,
rowSums(sub_df, na.rm = TRUE))
sub_df
# a x answer
#1 0 NA 0
#2 1 NA 1
#3 NA NA NA
#4 0 NA 0
#5 NA 1 1
#6 NA 0 0
In the below test data, v4 is calculated out of v1, v2 and v3 as follows:
test$v4 <- (test$v1 + test$v2 + test$v3) / 3
As expected, any row with a missing value returns an NA result for v4:
v1 v2 v3 v4
1 1 1 2 1.333333
2 1 1 2 1.333333
3 1 2 NA NA
4 0 1 NA NA
5 NA NA 0 NA
6 NA 1 0 NA
7 1 2 NA NA
However, I want R to return an NA only when there are two or three NA values. If there is only one NA, I want R to calculate the mean of the two available values.
Can you please advise as to how I can do that?
Thank you.
You can use ifelse and rowSums(is.na()) to have differing formula on different rows:
dat <- read.table(text= "v1 v2 v3 v4
1 1 1 2 1.333333
2 1 1 2 1.333333
3 1 2 NA NA
4 0 1 NA NA
5 NA NA 0 NA
6 NA 1 0 NA
7 1 2 NA NA")
# if more than 2 NAs in each row, NA, otherwise the mean ignoring NAs
dat$v4 <- ifelse(rowSums(is.na(dat)) >= 2, NA, rowMeans(dat, na.rm = TRUE))
I have the following dataframe dat, which presents a row-specific number of NAs at the beginning of some of its rows:
dat <- as.data.frame(rbind(c(NA,NA,1,3,5,NA,NA,NA), c(NA,1:3,6:8,NA), c(1:7,NA)))
dat
# V1 V2 V3 V4 V5 V6 V7 V8
# NA NA 1 3 5 NA NA NA
# NA 1 2 3 6 7 8 NA
# 1 NA 2 3 4 5 6 NA
My aim is to delete all the NAs at the beginning of each row and to left shift the row values (adding NAs at the end of the shifted rows accordingly, in order to keep their length constant).
The following code works as expected:
for (i in 1:nrow(dat)) {
if (is.na(dat[i,1])==TRUE) {
dat1 <- dat[i, min(which(!is.na(dat[i,]))):length(dat[i,])]
dat[i,] <- data.frame( dat1, t(rep(NA, ncol(dat)-length(dat1))) )
}
}
dat
returning:
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 3 5 NA NA NA NA NA
# 1 2 3 6 7 8 NA NA
# 1 NA 2 3 4 5 6 NA
I was wondering whther there is a more direct way to do so without using a for-loop and by using the tail function.
With respect to this last point, by using min(which(!is.na(dat[1,]))) the result is 3, as expected. But then if I type tail(dat[1,],min(which(!is.na(dat[1,])))) the result is the same initial row, and I don't understand why..
Thank you very much for anu suggestion.
if you just want all NA's to be pushed to the end, you could try
dat <- as.data.frame(rbind(c(NA,NA,1,3,5,NA,NA,NA), c(NA,1:3,6:8,NA), c(1:7,NA)))
dat[3,2] <- NA
> dat
V1 V2 V3 V4 V5 V6 V7 V8
1 NA NA 1 3 5 NA NA NA
2 NA 1 2 3 6 7 8 NA
3 1 NA 3 4 5 6 7 NA
dat.new<-do.call(rbind,lapply(1:nrow(dat),function(x) t(matrix(dat[x,order(is.na(dat[x,]))])) ))
colnames(dat.new)<-colnames(dat)
> dat.new
V1 V2 V3 V4 V5 V6 V7 V8
[1,] 1 3 5 NA NA NA NA NA
[2,] 1 2 3 6 7 8 NA NA
[3,] 1 3 4 5 6 7 NA NA
I don't think you can do this without a loop.
dat <- as.data.frame(rbind(c(NA,NA,1,3,5,NA,NA,NA), c(NA,1:3,6:8,NA), c(1:7,NA)))
dat[3,2] <- NA
# V1 V2 V3 V4 V5 V6 V7 V8
# 1 NA NA 1 3 5 NA NA NA
# 2 NA 1 2 3 6 7 8 NA
# 3 1 NA 3 4 5 6 7 NA
t(apply(dat, 1, function(x) {
if (is.na(x[1])) {
y <- x[-seq_len(which.min(is.na(x))-1)]
length(y) <- length(x)
y
} else x
}))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#[1,] 1 3 5 NA NA NA NA NA
#[2,] 1 2 3 6 7 8 NA NA
#[3,] 1 NA 3 4 5 6 7 NA
Then turn the matrix into a data.frame if you must.
Here there is the answer by using the tail function:
dat <- as.data.frame(rbind(c(NA,NA,1,3,5,NA,NA,NA), c(NA,1:3,6:8,NA), c(1:7,NA)))
dat
for (i in 1:nrow(dat)) {
if (is.na(dat[i,1])==TRUE) {
# drops initial NAs of the row (if the sequence starts with NAs)
dat1 <- tail(as.integer(dat[i,]), -min(which(!is.na(dat[i,]))-1))
# adds final NAs to keep the row length constant (i.e. conformable with 'dat')
length(dat1) <- ncol(dat)
dat[i,] <- dat1
}
}
dat
I'm connecting to my Vertica Database and retrieve huge amount of data. There are NAs in the dataset in all columns. But I want to find NAs from specific columns' and replace with 0.
How should I do that ?
Thanks !
To expand on my comment and make it into an answer, here's a minimal reproducible example:
set.seed(1)
mydf <- as.data.frame(matrix(sample(c(1:2, NA), 50, replace = TRUE), ncol = 10))
mydf
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 1 NA 1 2 NA 2 2 NA NA NA
# 2 2 NA 1 NA 1 1 2 NA 2 1
# 3 2 2 NA NA 2 2 2 1 NA 2
# 4 NA 2 2 2 1 NA 1 NA 2 NA
# 5 1 1 NA NA 1 2 NA 2 2 NA
Now, if we wanted to replace NA with "0", but only in columns 1, 3, 7, and 8, you can use:
mydf[c(1, 3, 7, 8)][is.na(mydf[c(1, 3, 7, 8)])] <- 0
mydf
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 1 NA 1 2 NA 2 2 0 NA NA
# 2 2 NA 1 NA 1 1 2 0 2 1
# 3 2 2 0 NA 2 2 2 1 NA 2
# 4 0 2 2 2 1 NA 1 0 2 NA
# 5 1 1 0 NA 1 2 0 2 2 NA
Instead of column numeric index positions, you can use a vector of column names (which will be safer than the numeric positions). Additionally, your code might be easier if the vector of column names or index positions you're working on were stored in a separate vector. Both of those concepts are demonstrated below, where we replace NA values in variables "V2", "V4" and "V5" with "-999".
changeMe <- c("V2", "V4", "V5")
mydf[changeMe][is.na(mydf[changeMe])] <- -999
mydf
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 1 -999 1 2 -999 2 2 0 NA NA
# 2 2 -999 1 -999 1 1 2 0 2 1
# 3 2 2 0 -999 2 2 2 1 NA 2
# 4 0 2 2 2 1 NA 1 0 2 NA
# 5 1 1 0 -999 1 2 0 2 2 NA