Convert entire data frame into one long column (vector) - r

I want to turn the entire content of a numeric (incl. NA's) data frame into one column. What would be the smartest way of achieving the following?
>df <- data.frame(C1=c(1,NA,3),C2=c(4,5,NA),C3=c(NA,8,9))
>df
C1 C2 C3
1 1 4 NA
2 NA 5 8
3 3 NA 9
>x <- mysterious_operation(df)
>x
[1] 1 NA 3 4 5 NA NA 8 9
I want to calculate the mean of this vector, so ideally I'd want to remove the NA's within the mysterious_operation - the data frame I'm working on is very large so it will probably be a good idea.

Here's a couple ways with purrr:
# using invoke, a wrapper around do.call
purrr::invoke(c, df, use.names = FALSE)
# similar to unlist, reduce list of lists to a single vector
purrr::flatten_dbl(df)
Both return:
[1] 1 NA 3 4 5 NA NA 8 9

The mysterious operation you are looking for is called unlist:
> df <- data.frame(C1=c(1,NA,3),C2=c(4,5,NA),C3=c(NA,8,9))
> unlist(df, use.names = F)
[1] 1 NA 3 4 5 NA NA 8 9

We can use unlist and create a single column data.frame
df1 <- data.frame(col =unlist(df))

Just for fun. Of course unlist is the most appropriate function.
alternative
stack(df)[,1]
alternative
do.call(c,df)
do.call(c,c(df,use.names=F)) #unnamed version
Maybe they are more mysterious.

Related

Count unique values in Raster Data in R

I have these Raster Datasets, which look like this
1 2 3 4 5
1 NA NA NA 10 NA
2 7 3 7 10 10
3 NA 3 7 3 3
4 9 9 NA 3 7
5 3 NA 7 NA NA
via
MyRaster1 <- raster("MyRaster_EUNIS1.tif")
head(MyRaster1)
I created that table.
Using unique(MyRaster1) I get 3 7 9 10.
What I need are the counts of these unique values in the raster dataset.
I have tried quite a few ways around, one way works, but is a lot of trouble and I can't get a loop to work for all the raster datasets I have.
Classes1 <- as.factor(unique(values(MyRaster1)))[!is.na(unique(values(MyRaster1)))]
val1 <- unique(MyRaster1)
Tab1 <- matrix(nrow = length(values(MyRaster1)), ncol = length(val))
colnames(Tab1) <- levels(unique(Classes1))
Tab1 <- Tab1[!is.na(Tab1[,1]),]
colSums(Tab1)
It seems to work properly, until I try to delete the NA values. When I use colSums before that, I get NA as result for each column, after I delete the NA values, I get 0.
This is my first time using R, so I'm a real novice. I've researched quite a lot, but since I hardly understand the language at all, this is the furthest I have gotten.
Thank you for your help.
Edit:
table(MyRaster1)
gives me this: Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors
The best result would be:
3 7 9 10
6 5 2 3
But I'd also be ok with a different format which I could use in Excel.
Use raster::freq()
Here's an example for the first two rows of your data:
r <- raster(matrix(c(NA,NA,NA,10,NA,7,3,7,10,10), nrow = 2, ncol =5))
> freq(r)
value count
[1,] 3 1
[2,] 7 2
[3,] 10 3
[4,] NA 4
Note that the freq function rounds unless explicitly told not to:
https://www.rdocumentation.org/packages/raster/versions/3.0-7/topics/freq

Counting 0`s, 1`s, 99`s and NA`s for each variable in a data frame

I have a data frame with 118 variables with 0's, 1's 99's and NA's. I need count for each variable how many 99's, NA's, 1's and 0's there is (the 99 is "not apply", the 0 is "no", the 1 is "yes" and the NA is "No answer"). I try to do this with table function but it works with vectors, how can I do it for all the set of variables?
There is a little reproducible example of the data frame:
forest<-c(1,1,1,1,0,0,0,1,1,1,0,NA,0,NA,0,99,99,1,0,NA)
water<-c(1,NA,NA,NA,NA,99,99,0,0,0,1,1,1,0,0,NA,NA,99,1,0)
rain<-c(1,NA,1,0,1,99,99,0,1,0,1,0,1,0,0,NA,99,99,1,1)
fire<-c(1,0,0,0,1,99,99,NA,NA,NA,1,0,1,0,0,NA,99,99,1,1)
df<-data.frame(forest,water,rain,fire)
And I need write in a data frame the result for variable, like this:
forest water rain fire
1 8 5 8 6
0 7 6 6 6
99 2 3 4 4
NA 3 6 2 4
Can't find a good dupe, so here's my comment as an answer:
A data frame is really a list of columns. lapply will apply a function to every item in the input (every column, in the case of a data frame) and return a list with each result:
lapply(df, table)
# $forest
#
# 0 1 99
# 7 8 2
#
# $water
#
# 0 1 99
# 6 5 3
#
# $rain
#
# 0 1 99
# 6 8 4
#
# $fire
#
# 0 1 99
# 6 6 4
sapply is like lapply, but it will attempt to simplify the result instead of always returning a list. In both cases, you can pass along additional arguments to the function being applied, like useNA = "always" to table to have NA included in the output:
sapply(df, table, useNA = "always")
# forest water rain fire
# 0 7 6 6 6
# 1 8 5 8 6
# 99 2 3 4 4
# <NA> 3 6 2 4
For lots more info, check out R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate
To compare with some other answers: apply is similar to lapply and sapply, but it is intended for use with matrices or higher-dimensional arrays. The only time you should use apply on a data.frame is when you need to apply a function to each row. For functions on data frame columns, prefer lapply or sapply. The reason is that apply will coerce the data frame to a matrix first, which can have unintended consequences if you have columns of different classes.
rbind(sapply(df,table),"NA"=sapply(df, function(y) sum(is.na(y))))
forest water rain fire
0 7 6 6 6
1 8 5 8 6
99 2 3 4 4
NA 3 6 2 4
This should do it:
tables <- apply(df, 2, FUN = table)
There's probably a way to do it in one fell swoop.
apply(df, 2, table)
apply(df, 2, function(x){ sum(is.na(x)) })
As the variables are factors, you should first turn them into it:
df <- lapply(df, as.factor)
And then, summary your data.frame:
sapply(df, summary)
The factor method for the summary() function counts each level of it.

Fill in-between entries in an ID vector

Looking for a quick-and-easy solution to a problem which I have only been able to solve inelegantly, by looping. I have an ID vector which looks something like this:
id<-c(NA,NA,1,1,1,NA,1,NA,2,2,2,NA,3,NA,3,3,3)
The NA's that fall in-between a sequence of a single number (id[6], id[14]) need to be replaced by that number. However, the NA's that don't meet this condition (those between sequences of two different numbers) need to be left alone (i.e., id[1],id[2],id[8],id[12]). The target vector is therefore:
id.target<-c(NA,NA,1,1,1,1,1,NA,2,2,2,NA,3,3,3,3,3)
This is not difficult to do by looping through each value, but I am looking to do this to many very long vectors, and was hoping for a neater solution. Thanks for any suggestions.
This seem to work. The idea is to use zoo::na.locf in order to fill the NAs correctly and then insert NAs when they are between different numbers
id.target <- zoo::na.locf(id, na.rm = FALSE)
id.target[(c(diff(id.target), 1L) > 0L) & is.na(id)] <- NA
id.target
## [1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3
Here is a base R option
d1 <- do.call(rbind,lapply(split(seq_along(id), id), function(x) {
i1 <- min(x):max(x)
data.frame(val= unique(id[x]), i1)}))
id[seq_along(id) %in% d1$i1 ] <- d1$val
id
#[1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3

Add columns in vector but not in df

I am trying to do the following and was wondering if there is an easier way to use dplyr to achieve this (I'm sure there is):
I want to compare the columns of a dataframe to a vector of names, and if the df does not contain a column corresponding to one of the names in the name vector, add that column to the df and populate its values with NAs.
E.g., in the MWE below:
df <- data.frame(cbind(c(1:6),c(11:16),c(10:15)))
colnames(df) <- c("A","B","C")
names <- c("A","B","C","D","E")
how do I use dplyr to create the two columns D and E (which are in names, but not in df) and populate it with NAs?
No need in dplyr, it's just a basic operation in base R. (Btw, try avoiding overriding built in functions such as names in the future. The reason names still works is because R looks in the base package NAMESPACE file instead in the global environment, but this is still a bad practice.)
df[setdiff(names, names(df))] <- NA
df
# A B C D E
# 1 1 11 10 NA NA
# 2 2 12 11 NA NA
# 3 3 13 12 NA NA
# 4 4 14 13 NA NA
# 5 5 15 14 NA NA
# 6 6 16 15 NA NA

Is there a way to identify where NAs are introduced?

Recently went through my fairly large dataset and realized some foo decided to use commas. Trying to convert it all to numeric. Used a nice little gsub to get rid of those pesky commas, but I'm still finding NAs introduced by coercion. Is there a way to identify the location by column and row where those NAs are being introduced so I can see why that is occurring?
Use the is.na() function. Consider the following data frame, which contains NA values, as an example:
> df <- data.frame(v1=c(1,2,NA,4), v2=c(NA,6,7,8), v3=c(9,NA,NA,12))
> df
v1 v2 v3
1 1 NA 9
2 2 6 NA
3 NA 7 NA
4 4 8 12
You can use is.na along with sapply to get the following result:
> sapply(df, function(x) { c(1:length(x))[is.na(x)] })
$v1
[1] 3
$v2
[1] 1
$v3
[1] 2 3
Each column will come back along with the rows where NA values occurred.
I would also use which with arr.ind=TRUE to get the row/column indices ('df' from #Tim Biegeleisen's post)
which(is.na(df), arr.ind=TRUE)
# row col
#[1,] 3 1
#[2,] 1 2
#[3,] 2 3
#[4,] 3 3

Resources