constraining on classes of column in data frame in R - r

So, I am trying to write a function with a dataframe as input to check whether the columns of the dataframe only contains integer, character(not factor) and numeric vectors. In that case I want to return value TRUE. If it contains something else, I want to return FALSE.
for example :
df1 <- data.frame( a = 1:4, b = c("x","y", "z","w"), c = 8:11, stringsAsFactors = FALSE)
df2 <- data.frame(a = 2:5, b = c("m", "n", "o", "p"),c = 11:14, stringsAsFactors = TRUE)
In this case, the function should return TRUE with input df1 since it has integer and character type columns. But for df2, I want to return FALSE since it contains factor column b.
Could someone help?

Since integers are also numeric, you can use the condition
is.numeric(x) | is.character(x)
Here's a function:
numOrChar <- function(df) {
f <- function(x) is.numeric(x) | is.character(x)
all(vapply(df, f, logical(1L)))
}
numOrChar(df1)
# [1] TRUE
numOrChar(df2)
# [1] FALSE

Related

Carry forward NA values for multiple columns R

I need to carry forward NA values from one column to the next. An example of the code is below
df <- data.frame(a = c(1,2,NA,NA,NA,NA,NA,NA,NA,NA),
b =c(NA,NA,3,4,NA,NA,NA,NA,NA,NA),
c = c(NA,NA,NA,NA,5,6,NA,NA,NA,NA),
d = c(NA,NA,NA,NA,NA,NA,7,8,NA,NA),
e = c(NA,NA,NA,NA,NA,NA,NA,NA,9,10))
I have tried to use a loop with the na.locf function in zoo but this only carries the previous columns values
columns <- seq(2,ncol(df))
output <- list()
for (i in columns){
output[[i]] <- t(zoo::na.locf(t(df[,(i-1):i])))[,2]
}
The expected output would be like
expected_output <- data.frame(a = c(1,2,NA,NA,NA,NA,NA,NA,NA,NA),
b = c(1,2,3,4,NA,NA,NA,NA,NA,NA),
c = c(1,2,3,4,5,6,NA,NA,NA,NA),
d = c(1,2,3,4,5,6,7,8,NA,NA),
e = c(1,2,3,4,5,6,7,8,9,10))
Transpose df, apply na.locf, transpose again and replace df contents with that to make it a data frame with the correct names.
library(zoo)
out <- replace(df, TRUE, t(na.locf(t(df), fill = NA)))
identical(out, expected_output)
## [1] TRUE
This also works and is similar except it applies na.locf0 to each row instead of applying na.locf to the transpose.
out <- replace(df, TRUE, t(apply(df, 1, na.locf0)))
identical(out, expected_output)
## [1] TRUE

R Check if multiple variables with the same pattern have the same values

I have some variables in my data frame that show the same pattern, and that should also have the same content. Now I want to check whether all rows show the same values for these variables. In this example, I want to compare all variables that start with "a" and want to get "True" if they are indeed all the same. How do I do that?
df = data.frame(
a1 = c(1,2,3),
nn22 = c(8,9,3),
a2 = c(1,2,3),
nn = c(8,9,3),
u6 = c(8,4,3),
o8 = c(3,9,1),
a3 = c(1,2,3),
a4 = c(1,2,3),
a5 = c(1,2,3),
a6 = c(1,2,3),
b= c(2,2,2))
We could split the data into a list of data.frame based on the prefix names and then use == by comparing the first column with all other columns after looping over the list with sapply. Wrap with all to check if we have all TRUEs
sapply(split.default(df, sub("\\d+$", "", names(df))), function(x) all(x[,1] == x))
# a b nn o u
#TRUE TRUE TRUE TRUE TRUE
If we need only to compare 'a' columns
dfa <- df[startsWith(names(df), 'a')]
all(dfa == dfa[,1])
#[1] TRUE

Passing dataframe as argument to function

I am writing a function to process data from a huge dataframe (row by row) which always has the same column names. So I want to pass the dataframe itself as a function to read out the information I need from the individual rows. However, when I try to use it as argument I can't read the information from it for some reason.
Dataframe:
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
My code:
List <- do.call(list, Map(function(DT) {
DT <- as.data.frame(DT)
aa <- as.numeric(strsplit(DT$Age, ","))
mean.aa <- mean(aa)
},
DF))
Trying this I get a list with the column names, but all Values are NULL.
Expected output :
My expected output is a list with length equal to the number of rows in the data frame. Under each list index there should be another list with the age of the corresponding row (an also other stuff from the same row of the data table, later).
DF <- apply(data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"), "mean.aa" = c(179.7143, 100.8571)), 1, as.list)
What am I doing wrong?
Here is one way :
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
apply(DF, 1, function(row){
aa <- as.numeric(strsplit(row["Age"], ",")[[1]])
row["mean.aa"] <- mean(aa)
as.list(row)
})

Rowwise maximum of multiple columns (ignoring null values)

How to apply the max()-functions to multiple columns, ignoring NULL values
(in this context NULL is named NA).
My data:
# data
df <- data.frame( a = sample(5), b = sample(5) )
df[2:3,1] <- NA
dbWriteTable(db1, "df", df, overwrite = TRUE )
What I have tried
What I want:
(notice that the column max1 does not contain NA)
I was hoping there was a simple way to do this in SQLite, but may there is not?
df$max<-apply(X=df, MARGIN=1, FUN=max,na.rm=T)

data.table loses factor ordering after rbind, R

When rbinding two data.table with ordered factors, the ordering seems to be lost:
dtb1 = data.table(id = factor(c("a", "b"), levels = c("a", "c", "b"), ordered=T), key="id")
dtb2 = data.table(id = factor(c("c"), levels = c("a", "c", "b"), ordered=T), key="id")
test = rbind(dtb1, dtb2)
is.ordered(test$id)
#[1] FALSE
Any thoughts or ideas?
data.table does some fancy footwork that means that data.table:::.rbind.data.table is called when rbind is called on objects including data.tables. .rbind.data.table utilizes the speedups associated with rbindlist, with a bit of extra checking to match by name etc.
.rbind.data.table deals with factor columns by using c to combine them (hence retaining the levels attribute)
# the relevant code is
l = lapply(seq_along(allargs[[1L]]), function(i) do.call("c",
lapply(allargs, "[[", i)))
In base R using c in this manner does not retain the "ordered" attribute, it doesn't even return a factor!
For example (in base R)
f <- factor(1:2, levels = 2:1, ordered=TRUE)
g <- factor(1:2, levels = 2:1, ordered=TRUE)
# it isn't ordered!
is.ordered(c(f,g))
# [1] FALSE
# no suprise as it isn't even a factor!
is.factor(c(f,g))
# [1] FALSE
However data.table has an S3 method c.factor, which is used to ensure that a factor is returned and the levels are retained. Unfortunately this method does not retain the ordered attribute.
getAnywhere('c.factor')
# A single object matching ‘c.factor’ was found
# It was found in the following places
# namespace:data.table
# with value
#
# function (...)
# {
# args <- list(...)
# for (i in seq_along(args)) if (!is.factor(args[[i]]))
# args[[i]] = as.factor(args[[i]])
# newlevels = unique(unlist(lapply(args, levels), recursive = TRUE,
# use.names = TRUE))
# ind <- fastorder(list(newlevels))
# newlevels <- newlevels[ind]
# nm <- names(unlist(args, recursive = TRUE, use.names = TRUE))
# ans = unlist(lapply(args, function(x) {
# m = match(levels(x), newlevels)
# m[as.integer(x)]
# }))
structure(ans, levels = newlevels, names = nm, class = "factor")
}
<bytecode: 0x073f7f70>
<environment: namespace:data.table
So yes, this is a bug. It is now reported as #5019.
As of version 1.8.11 data.table will combine ordered factors to result in ordered if a global order exists, and will complain and result in a factor if it doesn't exist:
DT1 = data.table(ordered('a', levels = c('a','b','c')))
DT2 = data.table(ordered('a', levels = c('a','d','b')))
rbind(DT1, DT2)$V1
#[1] a a
#Levels: a < d < b < c
DT3 = data.table(ordered('a', levels = c('b','a','c')))
rbind(DT1, DT3)$V1
#[1] a a
#Levels: a b c
#Warning message:
#In rbindlist(lapply(seq_along(allargs), function(x) { :
# ordered factor levels cannot be combined, going to convert to simple factor instead
To contrast, here's what base R does:
rbind(data.frame(DT1), data.frame(DT2))$V1
#[1] a a
#Levels: a < b < c < d
# Notice that the resulting order does not respect the suborder for DT2
rbind(data.frame(DT1), data.frame(DT3))$V1
#[1] a a
#Levels: a < b < c
# Again, suborders are not respected and new order is created
I met with the same problem after rbind, just re-assign the ordered level for the column.
test$id <- factor(test$id, levels = letters, ordered = T)
It's better to define factor after rbind

Resources