Removing rows in list (with ff) - r

I have a data set where I want to remove every row in which Dataset$a
does not have the value "Right". Dataset$a is a list with three diffrent objects "Right", "Wrong1" and "Wrong2". I tried to do this by using the code:
Dataset$a <- subset.ffdf(Dataset, a == "Right")
But I get the error
Error in if (any(B < 1)) stop("B too small") : missing value where TRUE/FALSE needed
In addition: Warning message:
In bbatch(n, as.integer(BATCHBYTES/theobytes)) :
NAs introduced by coercion to integer range
What should I do instead?

The warning that is returned has something do in working with a large data set.
see here
Before doing your filter see if row a is a factor or character using:
str(Dataset$a)
If it is in character format, this should work
finalDf <- Dataset[Dataset$a != "Right", ]
Or you can use dplyr like so:
require(dplyr)
newData <- Dataset%>% dplyr::filter(a=="Right")

Related

Convert numerical variable to categorical variable

I have a list of columns that contain 0 and 1 as values. Right now they are treated as numerical variables but I want them to be treated as categorical.
I tried
as.factor(df[,"diseasesA":"diseaseM"], exclude = NULL)
but received the following error message:
Error in as.factor(df[,"diseasesA":"diseaseM"], :
unused argument (exclude = NULL)
not using "exclude = NULL" gave me the following error message:
Error in "diseasesA":"diseaseM" : NA/NaN argument
In addition: Warning messages:
1: In eval(jsub, setattr(as.list(seq_along(x)), "names", names_x), :
NAs introduced by coercion
2: In eval(jsub, setattr(as.list(seq_along(x)), "names", names_x), :
NAs introduced by coercion
factor() or as.factor() works on a single column, not a data frame. So you need to apply that function to the columns you want to convert. Here are a few equivalent methods:
cols = paste0("disease", LETTERS[1:13]) # assuming your naming pattern is consistent
## base R with lapply
df[cols] = lapply(df[cols], factor)
## base R with for loop
for(i in seq_along(cols)) {
df[[i]] = factor(df[[i]])
}
## dplyr
library(dplyr)
df = df %>%
mutate(across(diseaseA:diseaseM, factor))
I will note that your question is inconsistent in its column naming pattern, disease vs diseases. In the base R methods I assumed that's a typo and further assumed you wanted to convert columns diseaseA, diseaseB, diseaseC, ..., diseaseM. In dplyr we can use across() to use X:Z to operate on all columns starting with X through Z--but there are many other methods possible to select which columns to work on, e.g., starts_with("diesease").

Select odd rows from a specific column in a dataframe

I have a large df with a specific numeric column named Amount.
df = data.frame(Amount = c(as.numeric(1:14)), stringsAsFactors = FALSE)
I want to select odd rows. So far, I have tried with the syntax below but I always get this error messages:
df$Amount[c(FALSE, TRUE),]
Error in df$Amount[c(FALSE, TRUE), ] : incorrect number of dimensions
seq_len(ncol(df$Amount)) %% 2
Error in seq_len(ncol(df$Amount)) :
argument must be coercible to non-negative integer
In addition: Warning message:
In seq_len(ncol(df$Amount)) :
first element used of 'length.out' argument
odd = seq(1,14,1)
df$Amount[odd,1]
Error in P20$Journal.Amount[even, 1] : incorrect number of dimensions
P20$Journal.Amount[seq(2,length(14), 2),]
Error in seq.default(2, length(14), 2) : wrong sign in 'by' argument
My question is: Is there a way I can do this directly? I tried with the solutions of questions previously posted but so far, I keep having these error messages.
BaseR preferably.
The row/column index is used when there are dim attributes. vector doesn't have it.
is.vector(df$Amount)
If we extract the vector, then just use the row index
df$Amount[c(FALSE, TRUE)]
If we want to subset the rows of the dataset,
df[c(FALSE, TRUE), 'Amount', drop = FALSE]
In the above code, we are specify the row index (i), 'j' as the column index or column name, and drop (?Extract - is by default drop = TRUE for data.frame. So, we need to specify drop = FALSE to not lose the dimensions and coerce to a vector)

How to properly apply RowMeans()? "X is not numeric" error

I have two columns within OtherIncludedClean, and I would like to add another column of OtherIncludedClean$Mean; however, my efforts are in vain.
I have tried:
OtherIncludedClean$mean <- rowMeans(OtherIncludedClean, na.rm = FALSE, dims = 1)
But, the above reports the error:
"Error in base::rowMeans(x, na.rm = na.rm, dims = dims, ...) :
'x' must be numeric"
I have also attempted:
OtherIncludedClean$mean <- apply(OtherIncludedClean, 1, function(x) { mean(x, na.rm=TRUE) })
Which reports this error:
"1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA"
For all 141 rows.
Any and all help appreciated. Thank you .
My columns are "X__1" and "X__2"
When we get error 'x' must be numeric", it is better to check the column types. An easier option is
str(OtherIncludedClean)
If we find that the types are not numeric/integer and it is character/factor, we need to convert it to numeric type (assuming that most of the values are numeric in a column and due to one or two elements which is not numeric, it changes the type).
The way to convert is as.numeric. For a single column, as.numeric(data$columnname) if it is character class and for factor class,
as.numeric(as.character(data$columnname))
Here, we need to change all the columns to numeric (assuming it is character class). For that, loop through the columns with lapply and assign the output back to the dataset
OtherIncludedClean[] <- lapplyOtherIncludedClean, as.numeric)
and then apply the rowMeans
If the class of only a subset of columns are character, then we need to only loop through those columns
i1 <- !sapply(OtherIncludedClean, is.numeric)
OtherIncludedClean[i1] <- lapplyOtherIncludedClean[i1], as.numeric)

Function to impute missing values using mean in R

My tibble:
Data in Excel:
impute <- read_excel(choose.files())
imp <- function(df) {
for(i in 1:ncol(df)){
df[is.na(df[,i]),i] <- mean(df[,i],na.rm = T)
}
}
imp(impute)
Warning messages:
1: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
2: In mean.default(df[, i], na.rm = T) :
argument is not numeric or logical: returning NA
The above code works fine it impute is a Data.Frame, but doesn't work if it's a Tibble. Could someone please let me know how to change the code if I were to work with Tibble.
One of the differences between a data.frame and a tibble is that data frames drop dimensions when possible by default and tibbles don't.
That is, if x is a data frame then x[, i] may or may not be a data frame, depending on i. If i is one value, then x[, i] will just be a vector. If i is a vector with multiple values then x[, i] will be a data frame. This can cause bugs when i is a variable that may or may not have multiple values, because the class may be different (with the fix being to use x[, i, drop = FALSE] to guarantee a data.frame return).
Tibbles seek to address this issue by switching the default drop = TRUE to drop = FALSE, so x[, i] is a tibble, regardless of whether i has length 1 or more.
When calculating the mean, you want df[,i] to be treated as a numeric vector, not a tibble with 1 column, so you need to specify it:
df[[i]] # This is the preferred way to extract a single column
df[, i, drop = TRUE] # this will work too (since tibble version 1.4.1)
This is explained in greater detail in the "Tibbles vs data.frames" section of the Tibbles vignette.

R: How to find the minimum value in row that has both numeric and non-numeric items?

I have a matrix - columns 1-371 are numeric, and columns 372-379 are non-numeric (ie. stores the age, gender information). I want to find the minimum value of each row of the numeric items (for each row, look over the 371 values).
I'm trying to make a count vector, so the code is:
count_a <- 0
for (i in 1:nrow(data)) {
if (min(data[i,][which(data$Age < age & data$Gender == gender)]) <= threshold) {
count_a <- count_a+1
}
}
However I keep getting this error: Error in FUN(X[[1L]], ...) :
only defined on a data frame with all numeric variables
What should I do? Thanks!
Using the CO2 data set try something like this:
NUM <-function(dataframe)dataframe[,sapply(dataframe,is.numeric)]
apply(NUM(CO2), 1, min)

Resources