Related
I'd like to classify some data into factor levels. So I wrote a function that will take an input and return the corresponding level from a factor. The problem is that the result I get is the integer value of the factor, not the factor. Here is a sample code.
data <- data.frame(a = 1:10)
find_class <- function(i) {
classes <- factor(c('A', 'B', 'C'))
ifelse(i %in% c(1, 3, 5), classes[1],
ifelse(i %in% c(2, 4, 9), classes[2], classes[3]))
}
data$class <- find_class(data$a)
Thus data$class is of type int. How to get data$class to be a factor?
Also, since the breaks are not based on a simple value range, I can't use cut (which would work fine).
It's the return of ifelse that is causing the problem. If I use case_when from dplyr it works.
library(dplyr)
data <- data.frame(a = 1:10)
find_class <- function(i) {
classes <- factor(c('A', 'B', 'C'))
case_when(
i %in% c(1,3,5) ~ classes[1],
i %in% c(2,4,9) ~ classes[2],
TRUE ~ classes[3]
)
}
data$class <- find_class(data$a)
str(data)
# 'data.frame': 10 obs. of 2 variables:
# $ a : int 1 2 3 4 5 6 7 8 9 10
# $ class: Factor w/ 3 levels "A","B","C": 1 2 1 2 1 3 3 3 2 3
You can use the levels of the variable Classes and the output of the ifelse statement as follows:
data <- data.frame(a = 1:10)
find_class <- function(i) {
classes <- factor(c('A', 'B', 'C'))
idx <- ifelse(i %in% c(1, 3, 5), classes[1],
ifelse(i %in% c(2, 4, 9), classes[2], classes[3]))
res <- levels(classes)[idx]
factor(res, levels(classes))
}
data$class <- find_class(data$a)
data$class
# [1] A B A B A C C C B C
# Levels: A B C
data
# a class
# 1 1 A
# 2 2 B
# 3 3 A
# 4 4 B
# 5 5 A
# 6 6 C
# 7 7 C
# 8 8 C
# 9 9 B
# 10 10 C
One more option - using a general mapping function as parameter:
factorize = function(
data,
mapping=function(v)
ifelse(v %in% c(1, 3, 5), "A",
ifelse(v %in% c(2, 4, 9), "B", "C"))
) {
as.factor(mapping(data))
}
That gives:
> factorize(1:10)
[1] A B A B A C C C B C
Levels: A B C
And now an option with a mapping vector instead of a mapping function:
factorize = function(
data,
mapping=c("1"="A", "2"="B", "3"="A", "4"="B", "5"="A", "9"="B"),
default="C"
) {
data = mapping[as.character(data)]
data[is.na(data)] = default
names(data) = NULL
as.factor(data)
}
I may figure it out. Take a close look at the source code of "ifelse" by running it without brackets. Your will see a segment of code as below:
ans <- test
len <- length(ans)
ypos <- which(test)
npos <- which(!test)
if (length(ypos) > 0L)
ans[ypos] <- rep(yes, length.out = len)[ypos]
if (length(npos) > 0L)
ans[npos] <- rep(no, length.out = len)[npos]
ans
That is, "ifelse" want the logical vector "ans" to take the value of "rep(yes, length.out = len)[ypos]". However, when the value from "rep()"is a factor, the factor value will/must be coerced to integer, so ifelse did not give what u want.
Possible solution:
find_class <- function(i) {
classes <- c("A", "B", "C")
i=1:10
outcome=ifelse(i %in% c(1, 3, 5), classes[1],
ifelse(i %in% c(2, 4, 9), classes[2], classes[3]))
as.factor(outcome)
}
find_class(data)
this works because a logical vector can take character value and covert itself into a character vector, while the one in your function get coerced to an integer one.
The latest release of the fct_collapse() function from the forecats package can be used in place of OP's own find_class() function. Please, make sure to install the development version 0.4.0.9000 from GitHub instead of CRAN version 0.4.0 by
devtools::install_github("tidyverse/forcats")
Then,
data$class <- forcats::fct_collapse(as.factor(data$a),
A = c("1", "3", "5"), B = c("2", "4", "9"),
other_level = "C")
data
returns
a class
1 1 A
2 2 B
3 3 A
4 4 B
5 5 A
6 6 C
7 7 C
8 8 C
9 9 B
10 10 C
str(data)
'data.frame': 10 obs. of 2 variables:
$ a : int 1 2 3 4 5 6 7 8 9 10
$ class: Factor w/ 3 levels "A","B","C": 1 2 1 2 1 3 3 3 2 3
Another approach is to create a lookup table from a named list:
find_class <- function(i, classes) {
long <- reshape2::melt(classes)
as.factor(long$L1[match(data$a, long$value, nomatch = which(is.na(long$value)))])
}
data$class <- find_class(data$a, list(A = c(1, 3, 5), B = c(2, 4, 9), C = NA))
data
a class
1 1 A
2 2 B
3 3 A
4 4 B
5 5 A
6 6 C
7 7 C
8 8 C
9 9 B
10 10 C
str(data)
'data.frame': 10 obs. of 2 variables:
$ a : int 1 2 3 4 5 6 7 8 9 10
$ class: Factor w/ 3 levels "A","B","C": 1 2 1 2 1 3 3 3 2 3
The advantage is that the classification is not hard-coded but can be passed in a compact way as an additional parameter. Thus, the number of classes can be modified easily without having to deal with nested ifelse().
data$class <- find_class(data$a)
data
a class
1 1 A
2 2 B
3 3 A
4 4 B
5 5 A
6 6 C
7 7 C
8 8 C
9 9 B
10 10 C
str(data)
'data.frame': 10 obs. of 2 variables:
$ a : int 1 2 3 4 5 6 7 8 9 10
$ class: Factor w/ 3 levels "A","B","C": 1 2 1 2 1 3 3 3 2 3
I've got a dataset
>view(interval)
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 2 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
>dput(interval)
structure(list(V1 = c(NA, 2, 3, 4, NA),
V2 = c(1, 2, NA, 2, 5),
V3 = c(2, 3, 1, 2, 1), ID = 1:5), row.names = c(NA, -5L), class = "data.frame")
I would like to extract the previous not NA value (or the next, if NA is in the first row) for every row, and store it as a local variable in a custom function, because I have to perform other operations on every row based on this value(which should change for every row i'm applying the function).
I've written this function to print the local variables, but when I apply it the output is not what I want
myFunction<- function(x){
position <- as.data.frame(which(is.na(interval), arr.ind=TRUE))
tempVar <- ifelse(interval$ID == 1, interval[position$row+1,
position$col], interval[position$row-1, position$col])
return(tempVar)
}
I was expecting to get something like this
# [1] 2
# [2] 2
# [3] 4
But I get something pretty messed up instead.
Here's attempt number 1:
dat <- read.table(header=TRUE, text='
V1 V2 V3 ID
NA 1 2 1
2 2 3 2
3 NA 1 3
4 2 2 4
NA 5 1 5')
myfunc1 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# catch first-row NA
ind[,1] <- ifelse(ind[,1] == 1L, 2L, ind[,1] - 1L)
x[ind]
}
myfunc1(dat)
# [1] 2 2 4
The problem with this is when there is a second "stacked" NA:
dat2 <- dat
dat2[2,1] <- NA
dat2
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 NA 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
myfunc1(dat2)
# [1] NA NA 2 4
One fix/safeguard against this is to use zoo::na.locf, which takes the "last observation carried forward". Since the top-row is a special case, we do it twice, second time in reverse. This gives us the "next non-NA value in the column (up or down, depending).
library(zoo)
myfunc2 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# this is to guard against stacked NA
x <- apply(x, 2, zoo::na.locf, na.rm = FALSE)
# this special-case is when there are one or more NAs at the top of a column
x <- apply(x, 2, zoo::na.locf, fromLast = TRUE, na.rm = FALSE)
x[ind]
}
myfunc2(dat2)
# [1] 3 3 2 4
I have a dataframe of survey responses (rows = participants, columns = question responses). Participants would respond to 50 questions on a 5-point Likert scale. I would like to remove participants who answered 5 across the 50 questions as they have zero-variance and likely to bias my results.
I have seen the nearZeroVar()function, but was wondering if there's a way to do this in base R?
Many thanks,
R
If you had this dataframe:
df <- data.frame(col1 = rep(1, 10),
col2 = 1:10,
col3 = rep(1:2, 5))
You could calculate the variance of each column and select only those columns where the variance is not 0 or greater than or equal to a certain threshold which is close to what nearZeroVar() would do:
df[, sapply(df, var) != 0]
df[, sapply(df, var) >= 0.3]
If you wanted to exclude rows, you could do something similar, but loop through the rows instead and then subset:
df[apply(df, 1, var) != 0, ]
df[apply(df, 1, var) >= 0.3, ]
Assuming you have data like this.
survey <- data.frame(participants = c(1:10),
q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
You can do the following.
idx <- which(apply(survey[,-1], 1, function(x) all(x == 5)) == T)
survey[-idx,]
This will remove rows where all values equal 5.
# Dummy data:
df <- data.frame(
matrix(
sample(1:5, 100000, replace =TRUE),
ncol = 5
)
)
names(df) <- paste0("likert", 1:5)
df$id <- 1:nrow(df)
head(df)
likert1 likert2 likert3 likert4 likert5 id
1 1 2 4 4 5 1
2 5 4 2 2 1 2
3 2 1 2 1 5 3
4 5 1 3 3 2 4
5 4 3 3 5 1 5
6 1 3 3 2 3 6
dim(df)
[1] 20000 6
# Clean out rows where all likert values are 5
df <- df[rowSums(df[grepl("likert", names(df))] == 5) != 5, ]
nrow(df)
[1] 19995
Stealing #AshOfFire's data, with small modification as you say you only have answers in columns and not participants :
survey <- data.frame(q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
survey[!apply(survey==survey[[1]],1,all),]
# q1 q2 q3
# 1 1 1 3
# 4 5 5 4
# 6 1 1 5
# 10 2 3 5
the equality test builds a data.frame filled with Booleans, then with apply we keep rows that aren't always TRUE.
I want to recode some numeric values into different numeric values and have had a go using the following code:
survey$KY27PHYc <- revalue(survey$KY27PHY1, c(5=3, 4=2,3=2,2=1,1=1))
I get the following error:
## Error: unexpected '=' in "survey$KY27PHYc <- revalue(survey$KY27PHY1, c(5="
Where am I going wrong?
We can recode numeric values by using recode or case_when on dplyr 0.7.0.
library(dplyr)
packageVersion("dplyr")
# [1] ‘0.7.0’
x <- 1:10
# With recode function using backquotes as arguments
dplyr::recode(x, `2` = 20L, `4` = 40L)
# [1] 1 20 3 40 5 6 7 8 9 10
# Note: it is necessary to add "L" a numerical value.
dplyr::recode(x, `2` = 20, `4` = 40)
# [1] NA 20 NA 40 NA NA NA NA NA NA
# Warning message:
# Unreplaced values treated as NA as .x is not compatible. Please specify replacements exhaustively or supply .default
# With recode function using characters as arguments
as.numeric(dplyr::recode(as.character(x), "2" = "20", "4" = "40"))
# [1] 1 20 3 40 5 6 7 8 9 10
# With case_when function
dplyr::case_when(
x %in% 2 ~ 20,
x %in% 4 ~ 40,
TRUE ~ as.numeric(x)
)
# [1] 1 20 3 40 5 6 7 8 9 10
This function does not work on numeric vector. If you want to use it, you can do as follows:
x <- 1:10 # your numeric vector
as.numeric(revalue(as.character(x), c("2" = "33", "4" = "88")))
# [1] 1 33 3 88 5 6 7 8 9 10
Try this:
#sample data
set.seed(123); x <- sample(1:5, size = 10, replace = TRUE)
x
# [1] 2 4 3 5 5 1 3 5 3 3
#recode
x <- c(1, 1, 2, 2, 3)[ x ]
x
# [1] 1 2 2 3 3 1 2 3 2 2
How to ignore case when using subset function in R?
eos91corr.data <- subset(test.data,select=c(c(X,Y,Z,W,T)))
I would like to select columns with names x,y,z,w,t. what should i do?
Thanks
If you can live without the subset() function, the tolower() function may work:
dat <- data.frame(XY = 1:5, x = 1:5, mm = 1:5,
y = 1:5, z = 1:5, w = 1:5, t = 1:5, r = 1:5)
dat[,tolower(names(dat)) %in% c("xy","x")]
However, this will return a data.frame with the columns in the order they are in the original dataset dat: both
dat[,tolower(names(dat)) %in% c("xy","x")]
and
dat[,tolower(names(dat)) %in% c("x","xy")]
will yield the same result, although the order of the target names has been reversed.
If you want the columns in the result to be in the order of the target vector, you need to be slightly more fancy. The two following commands both return a data.frame with the columns in the order of the target vector (i.e., the results will be different, with columns switched):
dat[,sapply(c("x","xy"),FUN=function(foo)which(foo==tolower(names(dat))))]
dat[,sapply(c("xy","x"),FUN=function(foo)which(foo==tolower(names(dat))))]
You could use regular expressions with the grep function to ignore case when identifying column names to select. Once you have identified the desired column names, then you can pass these to subset.
If your data are
dat <- data.frame(xy = 1:5, x = 1:5, mm = 1:5, y = 1:5, z = 1:5,
w = 1:5, t = 1:5, r = 1:5)
# xy x mm y z w t r
# 1 1 1 1 1 1 1 1 1
# 2 2 2 2 2 2 2 2 2
# 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4
# 5 5 5 5 5 5 5 5 5
Then
(selNames <- grep("^[XYZWT]$", names(dat), ignore.case = TRUE, value = TRUE))
# [1] "x" "y" "z" "w" "t"
subset(dat, select = selNames)
# x y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
EDIT If your column names are longer than one letter, the above approach won't work too well. So assuming you can get your desired column names in a vector, you could use the following:
upperNames <- c("XY", "Y", "Z", "W", "T")
(grepPattern <- paste0("^", upperNames, "$", collapse = "|"))
# [1] "^XY$|^Y$|^Z$|^W$|^T$"
(selNames2 <- grep(grepPattern, names(dat), ignore.case = TRUE, value = TRUE))
# [1] "xy" "y" "z" "w" "t"
subset(dat, select = selNames2)
# xy y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
The 'stringr' library is a very neat wrapper for all of this functionality. It has 'ignore.case' option as follows:
also, you may want to consider using match not subset.