I have a dataframe of survey responses (rows = participants, columns = question responses). Participants would respond to 50 questions on a 5-point Likert scale. I would like to remove participants who answered 5 across the 50 questions as they have zero-variance and likely to bias my results.
I have seen the nearZeroVar()function, but was wondering if there's a way to do this in base R?
Many thanks,
R
If you had this dataframe:
df <- data.frame(col1 = rep(1, 10),
col2 = 1:10,
col3 = rep(1:2, 5))
You could calculate the variance of each column and select only those columns where the variance is not 0 or greater than or equal to a certain threshold which is close to what nearZeroVar() would do:
df[, sapply(df, var) != 0]
df[, sapply(df, var) >= 0.3]
If you wanted to exclude rows, you could do something similar, but loop through the rows instead and then subset:
df[apply(df, 1, var) != 0, ]
df[apply(df, 1, var) >= 0.3, ]
Assuming you have data like this.
survey <- data.frame(participants = c(1:10),
q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
You can do the following.
idx <- which(apply(survey[,-1], 1, function(x) all(x == 5)) == T)
survey[-idx,]
This will remove rows where all values equal 5.
# Dummy data:
df <- data.frame(
matrix(
sample(1:5, 100000, replace =TRUE),
ncol = 5
)
)
names(df) <- paste0("likert", 1:5)
df$id <- 1:nrow(df)
head(df)
likert1 likert2 likert3 likert4 likert5 id
1 1 2 4 4 5 1
2 5 4 2 2 1 2
3 2 1 2 1 5 3
4 5 1 3 3 2 4
5 4 3 3 5 1 5
6 1 3 3 2 3 6
dim(df)
[1] 20000 6
# Clean out rows where all likert values are 5
df <- df[rowSums(df[grepl("likert", names(df))] == 5) != 5, ]
nrow(df)
[1] 19995
Stealing #AshOfFire's data, with small modification as you say you only have answers in columns and not participants :
survey <- data.frame(q1 = c(1,2,5,5,5,1,2,3,4,2),
q2 = c(1,2,5,5,5,1,2,3,4,3),
q3 = c(3,2,5,4,5,5,2,3,4,5))
survey[!apply(survey==survey[[1]],1,all),]
# q1 q2 q3
# 1 1 1 3
# 4 5 5 4
# 6 1 1 5
# 10 2 3 5
the equality test builds a data.frame filled with Booleans, then with apply we keep rows that aren't always TRUE.
Related
I've got a dataset
>view(interval)
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 2 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
>dput(interval)
structure(list(V1 = c(NA, 2, 3, 4, NA),
V2 = c(1, 2, NA, 2, 5),
V3 = c(2, 3, 1, 2, 1), ID = 1:5), row.names = c(NA, -5L), class = "data.frame")
I would like to extract the previous not NA value (or the next, if NA is in the first row) for every row, and store it as a local variable in a custom function, because I have to perform other operations on every row based on this value(which should change for every row i'm applying the function).
I've written this function to print the local variables, but when I apply it the output is not what I want
myFunction<- function(x){
position <- as.data.frame(which(is.na(interval), arr.ind=TRUE))
tempVar <- ifelse(interval$ID == 1, interval[position$row+1,
position$col], interval[position$row-1, position$col])
return(tempVar)
}
I was expecting to get something like this
# [1] 2
# [2] 2
# [3] 4
But I get something pretty messed up instead.
Here's attempt number 1:
dat <- read.table(header=TRUE, text='
V1 V2 V3 ID
NA 1 2 1
2 2 3 2
3 NA 1 3
4 2 2 4
NA 5 1 5')
myfunc1 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# catch first-row NA
ind[,1] <- ifelse(ind[,1] == 1L, 2L, ind[,1] - 1L)
x[ind]
}
myfunc1(dat)
# [1] 2 2 4
The problem with this is when there is a second "stacked" NA:
dat2 <- dat
dat2[2,1] <- NA
dat2
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 NA 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
myfunc1(dat2)
# [1] NA NA 2 4
One fix/safeguard against this is to use zoo::na.locf, which takes the "last observation carried forward". Since the top-row is a special case, we do it twice, second time in reverse. This gives us the "next non-NA value in the column (up or down, depending).
library(zoo)
myfunc2 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# this is to guard against stacked NA
x <- apply(x, 2, zoo::na.locf, na.rm = FALSE)
# this special-case is when there are one or more NAs at the top of a column
x <- apply(x, 2, zoo::na.locf, fromLast = TRUE, na.rm = FALSE)
x[ind]
}
myfunc2(dat2)
# [1] 3 3 2 4
I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA
I want to scale my data before do a PCA, but unfortunately I found some columns contains NA, and the variance of some columns equal to 0, I want to delete these columns. This is an example of my data
df <- data.frame( v1 = 1:10 , v2 = rep( 0 , 10 ) , v3 = sample( c( 1:3 , NA ) , 10 , repl = TRUE ), v4 = 1:10 )
I want to delete the v2 and v3 column at the same time. how can I implement that?
I know how to delete the columns contain NA, and then delete the column whose variance equal to 0.
colsd <- apply(df, 2, sd)
df2 <- df[!is.na(colsd)]
colsd2 <- apply(df2, 2, sd)
df3 <- df2[!colsd2 == 0]
but it looks redundancy, I just want to know can I implement this more efficient, maybe just in one line. Thank you for any response.
You can try something like:
> df[!sapply(df, var) %in% c(0, NA)]
v1 v4
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
I have the following data frame
Type CA AR
alpha 1 5
beta 4 9
gamma 3 8
I want to get the column and row sums such that it looks like this:
Type CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
I am able to do rowSums (as shown above) I guess because they are all numeric.
colSums(df)
However, when I do colSums I get the error 'x must be numeric.' I realize that this is because the "Type" column is not numeric.
If I do the following code such that I try to print the value into the 4th row (and only the 2nd through 4th columns are summed)
df[,4] = colSums(df[c(2:4)]
Then I get an error that replacement isn't same as data size.
Does anyone know how to work around this? I want to print the column sums for columns 2-4, and leave the 1st column total blank or allow me to print "Total"?
Thanks in advance!!
Checkout numcolwise() in the plyr package.
library(plyr)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
numcolwise(sum)(df)
Result:
CA AR
1 8 22
Use a matrix:
m <- as.matrix(df[,-1])
rownames(m) <- df$Type
# CA AR
# alpha 1 5
# beta 4 9
# gamma 3 8
Then add margins:
addmargins(m,FUN=c(Total=sum),quiet=TRUE)
# CA AR Total
# alpha 1 5 6
# beta 4 9 13
# gamma 3 8 11
# Total 8 22 30
The simpler addmargins(m) also works, but defaults to labeling the margins with "Sum".
You are right, it is because the first column is not numeric.
Try to use the first column as rownames:
df <- data.frame(row.names = c("alpha", "beta", "gamma"), CA = c(1, 4, 3), AR = c(5, 9, 8))
df$Total <- rowSums(df)
df['Total',] <- colSums(df)
df
The output will be:
CA AR Total
alpha 1 5 6
beta 4 9 13
gamma 3 8 11
Total 8 22 30
If you need the word 'Type', just remove the rownames and add the column back:
Type <- rownames(df)
df <- data.frame(Type, df, row.names=NULL)
df
And it's output:
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
Use:
df$Total <- df$CA + df$AR
A more general solution:
data$Total <- Reduce('+',data[, sapply(data, is.numeric)])
EDIT: I realize I completely misunderstood the question. you are indeed looking for the sum of rows, and I gave sum of columns.
To do rows instead:
data <- data.frame(x = 1:3, y = 4:6, z = as.character(letters[1:3]))
data$z <- as.character(data$z)
rbind(data,sapply(data, function(y) ifelse(test = is.numeric(y), Reduce('+',y), "Total")))
If you do not know which columns are numeric, but rather want the sums across rows then do this:
df$Total = rowSums( df[ sapply(df, is.numeric)] )
The is.numeric function will return a logical value which is valid for selecting columns and sapply will return the logical values as a vector.
To add a set of column totals and a grand total we need to rewind to the point where the dataset was created and prevent the "Type" column from being constructed as a factor:
dat <- read.table(text="Type CA AR
alpha 1 5
beta 4 9
gamma 3 8 ",stringsAsFactors=FALSE)
dat$Total = rowSums( dat[ sapply(dat, is.numeric)] )
rbind( dat, append(c(Type="Total"),
as.list(colSums( dat[ sapply(dat, is.numeric)] ))))
#----------
Type CA AR Total
1 alpha 1 5 6
2 beta 4 9 13
3 gamma 3 8 11
4 Total 8 22 30
That's a data.frame:
> str( rbind( dat, append(c(Type="Total"), as.list(colSums( dat[ sapply(dat, is.numeric)] )))) )
'data.frame': 4 obs. of 4 variables:
$ Type : chr "alpha" "beta" "gamma" "Total"
$ CA : num 1 4 3 8
$ AR : num 5 9 8 22
$ Total: num 6 13 11 30
I think this should solve your problem
x<-data.frame(type=c('alpha','beta','gama'), x=c(1,2,3), y=c(4,5,6))
x[,'Total'] <- rowSums(x[,c(2:3)])
x<-rbind(x,c(type = c('Total'), c(colSums(x[,c(2:4)]))))
library(tidyverse)
df <- data.frame(
Type = c("alpha", "beta", "gamme"),
CA = c(1, 4, 3),
AR = c(5, 9, 8)
)
df2 <- colSums(df[, c("CA", "AR")])
# CA AR
# 8 22
I have to call the table() function on 10 variables in R. Is there any way of doing it in one shot, without calling them individually like table(v1), table(v2)... table(v10)?
If your variables are arranged as columns in a data.frame, you could use lapply:
df <- data.frame(aa = rpois(10, 4), bb = rpois(10, 3), c = rpois(10, 7))
tabList <- lapply(df, table)
Then you get a list with the various tables:
> tabList
$aa
1 3 4 5 6 7
2 3 2 1 1 1
$bb
1 2 3 4 5
1 2 4 1 2
$c
3 4 5 6 7 9 11 12
1 1 1 3 1 1 1 1
EDIT:
For variables across multiple data.frames, you might try putting them into a list and then using lapply again:
df2 <- df[sample(rownames(df), 15, replace = TRUE), ]
df3 <- df[sample(rownames(df), 20, replace = TRUE), ]
dfList <- list(df = df, df2 = df2, df3 = df3)
lapply(dfList, function(x) lapply(x, FUN = table))