I'm working with Pima Indians Diabetes data from Kaggle in Rstudio and instead of na's as missing values it has 0s.
How can I count the number of "0" values in each variable with a single loop instead of typing table(data$variableName==0) for each column. Just rephrasing ,"a single loop for the whole data frame".
We can use colSums on a logical matrix
colSums(data == 0)
Or with sapply in a loop
sapply(data, function(x) sum(x == 0))
or with apply
apply(data, 2, function(x) sum(x == 0))
Or in a for loop
count <- numeric(ncol(data))
for(i in seq_along(data)) count[i] <- sum(data[[i]] == 0)
Try this:
library(dplyr)
data %>% summarise(across(.fns = ~sum(.==0,na.rm=TRUE) ,.names = "Zeros_in_{.col}"))
Related
As the subject suggests, how can I write the following operation with lapply/map/etc to be more efficient in R?
for(i in 1:length(tbl)){
tbl[[i]] <- filter(tbl[[i]], tbl[[i]][, 7] >= 10)
}
The idea is to basically filter the 7th column from each element (data frame) of the list so that the output only gives values that hold 10 or greater in the 7th column of all data frames. I tried something like this, but didn't get the code working:
lapply(tbl, function(x) filter(x, tbl[[x]][, 7] >= 10))
lapply(tbl, function(x) x[x[,7] >= 10,])
or using dplyr::filter
lapply(tbl, function(x) filter(x, x[,7] >= 10))
using purrr::map
map(tbl, ~filter(.x, .x[,7] >= 10))
Many previous questions highlight various ways to remove duplicate rows with missing values, however none deal with the following case. Example starting data:
df <- data.frame(x = c(1, NA, 1), y=c(NA, 1, 1), z=c(0, NA, NA))
print(df)
Desired output:
df2 <- data.frame(x = c(1, 1), y=c(NA, 1), z=c(0, NA))
print(df2)
In this case the second row was removed because it was a perfect subset of row 3. In the real application I want to remove rows that contain all redundant info in non-missing columns, and keep the row that has less missing overall.
I thought this might be accomplished using dplyr and a rowwise application of distinct(), but to no avail. I could do this with a very slow for loop, but with hundreds of columns and thousands of rows this is a poor option.
Here is another option using data.table:
library(data.table)
#convert into long format and discard NAs
mDT <- melt(setDT(df)[, rn := .I], id.var="rn", na.rm=TRUE)[, cnt := .N , rn]
#self join and filter for rows that match to other rows
merged <- mDT[mDT, on=.(variable, value), {
diffrow <- i.rn!=x.rn
.(irn=i.rn[diffrow], xrn=x.rn[diffrow], icnt=i.cnt[diffrow])
}]
#count the occurrence and delete rows where all values are matched to another row
ix <- merged[, xcnt := .N, .(irn, xrn)][
icnt==xcnt]$irn
#delete dupe rows
df[-ix]
I'm not sure how to do it with dplyr, but here is soultion with loop. Also I'm not sure that dplyr solution can be faster than loop one (at the end it must use some loop), here you can at least control loop flow.
Subset vector function determines if vector a is subset of vector b (return 1) or if vector b is subset of vector a (returns 2) otherwise it returns 0. Then I loop over all rows of data.frame and remove subset rows.
subsetVector <- function(a, b){
na_a <- which(is.na(a))
na_b <- which(is.na(b))
if(all(na_a %in% na_b)){
if(all(a[-na_b] == b[-na_b])) return(2)
}else if(all(na_b %in% na_a)){
if(all(b[-na_a] == a[-na_a])) return(1)
}
return(0)
}
i <- 1
while(i < nrow(df)){
remove_rows <- NULL
for(j in (i+1):nrow(df)){
p <- subsetVector(df[i,], df[j,])
if(p == 1){
remove_rows <- c(remove_rows, i)
break()
}else if(p == 2){
remove_rows <- c(remove_rows, j)
}
}
if(length(remove_rows) > 0)
df <- df[-remove_rows,]
if(!1 %in% remove_rows)
i <- i + 1
}
I'm working on a shiny R app in which I need to parse csv files. From them, I build a dataframe. Then, I want to extract some rows from this dataframe and put them in another dataframe.
I found a way to do that using rbind, but it's pretty ugly and seems inadequate.
function(set){ #set is the data.frame containing the data I want to extract
newTable <- data.frame(
name = character(1),
value = numeric(1),
columnC = character(1),
stringsAsFactors=FALSE)
threshold <- 0
for (i in 1:nrow(set)){
value <- calculateValue(set$Value[[i]]))
if (value >= threshold){
name <- set[which(set$Name == "foo")), ]$Name
columnC <- set[which(set$C == "bar")), ]$C
v <- c(name, value, columnC)
newTable <- rbind(newTable, v)
}
}
If I don't initialize my dataframe values with character(1) or numeric(1), I get an error:
Warning: Error in data.frame: arguments imply differing number of
rows: 0, 1 75: stop 74: data.frame
But then it leaves me with an empty row in my dataframe (empty strings for characters and 0s for numerics).
Since R is a cool language, I assume there's an easier and more efficient to do this. Can anybody help me?
Rather than looping through each row, you can either subset
function(set, threshold) {
set[calculateValue(set$Value) >= threshold, c("name", "value", "columnC")]
}
Or use dplyr to filter rows and select columns to get the subset you want.
library(tidyverse)
function(set, threshold) {
set %>%
filter(calculateValue(Value) >= threshold) %>%
select(name, value, columnC)
}
Then assign the result to a new variable if you want a new dataframe
getValueOverThreshold <- function(set, threshold) {
set %>%
filter(calculateValue(Value) >= threshold) %>%
select(name, value, columnC)
}
newDF <- getValueOverThreshold(set, 0)
You might want to check out https://r4ds.had.co.nz/transform.html
I have this data.frame:
set.seed(1)
df <- data.frame(id1=LETTERS[sample(26,100,replace = T)],id2=LETTERS[sample(26,100,replace = T)],stringsAsFactors = F)
and this vector:
vec <- LETTERS[sample(26,10,replace = F)]
I want to remove from df any row which either df$id1 or df$id2 are not in vec
Is there any faster way of finding the row indices which meet this condition than this:
rm.idx <- which(!apply(df,1,function(x) all(x %in% vec)))
I used dplyr with such script
df1 <- df %>% filter(!(df$id1 %in% vec)|!(df$id2 %in% vec))
Looping over the columns might be faster than over rows. So, use lapply to loop over the columns, create a list of logical vectors with %in%, use Reduce with | to check whether there are any TRUE values for each corresponding row and use that to subset the 'df'
df[Reduce(`|`, lapply(df, `%in%`, vec)),]
If we need both elements, then replace | with &
df[Reduce(`&`, lapply(df, `%in%`, vec)),]
Actually
rm.idx <- unique(which(!(df$id1 %in% vec) | !(df$id2 %in% vec)))
is also fast.
I am trying to remove duplicate rows in an R data frame, but I want the condition that the row with a smaller or larger value (not bothered for the purpose of this question) in a certain column should be kept.
I can remove duplicate rows normally (from either side) like this:
df = data.frame( x = c(1,1,2,3,4,5,5,6,1,2,3,3,4,5,6),
y = c(rnorm(4),NA,rnorm(10)),
id = c(rep(1,8), rep(2,7)))
splitID <- split(df , df$id)
lapply(splitID, function(x) x[!duplicated(x$x),] )
How can I condition the removal of duplicate rows?
Thanks!
Use ave() to return a logical index to subset your data.frame
idx = as.logical(ave(df$y, df$x, df$id, FUN=fun))
df[idx,, drop=FALSE]
Some possible fun include
fun1 = function(x)
!is.na(x) & !duplicated(x) & (x == min(x, na.rm=TRUE))
fun2 = function(x) {
res = logical(length(x))
res[which.min(x)] = TRUE
res
}
The dplyr version of this might be
df %>% group_by(x, id) %>% filter(fun2(y))
We may need to order before applying the duplicated
lapply(splitID, function(x) x[!duplicated(x[order(x$x, x$y),]$x),] )
and for the reverse, i.e. keeping the larger values, order with decreasing = TRUE