I have a dataframe as below. I want to get a column of maximums for each row. But that column should ignore value 9 if it is present in that row.
How can i achive that efficiently?
df <- data.frame(age=c(5,6,9), marks=c(1,2,7), story=c(2,9,1))
df$max <- apply(df, 1, max)
df
Here's one possibility:
df$colMax <- apply(df, 1, function(x) max(x[x != 9]))
The pmax function would be useful here. The only catch is that it takes a bunch of vectors as parameters. You can convert a data.frame to parameters with do.call. I also set the 9 values to NA as suggested by other but do so using the somewhat unconventional is.na<- command.
do.call(pmax, c(`is.na<-`(df, df==9), na.rm=T))
# [1] 5 6 7
Substitute 9 with NA and then use pmax as suggested by #MrFlick in his deleted answer:
df2 <- df #copy df because we are going to change it
df2[df2==9] <- NA
do.call(function(...) pmax(..., na.rm=TRUE), df2)
#[1] 5 6 7
#make a copy of your data.frame
tmp.df <- df
#replace the 9s with NA
tmp.df[tmp.df==9] <- NA
#Use apply to process the data one row at a time through the max function, removing NA values first
apply(tmp.df,1,max,na.rm=TRUE)
Related
Suppose I have a data.frame like THIS (or see my code below). As you can see, after every some number of continuous rows, there is a row with all NAs.
I was wondering how I could split THIS data.frame based on every row of NA?
For example, in my code below, I want my original data.frame to be split into 3 smaller data.frames as there are 2 rows of NAs in the original data.frame.
Here is is what I tried with no success:
## The original data.frame:
DF <- read.csv("https://raw.githubusercontent.com/izeh/i/master/m.csv", header = T)
## the index number of rows with "NA"s; Here rows 7 and 14:
b <- as.numeric(rownames(DF[!complete.cases(DF), ]))
## split DF by rows that have "NA"s; that is rows 7 and 14:
split(DF, b)
If we also need the NA rows, create a group with cumsum on the 'study.name' column which is blank (or NA)
library(dplyr)
DF %>%
group_split(grp = cumsum(lag(study.name == "", default = FALSE)), keep = FALSE)
Or with base R
split(DF, cumsum(c(FALSE, head(DF$study.name == "", -1))))
Or with NA
i1 <- rowSums(is.na(DF))== ncol(DF)
split(DF, cumsum(c(FALSE, head(i1, -1))))
Or based on 'b'
DF1 <- DF[setdiff(seq_len(nrow(DF)), b), ]
split(DF1, as.character(DF1$study.name))
You can find occurrence of b in sequence of rows in DF and use cumsum to create groups.
split(DF, cumsum(seq_len(nrow(DF)) %in% b))
I have a dataframe as below. I want to get a column of maximums for each row. But that column should ignore value 9 if it is present in that row.
How can i achive that efficiently?
df <- data.frame(age=c(5,6,9), marks=c(1,2,7), story=c(2,9,1))
df$max <- apply(df, 1, max)
df
Here's one possibility:
df$colMax <- apply(df, 1, function(x) max(x[x != 9]))
The pmax function would be useful here. The only catch is that it takes a bunch of vectors as parameters. You can convert a data.frame to parameters with do.call. I also set the 9 values to NA as suggested by other but do so using the somewhat unconventional is.na<- command.
do.call(pmax, c(`is.na<-`(df, df==9), na.rm=T))
# [1] 5 6 7
Substitute 9 with NA and then use pmax as suggested by #MrFlick in his deleted answer:
df2 <- df #copy df because we are going to change it
df2[df2==9] <- NA
do.call(function(...) pmax(..., na.rm=TRUE), df2)
#[1] 5 6 7
#make a copy of your data.frame
tmp.df <- df
#replace the 9s with NA
tmp.df[tmp.df==9] <- NA
#Use apply to process the data one row at a time through the max function, removing NA values first
apply(tmp.df,1,max,na.rm=TRUE)
I want to count the number of NA values in a data frame column. Say my data frame is called df, and the name of the column I am considering is col. The way I have come up with is following:
sapply(df$col, function(x) sum(length(which(is.na(x)))))
Is this a good/most efficient way to do this?
You're over-thinking the problem:
sum(is.na(df$col))
If you are looking for NA counts for each column in a dataframe then:
na_count <-sapply(x, function(y) sum(length(which(is.na(y)))))
should give you a list with the counts for each column.
na_count <- data.frame(na_count)
Should output the data nicely in a dataframe like:
----------------------
| row.names | na_count
------------------------
| column_1 | count
Try the colSums function
df <- data.frame(x = c(1,2,NA), y = rep(NA, 3))
colSums(is.na(df))
#x y
#1 3
A quick and easy Tidyverse solution to get a NA count for all columns is to use summarise_all() which I think makes a much easier to read solution than using purrr or sapply
library(tidyverse)
# Example data
df <- tibble(col1 = c(1, 2, 3, NA),
col2 = c(NA, NA, "a", "b"))
df %>% summarise_all(~ sum(is.na(.)))
#> # A tibble: 1 x 2
#> col1 col2
#> <int> <int>
#> 1 1 2
Or using the more modern across() function:
df %>% summarise(across(everything(), ~ sum(is.na(.))))
If you are looking to count the number of NAs in the entire dataframe you could also use
sum(is.na(df))
In the summary() output, the function also counts the NAs so one can use this function if one wants the sum of NAs in several variables.
A tidyverse way to count the number of nulls in every column of a dataframe:
library(tidyverse)
library(purrr)
df %>%
map_df(function(x) sum(is.na(x))) %>%
gather(feature, num_nulls) %>%
print(n = 100)
This form, slightly changed from Kevin Ogoros's one:
na_count <-function (x) sapply(x, function(y) sum(is.na(y)))
returns NA counts as named int array
sapply(name of the data, function(x) sum(is.na(x)))
Try this:
length(df$col[is.na(df$col)])
User rrs answer is right but that only tells you the number of NA values in the particular column of the data frame that you are passing to get the number of NA values for the whole data frame try this:
apply(<name of dataFrame>, 2<for getting column stats>, function(x) {sum(is.na(x))})
This does the trick
I read a csv file from local directory. Following code works for me.
# to get number of which contains na
sum(is.na(df[, c(columnName)]) # to get number of na row
# to get number of which not contains na
sum(!is.na(df[, c(columnName)])
#here columnName is your desire column name
Similar to hute37's answer but using the purrr package. I think this tidyverse approach is simpler than the answer proposed by AbiK.
library(purrr)
map_dbl(df, ~sum(is.na(.)))
Note: the tilde (~) creates an anonymous function. And the '.' refers to the input for the anonymous function, in this case the data.frame df.
If you're looking for null values in each column to be printed one after the other then you can use this. Simple solution.
lapply(df, function(x) { length(which(is.na(x)))})
Another option using complete.cases like this:
df <- data.frame(col = c(1,2,NA))
df
#> col
#> 1 1
#> 2 2
#> 3 NA
sum(!complete.cases(df$col))
#> [1] 1
Created on 2022-08-27 with reprex v2.0.2
You can use this to count number of NA or blanks in every column
colSums(is.na(data_set_name)|data_set_name == '')
In the interests of completeness you can also use the useNA argument in table. For example table(df$col, useNA="always") will count all of non NA cases and the NA ones.
I have two data frames with 2 columns in each. For example:
df.1 = data.frame(col.1 = c("a","a","a","a","b","b","b","c","c","d"), col.2 = c("b","c","d","e","c","d","e","d","e","e"))
df.2 = data.frame(col.1 = c("b","b","b","a","a","e"), col.2 = c("a","c","e","c","e","c"))
and I'm looking for an efficient way to look up the row index in df.2 of every col.1 col.2 row pair of df.1. Note that a row pair in df.1 may appear in df.2 in reverse order (for example df.1[1,], which is "a","b" appears in df.2[1,] as "b","a"). That doesn't matter to me. In other words, as long as a row pair in df.1 appears in any order in df.2 I want its row index in df.2, otherwise it should return NA. One more note, row pairs in both data frames are unique - meaning each row pair appears only once.
So for these two data frames the return vector would be:
c(1,4,NA,5,2,NA,3,NA,6,NA)
Maybe something using dplyr package:
first make the reference frame
use row_number() to number as per row index efficiently.
use select to "flip" the column vars.
two halves:
df_ref_top <- df.2 %>% mutate(n=row_number())
df_ref_btm <- df.2 %>% select(col.1=col.2, col.2=col.1) %>% mutate(n=row_number())
then bind together:
df_ref <- rbind(df_ref_top,df_ref_btm)
Left join and select vector:
gives to get your answer
left_join(df.1,df_ref)$n
# Per #thelatemail's comment, here's a more elegant approach:
match(apply(df.1,1,function(x) paste(sort(x),collapse="")),
apply(df.2,1,function(x) paste(sort(x),collapse="")))
# My original answer, for reference:
# Check for matches with both orderings of df.2's columns
match.tmp = cbind(match(paste(df.1[,1],df.1[,2]), paste(df.2[,1],df.2[,2])),
match(paste(df.1[,1],df.1[,2]), paste(df.2[,2],df.2[,1])))
# Convert to single vector of match indices
match.index = apply(match.tmp, 1,
function(x) ifelse(all(is.na(x)), NA, max(x, na.rm=TRUE)))
[1] 1 4 NA 5 2 NA 3 NA 6 NA
Here's a little function that tests a few of the looping options in R (which was not really intentional, but it happened).
check.rows <- function(data1, data2)
{
df1 <- as.matrix(data1);df2 <- as.matrix(data2);ll <- vector('list', nrow(df1))
for(i in seq(nrow(df1))){
ll[[i]] <- sapply(seq(nrow(df2)), function(j) df2[j,] %in% df1[i,])
}
h <- sapply(ll, function(x) which(apply(x, 2, all)))
sapply(h, function(x) ifelse(is.double(x), NA, x))
}
check.rows(df.1, df.2)
## [1] 1 4 NA 5 2 NA 3 NA 6 NA
And here's a benchmark when row dimensions are increased for both df.1 and df.2. Not too bad I guess, considering the 24 checks on each of 40 rows.
> dim(df.11); dim(df.22)
[1] 40 2
[1] 24 2
> f <- function() check.rows(df.11, df.22)
> microbenchmark(f())
## Unit: milliseconds
## expr min lq median uq max neval
## f() 75.52258 75.94061 76.96523 78.61594 81.00019 100
1) sort/merge First sort df.2 creating df.2.s and append a row number column. Then merge this new data frame with df.1 (which is already sorted in the question):
df.2.s <- replace(df.2, TRUE, t(apply(df.2, 1, sort)))
df.2.s$row <- 1:nrow(df.2.s)
merge(df.1, df.2.s, all.x = TRUE)$row
The result is:
[1] 1 4 NA 5 2 NA 3 NA 6 NA
2) sqldf Since dot is an SQL operator rename the data frames as df1 and df2. Note that for the same reason the column names will be transformed to col_1 and col_2 when df1 and df2 are automatically uploaded to the backend database. We sort df2 using min and max and left join it to df1 (which is already sorted):
df1 <- df.1
df2 <- df.2
library(sqldf)
sqldf("select b.rowid row
from df1
left join
(select min(col_1, col_2) col_1, max(col_1, col_2) col_2 from df2) b
using (col_1, col_2)")$row
REVISED Some code improvements. Added second solution.
I want to get the number of unique values in each of the columns of a data frame.
Let's say I have the following data frame:
DF <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
then it should return that there are 3 distinct values for v1, and 2 for v2.
I tried unique(DF), but it does not work as each rows are different.
Or using unique:
rapply(DF,function(x)length(unique(x)))
v1 v2
3 2
sapply(DF, function(x) length(unique(x)))
In dplyr:
DF %>% summarise_all(funs(n_distinct(.)))
Here's one approach:
> lapply(DF, function(x) length(table(x)))
$v1
[1] 3
$v2
[1] 2
This basically tabulates the unique values per column. Using length on that tells you the number. Removing length will show you the actual table of unique values.
For the sake of completeness: Since CRAN version 1.9.6 of 19 Sep 2015, the data.table package includes the helper function uniqueN() which saves us from writing
function(x) length(unique(x))
when calling one of the siblings of apply():
sapply(DF, data.table::uniqueN)
v1 v2
3 2
Note that neither the data.table package needs to be loaded nor DF coerced to class data.table in order to use uniqueN(), here.
In dplyr (>=1.0.0 - june 2020):
DF %>% summarize_all(n_distinct)
v1 v2
1 3 2
I think a function like this would give you what you are looking for. This also shows the unique values, in addition to how many NA's there are in each dataframe's columns. Simply plug in your dataframe, and you are good to go.
totaluniquevals <- function(df) {
x <<- data.frame("Row Name"= numeric(0), "TotalUnique"=numeric(0), "IsNA"=numeric(0))
result <- sapply(df, function(x) length(unique(x)))
isnatotals <- sapply(df, function(x) sum(is.na(x)))
#Now Create the Row names
for (i in 1:length(colnames(df))) {
x[i,1] <<- (names(result[i]))
x[i,2] <<- result[[i]]
x[i,3] <<- isnatotals[[i]]
}
return(x)
}
Test:
DF <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
totaluniquevals(DF)
Row.Name TotalUnique IsNA
1 v1 3 0
2 v2 2 0
You can then use unique on whatever column, to see what the specific unique values are.
unique(DF$v2)
[1] a b
Levels: a b
This should work for getting an unique value for each variable:
length(unique(datasetname$variablename))
This will give you unique values in DF dataframe of column 1.
unique(sc_data[,1])