I want to get the number of unique values in each of the columns of a data frame.
Let's say I have the following data frame:
DF <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
then it should return that there are 3 distinct values for v1, and 2 for v2.
I tried unique(DF), but it does not work as each rows are different.
Or using unique:
rapply(DF,function(x)length(unique(x)))
v1 v2
3 2
sapply(DF, function(x) length(unique(x)))
In dplyr:
DF %>% summarise_all(funs(n_distinct(.)))
Here's one approach:
> lapply(DF, function(x) length(table(x)))
$v1
[1] 3
$v2
[1] 2
This basically tabulates the unique values per column. Using length on that tells you the number. Removing length will show you the actual table of unique values.
For the sake of completeness: Since CRAN version 1.9.6 of 19 Sep 2015, the data.table package includes the helper function uniqueN() which saves us from writing
function(x) length(unique(x))
when calling one of the siblings of apply():
sapply(DF, data.table::uniqueN)
v1 v2
3 2
Note that neither the data.table package needs to be loaded nor DF coerced to class data.table in order to use uniqueN(), here.
In dplyr (>=1.0.0 - june 2020):
DF %>% summarize_all(n_distinct)
v1 v2
1 3 2
I think a function like this would give you what you are looking for. This also shows the unique values, in addition to how many NA's there are in each dataframe's columns. Simply plug in your dataframe, and you are good to go.
totaluniquevals <- function(df) {
x <<- data.frame("Row Name"= numeric(0), "TotalUnique"=numeric(0), "IsNA"=numeric(0))
result <- sapply(df, function(x) length(unique(x)))
isnatotals <- sapply(df, function(x) sum(is.na(x)))
#Now Create the Row names
for (i in 1:length(colnames(df))) {
x[i,1] <<- (names(result[i]))
x[i,2] <<- result[[i]]
x[i,3] <<- isnatotals[[i]]
}
return(x)
}
Test:
DF <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
totaluniquevals(DF)
Row.Name TotalUnique IsNA
1 v1 3 0
2 v2 2 0
You can then use unique on whatever column, to see what the specific unique values are.
unique(DF$v2)
[1] a b
Levels: a b
This should work for getting an unique value for each variable:
length(unique(datasetname$variablename))
This will give you unique values in DF dataframe of column 1.
unique(sc_data[,1])
Related
I have a dataframe as below. I want to get a column of maximums for each row. But that column should ignore value 9 if it is present in that row.
How can i achive that efficiently?
df <- data.frame(age=c(5,6,9), marks=c(1,2,7), story=c(2,9,1))
df$max <- apply(df, 1, max)
df
Here's one possibility:
df$colMax <- apply(df, 1, function(x) max(x[x != 9]))
The pmax function would be useful here. The only catch is that it takes a bunch of vectors as parameters. You can convert a data.frame to parameters with do.call. I also set the 9 values to NA as suggested by other but do so using the somewhat unconventional is.na<- command.
do.call(pmax, c(`is.na<-`(df, df==9), na.rm=T))
# [1] 5 6 7
Substitute 9 with NA and then use pmax as suggested by #MrFlick in his deleted answer:
df2 <- df #copy df because we are going to change it
df2[df2==9] <- NA
do.call(function(...) pmax(..., na.rm=TRUE), df2)
#[1] 5 6 7
#make a copy of your data.frame
tmp.df <- df
#replace the 9s with NA
tmp.df[tmp.df==9] <- NA
#Use apply to process the data one row at a time through the max function, removing NA values first
apply(tmp.df,1,max,na.rm=TRUE)
I have a dataset like this
id <- 1:12
b <- c(0,0,1,2,0,1,1,2,2,0,2,2)
c <- rep(NA,3)
d <- rep(NA,3)
df <-data.frame(id,b)
newdf <- data.frame(c,d)
I want to do simple math. If x==1 or x==2 count them and write how many 1 and 2 are there in this dataset. But I don't want to count whole dataset, I want my function count them four by four.
I want to a result like this:
> newdf
one two
1 1 1
2 2 1
3 0 3
I tried this with lots of variation but I couldn't success.
afonk <- function(x) {
ifelse(x==1 | x==2, x, newdf <- (x[1]+x[2]))
}
afonk(newdf$one)
lapply(newdf, afonk)
Thanks in advance!
ismail
Fun with base R:
# counting function
countnum <- function(x,num){
sum(x == num)
}
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
dfl <- split(df$b,f = df$group)
# make data frame of counts
newdf <- data.frame(one = sapply(dfl,countnum,1),
two = sapply(dfl,countnum,2))
Edit based on comment:
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
table(subset(df, b != 0L)[c("group", "b")])
Which you prefer depends on what type of result you need. A table will work for a small visual count, and you can likely pull the data out of the table, but if it is as simple as your example, you might opt for the data.frame.
We could use dcast from data.table. Create a grouping variable using %/% and then dcast from 'long' to 'wide' format.
library(data.table)
dcast(setDT(df)[,.N ,.(grp=(id-1)%/%4+1L, b)],
grp~b, value.var='N', fill =0)[,c(2,4), with=FALSE]
Or a slightly more compact version would be using fun.aggregate as length.
res <- dcast(setDT(df)[,list((id-1)%/%4+1L, b)][b!=0],
V1~b, length)[,V1:=NULL][]
res
# 1 2
#1: 1 1
#2: 2 1
#3: 0 3
If we need the column names to be 'one', 'two'
library(english)
names(res) <- as.character(english(as.numeric(names(res))))
Is there any way to use string stored in variable as a column name in a new data frame? The expected result should be:
col.name <- 'col1'
df <- data.frame(col.name=1:4)
print(df)
# Real output
col.name
1 1
2 2
3 3
4 4
# Expected output
col1
1 1
2 2
3 3
4 4
I'm aware that I can create data frame and then use names() to rename column or use df[, col.name] for existing object, but I'd like to know if there is any other solution which could be used during creating data frame.
You cannot pass a variable into the name of an argument like that.
Instead what you can do is:
df <- data.frame(placeholder_name = 1:4)
names(df)[names(df) == "placeholder_name"] <- col.name
or use the default name of "V1":
df <- data.frame(1:4)
names(df)[names(df) == "V1"] <- col.name
or assign by position:
df <- data.frame(1:4)
names(df)[1] <- col.name
or if you only have one column just replace the entire names attribute:
df <- data.frame(1:4)
names(df) <- col.name
There's also the set_names function in the magrittr package that you can use to do this last solution in one step:
library(magrittr)
df <- set_names(data.frame(1:4), col.name)
But set_names is just an alias for:
df <- `names<-`(data.frame(1:4), col.name)
which is part of base R. Figuring out why this expression works and makes sense will be a good exercise.
In addition to ssdecontrol's answer, there is a second option.
You're looking for mget. First assign the name to a variable, then the value to the variable that you have previously assigned. After that, mget will evaluate the string and pass it to data.frame.
assign("col.name", "col1")
assign(paste(col.name), 1:4)
df <- data.frame(mget(col.name))
print(df)
col1
1 1
2 2
3 3
4 4
I don't recommend you do this, but:
col.name <- 'col1'
eval(parse(text=paste0('data.frame(', col.name, '=1:4)')))
I have a dataframe as below. I want to get a column of maximums for each row. But that column should ignore value 9 if it is present in that row.
How can i achive that efficiently?
df <- data.frame(age=c(5,6,9), marks=c(1,2,7), story=c(2,9,1))
df$max <- apply(df, 1, max)
df
Here's one possibility:
df$colMax <- apply(df, 1, function(x) max(x[x != 9]))
The pmax function would be useful here. The only catch is that it takes a bunch of vectors as parameters. You can convert a data.frame to parameters with do.call. I also set the 9 values to NA as suggested by other but do so using the somewhat unconventional is.na<- command.
do.call(pmax, c(`is.na<-`(df, df==9), na.rm=T))
# [1] 5 6 7
Substitute 9 with NA and then use pmax as suggested by #MrFlick in his deleted answer:
df2 <- df #copy df because we are going to change it
df2[df2==9] <- NA
do.call(function(...) pmax(..., na.rm=TRUE), df2)
#[1] 5 6 7
#make a copy of your data.frame
tmp.df <- df
#replace the 9s with NA
tmp.df[tmp.df==9] <- NA
#Use apply to process the data one row at a time through the max function, removing NA values first
apply(tmp.df,1,max,na.rm=TRUE)
I want to count the number of NA values in a data frame column. Say my data frame is called df, and the name of the column I am considering is col. The way I have come up with is following:
sapply(df$col, function(x) sum(length(which(is.na(x)))))
Is this a good/most efficient way to do this?
You're over-thinking the problem:
sum(is.na(df$col))
If you are looking for NA counts for each column in a dataframe then:
na_count <-sapply(x, function(y) sum(length(which(is.na(y)))))
should give you a list with the counts for each column.
na_count <- data.frame(na_count)
Should output the data nicely in a dataframe like:
----------------------
| row.names | na_count
------------------------
| column_1 | count
Try the colSums function
df <- data.frame(x = c(1,2,NA), y = rep(NA, 3))
colSums(is.na(df))
#x y
#1 3
A quick and easy Tidyverse solution to get a NA count for all columns is to use summarise_all() which I think makes a much easier to read solution than using purrr or sapply
library(tidyverse)
# Example data
df <- tibble(col1 = c(1, 2, 3, NA),
col2 = c(NA, NA, "a", "b"))
df %>% summarise_all(~ sum(is.na(.)))
#> # A tibble: 1 x 2
#> col1 col2
#> <int> <int>
#> 1 1 2
Or using the more modern across() function:
df %>% summarise(across(everything(), ~ sum(is.na(.))))
If you are looking to count the number of NAs in the entire dataframe you could also use
sum(is.na(df))
In the summary() output, the function also counts the NAs so one can use this function if one wants the sum of NAs in several variables.
A tidyverse way to count the number of nulls in every column of a dataframe:
library(tidyverse)
library(purrr)
df %>%
map_df(function(x) sum(is.na(x))) %>%
gather(feature, num_nulls) %>%
print(n = 100)
This form, slightly changed from Kevin Ogoros's one:
na_count <-function (x) sapply(x, function(y) sum(is.na(y)))
returns NA counts as named int array
sapply(name of the data, function(x) sum(is.na(x)))
Try this:
length(df$col[is.na(df$col)])
User rrs answer is right but that only tells you the number of NA values in the particular column of the data frame that you are passing to get the number of NA values for the whole data frame try this:
apply(<name of dataFrame>, 2<for getting column stats>, function(x) {sum(is.na(x))})
This does the trick
I read a csv file from local directory. Following code works for me.
# to get number of which contains na
sum(is.na(df[, c(columnName)]) # to get number of na row
# to get number of which not contains na
sum(!is.na(df[, c(columnName)])
#here columnName is your desire column name
Similar to hute37's answer but using the purrr package. I think this tidyverse approach is simpler than the answer proposed by AbiK.
library(purrr)
map_dbl(df, ~sum(is.na(.)))
Note: the tilde (~) creates an anonymous function. And the '.' refers to the input for the anonymous function, in this case the data.frame df.
If you're looking for null values in each column to be printed one after the other then you can use this. Simple solution.
lapply(df, function(x) { length(which(is.na(x)))})
Another option using complete.cases like this:
df <- data.frame(col = c(1,2,NA))
df
#> col
#> 1 1
#> 2 2
#> 3 NA
sum(!complete.cases(df$col))
#> [1] 1
Created on 2022-08-27 with reprex v2.0.2
You can use this to count number of NA or blanks in every column
colSums(is.na(data_set_name)|data_set_name == '')
In the interests of completeness you can also use the useNA argument in table. For example table(df$col, useNA="always") will count all of non NA cases and the NA ones.