I have a vector a<-c(0, 0). I want to convert this to a dataframe and then remove duplicated rows (as part of a loop).
This is my code:
a<-c(0, 0)
df<-t(as.data.frame(a))
distinct(df)
This isn't working because df isn't a dataframe even though I have converted it to a dataframe in the second step. I'm not sure how to make this work when the dataframe only has one row.
swap t and as.data.frame like this:
library(dplyr)
a<-c(0, 0)
df<-as.data.frame(t(a))
distinct(df)
Output:
V1 V2
1 0 0
The function t will transform the data.frame back to a matrix. The simplest solution would be to simply change the order of the functions:
a <- c(0,0)
df <- as.data.frame(t(a))
distinct(df)
You can use rbind.data.frame function as follows:
a <- c(0,0)
df = rbind.data.frame(a)
distinct(df)
Using dplyr and matrix:
library(dplyr)
data.frame(matrix(0, 1, 2)) %>%
distinct
#> X1 X2
#> 1 0 0
Related
I've looked at R create a vector from conditional operation on matrix, and using a similar solution does not yield what I want (and I'm not sure why).
My goal is to evaluate df with the following condition: if df > 2, df -2, else 0
Take df:
a <- seq(1,5)
b <- seq(0,4)
df <- cbind(a,b) %>% as.data.frame()
df is simply:
a b
1 0
2 1
3 2
4 3
5 4
df_final should look like this after a suitable function:
a b
0 0
0 0
1 0
2 1
3 2
I applied the following function with the result, and I'm not sure why it doesn't work (further explanation of a solution would be appreciated)
apply(df,2,function(df){
ifelse(any(df>2),df-2,0)
})
Yielding the following:
a b
-1 -2
Thank you SO community!
Let's fix your function and understand why it didn't work:
apply(df, # apply to df
2, # to each *column* of df
function(df){ # this function. Call the function argument (each column) df
# (confusing because this is the same name as the data frame...)
ifelse( # Looking at each column...
any(df > 2), # if there are any values > 2
df - 2, # then df - 2
0 # otherwise 0
)
})
any() returns a single value. ifelse() returns something the same shape as the test, so by making your test any(df > 2) (a single value), ifelse() will also return a single value.
Let's fix this by (a) changing the function to be of a different name than the input (for readability) and (b) getting rid of the any:
apply(df, # apply to df
2, # to each *column* of df
function(x){ # this function. Call the function argument (each column) x
ifelse( # Looking at each column...
x > 2, # when x is > 2
df - 2, # make it x - 2
0 # otherwise 0
)
})
apply is made for working on matrices. When you give it a data frame, the first thing it does is convert it to a matrix. If you want the result to be a data frame, you need to convert it back to a data frame.
Or we can use lapply instead. lapply returns a list, and by assigning it to the columns of df with df[] <- lapply(), we won't need to convert. (And since lapply doesn't do the matrix conversion, it knows by default to apply the function to each column.)
df[] <- lapply(df, function(x) ifelse(x > 2, x - 2, 0))
As a side note, df <- cbind(a,b) %>% as.data.frame() is a more complicated way of writing df <- data.frame(a, b)
Create the 'out' dataset by subtracting 2, then replace the values that are based on a logical condition to 0
out <- df - 2
out[out < 0] <- 0
Or in a single step
(df-2) * ((df - 2) > 0)
Using apply
a <- seq(1,5)
b <- seq(0,4)
df <- cbind(a,b) %>% as.data.frame()
new_matrix <- apply(df, MARGIN=2,function(i)ifelse(i >2, i-2,0))
new_matrix
###if you want it to return a tibble/df
new_tibble <- apply(df, MARGIN=2,function(i)ifelse(i >2, i-2,0)) %>% as_tibble()
I have a data.frame setup like this:
df <- data.frame(units = c(1.5, -1, 1.4),
what = c('Num1', 'Num2', 'Num3'))
Which gives me something like this:
units what
1 1.500000 Num1
2 -1000000 Num2
3 1.400000 Num3
I want to able to remove the entire row if the number in the first column is -1. So Ideally loop through the whole dataframe and remove the rows that have -1 in the unit column. I've been trying things like this:
if(CONDITION TO REMOVE) {
print("deleting function...")
df <- df[-c(df[,'Num2']),]
}
But it deletes everything in the rest of the df. I only want to delete that one row (and the entire row).
Thanks in advance.
newdf <- df[-which(df[,1] ==-1),]
newdf is df without the rows containing -1 in the first column.
You can use dplyr to better suit your needs:
df.new <- df %>% filter(units != -1)
Or you can do this using base R
df.new <- df[df$units != -1, ]
I want have a dataframe with something like 90 variables, and over 1 million observations. I want to calculate the percentage of NA rows on each variable. I have the following code:
sum(is.na(dataframe$variable) / nrow(dataframe) * 100)
My question is, how can I apply this function to all 90 variables, without having to type all variable names in the code?
Use lapply() with your method:
lapply(df, function(x) sum(is.na(x))/nrow(df)*100)
If you want to return a data.frame rather than a list (via lapply()) or a vector (via sapply()), you can use summarise_each from the dplyr package:
library(dplyr)
df %>%
summarise_each(funs(sum(is.na(.)) / length(.)))
or, even more concisely:
df %>% summarise_each(funs(mean(is.na(.))))
data
df <- data.frame(
x = 1:10,
y = 1:10,
z = 1:10
)
df$x[c(2, 5, 7)] <- NA
df$y[c(4, 5)] <- NA
I have a data frame with 300 columns which has a string variable somewhere which I am trying to remove. I have found this solution in stack overflow using lapply (see below), which is what I want to do, but using the dplyr package. I have tried using the mutate_each function but cant seem to make it work
"If your data frame (df) is really all integers except for NAs and garbage then then the following converts it.
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
You'll have a warning about NAs introduced by coercion but that's just all those non numeric character strings turning into NAs.
dplyr 0.5 now includes a select_if() function.
For example:
person <- c("jim", "john", "harry")
df <- data.frame(matrix(c(1:9,NA,11,12), nrow=3), person)
library(dplyr)
df %>% select_if(is.numeric)
# X1 X2 X3 X4
#1 1 4 7 NA
#2 2 5 8 11
#3 3 6 9 12
Of course you could add further conditions if necessary.
If you want to use this line of code:
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
with dplyr (by which I assume you mean "using pipes") the easiest would be
df2 = df %>% lapply(function(x) as.numeric(as.character(x))) %>%
as.data.frame
To "translate" this into the mutate_each idiom:
mutate_each(df, funs(as.numeric(as.character(.)))
This function will, of course, convert all columns to character, then to numeric. To improve efficiency, don't bother doing two conversions on columns that are already numeric:
mutate_each(df, funs({
if (is.numeric(.)) return(.)
as.numeric(as.character(.))
}))
Data for testing:
df = data.frame(v1 = 1:10, v2 = factor(11:20))
mutate_all works here, and simply wrap the gsub in a function. (I also assume you aren't necessarily string hunting, so much as trawling for non-integers.
StrScrub <- function(x) {
as.integer(gsub("^\\D+$",NA, x))
}
ScrubbedDF <- mutate_all(data, funs(StrScrub))
Example dataframe:
library(dplyr)
options(stringsAsFactors = F)
data = data.frame("A" = c(2:5),"B" = c(5,"gr",3:2), "C" = c("h", 9, "j", "1"))
with reference/help from Tony Ladson
I want to count the number of NA values in a data frame column. Say my data frame is called df, and the name of the column I am considering is col. The way I have come up with is following:
sapply(df$col, function(x) sum(length(which(is.na(x)))))
Is this a good/most efficient way to do this?
You're over-thinking the problem:
sum(is.na(df$col))
If you are looking for NA counts for each column in a dataframe then:
na_count <-sapply(x, function(y) sum(length(which(is.na(y)))))
should give you a list with the counts for each column.
na_count <- data.frame(na_count)
Should output the data nicely in a dataframe like:
----------------------
| row.names | na_count
------------------------
| column_1 | count
Try the colSums function
df <- data.frame(x = c(1,2,NA), y = rep(NA, 3))
colSums(is.na(df))
#x y
#1 3
A quick and easy Tidyverse solution to get a NA count for all columns is to use summarise_all() which I think makes a much easier to read solution than using purrr or sapply
library(tidyverse)
# Example data
df <- tibble(col1 = c(1, 2, 3, NA),
col2 = c(NA, NA, "a", "b"))
df %>% summarise_all(~ sum(is.na(.)))
#> # A tibble: 1 x 2
#> col1 col2
#> <int> <int>
#> 1 1 2
Or using the more modern across() function:
df %>% summarise(across(everything(), ~ sum(is.na(.))))
If you are looking to count the number of NAs in the entire dataframe you could also use
sum(is.na(df))
In the summary() output, the function also counts the NAs so one can use this function if one wants the sum of NAs in several variables.
A tidyverse way to count the number of nulls in every column of a dataframe:
library(tidyverse)
library(purrr)
df %>%
map_df(function(x) sum(is.na(x))) %>%
gather(feature, num_nulls) %>%
print(n = 100)
This form, slightly changed from Kevin Ogoros's one:
na_count <-function (x) sapply(x, function(y) sum(is.na(y)))
returns NA counts as named int array
sapply(name of the data, function(x) sum(is.na(x)))
Try this:
length(df$col[is.na(df$col)])
User rrs answer is right but that only tells you the number of NA values in the particular column of the data frame that you are passing to get the number of NA values for the whole data frame try this:
apply(<name of dataFrame>, 2<for getting column stats>, function(x) {sum(is.na(x))})
This does the trick
I read a csv file from local directory. Following code works for me.
# to get number of which contains na
sum(is.na(df[, c(columnName)]) # to get number of na row
# to get number of which not contains na
sum(!is.na(df[, c(columnName)])
#here columnName is your desire column name
Similar to hute37's answer but using the purrr package. I think this tidyverse approach is simpler than the answer proposed by AbiK.
library(purrr)
map_dbl(df, ~sum(is.na(.)))
Note: the tilde (~) creates an anonymous function. And the '.' refers to the input for the anonymous function, in this case the data.frame df.
If you're looking for null values in each column to be printed one after the other then you can use this. Simple solution.
lapply(df, function(x) { length(which(is.na(x)))})
Another option using complete.cases like this:
df <- data.frame(col = c(1,2,NA))
df
#> col
#> 1 1
#> 2 2
#> 3 NA
sum(!complete.cases(df$col))
#> [1] 1
Created on 2022-08-27 with reprex v2.0.2
You can use this to count number of NA or blanks in every column
colSums(is.na(data_set_name)|data_set_name == '')
In the interests of completeness you can also use the useNA argument in table. For example table(df$col, useNA="always") will count all of non NA cases and the NA ones.