I need to get col sum for all the columns and have the result in a data frame with colnames and their sum as two columns. But if I do this, column name seem to become index rather than a col itself.
demo=data.frame(a=runif(10),b=runif(10,max=2),c=runif(10,max=3))
as.data.frame(colSums(demo))
The undesired result:
colSums(demo)
a 4.083571
b 11.698794
c 14.082574
The desired result:
colname colSums(demo)
a 4.083571
b 11.698794
c 14.082574
How can I add a heading to the column on the left while keep the shape as it is? Thanks.
One possibility is to transpose the result with t()
data.frame(t(colSums(demo)))
a b c
colSums.demo. 5.782475 10.46739 18.46751
To change the name of the row in the output, we can use rownames, for instance like this:
`rownames<-`(data.frame(t(colSums(demo))), "myColsum")
a b c
myColsum 5.782475 10.46739 18.46751
Your desired output can be achieved with stack() and setNames():
setNames(nm=c('colname','colSums(demo)'),stack(colSums(demo))[2:1]);
## colname colSums(demo)
## 1 a 4.083571
## 2 b 11.698794
## 3 c 14.082574
> demo = data.frame(a=runif(10),b=runif(10,max=2),c=runif(10,max=3))
> df <- data.frame(colname = names(demo),colSums_demo=colSums(demo))
> print(df, row.names=F)
colname colSums_demo
a 4.754546
b 12.488904
c 18.152095
Try this:
as.data.frame(lapply(demo, sum))
The result:
a b c
1 6.400121 10.16047 10.6528
There is now an easy way to do this using the tidyverse function enframe().
enframe(colSums(demo))
And if you want to use the column names from your example, you can set them as arguments.
demo %>% colSums() %>% enframe(name = "colname", value = "colSums(demo)")
We can use data.table with melt
library(data.table)
melt(setDT(demo)[, lapply(.SD, sum)])
demo=data.frame(a=runif(10),b=runif(10,max=2),c=runif(10,max=3))
as.data.frame(colSums(demo))
new_table<-as.data.frame(colSums(demo))
new_table %>%
rownames_to_column(var ="colname")
A new tidyverse way!
demo %>%
summarise(across(where(is.numeric)
,sum))
and then a tidyr::pivot_longer if you want long by colname.
Related
I have a vector a<-c(0, 0). I want to convert this to a dataframe and then remove duplicated rows (as part of a loop).
This is my code:
a<-c(0, 0)
df<-t(as.data.frame(a))
distinct(df)
This isn't working because df isn't a dataframe even though I have converted it to a dataframe in the second step. I'm not sure how to make this work when the dataframe only has one row.
swap t and as.data.frame like this:
library(dplyr)
a<-c(0, 0)
df<-as.data.frame(t(a))
distinct(df)
Output:
V1 V2
1 0 0
The function t will transform the data.frame back to a matrix. The simplest solution would be to simply change the order of the functions:
a <- c(0,0)
df <- as.data.frame(t(a))
distinct(df)
You can use rbind.data.frame function as follows:
a <- c(0,0)
df = rbind.data.frame(a)
distinct(df)
Using dplyr and matrix:
library(dplyr)
data.frame(matrix(0, 1, 2)) %>%
distinct
#> X1 X2
#> 1 0 0
I’m using filter to my dataset to select certain values from column:
%>%
filter(col1 %in% c(“value1”, “value2"))
How ever I don’t understand how to filter values in column with pattern without fully writing it. For example I also want all values which start with “value3” (“value33”, “value34”,....) along with “value1” and “value2”. Can I add grepl to that vector?
You can use regular expressions to do that:
df %>%
filter(str_detect('^value[1-3]'))
If you want to use another tidyverse package to help, you can use str_starts from stringr to find strings that start with a certain value
dd %>% filter(stringr::str_starts(col1, "value"))
Here are few options in base R :
Using grepl :
subset(df, grepl('^value', b))
# a b
#1 1 value3
#3 3 value123
#4 4 value12
Similar option with grep which returns index of match.
df[grep('^value', df$b),]
However, a faster option would be to use startsWith
subset(df, startsWith(b, "value"))
All of this would select rows where column b starts with "value".
data
df <- data.frame(a = 1:5, b = c('value3', 'abcd', 'value123', 'value12', 'def'),
stringsAsFactors = FALSE)
Question:
How can you use R to remove all special characters from a dataframe, quickly and efficiently?
Progress:
This SO post details how to remove special characters. I can apply the gsub function to single columns (images 1 and 2), but not the entire dataframe.
Problem:
My dataframe consists of 100+ columns of integers, string, etc. When I try to run the gsub on the dataframe, it doesn't return the output I desire. Instead, I get what's shown in image 3.
df <- read.csv("C:/test.csv")
dfa <- gsub("[[:punct:]]", "", df$a) #this works on a single column
dfb <- gsub("[[:punct:]]", "", df$b) #this works on a single column
df_all <- gsub("[[:punct:]]", "", df) #this does not work on the entire df
View(df_all)
df - This is the original dataframe:
dfa - This is gsub applied to column b. Good!
df_all - This is gsub applied to the entire dataframe. Bad!
Summary:
Is there a way to gsub an entire dataframe? Else, should an apply function be used instead?
Here is a possible solution using dplyr:
# Example data
bla <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use mutate_all from dplyr
bla %>%
mutate_all(funs(gsub("[[:punct:]]", "", .)))
a b c
1 1 fefa
2 2 fes
3 3 gDEEwfseges gdgd
Update:
mutate_all has been superseded, and funs is deprecated as of dplyr 0.8.0. Here is an updated solution using mutate and across:
# Example data
df <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use mutate_all from dplyr
df %>%
mutate(across(everything(), ~gsub("[[:punct:]]", "", .x)))
Another solution is to convert the data frame to a matrix first then run the gsub and then convert back to a data frame as follows:
as.data.frame(gsub("[[:punct:]]", "", as.matrix(df)))
I like Ryan's answer using dplyr. As mutate_all and funs are now deprecated, here is my suggested updated solution using mutate and across:
# Example data
df <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use across() from dplyr
df %>%
mutate(across(everything(), ~gsub("[[:punct:]]", "", .x)))
a b c
1 1 fefa
2 2 fes
3 3 gDEEwfseges gdgd
I want to count the number of NA values in a data frame column. Say my data frame is called df, and the name of the column I am considering is col. The way I have come up with is following:
sapply(df$col, function(x) sum(length(which(is.na(x)))))
Is this a good/most efficient way to do this?
You're over-thinking the problem:
sum(is.na(df$col))
If you are looking for NA counts for each column in a dataframe then:
na_count <-sapply(x, function(y) sum(length(which(is.na(y)))))
should give you a list with the counts for each column.
na_count <- data.frame(na_count)
Should output the data nicely in a dataframe like:
----------------------
| row.names | na_count
------------------------
| column_1 | count
Try the colSums function
df <- data.frame(x = c(1,2,NA), y = rep(NA, 3))
colSums(is.na(df))
#x y
#1 3
A quick and easy Tidyverse solution to get a NA count for all columns is to use summarise_all() which I think makes a much easier to read solution than using purrr or sapply
library(tidyverse)
# Example data
df <- tibble(col1 = c(1, 2, 3, NA),
col2 = c(NA, NA, "a", "b"))
df %>% summarise_all(~ sum(is.na(.)))
#> # A tibble: 1 x 2
#> col1 col2
#> <int> <int>
#> 1 1 2
Or using the more modern across() function:
df %>% summarise(across(everything(), ~ sum(is.na(.))))
If you are looking to count the number of NAs in the entire dataframe you could also use
sum(is.na(df))
In the summary() output, the function also counts the NAs so one can use this function if one wants the sum of NAs in several variables.
A tidyverse way to count the number of nulls in every column of a dataframe:
library(tidyverse)
library(purrr)
df %>%
map_df(function(x) sum(is.na(x))) %>%
gather(feature, num_nulls) %>%
print(n = 100)
This form, slightly changed from Kevin Ogoros's one:
na_count <-function (x) sapply(x, function(y) sum(is.na(y)))
returns NA counts as named int array
sapply(name of the data, function(x) sum(is.na(x)))
Try this:
length(df$col[is.na(df$col)])
User rrs answer is right but that only tells you the number of NA values in the particular column of the data frame that you are passing to get the number of NA values for the whole data frame try this:
apply(<name of dataFrame>, 2<for getting column stats>, function(x) {sum(is.na(x))})
This does the trick
I read a csv file from local directory. Following code works for me.
# to get number of which contains na
sum(is.na(df[, c(columnName)]) # to get number of na row
# to get number of which not contains na
sum(!is.na(df[, c(columnName)])
#here columnName is your desire column name
Similar to hute37's answer but using the purrr package. I think this tidyverse approach is simpler than the answer proposed by AbiK.
library(purrr)
map_dbl(df, ~sum(is.na(.)))
Note: the tilde (~) creates an anonymous function. And the '.' refers to the input for the anonymous function, in this case the data.frame df.
If you're looking for null values in each column to be printed one after the other then you can use this. Simple solution.
lapply(df, function(x) { length(which(is.na(x)))})
Another option using complete.cases like this:
df <- data.frame(col = c(1,2,NA))
df
#> col
#> 1 1
#> 2 2
#> 3 NA
sum(!complete.cases(df$col))
#> [1] 1
Created on 2022-08-27 with reprex v2.0.2
You can use this to count number of NA or blanks in every column
colSums(is.na(data_set_name)|data_set_name == '')
In the interests of completeness you can also use the useNA argument in table. For example table(df$col, useNA="always") will count all of non NA cases and the NA ones.
I do I remove all rows in a dataframe where a certain row meets a string match criteria?
For example:
A,B,C
4,3,Foo
2,3,Bar
7,5,Zap
How would I return a dataframe that excludes all rows where C = Foo:
A,B,C
2,3,Bar
7,5,Zap
Just use the == with the negation symbol (!). If dtfm is the name of your data.frame:
dtfm[!dtfm$C == "Foo", ]
Or, to move the negation in the comparison:
dtfm[dtfm$C != "Foo", ]
Or, even shorter using subset():
subset(dtfm, C!="Foo")
You can use the dplyr package to easily remove those particular rows.
library(dplyr)
df <- filter(df, C != "Foo")
I had a column(A) in a data frame with 3 values in it (yes, no, unknown). I wanted to filter only those rows which had a value "yes" for which this is the code, hope this will help you guys as well --
df <- df [(!(df$A=="no") & !(df$A=="unknown")),]
if you wish to using dplyr, for to remove row "Foo":
df %>%
filter(!C=="Foo")
I know this has been answered but here is another option:
library (dplyr)
df %>% filter(!c=="foo)
OR
df[!df$c=="foo", ]
If your exclusion conditions are stored in another data frame you could use rows_delete:
library(dplyr)
removal_df <- data.frame(C = "Foo")
df %>%
rows_delete(removal_df, by = "C")
A B C
1 2 3 Bar
2 7 5 Zap
This is also handy if you have multiple exclusion conditions so you do not have to write out a long filter statement.
Note: rows_delete is only available if you have dplyr >= 1.0.0