Let say I have the following data frame in R:
df1 <- data.frame(Item_Name = c("test1","test2","test3"), D_1=c(1,0,1),
D_2=c(1,1,1), D_3=c(11,3,1))
I would like to create a function that would delete columns with no variance
(e.g. in this case, it would remove column D_2 because it has only 1 value)
I know that I could check it by hand, but in reality my data is very large and I would like to automate it. Any idea?
Filter is a useful function here. I will filter only for those where there is more than 1 unique value.
i.e.
Filter(function(x)(length(unique(x))>1), df1)
## Item_Name D_1 D_3
## 1 test1 1 11
## 2 test2 0 3
## 3 test3 1 1
You can do:
df1[c(TRUE, lapply(df1[-1], var, na.rm = TRUE) != 0)]
# Item_Name D_1 D_3
# 1 test1 1 11
# 2 test2 0 3
# 3 test3 1 1
where the lapply piece tells you what variables have some variance:
lapply(df1[-1], var, na.rm = TRUE) != 0
# D_1 D_2 D_3
# TRUE FALSE TRUE
In dplyr, we can use n_distinct to count unique values and select_if to select columns
library(dplyr)
df1 %>% select(where(~n_distinct(.) > 1))
#For dplyr < 1.0.0
#df1 %>% select_if(~n_distinct(.) > 1)
# Item_Name D_1 D_3
#1 test1 1 11
#2 test2 0 3
#3 test3 1 1
We can use the same logic with purrr's keep and discard
purrr::keep(df1, ~n_distinct(.) > 1)
purrr::discard(df1, ~n_distinct(.) == 1)
Apart from that data.table way of doing it could be
library(data.table)
setDT(df1)
df1[, lapply(df1, uniqueN) > 1, with = FALSE]
Or probably this is smarter/better
df1[, .SD, .SDcols=lapply(df1, uniqueN) > 1]
In all the above approaches you could replace n_distinct/uniqueN with var or sd function after subsetting only numeric columns.
For example,
df1[-1] %>% select_if(~sd(.) != 0)
Related
I have a list of some 500.000 trees in different sites, and am trying to find out which sites are more endangered by specific pests (n ~500). Each pest has a different host range. I have these host ranges in dataframes in a list.
I am trying to use these dfs in the list as lookup tables, and will calculate fraction of suitable trees.
Example code:
#pests with their host ranges
pest1 <- as.data.frame(c("Abies", "Quercus"))
pest2 <- as.data.frame(c("Abies"))
pest3 <- as.data.frame (c("Abies", "Picea"))
pestlist <- as.list(c(pest1, pest2, pest3))
#changing this to any other kind of list would be fine too
df1 <- NULL#this will be the dataframe that would get new columns
df1$genus <- c("Abies", "Picea", "Abies", "Quercus", "Abies")
df1$site <- c("A" , "A" , "B" , "B", "B")
df1 <- as.data.frame(df1)
I tried following code, but it seems to go wrong because I don't know how to loop through lists:
library(tidyverse)
df2 <- map2(pestlist, names(df1), mutate(ifelse(df1$genus %in% pestlist , 1,0)))
For clarity, I want to go from:
to
Thanks for your time!
I would recommend using a named list of vectors instead of a list of data.frames. Then we can use map_dfc() inside mutate():
# use vectors instead of data.frames
pest1 <- c("Abies", "Quercus")
pest2 <- c("Abies")
pest3 <- c("Abies", "Picea")
df1 <- NULL#this will be the dataframe that would get new columns
df1$genus <- c("Abies", "Picea", "Abies", "Quercus", "Abies")
df1$site <- c("A" , "A" , "B" , "B", "B")
df1 <- as.data.frame(df1)
library(tidyverse)
# create named list of vectors
pestlist <- tibble::lst(pest1, pest2, pest3)
# use map_dfc
df1 %>%
mutate(map_dfc(pestlist, ~ as.integer(genus %in% .x)))
If you fix up your pestlist like this:
pestlist =do.call(
rbind,lapply(seq_along(pestlist),\(i) data.frame(pest=i,genus=pestlist[[i]]))
)
then you can join df1 and a version of pestlist that has been pivoted to wide format
inner_join(
df1, pivot_wider(
mutate(pestlist,v=1),
names_from=pest,names_prefix = "pest",values_from = v,values_fill = 0
)
)
Output:
genus site pest1 pest2 pest3
1 Abies A 1 1 1
2 Picea A 0 0 1
3 Abies B 1 1 1
4 Quercus B 1 0 0
5 Abies B 1 1 1
library(data.table)
setDT(df1, key = 'genus')
pestdt = rbindlist(list(pest1, pest2, pest3), use.names = FALSE, idcol = TRUE)
setnames(pestdt, c('id', 'genus'))
pestdt = dcast(pestdt, genus ~ paste0('pest', id), value.var = 'genus', fun.aggregate = \(x) as.integer(nzchar(x)), fill = 0)
cols = paste0("pest", 1:3)
df1[, (cols) := pestdt[.SD, mget(cols)]]
#
# genus site pest1 pest2 pest3
# <char> <char> <int> <int> <int>
# 1: Abies A 1 1 1
# 2: Abies B 1 1 1
# 3: Abies B 1 1 1
# 4: Picea A 0 0 1
# 5: Quercus B 1 0 0
I have a data frame where column "A" has 6 distinct values. Column "B" has float values. By using dplyr, I can group by column "A" and find mean of column "B" of each group as follows:
mydf %>% group_by(A) %>% summarize(Mean = mean(B, na.rm=TRUE))
My utter aim is to find rows in each group whose "B" values are higher than the group average. How can I achieve this (using base R or dplyr)?
A simple alternative with base R ave would be
df[df$b > ave(df$b, df$a) , ]
# a b
#4 1 4
#5 1 5
#9 2 9
#10 2 10
The default argument for ave is mean so no need to mention it explicitly, if there are NA values present in b modify it to
df[df$b > ave(df$b, df$a, FUN = function(x) mean(x,na.rm = TRUE)) , ]
Another solution with subset and ave as suggested by #Onyambu
subset(df,b>ave(b,a))
# a b
#4 1 4
#5 1 5
#9 2 9
#10 2 10
data
df <- data.frame(a = rep(c(1, 2), each = 5), b = 1:10)
df
# a b
#1 1 1
#2 1 2
#3 1 3
#4 1 4
#5 1 5
#6 2 6
#7 2 7
#8 2 8
#9 2 9
#10 2 10
You can just group and then filter:
mydf %>%
group_by(A) %>%
filter(B > mean(B, na.rm = TRUE)) %>%
ungroup()
Using Base R, I would go for this. It is not as elegant as dplyr.
mean.df <- aggregate(mydf$b, by =list(a = mydf$a), FUN = mean)
names(mean.df)[2] <- "mean"
mydf <- merge(mydf, mean.df, by = "a")
# Rows whose values are higher than mean
new.df <- subset(mydf, b > mean, select = -mean)
I like working with Data tables. So a data.table solution would be,
mydt <- data.table(mydf)
mydt[, mean := mean(b), by = a]
new.dt <- mydt[b > mean, -c("mean"), with = TRUE]
Another way to do it using base R and tapply:
mydf = cbind.data.frame(A=sample(6,20,rep=T),B=runif(20))
mydf.ave = tapply(mydf$B,mydf$A,mean)
newdf = mydf[mydf$B > mydf.ave[as.character(mydf$A)],]
(thus the one liner would be:mydf[mydf$B > tapply(mydf$B,mydf$A,mean)[as.character(mydf$A)],])
I would like to ask if there is a way of removing a group from dataframe using dplyr (or anz other way in that matter) in the following way. Lets say I have a dataframe in the following form grouped by variable 1:
Variable 1 Variable 2
1 a
1 b
2 a
2 a
2 b
3 a
3 c
3 a
... ...
I would like to remove only groups that have in Variable 2 two consecutive same values. That is in table above it would remove group 2 because there are values a,a,b but not group c where is a,c,a. So I would get the table bellow?
Variable 1 Variable 2
1 a
1 b
3 a
3 c
3 a
... ...
To test for consecutive identical values, you can compare a value to the previous value in that column. In dplyr, this is possible with lag. (You could do the same thing with comparing to the next value, using lead. Result comes out the same.)
Group the data by variable1, get the lag of variable2, then add up how many of these duplicates there are in that group. Then filter for just the groups with no duplicates. After that, feel free to remove the dupesInGroup column.
library(tidyverse)
df %>%
group_by(variable1) %>%
mutate(dupesInGroup = sum(variable2 == lag(variable2), na.rm = T)) %>%
filter(dupesInGroup == 0)
#> # A tibble: 5 x 3
#> # Groups: variable1 [2]
#> variable1 variable2 dupesInGroup
#> <int> <chr> <int>
#> 1 1 a 0
#> 2 1 b 0
#> 3 3 a 0
#> 4 3 c 0
#> 5 3 a 0
Created on 2018-05-10 by the reprex package (v0.2.0).
prepare data frame:
df <- data.frame("Variable 1" = c(1, 1, 2, 2, 2, 3, 3, 3), "Variable 2" = unlist(strsplit("abaabaca", "")))
write functions to test if consecutive repetitions are there or not:
any.consecutive.p <- function(v) {
for (i in 1:(length(v) - 1)) {
if (v[i] == v[i + 1]) {
return(TRUE)
}
}
return(FALSE)
}
any.consecutive.in.col.p <- function(df, col) {
any.consecutive.p(df[, col])
}
any.consecutive.p returns TRUE if it finds first consecutive repetition in a vector (v).
any.consecutive.in.col.p() looks for consecutive repetitions in a column of a data frame.
split data frame by values of Variable.1
df.l <- split(df, df$Variable.1)
df.l
$`1`
Variable.1 Variable.2
1 1 a
2 1 b
$`2`
Variable.1 Variable.2
3 2 a
4 2 a
5 2 b
$`3`
Variable.1 Variable.2
6 3 a
7 3 c
8 3 a
Finally go over this data.frame list and test for each data frame, if it contains consecutive duplicates in Variable.2 column.
If found, don't collect it.
Bind the collected data frames by rows.
Reduce(rbind, lapply(df.l, function(df) if(!any.consecutive.in.col.p(df, "Variable.2")) {df}))
Variable.1 Variable.2
1 1 a
2 1 b
6 3 a
7 3 c
8 3 a
Say you want to remove all groups of df, grouped by a, where the column b has repeated values. You can do that as below.
set.seed(0)
df <- data.frame(a = rep(1:3, rep(3, 3)), b = sample(1:5, 9, T))
# dplyr
library(dplyr)
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
#data.table
library(data.table)
setDT(df)
df[, if(all(b != shift(b), na.rm = T)) .SD, by = a]
Benchmark shows data.table is faster
#Results
# Unit: milliseconds
# expr min lq mean median uq max neval
# use_dplyr() 141.46819 165.03761 201.0975 179.48334 205.82301 539.5643 100
# use_DT() 36.27936 50.23011 64.9218 53.87114 66.73943 345.2863 100
# Method
set.seed(0)
df <- data.table(a = rep(1:2000, rep(1e3, 2000)), b = sample(1:1e3, 2e6, T))
use_dplyr <- function(x){
df %>%
group_by(a) %>%
filter(all(b != lag(b), na.rm = T))
}
use_DT <- function(x){
df[, if (all(b != shift(b), na.rm = T)) .SD, a]
}
microbenchmark(use_dplyr(), use_DT())
Here is an example of a data set df:
Name L1 L2 L3 L4
Carl 1 NA 0 2
Carl 0 1 4 1
Joe 3 0 3 1
Joe 2 2 1 0
I would like to create a function that would be able to tally up the number of values in columns L2, L3, and L4 that are greater than 0 as a function of some name. For example:
someFunction(Joe)
# 4
However, I have some NAs in my columns.
I have tried using complete.cases to remove the NAs but I do not want to remove the entire row. I want to use aggregate, however, I am not exactly sure how. Thanks for your help.
We can use
colSums(df[c("L2", "L3", "L4")] > 0, na.rm = TRUE)
Or you may want a sum per person:
m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)
# L2 L3 L4
#Carl 1 1 2
#Joe 1 2 1
There is something fun here. df[c("L2", "L3", "L4")] > 0 is a logical matrix (with NA):
Although colSums can work with it without trouble, rowsum can not. So a fix is to add a 0 to this matrix to cast it to a 0-1 numerical matrix;
when adding this 0, we must do (df[c("L2", "L3", "L4")] > 0) + 0 not df[c("L2", "L3", "L4")] > 0 + 0. The operation precedence in R means + is prior to >. Have a try on this toy example:
5 > 4 + 0 ## FALSE
(5 > 4) + 0 ## 1
So we want a bracket to evaluate > first, then +.
If you want the result to be a data frame, just cast the resulting matrix into a data frame by:
data.frame(m)
Follow-up
People stop responding, because your specific question on getting a function is less interesting than getting the summary dataset.
Well, if you still take my approach, I would define such function as:
extract <- function (person) {
m <- rowsum((df[c("L2", "L3", "L4")] > 0) + 0, df[["Name"]], na.rm = TRUE)
rowSums(m)[[person]]
}
Then you can call
extract("Joe")
# 4
extract("Carl")
# 4
Note, this is obviously not the most efficient way to write such a function. Because if you only want to extract the sum for one person, there is no need to proceed all data. We can do:
extract2 <- function (person) {
## subset data
sub <- subset(df, df$Name == person, select = c("L2", "L3", "L4"))
## get sum
sum(sub > 0, na.rm = TRUE)
}
Then you can call
extract2("Joe")
# 4
extract2("Carl")
# 4
With aggregate, you'll need to set both the na.rm parameter of sum, plus the na.action parameter of aggregate itself. After that, it's easy to add the three columns:
df_sums <- aggregate(. ~ Name, df, FUN = function(x) {
sum(x > 0, na.rm = TRUE)
}, na.action = na.pass)
df_sums$sum_L2_L3_L4 <- with(df_sums, L1 + L2 + L3)
df_sums
## Name L1 L2 L3 L4 sum_L2_L3_L4
## 1 Carl 1 1 1 2 4
## 2 Joe 2 1 2 1 4
or in dplyr,
library(dplyr)
df %>% group_by(Name) %>%
summarise_all(funs(sum(. > 0, na.rm = TRUE))) %>%
mutate(sum_L2_L3_L4 = L2 + L3 + L4)
## # A tibble: 2 × 6
## Name L1 L2 L3 L4 sum_L2_L3_L4
## <fctr> <int> <int> <int> <int> <int>
## 1 Carl 1 1 1 2 4
## 2 Joe 2 1 2 1 4
or directly,
df %>% group_by(Name) %>% summarise(sum = sum(cbind(L2, L3, L4) > 0, na.rm = TRUE))
## # A tibble: 2 × 2
## Name sum
## <fctr> <int>
## 1 Carl 4
## 2 Joe 4
or data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x){sum(x > 0, na.rm = TRUE)}), by = Name
][, sum_L2_L3_L4 := L2 + L3 + L4, by = Name][]
## Name L1 L2 L3 L4 sum_L2_L3_L4
## 1: Carl 1 1 1 2 4
## 2: Joe 2 1 2 1 4
or directly,
setDT(df)[, .(sum = sum(cbind(L2, L3, L4) > 0, na.rm = TRUE)), by = Name]
## Name sum
## 1: Carl 4
## 2: Joe 4
We can use aggregate with rowSums to get the output
aggregate(cbind(Total=rowSums(df[3:5]>0,
na.rm=TRUE))~cbind(Name=df$Name), FUN = sum)
# Name Total
#1 Carl 4
#2 Joe 4
Or using data.table, convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Name' and specifying the select column in .SDcols, unlist the Subset of Data.table (.SD), convert it to a logical vector (>0) and get the sum of the TRUE values to create the summarised 'Total' column
library(data.table)
setDT(df)[, .(Total = sum(unlist(.SD)>0, na.rm = TRUE)), Name, .SDcols = L2:L4]
# Name Total
#1: Carl 4
#2: Joe 4
Or another option is with dplyr/tidyr. We select the columns of interest, gather into 'long' format, filter only the elements that are greater than 0, then grouped by 'Name' get the total number of rows (n())
library(dplyr)
library(tidyr)
df %>%
select(-L1) %>%
gather(Var, Val, -Name) %>%
filter(Val>0) %>%
group_by(Name) %>%
summarise(Total = n())
# A tibble: 2 × 2
# Name Total
# <chr> <int>
#1 Carl 4
#2 Joe 4
With plyr you could do:
library(plyr)
nonZeroDF = ddply(DF[,-2],"Name",.fun = function(x)
data.frame(nonZeroObs=sum((x[,-1]) >0,na.rm=TRUE) ))
# Name nonZeroObs
#1 Carl 4
#2 Joe 4
I have a simple question about aggregating values in R.
Suppose I have a dataframe:
DF <- data.frame(col1=c("Type 1", "Type 1B", "Type 2"), col2=c(1, 2, 3))
which looks like this:
col1 col2
1 Type 1 1
2 Type 1B 2
3 Type 2 3
I notice that I have Type 1 and Type 1B in the data, so I would like to combine Type 1B into Type 1.
So I decide to use dplyr:
filter(DF, col1=='Type 1' | col1=='Type 1B') %>%
summarise(n = sum(col2))
But now I need to keep going with it:
DF2 <- data.frame('Type 1', filter(DF, col1=='Type 1' | col1=='Type 1B') %>%
summarise(n = sum(col2)))
I guess I want to cbind this new DF2 back to the original DF, but that means I have to set the column names to be consistent:
names(DF2) <- c('col1', 'col2')
OK, now I can rbind:
rbind(DF2, DF[3,])
The result? It worked....
col1 col2
1 Type 1 3
3 Type 2 3
...but ugh! That was awful! There has to be a better way to simply combine values.
Here's a possible dplyr approach:
library(dplyr)
DF %>%
group_by(col1 = sub("(.*\\d+).*$", "\\1", col1)) %>%
summarise(col2 = sum(col2))
#Source: local data frame [2 x 2]
#
# col1 col2
#1 Type 1 3
#2 Type 2 3
Using sub() with aggregate(), removing anything other than a digit from the end of col1,
do.call("data.frame",
aggregate(col2 ~ cbind(col1 = sub("\\D+$", "", col1)), DF, sum)
)
# col1 col2
# 1 Type 1 3
# 2 Type 2 3
The do.call() wrapper is there so that the first column after aggregate() is properly changed from a matrix to a vector. This way there aren't any surprises later on down the road.
In my opinion, aggregate() is the perfect function for this purpose, but you shouldn't have to do any text processing (e.g. gsub()). I would do this in a two-step process:
Overwrite col1 with the new desired grouping.
Compute the aggregation using the new col1 to specify the grouping.
DF$col1 <- ifelse(DF$col1 %in% c('Type 1','Type 1B'),'Type 1',levels(DF$col1));
DF;
## col1 col2
## 1 Type 1 1
## 2 Type 1 2
## 3 Type 2 3
DF <- aggregate(col2~col1, DF, FUN=sum );
DF;
## col1 col2
## 1 Type 1 3
## 2 Type 2 3
You can try:
library(data.table)
setDT(transform(DF, col1=gsub("(.*)[A-Z]+$","\\1",DF$col1)))[,list(col2=sum(col2)),col1]
# col1 col2
# 1: Type 1 3
# 2: Type 2 3
Or even more directly:
setDT(DF)[, .(col2 = sum(col2)), by = .(col1 = sub("[[:alpha:]]+$", "", col1))]