Removing rows of dataframe based on frequency of a variable - r

I'm working with a dataframe (in R) that contains observations of animals in the wild (recording time/date, location, and species identification). I want to remove rows that contain a certain species if there are less than x observations of them in the whole dataframe. As of now, I managed to get it to work with the following code, but I know there must be a more elegant and efficient way to do it.
namelist <- names(table(ind.data$Species))
for (i in 1:length(namelist)) {
if (table(ind.data$Species)[namelist[i]] <= 2) {
while (namelist[i] %in% ind.data$Species) {
j <- match(namelist[i], ind.data$Species)
ind.data <- ind.data[-j,]
}
}
}
The namelist vector contains all the species names in the data frame ind.data, and the if statement checks to see if the frequency of the ith name on the list is less than x (2 in this example).
I'm fully aware that this is not a very clean way to do it, I just threw it together at the end of the day yesterday to see if it would work. Now I'm looking for a better way to do it, or at least for how I could refine it.

You can do this with the dplyr package:
library(dplyr)
new.ind.data <- ind.data %>%
group_by(Species) %>%
filter(n() > 2) %>%
ungroup()
An alternative using built-in functions is to use ave():
group_sizes <- ave(ind.data$Species, ind.data$Species, FUN = length)
new.ind.data <- ind.data[group_sizes > 2, ]

We can use data.table
library(data.table)
setDT(ind.data)[, .SD[.N >2], Species]

Related

For loop in R for creating new data frames with respect to rows of a particular column

Hello I have created a forloop to split my data with respect to a certain row as so:
for(i in 1:(nrow(df)))
{
team_[[i]] <- df %>% filter(team == i)
}
R doesn't like this, as it says team_ not found. the code does run if I include a list as such
team_ <- list()
for(i in 1:(nrow(df)))
{
team_[[i]] <- df %>% filter(team == i)
}
This works... However, I am given a list with thousands of empty items and just a few that contain my filtered data sets.
is there a simpler way to create the data sets without this list function?
thank you
A simpler option is split from base R which would be faster than using == to subset in a loop
team_ <- split(df, df$team)
If we want to do some operations for each row, in tidyverse, it can be done with rowwise
library(dplyr)
df %>%
rowwise %>%
... step of operations ...
or with group_by
df %>%
group_by(team) %>%
...
The methods akrun suggests are much better than a loop, but you should understand why this isn't working. Remember for(i in 1:nrow(df)) will give you one list item for i = 1, i = 2, etc, right up until i = nrow(df), which is several thousand by the sounds of thing. If you don't have any rows where team is 1, you will get an empty data frame as the first item, and the same will be true for every other value of i that isn't represented.
A loop like this would work:
for(i in unique(df$team)) team_[[i]] <- df %>% filter(team == i)
But I would stick to a non-looping method as described by akrun.

R for loop to filter and print columns of a data frame

similarly asked questions to mine don’t seem to quite apply to what I am trying to accomplish, and at least one of the provided answers in one of the most similar questions didn’t properly provide a solution that actually works.
So I have a data frame that lets say is similar to the following.
sn <- 1:6
pn <- letters[1:6]
issue1_note <- c(“issue”,”# - #”,NA,”sue”,”# - #”,”ISSUE”)
issue2_note <- c(“# - #”,”ISS”,”# - #”,NA,”Issue”,”Tissue”)
df <- data.frame(sn,pn,issue1_note,issue2_note)
Here is what I want to do. I want to be able to visually inspect each _note column quickly and easily. I know I can do this on each column by using select() and filter() as in
df %>% select(issue1_note) %>%
filter(!is.na(issue1_note) & issue1_note != “# - #”)
However, I have around 30 columns and 300 rows in my real data and don’t want to do this each time.
I’d like to write a for loop that will do this across all of the columns. I also want each of the columns printed individually. I tried the below to remove just the NAs, but it merely selects and prints the columns. It’s as if it skips over the filtering completely.
col_notes <- df %>% select(ends_with(“note”)) %>% colnames()
for(col in col_notes){
df %>% select(col) %>% filter(!is.na(col)) %>% print()
}
Any ideas on how I can get this to also filter?
I was able to figure out a solution through more research, though it doesn’t involve a for loop. I created a custom function and then used lapply. In case anybody is wondering, here is my solution.
my_fn <- function(column){
tmp <- df %>% select(column)
tmp %>% filter(!is.na(.data[[column]]) & .data[[column]] != “# - #”)
}
lapply(col_notes, my_fn)
Thanks for the consideration.
This can be done all at once with filter/across or filter/if_any/filter_if_all` depending on the outcome desired
library(dplyr)
df %>%
filter(across(ends_with('note'), ~ !is.na(.) & . != "# - #"))
This will return rows with no NA or "# - #" in all of the columns with "note" as suffix in its column names. If we want to return at least one column have non-NA, use if_any
df %>%
filter(if_any(ends_with("note"), ~ !is.na(.) & . != "# - #"))

Developing Functions to Make New Dataframes in R

I am trying to develop a function that will take data, see if it matches a value in a category (e.g., 'Accident', and if so, develop a new dataframe using the following code.
cat.df <- function(i) {
sdb.i <- sdb %>%
filter(Category == i) %>%
group_by(Year) %>%
summarise(count = n()}
The name of the dataframe should be sdb.i, where i is the name of the category (e.g., 'Accident'). Unfortunately, I cannot get it to work. I'm notoriously bad with functions and would love some help.
It's not entirely clear what you are after so I am making a guess.
First of all, your function cat.df misses a closing bracket so it would not run.
I think it is good practice to pass all objects as parameters to a function. In my example I use the iris dataset so I pass this explicitly to the function.
You cannot change the name of a data frame in the way you describe. I offer two alternatives: if the count of your categories is limited you can just create separate names for each object. it you have many categories, best to combine all result objects into a list.
library(dplyr)
data(iris)
cat.df <- function(data, i) {
data <- data %>%
filter(Species== i) %>%
group_by(Petal.Width) %>%
summarise(count = n())
return(data)
}
result.setosa <- cat.df(iris, "setosa") # unique name
Species <- sort(unique(iris$Species))
results_list <- lapply(Species, function(x) cat.df(iris, x)) # combine all df's into a list
names(results_list) <- Species # name the list elements
You can then get the list elements as e.g. results_list$setosa or results_list[[1]].

Create a dataframe looping a function's results

So this is a simplification of my problem.
I have a dataframe like this:
df <- data.frame(name=c("lucas","julio","jack","juan"),number=c(1,15,100,22))
And I have a function that creates new values for every name, like this:
var_number <- function(x) {
example <- df %>%
filter(name %in% unique(df$name)[x]) %>%
select(-name) %>%
mutate(value1=number/2^5, value2=number^5)
(example)
}
var_number(1)
0.03125 1
Now I have two new values for every name and I would like to create a loop to save each result in a new dataframe.
I know how to solve this particular problem, but I need a general solution that allows me to save the results of all functions into a dataframe.
I'm looking for an automatic way to do something like this:
result<- bind_rows(var_number(1),var_number(2),var_number(3),var_number(4))
Since I would have to apply var_number around 1000 times and the lenght would change with every test i do.
There is anyway I can do something like this? I was thinking about doing it with "for", but I'm not really sure about how to do it, I have just started with R and I am a total newbie.
This answers my problem:
library(tidyverse) # contains purrr library
#an arbitrary function that always outputs a dataframe
# with a consistent number of columns, in this case 3
myfunc <- function(x){
data.frame(a=x*2,
b=x^2,
c=log2(x))
}
# iterate over 1:10 as inputs to myfunc, and
# combine the results rowwise into a df
purrr::map_dfr(1:10,
~myfunc(.))
Why do you want to apply var_number function for each name, create a new dataframe for each and then combine all of them together?
Do it only once in the same dataframe.
library(dplyr)
df1 <- df %>%
mutate(value1=number/2^5,value2=number^5) %>%
select(-name)
If you want to do it only for specific names, you can filter them first before applying the above.

How to use if-statement in apply function?

Since I have to read over 3 go of data, I would like to improve mycode by changing two for-loop and if-statement to the applyfunction.
Here under is a reproducible example of my code. The overall purpose (in this example) is to count the number of positive and negative values in "c" column for each value of x and y. In real case I have over 150 files to read.
# Example of initial data set
df1 <- data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15))
# Another dataframe to keep track of "c" counts
dfOcc <- data.frame(a=rep(c(1:5),times=3),"positive"=c(0),"negative"=c(0))
So far I did this code, which works but is really slow:
for (i in 1:nrow(df)) {
x = df[i,"a"]
y = df[i,"b"]
if (df[i,"c"]>=0) {
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"positive"] +1
}else{
dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] <- dfOcc[which(dfOcc$a==x && dfOcc$b==y),"negative"] +1
}
}
I am unsure whether the code is slow due to the size of the files (260k rows each) or due to the for-loop?
So far I managed to improve it in this way:
dfOcc[which(dfOcc$a==df$a & dfOcc$b==df$b),"positive"] <- apply(df,1,function(x){ifelse(x["c"]>0,1,0)})
This works fine in this example but not in my real case:
It only keeps count of the positive c and running this code twice might be counter productive
My original datasets are 260k rows while my "tracer" is 10k rows (the initial dataset repeats the a and b values with other c values
Any tip on how to improve those two points would be greatly appreciated!
I think you can simply count and spread the data. This will be easier and will work on any group and dataset. You can change group_by(a) to group_by(a, b) if you want to count grouping both a and b column.
library(dplyr)
library(tidyr)
df1 %>%
group_by(a) %>%
mutate(sign = ifelse(c > 0, "Positive", "Negative")) %>%
count(sign) %>%
spread(sign, n)
package data.table might help you do this in one line.
df1 <- data.table(data.frame(a=rep(c(1:5),times=3),b=rep(c(1:3),each=5),c=rnorm(15)))
posneg <- c("positive" , "negative") # list of columns needed
df1[,(posneg) := list(ifelse(c>0, 1,0), ifelse(c<0, 1,0))] # use list to combine the 2 ifelse conditions
for more information , try
?data.table
if you really want the positive negative counts to be in a separate dataframe,
dfOcc <- df1[,c("a", "positive","negative")]

Resources