remove rows containing certain data - r

In my data frame the first column is a factor and I want to delete rows that have a certain value of factorname (when the value is present). I tried:
df <- df[-grep("factorname",df$parameters),]
Which works well when the targeted factor name is present. However if the factorname is absent, this command destroys the data frame, leaving it with 0 rows. So I tried:
df <- df[!apply(df, 1, function(x) {df$parameters == "factorname"}),]
that does not remove the offending lines. How can I test for the presence of factorname and remove the line if factorname is present?

You could use:
df[ which( ! df$parameter %in% "factorname") , ]
(Used %in% since it would generalize better to multiple exclusion criteria.) Also possible:
df[ !grepl("factorname", df$parameter) , ]

l<-sapply(iris,function(x)is.factor(x)) # test for the factor variables
>l
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
FALSE FALSE FALSE FALSE TRUE
m<-iris[,names(which(l=="TRUE"))]) #gives the data frame of factor variables only
iris[iris$Species !="setosa",] #generates the data with Species other than setosa
> head(iris[iris$Species!="setosa",])
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor

Related

Function to filter data equal to or greater than a certain value

I have a dataframe containing thousands of rows and columns. The rows contain the names of genes and the columns the names of samples.
I only want to keep the rows that contain a value equal to or greater than 5 in more than 3 samples.
I tried this so far but I can't figure out how to set multiple conditions:
data.frame1 %>% filter_all(all_vars(.>= 5))
I hope I have stated this question correctly.
The way I do it in my gene expression filtering pre-differential gene expression pipeline is as follows:
data.frame1[rowSums(data.frame1 >= 5) > 3, ] -> filtered.counts
And if your first column is your gene identifier, with all the other columns being numeric, you can have the evaluation skip the first column as follows:
data.frame1[rowSums(data.frame1[-1] >= 5) > 3, ] -> filtered.counts
The way to do this in dplyr 1.0.0 is
iris %>%
filter(rowSums(across(where(is.numeric)) > 6) > 1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.6 3.0 6.6 2.1 virginica
2 7.3 2.9 6.3 1.8 virginica
3 7.2 3.6 6.1 2.5 virginica
4 7.7 3.8 6.7 2.2 virginica
5 7.7 2.6 6.9 2.3 virginica
6 7.7 2.8 6.7 2.0 virginica
7 7.4 2.8 6.1 1.9 virginica
etc
For your case
data.frame1 %>%
filter(rowSums(across(where(is.numeric)) >= 5) > 3)

Cycling through a list of dataframes with a for loop

New here and not very experienced, and I'm trying to get a project in R shinyapp to work.
I have a list of data frames which have a column labeled 'Gender' containing all/M/F. I want to filter all data frames based on the input, so that if the input is male, only rows containing M or all are kept.
list_tables <- list(adverb,adjective,simplenoun,verber,thingnoun,
personnoun,name_firstpart,name_secondpart)
input$gender <- "male
if(input$gender == "male"){
for (i in list_tables){
list_tables$i <- i[which((i$Gender=="M")|(i$Gender=="all")),]
}
}
Problem is, if I check the list afterwards, nothing has changed. If I do the same, but instead of using a for loop to cycle through the dataframes, I perform the same actions on only one dataframe, it does work. Theoretically, I could make a line of code for each dataframe separately, but it doesn't seem very neat and I have the feeling that the for loop should work but I'm just missing something. Would love to hear tips if anyone has them!
i is not a named-entry within list_tables, so list_tables$i doesn't work. Inside that loop, i is the data.frame you're trying to modify, but you don't update it.
Try either:
for (ind in seq_along(list_tables)) {
i <- list_tables[[ind]] # feels a little sloppt, but it's compact ...
list_tables[[ind]] <- i[which((i$Gender=="M")|(i$Gender=="all")),]
}
or even better
list_tables <- lapply(list_tables, function(i) i[which((i$Gender=="M")|(i$Gender=="all")),])
You could use lapply with subset:
example:
list_tables <- replicate(2,iris[c(1,51,101),],F)
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
#
# [[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 51 7.0 3.2 4.7 1.4 versicolor
# 101 6.3 3.3 6.0 2.5 virginica
solution:
lapply(list_tables,subset,Species %in% c("setosa","virginica"))
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 101 6.3 3.3 6.0 2.5 virginica
#
# [[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 101 6.3 3.3 6.0 2.5 virginica
In your case that would be:
lapply(list_tables,subset,Gender %in% c("M","all"))

dplyr filter by the first column

Is it possible to filter in dplyr by the position of a column?
I know how to do it without dplyr
iris[iris[,1]>6,]
But how can I do it in dplyr?
Thanks!
Besides the suggestion by #thelatemail, you can also use filter_at and pass the column number to vars parameter:
iris %>% filter_at(1, all_vars(. > 6))
all(iris %>% filter_at(1, all_vars(. > 6)) == iris[iris[,1] > 6, ])
# [1] TRUE
No magic, just use the item column number as per above, rather than the variable (column) name:
library("dplyr")
iris %>%
filter(iris[,1] > 6)
Which as #eipi10 commented is better as
iris %>%
filter(.[[1]] > 6)
dply >= 1.0.0
Scoped verbs (_if, _at, _all) and by extension all_vars() and any_vars() have been superseded by across(). In the case of filter the functions if_any and if_all have been created to combine logic across multiple columns to aid in subsetting (these verbs are available in dplyr >= 1.0.4):
if_any() and if_all() are used with to apply the same predicate function to a selection of columns and combine the results into a single logical vector.
The first argument to across, if_any, and if_any is still tidy-select syntax for column selection, which includes selection by column position.
Single Column
In your single column case you could do any with the same result:
iris %>%
filter(across(1, ~ . > 6))
iris %>%
filter(if_any(1, ~ . > 6))
iris %>%
filter(if_all(1, ~ . > 6))
Multiple Columns
If you're apply a predicate function or formula across multiple columns then across might give unexpected results and in this case you should use if_any and if_all:
iris %>%
filter(if_all(c(2, 4), ~ . > 2.3)) # by column position
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.3 3.3 6.0 2.5 virginica
2 7.2 3.6 6.1 2.5 virginica
3 5.8 2.8 5.1 2.4 virginica
4 6.3 3.4 5.6 2.4 virginica
5 6.7 3.1 5.6 2.4 virginica
6 6.7 3.3 5.7 2.5 virginica
Notice this returns rows where all selected columns have a value greater than 2.3, which is a subset of rows where any of the selected columns meet the logic:
iris %>%
filter(if_any(ends_with("Width"), ~ . > 2.3)) # same columns selection as above
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 6.7 3.3 5.7 2.5 virginica
7 6.7 3.0 5.2 2.3 virginica
8 6.3 2.5 5.0 1.9 virginica
9 6.5 3.0 5.2 2.0 virginica
10 6.2 3.4 5.4 2.3 virginica
11 5.9 3.0 5.1 1.8 virginica
The output above was shorted to be more compact for this example.

Smart spreadsheet parsing (managing group sub-header and sum rows, etc)

Say you have a set of spreadsheets formatted like so:
Is there an established method/library to parse this into R without having to individually edit the source spreadsheets? The aim is to parse header rows and dispense with sum rows so the output is the raw data, like so:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 5.7 2.8 4.1 1.3 versicolor
8 6.3 3.3 6.0 2.5 virginica
9 5.8 2.7 5.1 1.9 virginica
10 7.1 3.0 5.9 2.1 virginica
I can certainly hack a tailored solution to this, but wondering there is something a bit more developed/elegant than read.csv and a load of logic.
Here's a reproducible demo csv dataset (can't assume an equal number of lines per group..), although I'm hoping the solution can transpose to *.xlsx:
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17
There is a variety of ways to present spreadsheets so it would be hard to have a consistent methodology for all presentations. However, it is possible to transform the data once it is loaded in R. Here's an example with your data. It uses the function na.locf from package zoo.
x <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
library(zoo)
x <- x[x$X!="Mean",] #remove Mean line
x$Species <- x$X #create species column
x$Species[grepl("[0-9]",x$Species)] <- NA #put NA if Species contains numbers
x$Species <- na.locf(x$Species) #carry last observation if NA
x <- x[!rowSums(is.na(x))>0,] #remove lines with NA
X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3 1 5.1 3.5 1.4 0.2 Setosa
4 2 4.9 3.0 1.4 0.2 Setosa
5 3 4.7 3.2 1.3 0.2 Setosa
9 1 7.0 3.2 4.7 1.4 Versicolor
10 2 6.4 3.2 4.5 1.5 Versicolor
11 3 6.9 3.1 4.9 1.5 Versicolor
15 1 6.3 3.3 6.0 2.5 Virginica
16 2 5.8 2.7 5.1 1.9 Virginica
17 3 7.1 3.0 5.9 2.1 Virginica
I just recently did something similar. Here was my solution:
iris <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
First I used a which splits at an index.
split_at <- function(x, index) {
N <- NROW(x)
s <- cumsum(seq_len(N) %in% index)
unname(split(x, s))
}
Then you define that index using:
iris[,1] <- stringr::str_trim(iris[,1])
index <- which(iris[,1] %in% c("Virginica", "Versicolor", "Setosa"))
The rest is just using purrr::map_df to perform actions on each data.frame in the list that's returned. You can add some additional flexibility for removing unwanted rows if needed.
split_at(iris, index) %>%
.[2:length(.)] %>%
purrr::map_df(function(x) {
Species <- x[1,1]
x <- x[-c(1,NROW(x) - 1, NROW(x)),]
data.frame(x, Species = Species)
})

convert lapply results to single data frame in r [duplicate]

This question already has answers here:
Combine a list of data frames into one data frame by row
(10 answers)
Closed 3 years ago.
I am extracting specific rows from a list of data frames in R and would like to have those rows assembled into a new data frame. As an example, I will use the iris data:
data(iris)
a.iris <- split(iris, iris$Species)
b.iris <- lapply(a.iris, function(x) with(x, x[3,]))
I want the return from lapply() to be arranged into a single data frame that is in the same structure as the original data frame (e.g., names(iris)). I have been looking at the plyr package but cannot find the right code to make this work. Any assistance would be greatly appreciated!
Brian
You can use do.call() with rbind() and simplify your lapply() call.
a.iris <- split(iris, iris$Species)
b.iris <- do.call(rbind, lapply(a.iris, `[`, 3, ))
b.iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## setosa 4.7 3.2 1.3 0.2 setosa
## versicolor 6.9 3.1 4.9 1.5 versicolor
## virginica 7.1 3.0 5.9 2.1 virginica
> all.equal(names(iris), names(b.iris))
## [1] TRUE
Or course, you could have also used tapply() to find the third row per group.
iris[tapply(seq_len(nrow(iris)), iris$Species, `[`, 3), ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 3 4.7 3.2 1.3 0.2 setosa
# 53 6.9 3.1 4.9 1.5 versicolor
# 103 7.1 3.0 5.9 2.1 virginica

Resources