I am working with a data set that has over 500 000 observations. It is available here: http://archive.ics.uci.edu/ml/machine-learning-databases/00352/. My goal is to remove some of the rows based on the following buzzwords:
buzzwords = c("A", "B", "C", "D", "E", "F", "G", "H")
If the column Description contains these buzzwords, for example, "Apple" contains "A", I should remove the entire row. I wrote this function:
isUndesirable2 = function(x){
c=FALSE
for (i in 1:length(buzzwords)){
if (str_detect(toupper(x),buzzwords[i])){
c=TRUE
}
break
}
return(c)
}
The function above works well, I tried it on multiple examples. However, when I try to subset the original data set:
dataset43 = dataset2[which(!isUndesirable2(as.character(dataset2$Description))),]
I get this warning:
Warning message:
In if (str_detect(toupper(x), buzzwords[i])) { :
the condition has length > 1 and only the first element will be used
and the new dataset43 is empty. Weird!
If anyone more experienced with R can help - I'd really appreciate it. Thanks! PS Sorry about the formatting, it is my first post.
Consider a more R-like approach
buzzwords = c("A", "B", "C", "D", "E", "F", "G", "H")
dataset2 <- data.frame(Description=c("Apple", "Bee", "Zoo", "Green", "Hospital", "Yoohoo", "You"))
library(stringr)
I use outer for all-to-all comparison of dataset2$Description and buzzwords in str_detect(i, j). Then filter dataset2 based on whether any buzzword was found or not
is_buzzword_present <- outer(dataset2$Description, buzzwords, function(i, j) str_detect(toupper(i), j))
dplyr::filter(dataset2, !apply(is_buzzword_present, 1, any))
# Description
# 1 Zoo
# 2 You
Using the example above we can do:
subset(dataset2,!grepl(paste(buzzwords,collapse="|"),do.call(paste,dataset2)))
Description
3 Zoo
6 Yoohoo
7 You
Related
I have a data frame ('ju') that has three columns and 230 rows. The first two columns represent a pair of objects. The third column includes one of those objects. I'd like to add the fourth column which will contain the second object from that pair, as shown below.
I wrote a code to identify the value for the forth column (loser), but it does not give me any output when I run it.
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
paste(ju$letter1[i])
} else {
paste (ju$letter2[i])
}
}
I can not see what is wrong with the code. Also I would appreciate if you can suggest how I could create this fourth column directly into my data frame, instead of creating a separate vector and then adding it to the data frame. Thanks
This will do it without a for loop:
ju$loser <- ifelse(ju$winner %in% ju$letter1, ju$letter2, ju$letter1)
Gives:
> ju
letter1 letter2 winner loser
1 a c a c
2 c b b c
3 t j j t
4 r k k r
If you want to print to console, you'll need to add:
cat(ju$letter1[i])
or
print(ju$letter1[i])
Regarding the New Column question, a possible solution (sub-optimal to use a for loop here -- See suggestion from #lab_rat_kid):
ju$NewColumn = NA
for (i in 1:230) {
if (ju$winner[i]==ju$letter2[i]) {
ju$NewColumn[i] <- ju$letter1[i]
} else {
ju$NewColumn[i] <- ju$letter2[i]
}
}
with tidyverse:
dt <- tibble(l1 = c("a", "c", "t", "r"),
l2 = c("c", "b", "j", "k"),
winner = c("a", "b", "j", "k"))
dt <- dt %>%
mutate(looser = if_else(winner == l1, l2, l1))
(dt)
I have a big dataframe and a list of value (character). I want to subset the dataframe based on one column which contains value in the list. What should I do?
Thanks
Leilei
I assume you want something like this..
Please make a reproducable example next time ://
data.frame(letters = c("a", "b", "c", "d"), value = 1:4) -> df
list("a", "y", "b", "x") -> subset_list
df[df$letters %in% subset_list,] -> df_subset
I have the following code that it taking forever to run on my 80k rows CBP table. Anyone could help me optimize my loop. Trying simply to find duplicates sharing the same values in certain (not all) columns, getting the number of duplicates there is and then returning the ids for each of the duplicates:
for (row in 1:nrow(CBP)){
subs <- subset(CBP, CBP$Lower_Bound__c == CBP[row,"Lower_Bound__c"] & CBP$Price_Book__c == CBP[row,"Price_Book__c"] & CBP$Price__c == CBP[row,"Price__c"] & CBP$Product__c == CBP[row,"Product__c"] & CBP$Department__c == CBP[row,"Department__c"] & CBP$UOM__c == CBP[row,"UOM__c"] & CBP$Upper_Bound__c == CBP[row,"Upper_Bound__c"])
if (nrow(subs)>1){
CBP[row,]$dup <- nrow(subs)
CBP[row,]$dupids <- paste(subs[,"Id"], collapse = ",")
}
print(row)
}
I'm having a hard time understanding your example. However, here's a simple approach with data.table that might work for your situation. You can create a variable (nsame in the example) that counts if the something is a duplicate by multiple variables (var1 and var2 in the example). Then just grab the row index.
library(data.table)
# generate some example data
dt <- data.table(
var1 = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
var2 = c("a", "a", "z", "b", "y", "b", "c", "c", "c"),
var3 = 1:9
)
# counter for each combination of var1-var2
dt[ , nsame := 1:.N, by=.(var1, var2)]
# duplicates are where the counter is > 1
which(dt$nsame > 1)
## 2 6 8 9
Using base R:
dupe_columns = c(
"Lower_Bound__c", "Price_Book__c", "Price__c", "Product__c",
"Department__c", "UOM__c", "Upper_Bound__c"
)
# which rows are duplicated
dupes = which(duplicated(CBP[, dupe_columns]) | duplicated(CBP[, dupe_columns], fromLast = TRUE))
# how many are there
length(dupes)
# IDs that are duplicated
CBP[dupes, "Id"]
# collapse Ids with duplicates by group:
aggregate(CBP$Id, by = CBP[dupe_columns], FUN = paste, collapse = ",")
If any of this doesn't work or you need more help, post 10-20 rows of sample data (use dput() so it is copy/pasteable!!!) so we can test and verify.
Subtle point, but I use CBP[, dupe_columns] in the duplicated() line because duplicated() will work the same whether we give it a data frame or a vector. CBP[, dupe_columns] will be a data frame if you have more than one column to check for dupes, but will be a vector if you give it a single column. However, when we get down to aggregate we need the by argument to be a list (like a data frame). So I use CBP[dupe_columns] (no comma) which will guarantee a data frame even if we are only checking a single column.
I wish to find the number of times a unique/distinct character occurs accross mulitple vectors or from a list.
Perhaps its best to describe in an example ;
In this example, lets say the "unique character" are letters. And the muliple "vectors" are books. I wish to find the occurance of the letters as the number of book increases.
# Initial data in the format of a list
book_list <- list(book_A <- c("a", "b", "c", "z"),
book_B <- c("c", "d", "a"),
book_C <- c("b", "a", "c", "e", "x"))
# Initial data in the format of multiple vectors
book_A <- c("a", "b", "c", "z")
book_B <- c("c", "d", "a")
book_C <- c("b", "a", "c", "e", "x")
# Finding the unique letters in each book
# This is the part im struggling to code in a loop fashion
one_book <- length(unique(book_A))
two_book <- length(unique(c(book_A, book_B)))
three_book <- length(unique(c(book_A, book_B, book_C)))
# Plot the desired output
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of unqiue letters", xlab = "Book Number",
main="The occurence of unique letters as number of book increases")
To Note : The real data set is much bigger. Each vector (book_A, book_B...etc) is about 7000 in length.
I attempting to solve the problem with dplyr or data frame, but I'm not quite there yet.
# Explore data frame option with an example data
library(dplyr)
df <- read.delim("http://m.uploadedit.com/ba3s/148950223626.txt")
# Group them
df_group <- dplyr::group_by(df, book) %>% summarize(occurence = length(letter))
# Use the cummuative sum
plot(x=1:length(unique(df$book)), y=cumsum(df_group$occurence))
But I know the plot is not correct, as it is only plotting the cummulative sum rather than what I intended. Any hints would be most helpful.
To add to the complexity, it would be nice if the book which have the shortest number of letter first can be ploted. Something along the line
# Example ;
# Find the length of the letters in the book
lapply(book_list, length)
# I know that book_B is has the shortest number of letters (3);
# followed by book_A (4) then book_C (5)
one_book <- length(unique(book_B))
two_book <- length(unique(c(book_B, book_A)))
three_book <- length(unique(c(book_B, book_A, book_C)))
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of letters", xlab = "Book Number")
You can use Reduce with accumulate = TRUE, i.e.
sapply(Reduce(c, book_list, accumulate = TRUE), function(i) length(unique(i)))
#[1] 4 5 7
I am trying to create a function that will return counts of specific adjacent nucleotides (CG beside eachother) within a specific window that I have formatted in a vector.
I would like the windows to be 100 nucleotides long and move shift every 10.
The data is setup like this (to 10k entries):
data <- c("a", "g", "t", "t", "g", "t", "t", "a", "g", "t", "c", "t",
"a", "c", "g", "t", "g", "g", "a", "c", "c", "g", "a", "c")
So far I have tried this:
library(zoo)
library(seqinr)
rollapply(data, width=100, by=10, FUN=count(data, wordsize=2))
But I always get the error
"Error in match.fun(FUN) :
'count(data, 2)' is not a function, character or symbol"
I have also tried:
starts <- seq(1, length(data)-100, by = 100)
n <- length(starts)
for (i in 1:n){
chunk <- data[starts[i]:(starts[i]+99)]
chunkCG <- count(chunk,wordsize=2)
print (chunkCG)
}
However, I do not know how to save the data that is returned. This approach also does not allow me to overlap frames.
EDIT: To get the desired output with a 10 observation sliding window you can use a for loop. Since we pre-allocate the size of our result vector, the loop is reasonably fast. I think this is the best way to solve your problem since I dont think a lot of grouping (if any) supports a sliding window:
library(data.table)
set.seed(1)
#Sample data
df<-data.frame(var=sample(c("a","g","t","c"),600,replace=T))
#The number of windows you want, shift by 10 each time
n_windows <- ((nrow(df) - 100) / 10) + 1
#Create empty DF, this helps increase speed of below loop
res <- data.frame(window=rep(NA,n_windows),count_cg=rep(NA,n_windows))
#Loop over each i, paste a leaded version of your sequence onto current sequence and count "cg"s
for (i in 1:n_windows){
res$window[i] <- paste0((i-1)*10 + 1,"-",(i-1)*10 + 100)
subs <- df[((i-1)*10 + 1):((i-1)*10 + 100),"var"]
subs2<- paste0(as.character(subs),as.character(shift(subs,1L,type="lead")[1:length(subs) - 1]))
res$count_cg[i] <- sum(subs2=="cg")
}
head(res)
window count_cg
1 1-100 10
2 11-110 10
3 21-120 8
4 31-130 9
5 41-140 9
6 51-150 9
Your method doesn't overlap, as you call it with by = 100. Otherwise it looks fine. Just change it to 10.
To extract the data from you last try, try creating character vector that will collect the data and then you can extract the proper count with name indexing.
counted_cg <- vector(mode = "character")
for (i in 1:n){
chunk <- data[starts[i]:(starts[i]+99)]
chunkCG <- count(chunk,wordsize=2)
counted_cg <- c(counted_cg, chunkCG["cg"])
}