Cluster hh:mm in 3 minute intervals in dataframe R - r

I have a dataframe containing tweets during a 1.5 hour time span (from 21:23 to 22:48). One of the columns in my dataframe indicates the timestamp hh:mm (this is of class character). I want to plot my data (e.g nr of tweets) over time but cluster it in 3 minute intervals (so I will get cleaner overview when creating a barplot, than clustering it by the minute). Can someone explain to me how this would work?
So the result I want to achieve is that in the example dt below is:
a, b, c, and d are clustered in group 1
e, f, g, h, and i are clustered in group 2
j, k, and l are clustered in group 3
dt <- data.frame(text = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l"),
time = c("21:23", "21:23", "21:24", "21:25", "21:26", "21:27", "21:27", "21:28", "21:28", "21:29", "21:30", "21:31"))

Check out the bins option where you can specify the number of bins, that is number of buckets. For example,
ggplot(<your options>) + geom_histogram(bins=30)
I used 30 since there are 3 *30 = 90 mins of data. But you can make it a variable.

Related

Breaking ties based on repeated counts/Subsetting data in R

I'm trying to come up with a reasonable (if not clever) way of subsetting some data. Assume that when I create a table from the original data, it looks like this:
testdat <- data.frame(nom = c("A", "B", "C", "D", "E", "F", "G", "H", "I",
"J", "K"), cts = c(100, 50, 35, 10, 10, 5, 4, 2, 1, 1, 1))
My idea was to cut the data after the first three points here (they all have unique name/count combinations) and then take points D, E, F, and G as a group (they are the first group with repeated counts), and then points I, J, and K (second group with repeated counts). Just in case it isn't clear what I mean by "repeated counts," I mean that there's no difference between E and F except their name - they both appear 10 times in the data.
This isn't searching for duplicates (since each row is unique), but it is (since there are repeated counts in the second column). We can assume that the order is always either decreasing or repeated; it never increases (the table results were sorted in decreasing order).
How can I find the row (and row number) of the first time cts is repeated n times?
You can get the row containing the first value that repeats more than once by doing:
which(testdat$cts == rle(testdat$cts)$values[which(rle(testdat$cts)$lengths > 1)[1]])[1]
#> [1] 4
And the first entry that repeats three times is
which(testdat$cts == rle(testdat$cts)$values[which(rle(testdat$cts)$lengths > 2)[1]])[1]
#> [1] 9
And all the duplicated rows with
which(duplicated(testdat$cts) | rev(duplicated(rev(testdat$cts))))
#> [1] 4 5 9 10 11

How to subset a data set based on a specific condition?

I am working with a data set that has over 500 000 observations. It is available here: http://archive.ics.uci.edu/ml/machine-learning-databases/00352/. My goal is to remove some of the rows based on the following buzzwords:
buzzwords = c("A", "B", "C", "D", "E", "F", "G", "H")
If the column Description contains these buzzwords, for example, "Apple" contains "A", I should remove the entire row. I wrote this function:
isUndesirable2 = function(x){
c=FALSE
for (i in 1:length(buzzwords)){
if (str_detect(toupper(x),buzzwords[i])){
c=TRUE
}
break
}
return(c)
}
The function above works well, I tried it on multiple examples. However, when I try to subset the original data set:
dataset43 = dataset2[which(!isUndesirable2(as.character(dataset2$Description))),]
I get this warning:
Warning message:
In if (str_detect(toupper(x), buzzwords[i])) { :
the condition has length > 1 and only the first element will be used
and the new dataset43 is empty. Weird!
If anyone more experienced with R can help - I'd really appreciate it. Thanks! PS Sorry about the formatting, it is my first post.
Consider a more R-like approach
buzzwords = c("A", "B", "C", "D", "E", "F", "G", "H")
dataset2 <- data.frame(Description=c("Apple", "Bee", "Zoo", "Green", "Hospital", "Yoohoo", "You"))
library(stringr)
I use outer for all-to-all comparison of dataset2$Description and buzzwords in str_detect(i, j). Then filter dataset2 based on whether any buzzword was found or not
is_buzzword_present <- outer(dataset2$Description, buzzwords, function(i, j) str_detect(toupper(i), j))
dplyr::filter(dataset2, !apply(is_buzzword_present, 1, any))
# Description
# 1 Zoo
# 2 You
Using the example above we can do:
subset(dataset2,!grepl(paste(buzzwords,collapse="|"),do.call(paste,dataset2)))
Description
3 Zoo
6 Yoohoo
7 You

R - Compare values in two columns in different rows

I have a dataframe df as seen below with two features, a departing city and an arrival city. Every two rows information is stored about a going and a return flight.
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U
7 K L
8 K L
There is some inconsistency in the data where the same flight is repeated as it can be seen in the last two rows.
How is it possible to compare for every two rows the departure city of the first row and the arrival city of the second row, and keep the ones that are equal.
The dataset is very big and of course a for-loop is not considered an option.
Thank you in advance.
Here is a method that compares the pairs of rows using head and tail to line them up.
# find Departures that match the Arrival in the next row
sames <- which(head(dat$Departure, -1) == tail(dat$Arrival, -1))
# keep pairs of rows that match, maintaining order with `sort`
dat[sort(unique(c(sames, (sames + 1)))),]
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U
Note that the two variables have to be character vectors, not factor variables. you can coerce them to character using as.character if necessary.
data
dat <-
structure(list(Departure = c("A", "B", "F", "G", "U", "V", "K",
"K"), Arrival = c("B", "A", "G", "F", "V", "U", "L", "L")), .Names = c("Departure",
"Arrival"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8"))
So you just want unique flight paths? there are a number of ways to do this, I'd think the fastest would be with data.table, something like:
library(data.table)
df <- as.data.table(df)
uniqueDf <- unique(df)
you can also use the duplicated function, something like
df <- df[!duplicated(df), ]
should do nicely.
You could also do it this way:
right = rep(df[c(T,F),"Arrival"]==df[c(F,T),"Departure"],each=2)
df[right,]
This returns:
Departure Arrival
1 A B
2 B A
3 F G
4 G F
5 U V
6 V U
This answer doesn't look for unique records, it specifically checks if a row is a duplicate of the row before.
Adding a new column with a 1 if the row has repeated:
for(i in 2:length(df$Departure)){df$test[i]=ifelse(df$Departure[i] == df$Departure[i-1] & df$Arrival[i] == df$Arrival[i-1], 1,0)}
Loops can be slow though:
library(data.table)
df$test2 = ifelse(df$Departure == shift(df$Departure) & df$Arrival == shift(df$Arrival), 1,0)
Try the following solution, if it works for you:
df[duplicated(paste0(df$Departure,df$Arrival))==F,]

Finding occurrence of character from multiple vector or list

I wish to find the number of times a unique/distinct character occurs accross mulitple vectors or from a list.
Perhaps its best to describe in an example ;
In this example, lets say the "unique character" are letters. And the muliple "vectors" are books. I wish to find the occurance of the letters as the number of book increases.
# Initial data in the format of a list
book_list <- list(book_A <- c("a", "b", "c", "z"),
book_B <- c("c", "d", "a"),
book_C <- c("b", "a", "c", "e", "x"))
# Initial data in the format of multiple vectors
book_A <- c("a", "b", "c", "z")
book_B <- c("c", "d", "a")
book_C <- c("b", "a", "c", "e", "x")
# Finding the unique letters in each book
# This is the part im struggling to code in a loop fashion
one_book <- length(unique(book_A))
two_book <- length(unique(c(book_A, book_B)))
three_book <- length(unique(c(book_A, book_B, book_C)))
# Plot the desired output
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of unqiue letters", xlab = "Book Number",
main="The occurence of unique letters as number of book increases")
To Note : The real data set is much bigger. Each vector (book_A, book_B...etc) is about 7000 in length.
I attempting to solve the problem with dplyr or data frame, but I'm not quite there yet.
# Explore data frame option with an example data
library(dplyr)
df <- read.delim("http://m.uploadedit.com/ba3s/148950223626.txt")
# Group them
df_group <- dplyr::group_by(df, book) %>% summarize(occurence = length(letter))
# Use the cummuative sum
plot(x=1:length(unique(df$book)), y=cumsum(df_group$occurence))
But I know the plot is not correct, as it is only plotting the cummulative sum rather than what I intended. Any hints would be most helpful.
To add to the complexity, it would be nice if the book which have the shortest number of letter first can be ploted. Something along the line
# Example ;
# Find the length of the letters in the book
lapply(book_list, length)
# I know that book_B is has the shortest number of letters (3);
# followed by book_A (4) then book_C (5)
one_book <- length(unique(book_B))
two_book <- length(unique(c(book_B, book_A)))
three_book <- length(unique(c(book_B, book_A, book_C)))
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of letters", xlab = "Book Number")
You can use Reduce with accumulate = TRUE, i.e.
sapply(Reduce(c, book_list, accumulate = TRUE), function(i) length(unique(i)))
#[1] 4 5 7

Finding Specific Vector Entries in a Sliding Window

I am trying to create a function that will return counts of specific adjacent nucleotides (CG beside eachother) within a specific window that I have formatted in a vector.
I would like the windows to be 100 nucleotides long and move shift every 10.
The data is setup like this (to 10k entries):
data <- c("a", "g", "t", "t", "g", "t", "t", "a", "g", "t", "c", "t",
"a", "c", "g", "t", "g", "g", "a", "c", "c", "g", "a", "c")
So far I have tried this:
library(zoo)
library(seqinr)
rollapply(data, width=100, by=10, FUN=count(data, wordsize=2))
But I always get the error
"Error in match.fun(FUN) :
'count(data, 2)' is not a function, character or symbol"
I have also tried:
starts <- seq(1, length(data)-100, by = 100)
n <- length(starts)
for (i in 1:n){
chunk <- data[starts[i]:(starts[i]+99)]
chunkCG <- count(chunk,wordsize=2)
print (chunkCG)
}
However, I do not know how to save the data that is returned. This approach also does not allow me to overlap frames.
EDIT: To get the desired output with a 10 observation sliding window you can use a for loop. Since we pre-allocate the size of our result vector, the loop is reasonably fast. I think this is the best way to solve your problem since I dont think a lot of grouping (if any) supports a sliding window:
library(data.table)
set.seed(1)
#Sample data
df<-data.frame(var=sample(c("a","g","t","c"),600,replace=T))
#The number of windows you want, shift by 10 each time
n_windows <- ((nrow(df) - 100) / 10) + 1
#Create empty DF, this helps increase speed of below loop
res <- data.frame(window=rep(NA,n_windows),count_cg=rep(NA,n_windows))
#Loop over each i, paste a leaded version of your sequence onto current sequence and count "cg"s
for (i in 1:n_windows){
res$window[i] <- paste0((i-1)*10 + 1,"-",(i-1)*10 + 100)
subs <- df[((i-1)*10 + 1):((i-1)*10 + 100),"var"]
subs2<- paste0(as.character(subs),as.character(shift(subs,1L,type="lead")[1:length(subs) - 1]))
res$count_cg[i] <- sum(subs2=="cg")
}
head(res)
window count_cg
1 1-100 10
2 11-110 10
3 21-120 8
4 31-130 9
5 41-140 9
6 51-150 9
Your method doesn't overlap, as you call it with by = 100. Otherwise it looks fine. Just change it to 10.
To extract the data from you last try, try creating character vector that will collect the data and then you can extract the proper count with name indexing.
counted_cg <- vector(mode = "character")
for (i in 1:n){
chunk <- data[starts[i]:(starts[i]+99)]
chunkCG <- count(chunk,wordsize=2)
counted_cg <- c(counted_cg, chunkCG["cg"])
}

Resources