Breaking ties based on repeated counts/Subsetting data in R - r

I'm trying to come up with a reasonable (if not clever) way of subsetting some data. Assume that when I create a table from the original data, it looks like this:
testdat <- data.frame(nom = c("A", "B", "C", "D", "E", "F", "G", "H", "I",
"J", "K"), cts = c(100, 50, 35, 10, 10, 5, 4, 2, 1, 1, 1))
My idea was to cut the data after the first three points here (they all have unique name/count combinations) and then take points D, E, F, and G as a group (they are the first group with repeated counts), and then points I, J, and K (second group with repeated counts). Just in case it isn't clear what I mean by "repeated counts," I mean that there's no difference between E and F except their name - they both appear 10 times in the data.
This isn't searching for duplicates (since each row is unique), but it is (since there are repeated counts in the second column). We can assume that the order is always either decreasing or repeated; it never increases (the table results were sorted in decreasing order).
How can I find the row (and row number) of the first time cts is repeated n times?

You can get the row containing the first value that repeats more than once by doing:
which(testdat$cts == rle(testdat$cts)$values[which(rle(testdat$cts)$lengths > 1)[1]])[1]
#> [1] 4
And the first entry that repeats three times is
which(testdat$cts == rle(testdat$cts)$values[which(rle(testdat$cts)$lengths > 2)[1]])[1]
#> [1] 9
And all the duplicated rows with
which(duplicated(testdat$cts) | rev(duplicated(rev(testdat$cts))))
#> [1] 4 5 9 10 11

Related

Use index to subset dataframe based on unique values in a column

I have a large dataset with numerous sample IDs. A very simplified version looks something like this:
df <- data.frame(ID = rep(c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"), times = c(10, 4, 12, 19, 5, 22, 6, 7, 11, 4)),
Value = sample(x = 20:30, size = 100, replace = T))
I would like to split my large dataset into multiple smaller dataframes based on ID so that when I plot the data my graph doesn't get too crowded. In this simplified example, I would like to split it into two dataframes/plots, one with data from the first 5 unique IDs (A-E) and the other with data from the next 5 unique IDs (F-J). How can I do this easily using index notation (assuming I have hundreds of IDs)? My code below doesn't work and I don't know what's wrong with it:
subset.1 <- df[unique(df$ID)[1:5]]
subset.2 <- df[unique(df$ID)[6:10]]
You should subset with a logical vector:
df[df$ID %in% unique(df$ID)[1:5], ]
df[df$ID %in% unique(df$ID)[6:10], ]
You can also use split with cut to split your dataframe into n datasets (here, 2) by group.
split(df, cut(as.numeric(as.factor(df$ID)), 2))

Cluster hh:mm in 3 minute intervals in dataframe R

I have a dataframe containing tweets during a 1.5 hour time span (from 21:23 to 22:48). One of the columns in my dataframe indicates the timestamp hh:mm (this is of class character). I want to plot my data (e.g nr of tweets) over time but cluster it in 3 minute intervals (so I will get cleaner overview when creating a barplot, than clustering it by the minute). Can someone explain to me how this would work?
So the result I want to achieve is that in the example dt below is:
a, b, c, and d are clustered in group 1
e, f, g, h, and i are clustered in group 2
j, k, and l are clustered in group 3
dt <- data.frame(text = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l"),
time = c("21:23", "21:23", "21:24", "21:25", "21:26", "21:27", "21:27", "21:28", "21:28", "21:29", "21:30", "21:31"))
Check out the bins option where you can specify the number of bins, that is number of buckets. For example,
ggplot(<your options>) + geom_histogram(bins=30)
I used 30 since there are 3 *30 = 90 mins of data. But you can make it a variable.

Finding occurrence of character from multiple vector or list

I wish to find the number of times a unique/distinct character occurs accross mulitple vectors or from a list.
Perhaps its best to describe in an example ;
In this example, lets say the "unique character" are letters. And the muliple "vectors" are books. I wish to find the occurance of the letters as the number of book increases.
# Initial data in the format of a list
book_list <- list(book_A <- c("a", "b", "c", "z"),
book_B <- c("c", "d", "a"),
book_C <- c("b", "a", "c", "e", "x"))
# Initial data in the format of multiple vectors
book_A <- c("a", "b", "c", "z")
book_B <- c("c", "d", "a")
book_C <- c("b", "a", "c", "e", "x")
# Finding the unique letters in each book
# This is the part im struggling to code in a loop fashion
one_book <- length(unique(book_A))
two_book <- length(unique(c(book_A, book_B)))
three_book <- length(unique(c(book_A, book_B, book_C)))
# Plot the desired output
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of unqiue letters", xlab = "Book Number",
main="The occurence of unique letters as number of book increases")
To Note : The real data set is much bigger. Each vector (book_A, book_B...etc) is about 7000 in length.
I attempting to solve the problem with dplyr or data frame, but I'm not quite there yet.
# Explore data frame option with an example data
library(dplyr)
df <- read.delim("http://m.uploadedit.com/ba3s/148950223626.txt")
# Group them
df_group <- dplyr::group_by(df, book) %>% summarize(occurence = length(letter))
# Use the cummuative sum
plot(x=1:length(unique(df$book)), y=cumsum(df_group$occurence))
But I know the plot is not correct, as it is only plotting the cummulative sum rather than what I intended. Any hints would be most helpful.
To add to the complexity, it would be nice if the book which have the shortest number of letter first can be ploted. Something along the line
# Example ;
# Find the length of the letters in the book
lapply(book_list, length)
# I know that book_B is has the shortest number of letters (3);
# followed by book_A (4) then book_C (5)
one_book <- length(unique(book_B))
two_book <- length(unique(c(book_B, book_A)))
three_book <- length(unique(c(book_B, book_A, book_C)))
plot(x=c(1,2,3),
y=c(one_book, two_book, three_book),
ylab = "Number of letters", xlab = "Book Number")
You can use Reduce with accumulate = TRUE, i.e.
sapply(Reduce(c, book_list, accumulate = TRUE), function(i) length(unique(i)))
#[1] 4 5 7

Lagged difference between rows in R_ a different take

My question is similar to a few that have been asked before, but I hope different enough to warrant a separate question.
See here, and here. I'll pull some of the same example data as these questions. For context to my question- I am looking to see how my observed catch-rate (sea creatures) changed over multiple days of sampling the same area.
I want to calculate the difference between the first sample day at a given site (first Letter in data below), and the subsequent sample days (next rows of same letter).
#Example data
df <- data.frame(
id = c("A", "A", "A", "A", "B", "B", "B"),
num = c(1, 8, 6, 3, 7, 7 , 9),
What_I_Want = c(NA, 7, 5, 2, NA, 0, 2))
The first solution that I found calculates a lagged difference between each row. I also wanted this calculation- so it was helpful to find:
#Calculate lagged differences
df_new <- df %>%
# group by condition
group_by(id) %>%
# find difference
mutate(diff = num - lag(num))
Here the difference is between A.1 and A.2; then A.2 and A.3 etc...
What I would like to do now is calculate the difference with respect to the first value of each group. So for letter A, I would like to calculate 1 - 8, then 1 - 6, and finally 1 - 3. Any suggestions?
One clunky solution (linked above) is to create two (or more) columns for each distance lagged and some how merge the results that I want
df_clunky = df %>%
group_by(id) %>%
mutate(
deltaLag1 = num - lag(num, 1),
deltaLag2 = num - lag(num, 2))
Here is a base R method with replace and ave
ave(df$num , df$id, FUN=function(x) replace(x - x[1], 1, NA))
[1] NA 7 5 2 NA 0 2
ave applies the replace function to each id. replace takes the difference of the vector and the first element in the vector as its input and replaces NA into the first element.

subsetting in r based on a vector of conditions

This is a restatement of my poorly worded previous question. (To those who replied to it, I appreciate your efforts, and I apologize for not being as clear with my question as I should have been.) I have a large dataset, a subset of which might look like this:
a<-c(1,2,3,4,5,1)
b<-c("a","b","a","b","c","a")
c<-c("m","f","f","m","m","f")
d<-1:6
e<-data.frame(a,b,c,d)
If I want the sum of the entries in the fourth column based on a specific condition, I could do something like this:
attach(e)
total<-sum(e[which(a==3 & b=="a"),4])
detach(e)
However, I have a "vector" of conditions (call it condition_vector), the first four elements of which look more like this:
a==3 & b == "a"
a==2
a==1 & b=="a" & c=="m"
c=="f"
I'd like to create a "generalized" version of the "total" formula above that produces a results_vector of totals by reading in the condition_vector of conditions. In this example, the first four entries in the results_vector would be calculated conceptually as follows:
results_vector[1]<-sum(e[which(a==3 & b=="a"),4])
results_vector[2]<-sum(e[which(a==2),4])
results_vector[3]<-sum(e[which(a==1 & b=="a" & c=="m"),4])
results_vector[4]<-sum(e[which(c=="f"),4])
My actual data set has more than 20 variables. So each record in the condition_vector can contain anywhere from 1 to more than 20 conditions (as opposed to between 1 and 3 conditions, used in this example).
Is there a way to accomplish this other than using a parse(eval(text= ... approach (which takes a long time to run on a relatively small dataset)?
Thanks in advance for any help you can provide (and again, I apologize that I wasn't as clear as I should have been last time around).
Spark
Here using a solution using eval(parse(text=..) here, even if obviously you find it slow:
cond <- c('a==3 & b == "a"','a==2','a==1 & b=="a" & c=="x"','c=="f"')
names(cond) <- cond
results_vector <- lapply(cond,function(x)
sum(dat[eval(parse(text=x)),"d"]))
$`a==3 & b == "a"`
[1] 3
$`a==2`
[1] 2
$`a==1 & b=="a" & c=="m"`
[1] 1
$`c=="f"`
[1] 11
The advantage of naming your conditions vector is to access to your results by condition.
results_vector[cond[2]]
$`a==2`
[1] 2
Here is a function that takes as arguments the condition in each column (if no condition in a column, then NA as argument) and sums in a selected column of a selected data.frame:
conds.by.col <- function(..., sumcol, DF) #NA if not condition in a column
{
conds.ls <- list(...)
res.ls <- vector("list", length(conds.ls))
for(i in 1: length(conds.ls))
{
res.ls[[i]] <- which(DF[,i] == conds.ls[[i]])
}
res.ls <- res.ls[which(lapply(res.ls, length) != 0)]
which_rows <- Reduce(intersect, res.ls)
return(sum(DF[which_rows , sumcol]))
}
Test:
a <- c(1,2,3,4,5,1)
b <- c("a", "b", "a", "b", "c", "a")
c <- c("m", "f", "f", "m", "m", "f")
d <- 1:6
e <- data.frame(a, b, c, d)
conds.by.col(3, "a", "f", sumcol = 4, DF = e)
#[1] 3
For multiple conditions, mapply:
#all conditions in a data.frame:
myconds <- data.frame(con1 = c(3, "a", "f"),
con2 = c(NA, "a", NA),
con3 = c(1, NA, "f"),
stringsAsFactors = F)
mapply(conds.by.col, myconds[1,], myconds[2,], myconds[3,], MoreArgs = list(sumcol = 4, DF = e))
#con1 con2 con3
# 3 10 6
I guess "efficiency" isn't the first you say watching this, though...

Resources