Wordclouds with absolute word sizes - r

I'm trying to make several wordclouds to compare terms, which themselves are nested within groups. I would like to make one wordcloud per group. The wordcloud package in R can make the wordclouds I need, but each new wordcloud has the size of the words scaled relatively to the maximum and minimum word frequency. This is able to be set with the scale parameter.
My aim is to make wordclouds where the size of the word is absolutely related to the frequency of the word, enabling different wordclouds to be visually compared.
library(wordcloud)
dat <- data.frame(word = rep(LETTERS[1:3], 2), freq = c(10, 5, 3, 20, 10, 6), group = c(1, 1, 1, 2, 2, 2))
dat
# word freq group
#1 A 10 1
#2 B 5 1
#3 C 3 1
#4 A 20 2
#5 B 10 2
#6 C 6 2
wordcloud(dat$word[dat$group == 1], dat$freq[dat$group == 1])
wordcloud(dat$word[dat$group == 2], dat$freq[dat$group == 2]) # Currently the same
This is the current wordcloud I get from the above command, run on both groups in the MWE (although the exact placement will vary randomly with each run). I would like each letter in the second group's wordcloud to be twice as large as the first, in line with the data (or for there to be some sensible scaled difference, even if it is not linear).
How can this be achieved?

Hm, this might be a roundabout way. But what if we set the scale for all data using a single anchor.
anchor <- max(dat$freq)
wordcloud(dat$word[dat$group == 1], dat$freq[dat$group == 1], scale = c(8*max(dat$freq[dat$group == 1])/anchor, 0.5))
wordcloud(dat$word[dat$group == 2], dat$freq[dat$group == 2], scale = c(8*max(dat$freq[dat$group == 2])/anchor, 0.5))

Related

Cumulative sum with a threshold window in R data.table

I want to calculate the rolling sum of n rows in my dataset where the window size 'n' depends on the sum itself. For example, I want to slide the window as soon as the rolling sum of time exceeds 5 mins. Basically, I want to calculate how much distance the person traveled in the last 5 mins but the time steps are not equally spaced. Here's a dummy data.table for clarity (the last two columns are required):
I am looking for a data.table solution in R
Input data table:
ID
Distance
Time
1
2
2
1
4
1
1
2
1
1
2
2
1
3
3
1
6
3
1
1
1
Desired Output:
ID
Distance
Time
5.min.rolling.distance
5.min.rolling.time
1
2
2
NA
NA
1
4
1
NA
NA
1
2
1
NA
NA
1
2
2
10
6
1
3
3
5
5
1
6
3
9
6
1
1
1
10
7
Here is a solution that works with double time units as well as a simpler solution that will work with integer time units. I tested the double solution on 10,000 records and on my 2015 laptop it executed instantly. I can't make any guarantees about performance on 40 GB of data.
If you wanted to generalize this code I'd look at the RcppRoll package and learn how to implement c++ code in R.
Solution with double time units
I broke this down into two problems. First, figure out the window size by looking back until we get to at least 5 minutes (or run out of data). Second, take the sum of distances and time from the current observation to the look back unit.
Bad loop code in R usually tries to 'grow' a vector, its a huge efficiency gain to pre-allocate the vector length and then change elements in it.
input <- data.frame(
dist = c(2, 4, 2, 2, 3, 6, 1),
time = c(2, 1, 1, 2, 3, 3, 1)
)
var_window_cumsum <- function(input, MIN_TIME) {
if(is.null(input$time) | is.null(input$dist)) {
stop("input must have variables time and dist that record the row's duration and distance traveled.")
}
n <- nrow(input)
# First, figure out how far we need to look back to, this vector will store
# the position of the first record that gets our target record up to 5 min or
# more. If we cant look back to 5 min, we leave it as NA.
time_indx = rep(NA_integer_, length = n) # always preallocate your vector!
for(time in (1:n)) {
prior = time # start at self in case observation is already >= MIN_TIME
while(sum(input$time[time:prior]) < MIN_TIME & prior > 1) {
prior = prior - 1
}
# if we cant look back to our minimum time, leave the indx as NA
if (sum(input$time[time:prior]) >= MIN_TIME) {
time_indx[time] = prior
}
}
# Now that we know how far to look back, its easy to find out the total distance
# and total time.
dist5 = rep(NA_integer_, n)
time5 = rep(NA_integer_, n)
for (i in 1:n) {
dist5[i] <- ifelse(!is.na(time_indx[i]),
sum(input$dist[i:time_indx[i]]),
NA)
time5[i] <- ifelse(!is.na(time_indx[i]),
sum(input$time[i:time_indx[i]]),
NA)
}
cbind(input,
window_dist = dist5,
window_time = time5,
window_start = time_indx)
}
# output looks good
# Warning: example data does not include exhaustive cases
# I have not setup thorough testing
var_window_cumsum(input, 5)
# Test on a larger dataset, 10k records
set.seed(1234)
n <- 10000
med_input <- data.frame(
dist = sample(1:5, n, replace = TRUE),
time = sample(1:60, n, replace = TRUE) / 10
)
# you should inspect this to make sure there are no errors
med_output <- var_window_cumsum(med_input, 5)
Solution with integer time units
If your time unit is in integers and your data isn't too big, it may work to complete your dataset. This is a little bit of a hack, but here I create a continuos timeid variable that goes from the starting time to the maximum time, and create one row for each integer unit of time. From there its easy to calculate a rolling cumulative sum for the last five time units. Finally, we get rid of all the fake rows we added in (you want to make sure to do that because they will have invalid cumulative sum data. Also, important to note that I use roll_sumr and not roll_sum; roll_sumr includes 4 padding NA's on the left side of the output vector for the first 4 units.
library(tidyverse)
library(RcppRoll)
input <- data.frame(
dist = c(2, 4, 2, 2, 3, 6, 1),
time = c(2, 1, 1, 2, 3, 3, 1)
)
desired_dist5 <- c(NA, NA, NA, 10, 5, 9, 10)
desired_time5 <- c(NA, NA, NA, 6, 5, 6, 7)
output <- input %>%
mutate(timeid = cumsum(time),
realrow = TRUE) %>%
complete(timeid = 1:max(timeid)) %>%
mutate(dist5 = roll_sumr(dist, 5, na.rm = T),
time5 = roll_sumr(time, 5, na.rm = T)) %>%
filter(realrow) %>%
select(-c(realrow, timeid))
# Check against example table
output$dist5 == desired_dist5
output$time5 == desired_time5

Control which nodes to change size igraph

This may look simple but I am not able to do this. I want to plot two types of nodes, small and big using a cutoff. The values are in the attribute degree.
Here is a small toy example
g1 <- graph(edges=c(1,2, 2,3, 3, 1, 4,2), n=4, directed=F) %>%
set_vertex_attr("names", value = LETTERS[1:4])
g1_degree <- degree(g1, mode = "total")
g1_degree
[1] 2 3 2 1
g1 <- set_vertex_attr(g1, "degree", value = g1_degree)
plot(g1, vertex.size=V(g1)$degree)
This gives me every node according to the degree, but I want nodes of degree 2 and 3 big and 1 small.
So I tried to edit the values within V(g1)$degree
ifelse(V(g1)$degree < 2, yes = V(g1)$degree==1, no = V(g1)$degree==5)
FALSE FALSE FALSE TRUE
Ok, I checked my degree values, but how can I overwrite the TRUE or FALSE using the cutoffs I need?
Here are two solutions.
One with ifelse, like in the question.
g1 <- set_vertex_attr(g1, "degree", value = ifelse(V(g1)$degree < 2, 1, 5))
V(g1)$degree
#[1] 5 5 5 1
And another with findInterval. This has better performance than ifelse, which can be important in large data sets.
i <- findInterval(V(g1)$degree, c(0, 2, Inf))
g1 <- set_vertex_attr(g1, "degree", value = c(1, 5)[i])
V(g1)$degree
#[1] 5 5 5 1
With different new sizes set with the findInterval index, c(10, 50)[i], the graph would look like below.
g1 <- set_vertex_attr(g1, "degree", value = c(10, 50)[i])
plot(g1, vertex.size = V(g1)$degree)

Finding number of identical groups

Consider a clustering problem, where the true class labels are known (say g).
Suppose, p denotes the predicted cluster labels (can be obtained by any clustering approach).
So, both g and p splits the data set in some groups, though the number of groups need not be same in two cases.
Among these two sets of groups, in some cases one group by g will be identical to another group by p, though their labels in two cases may be different. I want to find the number of such groups, i.e. I want to find the number of cases where the clustering method is able to detect a class perfectly.
I understand this is not a standard way to evaluate clustering (Rand Index, Dunn Index, etc. are recommended), but I am interested in this. I also understand that this number will be very small in most of the real life data, may be even 0, but the data set I am currently working with has a large number (around 1500) of classes, with highest number of observations in one class being at most 15. So, in this case, this number is likely to be quite high.
Here is a reproducible example and my attempt (working) at the solution:
# true labels
g <- c(1, 1, 2, 2, 2, 1, 3, 3, 3, 4)
# predicted labels
p <- c(3, 3, 1, 1, 1, 3, 4, 4, 1, 2)
# correctly detected groups
n_correct <- 2 # (1st class and 3rd cluster), (4th class and 2nd cluster)
# attempt
distinct_class_labels <- unique(x = g)
counter <- 0
for (i in seq_along(along.with = distinct_class_labels))
{
cluster_labels_of_obs_in_ith_class <- subset(x = p,
subset = (g == distinct_class_labels[i]))
unique_cluster_labels_of_obs_in_ith_class <- unique(x = cluster_labels_of_obs_in_ith_class)
if (length(x = unique_cluster_labels_of_obs_in_ith_class) == 1)
{
class_labels_of_obs_in_this_cluster <- subset(x = g,
subset = (p == unique_cluster_labels_of_obs_in_ith_class))
if (length(x = unique(x = class_labels_of_obs_in_this_cluster)) == 1)
{
counter <- (counter + 1)
}
}
}
counter
#> [1] 2
Created on 2019-05-22 by the reprex package (v0.3.0)
This works correctly, but it takes time (and I do not like this method). I suppose one can use dplyr::group_by with both g and p separately and somehow compare the groups of these two objects. I guess there are other better approaches to this and I will highly appreciate such answers.
Thanks.
If you are also interested in the combination of the correctly detected groups you can try this
library(tidyverse)
tibble(g = g, p=p) %>%
distinct(g,p) %>% # unique combinations of g and p
add_count(g, name="g_count") %>% # count how often each class/label occurs in g and p. When it is unambiguous assigned it should be 1
add_count(p, name="p_count") %>%
filter(g_count == 1 & p_count == 1) %>%
select(g,p)
# A tibble: 2 x 2
g p
<dbl> <dbl>
1 1 3
2 4 2
The number of rows (you can use nrow()) will give you the number of correctly detected groups
Convert g and p to factor with levels specified based on their occurrence in the vector and count the frequencies that match.
sum(table(factor(p, levels = unique(p))) == table(factor(g, levels = unique(g))))
#[1] 2
To understand, see
table(factor(p, levels = unique(p)))
#3 1 4 2
#3 4 2 1
table(factor(g, levels = unique(g)))
#1 2 3 4
#3 3 3 1
We can ignore the labels (as the group labels are not same) and focus only on frequency. We can see that the first and fourth value have the same frequency hence, the count 2.
If you want to find out which groups are similar, you can do
inds <- table(factor(p, levels = unique(p))) == table(factor(g, levels = unique(g)))
unique(p)[inds]
#[1] 3 2
unique(g)[inds]
#[1] 1 4
This says that group 3 in p is similar to group 1 in g and same for 2 and 4 respectively.
Before solving it using table , I did it with split although the underlying logic is the same.
sum(lengths(split(p, factor(p, levels = unique(p)))) ==
lengths(split(g, factor(g, levels = unique(g)))))
EDIT
If there is a chance of class imbalance we need to combine the levels to include all. For example,
g1 <- c(g, 5)
p1 <- c(p, 1)
sum(table(factor(p1, levels = unique(c(p1, g1)))) ==
table(factor(g1, levels = unique(c(g1, p1)))))
#[1] 2

Plot every 10 datapoint in a vector by different color in R

I have one dimensional vector in R which I would like to plot like :
Every 10 data points have different color. How do I do this in R with normal plot function, with ggplot and with plotly?
in base R you can try this.
I changed the data a little bit compared to the other answer
# The data
set.seed(2017);
df <- data.frame(x = 1:100, y = 0.001 * 1:100 + runif(100));
nCol <- 10;
df$col <- rep(1:10, each = 10);
# base R plot
plot(df[1:2]) #add `type="n"` to remove the points
sapply(1:nrow(df), function(x) lines(df[x+0:1,1:2], col=df$col[x], lwd=2))
As for lines the col parameter will be recycled you have to use a loop (here sapply) over the rows and plot segments.
Here is a ggplot solution; unfortunately you don't provide sample data, so I'm generating some random data.
# Sample data
set.seed(2017);
df <- data.frame(x = 1:100, y = 0.001 * 1:100 + runif(1000));
# The number of different colours
nCol <- 5;
df$col <- rep(1:nCol, each = 10);
# ggplot
library(tidyverse);
ggplot(df, aes(x = x, y = y, col = as.factor(col), group = 1)) +
geom_line();
For plotly just wrap the ggplot call within ggplotly.
This answer doesn't show you how to do it in a specific plotting package, but instead shows how to assign random colors to your data according to your specifications. The benefit of this approach is that it gives you control over which colors you use if you choose.
library(dplyr) # assumed okay given ggplot2 mention
df = data_frame(v1=rnorm(100))
n = nrow(df)
df$group = (1:n - (1:n %% -10)) / 10
colors = sample(colors(), max(df$group), replace=FALSE)
df$color = colors[df$group]
df %>% group_by(group) %>% filter(row_number() <= 2) %>% ungroup()
# A tibble: 20 x 3
v1 group color
<dbl> <dbl> <chr>
1 -0.6941434087 1 lightsteelblue2
2 -0.4559695973 1 lightsteelblue2
3 0.7567737300 2 darkgoldenrod2
4 0.9478937275 2 darkgoldenrod2
5 -1.2358486079 3 slategray3
6 -0.7068140340 3 slategray3
7 1.3625895045 4 cornsilk
8 -2.0416315923 4 cornsilk
9 -0.6273386846 5 darkgoldenrod4
10 -0.5884521130 5 darkgoldenrod4
11 0.0645078975 6 antiquewhite1
12 1.3176727205 6 antiquewhite1
13 -1.9082708004 7 khaki
14 0.2898018693 7 khaki
15 0.7276799336 8 greenyellow
16 0.2601492048 8 greenyellow
17 -0.0514811315 9 seagreen1
18 0.8122600269 9 seagreen1
19 0.0004641533 10 darkseagreen4
20 -0.9032770589 10 darkseagreen4
The above code first creates a fake dataset with 100 rows of data, and sets n equal to 100. df$group is set by taking the row numbers (1:n) performing a rather convoluted evaluation to get a vector of numbers like c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, ..., 10). It then samples the colors available in base R returning as many colors as their are groups (max(df$group)) and then using the group variable to index the color vector to get the color. The final output is just the first two rows of each group to show that the colors are the same within group, but different between groups. This should now be able to be passed in as a variable in your various plotting environments.

How to add a column that gives the result of an operation on every row relative to current row?

I have a data frame with a group of x and y points. I need to calculate the euclidean distance of every point relative to every other point. Then I have to figure, for each row, how many are within a given range.
For example, if I had this data frame:
x y
- -
1 2
2 2
9 9
I should add a column that signals how many points (if we consider these points to be in a cartesian plane) are within a distance of 3 units from every other point.
x y n
- - -
1 2 1
2 2 1
9 9 0
Thus, the first point (1,2) has one other point (2,2) that is within that range, whereas the point (9,9) has 0 points at a distance of 3 units.
I could do this with a couple of nested for loops, but I am interested in solving this in R in an idiomatic way, preferably using dplyr or other library.
This is what I have:
ddply(.data=mydataframe, .variables('x', 'y'), .fun=count.in.range)
count.in.range <- function (df) {
xp <- df$x
yp <- df$y
return(nrow(filter(df, dist( rbind(c(x,y), c(xp,yp)) ) < 3 )))
}
But, for some reason, this doesn't work. I think it has to do with filter.
Given
df_ <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
You can use the function "dist":
matrix_dist <- as.matrix(dist(df_))
df_$n <- rowSums(matrix_dist <= 3)
This is base approach with straightforward application of a "distance function" but only on a row-by-row basis:
apply( df_ , 1, function(x) sum( (x[1] - df_[['x']])^2+(x[2]-df_[['y']])^2 <=9 )-1 )
#[1] 1 1 0
It's also really a "sweep" operation, although I wouldn't really expect a performance improvement.
I would suggest you work with pairs of points in the long format and then use a data.table solution, which is probably one of the fastest alternatives to work with large datasets
library(data.table)
library(reshape)
df <- data.frame(x = c(1, 2, 9),
y = c(2, 2, 9))
The first thing you need to do is to reshape your data to long format with all possible combinations of pairs of points:
df_long <- expand.grid.df(df,df)
# rename columns
setDT(df_long )
setnames(df_long, c("x","y","x1","y1"))
Now you only need to do this:
# calculate distance between pairs
df_long[ , mydist := dist ( matrix(c(x,x1,y,y1), ncol = 2, nrow = 2) ) , by=.(x,y,x1,y1)]
# count how many points are within a distance of 3 units
df_long[mydist <3 , .(count = .N), by=.(x,y)]
#> x y count
#> 1: 1 2 2
#> 2: 2 2 2
#> 3: 9 9 1

Resources