R Display by Unique Value and Frequency - r

I have a dataset such as the one below except for around 5 million observations. I have already filtered the dates based on the time they were recorded in previous code to include only the calls made during working time. Now, I want to separate the dates based on WORKERCALL_ID in order to see a list of all of the unique dates for each worker and the number of times each WORKERCALL_ID shows up on each date (number of calls per date, separated by each WORKERCALL_ID. I tried to do this using a contingency matrix and then changing it to a data frame, but the file is so large that my R session always aborts. Does anyone have any idea how to accomplish this?
WORKERCALL_ID DATE
124789244 02-01-2014
128324834 05-01-2014
124184728 06-10-2014
An example of desired output is below, for each WORKERCALL_ID and date. My end goal is to be able to subset the result and remove the rows/ID's with a high frequency of calls.
WORKERCALL_ID DATE FREQ
124789244 02-01-2014 4
124789244 02-23-2014 1

Two options :
table(df$WORKERCALL_ID, df$DATE)
Or, using dplyr (also including the requested added filtering out for IDs that have any cases of frequency higher than 5):
df %>% group_by(WORKERCALL_ID, DATE) %>% summarize(freq=n()) %>% group_by(WORKERCALL_ID) %>%
filter(!any(freq>5))
Example:
rbind(as.data.frame(df),data.frame(WORKERCALL_ID=128324834, DATE="Moose",freq=6,stringsAsFactors = FALSE)) %>% group_by(WORKERCALL_ID) %>% filter(!any(freq>5))
# A tibble: 2 x 3
# Groups: WORKERCALL_ID [2]
WORKERCALL_ID DATE freq
<dbl> <chr> <dbl>
1 124184728. 06-10-2014 1.
2 124789244. 02-01-2014 1.
Note how ID 128324834 is removed from the final result.

I would use dplyr::count
library(dplyr)
count(df,WORKERCALL_ID,DATE)

Related

R language: how to return and print a list of missing entries based on two columns

I'm struggling to write R code that prints a "list of dates that do not have data between given start and end dates for all the possible values of another variable / column in a table". It's a little difficult to explain in words, so I'll give a very simplified example that will hopefully make it clear what I'm trying to do.
You are the manager of a pet store and in charge of checking the quality of pet food sales data. The data comes in a csv file with four columns; date, type of animal food, sales price, and quantity sold. The animal_type column can have 3 possible values; dog, cat, or bird in string format.
I've simulated the first three days worth of data for the month of December in a very simplified manner below. The price and quantity columns aren't relevant and so I've left them blank.
date
animal_type
price
quantity
2021-12-01
dog
2021-12-01
dog
2021-12-01
cat
2021-12-01
bird
2021-12-02
dog
2021-12-02
bird
2021-12-03
cat
2021-12-03
cat
2021-12-03
cat
What I'm trying to do is print out / return the dates that don't have entries for all the possible values in the animal_type column. So for my example, what I'm looking to print out is something like...
2021-12-02 : ['cat']
2021-12-03 : ['dog', 'bird']
Because [2021-12-02] doesn't have an entry for 'cat' and [2021-12-03] doesn't have entries for 'dog' or 'bird' in the data. However, I've only been able to get a count of the number of unique animal_type values for each date so far with the following functions.
import(tidyverse)
import(dplyr)
df %>% group_by(date) %>% summarise(n = n_distinct(unique(animal_type))) # sums the number of unique animal_type appearing in all the entries for every date
df %>% group_by(animal_type) %>% summarise(n = n_distinct(unique(date))) # sums the number of unique dates that appear in all the entries for every animal_type
# output for "sums the number of unique animal_type appearing in all the entries for every date"
date n
<date> <int>
1 2021-12-01 3
2 2021-12-02 2
3 2021-12-03 1
# output for "sums the number of unique dates that appear in all the entries for every animal_type"
animal_type num_dates
<chr> <int>
1 dog 2
2 cat 2
3 bird 2
This can me tell which dates have missing animal_type values but not which one(s) specifically. I've tried looking around but couldn't find many similar problems and so I'm wondering how feasible this would be. I'm also rusty with using R and relearning much of the syntax, packages, and libraries. So I could be missing something simple. I'm open to both tidyverse / dplyr and base r advice as you can likely see from my code. I would appreciate any help and thank you guys for your time!
You can use both the tidyr::complete function and an anti-join.
First you have to complete the implicit missing values and then anti-join the completed tibble with the one you currently have.
See the example below
library(tidyverse)
example <- crossing("Date"=c("2021-12-01", "2021-12-02", "2021-12-03"),
"Pet"=c("Bird", "Cat", "Dog"))
op_example <- example %>% slice(-c(5, 7, 9))
op_example %>% complete(Date, Pet) %>%
anti_join(op_example)

How can I get a conditional statement to select the most recent timestamp?

I have a data frame consisting of ~1,000,000 rows and am classifying some data.
Where there are two or more dates present against a record, I want to use the first date in a new field called Day1 and the second date in a field called Day2.
I achieve this thus:
df %>%
group_by(pii, cn) %>%
summarise(Day1 = min(TestDate, na.rm = TRUE), # Selects the first available date
Day2 = sort(TestDate, na.last = TRUE)[2]) # Selects the second available date
However, I have come across a problem affecting around 1.6% of the records (~14,000) where there are only two dates listed, which are identical.
In this case, I want to be able to look at the time listed against each date (recorded in df$time) to determine which came first, still with the intention of taking the first (earlier) date as Day1 and the second as Day2.
How can I incorporate this into my current structure?
For the sake of an illustrative example (albeit non-functioning), I am thinking that it could be something like this:
if_else(sort(TestDate,na.last = TRUE)[2] == Day1, [CHECK TIMES HERE], sort(TestDate, na.last = TRUE)[2])
As such, I would hope for something like this as an output:
id Day1 D1_Time Day2 D2_Time
1 2021-01-02 NA 2021-01-04 NA
2 2021-01-01 04.45 2021-01-01 04.48
3 2021-01-03 NA 2021-01-08 NA
In this output example, the record with id value 2 has two identical dates listed, so the df$time field was consulted to determine which came first.
I think this would solve your problem (though it might not answer your specific question). I'd do something like this:
library(dplyr)
library(tidyr)
df %>%
group_by(pii, cn) %>%
arrange(TestDate, time) %>% # order within the groups by date and time
mutate(rownum = 1:n()) %>% # number the rows in order
filter(rownum <= 2) %>% # only keep the top 2 in each group. slice_head(n=2) would be an alternative to the last 2 steps, but I want the row number below
ungroup() %>%
pivot_wider(names_from = rownum, # spread to match your desired
values_from = c(TestDate, time)) %>%
select(pii, cn, TestDate_1, time_1, TestDate_2, time_2) #reorder the columns to match your sample
Or, if you don't like that approach, could you combine each date/time pair into a single datetime field, and use your original logic?

R calculating time differences in a (layered) long dataset

I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out.
I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps.
The df is structured as follows, in a long format:
df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3),
"Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times.
"Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55"))
Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow.
What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers.
Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average.
I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!
Using dplyr:
df %>%
arrange(Customer, Timestamp) %>%
group_by(Customer) %>%
mutate(Difference = Timestamp - lag(Timestamp)) %>%
summarise(mean(Difference, na.rm = TRUE))
Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.
Using base R (no extra packages):
sort the data, ordering by customer Id, then by timestamp.
calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping).
find the average
squish that into a data.frame.
# 1 sort the data
df$Timestamp <- as.POSIXct(df$Timestamp)
# not debugged
df <- df[order(df$Customer, df$Timestamp),]
# 2 apply a diff.
# if you want to force the time units to seconds, convert
# the timestamp to numeric first.
# without conversion
diffs <- tapply(df$Timestamp, df$Customer, diff)
# ======OR======
# convert to seconds
diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff)
# 3 find the averages
diffs.mean <- lapply(diffs, mean)
# 4 squish that into a data.frame
diffs.df <- data.frame(do.call(rbind, diffs.mean))
diffs.df$Customer <- names(diffs.mean)
# 4a tidy up the data.frame names
names(diffs.df)[1] <- "Avg_Interval"
diffs.df
You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.

Filtering Data based on another dataframe based on two rows

I have two Datasets.
The first dataset includes Companies, the Quarter and the corresponding value from the whole timespan.
Quarter Date Company value
2012.1 2012-12-28 x 1
2013.1 2013-01-02 y 2
2013.1 2013-01-03 z 3
Companies again are in the dataset over the whole time and show up multiple times.
The other dataset is an index which includes a company identifier and the quarter in which it existed in the index (Companies can be in the index in multiple quarters).
Quarter Date Company value
2012.1 2012-12-28 x 1
2014.1 2013-01-02 y 2
2013.1 2013-01-03 x 3
Now I need to only select the companies which are in the index at the same time (quarter) as I have data from the first dataset.
In the example above I would need the data from company x in both quarters, but company y needs to get kicked out because the data is available in the wrong quarter.
I tried multiple functions including filter, subset and match but never got the desired result. It always filters either too much or too little.
data %>% filter(Company == index$Company & Quarter == index$Quarter)
or
data[Company == index$Company & Quarter = index$Quarter,]
Something with my conditions doesn't seem right. Any help is appreciated!
Have a look at dplyr's powerful join functions. Here inner_join might help you
dplyr::inner_join(df1, df2, by=c("Company", "Quarter"))

Group dates by time doing the mean in the rest of the columns

Hi and thanks in advance,
I'd need to group the rows by date of this data set I've imported with: read.table. One problem to add is the format of all variables is factor:
Date; Time; Global_active_power; Global_reactive_power; Voltage
16/12/2006; 00:00:00; 4.216; 0.418; 234.840
16/12/2006; 00:01:00; 5.360; 0.436; 233.630
16/12/2006; 00:02:00; 5.360; 0.436; 233.630
.....
17/12/2006; 00:00:00; 1.044; 0.152; 242.730
Instead of group by date I need to calculate the mean of every column to summarize all the records during a day in just one row like this:
Date; Time; Global_active_power; Global_reactive_power; Voltage
16/12/2006; - MEAN ALL MEASURES OF THE DAY
After doing date I'delete the Time columns since I just need the mean of the measures of each day during a period of time.
Thanks again !
You can do this using the dplyr package assuming that your data is in a data frame df:
library(`dplyr`)
result <- df %>% group_by(Date) %>% ## 1.
select(-Time) %>% ## 2.
mutate_each(funs(as.numeric)) %>% ## 3.
summarise_each(funs(mean)) ## 4.
In fact, the commands reflect what you want to accomplish.
Notes:
First group_by the Date column so that the subsequent mean is computed with respect to values over all times for the date.
Then select all other columns except for the Time column using select(-Time).
As you pointed out, the columns of the data to be averaged needs to be numeric instead of factors, so convert each to numeric as necessary. This uses mutate_each to apply the as.numeric function to each column selected.
Finally, summarise_each of these selected columns applying the mean function to each column.
Using the data you provided:
print(result)
### A tibble: 2 x 4
## Date Global_active_power Global_reactive_power Voltage
## <chr> <dbl> <dbl> <dbl>
##1 16/12/2006 4.978667 0.430 234.0333
##2 17/12/2006 1.044000 0.152 242.7300
Hope this helps.

Resources