Grouping data based on repetitive records using R - r

I have a dataset which contains repetitive records/common records. It looks something like this:
| Vendor | Buyer | Amount |
|--------|:-----:|-------:|
| A | P | 100 |
| B | P | 150 |
| C | Q | 300 |
| A | P | 290 |
I need to group similar records together but I do not want to summarize my amount. I want to have the amount value being represented individually. The output should like something like this:
| Vendor | Buyer | Amount |
|--------|:-----:|-------:|
| A | P | 100 |
| A | P | 290 |
| | | |
| B | P | 150 |
| | | |
| C | Q | 300 |
I thought of using split(), but since my original data has too many records, the split function creates too many lists and it becomes tedious to create new datasets from them. How can I achieve the above stated output with any other method?
EDIT:
Let us assume that we have an additional column called date and the dataset now looks like this:
| Vendor | Buyer | Amount | Date |
|--------|:-----:|-------:|-----------|
| A | P | 100 | 3/6/2019 |
| B | P | 150 | 7/6/2018 |
| C | Q | 300 | 4/21/2018 |
| A | P | 290 | 6/5/2018 |
Once, each buyer and vendor is grouped together, I need to arrange the dates in ascending order for each buyer and vendor such that it looks something like the below one:
| Vendor | Buyer | Amount | Date |
|--------|:-----:|-------:|-----------|
| A | P | 290 | 6/5/2018 |
| A | P | 100 | 3/6/2019 |
| | | | |
| B | P | 150 | 7/6/2018 |
| | | | |
| C | Q | 300 | 4/21/2018 |
and then remove the single transactions to get the final table containing only
| Vendor | Buyer | Amount | Date |
|--------|:-----:|-------:|----------|
| A | P | 290 | 6/5/2018 |
| A | P | 100 | 3/6/2019 |

In the following we sort the data frame and add a group column which allows easy subsequent processing of individual groups. For example, to process the groups without creating a large split of DF:
for(g in unique(DFout$group)) {
DFsub <- subset(DFout, group == g)
... process DFsub ...
}
1) Base R Sort the data and then assign the group column using cumsum on the non-duplicated elements.
library(data.table)
o <- with(DF, order(Vendor, Buyer))
DFo <- DF[o, ]
DFout <- transform(DFo, group = cumsum(!duplicated(data.frame(Vendor, Buyer))))
DFout
giving:
Vendor Buyer Amount group
1 A P 100 1
4 A P 290 1
2 B P 150 2
3 C Q 300 3
I am not sure this is such a good idea to do in the first place but if you really want to add a row of NAs after each group:
ix <- unname(unlist(tapply(DFout$group, DFout$group, function(x) c(x, NA))))
ix[!is.na(ix)] <- seq_len(nrow(DFout))
DFout[ix, ]
2) data.table Convert to data.table, set the key (which sorts it) and use rleid to assign the group number.
library(data.table)
DT <- data.table(DF)
setkey(DT, Vendor, Buyer)
DT[, group := rleid(Vendor, Buyer)]
3) sqldf Another approach is to use SQL. This requires the development version of RSQLite on github. Here dense_rank acts similarly to rleid above.
library(sqldf)
sqldf("select *, dense_rank() over (order by Vendor, Buyer) as [group]
from DF
order by Vendor, Buyer")
giving:
Vendor Buyer Amount group
1 A P 100 1
2 A P 290 1
3 B P 150 2
4 C Q 300 3
Note
DF <- structure(list(Vendor = structure(c(1L, 2L, 3L, 1L), .Label = c("A",
"B", "C"), class = "factor"), Buyer = structure(c(1L, 1L, 2L,
1L), .Label = c("P", "Q"), class = "factor"), Amount = c(100L,
150L, 300L, 290L)), class = "data.frame", row.names = c(NA, -4L
))

Related

How do I merge 2 dataframes without a corresponding column to match by?

I'm trying to use the Merge() function in RStudio. Basically I have two tables with 5000+ rows. They both have the same amount of rows. Although there is no corresponding Columns to merge by. However the rows are in order and correspond. E.g. The first row of dataframe1 should merge with first row dataframe2...2nd row dataframe1 should merge with 2nd row dataframe2 and so on...
Here's an example of what they could look like:
Dataframe1(df1):
+-------------------------------------+
| Name | Sales | Location |
+-------------------------------------+
| Rod | 123 | USA |
| Kelly | 142 | CAN |
| Sam | 183 | USA |
| Joyce | 99 | NED |
+-------------------------------------+
Dataframe2(df2):
+---------------------+
| Sex | Age |
+---------------------+
| M | 23 |
| M | 33 |
| M | 31 |
| F | 45 |
+---------------------+
NOTE: this is a downsized example only.
I've tried to use the merge function in RStudio, here's what I've done:
DFMerged <- merge(df1, df2)
This however increases both the rows and columns. It returns 16 rows and 5 columns for this example.
What am I missing from this function, I know there is a merge(x,y, by=) argument but I'm unable to use a column to match them.
The output I would like is:
+----------------------------------------------------------+
| Name | Sales | Location | Sex | Age |
+----------------------------------------------------------+
| Rod | 123 | USA | M | 23 |
| Kelly | 142 | CAN | M | 33 |
| Sam | 183 | USA | M | 31 |
| Joyce | 99 | NED | F | 45 |
+-------------------------------------+--------------------+
I've considering making extra columns in each dataframes, says row# and match them by that.
You could use cbind:
cbind(df1, df2)
If you want to use merge you could use:
merge(df1, df2, by=0)
You could use:
cbind(df1,df2)
This will necessarily work with same number of rows in two data frames

How can I select cases based on time data?

I am new to using R and I'm stumbling upon a few problems which I can't seem to solve on my own. I can't figure out how I can select cases based on time units.
I want to select cases where Time_D - Time_A is equal or above 5 seconds (for the same individual).
For instance my data frame consists of the following data:
+-------------------+--------------+---------------+
| | Individual | Time_A | Time_D |
+-------------------+--------------+---------------+
| 1 | A | 09:21:27 | 09:21:28 |
| 2 | A | 09:21:29 | 09:21:40 |
| 3 | A | 09:21:30 | 09:21:36 |
| 4 | B | 09:32:14 | 09:32:23 |
| 5 | B | 09:32:18 | 09:32:22 |
+-------------------+--------------+---------------+
And I want to only select the cases where Time_D - Time_A >= 5 seconds to get the following data frame:
+----------------+------------+-------------+
| | Individual | Time_A | Time_D |
+----------------+------------+-------------+
| 2 | A | 09:21:29 | 09:21:40 |
| 3 | A | 09:21:30 | 09:21:36 |
| 4 | B | 09:32:14 | 09:32:23 |
+----------------+------------+-------------+
I have already coded for time:
DT <- as.data.table(df3)[, Time_A := as.ITime(Time_A)][, Time_D := as.ITime(Time_D)]
After converting the columns to ITime you can subtract Time_D - Time_A and keep rows where the difference is greater than 5.
library(data.table)
cols <- c('Time_A', 'Time_D')
setDT(df)[, (cols) := lapply(.SD, as.ITime), .SDcols = cols]
df[(Time_D - Time_A) >= 5]
# Individual Time_A Time_D
#1: A 09:21:29 09:21:40
#2: A 09:21:30 09:21:36
#3: B 09:32:14 09:32:23
In base R, you can do this with as.POSIXct.
subset(df, as.POSIXct(Time_D, format = '%T') -
as.POSIXct(Time_A, format = '%T') >= 5)
We can use tidyverse
library(dplyr)
library(lubridate)
df1 %>%
filter(period_to_seconds(hms(Time_D)) - period_to_seconds(hms(Time_A)) >=5)
# Individual Time_A Time_D
#1 A 09:21:29 09:21:40
#2 A 09:21:30 09:21:36
#3 B 09:32:14 09:32:23
data
df1 <- structure(list(Individual = c("A", "A", "A", "B", "B"),
Time_A = c("09:21:27",
"09:21:29", "09:21:30", "09:32:14", "09:32:18"), Time_D = c("09:21:28",
"09:21:40", "09:21:36", "09:32:23", "09:32:22")), class = "data.frame",
row.names = c(NA,
-5L))

How to merge attributes on a frequency table in R?

Assume that i have two variables. See Dummy data below:
Out of 250 records:
SEX
Male : 100
Female : 150
HAIR
Short : 110
Long : 140
The code i currently use is provided below, For each variable a different table is created:
sexTable <- table(myDataSet$Sex)
hairTable <- table(myDataSet$Hair)
View(sexTable):
|------------------|------------------|
| Level | Frequency |
|------------------|------------------|
| Male | 100 |
| Female | 150 |
|------------------|------------------|
View(hairTable)
|------------------|------------------|
| Level | Frequency |
|------------------|------------------|
| Short | 110 |
| Long | 140 |
|------------------|------------------|
My question is how to merge the two tables in R that will have the following format As well as to calculate the percentage of frequency for each group of levels:
|---------------------|------------------|------------------|
| Variables | Level | Frequency |
|---------------------|------------------|------------------|
| Sex(N=250) | Male | 100 (40%) |
| | Female | 150 (60%) |
| Hair(N=250) | Short | 110 (44%) |
| | Long | 140 (56%) |
|---------------------|------------------|------------------|
We can use bind_rows after converting to data.frame
library(dplyr)
bind_rows(list(sex = as.data.frame(sexTable),
Hair = as.data.frame(hairTable)), .id = 'Variables')
Using a reproducible example
tbl1 <- table(mtcars$cyl)
tbl2 <- table(mtcars$vs)
bind_rows(list(sex = as.data.frame(tbl1),
Hair = as.data.frame(tbl2)), .id = 'Variables')%>%
mutate(Variables = replace(Variables, duplicated(Variables), ""))
If we also need the percentages
dat1 <- transform(as.data.frame(tbl1),
Freq = sprintf('%d (%0.2f%%)', Freq, as.numeric(prop.table(tbl1) * 100)))
dat2 <- transform(as.data.frame(tbl2),
Freq = sprintf('%d (%0.2f%%)', Freq, as.numeric(prop.table(tbl2) * 100)))
bind_rows(list(sex = dat1, Hair = dat2, .id = 'Variables')

Combine dplyr mutate function with a search through the whole table

I'm quite new to R and especially to the tidy verse. I'm trying to write a script with which we can rewrite a list of taxons. We already have one using quite a lot for and if loops and I want to try to simplify it with the tidyverse, but I'm kind of stuck how to do that.
what I have is a table that looks something like that (really simplified)
taxon_file<- tibble(name = c( "cockroach","cockroach2", "grasshopper", "spider", "lobster", "insect", "crustacea", "arachnid"),
Id = c(445,448,446,778,543,200,400,300),
parent_ID = c(200,200,200,300,400,200,400,300),
rank = c("genus","genus","genus","genus","genus","order","order","order")
)
+-------------+-----+-----------+----------+
| name | Id | parent_ID | rank |
+=============+=====+===========+==========+
| cockroach | 445 | 200 | genus |
| cockroach2 | 448 | 200 | genus |
| grasshopper | 446 | 200 | genus |
| spider | 778 | 300 | genus |
| lobster | 543 | 400 | genus |
| insect | 200 | 200 | order |
| crustacea | 400 | 400 | order |
| arachnid | 300 | 300 | order |
+-------------+-----+-----+------------+----------+
Now I want to rearrange it so that I get a new column where I can add the order that matches the parent_ID (so when parent_ID == ID then write name in column order). The end result should look kinda like this
+-------------+------------+------+-----------+
| name | order | Id | parent_ID |
+=============+============+======+===========+
| cockroach | insect | 445 | 200 |
| cockroach2 | insect | 448 | 200 |
| grasshopper | insect | 446 | 200 |
| spider | arachnid | 778 | 300 |
| lobster | crustacea | 543 | 400 |
+-------------+------------+------+-----------+
I tried to combine mutate with an ifelse statement but this just adds NA's to the whole order column.
tibble is named taxon_list
taxon_list %>%
mutate(order = ifelse(parent_ID == Id, Name, NA))
I know this will not work because it doesn't search the whole data-set for the correct row (that's what I did before with alle the for loops). Maybe someone can point me in the right direction?
One way is to filter to each rank type to 2 separate dfs, subset using select, and merge the 2.
df <- tibble(name = c( "cockroach","cockroach2", "grasshopper", "spider", "lobster", "insect", "crustacea", "arachnid"),
Id = c(445,448,446,778,543,200,400,300),
parent_ID = c(200,200,200,300,400,200,400,300),
rank = c("genus","genus","genus","genus","genus","order","order","order"))
library(tidyverse)
df_order <- df %>%
filter(rank == "order") %>%
select(order = name, parent_ID)
df_genus <- df %>%
filter(rank == "genus") %>%
select(name, Id, parent_ID) %>%
merge(df_order, by = "parent_ID")
Result:
parent_ID name Id order
1 200 cockroach 445 insect
2 200 cockroach2 448 insect
3 200 grasshopper 446 insect
4 300 spider 778 arachnid
5 400 lobster 543 crustacea

Copy column data when function unaggregates a single row into multiple in R

I need help in taking an annual total (for each of many initiatives) and breaking that down to each month using a simple division formula. I need to do this for each distinct combination of a few columns while copying down the columns that are broken from annual to each monthly total. The loop will apply the formula to two columns and loop through each distinct group in a vector. I tried to explain in an example below as it's somewhat complex.
What I have :
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2015 | TotalD | TotalD |
| A | Mike | 2015 | TotalE | TotalE |
| A | Rob | 2015 | TotalF | TotalF |
| B | John | 2015 | TotalG | TotalG |
| B | Mike | 2015 | TotalH | TotalH |
......
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2016 | TotalI | TotalI |
| A | Mike | 2016 | TotalJ | TotalJ |
| A | Rob | 2016 | TotalK | TotalK |
| B | John | 2016 | TotalL | TotalL |
| B | Mike | 2016 | TotalM | TotalM |
I'm going to loop a function for the first row to take the "Total Savings" and "Total Costs" and divide by 12 where Date = 2015 and 9 where Date = 2016 (YTD to Sept) and create an individual row for each. I'm essentially breaking out an annual total in a row and creating a row for each month of the year. I need help in running that loop to copy also columns "Init", "Name", until "Init", "Name" combination are not distinct. Also, note the formula for the division based on the year will be different as well. I suppose I could separate the datasets for 2015 and 2016 and use two different functions and merge if that would be easier. Below should be the output:
| Init | Name | Date |Monthly Savings|Monthly Costs|
| A | John | 01-01-2015 | TotalD/12* | MonthD |
| A | John | 02-01-2015 | MonthD | MonthD |
| A | John | 03-01-2015 | MonthD | MonthD |
...
| A | Mike | 01-01-2016 | TotalE/9* | TotalE |
| A | Mike | 02-01-2016 | TotalE | TotalE |
| A | Mike | 03-01-2016 | TotalE | TotalE |
...
| B | John | 01-01-2015 | TotalG/12* | MonthD |
| B | John | 02-01-2015 | MonthG | MonthD |
| B | John | 03-01-2015 | MonthG | MonthD |
TotalD/12* = MonthD - this is the formula for 2015
TotalE/9* = MonthE - this is the formula for 2016
Any help would be appreciated...
As a start, here are some reproducible data, with the columns described:
myData <-
data.frame(
Init = rep(LETTERS[1:3], each = 4)
, Name = rep(c("John", "Mike"), each = 2)
, Date = 2015:2016
, Savings = (1:12)*1200
, Cost = (1:12)*2400
)
Next, set the divisor to be used for each year:
toDivide <-
c("2015" = 12, "2016" = 9)
Then, I am using the magrittr pipe as I split the data up into single rows, then looping through them with lapply to expand each row into the appropriate number of rows (9 or 12) with the savings and costs divided by the number of months. Finally, dplyr's bind_rows stitches the rows back together.
myData %>%
split(1:nrow(.)) %>%
lapply(function(x){
temp <- data.frame(
Init = x$Init
, Name = x$Name
, Date = as.Date(paste(x$Date
, formatC(1:toDivide[as.character(x$Date)]
, width = 2, flag = "0")
, "01"
, sep = "-"))
, Savings = x$Savings / toDivide[as.character(x$Date)]
, Cost = x$Cost / toDivide[as.character(x$Date)]
)
}) %>%
bind_rows()
The head of this looks like:
Init Name Date Savings Cost
1 A John 2015-01-01 100.0000 200.0000
2 A John 2015-02-01 100.0000 200.0000
3 A John 2015-03-01 100.0000 200.0000
4 A John 2015-04-01 100.0000 200.0000
5 A John 2015-05-01 100.0000 200.0000
6 A John 2015-06-01 100.0000 200.0000
with similar entries for each expanded row.

Resources