Automate detection of start and end row number of phrases - r

I have a dataframe like this:
df = data.frame(main_name = c("google","yahoo","google","amazon","yahoo","google"),
volume = c(32,43,412,45,12,54))
I would like to sort it accordind to main_name, example
Aiming to know from which start row there is the specific phrase until which one in order to use it into a for loop.
main_name volume
amazon 45
google 32
google 412
google 54
yahoo 43
yahoo 12
In there any "auto" to make need without the need to know the specific phrase. Just to check if it is changed and know the start and end row number?
amazon [1]
google [2:4]
yahoo [5:6]

With tidyverse:
df%>%
arrange(main_name)%>%
mutate(row=row_number())%>%
group_by(main_name)%>%
summarise(start=first(row),
end=last(row))%>%
mutate(res=glue::glue("[{start}:{end}]"))
# A tibble: 3 x 4
main_name start end res
<fct> <int> <int> <chr>
1 amazon 1 1 [1:1]
2 google 2 4 [2:4]
3 yahoo 5 6 [5:6]

Here is an alternative base R solution using rle
with(rle(as.character(df$main_name)), setNames(mapply(
function(x, y) sprintf("[%s:%s]", x, y),
cumsum(lengths) - lengths + 1, cumsum(lengths)), values))
# amazon google yahoo
#"[1:1]" "[2:4]" "[5:6]"
Sample data
df <- read.table(text =
"main_name volume
amazon 45
google 32
google 412
google 54
yahoo 43
yahoo 12", header = T)

Here is another base R option
with(df, tapply(seq_along(main_name), main_name, FUN =
function(x) do.call(sprintf, c(fmt = "[%d:%d]", as.list(range(x))))))
# amazon google yahoo
# "[1:1]" "[2:4]" "[5:6]"

Related

Function for writing an automated report in R

So I am trying to write an automated report in R with Functions. One of the questions I am trying to answer is this " During the first week of the month, what were the 10 most viewed products? Show the results in a table with the product's identifier, category, and count of the number of views.". To to this I wrote the following function
most_viewed_products_per_week <- function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
}
print(most_viewed_products_per_week)
However the output I get is this:
function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
How do I fix that?
This report has more questions like this, so I am trying to get my function writing as correct as possible from the start.
Thanks in advance,
Edo
It is a good practice to code in functions. Still I recommend you get your code doing what you want and then think about what parts you want to wrap in a function (for future re-use). This is to get you going.
In general: to support your analysis, make sure that your data is in the right class. I.e. dates are formatted as dates, numbers as double or integers, etc. This will give you access to many helper functions and packages.
For the case at hand, read up on {tidyverse}, in particular {dplyr} which can help you with coding pipes.
simulate data
As mentioned - you will find many friends on Stackoverflow, if you provide a reproducible example.
Your questions suggests your data look a bit like the following simulated data.
Adapt as appropriate (or provide example)
library(tibble) # tibble are modern data frames
library(dplyr) # for crunching tibbles/data frames
library(lubridate) # tidyverse package for date (and time) handling
df <- tribble( # create row-tibble
~date, ~identifier, ~category, ~views
,"2020-02-01", 1, "TV", 27
,"2020-02-02", 2, "PC", 40
,"2020-02-03", 1, "TV", 12
,"2020-02-03", 2, "PC", 2
,"2020-02-08", 3, "UV", 200
) %>%
mutate(date = ymd(date)) # date is read in a character - lubridate::ymd() for date
This yields
> df
# A tibble: 5 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
5 2020-02-08 3 UV 200
Notice: date-column is in date-format.
work your algorithm
From your attempt it follows you want to extract the first 7 days.
Since we have a "date"-column, we can use a date-function to help us here.
{lubridate}'s day() extracts the "day-number".
> df %>% filter(day(date) <= 7)
# A tibble: 4 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
Anything outside the first 7 days is gone.
Next you want to summarise to get your product views total.
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
## ---------- summarise in bins that you need := groups -------
group_by(identifier, category) %>%
summarise(total_views = sum(views)
, .groups = "drop" ) # if grouping is not needed "drop" it
This gives you:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 1 TV 39
2 2 PC 42
Now pick the top-10 and sort the order:
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
group_by(identifier, category) %>%
summarise(total_views = sum(views), .groups = "drop" ) %>%
## ---------- make use of another helper function of dplyr
top_n(n = 10, total_views) %>% # note top-10 makes here no "real" sense :), try top_n(1, total_views)
arrange(desc(total_views)) # arrange in descending order on total_views
wrap in function
Now that the workflow is in place, think about breaking your code into the blocks you think are useful.
I leave this to you. You can assign interim results to new data frames and wrap the preparation of the data into a function and then the top_n() %>% arrange() in another function, ...
This yields:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 2 PC 42
2 1 TV 39

Group by name and add up the columns count in r

I have a dataset with 405 observations and 39 variables. But just two columns are important for further analysis.
I would like to group the first row with similar names together and add up their number from the second column.
Reproducible dataset looks like this:
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facebook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
Outcome should be in an new data.frame and look like this:
df2 <- data.frame (name=c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft","Others"),
value=c(30,70,50,60,70,80,190))
A tidyverse way of doing it.
First store all valid_names in a vector say valid_names
Thereafter create a new column say all_names in df1 by -
first splitting all strings at space ' ' using str_split
thereafter use purrr::map_chr() to check if any of the split string matches with your valid_names and if yes, retrieve that string only otherwise get others
Thereafter group_by on this field. (I omitted one step of mutate first and then group_by and directly created the new field in group_by statement, that works)
Now summarise your important values as desired.
valid_names =c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft")
valid_names
#> [1] "Google" "Facebook" "Twitter" "Flurry" "Amazon" "Microsoft"
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facebook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
df1
#> name value unimportant
#> 1 Google Ads 10 1
#> 2 Google Doubleclick 20 2
#> 3 Facebook Login 30 3
#> 4 Facebook Ads 40 4
#> 5 Twitter MoPub 50 5
#> 6 Flurry 60 6
#> 7 Amazon advertisment 70 7
#> 8 Microsoft 80 8
#> 9 Ad4screen 90 9
#> 10 imobi 100 10
library(tidyverse)
df1 %>% group_by(all_names = str_split(name, ' '),
all_names = map_chr(all_names, ~ ifelse(any(.x %in% valid_names),.x[.x %in% valid_names], 'others'))) %>%
summarise(value = sum(value), .groups = 'drop')
#> # A tibble: 7 x 2
#> all_names value
#> <chr> <dbl>
#> 1 Amazon 70
#> 2 Facebook 70
#> 3 Flurry 60
#> 4 Google 30
#> 5 Microsoft 80
#> 6 others 190
#> 7 Twitter 50
Created on 2021-06-22 by the reprex package (v2.0.0)
This works on the sample data using the adist function and with partial=TRUE to look at partial string matches. It requires defining the known groups though, rather than trying to find them. I think this leg work is worth doing though as it simplifies the problem a lot once the output is known
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facbook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
# types we want to map. known is the groupings
types <- unique(df1$name)
known <- c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft")
# use distrance measures, and look for matches on partial strings eg
# ignore the Doubleclick part when matching on Google
distance <- adist(known, types, partial=TRUE)
# cap controls leniancy in matching e.g. Facbook and Facebook have a dist of 1
# whilst Facebook and Facebook is a perfect match with score of 0
# Raise to be more leniant
cap <- 1
# loop through the types
map_all <- sapply(seq_along(types), function(i){
# find minimum value, check if its below the cap. If so, assign to the closest
# group, else assign to others
v <- min(distance[,i])
if(v <= cap){
map_i <- known[which.min(distance[,i])]
}else{
map_i <- "Others"
}
map_i
})
# now merge in to df1, then sum out using your preferred method
df_map <- data.frame(name=types, group=map_all)
df_merged <- merge(df1, df_map, by="name")
df2 <- aggregate(value ~ group, sum, data=df_merged)
df2
group value
1 Amazon 70
2 Facebook 70
3 Flurry 60
4 Google 30
5 Microsoft 80
6 Others 190
7 Twitter 50

R dplyr - running total with row-wise calculations

I have a dataframe that keeps track of the activities associated with a bank account (example below).
The initial balance is $5,000 (type "initial). If type is "in", that means a cash deposit. In this example each deposit is $1,000. If type is "out", that means a withdrawal from the account. In this example each withdrawal is 10% of the account balance.
data <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000))
Is there a dplyr solution to keep track of the balance after each activity?? I have tried several ways but I can't seem to find a way to efficiently calculate running totals and the withdrawal amount (which depends on the running total).
For this example the output should be:
result <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000),
balance=c(5000,6000,5400,4860,5860,6860))
Thanks in advance for any suggestions or recommendations!
You can use purrr::accumulate2() to condition the calculation on the value of type:
library(dplyr)
library(purrr)
library(tidyr)
data %>%
mutate(balance = accumulate2(input, type[-1], .f = function(x, y, type) if(type == "out") x - x * y/100 else x + y)) %>%
unnest(balance)
# A tibble: 6 x 4
activity type input balance
<int> <chr> <dbl> <dbl>
1 1 initial 5000 5000
2 2 in 1000 6000
3 3 out 10 5400
4 4 out 10 4860
5 5 in 1000 5860
6 6 in 1000 6860

How to create an edge list for each user mentioned in a tweet when there are observations containing several user mentioned

I want to do an network analysis of the tweets of some users of my interest and the mentioned users in their tweets.
I retrieved the tweets (no retweets) from several user timelines using the rtweet package in r and want to see who they mention in their tweets.
There is even a variable with the screen names of those useres who are mentioned which will serve me as the target group for my edge list. But sometimes they mention several users and then the observation looks for example like this: c('luigidimaio', 'giuseppeconteit') whereas there is only one user mentioned it is naming just this one user as an observation (eg. agorarai). I want to split those observations containing several mentioned users into single observations for each user. So out of one observation containing both mentioned users as a vector I would have to split it into two observation each containing one of the mentioned users.
The code looks like this so far:
# get user timelines of the most active italian parties (excluding retweets)
tmls_nort <- get_timelines(c("Mov5Stelle", "pdnetwork", "LegaSalvini"),
n = 3200, include_rts = FALSE
)
# create an edge list
tmls_el = as.data.frame(cbind(Source = tolower(tmls_nort$screen_name), Target = tolower(tmls_nort$mentions_screen_name)))
Here is an extract of my dataframe:
Source Target n
<fct> <fct> <int>
1 legasalvini circomassimo 2
2 legasalvini 1giornodapecora 2
3 legasalvini 24mattino 2
4 legasalvini agorarai 28
5 legasalvini ariachetira 2
6 legasalvini "c(\"raiportaaporta\", \"brunovespa\")" 7
```
We can start from this: first you could clean up your columns, tidy up the data and plot your network.
The data I used are:
tmls_el
Source Target n
1 legasalvini circomassimo 2
2 legasalvini 1giornodapecora 2
3 legasalvini 24mattino 2
4 legasalvini agorarai 28
5 legasalvini ariachetira 26
6 legasalvini c("raiportaaporta", "brunovespa") 7
7 movimento5stelle c("test1", "test2", "test3", "test4", "test5", "test6", "test7", "test8") 20
Now the what I've done:
# here you replace the useless characer with nothing
tmls_el$Target <- gsub("c\\(\"", "", tmls_el$Target)
tmls_el$Target <- gsub("\\)", "", tmls_el$Target)
tmls_el$Target <- gsub("\"", "", tmls_el$Target)
library(stringr)
temp <- data.frame(str_split_fixed(tmls_el$Target, ", ", 8))
tmls_el_2 <- data.frame(
Source = c(rep(as.character(tmls_el$Source),8))
, Target = c(as.character(temp$X1),as.character(temp$X2),as.character(temp$X3),
as.character(temp$X4),as.character(temp$X5),as.character(temp$X6),
as.character(temp$X7),as.character(temp$X8))
, n = c(rep(as.character(tmls_el$n),8)))
Note: it works with the example you give, if you have more than 8 target, you have to change the number 2 to 2,3,...k, and paste the new column in Target, and repeat k times Source and n. Surely there is a more elegant way, but this works.
Here you can create edges and nodes:
library(dplyr)
el <- tmls_el_2 %>% filter(Target !='')
no <- data.frame(name = unique(c(as.character(el$Source),as.character(el$Target))))
Now you can use igraph to plot the results:
library(igraph)
g <- graph_from_data_frame(el, directed=TRUE, vertices=no)
plot(g, edge.width = el$n/2)
With data:
tmls_el <- data.frame(Source = c("legasalvini","legasalvini","legasalvini","legasalvini","legasalvini","legasalvini","movimento5stelle"),
Target = c("circomassimo","1giornodapecora","24mattino","agorarai","ariachetira","c(\"raiportaaporta\", \"brunovespa\")","c(\"test1\", \"test2\", \"test3\", \"test4\", \"test5\", \"test6\", \"test7\", \"test8\")"),
n = c(2,2,2,28,26,7,20))

Combining rows in the data set into categories in R

I'm trying to write a script which combines similar entries into the common category.
I have the dataset:
product <- c('Laptops','13" Laptops','Apple Laptops', '10 inch laptop','Laptop 13','TV','Big TV')
volume <- c(100,10,20,2,1,200,10)
dataset <- data.frame(product,volume)
Looks like:
product volume
1 Laptops 100
2 13" Laptops 10
3 Apple Laptop 20
4 10 inch laptop 2
5 Laptop 13 1
6 TV 200
7 Big TV 10
What I want to do is combine all categories together, so for example after running the script I want the dataset to be:
product volume
1 Laptops 113
2 Apple Laptop 20
3 TV 210
Since Apple is a brand, I want it to remain separate from categories. I don't know how to get started but I figure I need a for loop to go through every row, and check if a Brand name is in the product name. E.g.
brandlist <- 'Apple|Samsung'
if ( grepl(brandlist, dataset$product[i])) { Skip this row }
Now I need to define category names - which I do by looking at products which most searches, since people tend to search for categories. Let's say a row is a category if the volume is >100.
categories <- c()
for ( i in 1:count(dataset) ) {
if ( dataset$volume[i] > 100 ) { categories <- c(categories , dataset$product[i] }}
Now I need to check if every row name has a somewhat partial match... I'm thinking of some sort of regex with number + " + category or the other way around. I was also considering some sort of algorithm to check how many letters are different, e.g. allow 4 characters to differ and at least 5 must match exactly to the category, so laptops and 13" laptops will be grouped together since they have 7 characters in common and differ in 4.
EDIT:
I'm currently thinking along the lines of the following solution:
I made a list of categories, and I created a new data frame such as:
category <- c ('other', 'category 1', 'category 2')
volume <- c(0,0,0)
df <- data.frame(category,volume)
category volume
1 other 0
2 category 1 0
3 category 2 0
Now I want to go through results in the previous table using a loop, and match all results (based on the restriction on brands and matching - it must have 1 word in common and could differ in some ways, and put the result in the new data frame.
You can try following. First remove all numbers and signes like ", \ or " ".
Then search for brands and extract the last words, update if there are brands found and print all with lower case. Finally replace the plural s. Group and summarize in the last step. Of course this is a hardcoded solution for the provided data.frame, but I see no other way.
library(stringi)
library(tidyverse)
dataset %>%
mutate(p2=gsub("[[:digit:]]|\"","",product),
p2=stri_trim(p2)) %>%
mutate(p3=grepl(brandlist, p2)) %>%
mutate(p4=stri_extract_last_words(p2),
p4=ifelse(p3, grep(brandlist, p2, value=T), p4),
p4=tolower(p4),
p4=stri_replace_last_fixed(p4, "s","")) %>%
group_by(p4) %>%
summarise(volume=sum(volume)) %>%
select(product=p4, volume)
# A tibble: 3 x 2
product volume
<chr> <dbl>
1 laptop 113
2 tv 210
3 apple laptop 20
Edits:
You can also set up a function. but then you have to create the categories by yourself. Please note to write them in singular and in lower case.
library(stringr)
foo <- function(data, product=product, volume=volume, brandlist, categories){
data %>%
mutate(p1=tolower(product)) %>%
mutate(p2=str_extract(p1, brandlist),
p2=ifelse(is.na(p2),"",p2)) %>%
mutate(p3=str_extract(p1, categories)) %>%
unite(Product, p2, p3, sep = " ") %>%
mutate(Product=str_trim(Product)) %>%
group_by(Product) %>%
summarise(volume=sum(volume))
}
foo(dataset, brandlist = 'apple|samsung',categories = "laptop|tv")
# A tibble: 3 x 2
Product volume
<chr> <dbl>
1 apple laptop 20
2 laptop 113
3 tv 210
foo(dataset, brandlist = 'apple|samsung',categories = "laptop|tv|big tv")
> foo(dataset, brandlist = 'apple|samsung',categories = "laptop|tv|big tv")
# A tibble: 4 x 2
Product volume
<chr> <dbl>
1 apple laptop 20
2 big tv 10
3 laptop 113
4 tv 200
To the first part you can define a categories list and then differentially exclude
Categories <- c("Laptop","TV")
Brands <- c("Apple")
Aggregated.df <- do.call(rbind,lapply(1:length(Categories),function(x){
SumRow <- sum(dataset[grepl(Categories[x],dataset$product,ignore.case=TRUE),"volume"])
Excluded <- sapply(1:length(Brands),function(y){
SumCol <- sum(dataset[grepl(Categories[x],dataset$product,ignore.case=TRUE) & grepl(Brands[y],dataset$product,ignore.case=TRUE),"volume"])
})
SumRow <- ifelse((SumRow - sum(Excluded)) < 0, 0, (SumRow - sum(Excluded)))
Excluded.df <- NULL
if(any(Excluded>0)){
Which <- which(Excluded>0)
Excluded.df <- data.frame(Product=paste(Brands[Which],Categories[x],sep=" "), volume = Excluded[Which])
}
Row.df <- data.frame(Product=Categories[x], volume = SumRow)
DataFrame <- rbind(Row.df,Excluded.df)
}))
Now I need to define category names - which I do by looking at products which most searches, since people tend to search for categories. Let's say a row is a category if the volume is >100.
Min.volume <- 100
Categories <- unique(Aggregated.df$Product[Aggregated.df$volume > Min.volume])

Resources