Group by name and add up the columns count in r - r

I have a dataset with 405 observations and 39 variables. But just two columns are important for further analysis.
I would like to group the first row with similar names together and add up their number from the second column.
Reproducible dataset looks like this:
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facebook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
Outcome should be in an new data.frame and look like this:
df2 <- data.frame (name=c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft","Others"),
value=c(30,70,50,60,70,80,190))

A tidyverse way of doing it.
First store all valid_names in a vector say valid_names
Thereafter create a new column say all_names in df1 by -
first splitting all strings at space ' ' using str_split
thereafter use purrr::map_chr() to check if any of the split string matches with your valid_names and if yes, retrieve that string only otherwise get others
Thereafter group_by on this field. (I omitted one step of mutate first and then group_by and directly created the new field in group_by statement, that works)
Now summarise your important values as desired.
valid_names =c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft")
valid_names
#> [1] "Google" "Facebook" "Twitter" "Flurry" "Amazon" "Microsoft"
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facebook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
df1
#> name value unimportant
#> 1 Google Ads 10 1
#> 2 Google Doubleclick 20 2
#> 3 Facebook Login 30 3
#> 4 Facebook Ads 40 4
#> 5 Twitter MoPub 50 5
#> 6 Flurry 60 6
#> 7 Amazon advertisment 70 7
#> 8 Microsoft 80 8
#> 9 Ad4screen 90 9
#> 10 imobi 100 10
library(tidyverse)
df1 %>% group_by(all_names = str_split(name, ' '),
all_names = map_chr(all_names, ~ ifelse(any(.x %in% valid_names),.x[.x %in% valid_names], 'others'))) %>%
summarise(value = sum(value), .groups = 'drop')
#> # A tibble: 7 x 2
#> all_names value
#> <chr> <dbl>
#> 1 Amazon 70
#> 2 Facebook 70
#> 3 Flurry 60
#> 4 Google 30
#> 5 Microsoft 80
#> 6 others 190
#> 7 Twitter 50
Created on 2021-06-22 by the reprex package (v2.0.0)

This works on the sample data using the adist function and with partial=TRUE to look at partial string matches. It requires defining the known groups though, rather than trying to find them. I think this leg work is worth doing though as it simplifies the problem a lot once the output is known
df1 <- data.frame(name=c("Google Ads", "Google Doubleclick","Facebook Login",
"Facbook Ads","Twitter MoPub","Flurry","Amazon advertisment","Microsoft ","Ad4screen","imobi"),
value=c(10,20,30,40,50,60,70,80,90,100),unimportant=c(1,2,3,4,5,6,7,8,9,10))
# types we want to map. known is the groupings
types <- unique(df1$name)
known <- c("Google","Facebook","Twitter","Flurry","Amazon","Microsoft")
# use distrance measures, and look for matches on partial strings eg
# ignore the Doubleclick part when matching on Google
distance <- adist(known, types, partial=TRUE)
# cap controls leniancy in matching e.g. Facbook and Facebook have a dist of 1
# whilst Facebook and Facebook is a perfect match with score of 0
# Raise to be more leniant
cap <- 1
# loop through the types
map_all <- sapply(seq_along(types), function(i){
# find minimum value, check if its below the cap. If so, assign to the closest
# group, else assign to others
v <- min(distance[,i])
if(v <= cap){
map_i <- known[which.min(distance[,i])]
}else{
map_i <- "Others"
}
map_i
})
# now merge in to df1, then sum out using your preferred method
df_map <- data.frame(name=types, group=map_all)
df_merged <- merge(df1, df_map, by="name")
df2 <- aggregate(value ~ group, sum, data=df_merged)
df2
group value
1 Amazon 70
2 Facebook 70
3 Flurry 60
4 Google 30
5 Microsoft 80
6 Others 190
7 Twitter 50

Related

Function for writing an automated report in R

So I am trying to write an automated report in R with Functions. One of the questions I am trying to answer is this " During the first week of the month, what were the 10 most viewed products? Show the results in a table with the product's identifier, category, and count of the number of views.". To to this I wrote the following function
most_viewed_products_per_week <- function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
}
print(most_viewed_products_per_week)
However the output I get is this:
function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
How do I fix that?
This report has more questions like this, so I am trying to get my function writing as correct as possible from the start.
Thanks in advance,
Edo
It is a good practice to code in functions. Still I recommend you get your code doing what you want and then think about what parts you want to wrap in a function (for future re-use). This is to get you going.
In general: to support your analysis, make sure that your data is in the right class. I.e. dates are formatted as dates, numbers as double or integers, etc. This will give you access to many helper functions and packages.
For the case at hand, read up on {tidyverse}, in particular {dplyr} which can help you with coding pipes.
simulate data
As mentioned - you will find many friends on Stackoverflow, if you provide a reproducible example.
Your questions suggests your data look a bit like the following simulated data.
Adapt as appropriate (or provide example)
library(tibble) # tibble are modern data frames
library(dplyr) # for crunching tibbles/data frames
library(lubridate) # tidyverse package for date (and time) handling
df <- tribble( # create row-tibble
~date, ~identifier, ~category, ~views
,"2020-02-01", 1, "TV", 27
,"2020-02-02", 2, "PC", 40
,"2020-02-03", 1, "TV", 12
,"2020-02-03", 2, "PC", 2
,"2020-02-08", 3, "UV", 200
) %>%
mutate(date = ymd(date)) # date is read in a character - lubridate::ymd() for date
This yields
> df
# A tibble: 5 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
5 2020-02-08 3 UV 200
Notice: date-column is in date-format.
work your algorithm
From your attempt it follows you want to extract the first 7 days.
Since we have a "date"-column, we can use a date-function to help us here.
{lubridate}'s day() extracts the "day-number".
> df %>% filter(day(date) <= 7)
# A tibble: 4 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
Anything outside the first 7 days is gone.
Next you want to summarise to get your product views total.
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
## ---------- summarise in bins that you need := groups -------
group_by(identifier, category) %>%
summarise(total_views = sum(views)
, .groups = "drop" ) # if grouping is not needed "drop" it
This gives you:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 1 TV 39
2 2 PC 42
Now pick the top-10 and sort the order:
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
group_by(identifier, category) %>%
summarise(total_views = sum(views), .groups = "drop" ) %>%
## ---------- make use of another helper function of dplyr
top_n(n = 10, total_views) %>% # note top-10 makes here no "real" sense :), try top_n(1, total_views)
arrange(desc(total_views)) # arrange in descending order on total_views
wrap in function
Now that the workflow is in place, think about breaking your code into the blocks you think are useful.
I leave this to you. You can assign interim results to new data frames and wrap the preparation of the data into a function and then the top_n() %>% arrange() in another function, ...
This yields:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 2 PC 42
2 1 TV 39

Obtaining Percentage for Date Observations

I am very new to R and am struggling with this concept. I have a data frame that looks like this:
enter image description here
I have used summary(FoodFacilityInspections$DateRecent) to get the observations for each "date" listed. I have 3932 observations, though, and wanted to get a summary of:
Dates with the most observations and the percentage for that
Percentage of observations for the Date Recent category
I have tried:
*
> count(FoodFacilityInspections$DateRecent) Error in UseMethod("count")
> : no applicable method for 'count' applied to an object of class
> "factor"
Using built in data as you did not provide example data
library(data.table)
dtcars <- data.table(mtcars, keep.rownames = TRUE)
Solution
dtcars[, .("count"=.N, "percent"=.N/dtcars[, .N]*100),
by=cyl]
You can use the table function to find out which date occurs the most. Then you can loop through each item in the table (date in your case) and divide it by the total number of rows like this (also using the mtcars dataset):
table(mtcars$cyl)
percent <- c()
for (i in 1:length(table(mtcars$cyl))){
percent[i] <- table(mtcars$cyl)[i]/nrow(mtcars) * 100
}
output <- cbind(table(mtcars$cyl), percent)
output
percent
4 11 34.375
6 7 21.875
8 14 43.750
A one-liner using table and proportions in within.
within(as.data.frame.table(with(mtcars, table(cyl))), Pc <- proportions(Freq)*100)
# cyl Freq Pc
# 1 4 11 34.375
# 2 6 7 21.875
# 3 8 14 43.750
An updated solution with total, percent and cumulative percent table based on your data.
library(data.table)
data<-data.frame("ScoreRecent"=c(100,100,100,100,100,100,100,100,100),
"DateRecent"=c("7/23/2021", "7/8/2021","5/25/2021","5/19/2021","5/20/2021","5/13/2021","5/17/2021","5/18/2021","5/18/2021"),
"Facility_Type_Description"=c("Retail Food Stores", "Retail Food Stores","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment"),
"Premise_zip"=c(40207,40207,40207,40206,40207,40206,40207,40206,40206),
"Opening_Date"=c("6/27/1988","6/29/1988","10/20/2009","2/28/1989","10/20/2009","10/20/2009","10/20/2009","10/20/2009", "10/20/2009"))
tab <- function(dataset, var){
dataset %>%
group_by({{var}}) %>%
summarise(n=n()) %>%
mutate(total = cumsum(n),
percent = n / sum(n) * 100,
cumulativepercent = cumsum(n / sum(n) * 100))
}
tab(data, Facility_Type_Description)
Facility_Type_Description n total percent cumulativepercent
<chr> <int> <int> <dbl> <dbl>
1 Food Service Establishment 7 7 77.8 77.8
2 Retail Food Stores 2 9 22.2 100

Cleaning a column in a dataset R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
So I got a dataset with a column that I need to clean.
The column has objects with stuff like: "$10,000 - $19,999", "$40,000 and over."
How do I code this so for example "$10,000 - $19,999" becomes 15000 instead, and "$40,000 and over" becomes 40000 in a new column?
I am new to R so I have no idea how to start. I need to do a regression analysis on this but it doesn't work if I don't get this fixed.
I have been told that some basic string/regex operations are what I need. How should I proceed?
Here's a solution using the tidyverse.
Load packages
library(dplyr) # for general cleaning functions
library(stringr) # for string manipulations
library(magrittr) # for the '%<>% function
Make a dummy dataset based on your example.
df <- data_frame(price = sample(c(rep('$40,000 and over', 10),
rep('$10,000', 10),
rep('$19,999', 10),
rep('$9,000', 10),
rep('$28,000', 10))))
Inspect the new dataframe
print(df)
#> # A tibble: 50 x 1
#> price
#> <chr>
#> 1 $9,000
#> 2 $40,000 and over
#> 3 $28,000
#> 4 $10,000
#> 5 $10,000
#> 6 $9,000
#> 7 $19,999
#> 8 $10,000
#> 9 $19,999
#> 10 $40,000 and over
#> # ... with 40 more rows
Clean-up the the format of the price strings by removing the $ symbol and ,. Note the use of the '\\' before the $ symbol. This formatting is used within R to escape special characters (the second \ is a standard regex escape switch, the first \ is tells R to escape the second \).
df %<>%
mutate(price = str_remove(string = price, pattern = '\\$'), # remove $ sign
price = str_remove(string = price, pattern = ',')) # remove comma
Quick check of the data.
head(df)
#> # A tibble: 6 x 1
#> price
#> <chr>
#> 1 9000
#> 2 40000 and over
#> 3 28000
#> 4 10000
#> 5 10000
#> 6 9000
Process the number strings into numerics. First convert 40000 and over to 40000, then convert all the strings to numerics, then use logic statements to convert the numbers to the values you want. The functions ifelse() and case_when() are interchangeable, but I tend to use ifelse() for single rules, and case_when() when there are multiple rules because of the more compact format of the case_when().
df %<>%
mutate(price = ifelse(price == '40000 and over', # convert 40000+ to 40000
yes = '40000',
no = price),
price = as.numeric(price), # convert all to numeric
price = case_when( # use logic statements to change values to desired value
price == 40000 ~ 40000,
price >= 30000 & price < 40000 ~ 35000,
price >= 20000 & price < 30000 ~ 25000,
price >= 10000 & price < 20000 ~ 15000,
price >= 0 & price < 10000 ~ 5000
))
Have a final look.
print(df)
#> # A tibble: 50 x 1
#> price
#> <dbl>
#> 1 5000
#> 2 40000
#> 3 25000
#> 4 15000
#> 5 15000
#> 6 5000
#> 7 15000
#> 8 15000
#> 9 15000
#> 10 40000
#> # ... with 40 more rows
```
Created on 2018-11-18 by the reprex package (v0.2.1)
First you should see what exactly your data is composed of- use the table() function on data$column to see how many unique entries you must account for.
table(data$column)
If whoever was entering this data was consistent about their wording, it may be easiest to hard code for substitution for each unique entry. So if unique(data$column)[1]== "$10,000 - $19,999", and unique(data$column)[2]== "$40,000 and over."
data$column[which(data$column==unique(data$column)[1])] <- "15000"
data$column[which(data$column==unique(data$column)[2])] <- "40000"
...
If you have too many unique entries for this approach to be viable, I'd suggest looking for consistencies in character sequences that can be used to make replacements. If you found that whoever entered this data was inconsistent about how they would write "$40,000 and over" such that you had:
data$column==unique(data$column)[2]
>"$40,000 and over."
data$column==unique(data$column)[3]
>"$40,000 and over"
data$column==unique(data$column)[4]
>"above $40,000"
...
If there weren't instances of "$40,000" that belonged to other categories, you could combine these entries for substitution a la:
data$column[which(grepl("$40,000",data$column))] <- "40000"
Inconsistency in qualitative data entry is a very human problem and requires exploring your data to search for trends and easy ways to consolidate your replacements. I think it's a fine idea to use R to identify and replace for patterns you find to save time, but ultimately it will require a fine touch as you get down to individual cases where you have to interpret/correct someone's entries to include them in your desired bins. Depending on your data quality standards, you can always throw out these entries that don't seem to fit your observed patterns.

Automate detection of start and end row number of phrases

I have a dataframe like this:
df = data.frame(main_name = c("google","yahoo","google","amazon","yahoo","google"),
volume = c(32,43,412,45,12,54))
I would like to sort it accordind to main_name, example
Aiming to know from which start row there is the specific phrase until which one in order to use it into a for loop.
main_name volume
amazon 45
google 32
google 412
google 54
yahoo 43
yahoo 12
In there any "auto" to make need without the need to know the specific phrase. Just to check if it is changed and know the start and end row number?
amazon [1]
google [2:4]
yahoo [5:6]
With tidyverse:
df%>%
arrange(main_name)%>%
mutate(row=row_number())%>%
group_by(main_name)%>%
summarise(start=first(row),
end=last(row))%>%
mutate(res=glue::glue("[{start}:{end}]"))
# A tibble: 3 x 4
main_name start end res
<fct> <int> <int> <chr>
1 amazon 1 1 [1:1]
2 google 2 4 [2:4]
3 yahoo 5 6 [5:6]
Here is an alternative base R solution using rle
with(rle(as.character(df$main_name)), setNames(mapply(
function(x, y) sprintf("[%s:%s]", x, y),
cumsum(lengths) - lengths + 1, cumsum(lengths)), values))
# amazon google yahoo
#"[1:1]" "[2:4]" "[5:6]"
Sample data
df <- read.table(text =
"main_name volume
amazon 45
google 32
google 412
google 54
yahoo 43
yahoo 12", header = T)
Here is another base R option
with(df, tapply(seq_along(main_name), main_name, FUN =
function(x) do.call(sprintf, c(fmt = "[%d:%d]", as.list(range(x))))))
# amazon google yahoo
# "[1:1]" "[2:4]" "[5:6]"

Using the Nested List Column Approach and Purrr Together with Tidytext::Unnest_Tokens

I have a dataframe that contains survey responses with each row representing a different person. One column - "Text" - is an open-ended text question. I would like to use Tidytext::unnest_tokens so that I do text analysis by each row, including sentiment scores, word counts, etc.
Here is the simple dataframe for this example:
Satisfaction<-c ("Satisfied","Satisfied","Dissatisfied","Satisfied","Dissatisfied")
Text<-c("I'm very satisfied with the services", "Your service providers are always late which causes me a lot of frustration", "You should improve your staff training, service providers have bad customer service","Everything is great!","Service is bad")
Gender<-c("M","M","F","M","F")
df<-data.frame(Satisfaction,Text,Gender)
I then turned the Text column into character...
df$Text<-as.character(df$Text)
Next I grouped by the id column and nested the dataframe.
df<-df%>%mutate(id=row_number())%>%group_by(id)%>%unnest_tokens(word,Text)%>%nest(-id)
Getting this far seems to have worked ok, but now how do I use purrr::map functions to work on the nested list column "word"? For example, if I want to create a new column using dplyr::mutate with word counts for each row?
Also, is there a better way to nest the dataframe so that only the "Text" column is a nested list?
I love using purrr::map to do modeling for different groups, but for what you are talking about doing, I think you can stick with just straight dplyr.
You can set up your dataframe like this:
library(dplyr)
library(tidytext)
Satisfaction <- c("Satisfied",
"Satisfied",
"Dissatisfied",
"Satisfied",
"Dissatisfied")
Text <- c("I'm very satisfied with the services",
"Your service providers are always late which causes me a lot of frustration",
"You should improve your staff training, service providers have bad customer service",
"Everything is great!",
"Service is bad")
Gender <- c("M","M","F","M","F")
df <- data_frame(Satisfaction, Text, Gender)
tidy_df <- df %>%
mutate(id = row_number()) %>%
unnest_tokens(word, Text)
Then to find, for example, the number of words per line, you can use group_by and mutate.
tidy_df %>%
group_by(id) %>%
mutate(num_words = n()) %>%
ungroup
#> # A tibble: 37 × 5
#> Satisfaction Gender id word num_words
#> <chr> <chr> <int> <chr> <int>
#> 1 Satisfied M 1 i'm 6
#> 2 Satisfied M 1 very 6
#> 3 Satisfied M 1 satisfied 6
#> 4 Satisfied M 1 with 6
#> 5 Satisfied M 1 the 6
#> 6 Satisfied M 1 services 6
#> 7 Satisfied M 2 your 13
#> 8 Satisfied M 2 service 13
#> 9 Satisfied M 2 providers 13
#> 10 Satisfied M 2 are 13
#> # ... with 27 more rows
You can do sentiment analysis by implementing an inner join; check out some examples here.

Resources