Remove commas within numbers only - r

I have a list of product sales (and their cost) which have frustratingly been concatenated into a single string, separated by commas. I ultimately need to separate out each product into unique rows which is easy enough with stringr::str_split.
However, the cost associated with each product has comma to show thousands e.g. 1,000.00 or 38,647.89. Therefore str_split is splitting products incorrectly as it hits commas within a product's cost.
I was wondering what the best tidyverse solution would be to remove all commas which are surrounded by numbers so that 1,000.00 becomes 1000.00 and 38,647.89 becomes 38647.89. Once these commas are removed I can str_split on the commas which delimit the products and thus split each unique product into its own row.
Here is a dummy dataset:
df<-data.frame(id = c(1, 2), product = c("1 Car at $38,678.49, 1 Truck at $78,468.00, 1 Motorbike at $5,634.78", "1 Car at $38,678.49, 1 Truck at $78,468.00, 1 Motorbike at $5,634.78"))
df
id product
1 1 Car at $38,678.49, 1 Truck at $78,468.00, 1 Motorbike at $5,634.78
2 1 Car at $38,678.49, 1 Truck at $78,468.00, 1 Motorbike at $5,634.78
Expected outcome:
id product
1 1 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78
2 2 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78

df %>%
mutate(product = product %>% str_replace_all("([0-9]),([0-9])", "\\1\\2"))
Result
id product
1 1 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78
2 2 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78

> apply(df,1,function(x){gsub(",([0-9])","\\1",x[2])})
[1] "1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78"
[2] "1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78"

A way via base R can be,
sapply(strsplit(as.character(df$product), ' '), function(i)paste(sub(',', '', i), collapse = ' '))
#[1] "1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78" "1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78"

library(tidyverse)
df$product <- str_replace_all(df$product, "(?<=\\d),(?=\\d)", "")
df
id product
1 1 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78
2 2 1 Car at $38678.49, 1 Truck at $78468.00, 1 Motorbike at $5634.78

Related

Aggregating observations based on category in R

I have a set of agricultural data in R that looks something like this:
State District Year Crop Production Area
1 State A District 1 2000 Banana 1254.00 2000.00
2 State A District 1 2000 Apple 175.00 176.00
3 State A District 1 2000 Wheat 321.00 641.00
4 State A District 1 2000 Rice 1438.00 175.00
5 State A District 1 2000 Cashew 154.00 1845.00
6 State A District 1 2000 Peanut 2076.00 439.00
7 State B District 2 2000 Banana 3089.00 1987.00
8 State B District 2 2000 Apple 309.00 302.00
9 State B District 2 2000 Wheat 401.00 230.00
10 State B District 2 2000 Rice 1832.00 2134.00
11 State B District 2 2000 Cashew 991.00 1845.00
12 State B District 2 2000 Peanut 2311.00 1032.00
I want to aggregate the area and production values by crop type, but keep the state, district and year details, so that it would look something like:
State District Year Crop Production Area
1 State A District 1 2000 Fruit 1429.00 2176.00
2 State A District 1 2000 Grain 1759.00 816.00
3 State A District 1 2000 Nut 2230.00 2284.00
4 State B District 2 2000 Fruit 3398.00 2289.00
5 State B District 2 2000 Grain 2233.00 2364.00
6 State B District 2 2000 Nut 3302.00 2877.00
What's the best way to go about this?
Using dplyr & forcats:
library(dplyr)
library(forcats)
df %>%
mutate(crop_type = fct_recode(Crop, fruit = "Apple", fruit = "Banana",
grain = "Wheat", grain = "Rice",
nut = "Cashew", nut = "Peanut")) %>%
group_by(State, District, Year, Crop) %>%
summarize(mean_production = mean(Production),
mean_area = mean(Area))

Return and count value from one column based on another column [duplicate]

In R, I can return the count results using the specific column names I am interested in as an array as below.
require("plyr")
bevs <- data.frame(cbind(name = c("Bill", "Llib"), drink = c("coffee", "tea", "cocoa", "water"), cost = seq(1:8)))
count(bevs, c("name", "drink"))
# produces
name drink freq
1 Bill cocoa 2
2 Bill coffee 2
3 Llib tea 2
4 Llib water 2
How can I get the count result of two specific column names in a matrix which has columns: all unique drinks, rows: all unique names and cells: freqs (like below)?
cocoa coffee tea water
Bill 2 2 0 0
Llib 0 0 2 2
P.S: Obviously, the solution does not need to use plyr.
You want a contingency table, which you can create using table:
table(bevs[, c("name", "drink")])
# drink
#name cocoa coffee tea water
# Bill 2 2 0 0
# Llib 0 0 2 2

Cross tabulation with n>2 categories in R: hide rows with zero cases

I'm trying to make a cross tabulation in R, and having its output resemble as much as possible what I'd get in an Excel pivot table. So, given this code:
set.seed(2)
df<-data.frame("ministry"=paste("ministry ",sample(1:3,20,replace=T)),"department"=paste("department ",sample(1:3,20,replace=T)),"program"=paste("program ",sample(letters[1:20],20,replace=F)),"budget"=runif(20)*1e6)
library(tables)
library(dplyr)
arrange(df,ministry,department,program)
tabular(ministry*department~((Count=budget)+(Avg=(mean*budget))+(Total=(sum*budget))),data=df)
which yields:
Avg Total
ministry department Count budget budget
ministry 1 department 1 5 479871 2399356
department 2 1 770028 770028
department 3 1 184673 184673
ministry 2 department 1 2 170818 341637
department 2 1 183373 183373
department 3 3 415480 1246440
ministry 3 department 1 0 NaN 0 <---- LOOK HERE
department 2 5 680102 3400509
department 3 2 165118 330235
How do I get the output to hide the rows with zero frequencies?
I'm using tables::tabular but any other package is good for me (as long as there's a way, even indirect, of outputting to html). This is for generating HTML or Latex using R Markdown and displaying the table with my script's results as Excel would, or as in the example above in a pivot-table like form. But without the superfluous row.
Thanks!
Why not just use dplyr?
df %>%
group_by(ministry, department) %>%
summarise(count = n(),
avg_budget = mean(budget, na.rm = TRUE),
tot_budget = sum(budget, na.rm = TRUE))
ministry department count avg_budget tot_budget
1 ministry 1 department 1 5 479871.1 2399355.6
2 ministry 1 department 2 1 770027.9 770027.9
3 ministry 1 department 3 1 184673.5 184673.5
4 ministry 2 department 1 2 170818.3 341636.5
5 ministry 2 department 2 1 183373.2 183373.2
6 ministry 2 department 3 3 415479.9 1246439.7
7 ministry 3 department 2 5 680101.8 3400508.8
8 ministry 3 department 3 2 165117.6 330235.3
While I don't understand at all how the tabular object is made (since it says it's a list but seems to behaves like a data frame), you can select cells as usual, so
> results <-tabular(ministry*department~((Count=budget)+(Avg=(mean*budget))+(Total=(sum*budget))),data=df)
> results[results[,1]!=0,]
Avg Total
ministry department Count budget budget
ministry 1 department 1 5 479871 2399356
department 2 1 770028 770028
department 3 1 184673 184673
ministry 2 department 1 2 170818 341637
department 2 1 183373 183373
department 3 3 415480 1246440
ministry 3 department 2 5 680102 3400509
department 3 2 165118 330235
That's the solution.
I just found out the solution thanks to this user's reply on another question https://stackoverflow.com/users/516548/g-grothendieck

Count of Row Frequency in R

Let's say that I have the following data frame with three columns.
data = data.frame(id=c(1:10), interest_1=c("food","","","drugs","beer","soda","","","drugs","sports"),
interest_2=c("fruits","car","jeans","","","","soda","shoes","","drugs"),
interest_3=c("","","","","soda","sports","","","",""))
data
I want to get a count of each row.
The following incident, where food is interest_1, fruits is interest_2, and nothing is interest_3 occurs only once.
id interest_1 interest_2 interest_3
1 1 food fruits
The following incident, where drugs are interest_1 and nothing is interest_2 or interest_3 occurs twice.
id interest_1 interest_2 interest_3
4 drugs
9 drugs
I want to get a count the number of times that each incidence occurs. How would I go about doing this?
Output should look like:
interest_1 interest_2 interest_3 count
food fruits 1
car 1
jeans 1
drugs 2
> aggregate(id~.,data,length)
interest_1 interest_2 interest_3 id
1 drugs 2
2 car 1
3 sports drugs 1
4 food fruits 1
5 jeans 1
6 shoes 1
7 soda 1
8 beer soda 1
9 soda sports 1
Basically, this means: apply function length to to the vector made up of id values for each combination of the other columns.
require(plyr)
ddply(data, .(interest_1, interest_2, interest_3), c("nrow"))

How to produce an R count matrix

In R, I can return the count results using the specific column names I am interested in as an array as below.
require("plyr")
bevs <- data.frame(cbind(name = c("Bill", "Llib"), drink = c("coffee", "tea", "cocoa", "water"), cost = seq(1:8)))
count(bevs, c("name", "drink"))
# produces
name drink freq
1 Bill cocoa 2
2 Bill coffee 2
3 Llib tea 2
4 Llib water 2
How can I get the count result of two specific column names in a matrix which has columns: all unique drinks, rows: all unique names and cells: freqs (like below)?
cocoa coffee tea water
Bill 2 2 0 0
Llib 0 0 2 2
P.S: Obviously, the solution does not need to use plyr.
You want a contingency table, which you can create using table:
table(bevs[, c("name", "drink")])
# drink
#name cocoa coffee tea water
# Bill 2 2 0 0
# Llib 0 0 2 2

Resources