Add category column to a data set - r

I've a data table like this
+------------+-------+
| Model | Price |
+------------+-------+
| Apple-1 | 10 |
+------------+-------+
| New Apple | 11 |
+------------+-------+
| Orange | 13 |
+------------+-------+
| Orange2019| 15 |
+------------+-------+
| Cat | 19 |
+------------+-------+
I'want to define a list of base model tags that I want to add to any single row that matches certain condition/value. So for example defined a data frame for tagging like this
+------------+--------+
| Model | Tag |
+------------+------ -+
| Apple-1 | A |
+------------+------ -+
| New Apple | A |
+------------+------ -+
| Orange | B |
+------------+------ -+
| Cat | B |
+------------+--------+
I would like to find some way to get this results:
+------------+-------+--------+
| Model | Price | Tag |
+------------+-------+--------+
| Apple-1 | 10 | A |
+------------+-------+--------|
| New Apple | 11 | A |
+------------+-------+--------|
| Orange | 13 | B |
+------------+-------+--------|
| Orange2019| 15 | B |
+------------+-------+--------|
| Cat | 19 | B |
+------------+-------+--------|
I'm don't mind to use a table to managed the tagging data, and I know that I could write very "ad-hoc" mutate statement to achieve the results I want, just wondering if there is more elegant way to tagging a string based on a pattern match.

One idea is to use the Levenshtein distances to cluster the words you have. You would need to provide with a number of clusters. Once you have this clusters, just add the number of each one as a category tag to your table. Check out this answer which goes into detail of Levenshtein distance clustering. Text clustering with Levenshtein distances
edit
I think I totally misunderstood your question... try this
df=data.frame("Model"=c("Apple-1","New Apple","Organe","Orange2019","Cat"),
"Price"=c(10,11,13,15,19),stringsAsFactors = FALSE)
tags=data.frame("Model"=c("Apple-1","New Apple","Orange","Cat"),
"Tag"=c("A","A","B","B"),stringsAsFactors = FALSE)
df%>%rowwise()%>%mutate(Tag=if_else(!is.na(tags$Tag[which(!is.na(str_extract(Model,tags$Model)))[1]]),
tags$Tag[which(!is.na(str_extract(Model,tags$Model)))[1]],false="None"))
Model Price Tag
<chr> <dbl> <chr>
1 Apple-1 10 A
2 New Apple 11 A
3 Organe 13 None
4 Orange2019 15 B
5 Cat 19 B
I actually changed Orange for Organe so that you see what happens if there is not match ( none is returned)

Related

how to query/add filter attribute in dynamodb query?

I am creating a composite key with artist and song in dynamo db. artist as a primary key and song as an sort key. i'm aware that query might be slow and expensive given how dynamo db works and how filter is applied, is it possible to apply filter to multiple attribute and how does it look like , for example - query all the records by an artist (joe) in a particular genre (country) with rating higher than 7
id | artist | song | albumTitle | Price | Genre | rating
1 | TI | hello | south | 90 | Rap | 7.0
2 | joe | good | free | 87 | pop | 8
3 | joe | bye | one | 99 | country| 7
4 | joe | beat | one | 99 | country| 5
5 | joe | sun | one | 99 | country| 9

Is there a way in R to create a column based on order of multiple values in one another column in dataframe? [duplicate]

This question already has answers here:
Aggregating all unique values of each column of data frame
(2 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
I would like to create a column in my R data frame based on the order in which multiple values occur in one column.
For example, my data frame has an id column and an item type column, and the values of the order column is what I would like to add. Is there a way to tell R to look at the order of values in the item column so that it can spit out "ABCD" or "ADCB" (any other order) as the cell value under the 3rd column?
| id | item | order |
| 11 | A | ABCD |
| 11 | A | ABCD |
| 11 | B | ABCD |
| 11 | B | ABCD |
| 11 | C | ABCD |
| 11 | C | ABCD |
| 11 | D | ABCD |
| 11 | D | ABCD |
| 12 | A | ADCB |
| 12 | A | ADCB |
| 12 | D | ADCB |
| 12 | D | ADCB |
| 12 | C | ADCB |
| 12 | C | ADCB |
| 12 | B | ADCB |
| 12 | B | ADCB |
...

How to get a query result into a key value form in HiveQL

I have tried different things, but none succeeded. I have the following issue, and would be very gratefull if someone could help me.
I get the data from a view as several billions of records, for different measures
A)
| s_c_m1 | s_c_m2 | s_c_m3 | s_c_m4 | s_p_m1 | s_p_m2 | s_p_m3 | s_p_m4 |
|--------+--------+--------+--------+--------+--------+--------+--------|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------+--------+--------+--------+--------+--------+--------+--------|
Then I need to aggregate it by each measure. And so long so fine. I got this figured out.
B)
| s_c_m1 | s_c_m2 | s_c_m3 | s_c_m4 | s_p_m1 | s_p_m2 | s_p_m3 | s_p_m4 |
|--------+--------+--------+--------+--------+--------+--------+--------|
| 3 | 6 | 9 | 12 | 15 | 18 | 21 | 24 |
|--------+--------+--------+--------+--------+--------+--------+--------|
Then I need to get the data in the following form. I need to turn it into a key-value form.
C)
| measure | c | p |
|---------+----+----|
| m1 | 3 | 15 |
| m2 | 6 | 18 |
| m3 | 9 | 21 |
| m4 | 12 | 24 |
|---------+----+----|
The first 4 columns from B) would form in C) the first column, and the second 4 columns would form another column.
Is there an elegant way, that could be easily maintainable? The perfect solution would be if another measure would be introduced in A) and B), there no modification would be required and it would automatically pick up the difference.
I know how to get this done in SqlServer and Postgres, but here I am missing the expirience.
I think you should use map for this

R : Create a factor from two variables containing ranks and levels

Everything is in the title, I got from a database many columns, paired two-by-two containing codes and labels for some variables, I want an easy way to create half as many factors, with, for each factor levels/codes matching to the original two variables.
Here is an exemple of original data for two factors
| customer_type | customer_type_name | customer_status | customer_status_name |
|----------------------|----------------------|----------------------|----------------------|
| 1 | A | 2 | Beta |
| 2 | B | 2 | Beta |
| 3 | C | 1 | Alpha |
| 2 | B | 3 | Gamma |
| 1 | A | 4 | Delta |
| 3 | C | 2 | Beta |
i.e. a simpler way (simpler to call in a function for lots of variables) to do from dataframe "accounts"
a<-accounts[,c("customertypecode","customertypecodename")]
a<-a[!duplicated(a),]
a<-a[order(a$customertypecode),]
accounts$customertypecode<-factor(accounts$customertypecode,labels=a$customertypecodename[!is.na(a$customertypecodename)])

Select single row per unique field value with SQL Developer

I have thousands of rows of data, a segment of which looks like:
+-------------+-----------+-------+
| Customer ID | Company | Sales |
+-------------+-----------+-------+
| 45678293 | Sears | 45 |
| 01928573 | Walmart | 6 |
| 29385068 | Fortinoes | 2 |
| 49582015 | Walmart | 1 |
| 49582015 | Joe's | 1 |
| 19285740 | Target | 56 |
| 39506783 | Target | 4 |
| 39506783 | H&M | 4 |
+-------------+-----------+-------+
In every case that a customer ID occurs more than once, the value in 'Sales' is also the same but the value in 'Company' is different (this is true throughout the entire table). I need for each value in 'Customer ID to only appear once, so I need a single row for each customer ID.
In other words, I'd like for the above table to look like:
+-------------+-----------+-------+
| Customer ID | Company | Sales |
+-------------+-----------+-------+
| 45678293 | Sears | 45 |
| 01928573 | Walmart | 6 |
| 29385068 | Fortinoes | 2 |
| 49582015 | Walmart | 1 |
| 19285740 | Target | 56 |
| 39506783 | Target | 4 |
+-------------+-----------+-------+
If anyone knows how I can go about doing this, I'd much appreciate some help.
Thanks!
Well it would have been helpful, if you have put your sql generate that data.
but it might go something like;
SELECT customer_id, Max(Company) as company, Count(sales.*) From Customers <your joins and where clause> GROUP BY customer_id
Assumes; there are many company and picks out the most number of occurance and the sales data to be in a different table.
Hope this helps.

Resources