I need to order values from the string range index based on what percentage of words matches query.
For example, if the search query is aaa and values:
aaa bbb ccc
aaa
ccc ddd aaa ppp
The output should be
aaa (100% match)
aaa bbb ccc (33% match)
ccc ddd aaa ppp (25% match)
I can pull all values from the index and loop through them, but I'm looking for a more efficient approach.
Related
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 2 years ago.
Improve this question
I have a data frame with two columns. The first one identifies the document and is called "doc_id". The second column contains text that I have assembled from several columns and is called "text". The assembled text contains several sentences. There is noting else in the data frame. A sample could look like this:
My Problem is, that when I use the aggregate function (from stats v3.6.2), it doesn´t aggregate properly. Some lines are not aggregated or sometimes they are aggregated as a list and not as a single string.
Starting Situation:
doc_id text
1 1015328_1999 aaa aaa aaa.
2 1015328_2000 bbb bbb bbb.
3 1015328_2003 ccc ccc ccc.
4 1015328_2004 ddd ddd ddd.
5 1015328_2005 eee eee eee.
6 1015328_2006 fff fff fff.
7 1015328_2006 ggg ggg ggg.
8 1015328_2006 hhh hhh hhh.
9 1015328_2006 iii iii iii.
10 1015328_2006 jjj jjj jjj.
Result I want to obtain:
doc_id text
1 1015328_1999 aaa aaa aaa.
2 1015328_2000 bbb bbb bbb.
3 1015328_2003 ccc ccc ccc.
4 1015328_2004 ddd ddd ddd.
5 1015328_2005 eee eee eee.
6 1015328_2006 fff fff fff. ggg ggg ggg. hhh hhh hhh. iii iii iii. jjj jjj jjj.
I tried to used the following code:
> mydataframe <- aggregate(text ~ doc_id, data = mydataframe, paste, collapse = " ")
But this is what I get instead:
doc_id text
1 1015328_1999 aaa aaa aaa.
2 1015328_2000 bbb bbb bbb.
3 1015328_2003 ccc ccc ccc.
4 1015328_2004 ddd ddd ddd.
5 1015328_2005 eee eee eee.
6 1015328_2006
Although there should be data for aggregation, the code does not. Sometimes it aggregates correctly, sometimes it aggregates but puts a list in a row instead of a string, and sometimes the values are missing completely.
Any help would be highly appreciated!
Thank you very much in advance.
EDIT:
Test data as code:
library(tidyverse)
test <- tribble(
~doc_id, ~text,
"1015328_1999", "aaa aaa aaa.",
"1015328_2000", "bbb bbb bbb.",
"1015328_2003", "ccc ccc ccc.",
"1015328_2004", "ddd ddd ddd.",
"1015328_2005", "eee eee eee.",
"1015328_2006", "fff fff fff.",
"1015328_2006", "ggg ggg ggg.",
"1015328_2006", "hhh hhh hhh.",
"1015328_2006", "iii iii iii.",
"1015328_2006", "jjj jjj jjj.",)
I've come up with a solution based on tidyverse not stats. If that is a problem or solution doesn't meet what you're trying to do, please let me know
library(tidyverse)
test <- tribble(
~doc_id, ~text,
"1015328_1999", "aaa aaa aaa.",
"1015328_2000", "bbb bbb bbb.",
"1015328_2003", "ccc ccc ccc.",
"1015328_2004", "ddd ddd ddd.",
"1015328_2005", "eee eee eee.",
"1015328_2006", "fff fff fff.",
"1015328_2006", "ggg ggg ggg.",
"1015328_2006", "hhh hhh hhh.",
"1015328_2006", "iii iii iii.",
"1015328_2006", "jjj jjj jjj.",)
group_by(test, doc_id) %>% summarise(text = paste(text, collapse = " "))
Output:
# A tibble: 6 x 2
doc_id text
<chr> <chr>
1 1015328_1999 aaa aaa aaa.
2 1015328_2000 bbb bbb bbb.
3 1015328_2003 ccc ccc ccc.
4 1015328_2004 ddd ddd ddd.
5 1015328_2005 eee eee eee.
6 1015328_2006 fff fff fff. ggg ggg ggg. hhh hhh hhh. iii iii iii. jjj jjj jjj.
I was able to solve the problem. It wasn't a coding error, but a display error. The RStudio viewer was not working properly. It displayed empty rows or cells when there was actually data in them.
As I used the code: utils::View(name of dataframe) I was able to see the full content.
Weird though. But thanks for all your help!
Best regards!
I imported a JSON file with below structure:
link
I would like to transform it to a dataframe with 3 columns: ID group_name date_joined,
where ID is a element number from "data" list.
It should look like this:
ID group_name date_joined
1 aaa dttm
1 bbb dttm
1 ccc dttm
1 ddd dttm
2 eee dttm
2 aaa dttm
2 bbb dttm
2 fff dttm
2 ggg dttm
3 bbb dttm
3 ccc dttm
3 ggg dttm
3 mmm dttm
Using below code few times i get a dataframe with just 2 columns: group_name and date_joined
train2 <- do.call("rbind", train2)
sample file link
the following should work:
library(jsonlite)
train2 <- fromJSON("sample.json")
train2 <- train2[[1]]$groups$data
df <- data.frame(
ID = unlist(lapply(1:length(train2), function(x) rep.int(x,length(train2[[x]]$group_name)))),
group_name = unlist(lapply(1:length(train2),function(x) train2[[x]]$group_name)),
date_joined = unlist(lapply(1:length(train2),function(x) train2[[x]]$date_joined)))
output:
> df
ID group_name date_joined
1 1 Let's excercise together and lose a few kilo quicker - everyone is welcome! (Piastow) 2008-09-05 09:55:18.730066
2 1 Strongman competition 2008-05-22 21:25:22.572365
3 1 Fast food 4 life 2012-02-02 05:26:01.293628
4 1 alternative medicine - Hypnosis and bioenergotheraphy 2008-07-05 05:47:12.254848
5 2 Tom Cruise group 2009-06-14 16:48:28.606142
6 2 Babysitters (Sokoka) 2010-09-25 03:21:01.944684
7 2 Work abroad - join to find well paid work and enjoy the experience (Sokoka) 2010-09-21 23:44:39.499240
8 2 Tennis, Squash, Badminton, table tennis - looking for sparring partner (Sokoka) 2007-10-09 17:15:13.896508
9 2 Lost&Found (Sokoka) 2007-01-03 04:49:01.499555
10 3 Polish wildlife - best places 2007-07-29 18:15:49.603727
11 3 Politics and politicians 2010-10-03 21:00:27.154597
12 3 Pizza ! Best recipes 2010-08-25 22:26:48.331266
13 3 Animal rights group - join us if you care! 2010-11-02 12:41:37.753989
14 4 The Aspiring Writer 2009-09-08 15:49:57.132171
15 4 Nutrition & food advices 2010-12-02 18:19:30.887307
16 4 Game of thrones 2009-09-18 10:00:16.190795
17 5 The ultimate house and electro group 2008-01-02 14:57:39.269135
18 5 Pirates of the Carribean 2012-03-05 03:28:37.972484
19 5 Musicians Available Poland (Osieczna) 2009-12-21 13:48:10.887986
20 5 Housekeeping - looking for a housekeeper ? Join the group! (Osieczna) 2008-10-28 23:22:26.159789
21 5 Rooms for rent (Osieczna) 2012-08-09 12:14:34.190438
22 5 Counter strike - global ladderboard 2008-11-28 03:33:43.272435
23 5 Nutrition & food advices 2011-02-08 19:38:58.932003
I don't know how to name the proper title; however, following is my question.
I have a data:
ID Name Type Date Amount
1 AAAA First 2009/7/20 100
1 AAAA First 2010/2/3 200
2 BBBB First 2015/3/10 250
2 CCC Second 2009/2/23 300
2 CCC Second 2010/1/25 400
2 CCC Third 2015/4/9 500
2 CCC Third 2016/6/25 700
I want to remove the data that has same ID, Name, and Type; but the Date is smaller. Or you can say that keep Date is the largest.
The result is like:
ID Name Type Date Amount
1 AAAA First 2010/2/3 300
2 BBBB First 2015/3/10 250
2 CCC Second 2010/1/25 700
2 CCC Third 2016/6/25 1200
I know I can use duplicated() to get the which observations are duplicating.
dt <- fread("
ID Name Type Date
1 AAAA First 2009/7/20
1 AAAA First 2010/2/3
2 BBBB First 2015/3/10
2 CCC Second 2009/2/23
2 CCC Second 2010/1/25
2 CCC Third 2015/4/9
2 CCC Third 2016/6/25
")
dt$Date <- as.Date(dt$Date)
dt[duplicated(ID) & duplicated(Name) & duplicated(Type)]
ID Name Type Date Amount
1: 1 AAAA First 2010/2/3 200
2: 2 CCC Second 2010/1/25 400
3: 2 CCC Third 2016/6/25 700
However, this is not I want. Although it removes the smaller Date, it cannot keep the third observation(ID=2, Name=BBBB, Type=First). Also, I still need to sum Amount.
How can I do?
I have a df
ID <- c('DX154','DX154','DX155','DX155','DX156','DX157','DX158','DX159')
Country <- c('US','US','US','US')
Level <- c('Level_1A','Level_1A','Level_1B','Level_1B','Level_1A','Level_1B','Level_1B','Level_1A')
Type_A <- c('Iphone','Iphone','Android','Android','aaa','bbb','ccc','ddd')
Type_B <- c("Iphone,Ipad,Ipod,Mac","Gmail,Android,Drive,Maps","Iphone,Ipad,Ipod,Mac","Gmail,Android,Drive,Maps","ALL","ALL","ALL","ALL")
df <- data.frame(ID ,Country ,Level ,Type_A,Type_B)
df
ID Country Level Type_A Type_B
1 DX154 US Level_1A Iphone Iphone,Ipad,Ipod,Mac
2 DX154 US Level_1A Iphone Gmail,Android,Drive,Maps
3 DX155 US Level_1B Android Iphone,Ipad,Ipod,Mac
4 DX155 US Level_1B Android Gmail,Android,Drive,Maps
5 DX156 US Level_1A aaa ALL
6 DX157 US Level_1B bbb ALL
7 DX158 US Level_1B ccc ALL
8 DX159 US Level_1A ddd ALL
I am trying to filer this data frame by joining the column Type_A, Type_B but not knowing how to parse the comma. Could someone please help me with this.
My Desired output is
ID Country Level Type_A Type_B
1 DX154 US Level_1A Iphone Iphone,Ipad,Ipod,Mac
2 DX155 US Level_1B Android Gmail,Android,Drive,Maps
3 DX156 US Level_1A aaa ALL
4 DX157 US Level_1B bbb ALL
5 DX158 US Level_1B ccc ALL
6 DX159 US Level_1A ddd ALL
Here's one solution. It's kind of gimmicky, but someone will be along to give you the super clever and speedy version soon. This does it row-wise, but Akrun's answer shows you how to do it by id only.
library(dplyr)
df <- df %>%
mutate(row_id = 1:n()) %>%
group_by(row_id) %>%
filter(grepl(Type_A, Type_B) | Type_B === "ALL")
We group by 'ID', use grepl, specify the pattern by pasteing the 'Type_A' column (In this example, using Type_A[1L] should also work as the 'Type_A' elements are duplicated. A better example would be nice) and use this to filter the rows. We also use grepl to filter those elements in 'Type_B' that has no , from start (^) to end ($) of the string.
library(dplyr)
df %>%
group_by(ID) %>%
filter(grepl(paste(Type_A, collapse='|'),
Type_B)|grepl('^[^,]+$', Type_B))
# ID Country Level Type_A Type_B
#1 DX154 US Level_1A Iphone Iphone,Ipad,Ipod,Mac
#2 DX155 US Level_1B Android Gmail,Android,Drive,Maps
#3 DX156 US Level_1A aaa ALL
#4 DX157 US Level_1B bbb ALL
#5 DX158 US Level_1B ccc ALL
#6 DX159 US Level_1A ddd ALL
Basically I have a dataset which has a large number of columns, and it might even grow in the future.
Now before I analyse the data, in most cases it makes sense to group by all columns. I can manually type everything, I know, but I was wondering if there is a way to make it automatic.
As an example, think of list of invoice items where many attributes actually just further describe the product (data is heavily denormalised), eg:
InvoiceId ProductId Price CustomerName SomeOtherProductAttribute...
123 ABC 32.11 CustA xyz
123 BBB 99.99 CustA xyzy
444 ABC 32.11 CustB xyz
444 CCC 12.99 CustB ttt
and I want to summarise the price
[,sum(price),by=list(invoiceId,ProductId,CustomerName,SomeOtherProductAttribute)]
You could use setdiff:
DT[, sum(Price), by = setdiff(names(DT), "Price")]
InvoiceId ProductId CustomerName SomeOtherProductAttribute... V1
1: 123 ABC CustA xyz 32.11
2: 123 BBB CustA xyzy 99.99
3: 444 ABC CustB xyz 32.11
4: 444 CCC CustB ttt 12.99
Use ddply from plyr package
library(plyr)
var_group<-colnames(data)[!(colnames(data) %in% "price")]
ddply(data,(var_group),summarise,price_sum=price)