I don't know how to name the proper title; however, following is my question.
I have a data:
ID Name Type Date Amount
1 AAAA First 2009/7/20 100
1 AAAA First 2010/2/3 200
2 BBBB First 2015/3/10 250
2 CCC Second 2009/2/23 300
2 CCC Second 2010/1/25 400
2 CCC Third 2015/4/9 500
2 CCC Third 2016/6/25 700
I want to remove the data that has same ID, Name, and Type; but the Date is smaller. Or you can say that keep Date is the largest.
The result is like:
ID Name Type Date Amount
1 AAAA First 2010/2/3 300
2 BBBB First 2015/3/10 250
2 CCC Second 2010/1/25 700
2 CCC Third 2016/6/25 1200
I know I can use duplicated() to get the which observations are duplicating.
dt <- fread("
ID Name Type Date
1 AAAA First 2009/7/20
1 AAAA First 2010/2/3
2 BBBB First 2015/3/10
2 CCC Second 2009/2/23
2 CCC Second 2010/1/25
2 CCC Third 2015/4/9
2 CCC Third 2016/6/25
")
dt$Date <- as.Date(dt$Date)
dt[duplicated(ID) & duplicated(Name) & duplicated(Type)]
ID Name Type Date Amount
1: 1 AAAA First 2010/2/3 200
2: 2 CCC Second 2010/1/25 400
3: 2 CCC Third 2016/6/25 700
However, this is not I want. Although it removes the smaller Date, it cannot keep the third observation(ID=2, Name=BBBB, Type=First). Also, I still need to sum Amount.
How can I do?
Related
I imported a JSON file with below structure:
link
I would like to transform it to a dataframe with 3 columns: ID group_name date_joined,
where ID is a element number from "data" list.
It should look like this:
ID group_name date_joined
1 aaa dttm
1 bbb dttm
1 ccc dttm
1 ddd dttm
2 eee dttm
2 aaa dttm
2 bbb dttm
2 fff dttm
2 ggg dttm
3 bbb dttm
3 ccc dttm
3 ggg dttm
3 mmm dttm
Using below code few times i get a dataframe with just 2 columns: group_name and date_joined
train2 <- do.call("rbind", train2)
sample file link
the following should work:
library(jsonlite)
train2 <- fromJSON("sample.json")
train2 <- train2[[1]]$groups$data
df <- data.frame(
ID = unlist(lapply(1:length(train2), function(x) rep.int(x,length(train2[[x]]$group_name)))),
group_name = unlist(lapply(1:length(train2),function(x) train2[[x]]$group_name)),
date_joined = unlist(lapply(1:length(train2),function(x) train2[[x]]$date_joined)))
output:
> df
ID group_name date_joined
1 1 Let's excercise together and lose a few kilo quicker - everyone is welcome! (Piastow) 2008-09-05 09:55:18.730066
2 1 Strongman competition 2008-05-22 21:25:22.572365
3 1 Fast food 4 life 2012-02-02 05:26:01.293628
4 1 alternative medicine - Hypnosis and bioenergotheraphy 2008-07-05 05:47:12.254848
5 2 Tom Cruise group 2009-06-14 16:48:28.606142
6 2 Babysitters (Sokoka) 2010-09-25 03:21:01.944684
7 2 Work abroad - join to find well paid work and enjoy the experience (Sokoka) 2010-09-21 23:44:39.499240
8 2 Tennis, Squash, Badminton, table tennis - looking for sparring partner (Sokoka) 2007-10-09 17:15:13.896508
9 2 Lost&Found (Sokoka) 2007-01-03 04:49:01.499555
10 3 Polish wildlife - best places 2007-07-29 18:15:49.603727
11 3 Politics and politicians 2010-10-03 21:00:27.154597
12 3 Pizza ! Best recipes 2010-08-25 22:26:48.331266
13 3 Animal rights group - join us if you care! 2010-11-02 12:41:37.753989
14 4 The Aspiring Writer 2009-09-08 15:49:57.132171
15 4 Nutrition & food advices 2010-12-02 18:19:30.887307
16 4 Game of thrones 2009-09-18 10:00:16.190795
17 5 The ultimate house and electro group 2008-01-02 14:57:39.269135
18 5 Pirates of the Carribean 2012-03-05 03:28:37.972484
19 5 Musicians Available Poland (Osieczna) 2009-12-21 13:48:10.887986
20 5 Housekeeping - looking for a housekeeper ? Join the group! (Osieczna) 2008-10-28 23:22:26.159789
21 5 Rooms for rent (Osieczna) 2012-08-09 12:14:34.190438
22 5 Counter strike - global ladderboard 2008-11-28 03:33:43.272435
23 5 Nutrition & food advices 2011-02-08 19:38:58.932003
I need to order values from the string range index based on what percentage of words matches query.
For example, if the search query is aaa and values:
aaa bbb ccc
aaa
ccc ddd aaa ppp
The output should be
aaa (100% match)
aaa bbb ccc (33% match)
ccc ddd aaa ppp (25% match)
I can pull all values from the index and loop through them, but I'm looking for a more efficient approach.
I have a data frame named as Records having 2 vectors Rank and Name
Rank Name
1 Ashish
1 Ashish
2 Ashish
3 Mark
4 Mark
1 Mark
3 Spencer
2 Spencer
1 Spencer
2 Mary
4 Joseph
I want that every name should be placed in either 1, 2 ,3 or 4 tag depending on their occurrence and uniqueness:
I want to create a new vector which will be named as Tagging
So The output should be:
Rank 1 has three unique elements Mark Spencer and Ashish so the tag is 1 for all three.
Rank 2 has one unique records which is Mary as Ashish has already been assigned tag 1 so Mary is tagged as 2.
Rank 3 has no unique records as Spencer and Mark has already been assigned 1 so I cannot tag 3 to anybody.
Rank 4 has one unique record Joseph so he gets tagged as 4.
Let me know which function can help me do this.
I do not want to use looping as this is 1000000 row database
The below solution follows the principle that the highest Rank of a person is going to be that person's tag too.
tbl <- read.table(header=TRUE, text='
Rank Name
1 Ashish
1 Ashish
2 Ashish
3 Mark
4 Mark
1 Mark
3 Spencer
2 Spencer
1 Spencer
2 Mary
4 Joseph
')
Ordering the 'tbl' dataframe by Rank
tbl_ord <- tbl[with(tbl,order(Rank)),]
Removing multiple occurrence of name within same Rank
> name_ord<- tbl_ord[duplicated(tbl_ord$Rank),]
> name_ord
Rank Name
2 1 Ashish
6 1 Mark
9 1 Spencer
8 2 Spencer
10 2 Mary
7 3 Spencer
11 4 Joseph
Displaying unique Names
#name_ord[unique(name_ord$Name),] #this will work too
> name_ord[!duplicated(name_ord$Name),]
Rank Name
2 1 Ashish
6 1 Mark
9 1 Spencer
10 2 Mary
11 4 Joseph
Using the setkey function of data.table package and unique:
library(data.table)
dt<-data.table(Rank=c(1,1,2,3,4,1,3,2,1,2,4), Name=c(rep("Ashish", 3), rep("Mark", 3), rep("Spencer", 3), "Mary", "Joseph"))
setkey(dt, Rank, Name)
dt<-unique(dt)
setkey(dt, Name)
dt<-unique(dt) # works because of the above setkey call which sorted it
setkey(dt, Rank) # if you want to order them by Rank again
I have the data frame below:
data<-data.frame(names= c("Bob","Bob", "Fred","Fred","Tom"), id =c(1,1,2,2,3),amount = c(100,200,400,500,700), status = c("Active","Not Active","Active","Retired","Active"))
data
names id amount status
1 Bob 1 100 Active
2 Bob 1 200 Not Active
3 Fred 2 400 Active
4 Fred 2 500 Retired
5 Tom 3 700 Active
I would like to Pivot the "Status" column so the "amount" data appears under the new status columns so that the result looks like this:
names id Active Not Active Retired
Bob 1 100 200
Fred 2 400 500
Tom 3 700
Is this possible? What is the best way?
I am now compelled to turn a comment into an answer. Here's the Hadleyverse version:
library(tidyr)
spread(data, status, amount)
## names id Active Not Active Retired
## 1 Bob 1 100 200 NA
## 2 Fred 2 400 NA 500
## 3 Tom 3 700 NA NA
Here is a solution using dcast from the package reshape2:
library(reshape2)
dcast(data, names + id ~ status, value.var="amount")
# names id Active Not Active Retired
# 1 Bob 1 100 200 NA
# 2 Fred 2 400 NA 500
# 3 Tom 3 700 NA NA
This would be the base method:
> xtabs(amount~names+status, data=data)
status
names Active Not Active Retired
Bob 100 200 0
Fred 400 0 500
Tom 700 0 0
Here is another base R option
reshape(data, idvar=c('names', 'id'), timevar='status', direction='wide')
# names id amount.Active amount.Not Active amount.Retired
#1 Bob 1 100 200 NA
#3 Fred 2 400 NA 500
#5 Tom 3 700 NA NA
i have a data which is of the form :
month price name
1 200 xyz
1 300 abc
2 500 xyz
3 300 abc
4 400 cde
5 200 cde
5 100 abc
5 200 xyz
i want to create a cumulative sum graph month wise. Can anyone please help me with that?
try:
ts.plot(cumsum(as.vector(unlist(tapply(df$price,df$month,sum)))),
main="cumulative month wise",
xlab="month",ylab="cumulative",lty=3,col="purple",type="o")