So I have a dataframe with about 500,000 obs that looks like this:
ID MonthYear Group
123 200811 Blue
345 201102 Red
678 201110 Blue
910 201303 Green
I would like to convert this to a panel that counts the number of occurrences for each group in each month. So it would look like this:
MonthYear Group Count
200801 Blue 521
200802 400
....
200801 Red 521
200802 600
....
I guess it doesn't need to look exactly like that, but just some way to turn this into a useful panel. Aggregate doesn't seem to be sufficient in and of itself.
aggregate(dfrm$ID, dfrm[,c("MonthYear","Group")], length)
If you want to reverse the grouping just reverse the order of the INDEX argument.
Related
Lets say the data is like this
cust_id visit_dt Purchase_dt item FIRST_Purchase_dt
1234 1/11/2017 1/12/2017 Big 1/1/2015
1234 1/18/2018 1/19/2018 Big 1/1/2015
1567 1/11/2008 1/12/2008 Big 3/27/2007
1345 1/3/2006 Small 1/2/2006
1345 1/24/2008 1/24/2008 Big 1/2/2006
1579 1/24/2009 Medium 5/6/2006
I want to calculate days between like calculation
Days between should be calculated like this-
a) if there is no duplicate Cust_id then it is visit_dt-First_Purchase dt or the first duplicate cust_id when sorted by visit_dt.
b) if there is duplicate then visit_dt-Previous(purchase_dt) if it exists else visit_dt-previous(visit_dt).
This happens as when the item is small or medium then there is no purchase dt.
select cust_id,visit_dt,Purchase_dt, item,FIRST_Purchase_dt,visit_dt-FIRST_Purchase_dt as Days_BTW from table
this works only for first condition not sure how to implement the second condition
I'm looking to identify duplicate records in my data set based on multiple columns, review the records, and keep the ones with the most complete data in R. I would like to keep the row(s) associated with each name that have the maximum number of data points populated. In the case of date columns, I would also like to treat invalid dates as missing. My data looks like this:
df<-data.frame(Record=c(1,2,3,4,5),
First=c("Ed","Sue","Ed","Sue","Ed"),
Last=c("Bee","Cord","Bee","Cord","Bee"),
Address=c(123,NA,NA,456,789),
DOB=c("12/6/1995","0056/12/5",NA,"12/5/1956","10/4/1980"))
Record First Last Address DOB
1 Ed Bee 123 12/6/1995
2 Sue Cord 0056/12/5
3 Ed Bee
4 Sue Cord 456 12/5/1956
5 Ed Bee 789 10/4/1980
So in this case I would keep records 1, 4, and 5. There are approximately 85000 records and 130 variables, so if there is a way to do this systematically, I'd appreciate the help. Also, I'm a total R newbie (as if you couldn't tell), so any explanation is also appreciated. Thanks!
#Add a new column to the dataframe containing the number of NA values in each row.
df$nMissing <- apply(df,MARGIN=1,FUN=function(x) {return(length(x[which(is.na(x))]))})
#Using ave, find the indices of the rows for each name with min nMissing
#value and use them to filter your data
deduped_df <-
df[which(df$nMissing==ave(df$nMissing,paste(df$First,df$Last),FUN=min)),]
#If you like, remove the nMissinig column
df$nMissing<-deduped_df$nMissing<-NULL
deduped_df
Record First Last Address DOB
1 1 Ed Bee 123 12/6/1995
4 4 Sue Cord 456 12/5/1956
5 5 Ed Bee 789 10/4/1980
Edit: Per your comment, if you also want to filter on invalid DOBs, you can start by converting the column to date format, which will automatically treat invalid dates as NA (missing data).
df$DOB<-as.Date(df$DOB,format="%m/%d/%Y")
This takes a bit to explain and the post itself may be a bit too long to be answered.
I have MANY data frames of individual chess players and their specific ratings at points in time.
Here is what my data looks like. Please forgive me for my poor formatting of separating the datasets. Carlsen and Nakamura are separate dataframes.
Player1
Nakamura, Hikaru Year
2364 2001-01-01
2430 2002-01-01
2520 2003-01-01
2571 2004-01-01
2613 2005-01-01
2644 2006-01-01
2651 2007-01-01
2670 2008-01-01
2699 2009-01-01
2708 2010-01-01
2751 2011-01-01
2759 2012-01-01
2769 2013-01-01
2789 2014-01-01
2776 2015-01-01
2787 2016-01-01
Player2
Carlsen, Magnus Year
2127 2002-01-01
2279 2003-01-01
2484 2004-01-01
2553 2005-01-01
2625 2006-01-01
2690 2007-01-01
2733 2008-01-01
2776 2009-01-01
2810 2010-01-01
2814 2011-01-01
2835 2012-01-01
2861 2013-01-01
2872 2014-01-01
2862 2015-01-01
2844 2016-01-01
You can download the two sets here:
Download Player2
Download Player1
Between the above code, and below, Ive deleted two columns and reassigned an observation as a column title.
Hikaru Nakamura/Magnus Carlsen's chess rating over time
Hikaru's data is assigned to a dataframe, Player1.
Magnus's data is assigned to a dataframe, Player2.
What I want to be able to do is get what you see below, a dataframe of them combined.
The code I used to produce this frame is
merged<- merge(Player1, Player2, by = c("Year"), all = TRUE)
Now, this is all fun and dandy for two data sets, but I am having very annoying difficulties to add more players to this combined data set.
For example, maybe I would like to add 5, 10, 15 more players to this set. Examples of these players would be Kramnik, Anand, Gelfand ( Examples of famous chess players). As you'd expect, for 5 players, the dataframe would have 6 columns, 10 would have 11, 15 would have 16, all ordered nicely by the Year variable.
Fortunately, the number of observations for each Player is less than 100 always. Also, each individual player is assigned his/her own dataset.
For example,
Nakamura is the Player1 dataframe
Carlsen is the Player2 dataframe
Kramnik is the Player3 dataframe
Anand is the Player4 dataframe
Gelfand is the Player5 dataframe
all of which I have created using a for loop assigning process using this code
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
I don't want to write out something like below:
merged<- merge(Player1, Player2,.....Player99 ,Player100, by = c("Year"), all = TRUE)
I want to able to merge all 5, 10, 15...i number of Player"i" objects that I created in the loop together by Year.
Also, once it leaves the loop initially, each dataset looks like this.
So what ends up happening is that I assign all of the data sets to a list by using the following snippet:
lst <- mget(ls(pattern='^Player\\d+'))
list2env(lapply(lst,`[`,-2), envir =.GlobalEnv)
lst <- mget(ls(pattern='^Player\\d+'))
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
names(lst[[i]]) [names(lst[[i]]) == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
}
This is what my list looks like.
Is there a way I write a table with YEAR as the way its merged by, so that it[cbinds, bind_cols, merges, etc] each of the Player"i" dataframes, which are necessarily not equal in length , in my lists are such a way that I get a combined/merged set like the one you saw below the merged(player1, player2) set?
Here is the diagram again, but it would have to be for many players, not just Carlsen and Nakmura.
Also, is there a way I can avoid using the list function, and just straight up do
names(Player"i") [names(Player"i") == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
which just renames the titles of all of the dataframes that start with "Player".
merge(player1, player2, player3,...., player99, player100, by = c("YEAR"), all = TRUE)
which would merge all of the "Player""i" datasets?
If anything is unclear, please mention it.
It was pretty funny that one line of code did the trick. After I assigned all of the Player1, Player 2....Player i into the list, I just joined all of the sets contained in the list by Year.
For loop that generates all of unique datasets.
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
Puts them into a list
lst <- mget(ls(pattern='^Player\\d+'))
Merge, or join by common value
df <- join_all(lst, by = 'Year')
Unfortunately, unlike merge(datasets...., all= TRUE), it drops certain observations for an unknown reason, will have to see why this happens.
I have two datasets. One called domain (d) which as general information about a gene and table called mutation (m). Both tables have similar column called Gene.name, which I'll use to look for. The two datasets do not have the same number of columns or rows.
I want to go through all the data in the file mutation and check to see whether the data found in column gene.name also exists in the file domain. If it does, I want it to check whether the data in column mutation is between the column "Start" and "End" (they can be equal to Start or End). If it is, I want to print it out to a new table with the merged column: Gene.Name, Mutation, and the domain information. If it doesn't exist, ignore it.
So this is what I have so far:
d<-read.table("domains.txt")
d
Gene.name Domain Start End
ABCF1 low_complexity_region 2 13
DKK1 low_complexity_region 25 39
ABCF1 AAA 328 532
F2 coiled_coil_region 499 558
m<-read.table("mutations.tx")
m
Gene.name Mutation
ABCF1 10
DKK1 21
ABCF1 335
xyz 15
F2 499
newfile<-m[, list(new=findInterval(d(c(d$Start,
d$End)),by'=Gene.Name']
My code isn't working and I'm reading a lot of different questions/answers and I'm much more confused. Any help would be great.
I"d like my final data to look like this:
Gene.name Mutation Domain
DKK1 21 low_complexity_region
ABCF1 335 AAA
F2 499 coiled_coil_region
A merge and subset should get you there (though I think your intended result doesn't match your description of what you want):
result <- merge(d,m,by="Gene.name")
result[with(result,Mutation >= Start & Mutation <= End),]
# Gene.name Domain Start End Mutation
#1 ABCF1 low_complexity_region 2 13 10
#4 ABCF1 AAA 328 532 335
#6 F2 coiled_coil_region 499 558 499
Hello and thank you in advance for your assistance,
(PLEASE Note Comments section for additional insight: i.e. the cost column in the example below was added to this question; Simon, provides a great answer, but the cost column itself is not represented in the data response from him, although the function he provides works with the cost column)
I have a data set, lets call it 'data' which looks like this
NAME DATE COLOR PAID COST
Jim 1/1/2013 GREEN 150 100
Jim 1/2/2013 GREEN 50 25
Joe 1/1/2013 GREEN 200 150
Joe 1/2/2013 GREEN 25 10
What I would like to do is sum the PAID (and COST) elements of the records with the same NAME value and reduce the number of rows (as in this example) to 2, such that my new data frame looks like this:
NAME DATE COLOR PAID COST
Jim 1/2/2013 GREEN 200 125
Joe 1/2/2013 GREEN 225 160
As far as the dates are concerned, I don't really care about which one survives the summation process.
I've gotten as far as rowSums(data), but I'm not exactly certain how to use it. Any help would be greatly appreciated.
aggregate is the function you are looking for:
aggregate( cbind( PAID , COST ) ~ NAME + COLOR , data = data , FUN = sum )
# NAME PAID
# 1 Jim 200
# 2 Joe 225