Removing Duplicates From a Dataframe in R - r

My situation is that I am trying to clean up a data set of student results for processing and I'm having some issues with completely removing duplicates as only wanting to look at "first attempts" but some students have taken the course multiple times. An example of the data using one of the duplicates is:
id period desc
632 1507 1101 90714 Research a contemporary biological issue
633 1507 1101 6317 Explain the process of speciation
634 1507 1101 8931 Describe gene expression
14448 1507 1201 8931 Describe gene expression
14449 1507 1201 6317 Explain the process of speciation
14450 1507 1201 90714 Research a contemporary biological issue
25884 1507 1301 6317 Explain the process of speciation
25885 1507 1301 8931 Describe gene expression
25886 1507 1301 90714 Research a contemporary biological issue
The first 2 digits of reg_period are the year they sat the paper. As can be seen, I would want to be keeping where id is 1507 and reg_period is 1101. So far, an example of my code to get the values I want to be trimming is:
unique.rows <- unique(df[c("id", "period")])
dups <- (unique.rows[duplicated(unique.rows$id),])
However, there are a couple of problems I am then running in to. This only works because the data is ordered by id and reg_period and this isn't guaranteed in future. Plus I don't know how to then take this list of duplicate entries and then select the rows that are not in it because %in% doesn't seem to work with it and a loop with rbind runs out of memory.
What's the best way to handle this?

I would probably use dplyr. Calling your data df:
result = df %>% group_by(id) %>%
filter(period == min(period))
If you prefer base, I would pull the id/period combinations to keep into a separate data frame and then do an inner join with the original data:
id_pd = df[order(df$id, df$pd), c("id", "period")]
id_pd = id_pd[!duplicated(df$id), ]
result = merge(df, id_pd)

Try this, it works for me with your data:
dd <- read.csv("a.csv", colClasses=c("numeric","numeric","character"), header=TRUE)
print (dd)
dd <- dd[order(dd$id, dd$period), ]
dd <- dd[!duplicated(dd[, c("id","period")]), ]
print (dd)
Output:
id period desc
1 1507 1101 90714 Research a contemporary biological issue
4 1507 1201 8931 Describe gene expression
7 1507 1301 6317 Explain the process of speciation

Related

Cartesian product of two large dataframes, keeping the values that fulfil a condtion

So my problem may be naïve, but I've been searching for a long while and I still can’t find the answer. I have two large data sets:
One is a census file with more than 700,000 records.
Lastname Census 1stname Census census_year
C2last C2first 1880
C3last C3first 1850
C4last C4first 1850
The other one is a sample of civil registers composed of 80,000 observations.
Lastname Reg 1stname reg birth_year
P2Last P2first 1818
P3last P3first 1879
P4last P4first 1903
I need to carry out the Cartesian product of both data sets, which is obviously a huge file (700,000 x 80,000), where for each row of the census we should be adding the 80,000 civil registers with an extra variable.
The values for this extra variable fulfill a condition. The condition is that the census year (a variable of the census) is larger than the variable 'year of birth' of the civil registers (or, in other words, the census year is younger than the birth on the register).
As I said, the goal is to make the Cartesian product, but adding an extra variable (flag) that gives a '1' when the condition is fulfilled (census year > birth year) or '0', when it's not:
LastNCens 1stNCens cens_year LastNamReg 1stNamReg birth Flag
C2last C2first 1880 P2Last P2first 1818 1
P3last P3first 1879 1
P4last P4first 1903 0
C3last C3first 1850 P2Last P2first 1818 1
P3last P3first 1879 0
P4last P4first 1903 0
C4last C4first 1860 P2Last P2first 1818 1
P3last P3first 1879 0
P4last P4first 1903 0
All this, keeping in mind that the product is too big.
I have tried many things (compare, diff, intersect) and I've read also other things that I couldn't apply (df.where, pd.merge), but they don't do what I need and I can't use them here. My simple approach would have been:
cp <- merge(census, register,by=NULL);
final.dataframe <- cp [which (cp$census_year > cp$birth_year_hsn ),]
But R runs out of memory.
It goes without saying that the resulting data frame (the Cartesian product) would also be valid with only those records that are flagged as '1' (getting rid of those with Flag='0').
I hope this is well explained and also useful for other people… Thanks a million for any tip. It's very welcome.
Going along with the comments to the question one could achieve what your looking for using the data.table package. The package modifies by reference, and as such can help reduce the amount of memory used for subsets, merges and calculations. For more information about the package i suggest using their wikipedia github page, which contain a quick cheat-sheet for most computations.
Below is an example of how one could perform the kind of merge that you are looking for using data.table. It is refered to as a non-equi join.
A few notes. It seems that a bug is present in the data.table package, which has not yet been noted. by = .EACHI seems necessary when you output both of the joined columns, in order to obtain the original values of the left part of the join. However it is a small cost.
df1 <- fread("Lastname_Census firstname_Census census_year
C2last C2first 1880
C3last C3first 1850
C4last C4first 1850", key = "census_year")
df2 <- fread("Lastname_Reg firstname_reg birth_year
P2Last P2first 1818
P3last P3first 1879
P4last P4first 1903", key = "birth_year")
cart_join <-
df2[df1, #join df1 on df2
on = .(birth_year >= census_year), #join criteria
#Force keep all columns to keep (i.var, indicates to keep var from df1)
j = .(i.Lastname_Census,
i.firstname_Census,
Lastname_Reg,
firstname_reg,
birth_year,
i.census_year,
Flag = birth_year >= i.census_year),
#Force evaluation on each i. This will keep the correct birth_year (seems to be a bug)
by = .EACHI,
#Let the table grow beyond nrow(df1) + nrow(df2)
allow.cartesian = TRUE][,-1] #Remove the first column. It is a merge column
Edit (A few possible bugs)
After playing around with the join, i noticed a few irregularities, and followed it by opening an issue here. Note that you should be careful with my above suggested answer. It seems to work fine while returning values from both tables (other than the once used in the on statement), but it is not impenetrable.
Please refer to my open issue for more information.

Many dataframes, different row lengths, similiar columns and dataframe titles, how to bind?

This takes a bit to explain and the post itself may be a bit too long to be answered.
I have MANY data frames of individual chess players and their specific ratings at points in time.
Here is what my data looks like. Please forgive me for my poor formatting of separating the datasets. Carlsen and Nakamura are separate dataframes.
Player1
Nakamura, Hikaru Year
2364 2001-01-01
2430 2002-01-01
2520 2003-01-01
2571 2004-01-01
2613 2005-01-01
2644 2006-01-01
2651 2007-01-01
2670 2008-01-01
2699 2009-01-01
2708 2010-01-01
2751 2011-01-01
2759 2012-01-01
2769 2013-01-01
2789 2014-01-01
2776 2015-01-01
2787 2016-01-01
Player2
Carlsen, Magnus Year
2127 2002-01-01
2279 2003-01-01
2484 2004-01-01
2553 2005-01-01
2625 2006-01-01
2690 2007-01-01
2733 2008-01-01
2776 2009-01-01
2810 2010-01-01
2814 2011-01-01
2835 2012-01-01
2861 2013-01-01
2872 2014-01-01
2862 2015-01-01
2844 2016-01-01
You can download the two sets here:
Download Player2
Download Player1
Between the above code, and below, Ive deleted two columns and reassigned an observation as a column title.
Hikaru Nakamura/Magnus Carlsen's chess rating over time
Hikaru's data is assigned to a dataframe, Player1.
Magnus's data is assigned to a dataframe, Player2.
What I want to be able to do is get what you see below, a dataframe of them combined.
The code I used to produce this frame is
merged<- merge(Player1, Player2, by = c("Year"), all = TRUE)
Now, this is all fun and dandy for two data sets, but I am having very annoying difficulties to add more players to this combined data set.
For example, maybe I would like to add 5, 10, 15 more players to this set. Examples of these players would be Kramnik, Anand, Gelfand ( Examples of famous chess players). As you'd expect, for 5 players, the dataframe would have 6 columns, 10 would have 11, 15 would have 16, all ordered nicely by the Year variable.
Fortunately, the number of observations for each Player is less than 100 always. Also, each individual player is assigned his/her own dataset.
For example,
Nakamura is the Player1 dataframe
Carlsen is the Player2 dataframe
Kramnik is the Player3 dataframe
Anand is the Player4 dataframe
Gelfand is the Player5 dataframe
all of which I have created using a for loop assigning process using this code
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
I don't want to write out something like below:
merged<- merge(Player1, Player2,.....Player99 ,Player100, by = c("Year"), all = TRUE)
I want to able to merge all 5, 10, 15...i number of Player"i" objects that I created in the loop together by Year.
Also, once it leaves the loop initially, each dataset looks like this.
So what ends up happening is that I assign all of the data sets to a list by using the following snippet:
lst <- mget(ls(pattern='^Player\\d+'))
list2env(lapply(lst,`[`,-2), envir =.GlobalEnv)
lst <- mget(ls(pattern='^Player\\d+'))
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
names(lst[[i]]) [names(lst[[i]]) == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
}
This is what my list looks like.
Is there a way I write a table with YEAR as the way its merged by, so that it[cbinds, bind_cols, merges, etc] each of the Player"i" dataframes, which are necessarily not equal in length , in my lists are such a way that I get a combined/merged set like the one you saw below the merged(player1, player2) set?
Here is the diagram again, but it would have to be for many players, not just Carlsen and Nakmura.
Also, is there a way I can avoid using the list function, and just straight up do
names(Player"i") [names(Player"i") == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
which just renames the titles of all of the dataframes that start with "Player".
merge(player1, player2, player3,...., player99, player100, by = c("YEAR"), all = TRUE)
which would merge all of the "Player""i" datasets?
If anything is unclear, please mention it.
It was pretty funny that one line of code did the trick. After I assigned all of the Player1, Player 2....Player i into the list, I just joined all of the sets contained in the list by Year.
For loop that generates all of unique datasets.
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
Puts them into a list
lst <- mget(ls(pattern='^Player\\d+'))
Merge, or join by common value
df <- join_all(lst, by = 'Year')
Unfortunately, unlike merge(datasets...., all= TRUE), it drops certain observations for an unknown reason, will have to see why this happens.

Test performing on counts

In R a dataset data1 that contains game and times. There are 6 games and times simply tells us how many time a game has been played in data1. So head(data1) gives us
game times
1 850
2 621
...
6 210
Similar for data2 we get
game times
1 744
2 989
...
6 711
And sum(data1$times) is a little higher than sum(data2$times). We have about 2000 users in data1 and about 1000 users in data2 but I do not think that information is relevant.
I want to compare the two datasets and see if there is a statistically difference and which game "causes" that difference.
What test should I use two compare these. I don't think Pearson's chisq.test is the right choice in this case, maybe wilcox.test is the right to chose ?

Matching Data from Different columns / dataframes - Working in R

Here is some sample data
Dataset A
id name reasonforlogin
123 Tom work
246 Timmy work
789 Mark play
Dataset B
id name reasonforlogin
789 Mark work
313 Sasha interview
000 Meryl interview
987 Dara play
789 Mark play
246 Timmy work
Two datasets. Same columns. Uneven number of rows.
I want to be able to say something like
1)"I want all of id numbers that appear in both datasetA and datasetB"
or
2)"I want to know how many times any one ID logs in on a day, say day 2."
So the answer to
1) So a list like
[246, 789]
2) So a data.frame with a "header" of ids, and then a "row" of their login numhbers.
123, 246, 789, 313, 000, 987
0, 1, 2, 1, 1, 1
It seems easy, but I think its non-trivial to do this quickly with large data. Originally I planned on doing loops-in-loops, but I'm sure there has to be a term for these kind of comparisons and likely packages that already do similar things.
If we have A as the first data set and B the second, and id as a character column in both so as to keep 000 from being printed as 0, we can do ...
id common to both data sets:
intersect(A$id, B$id)
# [1] "246" "789"
Times an id logged in on the second day (B), including those that were not logged in at all:
table(factor(B$id, levels = unique(c(A$id, B$id))))
# 123 246 789 313 000 987
# 0 1 2 1 1 1
You can do both with dplyr
1
A %>% select(id)
inner_join(B %>% select(id) ) %>%
distinct
2
B %>% count(id)
You need which and table.
1) Find which ids are in both data.frames
common_ids <- unique(df1[which(df1$id %in% df2$id), "id"])
Using intersect as in the other answers is much more elegant in this simple case. which provides however more flexibility when the comparison you need to do is more complicated than simple equality and is worth to know.
2) Find how many times any ID logs in
table(df1$id)

Comparing multiple columns in different data sets to find values within range R

I have two datasets. One called domain (d) which as general information about a gene and table called mutation (m). Both tables have similar column called Gene.name, which I'll use to look for. The two datasets do not have the same number of columns or rows.
I want to go through all the data in the file mutation and check to see whether the data found in column gene.name also exists in the file domain. If it does, I want it to check whether the data in column mutation is between the column "Start" and "End" (they can be equal to Start or End). If it is, I want to print it out to a new table with the merged column: Gene.Name, Mutation, and the domain information. If it doesn't exist, ignore it.
So this is what I have so far:
d<-read.table("domains.txt")
d
Gene.name Domain Start End
ABCF1 low_complexity_region 2 13
DKK1 low_complexity_region 25 39
ABCF1 AAA 328 532
F2 coiled_coil_region 499 558
m<-read.table("mutations.tx")
m
Gene.name Mutation
ABCF1 10
DKK1 21
ABCF1 335
xyz 15
F2 499
newfile<-m[, list(new=findInterval(d(c(d$Start,
d$End)),by'=Gene.Name']
My code isn't working and I'm reading a lot of different questions/answers and I'm much more confused. Any help would be great.
I"d like my final data to look like this:
Gene.name Mutation Domain
DKK1 21 low_complexity_region
ABCF1 335 AAA
F2 499 coiled_coil_region
A merge and subset should get you there (though I think your intended result doesn't match your description of what you want):
result <- merge(d,m,by="Gene.name")
result[with(result,Mutation >= Start & Mutation <= End),]
# Gene.name Domain Start End Mutation
#1 ABCF1 low_complexity_region 2 13 10
#4 ABCF1 AAA 328 532 335
#6 F2 coiled_coil_region 499 558 499

Resources