Remove duplicates where values are swapped across 2 columns in R [duplicate] - r

This question already has answers here:
pair-wise duplicate removal from dataframe [duplicate]
(4 answers)
Closed 6 years ago.
I have a simple dataframe like this:
| id1 | id2 | location | comment |
|-----|-----|------------|-----------|
| 1 | 2 | Alaska | cold |
| 2 | 1 | Alaska | freezing! |
| 3 | 4 | California | nice |
| 4 | 5 | Kansas | boring |
| 9 | 10 | Alaska | cold |
The first two rows are duplicates because id1 and id2 both went to Alaska. It doesn't matter that their comment are different.
How can I remove one of these duplicates -- either one would be fine to remove.
I was first trying to sort id1 and id2, then get the index where they are duplicated, then go back and use the index to subset the original df. But I can't seem to pull this off.
df <- data.frame(id1 = c(1,2,3,4,9), id2 = c(2,1,4,5,10), location=c('Alaska', 'Alaska', 'California', 'Kansas', 'Alaska'), comment=c('cold', 'freezing!', 'nice', 'boring', 'cold'))

We can use apply with MARGIN=1 to sort by row for the 'id' columns, cbind with 'location' and then use duplicated to get a logical index that can be used for removing/keeping the rows.
df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)), df$location)),]
# id1 id2 location comment
#1 1 2 Alaska cold
#3 3 4 California nice
#4 4 5 Kansas boring
#5 9 10 Alaska cold

Related

Combining 3 datasets with a lot of columns (+100), no overlapping rows, and some unknown overlapping columns

I have 3 datasets with varying rows and columns. In the end result all rows should be there, and all non-overlapping (unique) columns should be there.
a <- data.frame(a=c(0,1,2), b=c(3,4,5), c=c(6,7,8))
b <- data.frame(a=c(9,10,11), c=c(12,13,14), d=c(15,16,17))
Needs to be
c <- data.frame(a=c(0,1,2,9,10,11), b=c(3,4,5,NA,NA,NA), c=c(6,7,8,12,13,14), d=c(NA,NA,NA,15,16,17)
But imagine that instead of having abcd, you have the whole alphabet 4 times. (edit: and you don't know which ones are overlapping names (such as a and a in a and b are overlapping)).
The default behavior of all the dplyr join commands is to join on all columns that both datasets have in common. As you want to keep all values, even when there is nothing to join on then you will want an outer join.
Probably something like the following:
output = input_df1 %>%
dplyr::outer_join(input_df2) %>%
dplyr::outer_join(input_df3)
This will begin by joining the first two dataframes using all columns they have in common. Then it will join on the third dataframe using all columns in common.
This assumes that wherever dataframes have the same columns they have the same entries. Consider the following example:
df1:
a | b | z
---+---+---
1 | NA| 7
df2:
a | b | Y
---+---+---
1 | 2 | 8
df1:
a | b | X
---+---+---
1 | 2 | 9
This will not produce
output:
a | b | z | y | x
---+---+---+---+---
1 | 2 | 7 | 8 | 9
Because the first table does not have the same b value, even though it has the same a value. Instead this will produce:
output:
a | b | z | y | x
---+---+---+---+---
1 | 2 | NA| 8 | 9
1 | NA| 7 | NA| NA
It you need to handle this type of case, you will have to put more effort into your joins. Perhaps start with:
colnames1 = colnames(input_df1)
colnames2 = colnames(input_df2)
common_colnames = colnames1[colnames1 %in% colnames2]
To get all common column names and decide from there how to join.

Add new column to dataframe, based on values at specific rows within that dataframe [duplicate]

This question already has answers here:
Matching up two vectors in R
(2 answers)
Closed 4 years ago.
Suppose I have a dataframe such as the below
people.dat <- data.frame("ID" = c(2001, 1001, 2005, 2001 5000), "Data"
= c(100, 300, 500, 900, 200))
Which looks something like this
+------+------+
| ID | Data |
+------+------+
| 2001 | 100 |
| 1001 | 300 |
| 2005 | 500 |
| 2001 | 900 |
| 5000 | 200 |
+------+------+
Suppose the first thing I do is work out how many unique ID values are in the dataframe (this is necessary, due to the size of the real dataset in question)
unique_ids <- sort(c(unique(people.dat$ID)))
Which gives
[1] 1001 2001 2005 5000
Where I get stuck is that I would like to add a new column, say "new_id", which looks at the "ID" value in the dataframe and evaluates its position in unique_ids, and assigns positional value (so the column "new_id" consists of values at each row which range from 1:length(unique_ids)
An example of the output would be as follows
+------+------+--------+
| ID | Data | new_id |
+------+------+--------+
| 2001 | 100 | 2 |
| 1001 | 300 | 1 |
| 2005 | 500 | 3 |
| 2001 | 900 | 1 |
| 5000 | 200 | 4 |
+------+------+--------+
I thought about using a for loop with if statements, but my first attempts didn't quite hit the mark. Although, if I just wanted to replace "ID" with a sequential value, the following code would work (but where I get stuck is that I want to keep ID, but add another "new_id" column)
for (i in 1:48){
people.dat$ID[people.dat$ID == unique_ids[i]] <- i
}
Thank you for any help. Hope I have made the question as clear as possible (although I struggled to phrase some of it, so please let me know if there is anything specific that needs clarifying)
This is more like a 'rank' problem
people$rank=as.numeric(factor(people$ID))
people
ID Data rank
1 2001 100 2
2 1001 300 1
3 2005 500 3
4 2001 900 2
5 5000 200 4

R data.table add new column with query for each row

I have 2 R data.tables in R like so:
first_table
id | first | trunc | val1
=========================
1 | Bob | Smith | 10
2 | Sue | Goldm | 20
3 | Sue | Wollw | 30
4 | Bob | Bellb | 40
second_table
id | first | last | val2
==============================
1 | Bob | Smith | A
2 | Bob | Smith | B
3 | Sue | Goldman | A
4 | Sue | Goldman | B
5 | Sue | Wollworth | A
6 | Sue | Wollworth | B
7 | Bob | Bellbottom | A
8 | Bob | Bellbottom | B
As you can see, the last names in the first table are truncated. Also, the combination of first and last name is unique in the first table, but not in the second. I want to "join" on the combination of first name and last name under the incredibly naive assumptions that
first,last uniquely defines a person
that truncation of the last name does not introduce ambiguity.
The result should look like this:
id | first | trunc | last | val1
=======================================
1 | Bob | Smith | Smith | 10
2 | Sue | Goldm | Goldman | 20
3 | Sue | Wollw | Wollworth | 30
4 | Bob | Bellb | Bellbottom | 40
Basically, for each row in table_1, I need to find a row that back fills the last name.
For Each Row in first_table:
Find the first row in second_table with:
matching first_name & trunc is a substring of last
And then join on that row
Is there an easy vectorized way to accomplish this with data.table?
One approach is to join on first, then filter based on the substring-match
first_table[
unique(second_table[, .(first, last)])
, on = "first"
, nomatch = 0
][
substr(last, 1, nchar(trunc)) == trunc
]
# id first trunc val1 last
# 1: 1 Bob Smith 10 Smith
# 2: 2 Sue Goldm 20 Goldman
# 3: 3 Sue Wollw 30 Wollworth
# 4: 4 Bob Bellb 40 Bellbottom
Or, do the truncation on the second_table to match the first, then join on both columns
first_table[
unique(second_table[, .(first, last, trunc = substr(last, 1, 5))])
, on = c("first", "trunc")
, nomatch = 0
]
## yields the same answer

How to apply functions in columns for data frames with different sizes in nested list?

In R, to apply some function to a column, you can do:
df$col <- someFunction(df$col)
Now my question is, how do you the similar task when you have data frames in a nested list?
Say I have a following list like this, where I have data frames in the second level from the root.
+------+------+
type1 | id | name |
+----------->|------|------|
| | | |
| | | |
year1 | +------+------+
+------------------+
| |
| | +------+------+-----+
| | type2 | meta1|meta2 | name|
| +----------> |------|------|-----|
| | | | |
+ +------+------+-----+
| type1 +------+------+
| +---------> | id |name |
| | |------|------|
| year2 | | | |
list +----------------->+ | | |
+ | +------+------+
| | type2 +------+------+-----+
| +---------> | meta1|meta2 |name |
| |------|------|-----|
| | | | |
| type1 +------+------+-----+
| +----------> +------+------+
| | | id |name |
| year3 | |------|------|
+-----------------+ | | |
| | | |
| type2 +------+------+
+----------> +------+------+-----+
|meta1 | meta2|name |
|------|------|-----|
| | | |
+------+------+-----+
And I want to modify the "name" column in each of the data frame in the leaves with some functions and store the results there. How do you do that?
Here is the example data:
data<-list()
data$yr2001$type1 <- df_2001_1 <- data.frame(index=1:3,name=c("jack","king","larry"))
data$yr2001$type2 <- df_2001_2 <- data.frame(index=1:5,name=c("man","women","oliver","jack","jill"))
data$yr2002$type1 <- df_2002_1 <- data.frame(index=1:3,name=c("janet","king","larry"))
data$yr2002$type2 <- df_2002_2 <- data.frame(index=1:5,name=c("alboyr","king","larry","rachel","sam"))
data$yr2003$type1 <- df_2003_1 <- data.frame(index=1:3,name=c("dan","jay","zang"))
data$yr2003$type2 <- df_2003_2 <- data.frame(index=1:5,name=c("zang","king","larry","kim","fran"))
say I want to uppercase all of the names in in the name column in each data frame stored in the list
I agree with #joran's comment above---this is begging to be consolidated by adding type as a column. But here is one way with rapply. This assumes that the name column is the only factor column in each nested data.frame. As in #josilber's answer, my function of choice is toupper.
rapply(data, function(x) toupper(as.character(x)), classes='factor', how='replace')
This will drop the data.frame class, but the essential structure is preserved. If your name columns are already character, then you would use.
rapply(data, toupper, classes='character', how='replace')
To illustrate (using your simplified example):
library(reshape2)
dat1 <- melt(data,id.vars = c("index","name"))
> dat1$NAME <- toupper(dat1$name)
You can nest the lapply function twice to get at the inner data frames. Here, I apply toupper to each name variable:
result <- lapply(data, function(x) {
lapply(x, function(y) {
y$name = toupper(y$name)
return(y)
})
})
result
# $yr2001
# $yr2001$type1
# index name
# 1 1 JACK
# 2 2 KING
# 3 3 LARRY
#
# $yr2001$type2
# index name
# 1 1 MAN
# 2 2 WOMEN
# 3 3 OLIVER
# 4 4 JACK
# 5 5 JILL
#
#
# $yr2002
# $yr2002$type1
# index name
# 1 1 JANET
# 2 2 KING
# 3 3 LARRY
#
# $yr2002$type2
# index name
# 1 1 ALBOYR
# 2 2 KING
# 3 3 LARRY
# 4 4 RACHEL
# 5 5 SAM
#
#
# $yr2003
# $yr2003$type1
# index name
# 1 1 DAN
# 2 2 JAY
# 3 3 ZANG
#
# $yr2003$type2
# index name
# 1 1 ZANG
# 2 2 KING
# 3 3 LARRY
# 4 4 KIM
# 5 5 FRAN
Here is a truly recursive version based on lapply (i.e. will work with deeper nesting) and doesn't make any other assumptions except that the only types of terminal leaves you have are data frames. Unfortunately rapply won't stop the recursion at data.frames so you have to use lapply if you want to operate on the data frames (otherwise Matthew's answer is perfect).
samp.recur <- function(x)
lapply(x,
function(y)
if(is.data.frame(y)) transform(y, name=toupper(name)) else samp.recur(y))
This produces:
samp.recur(data)
# $yr2001
# $yr2001$type1
# index name
# 1 1 JACK
# 2 2 KING
# 3 3 LARRY
# $yr2001$type2
# index name
# 1 1 MAN
# 2 2 WOMEN
# 3 3 OLIVER
# 4 4 JACK
# 5 5 JILL
# etc...
Though I do also agree with others you may want to consider re-structuring your data.

In DGET function, how to use multiple cell ranges as search criteria?

I am using the DGET function in LibreOffice. I have the first table as shown below (top). I want to make second table (bottom). I can use DGET function where Database is the cell range containing top table and Database Field is "Winner".
Is it possible to have different cell ranges in Search Criteria, so that for each cell in row for Case #1 can have separate formula with a different search criteria as given in the first row of bottom table?
If I have to use separate continuous cell ranges for search criteria, then there would be [n*Chances] cell ranges, where n=total number of cases (~150 in my case) and Chances = possible number of Chance# (50 in my case).
Case | Chance# | Winner
-------------------------
1 | 7 | Joe
1 | 9 | Emil
1 | 10 | Harry
1 | 11 | Kate
2 | 1 | Tom
2 | 3 | Jerry
2 | 4 | Mike
2 | 7 | John
Case |Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|
|="=1" |="=2" |="=3" |="=4" |="=5" |="=6" |="=7" |="=8" |="=9" |="=10" |="=11" | ---- |="=50"
1 | | | | | | | Joe | |Emil |Harry | Kate | ---- |
2 | Tom | |Jerry |Mike | | | John | | | | | ---- |
To do so, you need to change your approach, instead of using DGET, I'm using a rather more complex method:
Considering your example:
A B C D
1 # Case Chance# Winner
2 1 1 7 Joe
3 2 1 9 Emil
4 3 1 10 Harry
5 4 1 11 Kate
6 5 2 1 Tom
7 6 2 3 Jerry
8 7 2 4 Mike
9 8 2 7 John
10
11 Case\Chance# 1 2 3 4
12 1
13 2 Tom Jerry Mike
I use the following:
=IF(SUMPRODUCT(($B$2:$B$9=$A12)*($C$2:$C$9=B$11)*($A$2:$A$9))> 0,INDEX($D$2:$D$9,SUMPRODUCT(($B$2:$B$9=$A12)*($C$2:$C$9=B$11)*($A$2:$A$9))),"")
Let's ignore the IF, and focus on the real deal here:
First, Get the row that matches your condition, $B$2:$B$9=$A12 and $C$2:$C$9=B$11 will result in a TRUE/FALSE arrays, multiply them to get a 0/1 array with only a single 1 for the match, now multiply by the ID to get the row number in your table.
SUMPRODUCT will get you a single value (the row) from the result array.
Finally use index to retrieve the desired value.
The IF statement tests if a match do exist (SUMPRODUCT > 0), to filter out the cell with no match.

Resources