Comparing two different Excel Files in python to populate the records which are not there in either of the files in output - count

Hi I am trying to compare two excel workbooks with one sheet each, both the files has the same data labels. and i want the records to populate the records which are not there in either of the files in output.
Example
Excel1 Sheet1
Colum A Colum B Colum C Colum D Colum E
100 10100 Cash G 1.0004
101 10101 Cash G 1.0001
102 10102 Cash G 1.0002
100 10103 Cash G 1.0003
Excel2 Sheet1
Colum A Colum B Colum C Colum D Colum E
100 10100 Cash G 1.0004
101 10101 Cash G 1.0001
100 10103 Cash G 1.0003
Needed out
Output1
Colum A Colum B Colum C Colum D Colum E
102 10102 Cash G 1.0002
OutPut2
Colum A Colum B Colum C Colum D Colum E Count
102 10102 Cash G 1.0002 1
100 10100 Cash G 1.0004 2
101 10101 Cash G 1.0001 2
100 10103 Cash G 1.0003 2
Sheet1
Sheet2
Output
Output2

Related

Remove auto-generated index row numbers in R table

I am loading a table from sql into R, calling it dataframe.
dataframe <- sqlQuery(dbo, dataFinal)
My dataframe then returns the below - which includes this number index which I do not want.
Column1 Column2 Score
1 a e 5
2 b f 5
3 c g 8
4 d h 7
What do I need to convert dataframe to such that I return the below:
Column1 Column2 Score
a e 5
b f 5
c g 8
d h 7
I want to be able to call this table, so a print() will not work.

R: find identical combinations over multiple columns

I have a hard time explaining what I am looking for, but I try my best, so bear with me. I have the following data that contains pairs of individuals and a certain value per pair:
Col1 Col2 Value
A B 90
E F 90
B A 50
C D 50
F E 90
What I want to do is find identical combinations (i.e. both A & B and B & A) and their respective values, and put them together. However, not all combinations are double in there (in my example, there is only C&D but no D&C).
I have tried to copy the data in a 2nd dataframe, then turn around col1 and col2 and then sort by col1. That gives me the following:
Col1 Col2 Value Dummy
A B 90 1
A B 50 2
C D 50 1
E F 90 1
E F 90 2
B A 50 1
B A 90 2
D C 50 2
F E 90 1
F E 90 2
But I then still have both A&B and B&A in my data. Ideally I would like to end up with this:
Col1 Col2 Value
A B 90
B A 50
C D 50
E F 90
F E 90
I hope my question is clear, but otherwise I am happy to try to explain myself better!
Using base R, we order first by the min column, then by the 1st column :
df[with(df,order(pmin(Col1,Col2),pmax(Col1,Col2),Col1)),]
# Col1 Col2 Value
# 1 A B 90
# 3 B A 50
# 4 C D 50
# 2 E F 90
# 5 F E 90
Thanks #akrun for the hint.
The tidyverse solution would be:
library(dplyr)
df %>% arrange(pmin(Col1,Col2),pmax(Col1,Col2),Col1)
previous solution :
df[order(
apply(df[1:2],1,function(x) paste(sort(x),collapse="")),
df$Col1),]
data
df <- read.table(text=
"Col1 Col2 Value
A B 90
E F 90
B A 50
C D 50
F E 90",h=T,strin=F
)
You can try something like data %>% distinct(Col1, Col2, Value) %>% arrange(Col1)

Filtering a Dataset by another Dataset in R

The task I am trying to accomplish is essentially filtering one dataset by the entries in another dataset by entries in an "id" column. The data sets I am working with are quite large having 10 of thousands of entries and 30 or so variables. I have made toy datasets to help explain what I want to do.
The first dataset contains a list of entries and each entry has their own unique accession number(this is the id).
Data1 = data.frame(accession_number = c('a','b','c','d','e','f'), values =c('1','3','4','2','3','12'))
>Data1
accession_number values
1 a 1
2 b 3
3 c 4
4 d 2
5 e 3
6 f 12
I am only interested in the entries that have the accession number 'c', 'd', and 'e'. (In reality though my list is around 100 unique accession numbers). Next, I created a dataframe with the only the unique accession numbers and no other values.
>SubsetData1
accession_number
1 c
2 d
3 e
The second data set, which i am looking to filter, contains multiple entries some which have the same accession number.
>Data2
accession_number values Intensity col4 col6
1 a 1 -0.0251304 a -0.4816370
2 a 2 -0.4308735 b -1.0335971
3 c 3 -1.9001321 c 0.6416735
4 c 4 0.1163934 d -0.4489048
5 c 5 0.7586820 e 0.5408650
6 b 6 0.4294415 f 0.6828412
7 b 7 -0.8045201 g 0.6677730
8 b 8 -0.9898947 h 0.3948412
9 c 9 -0.6004642 i -0.3323932
10 c 10 1.1367578 j 0.9151915
11 c 11 0.7084980 k -0.3424039
12 c 12 -0.9618102 l 0.2386307
13 c 13 0.2693441 m -1.3861064
14 d 14 1.6059971 n 1.3801924
15 e 15 2.4166472 o -1.1806929
16 e 16 -0.7834619 p 0.1880451
17 e 17 1.3856535 q -0.7826357
18 f 18 -0.6660976 r 0.6159731
19 f 19 0.2089186 s -0.8222399
20 f 20 -1.5809582 t 1.5567113
21 f 21 0.3610700 u 0.3264431
22 f 22 1.2923324 v 0.9636267
What im looking to do is compare the subsetted list of the first data set(SubsetData1), with the second dataset (Data2) to create a filtered dataset that only contains the entries that have the same accession numbers defined in the subsetted list. The filtered dataset should look something like this.
accession_number values Intensity col4 col6
9 c 9 -0.6004642 i -0.3323932
10 c 10 1.1367578 j 0.9151915
11 c 11 0.7084980 k -0.3424039
12 c 12 -0.9618102 l 0.2386307
13 c 13 0.2693441 m -1.3861064
14 d 14 1.6059971 n 1.3801924
15 e 15 2.4166472 o -1.1806929
16 e 16 -0.7834619 p 0.1880451
17 e 17 1.3856535 q -0.7826357
I don't know if I need to start making loops in order to tackle this problem, or if there is a simple R command that would help me accomplish this task. Any help is much appreciated.
Thank You
Try this
WantedData=Data2[Data2$ccession_number %in% SubsetData1$accession_number, ]
You can also use inner_join of dplyr package.
dat = inter_join(Data2, SubsetData1)
The subset function is designed for basic subsetting:
subset(Data2,accession_number %in% SubsetData1$accession_number)
Alternately, here you could merge:
merge(Data2,SubsetData1)
The other solutions seem fine, but I like the readability of dplyr, so here's a dplyr solution.
library(dplyr)
new_dataset <- Data2 %>%
filter(accession_number %in% SubsetData1$accession_number)

Sort data by row

I have a data frame like
Id A B C D E F
a 1 2 9 4 7 6
b 4 5 1 3 6 10
c 1 6 0 3 4 5
I want a data frame like
Id
a C E F D B A #for a, C has the highest value, then E then F and so on...similarly for other rows
b F E B A D C
c B F E D A C
Basically, i am sorting each row of the data frame first and then replacing row values by the respective column names.
Is there any nice way to do this?
Use order with apply, extracting the names in the process, like this:
data.frame(
mydf[1],
t(apply(mydf[-1], 1, function(x)
names(x)[order(x, decreasing = TRUE)])))
# Id X1 X2 X3 X4 X5 X6
# 1 a C E F D B A
# 2 b F E B A D C
# 3 c B F E D A C
The result of apply needs to be transposed before it is recombined with the Id column.

Expand Records to create Edges for igraph

I have a dataset that has multiple data points I want to map. iGraph uses 1-1 relationships though so I'm looking for a way to take one long record into many 1-1 records. For Example:
test <- data.frame(
drug1=c("A","B","C","D","E","F","G","H","I","J","K"),
drug2=c("P","O","R","T","L","A","N","D","R","A","D"),
drug3=c("B","O","R","I","S","B","E","C","K","E","R"),
age=c(15,20,35,1,35,58,51,21,54,80,75))
Which gives this output
drug1 drug2 drug3 age
1 A P B 15
2 B O O 20
3 C R R 35
4 D T I 1
5 E L S 35
6 F A B 58
7 G N E 51
8 H D C 21
9 I R K 54
10 J A E 80
11 K D R 75
I'd like to make a new table with drug1-drug2 and then stack drug2-drug3 into the previous column. So it would look like this.
drug1 drug2 age
1 A P 15
2 P B 15
3 C R 20
4 R R 20
5 E L 35
drug2 is held in the drug1 spot and drug3 is moved to drug1. I realize I can do this by creating multiple smaller steps, but was was wondering if anyone new of a way to loop this process. I have up to 11 fields.
Here are the smaller steps.
a <- test[,c("drug1","drug2","age")]
b <- test[,c("drug2","drug3","age")]
names(b) <- c("drug1","drug2","age")
test2 <- rbind(a,b)
drug1 drug2 age
1 A P 15
2 B O 20
3 C R 35
4 D T 1
5 E L 35
6 F A 58
7 G N 51
8 H D 21
9 I R 54
10 J A 80
11 K D 75
12 P B 15
13 O O 20
14 R R 35
15 T I 1
16 L S 35
17 A B 58
18 N E 51
19 D C 21
20 R K 54
21 A E 80
22 D R 75
So if you have many fields, here's a helper function which can pull down the data into pairs.
pulldown <- function(data, cols=1:(min(attr)-1),
attr=ncol(data), newnames=names(data)[c(cols[1:2], attr)]) {
if(is.character(attr)) attr<-match(attr, names(data))
if(is.character(cols)) cols<-match(cols, names(data))
do.call(rbind, lapply(unname(data.frame(t(embed(cols,2)))), function(x) {
`colnames<-`(data[, c(sort(x), attr)], newnames)
}))
}
You can run it with your data with
pulldown(test)
It has a parameter called attr where you can specify the columns (index or names) you would like repeated every row (here I have it default to the last column). Then the cols parameter is a vector of all the columns that you would like to turn into pairs. (The default is the beginning to one before the first attr). You can also specify a vector of newnames for the columns as they come out.
With three columns your method is pretty simple, this might be a better choice for 11 columns.
Slightly more compact and a one-liner would be:
test2 <- rbind( test[c("drug1","drug2","age")],
setNames(test[c("drug3", "drug2", "age")], c("drug1", "drug2", "age"))
)
The setNames function can be useful when column names are missing or need to be coerced to something else.

Resources