Matching two datasets by variable [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I have two datasets for buy and sell orders on a trading platform which look like this:
buy[45:50,]
NO SECCODE BUYSELL TIME ORDERNO ACTION PRICE VOLUME TRADENO TRADEPRICE
45 7880 SU25077RMFS7 B 1e+08 7880 1 98.4001 250 NA NA
46 7976 SU24018RMFS2 B 1e+08 7976 1 101.9989 4 NA NA
47 8314 SU52001RMFS3 B 1e+08 8314 1 94.6000 200 NA NA
48 8607 SU29009RMFS6 B 1e+08 8607 1 101.4000 22 NA NA
49 8735 SU29009RMFS6 B 1e+08 8735 1 101.4000 2 NA NA
50 8915 SU26206RMFS1 B 1e+08 8915 1 91.0002 225 NA NA
and
sell[45:50,]
NO SECCODE BUYSELL TIME ORDERNO ACTION PRICE VOLUME TRADENO TRADEPRICE
45 18767 SU26215RMFS2 S 100004130 13929 1 77.7410 6 NA NA
46 18831 SU26205RMFS3 S 100004156 13959 1 84.4680 3 NA NA
47 30345 SU26211RMFS1 S 100009446 19505 1 82.1999 7 NA NA
48 48387 SU24018RMFS2 S 100015879 3865 2 101.9989 4 2516559570 101.9989
49 54854 SU26212RMFS9 S 100019214 8920 0 77.2499 58 NA NA
50 55493 SU26212RMFS9 S 100019734 31671 1 74.6999 58 NA NA
I need to find all matches in PRICE by comparing all rows in "buy" with all rows in "sell". For example, the PRICE in the row 46 i the first dataset coincides with the PRICE in the row 48 in the second one.
The expected output is the dataframe combining the corresponding rows into single row, i.e., making one row out of row 46 from the first dataset and row 48 from the second one (to be honest, it doesn't even matter to me how the output will look like, I just need to find the TIME and PRICE for the corresponding orders).
I've tried something like
d <- data[data$PRICE %in% intersect (sell$PRICE, buy$PRICE),]
where data includes both "buy" and "sell" orders, but it doesn't work. As I understand, match and find.matches compare only the corresponding rows, but I need to compare all rows from "buy" with all rows from "sell".
My apologies if someone already asked something like this, but I couldn't find a similar question.

You can use merge() function with all = True
alldata <- merge(x= buy, y= sell , by = "PRICE", all = TRUE)

Related

Looping through a vector, creating a new variable which is the first vector minus 2 other vectors when none of them are NA

Assuming the following dataset:
Company Sales COGS Staff
A 100 50 25
B 200 NA 100
C NA 50 25
D 75 50 25
E 125 100 NA
I would like to create a new variable called profit which is Sales- COGS -Staff, if neither of those variables is NA. The desired output would be as follows:
Company Sales COGS Staff Profit
A 100 50 25 25
B 200 NA 100 NA
C NA 50 25 NA
D 75 50 25 0
E 125 100 NA NA
I started with something like:
# Creating the profit column (should be unnecessary right?)
df$Profit <- NA
# For each row in the sales column/vector
for(i in df$Sales){
# If all are not NA
if(!is.na(df$Sales) & !is.na(df$COGS) & !is.na(df$Staff)){
# Do calculation for profit
df$Profit <- df$Sales - (df$COGS + df$Staff)
# If calculation not possible
} else {
df$Profit <- NA
}}
Which does not give an error, but it makes R go a bit haywire. Is there a more efficient way to do this?
As simple as what you see ...
df$Sales-df$COGS-df$Staff
[1] 25 NA NA 0 NA
If there is any NA in COGS and Staff result will become NA , just like when you do sum , there is na.rm , the simple operation mark default as na.rm = False
This seems a job for within.
df <- within(df, Profit <- Sales - COGS - Staff)
df
# Company Sales COGS Staff Profit
#1 A 100 50 25 25
#2 B 200 NA 100 NA
#3 C NA 50 25 NA
#4 D 75 50 25 0
#5 E 125 100 NA NA
DATA.
df <- read.table(text = "
Company Sales COGS Staff
A 100 50 25
B 200 NA 100
C NA 50 25
D 75 50 25
E 125 100 NA
", header = TRUE)
We create a logical index with rowSums to check if there is any NA in one of the rows of the selected column dataset and if not, do the subtraction of the columns and assign it to 'Profit'
i1 <- !rowSums(is.na(df1[-1]))
df1$Profit[i1] <- with(df1, (Sales-COGS-Staff)[i1])
df1
# Company Sales COGS Staff Profit
#1 A 100 50 25 25
#2 B 200 NA 100 NA
#3 C NA 50 25 NA
#4 D 75 50 25 0
#5 E 125 100 NA NA
NOTE: It is a general way to exclude the NA rows and it thus we do the calculation only a subset of rows instead of the whole dataset
But, any value substracted with NA returns NA, so using
df1$Profit <- with(df1, (Sales - COGS - Staff))
should also work
Or another option if there are many columns,
rowSums(df1[-1] * c(1, -1, -1)[col(df1[-1])])

R: tapply(x,y,sum) returns NA instead of 0

I have a data set that contains occurrences of events over multiple years, regions, quarters, and types. Sample:
REGION Prov Year Quarter Type Hit Miss
xxx yy 2008 4 Snow 1 0
xxx yy 2009 2 Rain 0 1
I have variables defined to examine the columns of interest:
syno.h <- data$Type
quarter.number<-data$Quarter
syno.wrng<- data$Type
I wanted to get the amount of Hits per type, and quarter for all of the data. Given that the Hits are either 0 or 1, then a simple sum() function using tapply was my first attempt.
tapply(syno.h, list(syno.wrng, quarter.number), sum)
this returned:
1 2 3 4
ARCO NA NA NA 0
BLSN 0 NA 15 74
BLZD 4 NA 17 54
FZDZ NA NA 0 1
FZRA 26 0 143 194
RAIN 106 126 137 124
SNOW 43 2 215 381
SNSQ 0 NA 18 53
WATCHSNSQ NA NA NA 0
WATCHWSTM 0 NA NA NA
WCHL NA NA NA 1
WIND 47 38 155 167
WIND-SUETES 27 6 37 56
WIND-WRECK 34 14 44 58
WTSM 0 1 7 18
For a some of the types that have no occurrences in a given quarter, tapply sometimes returns NA instead of zero. I have checked the data a number of times, and I am confident that it is clean. The values that aren't NA are also correct.
If I check the type/quarter combinations that return NA with tapply using just sum() I get values I expect:
sum(syno.h[quarter.number==3&syno.wrng=="BLSN"])
[1] 15
> sum(syno.h[quarter.number==1&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="ARCO"])
[1] 0
It seems that my issue is with how I use tapply with sum, and not with the data itself.
Does anyone have any suggestions on what the issue may be?
Thanks in advance
I have two potential solutions for you depending on exactly what you are looking for. If you just are interested in your number of positive Hits per Type and Quarter and don't need a record of when no Hits exist, you can get an answer as
aggregate(data[["Hit"]], by = data[c("Type","Quarter")], FUN = sum)
If it is important to keep a record of the ones where there are no hits as well, you can use
dataHit <- data[data[["Hit"]] == 1, ]
dataHit[["Type"]] <- factor(data[["Type"]])
dataHit[["Quarter"]] <- factor(data[["Quarter"]])
table(dataHit[["Type"]], dataHit[["Quarter"]])

Issue with NA values when removing rows from data frame in R

This is my data frame:
ID <- c('TZ1','TZ2','TZ3','TZ4')
hr <- c(56,32,38,NA)
cr <- c(1,4,5,2)
data <- data.frame(ID,hr,cr)
ID hr cr
1 TZ1 56 1
2 TZ2 32 4
3 TZ3 38 5
4 TZ4 NA 2
I want to remove the rows where data$hr = 56. This is what I want the end product to be:
ID hr cr
2 TZ2 32 4
3 TZ3 38 5
4 TZ4 NA 2
This is what I thought would work:
data = data[data$hr !=56,]
However the resulting data frame looks like this:
ID hr cr
2 TZ2 32 4
3 TZ3 38 5
NA <NA> NA NA
How can I mofify my code to encorporate the NA value so this doesn't happen? Thank you for your help, I can't figure it out.
EDIT: I also want to keep the NA value in the data frame.
The issue is that when we do the == or !=, if there are NA values, it will remain as such and create an NA row for that corresponding NA value. So one way to make the logical index with only TRUE/FALSE values will be to use is.na also in the comparison.
data[!(data$hr==56 & !is.na(data$hr)),]
# ID hr cr
#2 TZ2 32 4
#3 TZ3 38 5
#4 TZ4 NA 2
We could also apply the reverse logic
subset(data, hr!=56|is.na(hr))
# ID hr cr
#2 TZ2 32 4
#3 TZ3 38 5
#4 TZ4 NA 2

Subsetting in R using a list

I have a large amount of data which I would like to subset based on the values in one of the columns (dive site in this case). The data looks like this:
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
alice rain 95 NA 50 NA 2 4 9
alice over NA 25 NA 25 2 4 9
steps clear NA 27 NA 25 2 4 9
steps NA 30 NA 20 1 4 9
andrea1 clear 60 NA 60 NA 2 4 5
I would like to create a subset of the data which contains only data for one dive site at a time (e.g. one subset for alice, one for steps, one for andrea1 etc...).
I understand that I could subset each individually using
alice <- subset(reefdata, site=="alice")
But as I have over 100 different sites to subset by would like to avoid having to individually specify each subset. I think that subset is probably not flexible enough for me to ask it to subset by a list of names (or at least not to my current knowledge of R, which is growing, but still in infancy), is there another command which I should be looking into?
Thank you
This will create a list that contains the subset data frames in separate list elements.
splitdat <- split(reefdata, reefdata$site)
Then if you want to access the "alice" data you can reference it like
splitdat[["alice"]]
I would use the plyr package.
library(plyr)
ll <- dlply(df,.variables = c("site"))
Result:
>ll
$alice
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 alice rain 95 NA 50 NA 2 4 9
2 alice over NA 25 NA 25 2 4 9
$andrea1
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 andrea1 clear 60 NA 60 NA 2 4 5
$steps
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 steps clear NA 27 NA 25 2 4 9
2 steps <NA> 30 NA 20 1 4 9 NA
split() and dlply() are perfect one shot solutions.
If you want a "step by step" procedure with a loop (which is frowned upon by many R users, but I find it helpful in order to understand what's going on), try this:
# create vector with site names, assuming reefdata$site is a factor
sites <- as.character( unique( reefdata$site ) )
# create empty list to take dive data per site
dives <- list( NULL )
# collect data per site into the list
for( i in 1:length( sites ) )
{
# subset
dive <- reefdata[ reefdata$site == sites[ i ] , ]
# add resulting data.frame to the list
dives[[ i ]] <- dive
# name the list element
names( dives )[ i ] <- sites[ i ]
}

Removing certain values from the dataframe in R

I am not sure how I can do this, but what I need is I need to form a cluster of this dataframe mydf where I want to omit the inf(infitive) values and the values greater than 50. I need to get the table that has no inf and no values greater than 50. How can I get a table that contains no inf and no value greater than 50(may be by nullifying those cells)? However, For clustering part, I don't have any problem because I can do this using mfuzz package. So the only problem I have is that I want to scale the cluster within 0-50 margin.
mydf
s.no A B C
1 Inf Inf 999.9
2 0.43 30 23
3 34 22 233
4 3 43 45
You can use NA, the built in missing data indicator in R:
?NA
By doing this:
mydf[mydf > 50 | mydf == Inf] <- NA
mydf
s.no A B C
1 1 NA NA NA
2 2 0.43 30 23
3 3 34.00 22 NA
4 4 3.00 43 45
Any stuff you do downstream in R should have NA handling methods, even if it's just na.omit

Resources