Removing certain values from the dataframe in R - r

I am not sure how I can do this, but what I need is I need to form a cluster of this dataframe mydf where I want to omit the inf(infitive) values and the values greater than 50. I need to get the table that has no inf and no values greater than 50. How can I get a table that contains no inf and no value greater than 50(may be by nullifying those cells)? However, For clustering part, I don't have any problem because I can do this using mfuzz package. So the only problem I have is that I want to scale the cluster within 0-50 margin.
mydf
s.no A B C
1 Inf Inf 999.9
2 0.43 30 23
3 34 22 233
4 3 43 45

You can use NA, the built in missing data indicator in R:
?NA
By doing this:
mydf[mydf > 50 | mydf == Inf] <- NA
mydf
s.no A B C
1 1 NA NA NA
2 2 0.43 30 23
3 3 34.00 22 NA
4 4 3.00 43 45
Any stuff you do downstream in R should have NA handling methods, even if it's just na.omit

Related

Matching two datasets by variable [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 3 years ago.
I have two datasets for buy and sell orders on a trading platform which look like this:
buy[45:50,]
NO SECCODE BUYSELL TIME ORDERNO ACTION PRICE VOLUME TRADENO TRADEPRICE
45 7880 SU25077RMFS7 B 1e+08 7880 1 98.4001 250 NA NA
46 7976 SU24018RMFS2 B 1e+08 7976 1 101.9989 4 NA NA
47 8314 SU52001RMFS3 B 1e+08 8314 1 94.6000 200 NA NA
48 8607 SU29009RMFS6 B 1e+08 8607 1 101.4000 22 NA NA
49 8735 SU29009RMFS6 B 1e+08 8735 1 101.4000 2 NA NA
50 8915 SU26206RMFS1 B 1e+08 8915 1 91.0002 225 NA NA
and
sell[45:50,]
NO SECCODE BUYSELL TIME ORDERNO ACTION PRICE VOLUME TRADENO TRADEPRICE
45 18767 SU26215RMFS2 S 100004130 13929 1 77.7410 6 NA NA
46 18831 SU26205RMFS3 S 100004156 13959 1 84.4680 3 NA NA
47 30345 SU26211RMFS1 S 100009446 19505 1 82.1999 7 NA NA
48 48387 SU24018RMFS2 S 100015879 3865 2 101.9989 4 2516559570 101.9989
49 54854 SU26212RMFS9 S 100019214 8920 0 77.2499 58 NA NA
50 55493 SU26212RMFS9 S 100019734 31671 1 74.6999 58 NA NA
I need to find all matches in PRICE by comparing all rows in "buy" with all rows in "sell". For example, the PRICE in the row 46 i the first dataset coincides with the PRICE in the row 48 in the second one.
The expected output is the dataframe combining the corresponding rows into single row, i.e., making one row out of row 46 from the first dataset and row 48 from the second one (to be honest, it doesn't even matter to me how the output will look like, I just need to find the TIME and PRICE for the corresponding orders).
I've tried something like
d <- data[data$PRICE %in% intersect (sell$PRICE, buy$PRICE),]
where data includes both "buy" and "sell" orders, but it doesn't work. As I understand, match and find.matches compare only the corresponding rows, but I need to compare all rows from "buy" with all rows from "sell".
My apologies if someone already asked something like this, but I couldn't find a similar question.
You can use merge() function with all = True
alldata <- merge(x= buy, y= sell , by = "PRICE", all = TRUE)

How do I create a column using values of a second column that meet the conditions of a third in R?

I have a dataset Comorbidity in RStudio, where I have added columns such as MDDOnset, and if the age at onset of MDD < the onset of OUD, it equals 1, and if the opposite is true, then it equals 2. I also have another column PhysDis that has values 0-100 (numeric in nature).
What I want to do is make a new column that includes the values of PhysDis, but only if MDDOnset == 1, and another if MDDOnset==2. I want to make these columns so that I can run a t-test on them and compare the two groups (those with MDD prior OUD, and those who had MDD after OUD with regards to which group has a greater physical disability score). I want any case where MDDOnset is not 1 to be NA.
ttest1 <-t.test(Comorbidity$MDDOnset==1, Comorbidity$PhysDis)
ttest2 <-t.test(Comorbidity$MDDOnset==2, Comorbidity$PhysDis)
When I did the t test twice, once where MDDOnset = 1 and another when it equaled 2, the mean for y (Comorbidity$PhysDis) was the same, and when I looked into the original csv file, it turned out that this mean was the mean of the entire column, and not just cases where MDDOnset had a value of one or two. If there is a different way to run the t-tests that would have the mean of PhysDis only when MDDOnset = 1, and another with the mean of PhysDis only when MDDOnset == 2 that does not require making new columns, then please tell me.. Sorry if there are any similar questions or if my approach is way off, I'm new to R and programming in general, and thanks in advance.
Here's a smaller data frame where I tried to replicate the error where the new columns have switched lengths. The issue would be that the length of C would be 4, and the length of D would be 6 if I could replicate the error.
> A <- sample(1:10)
> B <-c(25,34,14,76,56,34,23,12,89,56)
> alphabet <-data.frame(A,B)
> alphabet$C <-ifelse(alphabet$A<7, alphabet$B, NA)
> alphabet$D <-ifelse(alphabet$A>6, alphabet$B, NA)
> print(alphabet)
A B C D
1 7 25 NA 25
2 9 34 NA 34
3 4 14 14 NA
4 2 76 76 NA
5 5 56 56 NA
6 10 34 NA 34
7 8 23 NA 23
8 6 12 12 NA
9 1 89 89 NA
10 3 56 56 NA
> length(which(alphabet$C>0))
[1] 6
> length(which(alphabet$D>0))
[1] 4
I would use the mutate command from the dplyr package.
Comorbidity <- mutate(Comorbidity, newColumn = (ifelse(MDDOnset == 1, PhysDis, "")), newColumn2 = (ifelse(MDDOnset == 2, PhysDis, "")))

R: tapply(x,y,sum) returns NA instead of 0

I have a data set that contains occurrences of events over multiple years, regions, quarters, and types. Sample:
REGION Prov Year Quarter Type Hit Miss
xxx yy 2008 4 Snow 1 0
xxx yy 2009 2 Rain 0 1
I have variables defined to examine the columns of interest:
syno.h <- data$Type
quarter.number<-data$Quarter
syno.wrng<- data$Type
I wanted to get the amount of Hits per type, and quarter for all of the data. Given that the Hits are either 0 or 1, then a simple sum() function using tapply was my first attempt.
tapply(syno.h, list(syno.wrng, quarter.number), sum)
this returned:
1 2 3 4
ARCO NA NA NA 0
BLSN 0 NA 15 74
BLZD 4 NA 17 54
FZDZ NA NA 0 1
FZRA 26 0 143 194
RAIN 106 126 137 124
SNOW 43 2 215 381
SNSQ 0 NA 18 53
WATCHSNSQ NA NA NA 0
WATCHWSTM 0 NA NA NA
WCHL NA NA NA 1
WIND 47 38 155 167
WIND-SUETES 27 6 37 56
WIND-WRECK 34 14 44 58
WTSM 0 1 7 18
For a some of the types that have no occurrences in a given quarter, tapply sometimes returns NA instead of zero. I have checked the data a number of times, and I am confident that it is clean. The values that aren't NA are also correct.
If I check the type/quarter combinations that return NA with tapply using just sum() I get values I expect:
sum(syno.h[quarter.number==3&syno.wrng=="BLSN"])
[1] 15
> sum(syno.h[quarter.number==1&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="ARCO"])
[1] 0
It seems that my issue is with how I use tapply with sum, and not with the data itself.
Does anyone have any suggestions on what the issue may be?
Thanks in advance
I have two potential solutions for you depending on exactly what you are looking for. If you just are interested in your number of positive Hits per Type and Quarter and don't need a record of when no Hits exist, you can get an answer as
aggregate(data[["Hit"]], by = data[c("Type","Quarter")], FUN = sum)
If it is important to keep a record of the ones where there are no hits as well, you can use
dataHit <- data[data[["Hit"]] == 1, ]
dataHit[["Type"]] <- factor(data[["Type"]])
dataHit[["Quarter"]] <- factor(data[["Quarter"]])
table(dataHit[["Type"]], dataHit[["Quarter"]])

Issue with NA values when removing rows from data frame in R

This is my data frame:
ID <- c('TZ1','TZ2','TZ3','TZ4')
hr <- c(56,32,38,NA)
cr <- c(1,4,5,2)
data <- data.frame(ID,hr,cr)
ID hr cr
1 TZ1 56 1
2 TZ2 32 4
3 TZ3 38 5
4 TZ4 NA 2
I want to remove the rows where data$hr = 56. This is what I want the end product to be:
ID hr cr
2 TZ2 32 4
3 TZ3 38 5
4 TZ4 NA 2
This is what I thought would work:
data = data[data$hr !=56,]
However the resulting data frame looks like this:
ID hr cr
2 TZ2 32 4
3 TZ3 38 5
NA <NA> NA NA
How can I mofify my code to encorporate the NA value so this doesn't happen? Thank you for your help, I can't figure it out.
EDIT: I also want to keep the NA value in the data frame.
The issue is that when we do the == or !=, if there are NA values, it will remain as such and create an NA row for that corresponding NA value. So one way to make the logical index with only TRUE/FALSE values will be to use is.na also in the comparison.
data[!(data$hr==56 & !is.na(data$hr)),]
# ID hr cr
#2 TZ2 32 4
#3 TZ3 38 5
#4 TZ4 NA 2
We could also apply the reverse logic
subset(data, hr!=56|is.na(hr))
# ID hr cr
#2 TZ2 32 4
#3 TZ3 38 5
#4 TZ4 NA 2

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

Resources