Excluding a number of answers from a R dataframe - r

I'm looking for a way to exclude a number of answers from a length function.
This is a follow on question from Getting R Frequency counts for all possible answers In sql the syntax could be
select * from someTable
where variableName not in ( 0, null )
Given
Id <- c(1,2,3,4,5)
ClassA <- c(1,NA,3,1,1)
ClassB <- c(2,1,1,3,3)
R <- c(5,5,7,NA,9)
S <- c(3,7,NA,9,5)
df <- data.frame(Id,ClassA,ClassB,R,S)
ZeroTenNAScale <- c(0:10,NA);
R.freq = setNames(nm=c('R','freq'),data.frame(table(factor(df$R,levels=ZeroTenNAScale,exclude=NULL))));
S.freq = setNames(nm=c('S','freq'),data.frame(table(factor(df$S,levels=ZeroTenNAScale,exclude=NULL))));
length(S.freq$freq[S.freq$freq!=0])
# 5
How would I change
length(S.freq$freq[S.freq$freq!=0])
to get an answer of 4 by excluding 0 and NA?

We can use colSums,
colSums(!is.na(S.freq)[S.freq$freq!=0,])[[1]]
#[1] 4

You can use sum to calculate the sum of integers. if NA's are found in your column you could be using na.rm(), however because the NA is located in a different column you first need to remove the row containing NA.
Our solution is as follows, we remove the rows containing NA by subsetting S.freq[!is.na(S.freq$S),], but we also need the second column freq:
sum(S.freq[!is.na(S.freq$S), "freq"])
# 4

You can try na.omit (to remove NAs) and subset ( to get rid off all lines in freq equal to 0):
subset(na.omit(S.freq), freq != 0)
S freq
4 3 1
6 5 1
8 7 1
10 9 1
From here, that's straightforward:
length(subset(na.omit(S.freq), freq != 0)$freq)
[1] 4
Does it solve your problem?

Just add !is.na(S.freq$S) as a second filter:
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$S)])
If you want to extend it with other conditions, you could make an index vector first for readability:
idx <- S.freq$freq!=0 & !is.na(S.freq$S)
length(S.freq$freq[idx])

You're looking for values with frequency > 0, that means you're looking for unique values. You get this information directly from vector S:
length(unique(df$S))
and leaving NA aside you get answer 4 by:
length(unique(df$S[!is.na(df$S)]))
Regarding your question on how to exclude a number of items based on their value:
In R this is easily done with logical vectors as you used it in you code already:
length(S.freq$freq[S.freq$freq!=0])
you can combine different conditions to one logical vector and use it for subsetting e.g.
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$freq)])

Related

How to find the least frequent value in a column of dataframe in R

So I have a dataframe column with userID where there are duplicates. I was asked to find the userID that appear least frequent. What are the possible methods to achieve this. Only using Base R or Dplyr packages.
Something like this
userID = c(1,1,1,1,2,2,1,1,4,4,4,4,3)
Expected Output would be 3 in this case.
If this is based on the lengths of same adjacent values
with(rle(userID), values[which.min(lengths)])
#[1] 3
Or if it is based on the full data values
names(which.min(table(userID)))
#[1] "3"
Another possibility is to get the min of mode:
# example dataframe
df <- data.frame(userID = c(1,1,1,1,2,2,1,1,4,4,4,4,3))
# define Mode function
Mode <- function(x){
a = table(x) # x is a column
return(a[which.min(a)])
}
Mode(df$userID)
# Output:
3 #value
1 #count
Gives the value 3 and the count 1

How to find a total of row values in R

I am trying to find the total of rows that have a column value of 3 or 4. That being said, the first row has only one value of 3 so if I create a new column
currentdx_count1$TotalDiagnoses
That new column called TotalDiagnoses should only have a value of 1 under it for the first row. I have tried
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[2:32])
This doesn't give me what I need as expected because it literally sums up the whole row. That being said, is there an existing function that does what I want to do or will I have to make one? Could I specify more in rowSums for it to work as I need it to?
Thanks for any and all help.
Edit: I'm trying to adapt a method I use earlier in my script that works for a similar purpose
findtotal <- endsWith(names(currentdx_count1), 'Current')
findtotal <- lapply(findtotal, `>`, 2)
findtotal <- unlist(findtotal)
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
I get an error which I have never seen before (an error in view?!)
So I tried just this
findtotal <- endsWith(names(currentdx_count1), 'Current')
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
Gets me closer but it is finding the total count for each column separately which is not what I need. I want a single column to encompass counts for each SID.
You can compare the dataframe with the value of 3 or 4 and then use rowSums to count :
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[-1] == 3 |
currentdx_count1[-1] == 4)
currentdx_count1$TotalDiagnoses
#[1] 1 2 2 2 1 1 1 1 1 1 1 1 1 2

group data by tolerance via index list

I dont know how to explain it shortly. I try my best:
I have the following example data:
Data<-data.frame(A=c(1,2,3,5,8,9,10),B=c(5.3,9.2,5,8,10,9.5,4),C=c(1:7))
and a index
Ind<-data.frame(I=c(5,6,2,4,1,3,7))
The value in Ind corresponds to the C column in the Data. Now I want to start with the first Ind value, and find the corresponding row in the Data data.frame (column C). From that row I want to go up and down and find values in column A that are in a tolerance range of 1. I want to write these values into a result dataframe add a group id column and delete it in the dataframe Data (where I found them). Then I start with the next entry in the Index dataframe Ind and so an until the data.frame Data is empty. I know how to match my Ind with column C of my Data and how to write and delete and the other stuff in a for loop, but I dont know the main point, which is my question here:
when I have found my row in the Data, how can I look up fitting values of column A in the tolerance range up and below that entry to get my Group id?
what I want to get is this result:
A B C Group
1 5.3 1 2
2 9.2 2 2
3 5 3 2
5 8 4 3
8 10 5 1
9 9.5 6 1
10 4 7 4
Maybe somebody could help me with the critical point in my question or even how to solve this issue in a fast way.
Many thanks!
Generally: avoid deleting or growing a data frame row by row inside a loop. R's memory management means that every time you add or delete a row, another copy of the data frame is made. Garbage collection will eventually discard the "old" copies of the data frame, but garbage can quickly accumulate and reduce performance. Instead, add a logical column to the Data data frame, and set "extracted" rows to TRUE. So like this:
Data$extracted <- rep(FALSE,nrow(Data))
As for your problem: I get a different set of grouping numbers, but the groups are identical.
There might be a more elegant way to do this, but this will get it done.
# store results in a separate list
res <- list()
group.counter <- 1
# loop until they're all done.
for(idx in Ind$I) {
# skip this iteration if idx is NA.
if(is.na(idx)) {
next
}
# dat.rows is a logical vector which shows the rows where
# "A" meets the tolerance requirement.
# specify the tolerance here.
mytol <- 1
# the next only works for integer compare.
# also not covered: what if multiple values of C
# match idx? do we loop over each corresponding value of A,
# i.e. loop over each value of 'target'?
target <- Data$A[Data$C == idx]
# use the magic of vectorized logical compare.
dat.rows <-
( (Data$A - target) >= -mytol) &
( (Data$A - target) <= mytol) &
( ! Data$extracted)
# if dat.rows is all false, then nothing met the criteria.
# skip the rest of the loop
if( ! any(dat.rows)) {
next
}
# copy the rows to the result list.
res[[length(res) + 1]] <- data.frame(
A=Data[dat.rows,"A"],
B=Data[dat.rows,"B"],
C=Data[dat.rows,"C"],
Group=group.counter # this value will be recycled to match length of A, B, C.
)
# flag the extraction.
Data$extracted[dat.rows] <- TRUE
# increment the group counter
group.counter <- group.counter + 1
}
# now make a data.frame from the results.
# this is the last step in how we avoid
#"growing" a data.frame inside a loop.
resData <- do.call(rbind, res)

R - How to compare values across more than two columns

I'm trying to write code to compare the values of several columns, and i dont know ahead of time how many columns I will have. The data will look like this:
X Val1 Val2 Val3 Val4
A 1 1 1 2
B NA 2 2 2
C 3 3 3 3
The code should return a Fail for rows A and B, and a Pass for row C, but needs to be able to handle a changing number of columns. I can't figure out how to do this without nesting a couple of for loops, but there has to be some way to use apply or sapply to iterate through columns 2: length(df)
EDIT: I want to see if the values (which will be numbers) are equal
Assuming that the first column is excluded from the comparison and that all the other columns are not, you can try:
which(rowSums(df[,2]==df[,3:ncol(df)])==(ncol(df)-2))
You can use apply with a custom function length(unique(x)) to count the unique number of values in rows 2:ncol(yourDataFrame). You can then throw the whole thing into an ifelse function to return a true/false list.
ifelse(apply(df[ , 2:ncol(yourDataFrame)], MARGIN=1, function(x) length(unique(x))) == 1, TRUE, FALSE)

Conditional searching which omits NA values

I'm doing a conditional search of part of a dataset that has multiple NA values within each row.
Something like this (a preview)..
time1 time2 time3 time4 slice1 slice2 slice3 slice4
pt1 1 3 NA NA NA 1 3 5
pt2 NA 1 3 5 5 2 2 4
I want to do some conditional searching which applies a condition (comparing whether one column within a row is larger than another) for each row. I want to find all the rows (pt's) where a variable column (e.g. time1) is smaller than the corresponding column (e.g. slice 1).
all.smaller<-subset(patientdata, time1>slice1 & time2>slice2 & time3>slice3 & time4>slice4, na.rm=TRUE, select=c(1))
When I use this code (on a larger expanded table of this format), it only returns the rows without any NAs, where all the values are added in. This makes sense given the use of '&'.
My question is: Is there a way to find which rows fit my conditional search that ignores the NA's but only returns the rows where in all the column variables where values are provided, it searches whether time1>slice1, time2>slice2 etc.?
Any help is appreciated. Thanks.
You can make a function that takes a boolean (possibly NA) and maps it to TRUE if it is NA and its value otherwise.
na.true <- function(x) ifelse(is.na(x), TRUE, x)
You can then replace your subset with
na.true(time1 > slice1) & na.true(time2 > slice2) & na.true(time3 > slice3) & na.true(time4 > slice4)
You could try this.
n=1:4
cond <- paste0('((is.na(time',n,')|is.na(slice',n,'))|(time',n,'>slice',n,'))')
conds <- paste(cond, collapse=' & ')
all.smaller <- subset( patientdata, eval(parse(text=conds)) )
Essentially this checks if either time or slice are NA and forces a TRUE, and if not, check whether time is greater than slice. (Individually for each index.) It becomes clearer if you print out conds to see what it looks like.

Resources