R - How to compare values across more than two columns - r

I'm trying to write code to compare the values of several columns, and i dont know ahead of time how many columns I will have. The data will look like this:
X Val1 Val2 Val3 Val4
A 1 1 1 2
B NA 2 2 2
C 3 3 3 3
The code should return a Fail for rows A and B, and a Pass for row C, but needs to be able to handle a changing number of columns. I can't figure out how to do this without nesting a couple of for loops, but there has to be some way to use apply or sapply to iterate through columns 2: length(df)
EDIT: I want to see if the values (which will be numbers) are equal

Assuming that the first column is excluded from the comparison and that all the other columns are not, you can try:
which(rowSums(df[,2]==df[,3:ncol(df)])==(ncol(df)-2))

You can use apply with a custom function length(unique(x)) to count the unique number of values in rows 2:ncol(yourDataFrame). You can then throw the whole thing into an ifelse function to return a true/false list.
ifelse(apply(df[ , 2:ncol(yourDataFrame)], MARGIN=1, function(x) length(unique(x))) == 1, TRUE, FALSE)

Related

group data by tolerance via index list

I dont know how to explain it shortly. I try my best:
I have the following example data:
Data<-data.frame(A=c(1,2,3,5,8,9,10),B=c(5.3,9.2,5,8,10,9.5,4),C=c(1:7))
and a index
Ind<-data.frame(I=c(5,6,2,4,1,3,7))
The value in Ind corresponds to the C column in the Data. Now I want to start with the first Ind value, and find the corresponding row in the Data data.frame (column C). From that row I want to go up and down and find values in column A that are in a tolerance range of 1. I want to write these values into a result dataframe add a group id column and delete it in the dataframe Data (where I found them). Then I start with the next entry in the Index dataframe Ind and so an until the data.frame Data is empty. I know how to match my Ind with column C of my Data and how to write and delete and the other stuff in a for loop, but I dont know the main point, which is my question here:
when I have found my row in the Data, how can I look up fitting values of column A in the tolerance range up and below that entry to get my Group id?
what I want to get is this result:
A B C Group
1 5.3 1 2
2 9.2 2 2
3 5 3 2
5 8 4 3
8 10 5 1
9 9.5 6 1
10 4 7 4
Maybe somebody could help me with the critical point in my question or even how to solve this issue in a fast way.
Many thanks!
Generally: avoid deleting or growing a data frame row by row inside a loop. R's memory management means that every time you add or delete a row, another copy of the data frame is made. Garbage collection will eventually discard the "old" copies of the data frame, but garbage can quickly accumulate and reduce performance. Instead, add a logical column to the Data data frame, and set "extracted" rows to TRUE. So like this:
Data$extracted <- rep(FALSE,nrow(Data))
As for your problem: I get a different set of grouping numbers, but the groups are identical.
There might be a more elegant way to do this, but this will get it done.
# store results in a separate list
res <- list()
group.counter <- 1
# loop until they're all done.
for(idx in Ind$I) {
# skip this iteration if idx is NA.
if(is.na(idx)) {
next
}
# dat.rows is a logical vector which shows the rows where
# "A" meets the tolerance requirement.
# specify the tolerance here.
mytol <- 1
# the next only works for integer compare.
# also not covered: what if multiple values of C
# match idx? do we loop over each corresponding value of A,
# i.e. loop over each value of 'target'?
target <- Data$A[Data$C == idx]
# use the magic of vectorized logical compare.
dat.rows <-
( (Data$A - target) >= -mytol) &
( (Data$A - target) <= mytol) &
( ! Data$extracted)
# if dat.rows is all false, then nothing met the criteria.
# skip the rest of the loop
if( ! any(dat.rows)) {
next
}
# copy the rows to the result list.
res[[length(res) + 1]] <- data.frame(
A=Data[dat.rows,"A"],
B=Data[dat.rows,"B"],
C=Data[dat.rows,"C"],
Group=group.counter # this value will be recycled to match length of A, B, C.
)
# flag the extraction.
Data$extracted[dat.rows] <- TRUE
# increment the group counter
group.counter <- group.counter + 1
}
# now make a data.frame from the results.
# this is the last step in how we avoid
#"growing" a data.frame inside a loop.
resData <- do.call(rbind, res)

Excluding a number of answers from a R dataframe

I'm looking for a way to exclude a number of answers from a length function.
This is a follow on question from Getting R Frequency counts for all possible answers In sql the syntax could be
select * from someTable
where variableName not in ( 0, null )
Given
Id <- c(1,2,3,4,5)
ClassA <- c(1,NA,3,1,1)
ClassB <- c(2,1,1,3,3)
R <- c(5,5,7,NA,9)
S <- c(3,7,NA,9,5)
df <- data.frame(Id,ClassA,ClassB,R,S)
ZeroTenNAScale <- c(0:10,NA);
R.freq = setNames(nm=c('R','freq'),data.frame(table(factor(df$R,levels=ZeroTenNAScale,exclude=NULL))));
S.freq = setNames(nm=c('S','freq'),data.frame(table(factor(df$S,levels=ZeroTenNAScale,exclude=NULL))));
length(S.freq$freq[S.freq$freq!=0])
# 5
How would I change
length(S.freq$freq[S.freq$freq!=0])
to get an answer of 4 by excluding 0 and NA?
We can use colSums,
colSums(!is.na(S.freq)[S.freq$freq!=0,])[[1]]
#[1] 4
You can use sum to calculate the sum of integers. if NA's are found in your column you could be using na.rm(), however because the NA is located in a different column you first need to remove the row containing NA.
Our solution is as follows, we remove the rows containing NA by subsetting S.freq[!is.na(S.freq$S),], but we also need the second column freq:
sum(S.freq[!is.na(S.freq$S), "freq"])
# 4
You can try na.omit (to remove NAs) and subset ( to get rid off all lines in freq equal to 0):
subset(na.omit(S.freq), freq != 0)
S freq
4 3 1
6 5 1
8 7 1
10 9 1
From here, that's straightforward:
length(subset(na.omit(S.freq), freq != 0)$freq)
[1] 4
Does it solve your problem?
Just add !is.na(S.freq$S) as a second filter:
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$S)])
If you want to extend it with other conditions, you could make an index vector first for readability:
idx <- S.freq$freq!=0 & !is.na(S.freq$S)
length(S.freq$freq[idx])
You're looking for values with frequency > 0, that means you're looking for unique values. You get this information directly from vector S:
length(unique(df$S))
and leaving NA aside you get answer 4 by:
length(unique(df$S[!is.na(df$S)]))
Regarding your question on how to exclude a number of items based on their value:
In R this is easily done with logical vectors as you used it in you code already:
length(S.freq$freq[S.freq$freq!=0])
you can combine different conditions to one logical vector and use it for subsetting e.g.
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$freq)])

Conditional searching which omits NA values

I'm doing a conditional search of part of a dataset that has multiple NA values within each row.
Something like this (a preview)..
time1 time2 time3 time4 slice1 slice2 slice3 slice4
pt1 1 3 NA NA NA 1 3 5
pt2 NA 1 3 5 5 2 2 4
I want to do some conditional searching which applies a condition (comparing whether one column within a row is larger than another) for each row. I want to find all the rows (pt's) where a variable column (e.g. time1) is smaller than the corresponding column (e.g. slice 1).
all.smaller<-subset(patientdata, time1>slice1 & time2>slice2 & time3>slice3 & time4>slice4, na.rm=TRUE, select=c(1))
When I use this code (on a larger expanded table of this format), it only returns the rows without any NAs, where all the values are added in. This makes sense given the use of '&'.
My question is: Is there a way to find which rows fit my conditional search that ignores the NA's but only returns the rows where in all the column variables where values are provided, it searches whether time1>slice1, time2>slice2 etc.?
Any help is appreciated. Thanks.
You can make a function that takes a boolean (possibly NA) and maps it to TRUE if it is NA and its value otherwise.
na.true <- function(x) ifelse(is.na(x), TRUE, x)
You can then replace your subset with
na.true(time1 > slice1) & na.true(time2 > slice2) & na.true(time3 > slice3) & na.true(time4 > slice4)
You could try this.
n=1:4
cond <- paste0('((is.na(time',n,')|is.na(slice',n,'))|(time',n,'>slice',n,'))')
conds <- paste(cond, collapse=' & ')
all.smaller <- subset( patientdata, eval(parse(text=conds)) )
Essentially this checks if either time or slice are NA and forces a TRUE, and if not, check whether time is greater than slice. (Individually for each index.) It becomes clearer if you print out conds to see what it looks like.

R applying to a line

I have a data frame that contains multiple rows and multiple columns.
I have a character vector that contains the names of some of the columns in the data frame. The number of columns can vary.
For each line, for each of these columns, I have to identify if one of them is not NA. (basically any(!is.na(df[namecolumns])) for each line), to then do a subset for the ones that are TRUE.
Actually, any(!is.na(df[1,][namescolumns])) works well, but it's only for the first line.
I could easily do a for loop, which is my first reflex as a programmer and because it works for the first line, but I'm sure it's not the R way and that there is a way to do this with an "apply" (lapply, mapply, sapply, tapply or other), but I can't figure out which one and how.
Thank you.
try using apply over the first dimension (rows):
apply(df, 1 function(x) any(!is.na(x[namescolumns])))
The results will come back transposed, and so, you might want to wrap the whole statement inside of t(.)
You can use a combination of lapply and Reduce
has.na.in.cols <- Reduce(`&`, lapply(colnames, function (name) !is.na(df[name])))
to get a vector of whether or not there are NA values in any of the columns in colnames, which can in turn be used to subset the data.
df[has.any.na,]
For example. Given:
df <- data.frame(a = c(1,2,3,4,NA,6,7),
b = c(2,4,6,8,10,12,14),
c = c("one","two","three","four","five","six","seven"),
d = c("a",NA,"c","d","e","f","g")
)
colnames <- c("a","d")
You can get:
> df[Reduce(`&`, lapply(colnames, function (name) !is.na(df[name]))),]
a b c d
1 1 2 one a
3 3 6 three c
4 4 8 four d
6 6 12 six f
7 7 14 seven g

Trying to use user-defined function to populate new column in dataframe. What is going wrong?

Super short version: I'm trying to use a user-defined function to populate a new column in a dataframe with the command:
TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)
However, when I run the command, it seems to just apply EmployeeLocationNumber to the first row's value of Location rather than using each row's value to determine the new column's value for that row individually.
Please note: I'm trying to understand R, not just perform this particular task. I was actually able to get the output I was looking for using the Apply() function, but that's irrelevant. My understanding is that the above line should work on a row-by-row basis, but it isn't.
Here are the specifics for testing:
TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3),
Month=c(1,5,6,11,4,10,1,5,10),
Location=c(1,5,6,7,10,3,4,2,8))
This testDF keeps track of where each of 3 employees was over the course of the year among several locations.
(You can think of "Location" as unique to each Employee...it is eseentially a unique ID for that row.)
The the function EmployeeLocationNumber takes a location and outputs a number indicating the order that employee visited that location. For example EmployeeLocationNumber(8) = 2 because it was the second location visited by the employee who visited it.
EmployeeLocationNumber <- function(Site){
CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
return(LocationNumber)
}
I realize I probably could have packed all of that into a single subset command, but I didn't know how referencing worked when you used subset commands inside other subset commands.
So, keeping in mind that I'm really trying to understand how to work in R, I have a few questions:
Why won't TestDF$ELN<-EmployeeLocationNumber(TestDF$Location) work row-by-row like other assignment statements do?
Is there an easier way to reference a particular value in a dataframe based on the value of another one? Perhaps one that does not return a dataframe/list that then must be flattened and extracted from?
I'm sure the function I'm using is laughably un-R-like...what should I have done to essentially emulate an INNER Join type query?
Using logical indexing, the condensed one-liner replacement for your function is:
EmployeeLocationNumber <- function(Site){
with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}
Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.
A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.
B) In what sense is Location:8 the "second location visited"?
C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.
D) Conditional access of a data.frame typically involves logical indexing and or the use of which()
If you just want the sequence of visits by employee try this:
(Changed first argument to Month since that is what determines the sequence of locations)
with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
TestDF$LocOrder <- with(TestDF, ave(Month, Employee, FUN=seq))
If you wanted the second location for EE:3 it would be:
subset(TestDF, LocOrder==2 & Employee==3, select= Location)
# Location
# 8 2
The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.
Also, your example for EmployeeLocationNumber does not match your description.
> EmployeeLocationNumber(8)
[1] 3
Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()
TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)
which gives
> TestDF
Employee Month Location ELN
1 1 1 1 1
2 1 5 5 2
3 1 6 6 3
4 1 11 7 4
5 2 4 10 1
6 2 10 3 2
7 3 1 4 1
8 3 5 2 2
9 3 10 8 3
As to your other questions, I would just write it as
TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)
The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).
Your EmployeeLocationNumber function takes a vector in and returns a single value.
The assignment to create a new data.frame column therefore just gets a single value:
EmployeeLocationNumber(TestDF$Location) # returns 1
TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
I'll get back to you on that :)
Dito.
Update: I finally worked out some code to do it, but by then #DWin has a much better solution :(
TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))
...I guess the ave function does pretty much what the code above does. But for the record:
First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.
Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":
This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:
TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3

Resources