Count with conditional - dataframe - r

I would like to count how many times a observation appears with the condition one column is greater than another.
For example, how many times the "A", "B" and "C" apperead counting only if the column B is greater than colun C.
set.seed(20170524)
A <- rep(c("A","B","C"),5)
B <- round(runif(15,0,20),0)
C <- round(runif(15,1,5),0) + B
D <- as.data.frame(cbind(A,B,C))
D <- D[order(B),]
Thank you!

#firstly, those numbers got converted to factors, this is problematic.
D$B<-as.numeric(D$B)
D$C<-as.numeric(D$C)
#Then, get the counts for the A:
countA = sum(D$A=='A' & D$B < D$C)
Similarly for 'B' and 'C'
If there's many more than just categories "A,B,C" you might want to do a data.table for the by= option, but someone will probably be along to say that's overkill.

You can use: table(D$A[which(D$B>D$C)])
Note that when you do D <- as.data.frame(cbind(A,B,C)) you will get factors so either you transform B and C into numeric variables afterwards, or you just create directly a data.frame without passing through a matrix:
D <- data.frame(A,B,C)

Related

how to put many rows in a dataframe by looping in r

I am looping for example, from a list ["A", "B","C"],
I will run a for loop
to get v<- for different run it has v1,v2,v3 different values
I want to use cbind("A", "v1") #I want to get three of rows (after 3 times loop) together to form a dataframe.
At the end, I want to get a dataframe which has the format of
"A" v1
"B" v2
"C" v3
How to get this output? Thanks!
I may have misunderstood the request, but is the following what you are looking for?
input <- c("A", "B", "C")
data.frame(x=input, y=paste0("v", seq_along(input)))
# x y
# 1 A v1
# 2 B v2
# 3 C v3
Note that the approach you mentioned in your question (iteratively building a row and combining with the existing data via rbind) is a bad idea both because it will take a lot more typing (note that I could do the operation in one line) and also because it is inefficient (you can read more about that in the second circle of the R inferno).
The part I have been stuck by is that, I have to start with a empty dataframe
df <-data.frame()
for (e in mylist){
v <- function(e) #get the value our from e by a function
one_row<- cbind(e, v) #cbind e, and v corresponding to e
new_f <-data.frame(one_row)
output <-rbind(output,new_f)
}
At the end, I get the right output.

assign multiple categorical values to series of data frame variables in r

Say I have the following data input into R
G <- c(1,1,0,0,0,0,0,0,0)
H <- c(0,1,1,0,0,0,0,0,0)
I <- c(0,0,0,0,1,1,0,0,0)
J <- c(0,0,0,1,0,1,0,0,0)
K <- c(0,0,0,0,0,0,1,1,0)
L <- c(0,0,0,0,0,0,0,1,1)
list <- data.frame(G,H,I,J,K,L)
I want to assign
'a' value to any observation where 1 appears in either G or H or appears in both
'b' to observations where 1 appears in either/both of I and J.
'c' to observations where 1 appears in either/both K and L.
This is a simple solution by creating a variable and then assigning values to it using subsets. Is this sufficient to your purpose?
list$Z <- NA
list$Z[list$G|list$H] <- "a"
list$Z[list$I|list$J] <- "b"
list$Z[list$K|list$L] <- "c"
list
EDIT:
As per the suggestion by David Arenburg, the code gets cleaner and better readable (and probably more efficient) by using within():
list$Z <- NA
within(list, Z[G|H]<-"a"; Z[I|J]<-"b"; Z[K|L]<-"c")

as.data.frame and cbind results in factor columns

I have a big data.frame with a mix of integer, character and strings columns. I'll need to order the data.frame by a numeric column.
When I combine the original columns into a data.frame all the columns change to factor, including the column I need for the sort. So the sort gives something like 1, 10, 100... instead of 1, 2, 3...
Here is an example of my problem.
a <- 1:10
b <- c(1,3,5,6,2,10,100,110,7,4)
c <- LETTERS[1:10]
d <- as.data.frame(cbind(a, b, c)) # I am using this construction
e <- d[with(d, order(b)), ]
How can I fix this?
Actually you need to do:
d <- data.frame(a, b, c, stringsAsFactors=FALSE)
The last part stringsAsFactors=FALSE prevents column d$c from being converted to factors. Include it, and your strings will stay as strings.
Don't forget stringsAsFactors=FALSE - it will save you untold misery, trust me!

Finding maximum value for column among a subset of a data frame

Given a data frame df with columns d, c, v. How do I find the value of d for the maximum value of v among the subset of records where c == "foo"?
I tried this:
df[df$v==max(df$v) & df$c == "foo","d"]
But I got:
character(0)
Yo can do as follows:
with(df, d[v== max(v[c=="foo"])])
EDITED:
If you want to get the value of d for all the levels of c:
library(plyr)
ddply(df, "c", subset, v==max(v))
While Manuel's answer will work most of the time, I believe a more correct version would be:
with(df, d[v== max(v[c=="foo"]) & c=="foo"])
Otherwise it's possible to match a row which has v==max but is not in fact a subset of c=="foo".

Subtracting two dataset

I have 2 datasets. One is the parent dataset (A) and other one is a subset (B) of it. I want to create a dataset from A which does not contain rows from B. It should be something like
C=A-B
Both the datasets A and B have same number of columns and column names.
If B is an actual subset of A, you can use setdiff on rownames:
sset <- subset(mtcars,cyl==4)
mtcars[setdiff(rownames(mtcars),rownames(sset)),]
If you do not want to convert it into a string for comparing, i.e Do exact matches
you can try this out
a <- data.frame(t(matrix(1:12,3,4)))
b <- data.frame(t(matrix(7:21,3,5)))
a[!apply(a,1,FUN=function(y){any(apply(b,1,FUN=function(x){all(x==y)}))}),]
Something like the following might do the trick:
C <- A[!(apply(A, 1, toString) %in% apply(B, 1, toString)), ]

Resources