I have a dataframe a that has columns id and date and a second dataframe b that has id as its first column. For each row in b, I'm trying to find all rows in a with the same id, and then find the minimum of the dates. I'm using the code below, but when I run this, I'm getting a numeric as opposed to dates. I'm wondering if someone can help me with this.
class(a$date)
# "Date"
funP <- function(x){
b <- subset(a, id==x[1])
min(b$date)
}
f <- apply(b, 1, funP)
class(f)
# "numeric"
Apparently the apply function converts date values. The manual (?apply) mentions:
Value:
[...]
In all cases the result is coerced by ‘as.vector’ to one of the basic
vector types [...]
You could convert it back to the Date class:
f <- as.Date(f, origin="1970-01-01")
Related
Ok, so I am having this issue right now. I have a matrix A whose rownames are the values of a field in another matrix B. I want to find indices of my rownames in the second matrix B. Now I am trying to do this operation which(A$field == rowname_A) . Unfortunately couple of things are appearing one - the rowname_A variable is of character class. It is of this format , "X12345". The values of A$field is of type factor. Is there a way to remove the appended X from the character, convert it to factor and do the comparison. Or convert the factor variables of A$field in to character type and then do the comparison.
Help will be appreciated.
Thanks.
This is fairly straightfoward. The example below should help you out.
A <- matrix(1:3)
rownames(A) <- paste0("X", 1:3)
B <- data.frame(field = factor(1:3))
# Remove "X" from rownames(A) and check equality
B$field %in% substr(rownames(A), 2, nchar(rownames(A)))
# Add "X" to B$field and check equality
paste0("X", B$field) %in% rownames(A)
df is a frequency table, where the values in a were reported as many times as recorded in column x,y,z. I'm trying to convert the frequency table to the original data, so I use the rep() function.
How do I loop the rep() function to give me the original data for x, y, z without having to repeat the function several times like I did below?
Also, can I input the result into a data frame, bearing in mind that the output will have different column lengths:
a <- (1:10)
x <- (6:15)
y <- (11:20)
z <- (16:25)
df <- data.frame(a,x,y,z)
df
rep(df[,1], df[,2])
rep(df[,1], df[,3])
rep(df[,1], df[,4])
If you don't want to repeat the for loop, you can always try using an apply function. Note that you cannot store it in a data.frame because the objects are of different lengths, but you could store it in a list and access the elements in a similar way to a data.frame. Something like this works:
df2<-sapply(df[,2:4],function(x) rep(df[,1],x))
What this sapply function is saying is for each column in df[,2:4], apply the rep(df[,1],x) function to it where x is one of your columns ( df[,2], df[,3], or df[,4]).
The below code just makes sure the apply function is giving the same result as your original way.
identical(df2$x,rep(df[,1], df[,2]))
[1] TRUE
identical(df2$y,rep(df[,1], df[,3]))
[1] TRUE
identical(df2$z,rep(df[,1], df[,4]))
[1] TRUE
EDIT:
If you want it as a data.frame object you can do this:
res<-as.data.frame(sapply(df2, '[', seq(max(sapply(df2, length)))))
Note this introduces NAs into your data.frame so be careful!
I have a data frame of 15 columns where the first column is an integer and others are numeric. I have to generate a one-liner summary of the sum of all columns except the last one. I need to generate mean of the last column. So, I am doing something as below:
summary <- c(sum(df$col1), ... mean(df$col15))
The summary then appears with values up to two decimal places even for the integer column (first one). I have been trying the round function to fix this. I can understand, when different types are added, e.g. 1 + 1.0. But, in this case, shouldn't the summation maintain the data-type?
Please let me know what am I missing?
If you are looking for a one-line summary:
lst <- c(lapply(df[-ncol(df)], function(x) sum(x)), mean=mean(df[,ncol(df)]))
as.data.frame(lst)
# int num1 mean
#1 10 6 2.5
The output is a data frame that preserves the classes of each vector. If you would like the output to be added to the original data frame you can replace as.data.frame(lst) with:
names(lst) <- names(df)
rbind(df, lst)
If you are trying to get the sum of all integer columns and the mean of numeric columns, go with #Frank's answer.
Data
df <- data.frame(int=1:4, num1=seq(1,2,length.out=4), num2=seq(2,3,length.out=4))
Perhaps an adaptation of this?
apply(iris[,1:4], 2, sum) / c(rep(1,3), nrow(iris))
what is best practice to handle this particular problem when it comes up? for example I have created a dataframe:
dat<- sqlQuery(con,"select * from mytable")
in which my table looks like:
ID RESULT GROUP
-- ------ -----
1 Y A
2 N A
3 N B
4 Y B
5 N A
in which ID is an int, Result and Group are both factors.
problem is that when I want to do something like:
tapply(dat$RESULT,dat$GROUP,sum)
I get complaints about columns being a factor:
Error in Summary.factor(c(2L,2L,2L,2L,1L,2L,1L,2L,2L,1L,1L, :
sum not meaningful for factors
Given that factors are essential for use in things like ggplot, how does everyone else handle this?
Setting stringsAsFactors=FALSE and rerunning gives
tapply(dat$RESULT,dat$GROUP,sum)
Error in FUN(X[[1L]], ...) : invalid "type" (character) or argument
so I'm not sure merely setting stringsAsFactors=FALSE is the right approach
I assume you want to sum up the "Y"s in the RESULT column.
As suggested by #akrun, one possibility is to use table()
with(dat,table(GROUP,RESULT))
If you want to stick with the tapply(), you can change the type of the RESULT column to a boolean:
dat$RESULT <- dat$RESULT=="Y"
tapply(dat$RESULT,dat$GROUP,sum)
If your goal is to have some columns as factors and other as strings, you can convert to factors only selected columns in the result, e.g. with
dat<- sqlQuery(con,"select ID,RESULT,GROUP from mytable",as.is=2)
As in the read.table man page (recalled by the sqlQuery man page) : as.is is either a vector of logicals (values are recycled if necessary), or a vector of numeric or character indices which specify which columns should not be converted to factors.
But then again, you need either to use table() or to turn the result into a boolean.
I'm not clear what your question is, either. If you're just trying to sum the Y's, how about:
library(dplyr)
df <- data.frame(ID = 1:5,
RESULT = as.factor(c("Y","N","N","Y","N")),
GROUP = as.factor(c("A", "A", "B", "B", "A")))
df %>% mutate(logRes = (RESULT == "Y")) %>%
summarise(sum=sum(logRes))
When I pass a row of a data frame to a function using apply, I lose the class information of the elements of that row. They all turn into 'character'. The following is a simple example. I want to add a couple of years to the 3 stooges ages. When I try to add 2 a value that had been numeric R says "non-numeric argument to binary operator." How do I avoid this?
age = c(20, 30, 50)
who = c("Larry", "Curly", "Mo")
df = data.frame(who, age)
colnames(df) <- c( '_who_', '_age_')
dfunc <- function (er) {
print(er['_age_'])
print(er[2])
print(is.numeric(er[2]))
print(class(er[2]))
return (er[2] + 2)
}
a <- apply(df,1, dfunc)
Output follows:
_age_
"20"
_age_
"20"
[1] FALSE
[1] "character"
Error in er[2] + 2 : non-numeric argument to binary operator
apply only really works on matrices (which have the same type for all elements). When you run it on a data.frame, it simply calls as.matrix first.
The easiest way around this is to work on the numeric columns only:
# skips the first column
a <- apply(df[, -1, drop=FALSE],1, dfunc)
# Or in two steps:
m <- as.matrix(df[, -1, drop=FALSE])
a <- apply(m,1, dfunc)
The drop=FALSE is needed to avoid getting a single column vector.
-1 means all-but-the first column, you could instead explicitly specify the columns you want, for example df[, c('foo', 'bar')]
UPDATE
If you want your function to access one full data.frame row at a time, there are (at least) two options:
# "loop" over the index and extract a row at a time
sapply(seq_len(nrow(df)), function(i) dfunc(df[i,]))
# Use split to produce a list where each element is a row
sapply(split(df, seq_len(nrow(df))), dfunc)
The first option is probably better for large data frames since it doesn't have to create a huge list structure upfront.