This is making me feel dumb, but I am trying to produce a single vector/df/list/etc (anything but a matrix) concatenating two factors. Here's the scenario. I have a 100k line dataset. I used the top half to predict the bottom half and vice versa using knn. So now I have 2 objects created by knn predict().
> head(pred11)
[1] 0 0 0 0 0 0
Levels: 0 1
> head(pred12)
[1] 0 1 1 0 0 0
Levels: 0 1
> class(pred11)
[1] "factor"
> class(pred12)
[1] "factor"
Here's where my problem starts:
> pred13 <- rbind(pred11, pred12)
> class(pred13)
[1] "matrix"
There are 2 problems. First it changes the 0's and 1's to 1's and 2's and second it seems to create a huge matrix that's eats all my memory. I've tried messing with as.numeric(), data.frame(), etc, but can't get it to just combine the 2 50k factors into 1 100k one. Any suggestions?
#James presented one way, I'll chip in with another (shorter):
set.seed(42)
x1 <- factor(sample(0:1,10,replace=T))
x2 <- factor(sample(0:1,10,replace=T))
unlist(list(x1,x2))
# [1] 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1
#Levels: 0 1
...This might seem a bit like magic, but unlist has special support for factors for this particular purpose! All elements in the list must be factors for this to work.
rbind will create 2 x 50000 matrix in your case which isn't what you want. c is the correct function to combine 2 vectors in a single longer vector. When you use rbind or c on a factor, it will use the underlying integers that map to the levels. In general you need to combine as a character before refactoring:
x1 <- factor(sample(0:1,10,replace=T))
x2 <- factor(sample(0:1,10,replace=T))
factor(c(as.character(x1),as.character(x2)))
[1] 1 1 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 0 0 0
Levels: 0 1
Related
people_id activity_id success totl_act success_rate cum_success cum_act cum_success_rate success_rate_trend
(fctr) (fctr) (int) (int) (dbl) (int) (int) (dbl) (dbl)
1 ppl_100 act2_1734928 0 1 0 0 1 0 NA
2 ppl_100 act2_2434093 0 1 0 0 2 0 0
3 ppl_100 act2_3404049 0 1 0 0 3 0 0
4 ppl_100 act2_3651215 0 1 0 0 4 0 0
5 ppl_100 act2_4109017 0 1 0 0 5 0 0
6 ppl_100 act2_898576 0 1 0 0 6 0 0
7 ppl_100002 act2_1233489 1 1 1 1 1 1 1
8 ppl_100002 act2_1623405 1 1 1 2 2 1 0
9 ppl_100003 act2_1111598 1 1 1 1 1 1 0
10 ppl_100003 act2_1177453 1 1 1 2 2 1 0
I've this sample data frame. I want to create a variable success_rate_trend using cum_success_rate variable. The challenge is that I want it to compute for every activity_id except the first activity for every unique people_id i.e I want to capture success trend for unique people_id. I'm using the below code:
success_rate_trend<-vector(mode="numeric", length=nrow(succ_rate_df)-1)
for(i in 2:nrow(succ_rate_df)){
if(succ_rate_df[i,1]!=succ_rate_df[i-1,1]){
success_rate_trend[i] = NA
}
else {
success_rate_trend[i]<-succ_rate_df[i,8]-succ_rate_df[i-1,8]
}}
It takes forever to run. I've close to million rows in succ_rate_df dataframe. Can Anyone suggest how to simplify the code and reduce the run time.
Use vectorization:
success_rate_trend <- diff(succ_rate_df$cum_success_rate)
success_rate_trend[diff(as.integer(succ_rate_df$people_id)) != 0] <- NA_real_
Note:
people_id is a factor variable (fctr). To use diff() we must use as.integer() or unclass() to remove the factor class.
You are not having an ordinary data frame, but a tbl_df from dplyr. Matrix like indexing does not work. Use succ_rate_df$people_id or succ_rate_df[["people_id"]] instead of succ_rate_df[, 1].
You should be able to do this calculation using a vectorised approach. This will be orders of magnitude quicker.
n = nrow(succ_rate_df)
success_rate = succ_rate_df[2:n,1] == succ_rate_df[1:(n-1),1]
is_true = which(success_rate)
success_rate[is_true] = succ_rate_df[is_true+1,8]-succ_rate_df[is_true,8]
success_rate[!success_rate] = NA
The answer by Zheyuan Li is neater.
I'm going to offer an answer based on a dataframe version of this data. You SHOULD learn to post with the output of dput so that objects with special properties like the tibble you have printed above can be copied into other users consoles without loss of attributes. I'm also going to name my dataframe dat. The ave function is appropriate for calculating numeric vectors when you want them to be the same length as an input vector but want those calculations restricted to grouping vector(s). I only used one grouping factor, although you English language description of the problem suggested you wanted two. There are SO worked examples with two factors for grouping with ave.
success_rate_trend <- with( dat,
ave( cum_success_rate, people_id, FUN= function(x) c(NA, diff(x) ) ) )
success_rate_trend
[1] NA 0 0 0 0 0 NA 0 NA 0
# not a very interesting result
Is It possible to set colnames of a matrix in R to numeric ?
I know that they are character or NULL by default.
But, If there's a way to transform them to numeric, It would be so helpful for me.
Any clarification would be welcome.
EDIT
I'll explain myself more clearely :
I have a dataframe that contains two numeric column, for example :
> xy
x y
[1,] 1 1
[2,] 2 2
[3,] 3 5
[4,] 4 7
> xy = as.data.table(xy)
> xy_cast = dcast(xy, x~y, value.var="y", fun.aggregate=length)
> xy_cast
x 1 2 5 7
1 1 1 0 0 0
2 2 0 1 0 0
3 3 0 0 1 0
4 4 0 0 0 1
> xy_cast = xy_cast[-1]
> xy_cast
1 2 5 7
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
> class(colnames(xy_cast))
[1] "character"
As you see my colnames are numbers by they are coerced to character. If I can transform colnames to numeric it would reduce the execution time of the rest of the algorithm.
But, I'm not sure that's possible.
SOLUTION
I tried to treat the problem from another corner, so I thought differently :
which( colnames(df)=="b" )
This R function helped me to go through my colnames by selecting the column number of my column name, which helped me reduce execution time.
The first answer to this question helped me in my problem resolution :
Get the column number in R given the column name
Thank you for responses.
You can by default to the columns by their respective number
df[,3]
will return third column, or you can also use
df[,"155"]
which will return the column with name "155" which is a character.
If you're after getting the column name as a numeric, you can extract it and then coerce say
as.numeric(names(df)[3])
will return the name of third column as a numeric, if it is a numerical character. So you can easily go back and forth..
Suppose I have the following data.frame:
df=data.frame(cat=c("a","b","c"),y=c(1,2,3))
Taking the model.matrix of the categories (cat) converts them to dummy variables as follows:
model.matrix(~0+cat,df)
cata catb catc
1 1 0 0
2 0 1 0
3 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$cat
[1] "contr.treatment"
However, I wish to have those dummy variables be assigned the values in df$y instead. One possible solution I can think of is to row multiply with y.
However, I'm guessing there are better purpose built functions for this?
So basically, what is the most efficient way of converting dummy variables to a given vector?
May be we can try
library(reshape2)
acast(df, cat~y, value.var="y", fill=0)
# 1 2 3
#a 1 0 0
#b 0 2 0
#c 0 0 3
Or using model.matrix
model.matrix(~0 + cat, df) *df$y
I want to style the output of table(). Suppose I have the following:
dat$a <- c(1,2,3,4,4,3,4,2,2,2)
dat$b <- c(1,2,3,4,1,2,4,3,2,2)
table(dat$a,dat$b)
1 2 3 4
1 50 0 0 0
2 0 150 50 0
3 0 50 50 0
4 50 0 0 100
There are two problems with this. First, it doesn't give me the correct frequencies. Additionally, it has no row or column labels. I found this, and the table works for both frequency counts and axis labels. Is there an issue because this way subsets from a data frame? I would appreciate any tips on both fixing the frequency counts and adding style to the table.
The only problem is the way that you are inputting arguments to table. To get the desired output (with labels), use the data frame as argument, not 2 vectors (the columns). If you have a larger data frame, use only the subset that you want.
a <- c(1,2,3,4,4,3,4,2,2,2)
b <- c(1,2,3,4,1,2,4,3,2,2)
dat <- data.frame(a,b)
table(dat)
Gives me the output:
b
a 1 2 3 4
1 1 0 0 0
2 0 3 1 0
3 0 1 1 0
4 1 0 0 2
It shouldn't give the wrong frequencies, even with your approach. You could try restarting your R session to check this.
I have a data frame, that I am wanting to use to generate a design matrix.
>ct<-read.delim(filename, skip=0, as.is=TRUE, sep="\t", row.names = 1)
> ct
s2 s6 S10 S14 S3 S7 S11 S15 S4 S8 S12 S16
group 1 1 1 1 2 2 2 2 3 3 3 3
donor 1 2 3 4 1 2 3 4 1 2 3 4
>factotum<-apply(ct,1,as.factor) # to turn rows into factors.
>design <- model.matrix(~0 + factotum[,1] + factotum[,2])
Eventually, I'll generate a string and use as.formula() instead of hard coding the formula. Anyway, this works and produces a design matrix. It leaves a column out though.
>design
factotum[, 1]1 factotum[, 1]2 factotum[, 1]3 factotum[, 2]2 factotum[, 2]3 factotum[, 2]4
1 1 0 0 0 0 0
2 1 0 0 1 0 0
3 1 0 0 0 1 0
4 1 0 0 0 0 1
5 0 1 0 0 0 0
6 0 1 0 1 0 0
7 0 1 0 0 1 0
8 0 1 0 0 0 1
9 0 0 1 0 0 0
10 0 0 1 1 0 0
11 0 0 1 0 1 0
12 0 0 1 0 0 1
By my reasoning, the column names should be:
factotum[, 1]1 factotum[, 1]2 factotum[, 1]3, factotum[,2]1, factotum[, 2]2 factotum[, 2]3 factotum[, 2]4. These would be renamed as group1,group2,group3,donor1,donor2,donor3,donor4.
Which means that factotum[,2]1, or donor1, is missing. What am I doing that this would be missing? Any help would be be appreciated.
Cheers
Ben.
There are several things here.
(1) apply(ct,1,as.factor) doesn't necessarily turn the rows into factors. Try str(factotum) and you'll see that it failed. I'm not sure what the fastest way is, but this should work:
factotum <- data.frame(lapply(data.frame(t(ct)), as.factor))
(2) Since you are working with factors, model.matrix creates dummy coding. In this case, donor has four values. If you are 2, then you get a 1 in the column factotum[,2]2. If you are 3 or 4, you get a 1 in their respective columns. So what if you are a 1? Well, that simply means that you are 0 in all three columns. In this way, you only need three columns to create four groups. The value 1 for donor is called the reference group here, which is the group with which the other groups are compared.
(3) So now the question is... Why doesn't group (or factotum[,1]) have only TWO columns? We could easily code three levels with two columns, right? Well... yes, this is exactly what happens when you use:
design <- model.matrix(~ factotum[,1] + factotum[,2])
However, since you specify that there is no intercept, you'll get an extra column for group.
(4) Usually you don't have to create the design matrix yourself. I'm not sure what function you want to use next, but in most cases the functions take care of it for you.