model.matrix with assigned values - r

Suppose I have the following data.frame:
df=data.frame(cat=c("a","b","c"),y=c(1,2,3))
Taking the model.matrix of the categories (cat) converts them to dummy variables as follows:
model.matrix(~0+cat,df)
cata catb catc
1 1 0 0
2 0 1 0
3 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$cat
[1] "contr.treatment"
However, I wish to have those dummy variables be assigned the values in df$y instead. One possible solution I can think of is to row multiply with y.
However, I'm guessing there are better purpose built functions for this?
So basically, what is the most efficient way of converting dummy variables to a given vector?

May be we can try
library(reshape2)
acast(df, cat~y, value.var="y", fill=0)
# 1 2 3
#a 1 0 0
#b 0 2 0
#c 0 0 3
Or using model.matrix
model.matrix(~0 + cat, df) *df$y

Related

Extract columns from df by subset of column id characters

I am working on a gene expression dataset with hundreds of samples. Each sample in the data frame has a unique column ID (example: OHC_112 of IHC_123). I want to make a new dataframe containing only the columns containing the "OHC". How can i do this?
I am struggling to make workable example dataframe... but this is the best i was able to do.
Data frame "DF"
OHC_1 OHC_2 OHC_3 IHC_4 IHC_5 OHC_6
Gene1 1 1 0 1 1 0
Gene2 0 0 0 1 1 0
Gene3 1 1 1 0 0 1
Gene4 1 1 1 0 0 0
I got close by using the following subset command
newDF <- subset(DF, ,select = OHC_1:OHC_3)
This allows me to subset the dataframe by a range of the columns but does not allow me to choose all the columns containing "OHC" in the header.
Thanks for your help!
Just subset the columns with names that match using grepl?
> DF[, grepl("OHC",names(DF))]
OHC_1 OHC_2 OHC_3 OHC_6
1 1 1 0 0
2 0 0 0 0
3 1 1 1 1
4 1 1 1 0
You can make a shorter call that is also more generalizable with negative-grep:
df.2 <- df[, -grep("^OHC_[1:3]$", names(df) )]
Since grep returns numerics you can use the negative vector indexing to remove columns. You could add further number or more complex patterns.
We can use select with matches from tidyverse
library(tidyverse)
DF %>%
select(matches("^OHC"))
# OHC_1 OHC_2 OHC_3 OHC_6
#Gene1 1 1 0 0
#Gene2 0 0 0 0
#Gene3 1 1 1 1
#Gene4 1 1 1 0

For loop in R takes forever to run

people_id activity_id success totl_act success_rate cum_success cum_act cum_success_rate success_rate_trend
(fctr) (fctr) (int) (int) (dbl) (int) (int) (dbl) (dbl)
1 ppl_100 act2_1734928 0 1 0 0 1 0 NA
2 ppl_100 act2_2434093 0 1 0 0 2 0 0
3 ppl_100 act2_3404049 0 1 0 0 3 0 0
4 ppl_100 act2_3651215 0 1 0 0 4 0 0
5 ppl_100 act2_4109017 0 1 0 0 5 0 0
6 ppl_100 act2_898576 0 1 0 0 6 0 0
7 ppl_100002 act2_1233489 1 1 1 1 1 1 1
8 ppl_100002 act2_1623405 1 1 1 2 2 1 0
9 ppl_100003 act2_1111598 1 1 1 1 1 1 0
10 ppl_100003 act2_1177453 1 1 1 2 2 1 0
I've this sample data frame. I want to create a variable success_rate_trend using cum_success_rate variable. The challenge is that I want it to compute for every activity_id except the first activity for every unique people_id i.e I want to capture success trend for unique people_id. I'm using the below code:
success_rate_trend<-vector(mode="numeric", length=nrow(succ_rate_df)-1)
for(i in 2:nrow(succ_rate_df)){
if(succ_rate_df[i,1]!=succ_rate_df[i-1,1]){
success_rate_trend[i] = NA
}
else {
success_rate_trend[i]<-succ_rate_df[i,8]-succ_rate_df[i-1,8]
}}
It takes forever to run. I've close to million rows in succ_rate_df dataframe. Can Anyone suggest how to simplify the code and reduce the run time.
Use vectorization:
success_rate_trend <- diff(succ_rate_df$cum_success_rate)
success_rate_trend[diff(as.integer(succ_rate_df$people_id)) != 0] <- NA_real_
Note:
people_id is a factor variable (fctr). To use diff() we must use as.integer() or unclass() to remove the factor class.
You are not having an ordinary data frame, but a tbl_df from dplyr. Matrix like indexing does not work. Use succ_rate_df$people_id or succ_rate_df[["people_id"]] instead of succ_rate_df[, 1].
You should be able to do this calculation using a vectorised approach. This will be orders of magnitude quicker.
n = nrow(succ_rate_df)
success_rate = succ_rate_df[2:n,1] == succ_rate_df[1:(n-1),1]
is_true = which(success_rate)
success_rate[is_true] = succ_rate_df[is_true+1,8]-succ_rate_df[is_true,8]
success_rate[!success_rate] = NA
The answer by Zheyuan Li is neater.
I'm going to offer an answer based on a dataframe version of this data. You SHOULD learn to post with the output of dput so that objects with special properties like the tibble you have printed above can be copied into other users consoles without loss of attributes. I'm also going to name my dataframe dat. The ave function is appropriate for calculating numeric vectors when you want them to be the same length as an input vector but want those calculations restricted to grouping vector(s). I only used one grouping factor, although you English language description of the problem suggested you wanted two. There are SO worked examples with two factors for grouping with ave.
success_rate_trend <- with( dat,
ave( cum_success_rate, people_id, FUN= function(x) c(NA, diff(x) ) ) )
success_rate_trend
[1] NA 0 0 0 0 0 NA 0 NA 0
# not a very interesting result

finding if boolean is ever true by groups in R

I want a simple way to create a new variable determining whether a boolean is ever true in R data frame.
Here is and example:
Suppose in the dataset I have 2 variables (among other variables which are not relevant) 'a' and 'b' and 'a' determines a group, while 'b' is a boolean with values TRUE (1) or FALSE (0). I want to create a variable 'c', which is also a boolean being 1 for all entries in groups where 'b' is at least once 'TRUE', and 0 for all entries in groups in which 'b' is never TRUE.
From entries like below:
a b
-----
1 1
2 0
1 0
1 0
1 1
2 0
2 0
3 0
3 1
3 0
I want to get variable 'c' like below:
a b c
-----------
1 1 1
2 0 0
1 0 1
1 0 1
1 1 1
2 0 0
2 0 0
3 0 1
3 1 1
3 0 1
-----------
I know how to do it in Stata, but I haven't done similar things in R yet, and it is difficult to find information on that on the internet.
In fact I am doing that only in order to later remove all the observations for which 'c' is 0, so any other suggestions would be fine as well. The application of that relates to multinomial logit estimation, where the alternatives that are never-chosen need to be removed from the dataset before estimation.
if X is your data frame
library(dplyr)
X <- X %>%
group_by(a) %>%
mutate(c = any(b == 1))
A base R option would be
df1$c <- with(df1, ave(b, a, FUN=any))
Or
library(sqldf)
sqldf('select * from df1
left join(select a, b,
(sum(b))>0 as c
from df1
group by a)
using(a)')
Simple data.table approach
require(data.table)
data <- data.table(data)
data[, c := any(b), by = a]
Even though logical and numeric (0-1) columns behave identically for all intents and purposes, if you'd like a numeric result you can simply wrap the call to any with as.numeric.
An answer with base R, assuming a and b are in dataframe x
c value is a 1-to-1 mapping with a, and I create a mapping here
cmap <- ifelse(sapply(split(x, x$a), function(x) sum(x[, "b"])) > 0, 1, 0)
Then just add in the mapped value into the data frame
x$c <- cmap[x$a]
Final output
> x
a b c
1 1 1 1
2 2 0 0
3 1 0 1
4 1 0 1
5 1 1 1
6 2 0 0
7 2 0 0
8 3 0 1
9 3 1 1
10 3 0 1
edited to change call to split.

Two-way frequency table row and column labels

I want to style the output of table(). Suppose I have the following:
dat$a <- c(1,2,3,4,4,3,4,2,2,2)
dat$b <- c(1,2,3,4,1,2,4,3,2,2)
table(dat$a,dat$b)
1 2 3 4
1 50 0 0 0
2 0 150 50 0
3 0 50 50 0
4 50 0 0 100
There are two problems with this. First, it doesn't give me the correct frequencies. Additionally, it has no row or column labels. I found this, and the table works for both frequency counts and axis labels. Is there an issue because this way subsets from a data frame? I would appreciate any tips on both fixing the frequency counts and adding style to the table.
The only problem is the way that you are inputting arguments to table. To get the desired output (with labels), use the data frame as argument, not 2 vectors (the columns). If you have a larger data frame, use only the subset that you want.
a <- c(1,2,3,4,4,3,4,2,2,2)
b <- c(1,2,3,4,1,2,4,3,2,2)
dat <- data.frame(a,b)
table(dat)
Gives me the output:
b
a 1 2 3 4
1 1 0 0 0
2 0 3 1 0
3 0 1 1 0
4 1 0 0 2
It shouldn't give the wrong frequencies, even with your approach. You could try restarting your R session to check this.

R concatenating two factors

This is making me feel dumb, but I am trying to produce a single vector/df/list/etc (anything but a matrix) concatenating two factors. Here's the scenario. I have a 100k line dataset. I used the top half to predict the bottom half and vice versa using knn. So now I have 2 objects created by knn predict().
> head(pred11)
[1] 0 0 0 0 0 0
Levels: 0 1
> head(pred12)
[1] 0 1 1 0 0 0
Levels: 0 1
> class(pred11)
[1] "factor"
> class(pred12)
[1] "factor"
Here's where my problem starts:
> pred13 <- rbind(pred11, pred12)
> class(pred13)
[1] "matrix"
There are 2 problems. First it changes the 0's and 1's to 1's and 2's and second it seems to create a huge matrix that's eats all my memory. I've tried messing with as.numeric(), data.frame(), etc, but can't get it to just combine the 2 50k factors into 1 100k one. Any suggestions?
#James presented one way, I'll chip in with another (shorter):
set.seed(42)
x1 <- factor(sample(0:1,10,replace=T))
x2 <- factor(sample(0:1,10,replace=T))
unlist(list(x1,x2))
# [1] 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1
#Levels: 0 1
...This might seem a bit like magic, but unlist has special support for factors for this particular purpose! All elements in the list must be factors for this to work.
rbind will create 2 x 50000 matrix in your case which isn't what you want. c is the correct function to combine 2 vectors in a single longer vector. When you use rbind or c on a factor, it will use the underlying integers that map to the levels. In general you need to combine as a character before refactoring:
x1 <- factor(sample(0:1,10,replace=T))
x2 <- factor(sample(0:1,10,replace=T))
factor(c(as.character(x1),as.character(x2)))
[1] 1 1 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 0 0 0
Levels: 0 1

Resources