people_id activity_id success totl_act success_rate cum_success cum_act cum_success_rate success_rate_trend
(fctr) (fctr) (int) (int) (dbl) (int) (int) (dbl) (dbl)
1 ppl_100 act2_1734928 0 1 0 0 1 0 NA
2 ppl_100 act2_2434093 0 1 0 0 2 0 0
3 ppl_100 act2_3404049 0 1 0 0 3 0 0
4 ppl_100 act2_3651215 0 1 0 0 4 0 0
5 ppl_100 act2_4109017 0 1 0 0 5 0 0
6 ppl_100 act2_898576 0 1 0 0 6 0 0
7 ppl_100002 act2_1233489 1 1 1 1 1 1 1
8 ppl_100002 act2_1623405 1 1 1 2 2 1 0
9 ppl_100003 act2_1111598 1 1 1 1 1 1 0
10 ppl_100003 act2_1177453 1 1 1 2 2 1 0
I've this sample data frame. I want to create a variable success_rate_trend using cum_success_rate variable. The challenge is that I want it to compute for every activity_id except the first activity for every unique people_id i.e I want to capture success trend for unique people_id. I'm using the below code:
success_rate_trend<-vector(mode="numeric", length=nrow(succ_rate_df)-1)
for(i in 2:nrow(succ_rate_df)){
if(succ_rate_df[i,1]!=succ_rate_df[i-1,1]){
success_rate_trend[i] = NA
}
else {
success_rate_trend[i]<-succ_rate_df[i,8]-succ_rate_df[i-1,8]
}}
It takes forever to run. I've close to million rows in succ_rate_df dataframe. Can Anyone suggest how to simplify the code and reduce the run time.
Use vectorization:
success_rate_trend <- diff(succ_rate_df$cum_success_rate)
success_rate_trend[diff(as.integer(succ_rate_df$people_id)) != 0] <- NA_real_
Note:
people_id is a factor variable (fctr). To use diff() we must use as.integer() or unclass() to remove the factor class.
You are not having an ordinary data frame, but a tbl_df from dplyr. Matrix like indexing does not work. Use succ_rate_df$people_id or succ_rate_df[["people_id"]] instead of succ_rate_df[, 1].
You should be able to do this calculation using a vectorised approach. This will be orders of magnitude quicker.
n = nrow(succ_rate_df)
success_rate = succ_rate_df[2:n,1] == succ_rate_df[1:(n-1),1]
is_true = which(success_rate)
success_rate[is_true] = succ_rate_df[is_true+1,8]-succ_rate_df[is_true,8]
success_rate[!success_rate] = NA
The answer by Zheyuan Li is neater.
I'm going to offer an answer based on a dataframe version of this data. You SHOULD learn to post with the output of dput so that objects with special properties like the tibble you have printed above can be copied into other users consoles without loss of attributes. I'm also going to name my dataframe dat. The ave function is appropriate for calculating numeric vectors when you want them to be the same length as an input vector but want those calculations restricted to grouping vector(s). I only used one grouping factor, although you English language description of the problem suggested you wanted two. There are SO worked examples with two factors for grouping with ave.
success_rate_trend <- with( dat,
ave( cum_success_rate, people_id, FUN= function(x) c(NA, diff(x) ) ) )
success_rate_trend
[1] NA 0 0 0 0 0 NA 0 NA 0
# not a very interesting result
Related
I am new to R and still learning.
I have a dataset like this
county chemicalA chemicalB chemicalC chemicalD
A 1 0 1 0
B 0 0 0 0
C 1 0 0 0
D 0 1 1 0
I generate these binary variables by using code:
chemicalA=ifelse(Mean.Value_chemicalA>0,1,0)
chemicalA[is.na(chemicalA)]=0
Now I would like to sum the "1" up and see how many chemicals are detected in one place.My ideal result is like this:
county chemicalA chemicalB chemicalC chemicalD detection
A 1 0 1 0 2
B 0 0 0 0 0
C 1 0 0 0 1
D 0 1 1 1 3
I have tried
data$detection=chemicalA+chemicalB+chemicalC+chemicalD
But the result is only 2 and 0 and I don't know why. At first, I thought the chemicalX might not be numeric data and I used class(). All the chemicalX variables return as numeric.
Can someone help me with this? Thanks!
We can use rowSums on the column names that startsWith the prefix 'chemical'
data$detection <- rowSums(data[startsWith(names(data), "chemical")])
I think rowSums works better when the row name starts with the same prefix. But if not, we can try apply.
data$detection=apply(data[,c(1:5)], 1, sum)
Suppose I have matrix D which consists of death counts per year by specific ages.
I want to fill this matrix with appropriate death counts that is stored in
vector Age, but the following code gives me wrong answer. How should I write the code without making a loop?
# Year and age grid for tables
Years=c(2007:2017)
Ages=c(60:70)
#Data.frame of deaths
D=data.frame(matrix(ncol=length(Years),nrow=length(Ages))); D[is.na(D)]=0
colnames(D)=Years
rownames(D)=Ages
Age=c(60,61,62,65,65,65,68,69,60)
year=2010
D[as.character(Age),as.character(year)]<-
D[as.character(Age),as.character(year)]+1
D[,'2010'] # 1 1 1 0 0 1 0 0 1 1 0
# Should be 2 1 1 0 0 3 0 0 1 1 0
You need to use table
AgeTable = table(Age)
D[names(AgeTable), as.character(year)] = AgeTable
D[,'2010']
[1] 2 1 1 0 0 3 0 0 1 1 0
I am working on a gene expression dataset with hundreds of samples. Each sample in the data frame has a unique column ID (example: OHC_112 of IHC_123). I want to make a new dataframe containing only the columns containing the "OHC". How can i do this?
I am struggling to make workable example dataframe... but this is the best i was able to do.
Data frame "DF"
OHC_1 OHC_2 OHC_3 IHC_4 IHC_5 OHC_6
Gene1 1 1 0 1 1 0
Gene2 0 0 0 1 1 0
Gene3 1 1 1 0 0 1
Gene4 1 1 1 0 0 0
I got close by using the following subset command
newDF <- subset(DF, ,select = OHC_1:OHC_3)
This allows me to subset the dataframe by a range of the columns but does not allow me to choose all the columns containing "OHC" in the header.
Thanks for your help!
Just subset the columns with names that match using grepl?
> DF[, grepl("OHC",names(DF))]
OHC_1 OHC_2 OHC_3 OHC_6
1 1 1 0 0
2 0 0 0 0
3 1 1 1 1
4 1 1 1 0
You can make a shorter call that is also more generalizable with negative-grep:
df.2 <- df[, -grep("^OHC_[1:3]$", names(df) )]
Since grep returns numerics you can use the negative vector indexing to remove columns. You could add further number or more complex patterns.
We can use select with matches from tidyverse
library(tidyverse)
DF %>%
select(matches("^OHC"))
# OHC_1 OHC_2 OHC_3 OHC_6
#Gene1 1 1 0 0
#Gene2 0 0 0 0
#Gene3 1 1 1 1
#Gene4 1 1 1 0
Reference:
Transpose and create categorical values in R
Follow-up to this question. While both model.matrix and data.table work very well with values already in it, how can we use them to simulate a column?
Meaning, from data in the same data frame,
data <- read.table(header=T, text='
subject weight sex test
1 2 M control
2 3 F cond1
3 2 F cond2
4 4 M control
5 3 F control
6 2 F control
')
If I were to simulate the case statement with OR condition from SQL in R, how do I go about it? In SQL I would do:
case when ( sex = 'F' OR sex = 'M') AND CONTROL IS NOT NULL THEN 1 ELSE 0 AS F_M_CONTROL
case when (sex = 'F' OR sex = 'M') AND COND1 IS NOT NULL THEN 1 ELSE 0 AS F_M_COND1
bringing the output to:
subject weight control_F_M control_M condtrol_F cond1_F_M cond1_F cond1_M
1 2 0 1 0 0 0 0
2 3 0 0 1 0 0 0
3 2 0 0 0 0 1 0
4 4 0 1 0 0 0 0
5 3 1 0 0 0 0 0
6 2 1 0 0 0 0 0
Any idea how I can generate the "Control_F_M" and Cond1_F_M columns in R?
Thanks in advance,
Bee
Edit:
To generate the afore mentioned output, i'm using the data table & dcast as suggested before.
I can use If-Else if I knew all the values in the column: test. I apologize for not clarifying this earlier. The challenge ofcourse is that the column is dynamic and so I'm hoping to generate that many columns dynamically as an extension to the below or using a similar approach.
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))
This is making me feel dumb, but I am trying to produce a single vector/df/list/etc (anything but a matrix) concatenating two factors. Here's the scenario. I have a 100k line dataset. I used the top half to predict the bottom half and vice versa using knn. So now I have 2 objects created by knn predict().
> head(pred11)
[1] 0 0 0 0 0 0
Levels: 0 1
> head(pred12)
[1] 0 1 1 0 0 0
Levels: 0 1
> class(pred11)
[1] "factor"
> class(pred12)
[1] "factor"
Here's where my problem starts:
> pred13 <- rbind(pred11, pred12)
> class(pred13)
[1] "matrix"
There are 2 problems. First it changes the 0's and 1's to 1's and 2's and second it seems to create a huge matrix that's eats all my memory. I've tried messing with as.numeric(), data.frame(), etc, but can't get it to just combine the 2 50k factors into 1 100k one. Any suggestions?
#James presented one way, I'll chip in with another (shorter):
set.seed(42)
x1 <- factor(sample(0:1,10,replace=T))
x2 <- factor(sample(0:1,10,replace=T))
unlist(list(x1,x2))
# [1] 1 1 0 1 1 1 1 0 1 1 0 1 1 0 0 1 1 0 0 1
#Levels: 0 1
...This might seem a bit like magic, but unlist has special support for factors for this particular purpose! All elements in the list must be factors for this to work.
rbind will create 2 x 50000 matrix in your case which isn't what you want. c is the correct function to combine 2 vectors in a single longer vector. When you use rbind or c on a factor, it will use the underlying integers that map to the levels. In general you need to combine as a character before refactoring:
x1 <- factor(sample(0:1,10,replace=T))
x2 <- factor(sample(0:1,10,replace=T))
factor(c(as.character(x1),as.character(x2)))
[1] 1 1 1 0 1 1 0 1 0 0 0 1 1 1 1 1 1 0 0 0
Levels: 0 1