I am wondering how to subset my data based on the appearance of triplicates in a column.
t <- c(1,1,2,2,3,3,4,4,5,5,5,6,6,7,7,7,8,8)
mydf <- data.frame(t, 1:18)
I want to be able to grab only the rows that correspond to a triplicate in column t, so that I can form a new dataframe of only those rows. That would look like this where p is the vector of rows I'm looking for:
p <- c(9,10,11,14,15,16)
myidealdf[p,]
Sorry if this isn't clear, it's my first post
This should do it
keeps <- unique(t)[table(as.factor(t)) == 3]
keeps <- t %in% keeps
mydf <- mydf[keeps, ]
Using rle function.
which(t %in% with(rle(t), values[lengths==3]))
[1] 9 10 11 14 15 16
Related
I am trying to create a series of dataframes which are subset from a larger dataframe by a date range (2-year blocks), in order to do a separate survival analysis for each new dataframe. I cannot use "split" to split the dataframe based on one factor, as the data will need to be present in more than one subset.
I have some example data as follows:
Patient <- c(1:10)
First.Appt <- c("2014-01-01","2014-03-02","2015-05-17","2015-06-03","2016-01-12","2016-11-07","2017-07-08","2017-09-09","2018-04-12","2018-05-13")
DOD <- c("2014-01-29","2014-03-30","2015-06-14","2015-07-01","2016-02-09","2016-12-05","2017-08-05","2017-10-07","2018-05-10","2018-06-10")
First.Appt.Year <- c(2014,2014,2015,2015,2016,2016,2017,2017,2018,2018)
df <- as.data.frame(cbind(Patient, First.Appt, DOD, First.Appt.Year))%>%
mutate_at("First.Appt.Year", as.numeric)
I have created a start date (the minimum First.Appt.Year), the final start date (maximum First.Appt.Year - 1), and then a vector containing all my start dates from which to subset full 2-year blocks as follows:
Start.year <- as.numeric(min(df$First.Appt.Year))
Final.start.year <- max(df$First.Appt.Year) - 1
Start.vec <- c(Start.year:Final.start.year)
I thought to use a for loop with lapply to create a subset based on First.Appt.Year falling within the range of Start.vec and Start.vec + 1, for each value of Start.vec as follows:
for (i in 1:length(Start.vec)){
new.df = lapply(Start.vec, function(x)
subset(df, First.Appt.Year == Start.vec[i] | First.Appt.Year == Start.vec[i] + 1))
}
This almost works, but instead of creating four different dataframes (e.g. 2014-2015, 2015-2016, 2016-2017 and 2017-2018), all four of the dataframes in the output list only contain 2017-2018 values as below.
Patient
First.Appt
DOD
First.Appt.Year
7
08/07/2017
05/08/2017
2017
8
09/09/2017
07/10/2017
2017
9
12/04/2018
10/05/2018
2018
10
13/05/2018
10/06/2018
2018
Can anyone help me with what I am doing wrong and how to return the different subsets into each list object?
If there are other ways of doing this that seem more logical then please let me know.
It looks like a simple misunderstanding about the use of lapply. You don't need to wrap it in a for loop. Just replace your last block with :
new.df = lapply(Start.vec, function(x) subset(df, First.Appt.Year == x | First.Appt.Year == x + 1))
And that should work. At least, it does on my side.
You are close! Instead of using both the for loop and the lapply, you need only one.
For example, with the lapply:
new.df <- lapply(Start.vec, function(x) subset(df, First.Appt.Year == x | First.Appt.Year == x + 1))
And using only the for loop:
df_list <- list()
for (i in 1:length(Start.vec)){
new.df <- subset(df, First.Appt.Year == Start.vec[i] | First.Appt.Year == Start.vec[i] + 1)
df_list <- c(df_list, list(new.df))
}
df_list
Background
Here's a toy df:
df <- data.frame(ID = c("a","b","c","d","e","f"),
gender = c("f","f","m","f","m","m"),
zip = c(48601,NA,29910,54220,NA,44663),stringsAsFactors=FALSE)
As you can see, I've got a couple of NA values in the zip column.
Problem
I'm trying to randomly sample 2 entire rows from df -- but I want them to be rows for which zip is not null.
What I've tried
This code gets me a basic (i.e. non-conditional) random sample:
df2 <- df[sample(nrow(df), 2), ]
But of course, that only gets me halfway to my goal -- a bunch of the time it's going to return a row with an NA value in zip. This code attempts to add the condition:
df2 <- df[sample(nrow(df$zip != NA), 2), ]
I think I'm close, but this yields an error invalid first argument.
Any ideas?
We can use is.na
tmp <- df[!is.na(df$zip),]
> tmp[sample(nrow(tmp), 2),]
We can use rownames + na.omit to sample the rows
> df[sample(rownames(na.omit(df["zip"])), 2),]
ID gender zip
3 c m 29910
4 d f 54220
Here is a base R solution with complete.cases()
# define a logical vector to identify NA
x <- complete.cases(df)
# subset only not NA values
df_no_na <- df[x,]
# do the sample
df_no_na[sample(nrow(df_no_na), 2),]
Output:
ID gender zip
3 c m 29910
6 f m 44663
For the tidyverse lovers out there...
library("dplyr")
df %>%
tidyr::drop_na() %>%
dplyr::slice_sample(n = 2)
If it only NA in the zip column you care about, then:
df %>%
tidyr::drop_na(zip) %>%
dplyr::slice_sample(n = 2)
The important thing here is to avoid creating an unnecessary second data frame with the NA values dropped. You could use the solution using na.omit given in another answer, but alternatively you can use which to return a list of valid rows to sample from. For example:
nsamp <- 23
df[sample(which(!is.na(df$zip)), nsamp), ]
The advantage to doing it this way is that the condition inside the which can be anything you like, whether or not it involves missing values. For example this version will sample from all the rows with female gender in zip codes starting with 336:
df[sample(which(df$gender=='f' & grepl('^336', df$zip)), nsamp), ]
lets take an example dataframe with removal of variable columns:
frame <- data.frame("a" = 1:5, "b" = 2:6, "c" = 3:7, "d" = 4:8)
rem <- readline()
frame <- subset(frame, select = -c(rem))
How do I get the variable column to be removed? This is not my real code, just wanted to present my problem in a simple code. Thanks!
Edit: I am so sorry, I am really sleepy and don't know what I typed into my code, I edited it now.
1) Do both at once. We assume that ix contains at least one column number.
ix <- 1:2
frame[-ix]
## c d
## 1 3 4
## 2 4 5
## 3 5 6
## 4 6 7
## 5 7 8
1a) or if the case where ix is zero length, ix <- c(), is important we can do this. The output of this and all the rest are the same as for (1) so we won't repeat the output.
ix <- 1:2
frame[setdiff(seq_along(frame), ix)]
1b) or if we have names rather than column numbers. This works even if nms is a zero length vector in which case it returns the original data frame.
nms <- c("a", "b")
frame[setdiff(names(frame), nms)]
2) or if you need to do it iteratively remove the largest one first because if it were done in ascending order then after the first one is removed the second column is no longer the second but is the first. If we knew that ix is already sorted we could omit the sort. We have used frame_out to hold the result so that the input is not destroyed. This works even if ix is the empty vector.
ix <- 1:2
frame_out <- frame
for(i in rev(sort(ix))) frame_out <- frame_out[-i]
frame_out
3) One way to do it independent of order is to do it by name. In this case it would be possible to remove them in ascending order. This works even if ix the empty vector.
ix <- 1:2
nms <- names(frame)[ix]
frame_out <- frame
for(nm in nms) frame_out <- frame_out[-match(nm, names(frame_out))]
frame_out
I am attempting to replace NA values in my data frame based on the logical return of one of the columns in the data frame.
#Creating random example data frame
a <- rbinom(1000,1,.5)
b <- rbinom(1000,1,.75)
c <- rbinom(1000,1,.25)
d <- rbinom(1000,1,.5)
e <- rbinom(1000,1,.5) # Will be the logical column
df <- cbind(a,b,c,d)
for(i in 1:1000){
if(sum(df[i,1:4]) >2){
df[i,1:4] <- NA
}
}
# randomly replacing some of the NA to represent the observation data
df[sample(1:length(df), 100, replace=F)] <- 1
df <- cbind(df, e)
I am attempting to fill in the NAs with 0 when e == 1 while still retaining the random 1s I placed in the the other 4 columns (especially those where the rest of the values are NA).
I've tried creating loops like:
for(i in 1:nrow(df)){
if(df[,'e']==1){
df[i,is.na(df[i,1:4])] <- 0
}
}
however that clears both my logical column and my observation data.
The data frame that I want to apply this to is large (2.8 million rows X 23 col) containing metadata and observation data so something that takes speed into account would be great.
We can do this with data.table
library(data.table)
df1 <- as.data.frame(df)
setDT(df1)
for(j in 1:4){
set(df1, i = which(df1[['e']]==1 & is.na(df1[[j]])), j = j, value = 0)
}
It would be more efficient as we are using set. Based on the help page of set (?set) overhead of [.data.table is avoided by calling it.
As #thelatemail mentioned a compact base R option would be
df[,1:4][df[,"e"]==1 & is.na(df[,1:4])] <- 0
If the matrix is very big, the logical matrix would be big as well and that could potentially create memory-related issues.
I have a single column data frame - example data:
1 >PROKKA_00002 Alpha-ketoglutarate permease
2 MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3 QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4 >PROKKA_00003 lipoprotein
5 MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG
Each sequence of letters is associated with the ">" line above it. I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. This is what I've tried so far:
y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
z <- 0
for(i in 1:nrow(df)){
if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
z <- z + 1
y[z,1] <- paste(df[i])
} else{
y[z,2] <- paste(df[i], collapse = "")
}
}
I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". I'm also not sure if my conditional is right. Can anyone help? It would be greatly appreciated!
Although I will stick with packages, here is a solution
initialize data
mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL" ,"MRTIIVIASLLLT"), stringsAsFactors = F)
process
ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))
seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}
fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)
> fastatable
name sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2 PROKKA_00003 lipoprotein MTESSITERGAPELMRTIIVIASLLLT
Try creating an index of the rows with the target symbol with the column headers. Then split the data on that index. The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers.
ind1 <- grepl(">", mydf$x)
#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]
#Add names
names(newdf) <- c("Name", "Value")
newdf
# Name Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002 MTESSITERGAPEL
# 5 >PROKKA_00003 lipoprotein
# 6 >PROKKA_00003 MRTIIVIASLLLT
Data
mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein" ,"MRTIIVIASLLLT"))
You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately:
library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
"MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
"QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
">PROKKA_00003 lipoprotein",
"MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)
t <- ddply(df, "section", function(x){
data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})
t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns
if you then view 't' I believe this is what you were looking for in your original post