I am handling a data frame 'df' that have millions of rows and four columns (i.e., Chromosome, Position, Allele1, Allele2). Now I am wanting to concatenate characters in these columns into one separate vector 'cc'. This is my first try:
myfunc = function(CHR) {
chr = subset(df, df$Chromosome == CHR)
cc = data.frame(No=seq.int(nrow(chr)), pos_al1_al2=NA)
for (i in 1: nrow(chr)) {
cc$pos_al1_al2[i] = paste(CHR, chr$Position[i], ".", chr$Allele1[i], chr$Allele2[i])
cc = cc[, -1] # remove the column 'No'
}
}
# Run my code
myfunc(7)
where CHR is the number of chromosome of my interest I will input to the function (e.g., 1,2,3,..., or 22). Of course, CHR must be in a range of from 1 to 22 as in the column Chromosome of the 'df'.
My idea is that: I first created an empty vector called cc whose the number of rows are the same as the data.frame 'df'.
Now I created a new column in the cc called pos_al1_al2 whose each row includes characters as you can see in the function.
The computation time is very slow. I guess It comes from the for loop but I do have no idea to optimize my function.
Any help is appreciated! Thanks in advance.
Is there any reason why you can't use paste() in vectorized mode:
myfunc <- function(CHR) {
chr <- subset(df, df$Chromosome == CHR)
cc <- data.frame(No = seq.int(nrow(chr)), pos_al1_al2=NA)
cc$pos_al1_al2 <- paste(CHR, chr$Position, ".", chr$Allele1, chr$Allele2)
cc = cc[, -1] # remove the column 'No'
}
Related
My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))
I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5))
And I would like to create a cumulative sum of "Val". I know how to do the simple cumulative sum
df <- df %>% group_by(id) %>% mutate(cumval=cumsum(Val))
However, I would like my final data to look like this
final <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1971,1972,1973,1974,1977,1978,1990),
"Group" = c(1,NA,1,NA,NA,2,2,NA),
"Val" = c(2,3,3,5,2,5,3,5),
"cumval" = c(2,5,6,11,2,7,5,10))
The basic idea is that when two "Val"'s are of the same "Group" the one happening later (Year) substitutes the previous one.
For instance, in the sample dataset, observation 3 has a "cumval" of 6 rather than 8 because of the "Val" at time 1972 replaced the "Val" at time 1970. similarly for Beta.
I thank you in advance for your help
In my head, this requires a for loop. First we split the dataframe by the id column into a list of two. Then we create two empty lists. In the og list, we will put the row where the first unique non NA group identifier occurs. For alpha this is the first row and for Beta this is the second row. We will use this to subtract from the cumulative sum when the value gets substituted.
mylist <- split(df, f = df$id)
og <- list()
vals <- list()
df_num <- 1
We shall use a nested loop, the outer loop loops over each object (dataframe in this case) in the list and the inner loop loops over each value in the Group column.
We need to keep track of the row numbers, which we do with the r variable. We initially set it to 0 outside the for loop so we add 1. First we check if we are in the first row of the data frame, in which case the cumulative sum is simply equal to the value in the first row of the Val column. Then within the if test, we use another if test to check if the Group id is an NA. If it isn't then this is the first occurrence of the number that will indicate a substitution of the current value if this number appears again. So we save the number to the temporary variable temp. We also extract and save the row that contains the value to the og list.
After this it, goes to the next iteration. We check if the current Group value is NA. If it is, then we just add the value to the cumulative sum. If it isn't equal to NA, we check if the value is NA and is equal to the value stored in temp. If both are true, then this means we need to substitute. We extract the original value stored in the og list and save it as old. We then subtract the old value from the cumulative sum and add the current value. We also replace the orginal value in og with the current replacement value. This is because if the value needs to replaced again, we will need to subtract the current value and not the original value.
If j is NA but it is not equal to temp, then this is a new instance of Group. So we save the row with the original value to og list, and save the Group. The sum continues as normal as this is not an instance of replacing a value. Note that the variable x that is used to count the elements in the og list is only incremented when a new occurrence is added to the list. Thus, og[[x-1]] will always be the replacement value.
for (my_df in mylist) {
x <- 1
r <- 0
for (j in my_df$Group) {
r <- r + 1
if (r == 1) {
vals[[1]] <- my_df$Val[1]
if (is.na(j)==FALSE) {
og[[x]] <- df[r, c('Group', 'Val'), drop = FALSE]
temp <- j
x <- x + 1
}
next
}
if (is.na(j)==TRUE) {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
} else if (is.na(j)==FALSE & j==temp) {
old <- og[[x-1]]
old <- old[,2]
vals[[r]] <- vals[[r-1]] - old + df$Val[r]
og[[x-1]] <- df[r, c('Group', 'Val'), drop = FALSE]
} else {
vals[[r]] <- vals[[r-1]] + my_df$Val[r]
og[[x]] <- my_df[r, c('Group', 'Val')]
temp <- j
x <- x + 1
}
}
cumval <- unlist(vals) %>% as.data.frame()
colnames(cumval) <- 'cumval'
my_df <- cbind(my_df, cumval)
mylist[[df_num]] <- my_df
df_num <- df_num + 1
}
Lastly, we combine the two dataframes in the list by binding them on rows with bind_rows from the dplyr package. Then I check if the Final dataframe is identical to your desired output with identical() and it evaluates to TRUE
final_df <- bind_rows(mylist)
identical(final_df, final)
[1] TRUE
I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)
I have a data frame like the following:
sampleid <- c("patient_sdlkfjd_2354_CSF_CD19+", "control_sdlkfjd_2632_CSF_CD8+", "control_sdlkfjd_2632_CSF")
values = rnorm(3, 8, 3)
df <- data.frame(sampleid, values)
I also have a vector like the following:
matches <- c("632_CSF_CD8+", "632_CSF").
I want to extract rows in this data frame which contain the matches at the end of the value in the sampleid column. From this example, you can see why the end of string is important,as I have two samples which contain "632_CSF," but they are distinct samples. If I chose to change matches to only:
matches <- c("632_CSF").
Then I want only the third row of the data frame to be outputted, because this is the only one where this matches at the end of the sampleid.
How can this be achieved?
Thanks!
Just use $ in your pattern to indicate that it occurs at the end of the string.
grep("632_CSF$", sampleid, value=TRUE)
[1] "control_sdlkfjd_2632_CSF"
You can make this with stringr and some manipulations.
You need to encode regex, it's done with quotemeta function.
Next step would be to append $ to ensure the match is in the end of the string and then concatenate all matches into one with regex OR - |.
And then it should be used with str_detect to get boolean indices.
library(stringr)
# taken from here
# https://stackoverflow.com/a/14838753/1030110
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
matches_with_end <- sapply(matches, function(x) { paste0(quotemeta(x), '$') })
joined_matches <- paste(matches_with_end, collapse = '|')
ind <- str_detect(df$sampleid, joined_matches)
# [1] FALSE TRUE TRUE
df[ind, ]
# sampleid values
# 2 control_sdlkfjd_2632_CSF_CD8+ 10.712634
# 3 control_sdlkfjd_2632_CSF 7.001628
Suggest making your dataset more regular.
library(tidyverse)
df_regular <- df %>%
separate(
sampleid,
into = c("patient_type",
"test_number",
"patient_group",
"patient_id"),
extra = "merge") %>%
mutate(patient_id = str_pad(patient_id, 9, side = c("left"), pad = "0"))
df_regular
df_regular %>%
filter(patient_group %in% "2632" & patient_id %in% "000000CSF")
I have a single column data frame - example data:
1 >PROKKA_00002 Alpha-ketoglutarate permease
2 MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3 QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4 >PROKKA_00003 lipoprotein
5 MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG
Each sequence of letters is associated with the ">" line above it. I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. This is what I've tried so far:
y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
z <- 0
for(i in 1:nrow(df)){
if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
z <- z + 1
y[z,1] <- paste(df[i])
} else{
y[z,2] <- paste(df[i], collapse = "")
}
}
I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". I'm also not sure if my conditional is right. Can anyone help? It would be greatly appreciated!
Although I will stick with packages, here is a solution
initialize data
mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL" ,"MRTIIVIASLLLT"), stringsAsFactors = F)
process
ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))
seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}
fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)
> fastatable
name sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2 PROKKA_00003 lipoprotein MTESSITERGAPELMRTIIVIASLLLT
Try creating an index of the rows with the target symbol with the column headers. Then split the data on that index. The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers.
ind1 <- grepl(">", mydf$x)
#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]
#Add names
names(newdf) <- c("Name", "Value")
newdf
# Name Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002 MTESSITERGAPEL
# 5 >PROKKA_00003 lipoprotein
# 6 >PROKKA_00003 MRTIIVIASLLLT
Data
mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein" ,"MRTIIVIASLLLT"))
You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately:
library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
"MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
"QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
">PROKKA_00003 lipoprotein",
"MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)
t <- ddply(df, "section", function(x){
data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})
t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns
if you then view 't' I believe this is what you were looking for in your original post