I have a classic output of the BLAST tool that it is like the table below. To make the table easier to read, I reduced the number of columns.
query
subject
startinsubject
endinsubject
1
SRR
50
100
1
SRR
500
450
What I would need would be to create another column, called "strand", where when the query is forward as in the first row, and therefore the startinsubject is less than the endinsubject, writes in the new column F.
On the other hand, when the query is in reverse, as in the second row, where the startinsubject is higher than the endinsubject, it adds an R in the new "strand" column.
I would like to get a new table like this one below. Could anyone help me? a thousand thanks
query
subject
startinsubject
endinsubject
strand
1
SRR
50
100
F
1
SRR
500
450
R
This is an ifelse option. You can use the following code:
df <- data.frame(query = c(1,1),
subject = c("SRR", "SRR"),
startinsubject = c(50, 500),
endinsubject = c(100, 450))
library(dplyr)
df %>%
mutate(strand = ifelse(startinsubject > endinsubject, "R", "F"))
Output:
query subject startinsubject endinsubject strand
1 1 SRR 50 100 F
2 1 SRR 500 450 R
We may either use ifelse/case_when or just convert the logical to numeric index for replacement
library(dplyr)
df1 <- df1 %>%
mutate(strand = c("R", "F")[1 + (startinsubject < endinsubject)])
-output
df1
query subject startinsubject endinsubject strand
1 1 SRR 50 100 F
2 1 SRR 500 450 R
data
df1 <- structure(list(query = c(1L, 1L), subject = c("SRR", "SRR"),
startinsubject = c(50L, 500L), endinsubject = c(100L, 450L
)), class = "data.frame", row.names = c(NA, -2L))
Related
I have a dataframe with distance in one column and scores in another column, e.g.
Distance Scores
1000. 1
1500. 1
etc.
I have a piecewise function that says:
If Distance >= 1000, change Scores to zero
If Distance is between 300 and 1000, change score to 0.5(1000 - the distance value)
If Distance is less than 300, change to 0.5(1000-300)
I tried the following:
DF$Scores[DF$Distance>=1000] <- 0
DF$Scores[DF$Distance>300 & DF$Distance<1000] <-0.5(1000-DF$Distance)
DF$Scores[DF$Distance<=300]<- 0.5*(1000 -300 )
However, this is not working because the Scores that have been changed to zero are then later altered by the less than 300 condition. Also, the replacement of the scoring values for distances between 300 to 1000 gives 'Error:attempt to apply non-function'.
I would suggest an approach like this:
#Data
df <- structure(list(Distance = c(1000, 1500), Scores = c(1L, 1L)), class = "data.frame", row.names = c(NA,
-2L))
The code:
df$Scores <- ifelse(df$Distance>=1000,0,
ifelse(df$Distance>300 & df$Distance<1000,0.5*(1000-df$Distance),
ifelse(df$Distance<=300,0.5*(1000 -300 ),NA)))
Output:
Distance Scores
1 1000 0
2 1500 0
Just be careful, if you have many conditions nesting ifelse can be complex.
Adding the other cases that might occur and providing a dplyr solution which is sometimes easier for humans to read...
df <- structure(list(Distance = c(1000, 1500, 500, 250),
Scores = c(1L, 1L, 2L, 3L)),
class = "data.frame",
row.names = c(NA, -4L))
library(dplyr)
df %>%
mutate(Scores = case_when(
Distance >= 1000 ~ 0,
Distance < 1000 & Distance > 300 ~ .5 * (1000 - Distance),
Distance <= 300 ~ 0.5 * (1000 - 300),
TRUE ~ as.numeric(Scores)
))
#> Distance Scores
#> 1 1000 0
#> 2 1500 0
#> 3 500 250
#> 4 250 350
I am having some trouble cleaning up my data. It consists of a list of sold houses. It is made up of the sell price, no. of rooms, m2 and the address.
As seen below the address is in one string.
Head(DF, 3)
Address Price m2 Rooms
Petersvej 1772900 Hoersholm 10.000 210 5
Annasvej 2B2900 Hoersholm 15.000 230 4
Krænsvej 125800 Lyngby C 10.000 210 5
A Mivs Alle 119800 Hjoerring 1.300 70 3
The syntax for the address coloumn is: road name, road no., followed by a 4 digit postalcode and the city name(sometimes two words).
Also need to extract the postalcode.. been looking at 'stringi' package haven't been able to find any examples..
any pointers are very much appreciated
1) Using separate in tidyr separate the subfields of Address into 3 fields merging anything left over into the last and then use separate again to split off the last 4 digits in the Number column that was generated in the first separate.
library(dplyr)
library(tidyr)
DF %>%
separate(Address, into = c("Road", "Number", "City"), extra = "merge") %>%
separate(Number, into = c("StreetNo", "Postal"), sep = -4)
giving:
Road StreetNo Postal City Price m2 Rooms CITY
1 Petersvej 77 2900 Hoersholm 10 210 5 Hoersholm
2 Annasvej 121B 2900 Hoersholm 15 230 4 Hoersholm
3 Krænsvej 12 5800 Lyngby C 10 210 5 C
2) Alternately, insert commas between the subfields of Address and then use separate to split the subfields out. It gives the same result as (1) on the input shown in the Note below.
DF %>%
mutate(Address = sub("(\\S.*) +(\\S+)(\\d{4}) +(.*)", "\\1,\\2,\\3,\\4", Address)) %>%
separate(Address, into = c("Road", "Number", "Postal", "City"), sep = ",")
Note
The input DF in reproducible form is:
DF <-
structure(list(Address = structure(c(3L, 1L, 2L), .Label = c("Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C", "Petersvej 772900 Hoersholm"), class = "factor"),
Price = c(10, 15, 10), m2 = c(210L, 230L, 210L), Rooms = c(5L,
4L, 5L), CITY = structure(c(2L, 2L, 1L), .Label = c("C",
"Hoersholm"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Update
Added and fixed (2).
Check out the cSplit function from the splitstackshape package
library(splitstackshape)
df_new <- cSplit(df, splitCols = "Address", sep = " ")
#This will split your address column into 4 different columns split at the space
#you can then add an ifelse block to combine the last 2 columns to make up the city like
df_new$City <- ifelse(is.na(df_new$Address_4), as.character(df_new$Address_3), paste(df_new$Address_3, df_new$Address_4, sep = " "))
One way to do this is with regex.
In this instance you may use a simple regular expression which will match all alphabetical characters and space characters which lead to the end of the string, then trim the whitespace off.
library(stringr)
DF <- data.frame(Address=c("Petersvej 772900 Hoersholm",
"Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C"))
DF$CITY <- str_trim(str_extract(DF$Address, "[a-zA-Z ]+$"))
This will give you the following output:
Address CITY
1 Petersvej 772900 Hoersholm Hoersholm
2 Annasvej 121B2900 Hoersholm Hoersholm
3 Krænsvej 125800 Lyngby C Lyngby C
In R the stringr package is preferred for regex because it allows for multiple-group capture, which in this example could allow you to separate each component of the address with one expression.
I have a dataframe that has OrderDate and MajorCategory as the two variables. OrderDates range from 2005-01-01 to 2007-12-31, and MajorCategory runs from 1 to 73 with around 35.5 million entries. Each OrderDate references a specific order, which has an ID number and also is attributed to a specific MajorCategory. I am trying to create a dataframe to show each unique OrderDate and the count of each MajorCategory that was ordered on that date.
The dataset currently looks something like:
OrderDate MajorCategory
2005-12-12 66
2005-12-12 66
2006-03-28 43
2006-05-16 66
I have separated the unique OrderDate (after changing the class to Date) into its own dataframe by using:
OD <- as.data.frame(unique(DMEFLines3Dataset2$OrderDate))
OD <- as.data.frame(sort(OD$`unique(DMEFLines3Dataset2$OrderDate)`))
I'm not sure how to get the MajorCategory to show me a count for each date. So the desired output would be something like:
OD MC_1 MC_2
2005-01-01 4 6
2005-01-02 7 45
2005-01-03 3 23
where OD is the Order Date and MC_X is the MajorCategory's order count per date (MC_1 to MC_73).
I tried using for loops, frequency, and count, but I can't seem to figure it out.
I am not an R expert, and if given the option I would try to aggregate the data as needed in a different language and then load the aggregated data into an R data frame for any further analysis.
I have done something close to what you are asking by calculating ROC graphs from the output of a third party naive bayes model which consisted of appointment detail grouped by departments. Tweaking my code a bit, I was able to get a dataframe with counts of an identifier grouped by date, which seems to be structured the way you are asking for.
library(RODBC)
dbConnection <- 'Driver={SQL Server};Server=SERVERNAME;Database=DBName;Trusted_Connection=yes'
channel <- odbcDriverConnect(dbConnection)
InputDataSet <- sqlQuery(channel, "
SELECT OrderID, OrderDate, MajorCategory from [dbo].[myDataSet];"
)
results <- data.frame("date", "ordCount")
names(results) <- c("date", "ordCount")
for (dt in unique(InputDataSet$OrderDate)) {
ordCount <- 0
filteredSet = InputDataSet[InputDataSet$OrderDate == dt,]
for (mc in unique(filteredSet$MajorCategory)) {
ordCount <- ordCount+1
}
df <- data.frame(dt, ordCount)
names(df) <- c("date", "ordCount")
results <- rbind(df, results)
}
results
library(tidyverse)
df1 <- df %>%
group_by(OrderDate, MajorCategory) %>%
tally() %>%
mutate(MajorCategory = paste("MC", MajorCategory, sep="_")) %>%
spread(MajorCategory, n)
df1
Output is:
OrderDate MC_43 MC_66 MC_67
1 2005-12-12 NA 2 1
2 2006-03-28 1 NA NA
3 2006-05-16 NA 1 NA
Sample data:
df <- structure(list(OrderDate = c("2005-12-12", "2005-12-12", "2005-12-12",
"2006-03-28", "2006-05-16"), MajorCategory = c(66L, 66L, 67L,
43L, 66L)), .Names = c("OrderDate", "MajorCategory"), class = "data.frame", row.names = c(NA,
-5L))
OrderDate<- as.Date(c('2005-12-12','2005-12-12','2006-03-28','2006-05-16','2005-03-04','2005-12-12'))
MajorCategory<- as.numeric(c(66, 66, 43, 66, 43, 1))
OD=data.frame(OrderDate,MajorCategory)
out <- split(OD, OD$MajorCategory)
count=lapply(out, function(x) aggregate(x$MajorCategory, FUN = length, by = list(x$OrderDate)))
I'm a trying-to-be R user. I never learned to code properly and have been just doing it by finding stuff online.
I encountered a problem that I would need some of you experts' help.
I have two data files.
Particulate matter (PM) concentrations (~20000 observations)
Coefficient combinations to use with the particulate matter concentrations to calculate final concentrations.
For example..
Data set 1.
ID PM
1 5
2 10
... ...
1500 25
Data set 2.
alpha beta
5 6
1 2
... ...
I ultimately have to use all the coefficient combinations (alpha and beta) for each of the IDs from data set 1. For example, if I have 10 observations in data set 1, and 10 coefficient combinations in data set 2, my output table should have 100 different output values (10*10=100).
for (i in cmaq$FID) {
mean=cmaq$PM*IER$alpha*IER$beta
}
I used the above code to do what I'm trying to do, but it only gave me 10 output values rather than 100. I think using the split function first, and somehow use that with the second dataset would work, but have not figured out how...
It may be a very very simple problem, but after spending hours to figure it out, I thought it may be a better strategy to get some help from R experts.
Thank you in advance!!!
You can do:
df1 = data.frame(
ID = c(1, 2, 1500),
PM = c(5, 10, 25)
)
df2 = data.frame(
alpha = c(5, 6),
beta = c(1, 2)
)
library(tidyverse)
library(dplyr)
df1 %>%
group_by(ID) %>%
do(data.frame(result = .$PM * df2$alpha * df2$beta,
alpha = df2$alpha,
beta = df2$beta))
Look for the term 'cross join' or 'cartesian join' (eg, How to do cross join in R?).
If that doesn't address the issue, please see https://stackoverflow.com/help/mcve. I think there is a mistake inside the loop. beta is free-floating, and not connected to the IER data.frame
We can do this with outer
data.frame(ID = rep(df1$ID, each = nrow(df2)), alpha = df2$alpha,
beta = df2$beta, result = c(t(outer(df1$PM, df2$alpha*df2$beta))))
# ID alpha beta result
#1 1 5 1 25
#2 1 6 2 60
#3 2 5 1 50
#4 2 6 2 120
#5 1500 5 1 125
#6 1500 6 2 300
data
df1 <- structure(list(ID = c(1, 2, 1500), PM = c(5, 10, 25)), .Names = c("ID",
"PM"), row.names = c(NA, -3L), class = "data.frame")
df2 <- structure(list(alpha = c(5, 6), beta = c(1, 2)), .Names = c("alpha",
"beta"), row.names = c(NA, -2L), class = "data.frame")
I've been trying to solve this issue with mapply, but I believe I will have to use several nested applies to make this work, and it has gotten real confusing.
The problem is as follows:
Dataframe one contains around 400 keywords. These fall into roughly 15 categories.
Dataframe two contains a string description field, and 15 additional columns, each named to correspond to the categories mentioned in dataframe one. This has millions of rows.
If a keyword from dataframe 1 exists in the string field in dataframe 2, the category in which the keyword exists should be flagged in dataframe 2.
What I want should look something like this:
> #Dataframe1 df1
>> keyword category
>> cat A
>> dog A
>> pig A
>> crow B
>> pigeon B
>> hawk B
>> catfish C
>> carp C
>> ...
>>
> #Dataframe2 df2
>> description A B C ....
>> false cat 1 0 0 ....
>> smiling pig 1 0 0 ....
>> shady pigeon 0 1 0 ....
>> dogged dog 2 0 0 ....
>> sad catfish 0 0 1 ....
>> hawkward carp 0 1 1 ....
>> ....
I tried to use mapply to get this to work but it fails, giving me the error "longer argument not a multiple of length of shorter". It also computes this only for the first string in df2. I haven't proceeded beyond this stage, i.e. attempting to get category flags.
> mapply(grepl, pattern = df1$keyword, x = df2$description)
Could anyone be of help? I thank you very much. I am new to R so it would also help if someone could mention some 'thumb rules' for turning loops into apply functions. I cannot afford to use loops to solve this as it would take way too much time.
There might be a more elegant way to do this but this is what I came up with:
## Your sample data:
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
## Load packages:
library(stringr)
library(dplyr)
library(tidyr)
## For each entry in df2$description count how many times each keyword
## is contained in it:
outList <- lapply(df2$description, function(description){
outDf <- data.frame(description = description,
value = vapply(stringr::str_extract_all(description, df1$keyword),
length, numeric(1)), category = df1$category)
})
## Combine to one long data frame and aggregate by category:
outLongDf<- do.call('rbind', outList) %>%
group_by(description, category) %>%
dplyr::summarise(value = sum(value))
## Reshape from long to wide format:
outWideDf <- tidyr::spread(data = outLongDf, key = category,
value = value)
outWideDf
# Source: local data frame [6 x 4]
# Groups: description [6]
#
# description A B C
# * <fctr> <dbl> <dbl> <dbl>
# 1 dogged dog 2 0 0
# 2 false cat 1 0 0
# 3 hawkward carp 0 1 1
# 4 sad catfish 1 0 1
# 5 shady pigeon 1 1 0
# 6 smiling pig 1 0 0
This approach, however also catches the "pig" in "pigeon" and the "cat" in "catfish". I don't know if this is what you want, though.
What you are looking for is a so-called document-term-matrix (or dtm in short), which stems from NLP (Natural Language Processing). There are many options available. I prefer text2vec. This package is blazingly fast (I wouldn't be surprised if it would outperform the other solutions here by a large magnitude) especially in combination with tokenizers.
In your case the code would look something like this:
# Create the data
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
# load the libraries
library(text2vec) # to create the dtm
library(tokenizers) # to help creating the dtm
library(reshape2) # to reshape the data from wide to long
# 1. create the vocabulary from the keywords
vocabulary <- vocab_vectorizer(create_vocabulary(itoken(df1$keyword)))
# 2. create the dtm
dtm <- create_dtm(itoken(as.character(df2$description)), vocabulary)
# 3. convert the sparse-matrix to a data.frame
dtm_df <- as.data.frame(as.matrix(dtm))
dtm_df$description <- df2$description
# 4. melt to long format
df_result <- melt(dtm_df, id.vars = "description", variable.name = "keyword")
df_result <- df_result[df_result$value == 1, ]
# 5. combine the data, i.e., add category
df_final <- merge(df_result, df1, by = "keyword")
# keyword description value category
# 1 carp hawkward carp 1 C
# 2 cat false cat 1 A
# 3 catfish sad catfish 1 C
# 4 dog dogged dog 1 A
# 5 pig smiling pig 1 A
# 6 pigeon shady pigeon 1 B
Whatever the implementation, counting the number of matches per category needs k x d comparisons, where k is the number of keywords and d the number of descriptions.
There are a few tricks to make solve this problem fast and without a lot of memory:
Use vectorized operations. These can be performed a lot quicker than use for loops. Note that lapply, mapply or vapply are just shorthand for for loops. I parallelize (see next) over the keywords such that the vectorization can be over the descriptions which is the largest dimension.
Use parallelization. Optimally using your multiple cores speeds up the proces at the cost of an increase in memory (since every core needs its own copy).
Example:
keywords <- stringi::stri_rand_strings(400, 2)
categories <- letters[1:15]
keyword_categories <- sample(categories, 400, TRUE)
descriptions <- stringi::stri_rand_strings(3e6, 20)
keyword_occurance <- function(word, list_of_descriptions) {
description_keywords <- str_detect(list_of_descriptions, word)
}
category_occurance <- function(category, mat) {
rowSums(mat[,keyword_categories == category])
}
list_keywords <- mclapply(keywords, keyword_occurance, descriptions, mc.cores = 8)
df_keywords <- do.call(cbind, list_keywords)
list_categories <- mclapply(categories, category_occurance, df_keywords, mc.cores = 8)
df_categories <- do.call(cbind, list_categories)
With my computer this takes 140 seconds and 14GB RAM to match 400 keywords in 15 categories to 3 million descriptions.