How to remove rows from a dataframe if condition?

How to remove rows from a dataframe if condition? - r

I am trying to achieve this using WHILE, but it's too complex for me,there must be a way using dplyr library.
I have a warehouse with:
product_id amount
1 1001 1
2 4911 100
3 4014 32
I am writing a function that will pass product_id and amount, and take the required amount out, and if such product_id does not exist or the amount higher that what available return an error.
So, if I ran the function:
remove_warehouse(1001,1)
Result should be:
product_id amount
1 4911 100
2 4014 32
And if I run eiter:
remove_warehouse(240,1)
or
remove_warehouse(4014,60)
I should get a generic error "not enough amount or product not present"

One way of writing the function could be
remove_warehouse <- function(df, product_id, amount) {
id = df$product_id == product_id
if (any(id))
amount_base = df$amount[id]
else
stop("No id present")
if (amount > amount_base)
stop("No sufficient amount")
else
df$amount[id] = df$amount[id] - amount
df
}
remove_warehouse(df, 4911, 90)
# product_id amount
#1 1001 1
#2 4911 10
#3 4014 32
remove_warehouse(df, 1234, 12)
#Error in remove_warehouse(df, 1234, 12) : No id present
remove_warehouse(df, 1001, 100)
#Error in remove_warehouse(df, 1001, 100) : No sufficient amount
This is assuming you will have only one product_id in your df.
data
df <- structure(list(product_id = c(1001L, 4911L, 4014L), amount = c(1L,
100L, 32L)), .Names = c("product_id", "amount"), class = "data.frame",
row.names = c("1", "2", "3"))

Related

Using R Base to sum a column of a dataframe for each value of a list

I have a dataframe named 2022_Rev that looks sort of like this:
Name Vendor Sales
Steve 6 80,000
Annie 4 95,000
Bill 6 45,000
Steve 3 25,000
Bill 2 40,000
Sam 5 5,000
... ... ...
I also have a list of each sales person:
Employees ['Steve', 'Annie', 'Bill', 'Sam', ...]
I want to apply mean() to column sales for each item in the list "Employee". I am supposed to use base R to create a loop that goes through each value in "Employees" and then creates a vector showing the mean for each employee. So far I have:
avgSales = rep(NA, 10)
for (i in length(Employees)){
if(Employees[i] == 2022_Rev$Name){
avgSales[i] = mean(2022_Rev$Sales[i])
}
}
This is erroring apparently because if can only check one value? I'm not sure how to fix it.

This is not normally the approach we would take in R (i.e. there are better ways to get the mean of a column by group). However, if you want an example of a for loop over the names of the Employees in your list, here is one base R approach. First preallocated a named vector of length as long as your Employees, and then fill it use a for loop:
sales_means = setNames(vector("numeric", length = length(Employees)), Employees)
for(e in Employees) {
sales_means[e] = mean(`2022_Rev`[`2022_Rev`$Name==e, "Sales"],na.rm=T)
}
Output:
Steve Annie Bill Sam
52500 95000 42500 5000
Input:
`2022_Rev` = structure(list(Name = c("Steve", "Annie", "Bill", "Steve", "Bill",
"Sam"), Vendor = c(6L, 4L, 6L, 3L, 2L, 5L), Sales = c(80000L,
95000L, 45000L, 25000L, 40000L, 5000L)), row.names = c(NA, -6L
), class = "data.frame")
Employees = list('Steve', 'Annie', 'Bill', 'Sam')

We could use the subset option in aggregate from base R
aggregate(Sales ~ Name, data = `2022_Rev`, subset = Name %in% Employees, mean)
Name Sales
1 Annie 95000
2 Bill 42500
3 Sam 5000
4 Steve 52500

We can use aggregate to calculate the mean of Sales with respect to Name , then transform your list Employees to data.frame then merge it with the aggregate result to get the values in the list
aggregate(Sales ~ Name , `2022_Rev` , mean) |>
merge(do.call(rbind , Employees) |>
data.frame(Name = _) , by.y = "Name")
Output
Name Sales
1 Annie 95000
2 Bill 42500
3 Sam 5000
4 Steve 52500

Sort table rows by column values in R

I have a classic output of the BLAST tool that it is like the table below. To make the table easier to read, I reduced the number of columns.
query
subject
startinsubject
endinsubject
1
SRR
50
100
1
SRR
500
450
What I would need would be to create another column, called "strand", where when the query is forward as in the first row, and therefore the startinsubject is less than the endinsubject, writes in the new column F.
On the other hand, when the query is in reverse, as in the second row, where the startinsubject is higher than the endinsubject, it adds an R in the new "strand" column.
I would like to get a new table like this one below. Could anyone help me? a thousand thanks
query
subject
startinsubject
endinsubject
strand
1
SRR
50
100
F
1
SRR
500
450
R

This is an ifelse option. You can use the following code:
df <- data.frame(query = c(1,1),
subject = c("SRR", "SRR"),
startinsubject = c(50, 500),
endinsubject = c(100, 450))
library(dplyr)
df %>%
mutate(strand = ifelse(startinsubject > endinsubject, "R", "F"))
Output:
query subject startinsubject endinsubject strand
1 1 SRR 50 100 F
2 1 SRR 500 450 R

We may either use ifelse/case_when or just convert the logical to numeric index for replacement
library(dplyr)
df1 <- df1 %>%
mutate(strand = c("R", "F")[1 + (startinsubject < endinsubject)])
-output
df1
query subject startinsubject endinsubject strand
1 1 SRR 50 100 F
2 1 SRR 500 450 R
data
df1 <- structure(list(query = c(1L, 1L), subject = c("SRR", "SRR"),
startinsubject = c(50L, 500L), endinsubject = c(100L, 450L
)), class = "data.frame", row.names = c(NA, -2L))

Splitting coloumn with differing syntax in R

I am having some trouble cleaning up my data. It consists of a list of sold houses. It is made up of the sell price, no. of rooms, m2 and the address.
As seen below the address is in one string.
Head(DF, 3)
Address Price m2 Rooms
Petersvej 1772900 Hoersholm 10.000 210 5
Annasvej 2B2900 Hoersholm 15.000 230 4
Krænsvej 125800 Lyngby C 10.000 210 5
A Mivs Alle 119800 Hjoerring 1.300 70 3
The syntax for the address coloumn is: road name, road no., followed by a 4 digit postalcode and the city name(sometimes two words).
Also need to extract the postalcode.. been looking at 'stringi' package haven't been able to find any examples..
any pointers are very much appreciated

1) Using separate in tidyr separate the subfields of Address into 3 fields merging anything left over into the last and then use separate again to split off the last 4 digits in the Number column that was generated in the first separate.
library(dplyr)
library(tidyr)
DF %>%
separate(Address, into = c("Road", "Number", "City"), extra = "merge") %>%
separate(Number, into = c("StreetNo", "Postal"), sep = -4)
giving:
Road StreetNo Postal City Price m2 Rooms CITY
1 Petersvej 77 2900 Hoersholm 10 210 5 Hoersholm
2 Annasvej 121B 2900 Hoersholm 15 230 4 Hoersholm
3 Krænsvej 12 5800 Lyngby C 10 210 5 C
2) Alternately, insert commas between the subfields of Address and then use separate to split the subfields out. It gives the same result as (1) on the input shown in the Note below.
DF %>%
mutate(Address = sub("(\\S.*) +(\\S+)(\\d{4}) +(.*)", "\\1,\\2,\\3,\\4", Address)) %>%
separate(Address, into = c("Road", "Number", "Postal", "City"), sep = ",")
Note
The input DF in reproducible form is:
DF <-
structure(list(Address = structure(c(3L, 1L, 2L), .Label = c("Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C", "Petersvej 772900 Hoersholm"), class = "factor"),
Price = c(10, 15, 10), m2 = c(210L, 230L, 210L), Rooms = c(5L,
4L, 5L), CITY = structure(c(2L, 2L, 1L), .Label = c("C",
"Hoersholm"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Update
Added and fixed (2).

Check out the cSplit function from the splitstackshape package
library(splitstackshape)
df_new <- cSplit(df, splitCols = "Address", sep = " ")
#This will split your address column into 4 different columns split at the space
#you can then add an ifelse block to combine the last 2 columns to make up the city like
df_new$City <- ifelse(is.na(df_new$Address_4), as.character(df_new$Address_3), paste(df_new$Address_3, df_new$Address_4, sep = " "))

One way to do this is with regex.
In this instance you may use a simple regular expression which will match all alphabetical characters and space characters which lead to the end of the string, then trim the whitespace off.
library(stringr)
DF <- data.frame(Address=c("Petersvej 772900 Hoersholm",
"Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C"))
DF$CITY <- str_trim(str_extract(DF$Address, "[a-zA-Z ]+$"))
This will give you the following output:
Address CITY
1 Petersvej 772900 Hoersholm Hoersholm
2 Annasvej 121B2900 Hoersholm Hoersholm
3 Krænsvej 125800 Lyngby C Lyngby C
In R the stringr package is preferred for regex because it allows for multiple-group capture, which in this example could allow you to separate each component of the address with one expression.

Checking if keyword in one table is within a string in another table using R

I've been trying to solve this issue with mapply, but I believe I will have to use several nested applies to make this work, and it has gotten real confusing.
The problem is as follows:
Dataframe one contains around 400 keywords. These fall into roughly 15 categories.
Dataframe two contains a string description field, and 15 additional columns, each named to correspond to the categories mentioned in dataframe one. This has millions of rows.
If a keyword from dataframe 1 exists in the string field in dataframe 2, the category in which the keyword exists should be flagged in dataframe 2.
What I want should look something like this:
> #Dataframe1 df1
>> keyword category
>> cat A
>> dog A
>> pig A
>> crow B
>> pigeon B
>> hawk B
>> catfish C
>> carp C
>> ...
>>
> #Dataframe2 df2
>> description A B C ....
>> false cat 1 0 0 ....
>> smiling pig 1 0 0 ....
>> shady pigeon 0 1 0 ....
>> dogged dog 2 0 0 ....
>> sad catfish 0 0 1 ....
>> hawkward carp 0 1 1 ....
>> ....
I tried to use mapply to get this to work but it fails, giving me the error "longer argument not a multiple of length of shorter". It also computes this only for the first string in df2. I haven't proceeded beyond this stage, i.e. attempting to get category flags.
> mapply(grepl, pattern = df1$keyword, x = df2$description)
Could anyone be of help? I thank you very much. I am new to R so it would also help if someone could mention some 'thumb rules' for turning loops into apply functions. I cannot afford to use loops to solve this as it would take way too much time.

There might be a more elegant way to do this but this is what I came up with:
## Your sample data:
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
## Load packages:
library(stringr)
library(dplyr)
library(tidyr)
## For each entry in df2$description count how many times each keyword
## is contained in it:
outList <- lapply(df2$description, function(description){
outDf <- data.frame(description = description,
value = vapply(stringr::str_extract_all(description, df1$keyword),
length, numeric(1)), category = df1$category)
})
## Combine to one long data frame and aggregate by category:
outLongDf<- do.call('rbind', outList) %>%
group_by(description, category) %>%
dplyr::summarise(value = sum(value))
## Reshape from long to wide format:
outWideDf <- tidyr::spread(data = outLongDf, key = category,
value = value)
outWideDf
# Source: local data frame [6 x 4]
# Groups: description [6]
#
# description A B C
# * <fctr> <dbl> <dbl> <dbl>
# 1 dogged dog 2 0 0
# 2 false cat 1 0 0
# 3 hawkward carp 0 1 1
# 4 sad catfish 1 0 1
# 5 shady pigeon 1 1 0
# 6 smiling pig 1 0 0
This approach, however also catches the "pig" in "pigeon" and the "cat" in "catfish". I don't know if this is what you want, though.

What you are looking for is a so-called document-term-matrix (or dtm in short), which stems from NLP (Natural Language Processing). There are many options available. I prefer text2vec. This package is blazingly fast (I wouldn't be surprised if it would outperform the other solutions here by a large magnitude) especially in combination with tokenizers.
In your case the code would look something like this:
# Create the data
df1 <- structure(list(keyword = c("cat", "dog", "pig", "crow", "pigeon", "hawk", "catfish", "carp"),
category = c("A", "A", "A", "B", "B", "B", "C", "C")),
.Names = c("keyword", "category"),
class = "data.frame", row.names = c(NA,-8L))
df2 <- structure(list(description = structure(c(2L, 6L, 5L, 1L, 4L,3L),
.Label = c("dogged dog", "false cat", "hawkward carp", "sad catfish", "shady pigeon", "smiling pig"), class = "factor")),
.Names = "description", row.names = c(NA, -6L), class = "data.frame")
# load the libraries
library(text2vec) # to create the dtm
library(tokenizers) # to help creating the dtm
library(reshape2) # to reshape the data from wide to long
# 1. create the vocabulary from the keywords
vocabulary <- vocab_vectorizer(create_vocabulary(itoken(df1$keyword)))
# 2. create the dtm
dtm <- create_dtm(itoken(as.character(df2$description)), vocabulary)
# 3. convert the sparse-matrix to a data.frame
dtm_df <- as.data.frame(as.matrix(dtm))
dtm_df$description <- df2$description
# 4. melt to long format
df_result <- melt(dtm_df, id.vars = "description", variable.name = "keyword")
df_result <- df_result[df_result$value == 1, ]
# 5. combine the data, i.e., add category
df_final <- merge(df_result, df1, by = "keyword")
# keyword description value category
# 1 carp hawkward carp 1 C
# 2 cat false cat 1 A
# 3 catfish sad catfish 1 C
# 4 dog dogged dog 1 A
# 5 pig smiling pig 1 A
# 6 pigeon shady pigeon 1 B

Whatever the implementation, counting the number of matches per category needs k x d comparisons, where k is the number of keywords and d the number of descriptions.
There are a few tricks to make solve this problem fast and without a lot of memory:
Use vectorized operations. These can be performed a lot quicker than use for loops. Note that lapply, mapply or vapply are just shorthand for for loops. I parallelize (see next) over the keywords such that the vectorization can be over the descriptions which is the largest dimension.
Use parallelization. Optimally using your multiple cores speeds up the proces at the cost of an increase in memory (since every core needs its own copy).
Example:
keywords <- stringi::stri_rand_strings(400, 2)
categories <- letters[1:15]
keyword_categories <- sample(categories, 400, TRUE)
descriptions <- stringi::stri_rand_strings(3e6, 20)
keyword_occurance <- function(word, list_of_descriptions) {
description_keywords <- str_detect(list_of_descriptions, word)
}
category_occurance <- function(category, mat) {
rowSums(mat[,keyword_categories == category])
}
list_keywords <- mclapply(keywords, keyword_occurance, descriptions, mc.cores = 8)
df_keywords <- do.call(cbind, list_keywords)
list_categories <- mclapply(categories, category_occurance, df_keywords, mc.cores = 8)
df_categories <- do.call(cbind, list_categories)
With my computer this takes 140 seconds and 14GB RAM to match 400 keywords in 15 categories to 3 million descriptions.

Add consecutive temp values above threshold to create "degree hours"

I am working with a dataset of hourly temperatures and I need to calculate "degree hours" above a heat threshold for each extreme event. I intend to run stats on the intensities (combined magnitude and duration) of each event to compare multiple sites over the same time period.
Example of data:
Temp
1 14.026
2 13.714
3 13.25
.....
21189 12.437
21190 12.558
21191 12.703
21192 12.896
Data after selecting only hours above the threshold of 18 degrees and then subtracting 18 to reveal degrees above 18:
Temp
5297 0.010
5468 0.010
5469 0.343
5470 0.081
5866 0.010
5868 0.319
5869 0.652
After this step I need help to sum consecutive hours during which the reading exceeded my specified threshold.
What I am hoping to produce out of above sample:
Temp
1 0.010
2 0.434
3 0.010
4 0.971
I've debated manipulating these data within a time series or by adding additional columns, but I do not want multiple rows for each warming event. I would immensely appreciate any advice.

This is an alternative solution in base R.
You have some data that walks around, and you want to sum up the points above a cutoff. For example:
set.seed(99999)
x <- cumsum(rnorm(30))
plot(x, type='b')
abline(h=2, lty='dashed')
which looks like this:
First, we want to split the data in to groups based on when they cross the cutoff. We can use run length encoding on the indicator to get a compressed version:
x.rle <- rle(x > 2)
which has the value:
Run Length Encoding
lengths: int [1:8] 5 2 3 1 9 4 5 1
values : logi [1:8] FALSE TRUE FALSE TRUE FALSE TRUE ...
The first group is the first 5 points where x > 2 is FALSE; the second group is the two following points, and so on.
We can create a group id by replacing the values in the rle object, and then back transforming:
x.rle$values <- seq_along(x.rle$values)
group <- inverse.rle(x.rle)
Finally, we aggregate by group, keeping only the data above the cut off:
aggregate(x~group, subset = x > 2, FUN=sum)
Which produces:
group x
1 2 5.113291213
2 4 2.124118005
3 6 11.775435706
4 8 2.175868979

I'd use data.table for this, although there are certainly other ways.
library( data.table )
setDT( df )
temp.threshold <- 18
First make a column showing the previous value from each one in your data. This will help to find the point at which the temperature rose above your threshold value.
df[ , lag := shift( Temp, fill = 0, type = "lag" ) ]
Now use that previous value column to compare with the Temp column. Mark every point at which the temperature rose above the threshold with a 1, and all other points as 0.
df[ , group := 0L
][ Temp > temp.threshold & lag <= temp.threshold, group := 1L ]
Now we can get cumsum of that new column, which will give each sequence after the temperature rose above the threshold its own group ID.
df[ , group := cumsum( group ) ]
Now we can get rid of every value not above the threshold.
df <- df[ Temp > temp.threshold, ]
And summarise what's left by finding the "degree hours" of each "group".
bygroup <- df[ , sum( Temp - temp.threshold ), by = group ]
I modified your input data a little to provide a couple of test events where the data rose above threshold:
structure(list(num = c(1L, 2L, 3L, 4L, 5L, 21189L, 21190L, 21191L,
21192L, 21193L, 21194L), Temp = c(14.026, 13.714, 13.25, 20,
19, 12.437, 12.558, 12.703, 12.896, 21, 21)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -11L), .Names = c("num",
"Temp"), spec = structure(list(cols = structure(list(num = structure(list(), class = c("collector_integer",
"collector")), Temp = structure(list(), class = c("collector_double",
"collector"))), .Names = c("num", "Temp")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
With that data, here's the output of the code above (note $V1 is in "degree hours"):
> bygroup
group V1
1: 1 3
2: 2 6

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to remove rows from a dataframe if condition? - r

Related

Using R Base to sum a column of a dataframe for each value of a list

Sort table rows by column values in R

Splitting coloumn with differing syntax in R

Checking if keyword in one table is within a string in another table using R

Add consecutive temp values above threshold to create "degree hours"

Categories

Resources