how to extract number beside specific string in a cell? - r

I would like to extract number information in cells which is located beside specific string. My data looks like this.
item stock
PRE 24GUSSETX4SX15G 200
PLS 12KLRX10SX15G 200
ADU 24SBX200ML 200
NIS 18BNDX40SX11G 200
REF 500GX12BTL 200
i want to extract the numbers which located besides string 'GUSSET','KLR','SB','BND' and 'BTL'. I want to use this number to do multiplication with the stock. For example like this.
item stock pcs total
PRE 24GUSSETX4SX15G 200 24 4800
PLS 12KLRX10SX15G 200 12 2400
ADU 24SBX200ML 200 24 4800
NIS 18BNDX40SX11G 200 18 3600
REF 500GX12BTL 200 12 2400
anyone know how to extract the numbers? thanks very much in advance

One way using base R, is to use sub to extract numbers besides those groups and multiply them with stock to get total.
df$pcs <- as.numeric(sub(".*?(\\d+)(GUSSET|KLR|SB|BND|BTL).*", "\\1", df$item))
df$total <- df$stock * df$pcs
df
# item stock pcs total
#PRE 24GUSSETX4SX15G 200 24 4800
#PLS 12KLRX10SX15G 200 12 2400
#ADU 24SBX200ML 200 24 4800
#NIS 18BNDX40SX11G 200 18 3600
#REF 500GX12BTL 200 12 2400
Or everything in one pipe
library(dplyr)
df %>%
mutate(pcs = as.numeric(sub(".*?(\\d+)(GUSSET|KLR|SB|BND|BTL).*", "\\1", item)),
total = stock * pcs)

We can do this in tidyverse
library(tidyverse)
df %>%
mutate(pcs = as.numeric(str_extract(item, "(\\d+)(?=(GUSSET|KLR|SB|BND|BTL))")),
total = pcs * stock)
# item stock pcs total
#1 PRE 24GUSSETX4SX15G 200 24 4800
#2 PLS 12KLRX10SX15G 200 12 2400
#3 ADU 24SBX200ML 200 24 4800
#4 NIS 18BNDX40SX11G 200 18 3600
#5 REF 500GX12BTL 200 12 2400
data
df <- structure(list(item = c("PRE 24GUSSETX4SX15G", "PLS 12KLRX10SX15G",
"ADU 24SBX200ML", "NIS 18BNDX40SX11G", "REF 500GX12BTL"), stock = c(200L,
200L, 200L, 200L, 200L)), class = "data.frame", row.names = c(NA,
-5L))

Related

Maximum based on custom order

I want to calculate highest price based on custom orders as follows (largest being first).
Hundred Million
Million
THO
Hundred
Data is as below -
df <- read.table(text = "Price Unit
1445 Million
620 THO
830 Million
661 Million
783 Hundred
349 'Hundred Million'
", header= T)
If you wish to also calculate the "actual price", we can:
first create a dataframe of "Unit" and "Value" (for example price_unit in my answer).
Then left_join this price_unit with your original dataframe, which will match on the "Unit" column.
Then do the calculation using mutate.
Finally sort the column.
library(tidyverse)
df <- read.table(text = "Price Unit
1445 Million
620 THO
830 Million
661 Million
783 Hundred
349 'Hundred Million'
", header= T)
price_unit <- tibble(Unit = c("THO", "Hundred", "Million", "Hundred Million"),
Value = c(10^3, 10^2, 10^6, 10^8))
left_join(df, price_unit, by = "Unit") %>%
mutate(actual_price = Price * Value) %>%
arrange(desc(actual_price))
Price Unit Value actual_price
1 349 Hundred Million 1e+08 3.490e+10
2 1445 Million 1e+06 1.445e+09
3 830 Million 1e+06 8.300e+08
4 661 Million 1e+06 6.610e+08
5 620 THO 1e+03 6.200e+05
6 783 Hundred 1e+02 7.830e+04
First you can create a factor for your Unit variable by ordering them in the levels command:
df$Unit <- factor(df$Unit,
levels = c("THO",
"Hundred",
"Million",
"Hundred Million"))
Then just arrange by unit, which should arrange them by smallest unit to largest:
df %>%
arrange(Unit,
Price)
Which gives you this output:
Price Unit
1 620 THO
2 783 Hundred
3 661 Million
4 830 Million
5 1445 Million
6 349 Hundred Million

Divide a data-frame into x roughly equal groups -- sequentially

I want to divide a df into x roughly equal groups, sequentially.
I was basically doing it like this:
df_1 <- df[1:10,]
df_2 <- df[11:21,]
df_3..
Is there a simpler way to do this, using split or slice? The important thing is, I want to maintain the order of the df, not sample from it.
Imagine I had 7000 observations, and I wanted 19 roughly equal groups.
Best!
I don't know if it counts for roughly equal, but you can do this:
nobs <- 7000
ngroups <- 17
df <- data.frame(x = sample(nobs))
set.seed(1)
df$grp <- sort(sample(1:ngroups,nobs,T)) # added the sort so the order of your df is maintained
table(df$grp)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# 436 407 410 369 417 411 440 401 431 411 356 398 390 414 443 418 448
then split(df,df$grp)

R One sample test for set of columns for each row

I have a data set where I have the Levels and Trends for say 50 cities for 3 scenarios. Below is the sample data -
City <- paste0("City",1:50)
L1 <- sample(100:500,50,replace = T)
L2 <- sample(100:500,50,replace = T)
L3 <- sample(100:500,50,replace = T)
T1 <- runif(50,0,3)
T2 <- runif(50,0,3)
T3 <- runif(50,0,3)
df <- data.frame(City,L1,L2,L3,T1,T2,T3)
Now, across the 3 scenarios I find the minimum Level and Minimum Trend using the below code -
df$L_min <- apply(df[,2:4],1,min)
df$T_min <- apply(df[,5:7],1,min)
Now I want to check if these minimum values are significantly different between the levels and trends respectively. So check L_min with columns 2-4 and T_min with columns 5-7. This needs to be done for each city (row) and if significant then return which column it is significantly different with.
It would help if some one could guide how this can be done.
Thank you!!
I'll put my idea here, nevertheless I'm looking forward for ideas for others.
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min
1 City1 251 176 263 1.162313 0.07196579 2.0925715 176 0.07196579
2 City2 385 406 264 0.353124 0.66089524 2.5613980 264 0.35312402
3 City3 437 333 426 2.625795 1.43547766 1.7667891 333 1.43547766
4 City4 431 405 493 2.042905 0.93041254 1.3872058 405 0.93041254
5 City5 101 429 100 1.731004 2.89794314 0.3535423 100 0.35354230
6 City6 374 394 465 1.854794 0.57909775 2.7485841 374 0.57909775
> df$FC <- rowMeans(df[,2:4])/df[,8]
> df <- df[order(-df$FC), ]
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min FC
18 City18 461 425 117 2.7786757 2.6577894 0.75974121 117 0.75974121 2.857550
38 City38 370 117 445 0.1103141 2.6890014 2.26174542 117 0.11031411 2.655271
44 City44 101 473 222 1.2754675 0.8667007 0.04057544 101 0.04057544 2.627063
10 City10 459 361 132 0.1529519 2.4678493 2.23373484 132 0.15295194 2.404040
16 City16 232 393 110 0.8628494 1.3995549 1.01689217 110 0.86284938 2.227273
15 City15 499 475 182 0.3679611 0.2519497 2.82647041 182 0.25194969 2.117216
Now you have the most different rows based on columns 2:4 at the top. Columns 5:7 in analogous way.
And some tips for stastical tests:
Always use t.test(parametrical, based on mean) instead of wilcoxon(u-mann whitney - non-parametrical, based on median), it has more power; HOWEVER:
-Data sets should be big ex. hipotesis: Montreal has taller citizens than Quebec; t.test will work fine when you take a 100 people from each city, so we have height measurment of 200 people 100 vs 100.
-Distribution should be close to normal distribution in all samples; or both samples should have similar distribution far from normal - it may be binominal. Anyway we can't use this test when one sample has normal distribution, and second hasn't.
-Size of both samples should be eqal, so 100 vs 100 is ok, but 87 vs 234 not exactly, p-value will be below 0.05, however it may be misrepresented.
If your data doesn't meet above conditions, I prefer non-parametrical test, less power but more resistant.

In R, how do I extract minutes and seconds from characters

I'm trying to split a column from R into minutes and seconds. Problem is, the column is simply numeric: for example it will have "752","843","823", "956", (up to about 2000 being the highest) etc... which stands for 7 mins and 52 seconds, 8 minutes and 43 seconds, 8 minutes and 23 seconds, etc... I'd like to split it into two columns. One column for the number of minutes, one for the number of seconds. I'll then use those columns to create a third, "totalSeconds" which would have "472" for the 7 minutes and 52 seconds.
I've been searching all over, checking out regular expressions, etc.. I just can't seem to figure it out. Another similar question on here pointed me towards the function 'substr' but I am stuck on that because each variable does not always have the same number of characters in it, i.e. 752 vs 1145. Any help? Or at least maybe somebody could point me in the right direction?
If the last two elements of your characters represent seconds and the remaining first one/two element/s represent minutes, then try the following:
res <- data.frame(v = v, minutes = substr(v, 1, nchar(v)-2), seconds = substr(v, nchar(v)-1, nchar(v)))
> res
v minutes seconds
1 752 7 52
2 843 8 43
3 823 8 23
4 956 9 56
In order to calculate the total amount of seconds:
res <- as.data.frame(apply(res, 2, function(x) as.double(as.character(x))))
res$tot.sec <- res$minutes*60 + res$seconds
> res
v minutes seconds tot.sec
1 752 7 52 472
2 843 8 43 523
3 823 8 23 503
4 956 9 56 596
You could consider separate from "tidyr". Here, I use it in conjunction with mutate from "dplyr" to get the output you are looking for.
Note that separate lets you specify either from the left of the string or from the right of the string depending on whether the sep value is positive or negative. This would allow us to deal with cases like "1000" appropriately.
library(dplyr)
library(tidyr)
df %>%
separate(secs, into = c("min", "sec"), sep = -3) %>%
mutate(tot = as.numeric(min)*60 + as.numeric(sec))
# min sec tot
# 1 7 52 472
# 2 8 43 523
# 3 8 23 503
# 4 9 56 596
# 5 10 00 600
Sample data:
df <- data.frame(secs = c("752","843","823", "956", "1000"))
In this example:
df=data.frame(D=round(1000*runif(100)))
D is the column that has your strings. If you do:
df$MIN=ifelse(nchar(df$D)>=3,substr(df$D, 0,1),0) #if there are only seconds
this would return the minutes
and
df$SEC=substr(df$D, nchar(df$D)-1,nchar(df$D))
would return the seconds.
Is this what you want?

adding as integers instead of list elements in R

adding as integers instead of list elements in R
I am getting
> total = 0
> for (qty in a[5]){
+ total = total + as.numeric(unlist(qty))
+ print(total)
+ }
[1] 400 400 400 400 400 400 400 400 400 400
what i really want is :
> total = 0
> for (qty in a[5]){
+ total = total + as.numeric(unlist(qty))
+ print(total)
+ }
[1] 400 800 1200 1600 2000 2400 2800 3200 3600 4000
refine: a little bit more to a more specific scenario,
price buy_sell qty
100 B 100
100 B 200
90 S 300
100 S 400
I want to make a forth column
price buy_sell qty net
100 B 100 10000
100 B 200 30000
90 S 300 3000
100 S 400 -37000
Note that if a is a list, you want to use double brackets. Otherwise you are getting back a list of size one, where the first element has the values you are looking for
Try:
total <- cumsum(a[[5]])
a <- list()
a[[5]] <- rep(400, 10)
cumsum(a[[5]])
# [1] 400 800 1200 1600 2000 2400 2800 3200 3600 4000
Compare:
a[5]
a[[5]]
a[5][[1]]

Resources