R - Split String with conditions - r

I have a string splitting related problem. I have a huge amount of files, which names are structures like this:
filenames = c("NO2_Place1_123_456789.dat", "NO2_Nice_Place_123_456789.dat", "NO2_Nice_Place_123_456789.dat", "NO2_Place2_123_456789.dat")
I need to extract the Stationnames, e.g. Place1, Nice_Place1 and so on. Its either "Place" and a number or "Nice_Place" and a number.
I tried this to get the stationnames for "Place" and a number and it works geat, but this doesnt give me the correct name in case of "Nice_Place"...because it handles it as 2 words.
Station = strsplit(filenames[1], "_")[[1]][2] #Works
Station = strsplit(filenames[2], "_")[[1]][2] #Doesnt work
My idea is now to use if...else. So If the Stationname in the example above is "Nice", add the 3rd part of the stringsplit with an underscore. Unfortunatley I am totally new to this if else condition.
Can somebody please help?
EDIT:
Expected output:
Station = strsplit(filenames[1], "_")[[1]][2] #Station = Place
Station = strsplit(filenames[2], "_")[[1]][2] #Station = Nice -- not correct I want to have "Nice_Place"
So When I get
Station = strsplit(filenames[2], "_")[[1]][2] #Station = Nice
I want to add a condition, that if Station is "Nice" it should add strsplit(filenames[2], "_")[[1]][3] with an underscore!
EDIT2:
I found now a way to get what I want:
filenames = c("NO2_Place1_123_456789.dat", "NO2_Nice_Place1_123_456789.dat", "NO2_Nice_Place2_123_456789.dat", "NO2_Place2_123_456789.dat")
Station = strsplit(filenames[2], "_")[[1]][2]
if (Station == "Nice"){
Station = paste(Station, strsplit(filenames[2], "_")[[1]][3], sep = "_")
}

We can use sub
sub("^[^_]+_(.*Place\\d*).*", "\\1", filenames[2])
#[1] "Nice_Place1"

Related

Ho to run a function (many times) that changes variable (tibble) in global env

I'm a newbie in R, so please have some patience and... tips are most welcome.
My goal is to create tibble that holds a "Full Name" (of a person, that may have 2 to 4 names) and his/her gender. I must start from a tibble that contains typical Male and Female names.
Below I present a minimum working example.
My problem: I can call get_name() multiple time (in 10.000 for loop!!) and get the right answer. But, I was looking for a more 'elegant' way of doing it. replicate() unfortunately returns a vector... which make it unusable.
My doubts: I know I have some (very few... right!!) issues, like the if statement, that is evaluated every time (which is redundant), but I don't find another way to do it. Any suggestion?
Any other suggestions about code struct are also welcome.
Thank you very much in advance for your help.
# Dummy name list
unit_names <- tribble(
~Women, ~Man,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
set.seed(12345) # seed for test
# Create a tibble with the full names
full_name <- tibble("Full Name" = character(), "Gender" = character() )
get_name <- function() {
# Get the Number of 'Unit-names' to compose a 'Full-name'
nbr_names <- sample(2:4, 1, replace = TRUE)
# Randomize the Gender
gender <- sample(c("Women", "Man"), 1, replace = TRUE)
if (gender == "Women") {
lim_names <- sum( !is.na(unit_names$"Women"))
} else {
lim_names <- sum( !is.na(unit_names$"Man"))
}
# Sample the Fem/Man List names (may have duplicate)
sample(unlist(unit_names[1:lim_names, gender]), nbr_names, replace = TRUE) %>%
# Form a Full-name
paste ( . , collapse = " ") %>%
# Add it to the tibble (INCLUDE the Gender)
add_row(full_name, "Full Name" = . , "Gender" = gender)
}
# How can I make 10k of this?
full_name <- get_name()
If you pass a larger number than 1 to sample this problem becomes easier to vectorise.
One thing that currently makes your problem much harder is the layout of your unit_names table: you are effectively treating male and female names as individually paired, but they clearly aren’t: hence they shouldn’t be in columns of the same table. Use a list of two vectors, for instance:
unit_names = list(
Women = c("fem1", "fem2", "fem3", "fem4", "fem5", "fem6", "fem7"),
Men = c("male1", "male2", "male3", "male4", "male5")
)
Then you can generate random names to your heart’s delight:
generate_names = function (n, unit_names) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
names = Map(sample, unit_names[genders], name_length, replace = TRUE) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
A note on style, unlike your function the above doesn’t use any global variables. Furthermore, don’t "quote" variable names (you do this in unit_names$"Women" and for the arguments of add_row). R allows this, but this is arguably a mistake in the language specification: these are not strings, they’re variable names, making them look like strings is misleading. You don’t quote your other variable names, after all. You do need to backtick-quote the `Full name` column name, since it contains a space. However, the use of backticks, rather than quotes, signifies that this is a variable name.
I am not 100% of what you are trying to get, but if I got it right...did you try with mutate at dplyr? For example:
result= mutate(data.frame,
concated_column = paste(column1, column2, column3, column4, sep = '_'))
With a LITTLE help from Konrad Rudolph, the following elegant (and vectorized ... and fast) solution that I was looking. map2 does the necessary trick.
Here is the full working example if someone needs it:
(Just a side note: I kept the initial conversion from tibble to list because the data arrives to me as a tibble...)
Once again thanks to Konrad.
# Dummy name list
unit_names <- tribble(
~Women, ~Men,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
name_list <- list(
Women = unit_names$Women[!is.na(unit_names$Women)],
Men = unit_names$Men[!is.na(unit_names$Men)]
)
generate_names = function (n, name_list) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
#names = lapply(name_list[genders], sample, name_length) %>%
names = map2(name_list[genders], name_length, sample) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
full_name <- generate_names(10000, name_list)

R - extracting multiple patterns from string using gregexpr

I am working with a dataset where I have a column describing different products. In the product description is also the weight of the product, which is what I'd like to extract. My problem is that some products come in dual-packs, meaning that the description starts with '2x', while the actual weight is at the end of the description. For example:
x = '2x pet food brand 12kg'
What I'd like to do is to shorten this to just 2x12kg.
I'm not great at using regexp in R and was hoping that someone here could help me.
I have tried doing this using gregexp in the following way:
m <- gregexpr("(^[0-9]+x [0-9]+kg)", x)
Unfortunately this only gives me '10kg' not including the '2x'
I would appreciate any help at all with this.
EDIT ----
After sorting out my initial problem, I found that there were a few instances in the data of a different format, which I also like to extract:
x = 'Pet food brand 15x85g'
# Should be:
x = '15x85g'
I have tried to play around with OR statements in gsub, like:
m <- gsub('^([0-9]+x)?[^0-9]*([0-9.]+kg)|([0-9]+x)?[^0-9]*([0-9.]+g)', '\\1\\2', x)
#And
m <- gsub('^([0-9]+x)?[^0-9]*([0-9.]+(kg|g)), x)
While this still extracts the kilos, it only removes the instances with grams and leaves the rest of the string, like:
x = 'Pet food brand '
Or running gsub a second time using:
m <- gsub('([0-9]+x[0-9]+g)', '\\1', x)
The latter option does not extract the product weights at all, and just leaves the string intact.
Sorry for not noticing that the strings were formatted differently earlier. Again, any help would be appreciated.
You could use this regular expression
m = gregexpr("([0-9]+x|[0-9.]+kg)", string, ignore.case = T)
result = regmatches(string, m)
r = paste0(unlist(result),collapse = "")
For string = "2x pet food brand 12kg" you get "2x12kg"
This also works if kilograms have decimals:
For string = "23x pet food 23.5Kg" you get "23x23.5Kg"
(edited to correct mistake pointed out by #R. Schifini)
You can use regex like this:
x <- '2x pet food brand 12kg'
gsub('^([0-9]+x)?[^0-9]*([0-9]+kg)', '\\1\\2', x)
## "2x12kg"
This would get you the weight even if there is no "2x" in the beginning of the string:
x <- 'pet food brand 12kg'
gsub('^([0-9]+x)?[^0-9]*([0-9]+kg)', '\\1\\2', x)
## "12kg"

Need help writing data from a table in R for unique values using a loop

Trying to figure why when I run this code all the information from the columns is being written to the first file only. What I want is only the data from the columns unique to a MO number to be written out. I believe the problem is in the third line, but am not sure how to divide the data by each unique number.
Thanks for the help,
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
df = MOs_Interest[MOs_Interest$MO_NUMBER == MO, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
submit.df <- data.frame(df)
filename = paste("Variance", "Report",MO, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)}
If you are trying to write out a separate csv for each unique MO number, then something like this may work to accomplish that.
unique.mos <- unique(MOs_Interest$MO_NUMBER)
for (mo in unique.mos){
submit.df <- MOs_Interest[MOs_Interest$MO_NUMBER == mo, c("ITEM_NUMBER", "OPER_NO", "OPER_DESC", "STDRUNHRS", "ACTRUNHRS","Difference", "Sum")]
filename <- paste("Variance", "Report", mo, ".csv", sep="")
write.csv(submit.df, file = filename, row.names = FALSE)
}
It's hard to answer fully without example data (what are the columns of MOs_InterestDF1?) but I think your issue is in the df line. Are you trying to subset the dataframe to only the data matching the MO? If so, try which as in df = MOs_Interest[which(MOs_Interest$MO_NUMBER == MO),].
I wasn't sure if you actually had two separate dfs (MOs_Interest and MOs_InterestDF1); if not, make sure the df line points to the correct data frame.
I tried to create some simplified sample data:
MOs_InterestDF1 <- data.frame("MO_NUMBER" = c(1,2,3), "Item_No" = c(142,423,214), "Desc" = c("Plate","Book","Table"))
for (i in 1:nrow(MOs_InterestDF1)) {
MO = MOs_InterestDF1[i,1]
mydf = data.frame(MOs_InterestDF1[which(MOs_InterestDF1$MO_NUMBER == MO),])
filename = paste("This is number ",MO,".csv", sep="")
write.csv(mydf, file = filename, row.names=FALSE)
}
This output three different csv files, each with exactly one row of data. For example, "This is number 1.csv" had the following data:
MOs Item_No Desc
1 142 Plate

How to use a character string as direction to produce a data.frame?

I'm sure this is simple, but I didn't find a solution.
I want to put my string called Data
Data
[1] "as.numeric(dataset$a),as.numeric(dataset1$a)"
in function data.frame to create a dataframe. I try:
DB<-data.frame(Data)
but the output is my string. If I call DB the output infact is:
Data
1 as.numeric(dataset$a),as.numeric(dataset1$a)
not the values into dataset$a, dataset1$a.
Thanks
Surely there is a better way to do whatever it is you want to do. But if you really want to run a string as if it were code you can use an eval(parse(text = string)) construction. However, it is generally a bad way to write code. Nonetheless here is a solution:
# a test dataframe
df = data.frame(a = 1:10, b = 11:20)
# string with code to run
string = "as.numeric(df$a),as.numeric(df$b)"
# split on , since those are separate lines of code
str = unlist(strsplit(string, ','))
# put it in a dataframe
df2 = data.frame(sapply(str, function(string) eval(parse(text = string))))

LOOP not working in R

This doesn't work and I'm not sure why.
look_up <- data.frame(flat=c("160","130"),
street=c("stamford%20street", "doddington%20grove"),
city = c("London", "London"),
postcode = c("SE1%20", "se17%20"))
new <- data.frame()
for(i in 1:nrow(look_up)){
new <- rbind(new,look_up$flat[i])
}
Grateful if someone could tell me why please! My result should be a data frame with one column called 'flat' and the values of 160 and 130 on each row. Once I understand this I can move onto the real thing I'm trying to do!
No need for a loop:
look_up[,"flat",drop=FALSE]
As mentioned, the problem with your loop is automatic conversion to factors. You can put options(stringsAsFactors=FALSE)in front of your script to avoid that.
However, it's almost certain that you are approaching your actual problem in the wrong way. You should probably ask a new question, where you tell us what you actually want to achieve.
You need to look into the stringsAsFactors argument of data.frame.
look_up <- data.frame(flat=c("160","130"),
street=c("stamford%20street", "doddington%20grove"),
city = c("London", "London"),
postcode = c("SE1%20", "se17%20"),
stringsAsFactors = FALSE)
look_up[, "flat", drop = FALSE ]
You could also do something like:
> look_up <- data.frame(flat=c("160","130"),
+ street=c("stamford%20street", "doddington%20grove"),
+ city = c("London", "London"),
+ postcode = c("SE1%20", "se17%20"))
>
> new <- look_up[,1,drop=FALSE]
> new
flat
1 160
2 130
> class(new)
[1] "data.frame"
This shows your final desired output is a dataframe with 160 and 130 on columns.
If you don't include drop=FALSE here, then your final output will be a factor.
Hope this helps.

Resources