Find string and create additional column - r

I have a list of data that contains a bunch off strings that contain currency codes. The location of the code varies within the string, and I am looking for a way to separate the code out.
I've tried searching, but all the suggestions I can find centre around the string being in the same location or separated by a similar character (eg. _ or -)
My input looks something like this:
input = structure(list(V1 = c("asdf23.USD123", "DKK1234", "1dCNY_d",
"fgdUSD33", "912#NZD")), class = "data.frame", row.names = c(NA,
-5L))
and I have a list of currencies I'm looking for like this:
fx = c("CNY", "DKK", "NZD", "USD")
I am trying to search the V1 column for values that match the list, and create a new column with the corresponding currency, eg:
output = structure(list(V1 = c("asdf23.USD123", "DKK1234", "1dCNY_d",
"fgdUSD33", "912#NZD"), V2 = c("USD", "DKK", "CNY", "USD", "NZD"
)), class = "data.frame", row.names = c(NA, -5L))
I don't know where I'd begin to look. Can anyone suggest what I should be searching for?

An option would be to extract the substring based on the value of 'fx' by pasteing the elements in to a single string
library(dplyr)
library(stringr)
input %>%
mutate(V2 = str_extract(V1, str_c(fx, collapse="|")))
# V1 V2
#1 asdf23.USD123 USD
#2 DKK1234 DKK
#3 1dCNY_d CNY
#4 fgdUSD33 USD
#5 912#NZD NZD
Or in base R
input$V2 <- regmatches(input$V1, regexpr(paste(fx, collapse="|"), input$V1))

Related

pull subject ids after detecting characters matching a string

Please help me pull subject id's after determining a list of participants who do not contain specified characters. e.g:
data:
df <- structure (list(subject_id = c("191-5467", "191-6784", "191-3457", "191-0987", "191-1245", "191-2365"), edta_codes = c("4EDTA-3M783316", "4EDTA-3M2897865", "4EDTA-M280934", "4EDTA-3M286549","MCF -3M289684", "NA")), class = "data.frame", row.names = c (NA, -6L))
Code to test if character is in string:
df$edta_codes[!grepl("4EDTA-3", df$edta_codes)]
Different method:
str_detect(df$edta_codes,"4EDTA-3")
Both give me the result I want but from here I want to show the subject ids that do not have the specified string, including those with NA (i.e. in this case - 191-3457, 191-1245, 191-2365 are all different from the specified characters). I have tried using pull after each of the above codes and they both did not work.
Please help.
You can simply do,
df[!grepl("4EDTA-3", df$edta_codes),'subject_id']
#[1] "191-3457" "191-1245" "191-2365"
If you want to return also the codes, then,
df[!grepl("4EDTA-3", df$edta_codes),]
# subject_id edta_codes
#3 191-3457 4EDTA-M280934
#5 191-1245 MCF -3M289684
#6 191-2365 NA

How to remove additional numbers in each cell in a dataframe

I am doing some data analyzing with R. I read a csv file. I would like to eliminate 000,000,000 from each cell. How can I get rid of only 000? I tried to use grep(), but it dropped rows.
This is the dataframe:
You can try this. I have included dummy data based on your screenshot (and please attention to coment of #andrew_reece):
#Code
df$NewVar <- trimws(gsub('000','',df$VIOLATIONS_RAW),whitespace = ',')
Output:
VIOLATIONS_RAW NewVar
1 202,403,506,000 202,403,506
2 213,145,123 213,145,123
3 212,000 212
4 123,000,000,000 123
Some data used:
#Data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")
We could also do in a general way to remove any number of 0's
df$VIOLATIONS_RAW <- trimws(gsub("(?<=,)0+(?=(,|$))", "",
df$VIOLATIONS_RAW, perl = TRUE), whitespace=",")
df$VIOLATIONS_RAW
#[1] "202,403,506" "213,145,123" "212" "123"
data
df <- structure(list(VIOLATIONS_RAW = c("202,403,506,000", "213,145,123",
"212,000", "123,000,000,000")), row.names = c(NA, -4L), class = "data.frame")

R not producing the same result when the data set source is changed

if i manually create 2 DFs then the code does what it was intended to do:
`df1 <- structure(list(CompanyName = c("Google", "Tesco")), .Names = "CompanyName", class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(CompanyVariationsNames = c("google plc", "tesco bank","tesco insurance", "google finance", "google play")), .Names = "CompanyVariationsNames", class = "data.frame", row.names = c(NA, -5L))-5L))
`
test <- df2 %>%
rowwise() %>%
mutate(CompanyName = as.character(Filter(length,
lapply(df1$CompanyName, function(x) x[grepl(x, CompanyVariationsNames, ignore.case=T)])))) %>%
group_by(CompanyName) %>%
summarise(Variation = paste(CompanyVariationsNames, collapse=",")) %>%
cSplit("Variation", ",")
this produces the following result:
CompanyName Variation_1 Variation_2 Variation_3
1: Google google plc google finance google play
2: Tesco tesco bank tesco insurance NA
but..... if i import a data set (using read.csv)then i get the following error Error in mutate_impl(.data, dots) : Column CompanyName must be length 1 (the group size), not 0. my data sets are rather large so df1 would have 1000 rows and df2 will have 54k rows.
is there a specific reason why the code works when the data set is manually created and it does not when data is imported?
the DF1 contains company names and DF2 contains variation names of those companies
help please!
Importing from CSV can be tricky. See if the default separator (comma) applies to your file in particular. If not, you can change it by setting the sep argument to a character that works. (E.g.: read.csv(file_path, sep = ";") which is a commom problem in my country due to our local conventions.
In fact, if your standard is semicolons, read.csv2(file_path) will suffice.
And also (to avoid further trouble) it is very commom for csv to mess with columns with decimal values, because here we use commas as decimal separators rather then dots. So, it would be worth checking if this is a problem in your file too, in any of the other columns.
If that is your case, you can set the adequate parameter in either read.csv or read.csv2 by setting dec = "," (E.g.: read.csv(file_path, sep = ";", dec = ","))

odd behavior when substituting parts of a string within a for loop

I'm trying to replace a series of numbers in a character string with information that comes from a dataframe.
My string comes from a text file that I imported using the readr package as follows: read_file("Human.txt")
I've checked the class, it is character. The string contains the following information (I've named it treeString):
"(1,2,((((3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
My dataframe (labels.csv) was originally in factor format, but I changed the format of the second column to character using the following command: labels[,2] = as.character(labels[,2]). It looks like this
v1 v2
1 1 name1
2 2 name2
3 3 name3
My goal is to substitute every number in the string with the corresponding name (i.e. V2) in the dataframe. This should result in the following:
"(name1,name2,((((name3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
Here is the code I am using to accomplish this:
for(i in 1:nrow(labels)){
gsub(as.character(i), labels[i,2], treeString)
}
The weird thing is that if I run the gsub() command on its own (with specified numbers - eg. 2) it does the substitution, however, when I run it in a loop it does not substitute the numbers.
As pointed out by Kumar Manglam in the comments, you forgot to assign the result of gsub() back to treeString.
There is something else you should be aware of: The way you specified the regular expression in your question it will also replace patterns like "(241)" with "(name24name1)". To avoid this behaviour, you should check whether the numbers you want to replace are preceded by a comma or opening parenthesis and succeeded by a comma or closing parenthesis:
# Option1
for(i in 1:nrow(labelnames)){
reg_pattern <- paste0("(?<=[(,])(", i, ")(?=[),])")
treeString <- gsub(reg_pattern, labelnames$v2[i], treeString, perl=T)
}
Another, nicer, option is drop the for-loop and do it all at once:
# Option2
reg_pattern <- paste0("(?<=[(,])([1-", nrow(labelnames), "])(?=[),])")
treeString <- gsub(reg_pattern, "name\\1", treeString, perl=T)
# Result
treeString
# "(name1,name2,((((name3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
Data
treeString <- "(1,2,((((3),884),(((((519,((516,517),(515,(518,(513,514))))),((((((((458,(457,(455,456))),459),(502,(454,(453,(451,452)))))"
labelnames <- structure(list(v1 = 1:3, v2 = c("name1", "name2", "name3")), .Names = c("v1", "v2"), class = "data.frame", row.names = c(NA, -3L))

Paste value after certain delimiter

I have data in the following format:
In Column A:
String1__String2__String3
In Column B:
Value
I would like to paste the Value into the String after the first delimiter like this:
String1__Value__String2__String3
The crucial part of the code I am using now (where I paste the value) is the following line:
df2 <-cbind(df[1],apply(df[,2:ncol(df)],2,function(i)ifelse(is.na(i), NA, paste(df[,1],i,sep="_"))))
With this code it append the value after the string, like this:
String1__String2__String3__Value
Is there an easy way to rearrange this so the Values will be pasted at the correct place. Or do I have to redo the complete code ?
Thanks
Update, Example:
Column A:
Jennifer__DoesSomething__inaCity
Column B:
2
Result now:
Jennifer__DoesSomething__inaCity__2
Desired result:
Jennifer__2__DoesSomething__inaCity
The strings Jennifer, DoesSomething, inaCity change and are not the same length. Only the delimiter stays the same. I want to paste after the first delimiter.
Thanks !
Here is an idea. Using sub we only replace the first seen pattern. So using mapply we replace all the numbers in one column with their corresponding strings on the second column.
mapply(function(x, y) sub('__', paste0('__', y, '__'), x), df$v1, df$v2)
# atsfs__dsfgg__sdgsdg eeee__FFFF__GGGG
#"atsfs__3__dsfgg__sdgsdg" "eeee__5__FFFF__GGGG"
DATA
dput(df)
structure(list(v1 = c("atsfs__dsfgg__sdgsdg", "eeee__FFFF__GGGG"
), v2 = c(3, 5)), .Names = c("v1", "v2"), row.names = c(NA, -2L
), class = "data.frame")

Resources