Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I would like to erase characters "(B)" in the code column, so then I could do "summarise" the 'stock_needed'. My data looks like this.
code stock_need
(B)1234 200
(B)5678 240
1234 700
5678 200
0123 200
to be like this.
code stock_need
1234 200
5678 240
1234 700
5678 200
0123 200
How could these "(B)" erased? Thanx in advance
What are other patterns your data has? If it's always "(B)" you can do
sub("\\(B\\)", "", df$code)
#[1] "1234" "5678" "1234" "5678" "0123"
Or if it could be any character do
sub("\\([A-Z]\\)", "", df$code)
You could also extract only the numbers from Code
sub(".*?(\\d+).*", "\\1", df$code)
You might want to wrap output of sub in as.numeric or as.integer to get numeric/integer output.
We can also use readr
readr::parse_number(df$code)
Basically, you need to do two things:
remove the unnecessary part of the string
convert the string to numeric.
Say, we load your data frame:
df <- read.table(header=TRUE, text="code stock_need
(B)1234 200
(B)5678 240
1234 700
5678 200
0123 200 ")
First, we replace the column "code" with something without the parentheses:
df$code <- gsub("\\(B\\)", "", df$code)
Explanation: why the weird \\? Because if we wrote (B), gsub would treat the parentheses in a special way. Parentheses have a special meaning in regular expressions, and the first argument to gsub is a regular expression.
Next, we make a number vector out of it:
df$code <- as.numeric(df$code)
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
What is the best way to match words with my sentence? Here is a little sample:
words <- c("apple", "pear", "grape")
sentences <- c("I have an apple and a pear", "Grape is my favorite", "I don't like pear")
The best is if the output could look like:
count sentence
2 "I have an apple and a pear"
1 "Grape is my favorite"
1 "I don't like pear
I have tried using str_count but to no avail. Any help is appreciated!
library(stringr)
str_count(sentences, paste0("(?i)\\b(", paste0(words, collapse = "|"), ")\\b"))
[1] 2 1 1
How this works:
(?i): this makes sure the pattern match is case-insensitive
\\b and \\b make sure the words are matched as words with word boundaries (if \\b is not used you may end up matching something that just contains your words but forms itself a different word such as grapple, which contains apple)
( and )form a non-capturing group, the content of which are the words separated, or combined if you prefer, by the pipe |, a metacharacter for alternation signifying 'OR'.
If you want to have this inside a dataframe:
df <- data.frame(
sentences = sentences,
count = str_count(sentences, paste0("(?i)\\b(", paste0(words, collapse = "|"), ")\\b")))
Result:
df
sentences count
1 I have an apple and a pear 2
2 Grape is my favorite 1
3 I don't like pear 1
This question already has answers here:
Using regex in R to find strings as whole words (but not strings as part of words)
(2 answers)
Closed 2 years ago.
I might be missing something very obvious but how can I write efficient code to get all matches of a singular version of a noun but NOT its plural? for example, I want to match
angel investor
angel
BUT NOT
angels
try angels
If I try
grep("angel ", string)
Then a string with JUST the word
angel
won't match.
Please help!
Use word-boundary markers \\b:
x <- c("angel investor", "angel","angels", "try angels")
grep("\\bangel\\b", x, value = T)
[1] "angel investor" "angel"
You can try the following approach. It still believe there are other excellent ways to solve this problem.
df <- data.frame(obs = 1:4, words = c("angle", "try angles", "angle investor", "angles"))
df %>%
filter(!str_detect(words, "(?<=[ertkgwmnl])s\\b"))
# obs words
# 1 1 angle
# 2 3 angle investor
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a .csv-file with a column containing book descriptions scraped from the web which I import into R for further analysis. My goal is to extract the protagonists' ages from this column in R, so what I imagine is this:
Match strings like "age" and "-year-old" with a regex
Copy the sentences containing these strings into a new column (so that I can make sure that the sentence is not, for example "In the middle ages 50 people lived in xy"
Extract the numbers (and, if possible some number words) from this column into a new column.
The resulting table (or probably data.frame) would then hopefully look like this
|Description |Sentence |Age
|YY is a novel by Mr. X |The 12-year-old boy| 12
|about a boy. The 12-year|is named Dave. |
|-old boy is named Dave..| |
If you could me help out that would great since my R-skills are still very limited and I have not found a solution for this problem!
Another option if the string contains other numbers/descriptions besides just age, but you only want age.
library(stringr)
description <- "YY is a novel by Mr. X about a boy. The boy is 5 feet tall. The 12-year-old boy is named Dave. Dave is happy. Dave lives at 42 Washington street."
sentence <- str_split(description, "\\.")[[1]][which(grepl("-year-old", unlist(str_split(description, "\\."))))]
> sentence
[1] " The 12-year-old boy is named Dave"
age <- as.numeric(str_extract(description, "\\d+(?=-year-old)"))
> age
[1] 12
Here we use the string "-year-old" to tell us which sentence to pull and then we extract the age that is followed by that string.
You can try the following
library(stringr)
description <- "YY is a novel by Mr. X about a boy. The 12-year-old boy is named Dave. Dave is happy."
sentence <- str_extract(description, pattern = "\\.[^\\.]*[0-9]+[^\\.]*.") %>%
str_replace("^\\. ", "")
> sentence
[1] "The 12-year-old boy is named Dave."
age <- str_extract(sentence, pattern = "[0-9]+")
> age
[1] "12"
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
Input data is of format A=integer, B=text [word count max 500]
Importing this data set into R truncates the second column to fit chr. Is there a different class that will ensure no truncation or a method to increase the size of chr to accommodate the entire text? (conceptually equivalent to a TEXT vs VARCHAR in sql)
xdoc <- read.csv("./data/abtest2.csv", header = TRUE, sep = ",", as.is = TRUE)
head(xdoc)
A 1 601004351600
B 1
adsfj al;ds fj;sd jf;klsdj f dsfdfsdf sdf sdf sdf as a dag dfgh tyutr
erigkdj fajklsdf j;sdkl ;klajdfsiljuaeiodgjdfl;gdASo ri[3iocvjilgjdfi
gjksjfl jgeoutoihjkvhlkasj;aljdsgkjdfghkdm,gfn;lkja;ja;drfjgkihyuirhl
jkjfdkl hjgasdhgdfjkgksdjkj r...
I think it's something about the way in which you're viewing the files.
longwords <- replicate(10,paste(
sample(letters,600,replace=TRUE),collapse=""))
nchar(longwords) ## 600 600 600 600 ...
dd <- data.frame(n=1:10,w=longwords)
write.csv(dd,file="tmp.csv",row.names=FALSE)
Now read the data file back in -- it's the same as when it was written out
xdoc <- read.csv("tmp.csv",as.is=TRUE)
nchar(xdoc$w)
## [1] 600 600 600 600 600 ...
I don't know what kind of limits there are on string length in R other than memory size, but they're long. Perhaps this note from ?as.character is relevant ... ?
> ‘as.character’ breaks lines in language objects at 500 characters,
and inserts newlines. Prior to 2.15.0 lines were truncated.
So something else, either in your viewing procedure or in the way you've processed the data, is messing you up.
head(xdoc)
n
1 1
2 2
3 3
4 4
5 5
6 6
w
1 llscwhauaiqfqcftzfqujwqefathrchnneqwkcoktrpnebpylyjkoiqyscegbmdwmiegivulxnqxjlrcjiwrsfbltdrcymcmpeolxpexxcjhrggqjuphahysgocgjtsafueqzrnvcsofeuxfworytsnfrclsxozrmoitlpfunvmoomgijudjrjngynbrpfotbxzktjbctyafofvyjeegwuiavxrzhropgdtkbwsszwetxcgrrsymcjwstrmrqkaqlwuccikpbtjjwssvxvrrldzfjdqtythlhhzslxvhxrojskaxxuhcnmqppbymxvmqzbyhtzqfgljelvcmsmwsdbytqkvhkgyhreomxohpjtcbiffeuqgwrolwqgmmxevifadnqkxgbentgxazfspzztpuulvpqrbioelzhimyxzhrmdltlmynfpkaqldvwhaicmykjmlxmffrqlukqiwdmhrwygkricdozrggopnsknwduqxrmzovnrzcumddwtqzipfwmdijqgnclenqemecguxqfvbfyxcwpswmzrcvnuqohruphgkzljxgovddliiwdsrfobimtcboljtkxcmzfqwi
2 xuevtjfterzujzmauuvbwkszsbvcmyllddxnebwxgbwnqzlxhsppyxfnynjqkbzzuypxqaselnvwciusswranngvzmxgoxpjuawyaxxgtuisnifdcuqukluqlpwaqznbvlgltryvliwpqwmzrssadzocbiputgsyvfatwdhrbpjnhawdfqcssfkpqimyebfihcmkphsaybnyukzdjlggbkmjkogszslcossstvcehuyunrqapaggmvosouccuzpwjcyyqyizkyzqbcbsnsuewjkeicclfbxhlmishlxggnpluoovhlhcvxqqebzihrhtwjsbvrstddpqqpevjxvmprgthqkdiqgzbzvxjthnjuxvmbpijyvnxuwgemztexcpvouuasdikegxfiqdscjsgpjuvkxeweelfrvfuhllswebmxktpofxusqaqzdrbrybytufvuavknulcnikckayqhoxxsbjhwxcidtpxiwjwqpecmseutimbkfyjfbslhbvdrquefmeqggtbfogjoozbrcfsucxokbdvinnuoolriszkrgbeplswmrujgejsolidvyrdutqnejgrlkeoqqpguks
3 ohhbcsacskcpfjptbbvddwuzwbguedjqyowktvrinuzifawboyqgomhqrxahkbbuoyvsfbwwqstreomtzmdlszdndeurvehobdkzzqffxqgpgkcnqbwrrdcewlfbouveqpbwruoqnmbbodjbhetantlffwzpiefnwreimkoxjwswhdpncqgyvaulwehcuyyngidtdpscxysjqcydwbrqvhpjejudsondgltrrmmydrlnbqjaamdfnivundbupuaialqhuvivfiwtzmdahrtsgvaooardpdiwcinxzvrjrfufmjpsmtugrzqfibdyzgznahftzhlraqubtgnbbrrlursixsgzggbxqrjaqpzgmekqrtyawavhbmlcfcluhvwxfwcvjmxmlwkkzsleayftbxiufysupsygpoklqckxcwfpscleyidikrqvudpjzsqebwodmjkndzagemlofmznaoamedremdtrtbvrqmncxcjoydarnqfukqrapgcewncmhrdmpehiosurelobpqxhfiqksimmvcllcsdnefsvkpcwpokzgnpyluvescbztdlsnyduaxnjlrqgtpgkhclexnbd
4 njpjvhthxdkwrhjvzgnjmceketvjoxeaorxyasibcdhgallwbtvdixviamkrjgrgrwmnkxnihclcuxwoyitwnstlfpqqdwaqtilbmihzshpreexixbrqqhzblmkiptpieqhptczxocchzhbdweualevdoqdzbjdcxlosbgvexcbgwopmrvlqoquknwgcoulqdpmvnlsaxchtqxzzdqnnxukbrfvlfyhssidxsmyqkwmghzdkleccscagvkdioydhjyihgesczherzyoiolgmgyefriokqrxvhbpbzszugnogafoonprykardrjhuqrtdacydaefhrhrgvelehknavjuspgvulgaixgfjrgnmzsagbrxekwwegidduogyxohrfsvcahohggbhabwzkgxpqqrabwnkdeprfkrzlqvqwlqocfohhokxgjjvixvszkdhvszunsdqzzcgezdgvluholijbuitornmpjvggkqsqxhlnxsbujtjpriksthpmfqvhcnhvrnxxpjfrrulzjnfbmlemtvlemhtwfzdypabgcljgegdiehklzfgocsfbfmammpceocxddwpqlrmcvjbldkx
5 hawfcjfxgucbgcjggkfplsgcsncipmjnrwatlhwkrjokunomffyvmrvdkenbwahirvimlauvtefealzgkxihtfitevmffqtizbkvdidmgyshuvvwugpddwxxijtexrlnelbhftpczkxlwecmzxwpzfmaosixyzejbgandcuuiknattwgnopcrpfdhgdxdgnvumacvhnwgvlwmplnjroenogsjlrqroivbvibicxprylsoamxmhcumsbdqhvhwsmizemfnvxvlpbrhdqjyotgteomiymxqsyvcimxyxdyiplmohjnoxamibvselbbujdfnvwmycggsvqmhdrcwddpmqlgtuujqaadtinfuwiyghofqkxbgqdqqvqknhfehxhnamlwvingtaqdwmtgvsxplthzhlzolsjlwuvnxrzioxjvxlwcyssfrxljmikbqjfhevynsetwysnevxsczqbekfrpbbomvpphewrhprpabefhssuooubmxjhksqkljgglkewjkxafrorjuwlwjxyvioywztmaaruyekwuwlajfybievzchqviuueoaxosoeglxgbvlrehhnrmgmljruvygkvp
6 wirtvzltqsseidfrlezfrmaakmroyeztniyoiwwumqhuzqehlymaumrxqupxsfxmgmvoesvcgnavlamsqxbnzhesqsdsjajpowlevkwpifqlyinnifvsmyymrpfbmobrealrommitauwzxzkoohoppqwhfgfyqkdienrejptrvmaaoxwvdkmxeddfzynbiayrpfvrayjuvvcekbnfjtqyohyvkivoggovrodqyqxzbzyplmisqcreigwbjvabwoyfjfkgxssnafhicpercfievxgbbgpbqvfeeduletbmanmfckimsbeegeqrtfdmsqftqtmfwkfnjikxzipsjpbjcjncssmajqisellewvhunzgnmncplslsiuqngxecktxwzuyvwvlhdolkoarzcemluebjcvxckolwyebtxodqsbaleppqdluinwlafciqbfgfawcpsgocliyzeqxlkcwvptgicrtuffqdypeqojtfooaapvstolguhdgrwinzwxiglsxenkeghjdpitkxowqdtmekbqfpvtfrhpmebnrkvwdytzrzuigzyesyhssdaoircggxozljfrtoylsmnkkvfxk
>
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Here is the web from which i want to get the data .
the data to be parsed
url="http://www.treasury.gov/resource-center/data-chart-center/tic/Documents/mfh.txt"
download.file(url, destfile="/tmp/data")
I can download it from web in txt format ,how can i get the data as a data frame?
I think it is a very interesting question that unfortunately will probably will be closed since OP don't show any effort to resolve it. The question is about of extracting numeric table from a text file.
You should first detect the start and end of your table within the text using grep
use read.fwf to read delimited data
Change a double header to a simple header using some regular expression and toString
Here my code:
ll <- readLines('allo1.txt')
i1 <- grep('Country',ll)
i2 <- grep('Grand Total',ll)
dat <- read.fwf(textConnection(ll[c(seq(i1+3,i2,1))]),
widths = c(20,-1,rep(c(7,-1),13)))
dat.h <- read.fwf(textConnection(ll[c(i1-1,i1)]),
widths = c(20,-1,rep(c(7,-1),13)))
nn <- unlist(lapply(dat.h,function(x)gsub('\\s|[*]','',toString(rev(unlist(x))))))
names(dat) <- nn
Country, 2013,Apr 2013,Mar 2013,Feb 2013,Jan 2012,Dec 2012,Nov 2012,Oct 2012,Sep 2012,Aug 2012,Jul 2012,Jun 2012,May 2012,Apr
1 China, Mainland 1264.9 1270.3 1251.9 1214.2 1220.4 1183.1 1169.9 1153.6 1155.2 1160.0 1147.0 1164.0 1164.4
2 Japan 1100.3 1114.3 1105.5 1103.9 1111.2 1117.7 1131.9 1128.5 1120.9 1119.8 1108.4 1107.2 1087.9
3 Carib Bnkng Ctrs 4/ 273.1 283.9 280.3 271.8 266.2 263.5 273.5 261.1 263.9 247.6 244.6 243.2 237.3
4 Oil Exporters 3/ 272.7 265.1 256.8 261.6 262.0 259.1 262.2 267.2 269.1 268.4 270.2 260.6 262.2
5 Brazil 252.6 257.9 256.5 254.1 253.3 255.9 254.1 251.2 259.8 256.5 244.3 245.8 245.9