I want to replace several strings with one. I've researched and found that gsub can replace elements but one at a time.
If I do this I get a warning saying that only the first one was used.
data$EVTYPE <- gsub( c("x","y") , "xy", data$EVTYPE)
I am trying now with sapply
data$EVTYPE <- sapply(data$EVTYPE, gsub, c("x", "y"), "xy") but it's been already more than 5 minutes and is still processing. I will get a stack overflow message any time now. :-/ Is there an elegant short solution for this? Is there a package I can use for this? It needs to be small because I need to do this for several cases where I have duplicate names.
Thanks for your useful comments. It was done as Frank suggested. gsub( "x|y" , "xy", data$EVTYPE).
Instead of using a vector.
For the cold temperature case, you could use gsub("COLD TEMPERATURES?", "COLD", data$EVTYPE) it's worth spending a little time getting one's head around the basics of regular expressions. There are lots of tutorials including this one.
Related
I am working with a large data frame in r which includes a column containing the text content of a number of tweets. Each value starts with "RT #(account which is retweeted): ", for example "RT #RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet...". I need to change each value in this column to only include the account name ("#RosannaXia"). How would I be able to do this? I understand that I may be able to do this with gsub and regular expressions (a lookbehind and a lookahead), but when I tried the following lookahead code it did not doing anything (or show an error):
Unnested_rts$rt_user <- gsub("[a-z](?=:)", "", Unnested_rts$rt_user, perl=TRUE)
Is there a better way to do this? I am not sure what went wrong, but I am still a very inexperienced coder. Any help would be greatly appreciated!
You can extract everything from # till a colon (:).
x <- "RT #RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet..."
sub('RT (#.*?):.*', '\\1', x)
#[1] "#RosannaXia"
For your case , it would be -
Unnested_rts$rt_user <- sub('RT (#.*?):.*', '\\1', Unnested_rts$rt_user)
A few things:
according to twitter, a handle can include alphanumeric ([A-Za-z0-9]) and underscores, this needs to be in your pattern;
your pattern needs to capture it and preserve it, and discard everything else, since we don't always know how to match everything else, we'll stick with matching what we know and use .* on either side.
gsub(".*(#[A-Za-z0-9_]+)(?=:).*", "\\1", "RT #RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet...", perl=TRUE)
# [1] "#RosannaXia"
Since you want this for the entire column, you can probably just to
gsub(".*(#[A-Za-z0-9_]+)(?=:).*", "\\1", Unnested_rts$rt_user, perl=TRUE)
The only catch is that if there is a failed match (pattern is not found), then the entire string is returned, which may not be what you want. If you want to extract what you found, then there are several techniques that use gregexpr and regmatches, or perhaps stringr::str_extract.
I recently just started with R a few weeks ago at the Uni. We were given a problem which we had to solve. However in this problem, I find that there are two answers that fit the question:
Verify that you created lo_heval correctly (incl. missing values). Store your verification in the object proof2.
So i find this is correct:
proof2 <- soep[1:100, c("heval", "lo_heval")]
But I think that this answer is also correct:
proof2 <- table(soep$heval, soep$lo_heval, useNA = "always")
Instead of having to decide for one answer, how do I combine them both into the object? I tried to use &, but I get an error. I may be using it wrong.
Prof. if you're seeing this, please don't fail me. I just can't decide between them.
Thanks in advance!
R lists can hold any arbitrary objects in them, so you could use
proof2 <- list(
soep[1:100, c("heval", "lo_heval")],
table(soep$heval, soep$lo_heval, useNA = "always")
)
However, to my mind 100 rows of two columns isn't proof - it's an exercise to look through those and verify things are right. (And what about the rows past 100? It's a decent spot check, but if there are more rows in the data it is more strong evidence than proof.) The table approach, on the other hand, seems succinct and effective.
I am not too great at regex's and have been stuck on this problem for awhile now. I have biological taxonomic information stored as strings in a "taxonomyString" column in a dataframe. The strings look like this:
“domain;kingdom;phylum;class;order;genus;species”
My goal is to split all of the strings (e.g., “domain”) into a taxonomic level column (e.g., “domain” into a "Domain" column) . I have accomplished this using the following (very long) code,
*taxa_data_six <- taxa_data %>% filter(str_count(taxonomyString, pattern = ";") == 6) %>%
tidyr::extract(taxonomyString, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), regex = "([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+)")*
I had to include a lot of different possible characters in between the semicolons because some of the taxa had [brackets] around the name, etc.
Besides being cumbersome, after running through my code, I have found there to be some errors in the taxonomyString, which I would like to clean up.
Sometimes, a class name is broken up by semicolons, e.g., what should be incertae sedis; is actually incertae;sedis;. These kinds of errors are throwing off my code, which assumes that the first semicolon always denotes the domain, the second, the kingdom, and so on.
In any case, my question is simple, but has been giving me a lot of grief. I would like to be able to group each taxonomyString by semicolons, e.g., group 1 is domain;, group 2 is kingdom;, so that I can refer back to them in another call and correct the errors. In the case of incertae;sedis;, I should be able to call group 4 and merge it with group 5. I have looked online about how to refer back to capture groups in R, and from what I've seen str_match seems to be the most efficient way to do this; however, I am uncertain why my ([:alnum:]*;) regex is not capturing the groups in str_match. I have tried different variations of this regexp (with parenthesis in different places), but I am stuck.
I am wondering if someone can help me write the str_match() function that will accomplish my goal.
Any help would be appreciated.
Edit
At this point, it seems like I should go with Wiktor's recommendation and simply split the strings by ;'s, and then fix the errors. Would anyone be able split the strings into their own columns?
New to R, taking a very accelerated class with very minimal instruction. So I apologize in advance if this is a rookie question.
The assignment I have is to take a specific column that has 21 levels from a dataframe, and condense them into 4 levels, using an if, or ifelse statement. I've tried what feels like hundreds of combinations, but this is the code that seemed most promising:
> b2$LANDFORM=ifelse(b2$LANDFORM=="af","af_type",
ifelse(b2$LANDFORM=="aflb","af_type",
ifelse(b2$LANDFORM=="afub","af_type",
ifelse(b2$LANDFORD=="afwb","af_type",
ifelse(b2$LANDFORM=="afws","af_type",
ifelse(b2$LANDFORM=="bfr","bf_type",
ifelse(b2$LANDFORM=="bfrlb","bf_type",
ifelse(b2$LANDFORM=="bfrwb","bf_type",
ifelse(b2$LANDFORM=="bfrwbws","bf_type",
ifelse(b2$LANDFORM=="bfrws","bf_type",
ifelse(b2$LANDFORM=="lb","lb_type",
ifelse(bs$LANDFORM=="lbaf","lb_type",
ifelse(b2$LANDFORM=="lbub","lb_type",
ifelse(b2$LANDFORM=="lbwb","lb_type","ws_type"))))))))))))))
LANDFORM is a factor, but I tried changing it to a character too, and the code still didn't work.
"ws_type" is the catch all for the remaining variables.
the code runs without errors, but when I check it, all I get is:
> unique(b2$LANDFORM)
[1] NA "af_type"
Am I even on the right path? Any suggestions? Should I bite the bullet and make a new column with substr()? Thanks in advance.
If your new levels are just the first two letters of the old ones followed by _type you can easily achieve what you want through:
#prototype of your column
mycol<-factor(sample(c("aflb","afub","afwb","afws","bfrlb","bfrwb","bfrws","lb","lbwb","lbws","wslb","wsub"), replace=TRUE, size=100))
as.factor(paste(sep="",substr(mycol,1,2),"_type"))
After a great deal of experimenting, I consulted a co-worker, and he was able to simplify a huge amount of this. Basically, I should have made a new column composed of the first two letters of the variables in LANDFORM, and then sample from that new column and replace values in LANDFORM, in order to make the ifelse() statement much shorter. The code is:
> b2$index=as.factor(substring(b2$LANDFORM,1,2))
b2$LANDFORM=ifelse(b2$index=="af","af_type",
ifelse(b2$index=="bf","bf_type",
ifelse(b2$index=="lb","lb_type",
ifelse(b2$index=="wb","wb_type",
ifelse(b2$index=="ws","ws_type","ub_type")))))
b2$LANDFORM=as.factor(b2$LANDFORM)
Thanks to everyone who gave me some guidance!
The background to this is that I'm mostly a Python programmer who has some passing familiarity with R. I've been tasked to look at an R script that was written by a Perl programmer who used for and while loops a lot, to see if I can make it more R-like and get it to run faster.
For example purposes, I have the following list:
> lnums <- list(1:5, 6:7, 8:12)
For the elements that have a length less than 5 (lnums[[2]]), I want to change the length to be 5. The original code uses a for loop to tack NA values to the end of any shorter vectors, and I know that there's got to be a better way than that. I was playing around with ways to get to it and came up with
> sapply(lnums, FUN=function(x) length(x) < 5)
which gets the right element, but I'm unable to figure out how to incorporate this into the subscript of a length(lnums[]) <- 5 statement. I know this is probably a really novice question, but I'd appreciate any help I can get.
Additionally, the reason that I want to increase the length of the shorter list elements is so that I can put the list into a data frame. It would be great if there was a way to do that without messing around with lengths, although I still wouldn't mind an answer to my first question to satisfy my curiosity if nothing else.
Thanks all. I've been digging through some topics in here and you've already helped me out quite a bit!
Here's one way:
lapply(lnums, 'length<-', 5)