R extract numbers while leaving blanks [duplicate] - r

This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 8 years ago.
Consider the input c("foo 1", "bar 2", "baz"). I'd like to turn this into c(1,2,NA) (basically extract the numbers from each string, or if none exist turn it into an NA). My first pass looks like this:
funNums = as.numeric(
regmatches(x$Fun,
regexpr('\\d+', x$Fun, perl = T)))
where x$Fun is my input vector. The output I get from this though is c(1,2) since regmatches throws away things which don't match. How can I get it to include NAs?

X <- c("foo 1", "bar 2", "baz")
as.numeric(gsub("([^[:digit:]]*)", "", X))
# [1] 1 2 NA
(Do be aware that when passed a string like "1 to 2", this will return the number 12, which may not be what you'd like it to do.)

Related

How to print data frame column names containing parentheses in R [duplicate]

Let's say I have a data.frame, like so:
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame("Label 1"=x,"Label 2"=rnorm(100))
head(df,3)
returns:
Label.1 Label.2
1 1 1.9825458
2 2 -0.4515584
3 3 0.6397516
How do I get R to stop automagically replacing the space with a period in the column name? ie, "Label 1" instead of "Label.1".
You may set check.names = FALSE in data.frame (as well as in read.table):
df <- data.frame("Label 1" = 1:3, "Label 2" = rnorm(3), check.names = FALSE)
returns:
Label 1 Label 2
1 1 0.2013347
2 2 1.8823111
3 3 -0.5233811
From ?data.frame:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.
From ?make.names:
A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ".2way" are not valid, and neither are the reserved words.
All invalid characters are translated to "."
Also, if you need to subset a variable with an 'invalid' name using $, you can use backticks `. For example:
df$`Label 1`
You don't.
With the space you desire the format would not satisfy the requirements for an identifier that come to play when you use df$column.1 -- that could not cope with a space. So see the make.names() function for details or an example:
> make.names(c("Foo Bar", "tic tac"))
[1] "Foo.Bar" "tic.tac"
>
Edit eleven years later: The answer still stands that R prefers column names can be valid variable names. But R is flexible: if you insist you can use the other form _but then need to require the not-otherwise-valid-within-the-language column names explicitly:
> x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
> df <- data.frame("Label 1"=x,"Label 2"=rnorm(100), check.names=FALSE)
> summary( df$`Label 2` )
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.2719 -0.7148 -0.0971 -0.0275 0.6559 2.5820
>
So by saying check.names=FALSE we override the default (and sensible) check, and by wrapping the identifier in backticks we can access the column.
You can change an existing data frames names to contain spaces ie using your example
x <- c(1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10,1:10)
df <- data.frame("Label 1"=x,"Label 2"=rnorm(100))
colnames(df) <- c("Label 1", "Label 2")
head(df, 3)
returns
Label 1 Label 2
1 1 0.2013347
2 2 1.8823111
3 3 -0.5233811
and you can still access the columns using the $ operator, you just need to use double quotes eg
df$"Label 2"[1:3]
returns
[1] 0.2013347 1.8823111 -0.5233811
It seems rather inconsistent to me to auto-convert column names upon data.frame creation, but not to-do the same during column name alteration, but thats how R works at the moment.
names(df)<-c('Label 1','Label 2)

Separate a row into many using separated by [duplicate]

This question already has an answer here:
Split Strings into values in long dataframe format [duplicate]
(1 answer)
Closed 3 years ago.
Having into one row of dataframe data like this:
data.frame(text = c("in this line ???? another line and ???? one more", "more lines ???? another row")
separate into many rows using as separation the ????. Here the expected output
data.frame(text = c("in this line", "another line and", "one more", "more lines", "another row")
Here is a base R solution
dfout <- data.frame(text = unlist(strsplit(as.character(df$text),split = " \\?{4} ")))
or a more efficient (Thanks to comments by #Sotos)
dfout <- data.frame(text = unlist(strsplit(as.character(df$text),split = " ???? ", fixed = TRUE)))
such that
> dfout
text
1 in this line
2 another line and
3 one more
4 more lines
5 another row

Keep rows with have a specific word [duplicate]

This question already has answers here:
Filter rows which contain a certain string
(5 answers)
Closed 3 years ago.
Using this command it keeps the rows which have the specific word
df[df$ID == "interesting", ]
If this word is exist in the row but it has more words around how is it possible to find if this word exist and keep the row.
Example input
data.frame(text = c("interesting", " I am interesting for this", "remove")
Expected output
data.frame(text = c("interesting", " I am interesting for this")
1.Example data:
df <- data.frame(text = c("interesting", " I am interesting for this", "remove"),
stringsAsFactors = FALSE)
Solution using base R. Indexing using grepl:
df[grepl("interesting", df$text), ]
This returns:
[1] "interesting" " I am interesting for this"
Edit 1
Change code so that it returns a data.frame and not a vector.
df[grep("interesting", df$text), , drop = FALSE]
This now returns:
text
1 interesting
2 I am interesting for this

How do I remove a word and what follows after it in an R vector?

I'm new to R and I need to remove a word and what follows it in quotation marks from a vector in a dataframe.
Here's a bit of what I have:
c("'character': 'Ted the Bellhop', 'credit_id': '52fe420dc3a36847f80001b7', 2",
"'character': 'Man', 'credit_id': '52fe420dc3a36847f800018b', 2",
"'character': 'Angela', 'credit_id': '52fe420dc3a36847f8000183', 1")
I'm working with a large dataset so I need to find a way to be able to remove 'character': and what comes after it ('Ted the Bellhop', 'Man', etc.)
I tried using fromJSON for this but it wouldn't work so that's why I chose to remove things manually.
I was able to remove a field with only numbers in it using:
x <- gsub("'cast_id': [[:digit:]]+,", "", x)
This should do it:
x <- gsub("'character': '[^']*',", "", x)
It's pretty much the same thing you did for the cast_id field, except it will remove values matching the regular expression '[^']*' instead of digits. Read this as:
[^']: any character other than '
[^']*: same as above, repeated 0 or more times
'[^']*': same as above, wrapped in single quotes
Hope this makes sense.
I'm still not clear on your expected output; is this what you're after?
sub("^.+\\s(?='credit_id')", "", ss, perl = T)
#[1] "'credit_id': '52fe420dc3a36847f80001b7', 2"
#[2] "'credit_id': '52fe420dc3a36847f800018b', 2"
#[3] "'credit_id': '52fe420dc3a36847f8000183', 1"
Or perhaps this?
sub("^.+\\s('credit_id': '\\w+'),.+$", "\\1", ss, perl = T)
#[1] "'credit_id': '52fe420dc3a36847f80001b7'"
#[2] "'credit_id': '52fe420dc3a36847f800018b'"
#[3] "'credit_id': '52fe420dc3a36847f8000183'"
Sample data
ss <- c("'character': 'Ted the Bellhop', 'credit_id': '52fe420dc3a36847f80001b7', 2",
"'character': 'Man', 'credit_id': '52fe420dc3a36847f800018b', 2",
"'character': 'Angela', 'credit_id': '52fe420dc3a36847f8000183', 1")

Subsetting in R using OR condition with strings

I have a data frame with about 40 columns, the second column, data[2] contains the name of the company that the rest of the row data describes. However, the names of the companies are different depending on the year (trailing 09 for 2009 data, nothing for 2010).
I would like to be able to subset the data such that I can pull in both years at once. Here is an example of what I'm trying to do...
subset(data, data[2] == "Company Name 09" | "Company Name", drop = T)
Essentially, I'm having difficulty using the OR operator within the subset function.
However, I have tried other alternatives:
subset(data, data[[2]] == grep("Company Name", data[[2]]))
Perhaps there's an easier way to do it using a string function?
Any thoughts would be appreicated.
First of all (as Jonathan done in his comment) to reference second column you should use either data[[2]] or data[,2]. But if you are using subset you could use column name: subset(data, CompanyName == ...).
And for you question I will do one of:
subset(data, data[[2]] %in% c("Company Name 09", "Company Name"), drop = TRUE)
subset(data, grepl("^Company Name", data[[2]]), drop = TRUE)
In second I use grepl (introduced with R version 2.9) which return logical vector with TRUE for match.
A couple of things:
1) Mock-up data is useful as we don't know exactly what you're faced with. Please supply data if possible. Maybe I misunderstood in what follows?
2) Don't use [[2]] to index your data.frame, I think [,"colname"] is much clearer
3) If the only difference is a trailing ' 09' in the name, then simply regexp that out:
R> x1 <- c("foo 09", "bar", "bar 09", "foo")
R> x2 <- gsub(" 09$", "", x1)
[1] "foo" "bar" "bar" "foo"
R>
Now you should be able to do your subset on the on-the-fly transformed data:
R> data <- data.frame(value=1:4, name=x1)
R> subset(data, gsub(" 09$", "", name)=="foo")
value name
1 1 foo 09
4 4 foo
R>
You could also have replace the name column with regexp'ed value.

Resources