I have a mystery character in my dataframe in R:
df <- structure(list(`ID21` = c("23", "44"),
ID22 = c("53", "23"), `Drug-na�ve_D22` = c("53",
"45")), row.names = 1:2, class = "data.frame")
> df
ID21 ID22 Drug-na�ve_D22
1 23 53 53
2 44 23 45
What's the best way to remove this character? Would some sort of gsub with regular expression work?
In this example I've replaced it with the letter i:
> df
ID21 ID22 Drug-naive_D22
1 23 53 53
2 44 23 45
To remove any non-word characters (letters, numbers and underscore) in your column names
names(df) <- gsub("\\W", "", names(df))
If you want to replace the characters with a different character, put them in the second argument
To match any non-ASCII character you can use this pattern:
[^ -~]
So, for example, if you want to replace the char by i, you can use sub thus:
sub("[^ -~]", "i", names(df))
Related
I have the following data
v1
19956673-1
20043747-23
20056956-1
36628-2
45820-4
478
115
I need to add trailing zeroes to the both sting fields (before and after the dash) so the desired output (v2) has 8 digits before the dash and 2 digits after. Also, data with no dash can be passed as is.
v1 v2
19956673-1 19956673-01
20043747-23 20043747-23
20056956-1 20056956-01
36628-2 00036628-02
45820-4 00045820-04
478 478
115 115
Here is an option to extract the part after the -, then use sprintf
i1 <- grep('-', df1$v1)
df1$v2 <- df1$v1
df1$v2[i1] <- sprintf('%s-%02d', sub('-.*', '', df1$v1[i1]),
as.numeric(sub('.*-', '', df1$v1[i1])))
-output
df1
# v1 v2
#1 19956673-1 19956673-01
#2 20043747-23 20043747-23
#3 20056956-1 20056956-01
#4 36628-2 36628-02
#5 45820-4 45820-04
#6 478 478
#7 115 115
Or another option is regex based on capturing as a group i.e. match the digits (\\d+) from the start (^) of the string, capture as a group ((...)), followed by a -, then capture the single digit (\\d) at the end ($), replace with the backreference of the captured groups and insert 0 before the second backreference
df1$v2 <- sub('^(\\d+)-(\\d)$', '\\1-0\\2', df1$v1)
data
df1 <- structure(list(v1 = c("19956673-1", "20043747-23", "20056956-1",
"36628-2", "45820-4", "478", "115")), row.names = c(NA, -7L),
class = "data.frame")
A solution with sub and positive lookbehind:
v2 <- sub("(?<=-)(\\d)$", "0\\1", v1, perl = TRUE)
Result:
v2
[1] "19956673-01" "20043747-23" "20056956-01" "36628-02" "45820-04"
How this works:
(?<=-): positive lookbehind: "if you see a - on the left ...
(\\d)$: ... then remember (\\1) the single digit ((\\d)) right at the end of the string ($) and add a 0 to the left of it"
Data:
v1 <- c("19956673-1", "20043747-23", "20056956-1", "36628-2", "45820-4")
I have a data table like this:
id number
1 5562,4024,...,1213
2 4244,4214,...,244
3 424,4213
4 1213,441
...
And I want to subset only the last part of each column of number, which should be like:
id number
1 1213
2 244
3 4213
4 441
...
So what should I do to achieve that?
One option is capture the digits at the end ($) of the string as a group that follows a , and replace with the backreference (\\1) of the captured group
df$number <- as.numeric(sub(".*,(\\d+)$", "\\1", df$number))
Or match the characters (.*) until the , and replace it with blank ("")
df$number <- as.numeric(sub(".*,", "", df$number))
data
df <- structure(list(id = 1:4, number = c("5562,4024,...,1213",
"4244,4214,...,244",
"424,4213", "1213,441")), class = "data.frame", row.names = c(NA,
-4L))
I have a set of data where some elements are preceded by "<" and I need to remove "<" so that I can perform some data analysis. The data is saved in a .txt file and I'm bringing it into R using read.table. Below is an example of what the text file looks like.
Background: 18 <10 27 22 <3
Site: 30 44 23 <16 13
I used x=read.file to make a dataframe, then tried gsub("<","",x) to remove the "<" and the result is something completely unexpected, at least to me. This is what I get as a result.
[1] "1:2" "c(18, 30)" "1:2" "c(27, 23)" "c(2, 1)" "1:2"
I have no idea what that means or why it's happening. I would greatly appreciate explanation both of what is going on here, and how I should go about accomplishing my goal.
df <- read.table(header = TRUE, text = "Background Site
18 30
<10 44
27 23
22 <16
<3 13", stringsAsFactors = FALSE)
You can use mutate_at and apply the gsub function to the variables (i.e. Background and Site) which you wish to remove the preceding < sign.
library(dplyr)
df %>% mutate_at(vars(Background, Site),
funs(as.numeric(gsub("^<", "", .))))
The output is:
Background Site
1 18 30
2 10 44
3 27 23
4 22 16
5 3 13
Read the file with readLines, perform the gsub and then re-read it with read.table. No packages are used:
read.table(text = gsub("<", "", readLines("myfile")), as.is = TRUE)
If the data does not come from a file but is already in a data frame DF then define a clean function which cleans a column of DF and apply it to each numeric column:
clean <- function(x) as.numeric(gsub(">", "", x))
DF[-1] <- lapply(DF[-1], clean)
I have a string having value as given below separated by vertical bar.
String1 <- "5|10|25|25|10|10|10|5"
String2 <- "5|10|25|25"
Is there any Direct Function to get the sum of the numbers in string ,
in this case it Should be 100 for Srting1 and 65 for string2,and I have a character vector of such.
>chk
chk
1 5|10|25|25|10|10|10|5
2 5|55|20|5|5|5|5
3 6
4 Not Available
> sum(scan(text=gsub("\\Not Available\\b", "NA", chk$chk), sep="|", what = numeric(), quiet=TRUE), na.rm = TRUE)
[1] 206
As it Should be
[1]100 100 6 NA
We can do a scan and then sum
sum(scan(text=String1, sep="|", what = numeric(), quiet=TRUE))
For multiple vectors, place it in a list and do the same operation
sapply(mget(paste0("String", 1:2)), function(x)
sum(scan(text=x, sep="|", what=numeric(), quiet=TRUE)))
# String1 String2
# 100 65
Another option is eval(parse( (not recommended though) after replacing the | with +
eval(parse(text=gsub("[|]", "+", String1)))
#[1] 100
Or as #thelatemail mentioned in the comments, assign (<-) the | to + and then do the eval(parse(..
`|` <- `+`
eval(parse(text=String1))
#[1] 100
If we have a data.frame column with strings, then it may be better to split by | to a list of vectors, convert the vectors to numeric (all the non-numeric elements coerce to NA with a friendly warning), get the sum with na.rm=TRUE
sapply(strsplit(as.character(chk$chk), "[|]"),
function(x) sum(as.numeric(x), na.rm=TRUE))
#[1] 100 100 6 0
NOTE: The as.character is not needed if the 'chk' column is already a character class
Otherwise, if we are using scan or eval(parse, it should be done for each element.
We can extract all the numbers from the string and then sum over it
library(stringr)
sum(as.numeric(unlist(str_match_all(String1, "[0-9]+"))))
#[1] 100
sum(as.numeric(unlist(str_match_all(String2, "[0-9]+"))))
#[1] 65
For multiple vectors we can keep it in a list
sapply(list(String1, String2), function(x)
sum(as.numeric(unlist(str_match_all(x, "[0-9]+")))))
#[1] 100 65
I have a column of data with the following types of dates and number entries:
16-Jun
21-01A
7-04
Aug-99
5-09
I want to convert these all into numbers, by doing two things. First, where the data have a number before a dash (as in the first three examples), I want to trim the data from the dash onwards. So the entries would appear 16, 21 and 7.
Second, where the entry is written in month-date format (e.g. Aug-99), I want to convert that to the number of the month and then trim it. so this example, would be to convert the date to 8-99 then trim to just 8.
How can I do this in R? When I use grep, sub and match commands, as in the answer below, I get:
[1] 16 21 7 5 8
When I am after: [1] 16 21 7 8 5
We use grep to find the index of elements that start with alphabets. Remove the substring that starts from - to the end of the string with sub. Subset the 'v2' based on 'i1' and convert to numeric while we match the ones starting with alphabets to month.abb and get the index of month, concatenate the output.
i1 <- grepl("^[A-Z]", v1)
v2 <- sub("-.*", "", v1)
c(as.numeric(v2[!i1]), match(v2[i1], month.abb))
#[1] 16 21 7 8
For the new dataset, we can use ifelse
i1 <- grepl("^[A-Z]", df1$v1)
v2 <- sub("-.*", "", df1$v1)
as.numeric(ifelse(i1, match(v2, month.abb), v2))
#[1] 16 21 7 8 5
data
v1 <- c('16-Jun','21-01A','7-04','Aug-99')
df1 <- structure(list(v1 = c("16-Jun", "21-01A", "7-04", "Aug-99", "5-09"
)), .Names = "v1", class = "data.frame", row.names = c(NA, -5L))