I have a column of data with the following types of dates and number entries:
16-Jun
21-01A
7-04
Aug-99
5-09
I want to convert these all into numbers, by doing two things. First, where the data have a number before a dash (as in the first three examples), I want to trim the data from the dash onwards. So the entries would appear 16, 21 and 7.
Second, where the entry is written in month-date format (e.g. Aug-99), I want to convert that to the number of the month and then trim it. so this example, would be to convert the date to 8-99 then trim to just 8.
How can I do this in R? When I use grep, sub and match commands, as in the answer below, I get:
[1] 16 21 7 5 8
When I am after: [1] 16 21 7 8 5
We use grep to find the index of elements that start with alphabets. Remove the substring that starts from - to the end of the string with sub. Subset the 'v2' based on 'i1' and convert to numeric while we match the ones starting with alphabets to month.abb and get the index of month, concatenate the output.
i1 <- grepl("^[A-Z]", v1)
v2 <- sub("-.*", "", v1)
c(as.numeric(v2[!i1]), match(v2[i1], month.abb))
#[1] 16 21 7 8
For the new dataset, we can use ifelse
i1 <- grepl("^[A-Z]", df1$v1)
v2 <- sub("-.*", "", df1$v1)
as.numeric(ifelse(i1, match(v2, month.abb), v2))
#[1] 16 21 7 8 5
data
v1 <- c('16-Jun','21-01A','7-04','Aug-99')
df1 <- structure(list(v1 = c("16-Jun", "21-01A", "7-04", "Aug-99", "5-09"
)), .Names = "v1", class = "data.frame", row.names = c(NA, -5L))
Related
I have a dataframe that looks like the below:
BaseRating contRating Participant
5,4,6,3,2,4 5 01
4 4 01
I would first like to run some code that looks to see whether there are any commas in the dataframe, and returns a column number of where that is. I have tried some of the solutions in the questions below, which don't seem to work when looking for a comma instead of a string/whole value? I'm probably missing something simple here but any help appreciated!
Selecting data frame rows based on partial string match in a column
Filter rows which contain a certain string
Check if value is in data frame
Having determined whether there are commas in my data, I then want to extract just the last number in the list separated by commas in that entry, and replace the entry with that value. For instance, I want the first row in the BaseRating column to become '4' because it is last in that list.
Is there a way to do this in R without manually changing the number?
A possible solution is bellow.
EXPLANATION
In what follows, I will explain the regex expression used in str_extract function, as asked for by #milsandhills:
The symbol | in the middle means the logical OR operator.
We use that because BaseRating can have multiple numbers or only one number — hence the need to use |, to treat each case separately.
The left-hand side of | means a number formed by one or more digits (\\d+), which starts (^) and finishes the string ($).
The right-hand side of | means a number formed by one or more digits (\\d+), which finishes the string ($). And (?<=\\,) is used to guarantee that the number is preceded by a comma.
You can find more details at the stringr cheat sheet.
library(tidyverse)
df <- data.frame(
BaseRating = c("5,4,6,3,2,4", "4"),
contRating = c(5L, 4L),
Participant = c(1L, 1L)
)
df %>%
mutate(BaseRating = sapply(BaseRating,
function(x) str_extract(x, "^\\d+$|(?<=\\,)\\d+$") %>% as.integer))
#> BaseRating contRating Participant
#> 1 4 5 1
#> 2 4 4 1
Or:
library(tidyverse)
df %>%
separate_rows(BaseRating, sep = ",", convert = TRUE) %>%
group_by(contRating, Participant) %>%
summarise(BaseRating = last(BaseRating), .groups = "drop") %>%
relocate(BaseRating, .before = 1)
#> # A tibble: 2 × 3
#> BaseRating contRating Participant
#> <int> <int> <int>
#> 1 4 4 1
#> 2 4 5 1
If we want a quick option, we can use trimws from base R
df$BaseRating <- as.numeric(trimws(df$BaseRating, whitespace = ".*,"))
-output
> df
BaseRating contRating Participant
1 4 5 1
2 4 4 1
Or another option is stri_extract_last
library(stringi)
df$BaseRating <- as.numeric(stri_extract_last_regex(df$BaseRating, "\\d+"))
data
df <- structure(list(BaseRating = c("5,4,6,3,2,4", "4"), contRating = 5:4,
Participant = c(1L, 1L)), class = "data.frame", row.names = c(NA,
-2L))
['ax', 'byc', 'crm', 'dop']
This is a character string, and I want a count of all substrings, ie 4 here as output. Want to do this for the entire column containing such strings.
We may use str_count
library(stringr)
str_count(str1, "\\w+")
[1] 4
Or may also extract the alpha numeric characters into a list and get the lengths
lengths(str_extract_all(str1, "[[:alnum:]]+"))
If it is a data.frame column, extract the column as a vector and apply str_count
str_count(df1$str1, "\\w+")
data
str1 <- "['ax', 'byc', 'crm', 'dop']"
df1 <- data.frame(str1)
Here are a few base R approaches. We use the 2 row input defined reproducibly in the Note at the end. No packages are used.
lengths(strsplit(DF$b, ","))
## [1] 4 4
nchar(gsub("[^,]", "", DF$b)) + 1
## [1] 4 4
count.fields(textConnection(DF$b), ",")
## [1] 4 4
Note
DF <- data.frame(a = 1:2, b = "['ax', 'byc', 'crm', 'dop']")
I have two dataframes d1 and d2. My data frames are census data from 2010. I want to merge them using a common attribute
merge (d1, d2, by.x="GEOID", by.y= "GISJOIN")
d1 has the common id as GEOID (for eg. 310019654001) while d2 has the same id attribute as GISJOIN (for eg. 31000109654001). I need to remove the "0" from the 3rd and 7th position in the GISJOIN attribute. How can I do that in R?
I splitted the values using
splitted <- as.data.frame(t(sapply(d2$GISJOIN, function(x) substring(x, first=c(1,4,8), last=c(2,6,14)))))
splitted$v4 <- (paste(splitted$V1, splitted$V2, splitted$V3))
v4 is character values, when I do as.numeric it gives me error: Warning message:
NAs introduced by coercion
Too long to type as a comment, using the only example you have provided and another invented example to show you don't need the sapply() :
d2 = data.frame(GISJOIN=c("31000109654001","12345678910112"))
d2$GISJOIN = as.character(d2$GISJOIN)
What you have now:
splitted <- as.data.frame(t(sapply(d2$GISJOIN, function(x) substring(x, first=c(1,4,8), last=c(2,6,14)))))
splitted$v4 <- (paste(splitted$V1, splitted$V2, splitted$V3))
V1 V2 V3 v4
31000109654001 31 001 9654001 31 001 9654001
12345678910112 12 456 8910112 12 456 8910112
The new string still has spaces in between, hence if you convert as.numeric() it gives NA. Below I just split it into characters and exclude position 3 and 7:
d2$new = lapply(strsplit(d2$GISJOIN,""),function(i){
paste(i[-c(3,7)],collapse="")
})
as.numeric(d2$new)
[1] 310019654001 124568910112
I'm trying to learn R and a sample problem is asking to only reverse part of a string that is in alphabetical order:
String: "abctextdefgtext"
StringNew: "cbatextgfedtext"
Is there a way to identify alphabetical patterns to do this?
Here is one approach with base R based on the patterns showed in the example. We split the string to individual characters ('v1'), use match to find the position of characters with that of alphabet position (letters), get the difference of the index and check if it is equal to 1 ('i1'). Using the logical vector, we subset the vector ('v1'), create a grouping variable and reverse (rev) the vector based on grouping variable. Finally, paste the characters together to get the expected output
v1 <- strsplit(str1, "")[[1]]
i1 <- cumsum(c(TRUE, diff(match(v1, letters)) != 1L))
paste(ave(v1, i1, FUN = rev), collapse="")
#[1] "cbatextgfedtext"
Or as #alexislaz mentioned in the comments
v1 = as.integer(charToRaw(str1))
rawToChar(as.raw(ave(v1, cumsum(c(TRUE, diff(v1) != 1L)), FUN = rev)))
#[1] "cbatextgfedtext"
EDIT:
1) A mistake was corrected based on #alexislaz's comments
2) Updated with another method suggested by #alexislaz in the comments
data
str1 <- "abctextdefgtext"
You could do this in base R
vec <- match(unlist(strsplit(s, "")), letters)
x <- c(0, which(diff(vec) != 1), length(vec))
newvec <- unlist(sapply(seq(length(x) - 1), function(i) rev(vec[(x[i]+1):x[i+1]])))
paste0(letters[newvec], collapse = "")
#[1] "cbatextgfedtext"
Where s <- "abctextdefgtext"
First you find the positions of each letter in the sequence of letters ([1] 1 2 3 20 5 24 20 4 5 6 7 20 5 24 20)
Having the positions in hand, you look for consecutive numbers and, when found, reverse that sequence. ([1] 3 2 1 20 5 24 20 7 6 5 4 20 5 24 20)
Finally, you get the letters back in the last line.
I asked a question which was not that clear probably. so I try to explain it in a understandable way. This is my data
My data looks like this
It looks like this
# V1 V2 V3
#1 Q9UNZ5 Q9Y2W1
#2 Q9ULV4;Q6QEF8
#3 Q9UNZ5
#4 Q9H6F5
#5 Q9H2K0 Q9ULV4;Q6QEF8
#6 Q9GZZ1 Q9UKD2
#7 Q9H6F5 Q9GZZ1 Q9GZZ1
#8 Q9GZZ1 Q9NYF8
#9 Q9BWS9
I want to remove the duplicated strings across all of them
for example, the V1 we have all the strings for the first time, so we don't remove anything just arrange them to have
Q9ULV4
Q6QEF8
Q9H6F5
Q9GZZ1
Q9BWS9
Then we check the second columns strings with the first column and we remove those that are repeated and again arrange them. for the third column we check the strings with the first and the second , if similar then we remove and then arrange them. So the output should look like below.
Q9ULV4 Q9UNZ5 Q9Y2W1
Q6QEF8 Q9H2K0 Q9UKD2
Q9H6F5 Q9NYF8
Q9GZZ1
Q9BWS9
It is not similar to any questions I have asked; so please if it is still not clear, just comment, I try to explain it
I would approach this in two steps:
1) get unique elements per column and convert to list:
l <- lapply(df, function(x) unique(unlist(strsplit(as.character(x), ";"))))
2) remove duplicates that appear in any previous columns
for(i in seq_along(l)) {
l[[i]] <- setdiff(l[[i]], unlist(l[seq_len(i-1L)]))
}
The reason why I use a list instead of a data.frame is because data.frames require all columns to have the same number of rows, which is not the case here (unless you fill them with NA or empty strings). In such cases, a list structure is the way to go.
The first line converts df to a list L. The second line creates a long form data frame long containing the values in column1 and the df column names in column 2 as a factor. Making it a factor is needed as the levels preserve all column names including the ones that are subsequently eliminated due to only containing duplicates. Also, it preserves the order of the column names. The last line removes duplicates producing long0. No packages are used.
L <- lapply(df, function(x) unlist(strsplit(as.character(x), ";")))
long <- transform(stack(L), ind = factor(as.character(ind), levels = names(df)))
long0 <- subset(long, !duplicated(values))
Now we consider three possible forms of output:
1) long form data frame
> long0
values ind
1 Q9ULV4 V1
2 Q6QEF8 V1
3 Q9H6F5 V1
4 Q9GZZ1 V1
5 Q9BWS9 V1
6 Q9UNZ5 V2
8 Q9H2K0 V2
11 Q9Y2W1 V3
15 Q9UKD2 V3
17 Q9NYF8 V3
2) list
L0 <- unstack(long0)
giving:
> L0
$V1
[1] "Q9ULV4" "Q6QEF8" "Q9H6F5" "Q9GZZ1" "Q9BWS9"
$V2
[1] "Q9UNZ5" "Q9H2K0"
$V3
[1] "Q9Y2W1" "Q9UKD2" "Q9NYF8"
3) character matrix Create a version of L0 that replaces each zero length component with NA and then expand the length of each component to the maximum length reforming into a matrix at the same time via sapply.
lens <- lengths(L0)
m0 <- sapply(replace(L0, !lens, NA), "length<-", max(lens))
giving:
> m0
V1 V2 V3
[1,] "Q9ULV4" "Q9UNZ5" "Q9Y2W1"
[2,] "Q6QEF8" "Q9H2K0" "Q9UKD2"
[3,] "Q9H6F5" NA "Q9NYF8"
[4,] "Q9GZZ1" NA NA
[5,] "Q9BWS9" NA NA
Update: Some fixes and clarifications.
Note 1: The input df in reproducible form is:
df <-
structure(list(V1 = c("", "Q9ULV4;Q6QEF8", "", "", "", "", "Q9H6F5",
"Q9GZZ1", "Q9BWS9"), V2 = c("Q9UNZ5", "", "", "Q9H6F5", "Q9H2K0",
"Q9GZZ1", "Q9GZZ1", "", ""), V3 = c("Q9Y2W1", "", "Q9UNZ5", "",
"Q9ULV4;Q6QEF8", "Q9UKD2", "Q9GZZ1", "Q9NYF8", "")), .Names = c("V1",
"V2", "V3"), row.names = c(NA, -9L), class = "data.frame")
Note 2: In the most recent development version of R, "R Under development (unstable) (2016-07-05 r70861)", the long <- line near the top could be simplified to just long <- stack(L) since stack creates a factor with all levels in that version of R.
I would do it in plain R based on the duplicate function in this way:
lst <- lapply(df, function(x) unlist(strsplit(as.character(x), ";", fixed = TRUE)))
cols <- colnames(df)
seen_entries <- NULL
for (i in (1:ncol(df))) {
n_seen_before <- length(seen_entries)
seen_entries <- c(seen_entries, lst[[cols[i]]])
lst[[cols[i]]] <- lst[[cols[i]]][(!duplicated(seen_entries))[
(n_seen_before+1):length(seen_entries)]]
}
Output is:
> lst
$V1
[1] "Q9ULV4" "Q6QEF8" "Q9H6F5" "Q9GZZ1" "Q9BWS9"
$V2
[1] "Q9UNZ5" "Q9H2K0"
$V3
[1] "Q9Y2W1" "Q9UKD2" "Q9NYF8"
Probably there are more elegant solution using e.g. data.table or something similar.
We can try
lst <- lapply(df, function(x) unique(unlist(strsplit(as.character(x), ";"))))
lapply(seq_along(lst), function(i) {
v1 <- unlist(lst[seq(i)])
setdiff(lst[[i]], v1[duplicated(v1)])})
#[[1]]
#[1] "Q9ULV4" "Q6QEF8" "Q9H6F5" "Q9GZZ1" "Q9BWS9"
#[[2]]
#[1] "Q9UNZ5" "Q9H2K0"
#[[3]]
#[1] "Q9Y2W1" "Q9UKD2" "Q9NYF8"