I have the same question as here but using R:
Sort numbers with colons
I have a data frame A with a column like this one:
1:5
11:36
2:1
2:14
2:8
I'd like to sort A based on that column, in this way:
1:5
2:1
2:8
2:14
11:36
We can separate the data into different columns, arrange the data by all columns and combine them again.
library(dplyr)
library(tidyr)
df %>%
separate(V1, into = c("A", "B"), sep = ":", convert = TRUE) %>%
arrange_all() %>%
unite(A, A, B, sep = ":")
# A
#1 1:5
#2 2:1
#3 2:8
#4 2:14
#5 11:36
data
df <- structure(list(V1 = c("1:5", "11:36", "2:1", "2:14", "2:8")),
row.names = c(NA, -5L), class = "data.frame")
Here is a base R solution using order + gsub, i.e.,
r <- v[order(as.numeric(gsub(":.*","",v)),
as.numeric(gsub(".*:","",v)))]
such that
> r
[1] "1:5" "2:1" "2:8" "2:14" "11:36"
1) gtools mixedsort and mixedorder in gtools can do that. We show how to do it for a vector v and an entire data frame DF which may have additional columns that are to be moved along with the v column. (The test data is defined reproducibly in the Note at the end. If the v column in DF were factor rather than character then use as.character(DF$v) in place of DF$v).
library(gtools)
mixedsort(v)
## [1] "1:5" "2:1" "2:8" "2:14" "11:36"
DF[mixedorder(DF$v), ]
## v x
## 1 1:5 1
## 3 2:1 3
## 5 2:8 5
## 4 2:14 4
## 2 11:36 2
2) Base R This alternative is slightly longer but only uses base R. It gives the same answers as (1). The comment about factors in (1) applies here too.
o <- do.call("order", read.table(text = v, sep = ":"))
v[o]
o <- do.call("order", read.table(text = DF$v, sep = ":"))
DF[o, ]
Note
Test data used
v <- c("1:5", "11:36", "2:1", "2:14", "2:8")
DF <- data.frame(v, x = seq_along(v), stringsAsFactors = FALSE)
Related
Say I have the following data frame:
# S/N a b
# 1 L1-S2 <blank>
# 2 T1-T3 <blank>
# 3 T1-L2 <blank>
How do I turn the above data frame into this:
# S/N a b
# 1 L1-S2 LS
# 2 T1-T3 T
# 3 T1-L2 TL
I am thinking of writing a loop, where
For x in column a,
If first character in x == L AND 4th character in x == S,
fill the corresponding cell in b with LS
and so on...
However, I am not sure how to implement it, or if there is a more elegant way of doing this.
We can extract the upper case letters and remove the repeated ones
library(stringr)
library(dplyr)
df1 %>%
mutate(b = str_replace(str_replace(a, "^([A-Z])\\d+-([A-Z])\\d+",
"\\1\\2"), "(.)\\1+", "\\1"))
-output
# S_N a b
#1 1 L1-S2 LS
#2 2 T1-T3 T
#3 3 T1-L2 TL
Or another option is str_extract_all to extract the upper case letters, loop over the list with map, paste the unique elements
library(purrr)
df1 %>%
mutate(b = str_extract_all(a, "[A-Z]") %>%
map_chr(~ str_c(unique(.x), collapse="")))
Or using a corresponding base R option for the first tidyverse option
df1$b <- sub("(.)\\1+", "\\1", gsub("[0-9-]+", "", df1$a))
Or with strsplit
df1$b <- sapply(strsplit(df1$a, "[0-9-]+"),
function(x) paste(unique(x), collapse=""))
data
df1 <- structure(list(S_N = 1:3, a = c("L1-S2", "T1-T3", "T1-L2"),
b = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
This is my current data set
I want to take the numbers after "narrow" (e.g. 20) and make another vector. Any idea how I can do that?
We can use sub to remove the substring "Narrow", followed by a , and zero or more spaces (\\s+), replace with blank ("") and convert to numeric
df1$New <- as.numeric(sub("Narrow,\\s*", "", df1$Stimulus))
You could use separate to separate the stimulus column into two vectors.
library(tidyr)
df %>%
separate(col = stimulus,
sep = ", ",
into = c("Text","Number"))
Maybe you can try the code below, using regmatches
df$new <- with(df, as.numeric(unlist(regmatches(stimulus,gregexpr("\\d+",stimulus)))))
You want separate from the tidyr package.
library(dplyr)
df <- data.frame(x = c(NA, "a.b", "a.d", "b.c"))
df %>% separate(x, c("A", "B"))
#> A B
#> 1 <NA> <NA>
#> 2 a b
#> 3 a d
#> 4 b c
I have a dataset with millions of observations.
One of the columns of this dataset uses 4 or 5 characters to classify these observations.
My goal is to merge this classification into smaller groups, for example, I want to replace all the values of the column that STARTS with "AA" (e.g., "AABC" or "AAUCC") for just "A". How can I do this?
To illustrate:
Considering that my data is labeled "f2016" and the column that I'm interested in is "SECT16", I've been using the following code to replace values:
f2016$SECT16[f2016$SECT16 == "AABB"] <- "A"
But I cannot do this to all combinations of letters that I have in the dataset. Is there a way that I can do the same replacement holding the first two letters constant?
Here is another base R solution:
f2016[startsWith(f2016$SECT16, "AA"),] <- "A"
# SECT16
# 1 A
# 2 A
# 3 ABBBBC
# 4 DDDDE
# 5 BABA
This replaces chars with the prefix specified in this case AA. An an excerpt from from the help(startsWith).
startsWith() is equivalent to but much faster than
substring(x, 1, nchar(prefix)) == prefix
or also
grepl("^", x)
where prefix is not to contain special regular expression characters.
Data
f2016 <- data.frame(SECT16 = c("AAABBB", "AAAAAABBBB", "ABBBBC", "DDDDE", "BABA"), stringsAsFactors = F)
We can use grep/grepl
f2016$SECT16[grep("^AA", f2016$SECT16)] <- "A"
#f2016$SECT16[grepl("^AA", f2016$SECT16)] <- "A"
Consider this dataset
df <- data.frame(A = c("ABCD", "AACD", "DASDD", "AABB"), stringsAsFactors = FALSE)
df
# A
#1 ABCD
#2 AACD
#3 DASDD
#4 AABB
df$A[grep("^AA", df$A)] <- "A"
df
# A
#1 ABCD
#2 A
#3 DASDD
#4 A
You can use stringr and dplyr.
Modify all columns:
df <- df %>% mutate_all(function(x) stringr::str_replace(x, "^AA.+", "A"))
Modify specific columns:
df <- df %>% mutate_at(1, function(x) stringr::str_replace(x, "^AA.+", "A"))
Data
df <- data.frame(SECT16 = c("AABC", "AABB"),
SECT17 = c("AADD", "AAEE"))
I have one dataframe that basically looks like this (contains data):
t <- data.frame(x1 = 1:5, x2 = 1:5, stingsAsFactors = FALSE)
I have another dataframe that contains the original column names and a replacement for each
n <- data.frame(abb = c("x1", "x2"), erf = c("XX1", "XX2"), stringsAsFactors = FALSE)
What I would like to do is rename columns in dataframe t according to the specification in dataframe n. My problem is I can't figure out how to do that with map. Why is the following wrong:
map2_dfr(n$abb, n$erf, function(x, y) rename(t, !!y := x))
We can use rename_at
library(dplyr)
t %>%
rename_at(n$abb, ~ n$erf)
Here is a one-liner in base R using match,
names(t) <- n$erf[match(names(t), n$abb)]
t
# XX1 XX2
#1 1 1
#2 2 2
#3 3 3
#4 4 4
#5 5 5
I am using R and I need to format the number within a dataframe, notably by imposing the number of digits before the decimal separator as well as after. E.g. 3.56 must become "0003,56000".
So I built my own function:
format <- function(x, nbr_before_comma, nbr_after_comma){
x= round(x, nbr_after_comma)
x = toString(x)
l = strsplit(x, "[.]")[[1]]
#print(l)
#print(nchar(l[2]))
before_comma = paste0(strrep("0",nbr_before_comma - nchar(l[1])),l[1])
after_comma = ifelse(length(l) > 1,
paste0(l[2],strrep("0",nbr_after_comma - nchar(l[2]))),
strrep("0", nbre_after_comma))
res = paste0(before_comma, ",", after_comma)
return(res)
}
Trying this on a single number will work. Now I am trying to apply this to a dataframe. Let's take the toy example:
df <- data.frame("a" = c(2.5,3.56,4.5))
I define moreprecisely what I want:
format44 <- function(x){
return(format(x,4,4))
}
I have tried several possibilities:
df[] <- lapply(df, format44)
with dplyr:
df <- df %>%
mutate(a = format44(a))
and finally:
df["a"] <- lapply(df["a"],format44)
None will work. actually, I get the same output everytime:
a
1 0002,5, 3
2 0002,5, 3
3 0002,5, 3
Any idea what the problem is ?
Use sprintf and then translate the decimal points to comma:
before <- after <- 4
fmt <- sprintf("%%0%d.%df", before + after + 1, after)
transform(df, a = chartr(".", ",", sprintf(fmt, a)))
giving:
a
1 0002,5000
2 0003,5600
3 0004,5000
or writing this with dplyr:
library(dplyr)
before <- after <- 4
df %>%
mutate(a = "%%0%d.%df" %>%
sprintf(before + after + 1, after) %>%
sprintf(a) %>%
chartr(".", ",", .))
giving:
a
1 0002,5000
2 0003,5600
3 0004,5000
In this case, mapply suits better you:
df$b <- mapply(format44, df$a)
You do not even need the format44 wrapper. You can use:
df$c <- mapply(format, df$a, 4,4)