Following hours of searching for what should be simple I need help.
What I want to do:
Ensure that all strings are padded to the same length of 26 characters in length.
Dataset:
library(stringr)
names <-
structure(list(
names = c(
"A",
"ABC",
"ABCDEFG",
"ABCDEFGHIJKLMNOP",
"AB",
"ABCDEFGHI",
"ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"ABCDEFGHIJKL",
"ABCDEFGHIJKLMNOPQR",
"ABCDEFGHIJKLMNOP",
"ABCDEFGHIJKLMNO"
)
),
class = "data.frame",
row.names = c(NA,-11L))
Step 1:
Find max character length and the number of spaces to pad:
max <- as.numeric(max(nchar(names$names)))
max
n <- as.numeric(nchar(names$names))
n
pad <- max - n
pad
#add columns to the dataset to check how many characters are to be padded for each name
names$max <- as.numeric(max(nchar(names$names)))
names$n <- as.numeric(nchar(names$names))
names$pad <- as.numeric(max - n)
Step 2: Pad
names$names <-
str_pad(names$names,
pad,
side = "right",
pad = "0")
But this approach doesn't appear to be working for me. Can someone point me in the right direction? I am getting different length strings:
names max n pad
1 A000000000000000000000000 26 1 25
2 ABC00000000000000000000 26 3 23
3 ABCDEFG000000000000 26 7 19
4 ABCDEFGHIJKLMNOP 26 16 10
5 AB0000000000000000000000 26 2 24
6 ABCDEFGHI00000000 26 9 17
7 ABCDEFGHIJKLMNOPQRSTUVWXYZ 26 26 0
8 ABCDEFGHIJKL00 26 12 14
9 ABCDEFGHIJKLMNOPQR 26 18 8
10 ABCDEFGHIJKLMNOP 26 16 10
11 ABCDEFGHIJKLMNO 26 15 11
Help would be greatly appreciated.
Here we need just
library(dplyr)
mx <- as.numeric(max(nchar(names$Name)))
names$Name <- str_pad(names$Name, mx, side = "right", pad = "0")
names$Name
-output
#[1] "A0000000000000000000000000" "ABC00000000000000000000000" "ABCDEFG0000000000000000000" "ABCDEFGHIJKLMNOP0000000000"
#[5] "AB000000000000000000000000" "ABCDEFGHI00000000000000000" "ABCDEFGHIJKLMNOPQRSTUVWXYZ" "ABCDEFGHIJKL00000000000000"
#[9] "ABCDEFGHIJKLMNOPQR00000000" "ABCDEFGHIJKLMNOP0000000000" "ABCDEFGHIJKLMNO00000000000"
NOTE: It is better not to name objects with names that are either function names or argument names
I think you want the format function. You set the width and then justify left, right or center:
format(names, width = 26, justify = "left")
# Name
# 1 A
# 2 ABC
# 3 ABCDEFG
# 4 ABCDEFGHIJKLMNOP
# 5 AB
# 6 ABCDEFGHI
# 7 ABCDEFGHIJKLMNOPQRSTUVWXYZ
# 8 ABCDEFGHIJKL
# 9 ABCDEFGHIJKLMNOPQR
# 10 ABCDEFGHIJKLMNOP
# 11 ABCDEFGHIJKLMNO
Using rep and paste(..., collapse="") (kind of pythong's join for vec of strings) and Vectorize() and closing-over pad (meaning just grapping pad from argument list) one can quickly create a pad-string generator reps.
Using paste0 one can element-wise join the character vectors.
pad_strings <- function(char_vec, max_len=NULL, pad="0") {
reps <- Vectorize(function(n) paste(rep(pad, n), collapse=""))
lengths <- nchar(char_vec)
if (is.null(max_len)) max_len <- max(lengths)
diffs <- max_len - lengths
paste0(char_vec, reps(diffs))
}
> pad_strings(char_vec)
[1] "A0000000000000000000000000" "ABC00000000000000000000000"
[3] "ABCDEFG0000000000000000000" "ABCDEFGHIJKLMNOP0000000000"
[5] "AB000000000000000000000000" "ABCDEFGHI00000000000000000"
[7] "ABCDEFGHIJKLMNOPQRSTUVWXYZ" "ABCDEFGHIJKL00000000000000"
[9] "ABCDEFGHIJKLMNOPQR00000000" "ABCDEFGHIJKLMNOP0000000000"
[11] "ABCDEFGHIJKLMNO00000000000"
If no argument is given for max_len=, then they are padded to the longest string. Otherwise the pad will be filled to max_len.
Related
I have a data frame:
df = read.table(text="index Htype
3 AAAABABBAAAAAABBAAHBUUAUAABBAABA
4 AAAABABBAAABABBABBAAHBBBBAABAABB
7 AAAABABBAAAAAABBAAABUBAUAABBAABA
8 BBBABABAAAAAAAABBABBAUAUAABBAAAA
9 BBHABABAAAAAAAABBABBABAUAABBAAAA", header=T, stringsAsFactors=F)
I would like to find out the the positions of characters "U" or "H" in the "Htype" column. So the expected result:
index Htype pos
3 AAAABABBAAAAAABBAAHBUUAUAABBAABA 19 21 22 24
4 AAAABABBAAABABBABBAAHBBBBAABAABB 21
7 AAAABABBAAAAAABBAAABUBAUAABBAABA 21 24
8 BBBABABAAAAAAAABBABBAUAUAABBAAAA 22 24
9 BBHABABAAAAAAAABBABBABAUAABBAAAA 3 24
I used the script not working,
df$pos <- apply(df$Htype,1,function(x) unlist(gregexpr(pattern ='U|H',x)))
I need helps thanks.
We can use gregexpr to either create a string column
df$pos <- sapply(gregexpr("H|U", df$Htype), toString)
or a list column
df$pos <- sapply(gregexpr("H|U", df$Htype), as.integer)
You need to paste together the string of positions. The following works for me.
df$pos <- apply(df,1,function(x) paste(unlist(gregexpr(pattern ='U|H',x[2])), collapse = " "))
I have two data frames. The first one contains the original state of an image with all the data available to reconstruct the image from scratch (the entire coordinate set and their color values).
I then have a second data frame. This one is smaller and contains only data about the differences (the changes made) between the the updated state and the original state. Sort of like video encoding with key frames.
Unfortunately I don't have an unique id column to help me match them. I have an x column and I have a y column which, combined, can make up a unique id.
My question is this: What is an elegant way of merging these two data sets, replacing the values in the original dataframe with the values in the "differenced" data frame whose x and y coordinates match.
Here's some example data to illustrate:
original <- data.frame(x = 1:10, y = 23:32, value = 120:129)
x y value
1 1 23 120
2 2 24 121
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 127
9 9 31 128
10 10 32 129
And the dataframe with updated differences:
update <- data.frame(x = c(1:4, 8), y = c(2, 24, 17, 23, 30), value = 50:54)
x y value
1 1 2 50
2 2 24 51
3 3 17 52
4 4 23 53
5 8 30 54
The desired final output should contain all the rows in the original data frame. However, the rows in original where the x and y coordinates both match the corresponding coordinates in update, should have their value replaced with the values in the update data frame. Here's the desired output:
original_updated <- data.frame(x = 1:10, y = 23:32,
value = c(120, 51, 122:126, 54, 128:129))
x y value
1 1 23 120
2 2 24 51
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 54
9 9 31 128
10 10 32 129
I've tried to come up with a vectorised solution with indexing for some time, but I can't figure it out. Usually I'd use %in% if it were just one column with unique ids. But the two columns are non unique.
One solution would be to treat them as strings or tuples and combine them to one column as a coordinate pair, and then use %in%.
But I was curious whether there were any solution to this problem involving indexing with boolean vectors. Any suggestions?
First merge in a way which guarantees all values from the original will be present:
merged = merge(original, update, by = c("x","y"), all.x = TRUE)
Then use dplyr to choose update's values where possible, and original's value otherwise:
library(dplyr)
middle = mutate(merged, value = ifelse(is.na(value.y), value.x, value.y))
final = select(middle, x, y, value)
The match function is used to generate indices. Needs a nomatch argument to prevent NA on the left hand side of data.frame.[<-. I don't think it is as transparent as a merge followed by replace, but I'm guessing it will be faster:
original[ match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)] ,
"value"] <-
update[ which( match(update$x, original$x) == match(update$y, original$y)),
"value"]
You can see the difference:
> match(update$x, original$x)[
match(update$x, original$x) ==
match(update$y, original$y) ]
[1] NA 2 NA 8
> match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)]
[1] 2 8
The "interior" match functions are returning:
> match(update$y, original$y)
[1] NA 2 NA 1 8
> match(update$x, original$x)
[1] 1 2 3 4 8
Let's say I want to select the 1st, 3rd, and 12th element from a data frame or a matrix:
m = matrix(1:12, 3, 4)
m[c(1,3,12)] # as expected: selects the 1st, 3rd, and 12th element
However this does not seems to work for data frames:
df = data.frame(m)
df[c(1,3,12)] # doesn't select the elements
What I'm using is:
as.vector(df)[c(1,3,12)] # works as expected
Is there a simpler way to achieve the same result?
EDIT:
as.vector(df)[c(1,3,12)] # does not work
As Richard Scriven pointed out:
unlist(df, use.names=FALSE)[c(1, 3, 12)] # do work
But I'm still looking for a shorter notation (if possible).
df<-data.frame(11:13,14:16,17:19,20:22)
df
# X11.13 X14.16 X17.19 X20.22
# 1 11 14 17 20
# 2 12 15 18 21
# 3 13 16 19 22
c(df[df==11], df[df==13],df[df==22])
# [1] 11 13 22
First make some example data:
df = data.frame(matrix(rnorm(200), nrow=100))
df1=data.frame(t(c(25,34)))
The starting row is different in each column. For example, in X1 I would like to start from 25 th row while in X2 from row 34. Then, I want to calculate the mean for each 5 values for the next 50 rows for all the columns in df.
I am new to R so this is probably very obvious. Can anyone provide some suggestions that how I can do this?
You could try Map.
lst <- Map(function(x,y) {x1 <- x[y:length(x)]
tapply(x1,as.numeric(gl(length(x1), 5,
length(x1))), FUN=mean)},
df, df1)
lst
# $X1
# 1 2 3 4 5 6
#-0.16500158 0.11339623 -0.86961872 -0.54985564 0.19958461 0.35234983
# 7 8 9 10 11 12
#0.32792769 0.65989801 -0.30409184 -0.53264725 -0.45792792 -0.59139844
# 13 14 15 16
# 0.03934133 -0.38068187 0.10100007 1.21017392
#$X2
# 1 2 3 4 5 6
# 0.24525622 0.07367300 0.18733973 -0.43784202 -0.45756095 -0.45740178
# 7 8 9 10 11 12
#-0.54086152 0.10439072 0.65660937 0.70623380 -0.51640088 0.46506135
# 13 14
#-0.09428336 -0.86295101
Because of the length difference, it might be better to keep it as a list. But, if you need it in a matrix/data.frame, you can make the lengths equal by padding with NAs.
do.call(cbind,lapply(lst, `length<-`,(max(sapply(lst, length)))))
Update
If you need only 50 rows, then change y:(length(x) to y:(y+49) in the Map code
data
set.seed(24)
df <- data.frame(matrix(rnorm(200), nrow=100))
df1 <- data.frame(t(c(25,34)))
Not entirely clear, especially, the second line of your code, but I think this might be close to what you want to do:
every_fifth_row <- df[seq(1, nrow(df), 5), ]
every_fifth_row
# X1 X2
# 1 -0.09490455 -0.28417104
# 6 -0.14949662 0.12857284
# 11 0.15297366 -0.84428186
# 16 -1.03397309 0.04775516
# 21 -1.95735213 -1.03750794
# 26 1.61135194 1.10189370
# 31 0.12447365 1.80792719
# 36 -0.92344017 0.66639710
# 41 -0.88764143 0.10858376
# 46 0.27761464 0.98382526
# 51 -0.14503359 -0.66868956
# 56 -1.70208187 0.05993688
# 61 0.33828525 1.00208639
# 66 -0.41427863 1.07969341
# 71 0.35027994 -1.46920059
# 76 1.38943839 0.01844205
# 81 -0.81560917 -0.32133221
# 86 1.38188423 -0.77755471
# 91 1.53247872 -0.98660308
# 96 0.45721909 -0.22855622
rowMeans(every_fifth_row)
colMeans(every_fifth_row)
# Alternative
# apply(every_fifth_row, 1, mean) # Row-wise mean
# apply(every_fifth_row, 2, mean) # Column-wise mean
I have a list of numerical vectors, and I need to create a list containing only one copy of each vector. There isn't a list method for the identical function, so I wrote a function to apply to check every vector against every other.
F1 <- function(x){
to_remove <- c()
for(i in 1:length(x)){
for(j in 1:length(x)){
if(i!=j && identical(x[[i]], x[[j]]) to_remove <- c(to_remove,j)
}
}
if(is.null(to_remove)) x else x[-c(to_remove)]
}
The problem is that this function becomes very slow as the size of the input list x increases, partly due to the assignment of two large vectors by the for loops. I'm hoping for a method that will run in under one minute for a list of length 1.5 million with vectors of length 15, but that might be optimistic.
Does anyone know a more efficient way of comparing each vector in a list with every other vector? The vectors themselves are guaranteed to be equal in length.
Sample output is shown below.
x = list(1:4, 1:4, 2:5, 3:6)
F1(x)
> list(1:4, 2:5, 3:6)
As per #JoshuaUlrich and #thelatemail, ll[!duplicated(ll)] works just fine.
And thus, so should unique(ll)
I previously suggested a method using sapply with the idea of not checking every element in the list (I deleted that answer, as I think using unique makes more sense)
Since efficiency is a goal, we should benchmark these.
# Let's create some sample data
xx <- lapply(rep(100,15), sample)
ll <- as.list(sample(xx, 1000, T))
ll
Putting it up against some becnhmarks
fun1 <- function(ll) {
ll[c(TRUE, !sapply(2:length(ll), function(i) ll[i] %in% ll[1:(i-1)]))]
}
fun2 <- function(ll) {
ll[!duplicated(sapply(ll, digest))]
}
fun3 <- function(ll) {
ll[!duplicated(ll)]
}
fun4 <- function(ll) {
unique(ll)
}
#Make sure all the same
all(identical(fun1(ll), fun2(ll)), identical(fun2(ll), fun3(ll)),
identical(fun3(ll), fun4(ll)), identical(fun4(ll), fun1(ll)))
# [1] TRUE
library(rbenchmark)
benchmark(digest=fun2(ll), duplicated=fun3(ll), unique=fun4(ll), replications=100, order="relative")[, c(1, 3:6)]
test elapsed relative user.self sys.self
3 unique 0.048 1.000 0.049 0.000
2 duplicated 0.050 1.042 0.050 0.000
1 digest 8.427 175.563 8.415 0.038
# I took out fun1, since when ll is large, it ran extremely slow
Fastest Option:
unique(ll)
You could hash each of the vectors and then use !duplicated() to identify unique elements of the resultant character vector:
library(digest)
## Some example data
x <- 1:44
y <- 2:10
z <- rnorm(10)
ll <- list(x,y,x,x,x,z,y)
ll[!duplicated(sapply(ll, digest))]
# [[1]]
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
# [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 1.24573610 -0.48894189 -0.18799758 -1.30696395 -0.05052373 0.94088670
# [7] -0.20254574 -1.08275938 -0.32937153 0.49454570
To see at a glance why this works, here's what the hashes look like:
sapply(ll, digest)
[1] "efe1bc7b6eca82ad78ac732d6f1507e7" "fd61b0fff79f76586ad840c9c0f497d1"
[3] "efe1bc7b6eca82ad78ac732d6f1507e7" "efe1bc7b6eca82ad78ac732d6f1507e7"
[5] "efe1bc7b6eca82ad78ac732d6f1507e7" "592e2e533582b2bbaf0bb460e558d0a5"
[7] "fd61b0fff79f76586ad840c9c0f497d1"