I am trying to format numbers as shown (adding thousand separator). The function is working fine but post formatting the numbers, the numeric columns does not sort by numbers since there are characters
df <- data.frame(x = c(12345,35666,345,5646575))
format_numbers <- function (df, column_name){
df[[column_name]] <- ifelse(nchar(df[[column_name]]) <= 5, paste(format(round(df[[column_name]] / 1e3, 1), trim = TRUE), "K"),
paste(format(round(df[[column_name]] / 1e6, 1), trim = TRUE), "M"))
}
df$x <- format_numbers(df,"x")
> df
x
1 12.3 K
2 35.7 K
3 0.3 K
4 5.6 M
Can we make sure the numbers are sorted in descending/ascending order post formatting ?
Note : This data df is to be incorporated in DT table
The problem is the formating part. If you do it correctly--ie while maintaining your data as numeric, then everything else will fall in place. Here I will demonstrate using S3 class:
my_numbers <- function(x) structure(x, class = c('my_numbers', 'numeric'))
format.my_numbers <- function(x,..., d = 1, L = c('', 'K', 'M', 'B', 'T')){
ifelse(abs(x) >= 1000, Recall(x/1000, d = d + 1),
sprintf('%.1f%s', x, L[d]))
}
print.my_numbers <- function(x, ...) print(format(x), quote = FALSE)
'[.my_numbers' <- function(x, ..., drop = FALSE) my_numbers(NextMethod('['))
Now you can run your code:
df <- data.frame(x = c(12345,35666,345,5646575))
df$x <- my_numbers(df$x)
df
x
1 12.3K
2 35.7K
3 345.0
4 5.6M
You can use any mathematical operation on column x as it is numeric.
eg:
cbinding with its double and ordering from smallest to larges:
cbind(x = df, y = df*2)[order(df$x),]
x x
3 345.0 690.0 # smallest
1 12.3K 24.7K
2 35.7K 71.3K
4 5.6M 11.3M # largest ie Millions
Note that under the hood, x does not change:
unclass(df$x)
[1] 12345 35666 345 5646575 # Same as given
Related
I have a dataframe ("md") containing several variables, of which one is binary ("adopter"). I would like to mean center three of the other (continous) variables, let's say X, Y, and Z, but only for the ones where adopter = 1. The others, for which adopter = 0, should remain unchanged.
In the end I would like to end up with a new dataframe containing all variables as before, but with the X, Y, and Z for which adopter = 1 being mean centered, while leaving the X, Y, and Z for which adopter = 0 being unchanged.
My dataframe looks like this (117 observations in total):
adopter
X
Y
Z
A
B
0
0.5
2.3
4.5
3
4.7
1
1.5
6.5
-2.3
69.3
-2.5
...
...
...
...
So the new dataframe should contain the center means of X, Y, and Z of the second row in this example, as adopter=1, and leave the rest unchanged.
I know how to mean center all X, Y, and Z:
md_cen <- md
covs_to_center <- c("X", "Y", "Z")
md_cen[covs_to_center] <- scale(md_cen[covs_to_center],
scale = FALSE)
But I cannot figure out how to get the "only if adopter == "1" " into it. I also tried applying a function:
center_apply <- function(x) {
apply(x, 2, function(y) y - mean(y))}
However, this leaves me again with the mean centered versions for all X, Y, Z, of course, and on top the new dataset contains only those three variables.
Can anyone help me out here, please?
The basic way to accomplish what you're trying to do is to use the split-apply-combine workflow. That is:
Split your data frame up into coherent and useful sub-parts.
Do the thing you want to each sub-part.
Reconstitute the parts into the whole.
First, here's a toy dataset:
covs_to_center <- c("X", "Y", "Z")
set.seed(123)
md <- data.frame(
adopter = sample(0:1, 10, replace = T),
X = rnorm(10, 2, 1),
Y = rnorm(10, 3, 2),
Z = rnorm(10, 5, 10),
A = rnorm(10, 40, 50),
B = rnorm(10, 0, 2)
)
md
## adopter X Y Z A B
## 1 0 3.7150650 6.5738263 -11.866933 74.432013 -2.24621717
## 2 0 2.4609162 3.9957010 13.377870 67.695883 -0.80576967
## 3 0 0.7349388 -0.9332343 6.533731 36.904414 -0.93331071
## 4 1 1.3131471 4.4027118 -6.381369 24.701867 1.55993024
## 5 0 1.5543380 2.0544172 17.538149 20.976450 -0.16673813
## 6 1 3.2240818 0.8643526 9.264642 5.264651 0.50663703
## 7 1 2.3598138 2.5640502 2.049285 29.604136 -0.05709351
## 8 1 2.4007715 0.9479911 13.951257 -23.269818 -0.08574091
## 9 0 2.1106827 1.5422175 13.781335 148.447798 2.73720457
## 10 0 1.4441589 1.7499215 13.215811 100.398100 -0.45154197
A base R solution:
md_base <- data.frame(row_num = 1:nrow(md), md)
# append column of row numbers to make it easier to recombine things later
md_split <- split(md_base, md_base$adopter)
# this is a list of 2 data frames, corresponding to the 2 possible outcomes
# of the adopter variable
md_split$`1`[, covs_to_center] <-
apply(md_split$`1`[, covs_to_center], 2, function(y) y - mean(y))
# grab the data frame that had a 1 in the response column; apply the centering
# function to the correct variables in that data frame
md_new <- do.call(rbind, md_split)
# glue the data frame back together; it will be ordered by adopter
rownames(md_new) <- NULL
# remove row name artifact created by joining
md_new <- md_new[order(md_new$row_num), names(md_new) != "row_num"]
# sort by the row_num column, then drop it
This is pretty clunky, and I'm sure it could be improved. Here's a tidyverse equivalent that produces the same output:
library(tidyverse)
md %>%
group_by(adopter) %>%
mutate(across(covs_to_center, function(y) y - adopter * mean(y))) %>%
ungroup()
The idea behind this is: group by adopter (much like the split() approach), calculate the mean() of the relevant variables within each group, and then subtract the mean of the subgroup multiplied by the adopter variable (meaning when adopter == 0, nothing will be subtracted).
I want to replace a vector in a dataframe that contains only 4 numbers to specific numbers as shown below
tt <- rep(c(1,2,3,4), each = 10)
df <- data.frame(tt)
I want to replace 1 = 10; 2 = 200, 3 = 458, 4 = -0.1
You could use recode from dplyr. Note that the old values are written as character. And the new values are integers since the original column was integer:
library(tidyverse):
df %>%
mutate(tt = recode(tt, '1'= 10, '2' = 200, '3' = 458, '4' = -0.1))
tt
1 10.0
2 10.0
3 200.0
4 200.0
5 458.0
6 458.0
7 -0.1
8 -0.1
To correct the error in the code in the question and provide for a shorter example we use the input in the Note at the end. Here are several alternatives. nos defined in (1) is used in some of the others too. No packages are used.
1) indexing To get the result since the input is 1 to 4 we can use indexing. This is probably the simplest solution given that the original values of tt are in 1:4.
nos <- c(10, 200, 458, -0.1)
transform(df, tt = nos[tt])
## tt
## 1 10.0
## 2 10.0
## 3 200.0
## 4 200.0
## 5 458.0
## 6 458.0
## 7 -0.1
## 8 -0.1
1a) If the input is not necessarily in 1:4 then we could use this generalization
transform(df, tt = nos[match(tt, 1:4)])
2) arithmetic Another approach is to use arithmetic:
transform(df, tt = 10 * (tt == 1) +
200 * (tt == 2) +
458 * (tt == 3) +
-0.1 * (tt == 4))
3) outer/matrix multiplication This would also work:
transform(df, tt = c(outer(tt, 1:4, `==`) %*% nos))
3a) This is the same except we use model.matrix instead of outer.
transform(df, tt = c(model.matrix(~ factor(tt) + 0, df) %*% nos))
4) factor The levels of the factor are 1:4 and the corresponding labels are defined by nos. Extract the labels using format and then convert them to numeric.
transform(df, tt = as.numeric(format(factor(tt, levels = 1:4, labels = nos))))
4a) or as a pipeline
transform(df, tt = tt |>
factor(levels = 1:4, labels = nos) |>
format() |>
as.numeric())
5) loop We can use a simple loop. Nulling out i at the end is so that it is not made into a column.
within(df, { for(i in 1:4) tt[tt == i] <- nos[i]; i <- NULL })
6) Reduce This is somewhat similar to (5) but implements the loop using Reduce.
fun <- function(tt, i) replace(tt, tt == i, nos[i])
transform(df, tt = Reduce(fun, init = tt, 1:4))
Note
df <- data.frame(tt = c(1, 1, 2, 2, 3, 3, 4, 4))
I have a matrix X, two data frames A and B and to vectors of indices vec_a and vec_b. A and B contain an index variable each, where the values correspond to the values in vec_a and vec_b. Other than that, A and B contain as as many values as there are columns in X:
# original data
X <- matrix(rnorm(200),100,2)
# values to substract in data.frames
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
# indices, which values to substract (one for each row of X)
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
What I want to achieve is the following: For each row iii in X get the values value1 and value2 from A and B based on elements iii in the vectors vec_a and vec_b. Then, subtract these values from the corresponding row in X. May sound a bit confusing, but I hope the following solution makes it more clear what the goal is:
# iterate over all rows of X
for(iii in 1:nrow(X)){
# get correct values
X_clean <- A[which(A$index_a == vec_a[iii]),-1] - # subtract correct A value
B[which(B$index_b == vec_b[iii]),-1] # subtract correct B value
# this intermediate step is necessary, otherwise we substract a data.frame from a matrix
X_clean <- as.numeric(X_clean)
# subtract from X
X[iii,] = X[iii,] - X_clean
}
Note that we have to convert to numeric in my loop solution, otherwise X loses class matrix as we subtract a data.frame from a matrix. My solution works fine, until you need to do that for many matrices like A and B and for millions of observations. Is there a solution that does not rely on looping over all rows?
EDIT
Thanks, both answers improve the speed of the code massively. I chose the answer by StupidWolf as it was more efficient than using data.table:
Unit: microseconds
expr min lq mean median uq max neval cld
datatable 5557.355 5754.931 6052.402 5881.729 5975.386 14154.040 100 b
stupid.wolf 818.529 1172.840 1311.784 1187.593 1221.164 4777.743 100 a
loop 111748.790 115141.149 116677.528 116109.571 117085.048 156497.999 100 c
You can just match the rows:
set.seed(111)
# original data
X <- matrix(rnorm(200),100,2)
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
newX <- X - as.matrix(A[match(vec_a,A$index_a),-1]-B[match(vec_b,B$index_b),-1])
Then we run your loop:
for(iii in 1:nrow(X)){
X_clean <- A[which(A$index_a == vec_a[iii]),-1] - # subtract correct A value
B[which(B$index_b == vec_b[iii]),-1] # subtract correct B value
X_clean <- as.numeric(X_clean)
X[iii,] = X[iii,] - X_clean
}
And check the values are equal:
all.equal(c(newX),c(X))
[1] TRUE
Match should be pretty fast, but if it is still too slow, you can just call out the values of A using vec_a, like A[vec_a,] ..
This approach uses data.table for easy joining.
library(data.table)
set.seed(111)
X <- matrix(rnorm(200),100,2)
A <- data.frame(index_a = 1:4, value1 = rnorm(4), value2 = rnorm(4))
B <- data.frame(index_b = 1:4, value1 = rnorm(4), value2 = rnorm(4))
vec_a <- sample(1:4, nrow(X), replace = T)
vec_b <- sample(1:4, nrow(X), replace = T)
setDT(A);setDT(B)
dtX <- as.data.table(cbind(1:nrow(X),X,vec_a,vec_b))
as.matrix(
dtX[A, on = .(vec_a = index_a)][B,
on = .(vec_b = index_b)][order(V1),
.(V2 - (value1 - i.value1), V3 - (value2 - i.value2))]
)
V1 V2
[1,] 0.22746 0.7069
[2,] 1.84340 -0.1258
[3,] -0.70038 1.2494
...
[98,] 2.04666 0.6767
[99,] 0.02451 1.0473
[100,] -2.72553 -0.6595
Hopefully this will be pretty fast for very large matrices.
Consider two strings of the form below:
101001
010001
How I can do OR between these two and report number of ones?
My goal is to just report 4 for the two strings above.
Thanks very much for your help
There's probably a more elegant way, but how about this:
x = "101001"
y = "010001"
dat = c(strsplit(x, split=""), strsplit(y, split=""))
sum(dat[[1]] == 1 | dat[[2]] == 1)
or this:
sum(unlist(strsplit(x, split="")) == 1 | unlist(strsplit(y, split="")) == 1)
or, per #jbaums comment:
sum(as.numeric(strsplit(x, '')[[1]]) | as.numeric(strsplit(y, '')[[1]]))
If you're only dealing with binary, you can convert the strings to numerics, add them, and count the number of non-zeros. (Edited to incorporate Julius's recommendation)
x = "101001"
y = "010001"
xy <- as.numeric(x) + as.numeric(y)
length(gregexpr("(1|2)", xy)[[1]])
You can write this to run over a vector pretty easily too.
#* function to generate sample data
make_binary_string <- function(n = 10, len = 6)
{
vapply(1:n,
function(i, n, len) paste0(sample(0:1, 6, replace = TRUE), collapse = ""),
character(1),
n = n,
len = len)
}
set.seed(pi)
x <- make_binary_string(n = 10)
y <- make_binary_string(n = 10)
xy <- as.numeric(x) + as.numeric(y)
nchar(gsub("0", "", xy))
Here is what I tried.
df <- data.frame(strsplit(str1,split = ""), strsplit(str2,split = ""))
names(df) <- c('x1', 'x2')
This will convert strings into dataframe like this
x1 x2
1 1 0
2 0 1
3 1 0
4 0 0
5 0 0
6 1 1
And then count number of rows which have atleast one 1
nrow(df[df$x1 == 1 | df$x2 == 1,])
Or
sum(bitwOr(as.numeric(strsplit(str1,split = "")[[1]]) , as.numeric(strsplit(str2,split = "")[[1]])))
We can define a function to.bool() that converts a string to a sequence of boolean values:
to.bool <- function(boolstr) as.logical(as.integer(unlist(strsplit(boolstr,""))))
sum(to.bool("101001") | to.bool("010001"))
#[1] 4
I want to calculate the distance between two linked set of spatial coordinates (program and admin in my fake dataset). The data are in a wide format, so both pairs of coordinates are in the same row.
library(sp)
set.seed(1)
n <- 100
program.id <- seq(1, n)
c1 <- cbind(runif(n, -90, 90), runif(n, -180, 180))
c2 <- cbind(runif(n, -90, 90), runif(n, -180, 180))
dat <- data.frame(cbind(program.id, c1, c2))
names(dat) <- c("program.id", "program.lat", "program.long", "admin.lat", "admin.long")
head(dat)
# program.id program.lat program.long admin.lat admin.long
# 1 1 -42.20844 55.70061 -41.848523 62.536404
# 2 2 -23.01770 -52.84898 -50.643849 -145.851172
# 3 3 13.11361 -82.70635 3.023431 -2.665397
# 4 4 73.47740 177.36626 -41.588893 -13.841337
# 5 5 -53.69725 48.05758 -57.389701 -44.922049
# 6 6 71.71014 -103.24507 3.343705 176.795719
I know how to create a matrix of distances among program or admin using the sp package:
ll <- c("program.lat", "program.long")
coords <- dat[ll]
dist <- apply(coords, 1,
function(eachPoint) spDistsN1(as.matrix(coords),
eachPoint, longlat=TRUE))
But what I want to do is create a nx1 vector of distances (dist.km) between each pair of coordinates and add it to dat.
# program.id program.lat program.long admin.lat admin.long dist.km
# 1 1 -42.20844 55.70061 -41.848523 62.536404 567.35
# 2 2 -23.01770 -52.84898 -50.643849 -145.851172 8267.86
# ...
Any suggestions? I've spent a while going through old SO questions, but nothing seems quite right. Happy to be proven wrong.
Update
#Amit's solution works for my toy dataset:
apply(dat,1,function(x) spDistsN1(matrix(x[2:3],nrow=1),x[3:4],longlat=TRUE))
But I think I need to swap the order of the lat, long the order of the lat long columns so long comes before lat. From ?spDistsN1:
pts: A matrix of 2D points, first column x/longitude, second column y/latitude, or a SpatialPoints or SpatialPointsDataFrame object
Also, unless I've misunderstood the logic, I think Amit's solution should grab cols [2:3] and [4:5], not [2:3] and [3:4].
My challenge now is applying this to my actual data. I've reproduced a portion below.
library(sp)
dat <- structure(list(ID = 1:4,
subcounty = c("a", "b", "c", "d"),
pro.long = c(33.47627919, 31.73605491, 31.54073482, 31.51748984),
pro.lat = c(2.73996953, 3.26530095, 3.21327597, 3.17784981),
sub.long = c(33.47552, 31.78307, 31.53083, 31.53083),
sub.lat = c(2.740362, 3.391209, 3.208736, 3.208736)),
.Names = c("ID", "subcounty", "pro.long", "pro.lat", "sub.long", "sub.lat"),
row.names = c(NA, 4L), class = "data.frame")
head(dat)
# ID subcounty pro.long pro.lat sub.long sub.lat
# 1 1 a 33.47628 2.739970 33.47552 2.740362
# 2 2 b 31.73605 3.265301 31.78307 3.391209
# 3 3 c 31.54073 3.213276 31.53083 3.208736
# 4 4 d 31.51749 3.177850 31.53083 3.208736
apply(dat, 1, function(x) spDistsN1(matrix(x[3:4], nrow=1),
x[5:6],
longlat=TRUE))
I get the error: Error in spDistsN1(matrix(x[3:4], nrow = 1), x[5:6], longlat = TRUE) : pts must be numeric
I'm confused because these columns are numeric:
> is.numeric(dat$pro.long)
[1] TRUE
> is.numeric(dat$pro.lat)
[1] TRUE
> is.numeric(dat$sub.long)
[1] TRUE
> is.numeric(dat$sub.lat)
[1] TRUE
The problem you're having is thatapply(...) coerces the first argument to a matrix. By definition, a matrix must have all elements of the same data type. Since one of the columns in dat (dat$subcounty) is char, apply(...) coerces everything to char. In your test dataset, everything was numeric, so you didn't have this problem.
This should work:
dat$dist.km <- sapply(1:nrow(dat),function(i)
spDistsN1(as.matrix(dat[i,3:4]),as.matrix(dat[i,5:6]),longlat=T))
There is a much faster solution using data.table and geosphere.
library(data.table)
library(geosphere)
setDT(dat)[ , dist_km := distGeo(matrix(c(pro.long, pro.lat), ncol = 2),
matrix(c(sub.long, sub.lat), ncol = 2))/1000]
Benchmark:
library(sp)
jlhoward <- function(dat) { dat$dist.km <- sapply(1:nrow(dat),function(i)
spDistsN1(as.matrix(dat[i,3:4]),as.matrix(dat[i,5:6]),longlat=T)) }
rafa.pereira <- function(dat2) { setDT(dat2)[ , dist_km := distGeo(matrix(c(pro.long, pro.lat), ncol = 2),
matrix(c(sub.long, sub.lat), ncol = 2))/1000] }
> system.time( jlhoward(dat) )
user system elapsed
8.94 0.00 8.94
> system.time( rafa.pereira(dat) )
user system elapsed
0.07 0.00 0.08
Data
dat <- structure(list(ID = 1:4,
subcounty = c("a", "b", "c", "d"),
pro.long = c(33.47627919, 31.73605491, 31.54073482, 31.51748984),
pro.lat = c(2.73996953, 3.26530095, 3.21327597, 3.17784981),
sub.long = c(33.47552, 31.78307, 31.53083, 31.53083),
sub.lat = c(2.740362, 3.391209, 3.208736, 3.208736)),
.Names = c("ID", "subcounty", "pro.long", "pro.lat", "sub.long", "sub.lat"),
row.names = c(NA, 4L), class = "data.frame")
# enlarge dataset to 40,000 pairs
dat <- dat[rep(seq_len(nrow(dat)), 10000), ]