I want to replace a vector in a dataframe that contains only 4 numbers to specific numbers as shown below
tt <- rep(c(1,2,3,4), each = 10)
df <- data.frame(tt)
I want to replace 1 = 10; 2 = 200, 3 = 458, 4 = -0.1
You could use recode from dplyr. Note that the old values are written as character. And the new values are integers since the original column was integer:
library(tidyverse):
df %>%
mutate(tt = recode(tt, '1'= 10, '2' = 200, '3' = 458, '4' = -0.1))
tt
1 10.0
2 10.0
3 200.0
4 200.0
5 458.0
6 458.0
7 -0.1
8 -0.1
To correct the error in the code in the question and provide for a shorter example we use the input in the Note at the end. Here are several alternatives. nos defined in (1) is used in some of the others too. No packages are used.
1) indexing To get the result since the input is 1 to 4 we can use indexing. This is probably the simplest solution given that the original values of tt are in 1:4.
nos <- c(10, 200, 458, -0.1)
transform(df, tt = nos[tt])
## tt
## 1 10.0
## 2 10.0
## 3 200.0
## 4 200.0
## 5 458.0
## 6 458.0
## 7 -0.1
## 8 -0.1
1a) If the input is not necessarily in 1:4 then we could use this generalization
transform(df, tt = nos[match(tt, 1:4)])
2) arithmetic Another approach is to use arithmetic:
transform(df, tt = 10 * (tt == 1) +
200 * (tt == 2) +
458 * (tt == 3) +
-0.1 * (tt == 4))
3) outer/matrix multiplication This would also work:
transform(df, tt = c(outer(tt, 1:4, `==`) %*% nos))
3a) This is the same except we use model.matrix instead of outer.
transform(df, tt = c(model.matrix(~ factor(tt) + 0, df) %*% nos))
4) factor The levels of the factor are 1:4 and the corresponding labels are defined by nos. Extract the labels using format and then convert them to numeric.
transform(df, tt = as.numeric(format(factor(tt, levels = 1:4, labels = nos))))
4a) or as a pipeline
transform(df, tt = tt |>
factor(levels = 1:4, labels = nos) |>
format() |>
as.numeric())
5) loop We can use a simple loop. Nulling out i at the end is so that it is not made into a column.
within(df, { for(i in 1:4) tt[tt == i] <- nos[i]; i <- NULL })
6) Reduce This is somewhat similar to (5) but implements the loop using Reduce.
fun <- function(tt, i) replace(tt, tt == i, nos[i])
transform(df, tt = Reduce(fun, init = tt, 1:4))
Note
df <- data.frame(tt = c(1, 1, 2, 2, 3, 3, 4, 4))
Related
I am trying to format numbers as shown (adding thousand separator). The function is working fine but post formatting the numbers, the numeric columns does not sort by numbers since there are characters
df <- data.frame(x = c(12345,35666,345,5646575))
format_numbers <- function (df, column_name){
df[[column_name]] <- ifelse(nchar(df[[column_name]]) <= 5, paste(format(round(df[[column_name]] / 1e3, 1), trim = TRUE), "K"),
paste(format(round(df[[column_name]] / 1e6, 1), trim = TRUE), "M"))
}
df$x <- format_numbers(df,"x")
> df
x
1 12.3 K
2 35.7 K
3 0.3 K
4 5.6 M
Can we make sure the numbers are sorted in descending/ascending order post formatting ?
Note : This data df is to be incorporated in DT table
The problem is the formating part. If you do it correctly--ie while maintaining your data as numeric, then everything else will fall in place. Here I will demonstrate using S3 class:
my_numbers <- function(x) structure(x, class = c('my_numbers', 'numeric'))
format.my_numbers <- function(x,..., d = 1, L = c('', 'K', 'M', 'B', 'T')){
ifelse(abs(x) >= 1000, Recall(x/1000, d = d + 1),
sprintf('%.1f%s', x, L[d]))
}
print.my_numbers <- function(x, ...) print(format(x), quote = FALSE)
'[.my_numbers' <- function(x, ..., drop = FALSE) my_numbers(NextMethod('['))
Now you can run your code:
df <- data.frame(x = c(12345,35666,345,5646575))
df$x <- my_numbers(df$x)
df
x
1 12.3K
2 35.7K
3 345.0
4 5.6M
You can use any mathematical operation on column x as it is numeric.
eg:
cbinding with its double and ordering from smallest to larges:
cbind(x = df, y = df*2)[order(df$x),]
x x
3 345.0 690.0 # smallest
1 12.3K 24.7K
2 35.7K 71.3K
4 5.6M 11.3M # largest ie Millions
Note that under the hood, x does not change:
unclass(df$x)
[1] 12345 35666 345 5646575 # Same as given
I want to create random mock data looks like this.
__ID__|__Amount__
1 20
1 14
1 9
1 3
2 11
2 5
2 2
Starting from the random number but the second number with the same ID should be lesser than the first one, and the third number has to be lesser than the second one. Maximum number to start should be 20.
you can just create the data first and then sort it as you need, using tidyverse :
set.seed(0)
df <- data.frame(id = rep(1:3,10), amt = sample(1:20, 30, replace = TRUE))
df %>%
group_by(id) %>%
arrange(id, desc(amt))
This is a tricky one if you want the Amount column to be truly random values you can use a recursive call that will use sample recursively:
## Recursively sampling from a uniform distribution
recursive.sample <- function(start, end, length, results = NA, counter =0) {
## To enter the recursion, counter must be smaller than the length out
## and the last result must be smaller than the starting point (except the firs time)
if(counter < length && ifelse(counter != 0, results[counter] > start, TRUE)){
## Increment the counter
counter <- counter + 1
## Sample between start and the last result or the start and the end of the vector
results[counter] <- ifelse(counter != 1, sample(start:results[counter-1], 1), sample(start:end, 1))
## Recursive call
return(recursive.sample(start = start, end = end, length = length, results = results, counter = counter))
} else {
## Exit the recursion
return(results)
}
}
## Example
set.seed(0)
recursive.sample(start = 1, end = 20, length = 3, results = NA, counter = 0)
#[1] 18 5 2
Alternatively (and way easier) you can use sort(sample()):
set.seed(0)
sort(sample(1:20, 3), decreasing = TRUE)
#[1] 18 7 6
Note that the results differ due to the lower probability of sampling higher values in the recursive function.
You can then easily create your table with your chosen function as follow:
set.seed(123)
## The ID column
ID <- c(rep(1, 4), rep(2,3))
## The Amount column
Amount <- c(recursive.sample(1, 20, 4, NA, 0), recursive.sample(1, 11, 3, NA, 0))
## The table
cbind(ID, Amount)
# ID Amount
#[1,] 1 18
#[2,] 1 5
#[3,] 1 2
#[4,] 1 2
#[5,] 2 10
#[6,] 2 3
#[7,] 2 3
Or, again, with the simple sort(sample()) function for a higher probability of picking larger numbers.
Two methods, one using dplyr and one using only base R functions. These are slightly different to the two previous solutions.
I used sorted ID column, but this is not necessary.
Method 1
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df %>% group_by(ID) %>%
mutate(Amount = sort(sample(1 : 20, n(), replace = T), decreasing = TRUE))
Method 2
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df$Amount <- NA
uniq_ID <- unique(df$ID)
index_lst <- lapply(uniq_ID, function(x) which(df$ID == x))
res <- lapply(index_lst, function(x) sort(sample(1 : 20, length(x)),
decreasing = TRUE))
df$Amount[unlist(index_lst)] <- unlist(res)
Method 2.5
This is more convoluted than the 2nd method.
rm(list = ls())
set.seed(1)
df <- data.frame(ID = rep(1:3, each = 5))
df$Amount <- NA
tab <- as.data.frame(table(df$ID))
lapply(1 : nrow(tab), function(x) df$Amount[which(df$ID == tab$Var1[x])] <<-
sort(sample(1 : 20, tab$Freq[x]), decreasing = TRUE))
I have two dataframes which have xy coordinates for different IDs at different timepoints. What I would like to do is identify which point in the previous year is closest to the point in current year and store that data in a list. So for this example data:
oldnames <- c('A', 'B', 'C')
oldx <- c(0,5,10)
oldy <- c(0,5,10)
olddf <- data.frame(oldnames, oldx, oldy)
newnames <- c('D','E','F')
newx <- c(1, 6, 11)
newy <- c(1, 6, 11)
newdf <- data.frame(newnames, newx, newy)
I would like to produce a list that looks like this:
names closest
D A
E B
F C
I've been trying to do this using apply (as below), but at the moment it gives me an error message:
(Error in mutate_impl(.data, dots) :
non-numeric argument to binary operator)
Does anyone have any ideas?
closestdf <- data.frame()
apply(newdf, 1, function(row) {
name <- row["names"]
xID <- row["x"]
yID <- row["y"]
closest <- olddf %>%
mutate(length = sqrt((xID - oldx)^2 + (yID - oldy)^2)) %>%
mutate(rank = min_rank(length)) %>%
filter(rank == '1')%>%
mutate(total = '1')
closestdf <- rbind(closest, closestdf)
})
Cheers!
No need for apply calls, we can purrr inside the mutate instead:
library(tidyverse)
newdf %>%
mutate(closest =
map2_chr(newx, newy,
~as.character(olddf$oldnames)[which.min((.x - olddf$oldx) ^ 2 + (.y - olddf$oldy) ^ 2)]
)
)
Gives:
newnames newx newy closest
1 D 1 1 A
2 E 6 6 B
3 F 11 101 C
There is no reason to perform the square root operation if we don't need the actual distance.
Or more clear and verbose with intermediate steps:
newdf %>%
mutate(dists = map2(newx, newy, ~(.x - olddf$oldx) ^ 2 + (.y - olddf$oldy) ^ 2),
ids = map_dbl(dists, which.min),
closest = olddf$oldnames[ids])
Gives:
newnames newx newy dists ids closest
1 D 1 1 2, 32, 162 1 A
2 E 6 6 72, 2, 32 2 B
3 F 11 101 10322, 9252, 8282 3 C
I have a large r data.frame with close to 500 columns. I want to add existing scale function and also try out different normalization function in a column wise fashion.
As of existing scale function
library(dplyr)
set.seed(1234)
dat <- data.frame(x = rnorm(10, 30, .2),
y = runif(10, 3, 5),
z = runif(10, 10, 20), k = runif(10, 5, 10))
dat %>% mutate_each_(funs(scale),vars=c("y","z"))
Question1:
In this case vars are only two but when you have 500 columns to normalized whats the best way?
I tried following:
dnot <- c("y", "z")
dat %>% mutate_each_(funs(scale),vars=!(names(dat) %in% dnot))
Error:
Error in UseMethod("as.lazy_dots") :
no applicable method for 'as.lazy_dots' applied to an object of class "logical"
Question2: Instead of using inbuilt scale function I want to apply my own function to normalize the data frame.
example: I have following function
normalized_columns <- function(x)
{
r <- (x/sum(x))
}
Question2: How can I efficiently apply this to all the columns while leaving out only 3 or 4 columns.
As the OP used dplyr methods, one option would be using setdiff with mutate_each_
dat %>%
mutate_each_(funs(scale), setdiff(names(dat), dnot))
# x y z k
#1 -0.8273937 3.633225 14.56091 0.22934964
#2 0.6633811 3.605387 12.65187 0.76742806
#3 1.4738069 3.318092 13.04672 -1.16688369
#4 -1.9708424 3.079992 15.07307 0.62528427
#5 0.8157183 3.437599 11.81096 -1.06313355
#6 0.8929749 4.621197 17.59671 -0.06743894
#7 -0.1923930 4.051395 12.01248 0.94484655
#8 -0.1641660 4.829316 12.58810 -0.16575678
#9 -0.1820615 4.662690 19.92150 -1.55940662
#10 -0.5090247 3.091541 18.07352 1.45571106
Or subset the names based on the logical index
dat %>%
mutate_each_(funs(scale), names(dat)[!names(dat) %in% dnot])
# x y z k
#1 -0.8273937 3.633225 14.56091 0.22934964
#2 0.6633811 3.605387 12.65187 0.76742806
#3 1.4738069 3.318092 13.04672 -1.16688369
#4 -1.9708424 3.079992 15.07307 0.62528427
#5 0.8157183 3.437599 11.81096 -1.06313355
#6 0.8929749 4.621197 17.59671 -0.06743894
#7 -0.1923930 4.051395 12.01248 0.94484655
#8 -0.1641660 4.829316 12.58810 -0.16575678
#9 -0.1820615 4.662690 19.92150 -1.55940662
#10 -0.5090247 3.091541 18.07352 1.45571106
If we are using mutate_each, another option is one_of
dat %>%
mutate_each(funs(scale), -one_of(dnot))
# x y z k
#1 -0.8273937 3.633225 14.56091 0.22934964
#2 0.6633811 3.605387 12.65187 0.76742806
#3 1.4738069 3.318092 13.04672 -1.16688369
#4 -1.9708424 3.079992 15.07307 0.62528427
#5 0.8157183 3.437599 11.81096 -1.06313355
#6 0.8929749 4.621197 17.59671 -0.06743894
#7 -0.1923930 4.051395 12.01248 0.94484655
#8 -0.1641660 4.829316 12.58810 -0.16575678
#9 -0.1820615 4.662690 19.92150 -1.55940662
#10 -0.5090247 3.091541 18.07352 1.45571106
The setdiff option with data.table would be
library(data.table)
nm1 <- setdiff(names(dat), dnot)
setDT(dat)[, (nm1) := lapply(.SD, scale), .SDcols = nm1]
There are better approaches, but I usually do something like:
set.seed(1234)
x = rnorm(10, 30, .2)
y = runif(10, 3, 5)
z = runif(10, 10, 20)
k = runif(10, 5, 10)
a = rnorm(10, 30, .2)
b = runif(10, 3, 5)
c = runif(10, 10, 20)
d = runif(10, 5, 10)
normalized_columns <- function(x)
{
x/sum(x)
}
dat<-data.frame(x,y,z,k,a,b,c,d)
dat[,c(1,4,6:8)]<-sapply(dat[,c(1,4,6:8)], normalized_columns)
Edit: as far as efficiency goes, this is pretty fast:
set.seed(100)
dat<-data.frame(matrix(rnorm(50000, 5, 2), nrow = 100, ncol = 500))
cols<-sample.int(500, 495, replace = F)
system.time(dat[,cols]<-sapply(dat[,cols], normalized_columns))
##user system elapsed
##0.03 0.00 0.03
I have a dataset like this:
df <- data.frame(ID=1:10, baseline = c(1.8,2.4,3.2,2.3,2.1,2.2,3,2.8,2,2.9))
I want to create a new column called "response", this column should be created based on the following equation:
individual response=individual baseline+0.5*sin(2*3.14*(t-7.5)/24)
in this equation, t is generated based on this vector
t=rep(seq(0,24,by=0.1))
so for each ID, there should be 241 responses generated. How could I generate the new dataset containing ID, baseline, time, and response?
Another approach:
t <- rep(seq(0, 24, by = 0.1), each = nrow(df))
vals <- 0.5 * sin(2 * 3.14 * (t - 7.5) / 24)
new_df <- cbind(df, t, response = df$baseline + vals)
Try
library(reshape2)
res <- melt(apply(df[,2, drop=FALSE], 1,
function(x) x+0.5*sin(2*3.14*(t-7.5)/24)))
indx <- rep(1:nrow(df), each=241)
df1 <- cbind(df[indx,], time= rep(t, nrow(df)), response=res[,3])
row.names(df1) <- NULL
dim(df1)
#[1] 2410 4
head(df1,3)
# ID baseline time response
#1 1 1.8 0.0 1.337870
#2 1 1.8 0.1 1.333034
#3 1 1.8 0.2 1.328518
Or
t <- seq(0,24, by=0.1)
indx <- rep(1:nrow(df), each=length(t))
df2 <- within(df[indx,], {response<-baseline+0.5*sin(2*3.14*(t-7.5)/24)
time <- t})
row.names(df2) <- NULL
all.equal(df1, df2)
#[1] TRUE