Assume I have a data frame df with variables A, B and C in it. I would like to create 3 more corresponding columns with names A_ranked, B_ranked and C_ranked. It doesn't matter how I will fill them for the sake of this question, so let's assume that I will set them all to 5. I tried the following code:
for (i in 1:length(df)){
df%>%mutate(
paste(colnames(df)[i],"ranked", sep="_")) = 5
}
I also tried:
for (i in 1:length(df)){
df%>%mutate(
as.vector(paste(colnames(df)[i],"ranked", sep="_")) = 5
}
And:
for (i in 1:length(df)){
df$paste(colnames(df)[i],"ranked", sep="_")) = 5
}
No one them seems to work. Can somebody please tell me what is the correct way to do this?
Here is a data.table option using the iris data set (here we create 4 more columns based on colnames of existing columns).
# data
df <- iris[, 1:4]
str(df)
# new columns
library(data.table)
setDT(df)[, paste(colnames(df), "ranked", "_") := 5][]
# output
Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length ranked _
1: 5.1 3.5 1.4 0.2 5
2: 4.9 3.0 1.4 0.2 5
3: 4.7 3.2 1.3 0.2 5
4: 4.6 3.1 1.5 0.2 5
5: 5.0 3.6 1.4 0.2 5
---
146: 6.7 3.0 5.2 2.3 5
147: 6.3 2.5 5.0 1.9 5
148: 6.5 3.0 5.2 2.0 5
149: 6.2 3.4 5.4 2.3 5
150: 5.9 3.0 5.1 1.8 5
Sepal.Width ranked _ Petal.Length ranked _ Petal.Width ranked _
1: 5 5 5
2: 5 5 5
3: 5 5 5
4: 5 5 5
5: 5 5 5
---
146: 5 5 5
147: 5 5 5
148: 5 5 5
149: 5 5 5
150: 5 5 5
# If you want to fill new columns with different values you can try something like
setDT(df)[, paste(colnames(df), "ranked", "_") := list(Sepal.Length/2,
Sepal.Width/2,
Petal.Length/2,
Petal.Width/2)][]
This should work:
df[paste(names(df), "ranked", sep = "_")] <- 5
df
# A B C A_ranked B_ranked C_ranked
# 1 1 2 3 5 5 5
Data:
df <- data.frame(A = 1, B = 2, C = 3)
Does this help?
dat <- data.frame(A=5,B=5,C=5)
dat %>%
mutate_each(funs(ranked=sum)) %>%
head()
Related
Here's a simplified mock dataframe:
df1 <- data.frame(amb = c(2.5,3.6,2.1,2.8,3.4,3.2,1.3,2.5,3.2),
warm = c(3.6,5.3,2.1,6.3,2.5,2.1,2.4,6.2,1.5),
sensor = c(1,1,1,2,2,2,3,3,3))
I'd like to set all values in the "amb" column to NA if they're in sensor 1, but retain the values in the "warm" column for sensor 1. Here's what I'd like the final output to look like:
amb warm sensor
NA 3.6 1
NA 5.3 1
NA 2.1 1
2.8 6.3 2
3.4 2.5 2
3.2 2.1 2
1.3 2.4 3
2.5 6.2 3
3.2 1.5 3
Using R version 4.0.2, Mac OS X 10.13.6
A possible solution, based on dplyr:
library(dplyr)
df1 %>%
mutate(amb = ifelse(sensor == 1, NA, amb))
#> amb warm sensor
#> 1 NA 3.6 1
#> 2 NA 5.3 1
#> 3 NA 2.1 1
#> 4 2.8 6.3 2
#> 5 3.4 2.5 2
#> 6 3.2 2.1 2
#> 7 1.3 2.4 3
#> 8 2.5 6.2 3
#> 9 3.2 1.5 3
Seems to be best handled with the vectorized function is.na<-
is.na(df1$amb) <- df1$sensor %in% c(1) # that c() isn't needed
But to be most general and support tests of proper test for equality among floating point numbers the answer might be:
is.na(df1$amb) <- df1$sensor-1 < 1e-16
So I have a dataframe (my.df) which I have grouped by the variable "strat". Each row consists of numerous variables. Example of what it looks like is below - I've simplified my.df for this example since it is quite large. What I want to do next is draw a simple random sample from each group. If I wanted to draw 5 observations from each group I would use this code:
new_df <- my.df %>% group_by(strat) %>% sample_n(5)
However, I have a different specified sample size that I want to sample for each group. I have these sample sizes in a vector nj.
nj <- c(3, 4, 2)
So ideally, I would want 3 observations from my first strata, 4 observations from my second strata and 2 observations from my last srata. I'm not sure if I can sample by group using each unique sample size (without having to write out "sample" however many times I need to)? Thanks in advance!
my.df looks like:
var1 var2 strat
15 3 1
13 5 3
8 6 2
12 70 3
11 10 1
14 4 2
You can use stratified from my "splitstackshape" package.
Here's some sample data:
set.seed(1)
my.df <- data.frame(var1 = sample(100, 20, TRUE),
var2 = runif(20),
strat = sample(3, 20, TRUE))
table(my.df$strat)
#
# 1 2 3
# 5 9 6
Here's how you can use stratified:
library(splitstackshape)
# nj needs to be a named vector
nj <- c("1" = 3, "2" = 4, "3" = 2)
stratified(my.df, "strat", nj)
# var1 var2 strat
# 1: 72 0.7942399 1
# 2: 39 0.1862176 1
# 3: 50 0.6684667 1
# 4: 21 0.2672207 2
# 5: 69 0.4935413 2
# 6: 91 0.1255551 2
# 7: 78 0.4112744 2
# 8: 7 0.3403490 3
# 9: 27 0.9347052 3
table(.Last.value$strat)
#
# 1 2 3
# 3 4 2
Since your data is inadequate for sampling, let us consider this example on iris dataset
library(tidyverse)
nj <- c(3, 5, 6)
set.seed(1)
iris %>% group_split(Species) %>% map2_df(nj, ~sample_n(.x, size = .y))
# A tibble: 14 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 4.6 3.1 1.5 0.2 setosa
2 4.4 3 1.3 0.2 setosa
3 5.1 3.5 1.4 0.2 setosa
4 6 2.7 5.1 1.6 versicolor
5 6.3 2.5 4.9 1.5 versicolor
6 5.8 2.6 4 1.2 versicolor
7 6.1 2.9 4.7 1.4 versicolor
8 5.8 2.7 4.1 1 versicolor
9 6.4 2.8 5.6 2.2 virginica
10 6.9 3.2 5.7 2.3 virginica
11 6.2 3.4 5.4 2.3 virginica
12 6.9 3.1 5.1 2.3 virginica
13 6.7 3 5.2 2.3 virginica
14 7.2 3.6 6.1 2.5 virginica
You can bring nj values to sample in the dataframe and then use sample_n by group.
library(dplyr)
df %>%
mutate(nj = nj[strat]) %>%
group_by(strat) %>%
sample_n(size = min(first(nj), n()))
Note that the above works because strat has value 1, 2, 3. For a general solution when the group does not have such values you could use :
df %>%
mutate(nj = nj[match(strat, unique(strat))]) %>%
group_by(strat) %>%
sample_n(size = min(first(nj), n()))
I am trying to do an inner join using data table that has multiple, fairly dynamic conditions. I am getting tripped up on the syntax. First, I create two objects, x and x2 that I want to do an inner join with.
set.seed(1)
#generate data
x = data.table(CJ(t=1:10, d=1:3,p1s=seq(1,3,by=0.1),p1sLAST=seq(1,3,by=0.1)))
x[d==1,p1sLAST:=3]
x=x[p1s<=p1sLAST]
x2 = data.table(CJ(tprime=1:10, p1sLASTprm=seq(1,3,by=0.1)))
With the objects:
> x
t d p1s p1sLAST
1: 1 1 1.0 3.0
2: 1 1 1.0 3.0
3: 1 1 1.0 3.0
4: 1 1 1.0 3.0
5: 1 1 1.0 3.0
---
9026: 10 3 2.8 2.9
9027: 10 3 2.8 3.0
9028: 10 3 2.9 2.9
9029: 10 3 2.9 3.0
9030: 10 3 3.0 3.0
> x2
tprime p1sLASTprm
1: 1 1.0
2: 1 1.1
3: 1 1.2
4: 1 1.3
5: 1 1.4
---
206: 10 2.6
207: 10 2.7
208: 10 2.8
209: 10 2.9
210: 10 3.0
Now, I want to do these last three steps in a single inner join.
joined = x[,x2[],by=names(x)]
joined=joined[p1sLASTprm==p1s & d!=3 | d==3 & p1sLASTprm==3]
joined=joined[tprime==t+1]
Resulting in the final output:
> joined
t d p1s p1sLAST tprime p1sLASTprm
1: 1 1 1.0 3.0 2 1.0
2: 1 1 1.1 3.0 2 1.1
3: 1 1 1.2 3.0 2 1.2
4: 1 1 1.3 3.0 2 1.3
5: 1 1 1.4 3.0 2 1.4
---
4343: 9 3 2.8 2.9 10 3.0
4344: 9 3 2.8 3.0 10 3.0
4345: 9 3 2.9 2.9 10 3.0
4346: 9 3 2.9 3.0 10 3.0
4347: 9 3 3.0 3.0 10 3.0
I do not think a single inner join can accomplish those 3 steps since there is a | and most likely a union of results will be required.
A more memory efficient approach could be:
ux <- unique(x)[, upt := t+1]
rbindlist(list(
ux[d!=3][x2,
c(mget(names(ux)), mget(names(x2))),
on=c("p1s"="p1sLASTprm", "upt"="tprime"),
nomatch=0L],
ux[d==3][x2[p1sLASTprm==3],
c(mget(names(ux)), mget(names(x2))),
on=c("upt"="tprime"),
nomatch=0L]
))
I have a dataframe df of 2M+ rows that looks like this:
LOCUS POS COUNT
1: CP007539.1 1 4
2: CP007539.1 2 7
3: CP007539.1 3 10
4: CP007539.1 4 15
5: CP007539.1 5 21
6: CP007539.1 6 28
Currently I am using this in order to remove the first and last 1000 rows:
> df_adj = df[head(df$POS,-1000),]
> df_adj = tail[(df_adj$POS,-1000),]
But even I can see that this can be done in a better way. Suggestions?
You can perform this specifying the range of rows you want to leave in the final dataset:
df_adj <- df[1001:( nrow(df) - 1000 ),]
Just make sure you have enough rows to perform this. A safer approach might be:
df_adj <- if( nrow(df) > 2000 ) df[1001:( nrow(df) - 1000 ),] else df
Another way would be to combine seq_len with nrow and range:
df <- head(iris, 10)
df[-range(seq_len(nrow(df))), ]
which will generate the output below:
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 8 5.0 3.4 1.5 0.2 setosa
# 9 4.4 2.9 1.4 0.2 setosa
I want to transform a dataset based on certain conditions. These conditions are given in another dataset. Let me explain it using an example.
Suppose I've a dataset in the following format:
Date Var1 Var2
3/1/2016 8 14
3/2/2016 7 8
3/3/2016 7 6
3/4/2016 10 8
3/5/2016 5 10
3/6/2016 9 15
3/7/2016 2 5
3/8/2016 6 14
3/9/2016 8 15
3/10/2016 8 8
And the following dataset has the transformation conditions and is in the following format:
Variable Trans1 Trans2
Var1 1||2 0.5||0.7
Var2 1||2 0.3||0.8
Now, I want to extract first conditions from transformation table for Var1, 1.0.5, and add 1 to Var1 and multiply it by 0.5. I'll do the same for var2, add by 1 and multiply by 0.3. This transformation will give me new variable Var1_1 and var2_1. I'll do the same thing for the other transformation, which will give me Var1_2 and Var2_2. For Var1_2, the transformation is Var1 sum with 2 and multiplied by 0.7.
After the transformation, the dataset will look like the following:
Date Var1 Var2 Var1_1 Var2_1 Var1_2 Var2_2
3/1/2016 8 14 4.5 4.5 7 11.2
3/2/2016 7 8 4 2.7 6.3 7
3/3/2016 7 6 4 2.1 6.3 5.6
3/4/2016 10 8 5.5 2.7 8.4 7
3/5/2016 5 10 3 3.3 4.9 8.4
3/6/2016 9 15 5 4.8 7.7 11.9
3/7/2016 2 5 1.5 1.8 2.8 4.9
3/8/2016 6 14 3.5 4.5 5.6 11.2
3/9/2016 8 15 4.5 4.8 7 11.9
3/10/2016 8 8 4.5 2.7 7 7
Given that your original data.frame is called df and your conditions table cond1 then we can create a custom function,
funV1Cond1 <- function(x){
t1 <- as.numeric(gsub("[||].*", "", cond1$Trans1[cond1$Variable == "Var1"]))
t2 <- as.numeric(gsub("[||].*", "", cond1$Trans2[cond1$Variable == "Var1"]))
result <- (x$Var1 + t1)*t2
return(result)
}
funV1Cond1(df)
#[1] 4.5 4.0 4.0 5.5 3.0 5.0 1.5 3.5 4.5 4.5
Same way with function 2
funV1Cond2 <- function(x){
t1 <- as.numeric(gsub(".*[||]", "", cond1$Trans1[cond1$Variable == "Var1"]))
t2 <- as.numeric(gsub(".*[||]", "", cond1$Trans2[cond1$Variable == "Var1"]))
result <- (x$Var1 + t1)*t2
return(result)
}
funV1Cond2(df)
#[1] 7.0 6.3 6.3 8.4 4.9 7.7 2.8 5.6 7.0 7.0
Assuming that Trans1 column has 3 conditions i.e. 1, 2, 3 then,
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[2]))
#[1] 2
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[1]))
#[1] 1
as.numeric(sapply(str_split(cond1$Trans1[cond1$Variable == "Var1"], ','),function(x) x[3]))
#[1] 3
Note that I changed the delimeter to a ','