For example : I have frame with 4 columns and I want divide columns A and B by C, but I want unchanged column ID
A B C ID
4 8 23 1
5 12 325 2
6 23 56 3
73 234 21 4
23 23 213 5
The result which i expect is
A B C ID
0,173913043 0,347826087 1 1
0,015384615 0,036923077 1 2
0,107142857 0,410714286 1 3
3,476190476 11,14285714 1 4
0,107981221 0,107981221 1 5
or without the column C, doesn't matter
So, I have the code which give me only columns A and B without the column 'ID'
columns_to_divide <- c(1,2)
results <- results[,columns_to_divide ]/results[,3]
We can use mutate, which creates or alters the values in a column. across says to alter columns A and B, and then we can define a function to divide both of these columns by C.
library(dplyr)
dat %>% mutate(across(c(A, B), function(x) x/C))
A B C ID
1: 0.17391304 0.34782609 23 1
2: 0.01538462 0.03692308 325 2
3: 0.10714286 0.41071429 56 3
4: 3.47619048 11.14285714 21 4
5: 0.10798122 0.10798122 213 5
div = c("A", "B")
div_by = "C"
DF[div] <- DF[div] / DF[[div_by]]
# A B C
# 1 0.17391304 0.34782609 23
# 2 0.01538462 0.03692308 325
# 3 0.10714286 0.41071429 56
# 4 3.47619048 11.14285714 21
# 5 0.10798122 0.10798122 213
Data
DF data.frame(
A = c(4, 5, 6, 73, 23), B = c(8, 12, 23, 234, 23), C = c(23, 325, 56, 21, 213)
)
Create Columns
A <- c(4, 5, 6, 73, 23)
B <- c(8, 12, 23, 234, 23)
C <- c(23, 325, 56, 21, 213)
ID <- c(1, 2, 3, 4, 5)
Add to data frame
df = data.frame(A, B, C, ID)
divide by and print
df$A <- df$A / df$C
df$B <- df$B / df$C
df$C <- df$C / df$C
print(df)
Related
I have a dataset like this:
risk earthquake
platarea
magnitude
area
0.4
no
5
30
0.5
no
6
20
5.5
yes
6
20
I would like to create a new column
i gave that code
df$newrisk <- 0.5*df$magnitude + 0.6*df$aarea + 3*df$platarea
I got an error message for df$platarea?
BUt the platarea will only increase when it is "yes".
How can I code that???? the code is right if I omit df$platarea, but i would also include df$platarea but don't know how??
We can create a logical vector
i1 <- df$platarea == "yes"
df$newrisk[i1] <- with(df, 0.5 * magnitude[i1] + 0.6 * area[i] + 3)
If it is only to change the + 3 *, multiply by the logical vector so that FALSE (or 0 will return 0 and TRUE for 'yes' will return 3 as -3 *1 = 3)
df$newrisk <- with(df, 0.5 * magnitude + 0.6 * area + 3 *i1)
There are three common ways to add a new column to a data frame in R:
Use the $ Operator
df$new <- c(3, 3, 6, 7, 8, 12)
Use Brackets
df['new'] <- c(3, 3, 6, 7, 8, 12)
Use Cbind
df_new <- cbind(df, new)
I leave some examples for further explanation:
#create data frame
df <- data.frame(a = c('A', 'B', 'C', 'D', 'E'),
b = c(45, 56, 54, 57, 59))
#view data frame
df
a b
1 A 45
2 B 56
3 C 54
4 D 57
5 E 59
Example 1: Use the $ Operator
#define new column to add
new <- c(3, 3, 6, 7, 8)
#add column called 'new'
df$new <- new
#view new data frame
df
a b new
1 A 45 3
2 B 56 3
3 C 54 6
4 D 57 7
5 E 59 8
Example 2: Use Brackets
#define new column to add
new <- c(3, 3, 6, 7, 8)
#add column called 'new'
df['new'] <- new
#view new data frame
df
a b new
1 A 45 3
2 B 56 3
3 C 54 6
4 D 57 7
5 E 59 8
Example 3: Use Cbind
#define new column to add
new <- c(3, 3, 6, 7, 8)
#add column called 'new'
df_new <- cbind(df, new)
#view new data frame
df_new
a b new
1 A 45 3
2 B 56 3
3 C 54 6
4 D 57 7
5 E 59 8
I have many numerical scalars/vectors, like:
a <- 1
b <- c(2,4)
c <- c(5,6,7)
d <- c(60, 556, 30, 4, 5556, 111232)
Now I need to add to every number in scalar/vector 1 and insert the result after that number. The solution should work with any numerical scalars and vectors. So result should look like:
a <- c(1,2)
b <- c(2,3,4,5)
c <- c(5,6,6,7,7,8)
d <- c(60, 61, 556, 557, 30, 31, 4, 5, 5556, 5557, 111232, 111233)
How this can be done?
lst <- list(
a = 1,
b = c(2,4),
c = c(5,6,7),
d = c(60, 556, 30, 4, 5556, 111232))
lapply(lst, function(x) as.vector(rbind(x, x + 1)))
# $`a`
# [1] 1 2
#
# $b
# [1] 2 3 4 5
#
# $c
# [1] 5 6 6 7 7 8
#
# $d
# [1] 60 61 556 557 30 31 4 5 5556 5557 111232 111233
This is pretty much a dupe of this, but not exactly so I'll let someone else make the call.
I have a dataframe like this:
V1 = paste0("AB", seq(1:48))
V2 = seq(1:48)
test = data.frame(name = V1, value = V2)
I want to calculate the means of the value-column and specific rows.
The pattern of the rows is pretty complicated:
Rows of MeanA1: 1, 5, 9
Rows of MeanA2: 2, 6, 10
Rows of MeanA3: 3, 7, 11
Rows of MeanA4: 4, 8, 12
Rows of MeanB1: 13, 17, 21
Rows of MeanB2: 14, 18, 22
Rows of MeanB3: 15, 19, 23
Rows of MeanB4: 16, 20, 24
Rows of MeanC1: 25, 29, 33
Rows of MeanC2: 26, 30, 34
Rows of MeanC3: 27, 31, 35
Rows of MeanC4: 28, 32, 36
Rows of MeanD1: 37, 41, 45
Rows of MeanD2: 38, 42, 46
Rows of MeanD3: 39, 43, 47
Rows of MeanD4: 40, 44, 48
As you see its starting at 4 different points (1, 13, 25, 37) then always +4 and for the following 4 means its just stepping 1 more row down.
I would like to have an output of all these means in one list.
Any ideas? NOTE: In this example the mean is of course always the middle number, but my real df is different.
Not quite sure about the output format you require, but the following codes can calculate what you want anyhow.
calc_mean1 <- function(x) mean(test$value[seq(x, by = 4, length.out = 3)])
calc_mean2 <- function(x){sapply(x:(x+3), calc_mean1)}
output <- lapply(seq(1, 37, 12), calc_mean2)
names(output) <- paste0('Mean', LETTERS[seq_along(output)]) # remove this line if more than 26 groups.
output
## $MeanA
## [1] 5 6 7 8
## $MeanB
## [1] 17 18 19 20
## $MeanC
## [1] 29 30 31 32
## $MeanD
## [1] 41 42 43 44
An idea via base R is to create a grouping variable for every 4 rows, split the data every 12 rows (nrow(test) / 4) and aggregate to find the mean, i.e.
test$new = rep(1:4, nrow(test)%/%4)
lapply(split(test, rep(1:4, each = nrow(test) %/% 4)), function(i)
aggregate(value ~ new, i, mean))
# $`1`
# new value
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
# $`2`
# new value
# 1 1 17
# 2 2 18
# 3 3 19
# 4 4 20
# $`3`
# new value
# 1 1 29
# 2 2 30
# 3 3 31
# 4 4 32
# $`4`
# new value
# 1 1 41
# 2 2 42
# 3 3 43
# 4 4 44
And yet another way.
fun <- function(DF, col, step = 4){
run <- nrow(DF)/step^2
res <- lapply(seq_len(step), function(inc){
inx <- seq_len(run*step) + (inc - 1)*run*step
dftmp <- DF[inx, ]
tapply(dftmp[[col]], rep(seq_len(step), run), mean, na.rm = TRUE)
})
names(res) <- sprintf("Mean%s", LETTERS[seq_len(step)])
res
}
fun(test, 2, 4)
#$MeanA
#1 2 3 4
#5 6 7 8
#
#$MeanB
# 1 2 3 4
#17 18 19 20
#
#$MeanC
# 1 2 3 4
#29 30 31 32
#
#$MeanD
# 1 2 3 4
#41 42 43 44
Since you said you wanted a long list of the means, I assumed it could also be a vector where you just have all these values. You would get that like this:
V1 = paste0("AB", seq(1:48))
V2 = seq(1:48)
test = data.frame(name = V1, value = V2)
meanVector <- NULL
for (i in 1:(nrow(test)-8)) {
x <- c(test$value[i], test$value[i+4], test$value[i+8])
m <- mean(x)
meanVector <- c(meanVector, m)
}
Let's say I want to create a column in a data.table, in which the value in each row is equal to the standard deviation of the values in three other cells in the same row. E.g., if I make
DT <- data.table(a = 1:4, b = c(5, 7, 9, 11), c = c(13, 16, 19, 22), d = c(25, 29, 33, 37))
DT
a b c d
1: 1 5 13 25
2: 2 7 16 29
3: 3 9 19 33
4: 4 11 22 37
and I'd like to add a column that contains the standard deviation of a, b, and d for each row, like this:
a b c d abdSD
1: 1 5 13 23 12.86
2: 2 7 16 27 14.36
3: 3 9 19 31 15.87
4: 4 11 22 35 17.39
I could of course write a for-loop or use an apply function to calculate this. Unfortunately, what I actually want to do needs to be applied to millions of rows, isn't as simple a function as calculating a standard deviation, and needs to finish within a fraction of a second, so I really need a vectorized solution. I want to write something like
DT[, abdSD := sd(c(a, b, d))]
but unfortunately that doesn't give the right answer. Is there any data.table syntax that can create a vector out of different values within the same row, and make that vector accessible to a function populating a new cell within that row? Any help would be greatly appreciated. #Arun
Depending on the size of your data, you might want to convert the data into a long format, then calculate the result as follows:
complexFunc <- function(x) sd(x)
cols <- c("a", "b", "d")
rowres <- melt(DT[, rn:=.I], id.vars="rn", variable.factor=FALSE)[,
list(abdRes=complexFunc(value[variable %chin% cols])), by=.(rn)]
DT[rowres, on=.(rn)]
or if your complex function has 3 arguments, you can do something like
DT[, abdSD := mapply(complexFunc, a, b, d)]
As #Frank mentioned, I could avoid adding a column by doing by=1:nrow(DT)
DT[, abdSD:=sd(c(a,b,d)),by=1:nrow(DT)]
output:
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
if you add a row_name column, it would be ultra easy
DT$row_id<-row.names(DT)
Simply by=row_id, would get you the result you want
DT[, abdSD:=sd(c(a,b,d)),by=row_id]
Result would have:
a b c d row_id abdSD
1: 1 5 13 25 1 12.85820
2: 2 7 16 29 2 14.36431
3: 3 9 19 33 3 15.87451
4: 4 11 22 37 4 17.38774
If you want row_id removed, simply adding [,row_id:=NULL]
DT[, abdSD:=sd(c(a,b,d)),by=row_id][,row_id:=NULL]
This line would get everything you want
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
You just gotta do it by row.
data.frame does it by row on default, data.table does it by column on default I think. It's a bit tricky
Hope this helps
I think you should try matrixStats package
library(matrixStats)
#sample data
dt <- data.table(a = 1:4, b = c(5, 7, 9, 11), c = c(13, 16, 19, 22), d = c(25, 29, 33, 37))
dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
dt
Output is:
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
Not an answer, but just trying to show the difference between using apply and the solution provided by Prem above :
I have blown up the sample data to 40,000 rows to show solid time differences :
library(matrixStats)
#sample data
dt <- data.table(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))
df <- data.frame(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))
t0 = Sys.time()
dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
print(paste("Time taken for data table operation = ",Sys.time() - t0))
# [1] "Time taken for data table operation = 0.117115020751953"
t0 = Sys.time()
df$abdSD <- apply(df[,c("a","b","d")],1, function(x){sd(x)})
print(paste("Time taken for apply opertaion = ",Sys.time() - t0))
# [1] "Time taken for apply opertaion = 2.93488311767578"
Using DT and matrixStats clearly wins the race
It's not hard to vectorize the sd for this situation:
vecSD = function(x) {
n = ncol(x)
sqrt((n/(n-1)) * (Reduce(`+`, x*x)/n - (Reduce(`+`, x)/n)^2))
}
DT[, vecSD(.SD), .SDcols = c('a', 'b', 'd')]
#[1] 12.85820 14.36431 15.87451 17.38774
My dummy input vector looks like this:
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
What I want: Add group factor to each number. Group is assigned according difference between neighbouring numbers.
Example:
Difference (absolute) between 10 and 20 is 10, hence they belong to same group.
Difference between 30 and 20 is 10 - they belong to same group.
Difference between 30 and 70 is 40 - they belong to different groups.
Given maximal difference 20 wanted result is:
x group
10 1
20 1
30 1
70 4
80 4
90 4
130 7
190 8
200 8
My code:
library(data.table)
library(foreach)
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
x <- data.table(x, group = 1)
y <- nrow(x)
maxGap <- 20
g <- 1
groups <-
foreach(i = 2:y, .combine = rbind) %do% {
if (x[i, x] - x[i - 1, x] < maxGap) {
g
} else {
g <- i
g
}
}
x[2:y]$group <- as.vector(groups)
My question
Given code works, but is too slow with large data (number of rows > 10mil). Is there simpler and quicker solution (not using loop)?
library(IRanges)
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
# If the distance between two integers is larger than 30,
# then they would be in two groups. Otherwise, they would
# be in the same group.
ther <- 15
df.1 <- data.frame(val=x, left=x-15, right=x+15)
df.ir <- IRanges(df.1$left, df.1$right)
df.ir.re <- findOverlaps(df.ir, reduce(df.ir))
df.1$group <- subjectHits(df.ir.re)
df.1
# val left right group
# 1 10 -5 25 1
# 2 20 5 35 1
# 3 30 15 45 1
# 4 70 55 85 2
# 5 80 65 95 2
# 6 90 75 105 2
# 7 130 115 145 3
# 8 190 175 205 4
# 9 200 185 215 4
An implementation which uses the rleid and shift functions of data.table:
x <- c(10, 20, 30, 70, 80, 90, 130, 190, 200)
DT <- data.table(x)
DT[, grp := rleid(cumsum(x - shift(x,1L,0) > 20))]
which gives:
> DT
x grp
1: 10 1
2: 20 1
3: 30 1
4: 70 2
5: 80 2
6: 90 2
7: 130 3
8: 190 4
9: 200 4
Explanation: With x - shift(x,1L,0) you calculate the difference with the previous observation of x. By comparing it to 20 (i.e.: the > 20 part) and wrapping that in cumsum and rleid a runlength id is created.
In response to #Roland's comments: you can leave the rleid-part out if you set the fill parameter in shift to -Inf:
DT[, grp := cumsum((x - shift(x, 1L, -Inf)) > 20)]
test <- c(TRUE, diff(x) > 20) #test the differences
res <- factor(cumsum(test)) #groups
#[1] 1 1 1 2 2 2 3 4 4
#Levels: 1 2 3 4
levels(res) <- which(test) #fix levels
res
#[1] 1 1 1 4 4 4 7 8 8
#Levels: 1 4 7 8