How to create new columns based on other columns' values using R - r

I have a dataframe (df) in r, and I am interested in two columns, df$LEFT and df$RIGHT. I would like to create two new columns such that in df$BEST I have the smaller number between LEFT and RIGHT for each row. Analogously, I want to create the column df$WORST where it is stored the smallest number.
ID LEFT RIGHT
1 20 70
2 65 15
3 25 65
I would like to obtain this:
ID LEFT RIGHT BEST WORST
1 20 70 20 70
2 65 15 15 65
3 25 65 25 65
How can I do that?

We can use pmin/pmax to get the corresponding minimum, maximum values of the two columns
transform(df, BEST = pmin(LEFT, RIGHT), WORST = pmax(LEFT, RIGHT))
# ID LEFT RIGHT BEST WORST
#1 1 20 70 20 70
#2 2 65 15 15 65
#3 3 25 65 25 65
data
df <- structure(list(ID = 1:3, LEFT = c(20L, 65L, 25L), RIGHT = c(70L,
15L, 65L)), class = "data.frame", row.names = c(NA, -3L))

An alternative is using apply
> df$WORST <- apply(df[,-1], 1, min)
> df$BEST <- apply(df[,-1], 1, max)
> df
ID LEFT RIGHT WORST BEST
1 1 20 70 20 70
2 2 65 15 15 65
3 3 25 65 25 65
Using #akrun's approach with transform:
> transform(df,
WORST = apply(df[,-1], 1, min),
BEST = apply(df[,-1], 1, max))

Related

Subtract an observation from another in different column and add a specific value to the result to create a new observation in the first column in R

I have something like this,
A B C
100 24
18
16
21
14
I am trying to write a function that calculates C = A-B for the respective row and then adds 20 to C which is A for the next row and repeats the step and it should be like this at the end.
A B C
100 24 76
96 18 78
98 16 82
102 21 81
101 14 87
I am doing it manually atm like
df$C[1] = df$A[1] - df$B[1] and then
df$A[2] = df$C[1]+20 and repeating it.
I would like to create a function instead of doing this way. Any help would be appreciated.
Here is another approach using for loop:
data
df <- data.frame(A=NA, B = c(24L, 18L, 16L, 21L, 14L),C=NA)
Initialize first row of df
df$A[1] <- 100
df$C[1] <- df$A[1]-df$B[1]
Populate the remaining rows of df
for (i in 1:(length(df$B)-1)){
df$C[i+1] <- df$C[i]-df$B[i+1]+20
df$A[i+1] <- df$C[i]+20
}
Output
df
A B C
1 100 24 76
2 96 18 78
3 98 16 82
4 102 21 81
5 101 14 87
We can start with only B column and then calculate A and C respectively.
start_value <- 100
df$A <- c(start_value, start_value - cumsum(df$B) + 20 * 1:nrow(df))[-(nrow(df) + 1)]
df$C <- df$A - df$B
df
# B A C
#1 24 100 76
#2 18 96 78
#3 16 98 82
#4 21 102 81
#5 14 101 87
data
df <- structure(list(B = c(24L, 18L, 16L, 21L, 14L)),
class = "data.frame", row.names = c(NA, -5L))

How to randomly select row from a dataframe for which the row skewness is larger that a given value in R

I am trying to select random rows from a data frame with 1000 lines (and six columns) where the skewness of the line is larger than a given value (say Sk > 0.3).
I've generated the following data frame
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
I can get row skewness from the fbasics package:
rowSkewness(df) gives:
[8] -0.2243295435 0.5306809351 0.0707122386 0.0341447417 0.3339384838 -0.3910593364 -0.6443905090
[15] 0.5603809206 0.4406091534 -0.3736108832 0.0397860038 0.9970040772 -0.7702547535 0.2065830354
But now, I need to select say 10 rows of the df which have rowskewness greater than say 0.1... May with
for (a in 1:10) {
sample.data[a,] = sample(x=df[which(rowSkewness(df[sample(1:nrow(df),1)>0.1),], size = 1, replace = TRUE)
}
or something like this?
Any thoughts on this will be appreciated.
thanks in advance.
you can use the sample_n() function or sample_frac() - makes your version a little shorter:
library(tidyr)
library(fBasics)
df=data.frame(replicate(6,sample(10:100,1000,rep=TRUE)))
x=df %>% dplyr::filter(rowSkewness(df)>0.1) %>% dplyr::sample_n(10)
Got it:
x=df %>% filter(rowSkewness(df)>0.1)
for (a in 1:samplesize) {
sample.data[a,] = sample(x=x, size = 1, replace = TRUE)
}
Just do a subset:
res1 <- DF[fBasics::rowSkewness(DF) > .1, ]
head(res1)
# X1 X2 X3 X4 X5 X6
# 7 56 28 21 93 74 24
# 8 33 56 23 44 10 12
# 12 29 19 29 38 94 95
# 13 35 51 54 98 66 10
# 14 12 51 24 23 36 68
# 15 50 37 81 22 55 97
Or with e1071::skewness:
res2 <- DF[apply(as.matrix(DF), 1, e1071::skewness) > .1, ]
stopifnot(all.equal(res1, res2))
Data
set.seed(42); DF <- data.frame(replicate(6, sample(10:100, 1000, rep=TRUE)))

multiplying columns in R

I have a data frame like this.
> abc
ID 1.x 2.x 1.y 2.y
1 4 10 20 30 40
2 16 5 10 5 10
3 42 16 17 18 19
4 91 20 20 20 20
5 103 103 42 56 84
How do I create two additional columns '1' and '2' by multiplying 1.x * 1.y and 2.x * 2.y in a generalized way?
I am trying to get a generalized solution where number of columns can be too many. So I want to multiply all x with all y. While x and y are fixed, n has to be figured out from data frame.
For simplicity lets assume n is also fixed however it is a large number.
One thing i can try is :-
abc[,c(6,7)]=abc[,c(2,3)]*abc[,c(4,5)]
It will work only if col positions are contiguous. This is good enough for me. If anyone can have more generalized solution, it will benefit us all.
If there are only couple of variables to multiply, we can do this with transform by multiplying the columns of interest
transform(abc, new1 = `1.x`*`1.y`, new2 = `2.x`*`2.y`, check.names = FALSE)
# ID 1.x 2.x 1.y 2.y new1 new2
#1 4 10 20 30 40 300 800
#2 16 5 10 5 10 25 100
#3 42 16 17 18 19 288 323
#4 91 20 20 20 20 400 400
#5 103 103 42 56 84 5768 3528
If we have lots of columns, then one approach is to split the dataset into a list of data.frames by taking the substring of names and then loop through the list and multiply the rows with do.call
abc[paste0("new", 1:2)] <- lapply(split.default(abc[-1],
sub("\\.[a-z]+$", "", names(abc)[-1])), function(x) do.call(`*`, x))
Or another option is (based on the pairwise column multiplication)
apply(aperm(array(unlist(abc[-1]), c(5, 2, 2)),
c(3, 1, 2)), 3, matrixStats::colProds)
Mutate will preserve the original variables. Mutate_all will allow you to multiply all columns in your dataframe.
abc %>%
mutate(new_vary1 = `1.x`* `2.x`,
new_vary2 = `1.y`* `2.y`) %>%
mutate_all(funs(.*`1.x`))

Replacing rows of a column in a dataframe conditional to another column in R [duplicate]

This question already has answers here:
Replace empty values with value from other column in a dataframe
(3 answers)
Closed 6 years ago.
Let I have such a data frame(df):
df:
header1 header2
------ -------
45 76
54 89
- 12
45 32
12 34
- 5
45 34
65 54
I want to get such a dataframe
header1 header2
------ -------
45 76
54 89
- -
45 32
12 34
- -
45 34
65 54
Namely I want to replace values in header2 columsn with "-", which rows of column header1 have "-" values.
How can I do that in R? I will be very glad for any help. Thanks a lot.
If both columns if your df are character vectors, you could do:
# You can convert your columns to character with
df[,1:2] <- lapply(df[,1:2], as.character)
df$header2[df$header1 == "-"] <- "-" # Replace values
> df
# header1 header2
#1 45 76
#2 54 89
#3 - -
#4 45 32
#5 12 34
#6 - -
#7 45 34
#8 65 54
Traditionally, I would suggest making use of dplyr as it produces beautify readable workflow when working on data frames.
set.seed(1)
dta <- data.frame(colA = c(12,22,34,"-",23,"-"),
colB = round(runif(n = 6, min = 1, max = 100),0))
Vectorize(require)(package = c("dplyr", "magrittr"),
character.only = TRUE)
dta %<>%
mutate(colB = ifelse(colA == "-", "-", colA))
This would give you the following results:
> head(dta)
colA colB
1 12 2
2 22 3
3 34 5
4 - -
5 23 4
6 - -
Side notes
This is very flexible mechanism but if you presume that the column classes may be of relevance you may simply choose to run mutate_each(funs(as.character)) before applying any other transformations.

Applying function to multiple rows using values from multiple rows

I have created the following simple function in R:
fun <- function(a,b,c,d,e){b+(c-a)*((e-b)/(d-a))}
That I want to apply this function to a data.frame that looks something like:
> data.frame("x1"=seq(55,75,5),"x2"=round(rnorm(5,50,10),0),"x3"=seq(30,10,-5))
x1 x2 x3
1 55 51 30
2 60 45 25
3 65 43 20
4 70 57 15
5 75 58 10
I want to apply fun to each separate row to create a new variable x4, but now comes the difficult part (to me at least..): for the arguments d and e I want to use the values x2 and x3 from the next row. So for the first row of the example that would mean: fun(a=55,b=51,c=30,d=45,e=25). I know that I can use mapply() to apply a function to each row, but I have no clue on how to tell mapply that it should use some values from the next row, or whether I should be looking for a different approach than mapply()?
Many thanks in advance!
Use mapply, but shift the fourth and fifth columns by one row. You can do it manually, or use taRifx::shift.
> dat
x1 x2 x3
1 55 25 30
2 60 58 25
3 65 59 20
4 70 68 15
5 75 43 10
library(taRifx)
> shift(dat$x2)
[1] 58 59 68 43 25
> mapply( dat$x1, dat$x2, dat$x3, shift(dat$x2), shift(dat$x3) , FUN=fun )
[1] 25.00000 -1272.00000 719.00000 -50.14815 26.10000
If you want the last row to be NA rather than wrapping, use wrap=FALSE,pad=TRUE:
> shift(dat$x2,wrap=FALSE,pad=TRUE)
[1] 58 59 68 43 NA

Resources