Indexing within a dataframe

Indexing within a dataframe - r

The task is to multiply all negative numbers by 10 in 'df'.
So I am only able to multiply everything by 10 but when I add an if-statement then everything stops working.
df =
x <- c('a', 'b', 'c', 'd', 'e')
y <- c(-4,-2,0,2,4)
z <- c(3, 4, -5, 6, -8)
# Join the variables to create a data frame
df <- data.frame(x,y,z)
df
##
x y z
1 a -4 3
2 b -2 4
3 c 0 -5
4 d 2 6
5 e 4 -8
my code so far
df2 <- df
df2
for(i in 2:ncol(df2)) {
df2[ , i] <- df2[ , i] *10
}
df2

cbind(df[1], 10^(df[-1] < 0) * df[-1])
x y z
1 a -40 3
2 b -20 4
3 c 0 -50
4 d 2 6
5 e 4 -80

You can achieve this using the dplyr functions mutate and across.
library(dplyr) # install if required
df %>%
mutate(across(-x, ~ifelse(. < 0, 10 * ., .)))
This says "for all columns except x, multiple by 10 where the value is < 0, otherwise leave value as is".
Result:
x y z
1 a -40 3
2 b -20 4
3 c 0 -50
4 d 2 6
5 e 4 -80

Related

Returning positive values in R using only vectorization and indexes

I have created a data frame which has string and integers. The integers which are positive and negative.
I have to change all the ints to be positive without using for/if loops but by only using vectorization and indexing. I have created one with a for loop but I am a bit stuck on the next part.
df <- data.frame(x = letters[1:5],
y = seq(-4,4,2),
z = c(3,4,-5,6,-8))
This is my loop to convert to positive.
loop_df_fn <- function(data){
for(i in names(data)){
if(is.numeric(data[[i]])){
data[[i]][data[[i]]<0] <- abs(data[[i]][data[[i]]< 0])*10
}
}
return(data)
}
print((loop_df_fn(df)))

You can use
df[] <- lapply(df , \(x) if(is.numeric(x)) abs(x)*10 else x)
Output
x y z
1 a 40 30
2 b 20 40
3 c 0 50
4 d 20 60
5 e 40 80

A tidy solution:
library(dplyr)
df1 <- df %>%
mutate(across(where(is.numeric), ~if_else(.<0, .*-10, .)))

rapply(df, \(x) (x*-10)^(x<0)*x^(x>0), 'numeric', how='replace')
x y z
1 a 40 3
2 b 20 4
3 c 1 50
4 d 2 6
5 e 4 80
rapply(df, \(x) replace(x, x<0, x[x<0]*-10), 'numeric', how='replace')
x y z
1 a 40 3
2 b 20 4
3 c 0 50
4 d 2 6
5 e 4 80
lastly:
ind <- sapply(df, is.numeric)
df[ind][df[ind]<0] <- df[ind][df[ind]<0] * -10
df
x y z
1 a 40 3
2 b 20 4
3 c 0 50
4 d 2 6
5 e 4 80

How to change specific values in a dataframe

Could anyone explain how to change the negative values in the below dataframe?
we have been asked to create a data structure to get the below output.
# > df
# x y z
# 1 a -2 3
# 2 b 0 4
# 3 c 2 -5
# 4 d 4 6
Then we have to use control flow operators and/or vectorisation to multiply only the negative values by 10.
I tried so many different ways but cannot get this to work. I get an error when i try to use a loop and because of the letters.

Create indices of the negative values and multiply by 10, i.e.
i1 <- which(df < 0, arr.ind = TRUE)
df[i1] <- as.numeric(df[i1]) * 10
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6

First find out the numeric columns of the dataframe and multiply the negative values by 10.
cols <- sapply(df, is.numeric)
#Multiply negative values by 10 and positive with 1
df[cols] <- df[cols] * ifelse(sign(df[cols]) == -1, 10, 1)
df
# x y z
#1 a -20 3
#2 b 0 4
#3 c 2 -50
#4 d 4 6
Using dplyr -
library(dplyr)
df <- df %>% mutate(across(where(is.numeric), ~. * ifelse(sign(.) == -1, 10, 1)))

How to create a scorecard for a dataframe in R?

I'm trying to create a scorecard for the values relative to the scorecard (both below).
values <- data.frame(A= c(-200,-78,-100,0,-30),
B= c(100,0,-101,-199,-300),
C= c(-400,400,500,-500,250),
D= c(NA,NA,-1000,-1000,-1000),
E= c(1000,1000,1,-1000,-2000))
scorecard <- data.frame(Names = c("A","B","C","D","E"),
"Score5" = c(-100,-200,-300,-400,-500),
"Score3" = c(-50,-100,-150,-200,-250),
"Score1" = c(-25,-50,-75,-100,-125))
values
A B C D E
1 -200 100 -400 NA 1000
2 -78 0 400 NA 1000
3 -100 -101 500 -1000 1
4 0 -199 -500 -1000 -1000
5 -30 -300 250 -1000 -2000
scorecard
Names Score5 Score3 Score1
1 A -100 -50 -25
2 B -200 -100 -50
3 C -300 -150 -75
4 D -400 -200 -100
5 E -500 -250 -125
For my scorecard, if the value:
is < its respective Score5, it gets awarded 5
is > its respective Score5 AND < Score3, but closer to Score5 than it is to Score3, it gets awarded 5
is > its respective Score5 AND < Score3, but closer to Score3 than it is to Score5, it gets awarded 3
is > its respective Score3 AND < Score1, but closer to Score3 than it is to Score1, it gets awarded 3
is > its respective Score3 AND < Score1, but closer to Score1 than it is to Score3, it gets awarded 1
all other values get 0
The desired result is:
desired result
I've tried the following - which required the packaged xts: install.packages("xts") but I didn't quite get there.
pointsfunction <- function(value) {
points <- c()
for(i in names) {
index = which(colnames(value)==i)
data_start <- which(!is.na(value))[1]
points[1:(data_start -1)] <- NA
for(a in (data_start):(length(value))) {
if(value[a] < scorecard[index, 2]) {
points[a] <- -5
} else {
points[a] <- 0
}
}
}
points <- reclass(points, value)
return(points)
}
scorecardpoints <- as.data.frame(lapply(values, pointsfunction))
I got the following error:
Error in if (value[a] < scorecard[index, 2]) { : argument is of length
zero Called from: FUN(X[[i]], ...)
Any ideas?

Here's a dplyr solution. We pivot to long format, join to the scorecard, do comparisons, and pivot the result back to wide. I added an ID column, but you can drop it at the end, if you like.
library(dplyr)
library(tidyr)
values %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = "Names") %>%
left_join(scorecard) %>%
mutate(
result = case_when(
value < (Score5 + Score3) / 2 ~ 5,
value < (Score3 + Score1) / 2 ~ 3,
value < Score1 ~ 1,
is.na(value) ~ NA_real_,
TRUE ~ 0
)
) %>%
pivot_wider(id_cols = id, names_from = Names, values_from = result)
# # A tibble: 5 x 6
# id A B C D E
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5 0 5 NA 0
# 2 2 5 0 0 NA 0
# 3 3 5 3 0 5 0
# 4 4 0 5 5 5 5
# 5 5 0 5 0 5 5

The values in your example values object is not the same as the values in the data.frame you assign to values. E.g. look at the 5th value of A.
You could use a base R approach like this:
# Look up the scorecard values for a name from the scorecard data.frame
get_scorecard_values <- function(name, card) {
as.numeric(card[card$Names == name, c(2,3,4)])
}
# translate scorecard values into breakpoints for scoring intervals
get_breaks <- function(x){
c((x[1]+x[2])/2, (x[2]+x[3])/2, x[3])
}
# the value to assign to each scoring interval
my_scores <- c(5,3,1,0)
# given a vector of values, assign a score value to each based on
# the interval that it falls into
get_scores <- function(x, intervals, scores) {
scores[(findInterval(x, get_breaks(intervals)) + 1L)]
}
# go across the list of names of variables of the values object.
# for each name, get the values and corresponding scorecard values
# and calculate the score values.
sapply(
names(values),
function(val, values, card, scores) {
get_scores(
x = values[[val]],
intervals = get_scorecard_values(name = val, card = card),
scores = scores
)
},
values = values,
card = scorecard,
scores = my_scores
)
A B C D E
[1,] 5 0 5 NA 0
[2,] 5 0 0 NA 0
[3,] 5 3 0 5 0
[4,] 0 5 5 5 5
[5,] 0 5 0 5 5

I used the dataframe with A5 = -30. Here is a base R solution
scoremat <- as.matrix(scorecard[, -1L])
dimnames(scoremat) <- list(scorecard$Names, names(scorecard)[-1L])
vscore <- function(x, nm, scoremat) {
scores <- c("Score5" = 5, "Score3" = 3, "Score1" = 1)[dimnames(score_mat)[[2L]]]
conds <- scoremat[rep(nm, length(x)), ]
i <- as.integer(apply(abs(x - conds), 1L, which.min))
unname(ifelse(x > conds[, "Score1"] , 0, scores[i]))
}
dscore <- function(df, scoremat) {
as.data.frame(vapply(
names(df),
function(nm, mat) vscore(df[[nm]], nm, mat),
numeric(nrow(df)),
scoremat
))
}
Output
> dscore(values, scoremat)
A B C D E
1 5 0 5 NA 0
2 5 0 0 NA 0
3 5 3 0 5 0
4 0 5 5 5 5
5 1 5 0 5 5
We first create a score matrix as follows
> scoremat
Score5 Score3 Score1
A -100 -50 -25
B -200 -100 -50
C -300 -150 -75
D -400 -200 -100
E -500 -250 -125
Note that your logic simplifies to
for any x in, for example, column A
if x > -25 (i.e. scoremat["A", "Score1"]) then
return 0
else
calculate distance = abs(x - values in row A of scoremat)
return the score where the minimum distance is
That's basically how vscore works. First match the scores
scores <- c("Score5" = 5, "Score3" = 3, "Score1" = 1)[dimnames(score_mat)[[2L]]]
Then, match and repeat the row so that the conds matrix has the same number of rows as the length of vector x.
conds <- scoremat[rep(nm, length(x)), ]
Next, calculate abs(x - conds) and get where the minimum is for each row. For example,
let x = values$A
abs ( x - conds ) = distance which.min = i
-200 -100 -50 -25 100 150 175 1
-150 -100 -50 -25 50 100 125 1
-100 -100 -50 -25 0 50 75 1
0 -100 -50 -25 100 50 25 3
-30 -100 -50 -25 70 20 5 3
Score5 Score3 Score1 Score5 Score3 Score1
Use as.integer to convert no match (this happens when there are NA values in x) into NA values.
i <- as.integer(apply(abs(x - conds), 1L, which.min))
Finally, return the results based on the logic shown above
unname(ifelse(x > conds[, "Score1"] , 0, scores[i]))

How to apply a formula one row at a time in R - row 2's values from calculated values of row 1

I have a data frame where I need to apply a formula to create new columns. The catch is, I need to calculate these numbers one row at a time. For eg,
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
I now need to create columns 'c' and 'd' as follows. Column 'c' whose R1 value is fixed as 5. But from R2 onwards the value of 'c' is calculated as (c (from previous row) - b(from previous row). Column 'd' R1 value is fixed as 10, but from R2 onwards, 'd' is calculated as 'c' from R2 - d from previous row.
I want my output to look like this:
A B C D
1 21 5 10
2 22 -16 -26
3 23 -38 -12
And so on. My actual data has over 1000 rows and 18 columns. For every row, 5 of the column values come from different columns of the previous row only (no other rows). And the rest of the column values are calculated from these newly calculated row values. I am quite at a loss in creating a formula that will apply my formulae to each row, calculate values for that row and then move to the next row. I know that I have simplified the problem a bit here, but this captures the essence of what I am attempting.
This is what I attempted:
df <- within(df, {
v1 <- shift(c)
v2 <- shift(d)
c <- v1-shift(b)
d <- c-v2
})
However, I need to apply this only from row 2 onwards and I have no idea how to do that.Because of that, I get something like this:
a b c d v2 v1
1 21 NA NA NA NA
2 22 4 -6 10 5
3 23 4 -6 10 5
I only get these values repeatedly for c, and d (4, -6, 10, 5).
Output
Thank you for your help.

df <- data.frame(a = 1:10, b = 21:30, c = 5:-4, d = 10)
for (i in (2:nrow(df))) {
df[i, "c"] <- df[i - 1, "c"] - df[i - 1, "b"]
df[i, "d"] <- df[i, "c"] - df[i - 1, "d"]
}
df[1:3, ]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12
Edit: adapting to your comment
# Let's define the coefficients of the equations into a dataframe
equation1 <- c("c", 0, 0, 0, 0, 0, -1, 1, 0) # c (from previous row) - b(from previous row)
equation2 <- c("d", 0, 0, 1, 0, 0, 0, 0, -1) # d is calculated as 'c' from R2 - d from previous row
equations <- data.frame(rbind(equation1,equation2), stringsAsFactors = F)
names(equations) <- c("y","a","b","c","d","a_previous","b_previous","c_previous","d_previous")
equations
# y a b c d a_previous b_previous c_previous d_previous
# "c" 0 0 0 0 0 -1 1 0
# "d" 0 0 1 0 0 0 0 -1
# define function to mutiply the rows of the dataframes
sumProd <- function(vect1, vect2) {
return(as.numeric(as.numeric(vect1) %*% as.numeric(vect2)))
}
# Apply the formulas to the originaldataframe
for (i in (2:nrow(df))) {
for(e in 1:nrow(equations)) {
df[i, equations[e, 'y']] <- sumProd(equations[e, c('a','b','c','d')], df[i, c('a','b','c','d')]) +
sumProd(equations[e, paste0(c('a','b','c','d'),'_previous')], df[i - 1, c('a','b','c','d')])
}
}
df[1:3,]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12

It might not be the most elegant way to do it with a for loop but it works. Your column c sounds like a simple sequence to me.
This is waht I would do:
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
# Use a simple sequence for c
df$c <- seq(5,5-(dim(df)[1]-1))
# Use for loop to calculate d
for(i in 2:(length(df$d)-1))
{
df$d[i] <- df$c[i] - df$d[i-1]
}
> df
a b c d
1 1 21 5 10
2 2 22 4 -6
3 3 23 3 9
4 4 24 2 -7
5 5 25 1 8
6 6 26 0 -8
7 7 27 -1 7
8 8 28 -2 -9
9 9 29 -3 6
10 10 30 -4 10

Getting NULL values for a multiple column IF statement passed to MAPPLY

I have a data frame of data:
df <- data.frame(x = c(11, 3, 2, 7, 9, 4, 6, 1, 6, 7),
y = c(rep("a",5), rep("b",5)))
df
x y
1 11 a
2 3 a
3 2 a
4 7 a
5 9 a
6 4 b
7 6 b
8 1 b
9 6 b
10 7 b
What I'm trying to do is an IF statement on both columns x and y, where it assigns a new value (z) based on meeting the criteria of x and y.
myfun <- function(x,y) {
if(x < 3 & y=="a") z <- 1
if(x>=3 & x <=7 & y=="a") z <- 2
if(x>7 & y=="a") z <- 3
if(x<3 & y=="b") z <-4
if(x>=3 & x<=1 & y=="b") z <-5
if(x>7 & y=="b") z<-6
}
I am trying to get the following result based on that logic above:
df
x y z
1 11 a 3
2 3 a 2
3 2 a 1
4 7 a 2
5 9 a 3
6 4 b 5
7 6 b 5
8 1 b 4
9 6 b 5
10 7 b 5
df$z <- mapply(myfun, df$x, df$x)
This results in:
x y z
1 11 a NULL
2 3 a NULL
3 2 a NULL
4 7 a NULL
5 9 a NULL
6 4 b NULL
7 6 b NULL
8 1 b NULL
9 6 b NULL
10 7 b NULL
I have no idea why. Can someone explain where I am going wrong?

if() function is not supposed to be used for vectors (or columns). It is used for single object comparisons like if(switch=="on"). What you should use is the ifelse() function. Your first three conditions would become:
myfun <- function(df) {
df$z <- with(df, ifelse(x < 3 & y=="a",1,NA))
df$z <- with(df, ifelse(x>=3 & x <=7 & y=="a",2,df$z))
df$z <- with(df, ifelse(x>7 & y=="a",3,df$z))
...
}
edit: and using df$x and df$y in the function call is probably not necessary. "result <- myfun(df)" would be enough unless you want x and y to be different.