Replace, based on other value in DF - r

I have a dataframe and like to replace certain values if other values in the same row meet a specific condition, e.g.:
DF <- data.frame(a= c(2,4,67),
b= c("TSS",".","TSS"),
c= c(3,46,5),
d= c(45,"-",47))
resulting in:
a b c d
1 2 TSS 3 45
2 4 . 46 -
3 67 TSS 5 47
Now I'd like to replace values in row 2 column c and d with "." and [2,c], respectively, if the value of [2,b] is ".". The result would look like this:
a b c d
1 2 TSS 3 45
2 4 . . 46
3 67 TSS 5 47
I tried using a for loop, but since I have a huge dataset this takes too much time. Is there a better way to solve this problem?

This should work:
DF <- data.frame(
a = c(2, 4, 67),
b = c("TSS", ".", "TSS"),
c = c(3, 46, 5),
d = c(45, "-", 47),
stringsAsFactors = FALSE
)
DF$d[DF$b == "."] <- DF$c[DF$b == "."]
DF$c[DF$b == "."] <- "."
First we replace the d-Value in rows where b is a "." with the value from c. The second line then replaces the value in c with a ".".
> DF
a b c d
1 2 TSS 3 45
2 4 . . 46
3 67 TSS 5 47

Related

can define a new column because it is factor

I have a dataset like this:
risk earthquake
platarea
magnitude
area
0.4
no
5
30
0.5
no
6
20
5.5
yes
6
20
I would like to create a new column
i gave that code
df$newrisk <- 0.5*df$magnitude + 0.6*df$aarea + 3*df$platarea
I got an error message for df$platarea?
BUt the platarea will only increase when it is "yes".
How can I code that???? the code is right if I omit df$platarea, but i would also include df$platarea but don't know how??
We can create a logical vector
i1 <- df$platarea == "yes"
df$newrisk[i1] <- with(df, 0.5 * magnitude[i1] + 0.6 * area[i] + 3)
If it is only to change the + 3 *, multiply by the logical vector so that FALSE (or 0 will return 0 and TRUE for 'yes' will return 3 as -3 *1 = 3)
df$newrisk <- with(df, 0.5 * magnitude + 0.6 * area + 3 *i1)
There are three common ways to add a new column to a data frame in R:
Use the $ Operator
df$new <- c(3, 3, 6, 7, 8, 12)
Use Brackets
df['new'] <- c(3, 3, 6, 7, 8, 12)
Use Cbind
df_new <- cbind(df, new)
I leave some examples for further explanation:
#create data frame
df <- data.frame(a = c('A', 'B', 'C', 'D', 'E'),
b = c(45, 56, 54, 57, 59))
#view data frame
df
a b
1 A 45
2 B 56
3 C 54
4 D 57
5 E 59
Example 1: Use the $ Operator
#define new column to add
new <- c(3, 3, 6, 7, 8)
#add column called 'new'
df$new <- new
#view new data frame
df
a b new
1 A 45 3
2 B 56 3
3 C 54 6
4 D 57 7
5 E 59 8
Example 2: Use Brackets
#define new column to add
new <- c(3, 3, 6, 7, 8)
#add column called 'new'
df['new'] <- new
#view new data frame
df
a b new
1 A 45 3
2 B 56 3
3 C 54 6
4 D 57 7
5 E 59 8
Example 3: Use Cbind
#define new column to add
new <- c(3, 3, 6, 7, 8)
#add column called 'new'
df_new <- cbind(df, new)
#view new data frame
df_new
a b new
1 A 45 3
2 B 56 3
3 C 54 6
4 D 57 7
5 E 59 8

How to apply a formula one row at a time in R - row 2's values from calculated values of row 1

I have a data frame where I need to apply a formula to create new columns. The catch is, I need to calculate these numbers one row at a time. For eg,
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
I now need to create columns 'c' and 'd' as follows. Column 'c' whose R1 value is fixed as 5. But from R2 onwards the value of 'c' is calculated as (c (from previous row) - b(from previous row). Column 'd' R1 value is fixed as 10, but from R2 onwards, 'd' is calculated as 'c' from R2 - d from previous row.
I want my output to look like this:
A B C D
1 21 5 10
2 22 -16 -26
3 23 -38 -12
And so on. My actual data has over 1000 rows and 18 columns. For every row, 5 of the column values come from different columns of the previous row only (no other rows). And the rest of the column values are calculated from these newly calculated row values. I am quite at a loss in creating a formula that will apply my formulae to each row, calculate values for that row and then move to the next row. I know that I have simplified the problem a bit here, but this captures the essence of what I am attempting.
This is what I attempted:
df <- within(df, {
v1 <- shift(c)
v2 <- shift(d)
c <- v1-shift(b)
d <- c-v2
})
However, I need to apply this only from row 2 onwards and I have no idea how to do that.Because of that, I get something like this:
a b c d v2 v1
1 21 NA NA NA NA
2 22 4 -6 10 5
3 23 4 -6 10 5
I only get these values repeatedly for c, and d (4, -6, 10, 5).
Output
Thank you for your help.
df <- data.frame(a = 1:10, b = 21:30, c = 5:-4, d = 10)
for (i in (2:nrow(df))) {
df[i, "c"] <- df[i - 1, "c"] - df[i - 1, "b"]
df[i, "d"] <- df[i, "c"] - df[i - 1, "d"]
}
df[1:3, ]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12
Edit: adapting to your comment
# Let's define the coefficients of the equations into a dataframe
equation1 <- c("c", 0, 0, 0, 0, 0, -1, 1, 0) # c (from previous row) - b(from previous row)
equation2 <- c("d", 0, 0, 1, 0, 0, 0, 0, -1) # d is calculated as 'c' from R2 - d from previous row
equations <- data.frame(rbind(equation1,equation2), stringsAsFactors = F)
names(equations) <- c("y","a","b","c","d","a_previous","b_previous","c_previous","d_previous")
equations
# y a b c d a_previous b_previous c_previous d_previous
# "c" 0 0 0 0 0 -1 1 0
# "d" 0 0 1 0 0 0 0 -1
# define function to mutiply the rows of the dataframes
sumProd <- function(vect1, vect2) {
return(as.numeric(as.numeric(vect1) %*% as.numeric(vect2)))
}
# Apply the formulas to the originaldataframe
for (i in (2:nrow(df))) {
for(e in 1:nrow(equations)) {
df[i, equations[e, 'y']] <- sumProd(equations[e, c('a','b','c','d')], df[i, c('a','b','c','d')]) +
sumProd(equations[e, paste0(c('a','b','c','d'),'_previous')], df[i - 1, c('a','b','c','d')])
}
}
df[1:3,]
a b c d
1 1 21 5 10
2 2 22 -16 -26
3 3 23 -38 -12
It might not be the most elegant way to do it with a for loop but it works. Your column c sounds like a simple sequence to me.
This is waht I would do:
df <- data.frame(c(1:10),c(21:30),5,10)
names(df) <- c('a','b','c','d')
# Use a simple sequence for c
df$c <- seq(5,5-(dim(df)[1]-1))
# Use for loop to calculate d
for(i in 2:(length(df$d)-1))
{
df$d[i] <- df$c[i] - df$d[i-1]
}
> df
a b c d
1 1 21 5 10
2 2 22 4 -6
3 3 23 3 9
4 4 24 2 -7
5 5 25 1 8
6 6 26 0 -8
7 7 27 -1 7
8 8 28 -2 -9
9 9 29 -3 6
10 10 30 -4 10

R - subtracting different columns if condition is met

I have a huge data frame that is like:
df = data.frame(A = c(1,54,23,2), B=c(1,2,4,65), C=c("+","-","-","+"))
> df
A B C
1 1 1 +
2 54 2 -
3 23 4 -
4 2 65 +
I need to subtract the rows based on different conditions, and add these results in a new column:
A - B if C == +
B - A if C == -
So, my output would be:
> new_df
A B C D
1 1 1 + 0
2 54 2 - -52
3 23 4 - -19
4 2 65 + -63
This assumes that only two conditions, + and -, are in column C.
df$D <- with(df, ifelse(C %in% "+", A - B, B - A))
df
# A B C D
# 1 1 1 + 0
# 2 54 2 - -52
# 3 23 4 - -19
# 4 2 65 + -63
Better to add stringsAsFactors = FALSE when you create a data frame. Also, I don't like to use df for variable names since there is a df() function:
df1 <- data.frame(A = c(1, 54, 23, 2),
B = c(1, 2, 4, 65),
C = c("+", "-", "-", "+"),
stringsAsFactors = FALSE)
Assuming that C is only + or -, you can use dplyr::mutate() and test using ifelse():
library(dplyr)
df1 %>%
mutate(D = ifelse(C == "+", A - B, B - A))
using dplyr:
If there are definitely only + and - in the C column you can do:
library(dplyr)
df2 <- df %>%
mutate(D = ifelse(C == '+', A - B, B - A))
I would generally do:
df2 <- df %>%
mutate(D = ifelse(C == '+', A - B,
ifelse(C == '-', B - A, NA)))
Just in case there are some that do not have + or -.
Alternatively, if you want to evaluate the arithmetic information in column C (as in addition or subtraction), you can use eval(parse(txt)) (more about that here: Evaluate expression given as a string).
## Transforming into a matrix (simplifies everything into characters)
df_mat <- as.matrix(df)
## Function for evaluation the rows
eval.row <- function(row) {
eval(parse(text= paste(row[1], row[3], row[2])))
}
## For the first row
eval.row(df_mat[1,])
# [1] 2
## For the whole data frame
apply(df_mat, 1, eval.row)
# [1] 2 52 19 67
## Updating the data.frame
df$D <- apply(df_mat, 1, eval.row)
This answer should work for you
https://stackoverflow.com/a/19000310/6395612
You can use with like this:
df['D'] = with(df, ifelse(C=='+', A - B, B - A))
A base solution:
df$D = (df$B-df$A)*sign((df$C=="-")-0.5)
# A B C D
# 1 1 1 + 0
# 2 54 2 - -52
# 3 23 4 - -19
# 4 2 65 + -63
can also be written df <- transform(df, D = (B-A)*sign((C=="-")-0.5))

Compare vector to a dataframe

I have a dataframe that looks something like -
test A B C
28 67 4 23
45 82 43 56
34 8 24 42
I need to compare test to the other three columns in that I just need the number of elements in the other column that is less than the corresponding element in the test column.
So the desired output is -
test A B C result
28 67 4 23 2
45 82 43 56 1
34 8 24 42 2
When I tried -
comp_vec = "test"
name_vec = c("A", "B", "C")
rowSums(df[, comp_vec] > df[, name_vec])
I get the error -
Error in Ops.data.frame(df[, comp_vec], df[, name_vec]) :
‘>’ only defined for equally-sized data frames
I am looking for a way without replicating test to match size of dataframe.
You can use sapply to return a vector of mapping the df$test column against the other three columns. That will return a T/F matrix that you can do rowSums, and set as your result column.
df <- data.frame(test = c(28, 45, 34), A = c(67, 82, 8), B = c(4, 43, 24), C = c(23, 56, 42))
df$result <- rowSums(sapply(df[,2:4], function(x) df$test > x))
> df
test A B C result
1 28 67 4 23 2
2 45 82 43 56 1
3 34 8 24 42 2
I noticed your expected results has 82 for the second row of A, whereas its 5 in your starting example.
df$result <- apply(df, 1, function(x) sum(x < x[1]))
Use apply, specify 1 to indicate by row. x < x[1] will give a vector of TRUE/FALSE if the value at each position in the row is smaller than the first column's value. Use sum to give the number of TRUE values.
# test A B C result
# 1 28 67 4 23 2
# 2 45 82 43 56 1
# 3 34 8 24 42 2

Change the index number of a dataframe

After I'm done with some manipulation in Dataframe, I got a result dataframe. But the index are not listed properly as below.
MsgType/Cxr NoOfMsgs AvgElpsdTime(ms)
161 AM 86 30.13
171 CM 1 104
18 CO 27 1244.81
19 US 23 1369.61
20 VK 2 245
21 VS 11 1273.82
112 fqa 78 1752.22
24 SN 78 1752.22
I would like to get the result as like below.
MsgType/Cxr NoOfMsgs AvgElpsdTime(ms)
1 AM 86 30.13
2 CM 1 104
3 CO 27 1244.81
4 US 23 1369.61
5 VK 2 245
6 VS 11 1273.82
7 fqa 78 1752.22
8 SN 78 1752.22
Please guide how I can get this ?
These are the rownames of your dataframe, which by default are 1:nrow(dfr). When you reordered the dataframe, the original rownames are also reordered. To have the rows of the new order listed sequentially, just use:
rownames(dfr) <- 1:nrow(dfr)
Or, simply
rownames(df) <- NULL
gives what you want.
> d <- data.frame(x = LETTERS[1:5], y = letters[1:5])[sample(5, 5), ]
> d
x y
5 E e
4 D d
3 C c
2 B b
1 A a
> rownames(d) <- NULL
> d
x y
1 E e
2 D d
3 C c
4 B b
5 A a
The index is actually the data frame row names. To change them, you can do something like:
rownames(dd) = 1:dim(dd)[1]
or
rownames(dd) = 1:nrow(dd)
Personally, I never use rownames.
In your example, I suspect that you don't need to worry about them either, since you are just renaming them 1 to n. In particular, when you subset your data frame the rownames will again be incorrect. For example,
##Simple data frame
R> dd = data.frame(a = rnorm(6))
R> dd$type = c("A", "B")
R> rownames(dd) = 1:nrow(dd)
R> dd
a type
1 2.1434 A
2 -1.1067 B
3 0.7451 A
4 -0.1711 B
5 1.4348 A
6 -1.3777 B
##Basic subsetting
R> dd_sub = dd[dd$type=="A",]
##Rownames are "wrong"
R> dd_sub
a type
1 2.1434 A
3 0.7451 A
5 1.4348 A

Resources