I have a huge data frame that is like:
df = data.frame(A = c(1,54,23,2), B=c(1,2,4,65), C=c("+","-","-","+"))
> df
A B C
1 1 1 +
2 54 2 -
3 23 4 -
4 2 65 +
I need to subtract the rows based on different conditions, and add these results in a new column:
A - B if C == +
B - A if C == -
So, my output would be:
> new_df
A B C D
1 1 1 + 0
2 54 2 - -52
3 23 4 - -19
4 2 65 + -63
This assumes that only two conditions, + and -, are in column C.
df$D <- with(df, ifelse(C %in% "+", A - B, B - A))
df
# A B C D
# 1 1 1 + 0
# 2 54 2 - -52
# 3 23 4 - -19
# 4 2 65 + -63
Better to add stringsAsFactors = FALSE when you create a data frame. Also, I don't like to use df for variable names since there is a df() function:
df1 <- data.frame(A = c(1, 54, 23, 2),
B = c(1, 2, 4, 65),
C = c("+", "-", "-", "+"),
stringsAsFactors = FALSE)
Assuming that C is only + or -, you can use dplyr::mutate() and test using ifelse():
library(dplyr)
df1 %>%
mutate(D = ifelse(C == "+", A - B, B - A))
using dplyr:
If there are definitely only + and - in the C column you can do:
library(dplyr)
df2 <- df %>%
mutate(D = ifelse(C == '+', A - B, B - A))
I would generally do:
df2 <- df %>%
mutate(D = ifelse(C == '+', A - B,
ifelse(C == '-', B - A, NA)))
Just in case there are some that do not have + or -.
Alternatively, if you want to evaluate the arithmetic information in column C (as in addition or subtraction), you can use eval(parse(txt)) (more about that here: Evaluate expression given as a string).
## Transforming into a matrix (simplifies everything into characters)
df_mat <- as.matrix(df)
## Function for evaluation the rows
eval.row <- function(row) {
eval(parse(text= paste(row[1], row[3], row[2])))
}
## For the first row
eval.row(df_mat[1,])
# [1] 2
## For the whole data frame
apply(df_mat, 1, eval.row)
# [1] 2 52 19 67
## Updating the data.frame
df$D <- apply(df_mat, 1, eval.row)
This answer should work for you
https://stackoverflow.com/a/19000310/6395612
You can use with like this:
df['D'] = with(df, ifelse(C=='+', A - B, B - A))
A base solution:
df$D = (df$B-df$A)*sign((df$C=="-")-0.5)
# A B C D
# 1 1 1 + 0
# 2 54 2 - -52
# 3 23 4 - -19
# 4 2 65 + -63
can also be written df <- transform(df, D = (B-A)*sign((C=="-")-0.5))
Related
I want to find which values in df2 which is also present in df1, within a certain range. One value is considering both a and b in the data frames (a & b can't split up). For examples, can I find 9,1 (df1[1,1]) in df2? It doesn't have to be on the same position. Also, we can allow a diff of for example 1 for "a" and 1 for "b". For example, I want to find all values 9+-1,1+-1 in df2. "a" & "b" always go together, each row stick together. Does anyone have a suggestion of how to code this? Many many thanks!
set.seed(1)
a <- sample(10,5)
set.seed(1)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df1 <- data.frame(feature,a,b)
df1
> df1
feature a b
A 9 1
B 4 4
C 7 1
D 1 2
E 2 5
set.seed(2)
a <- sample(10,5)
b <- sample(5,5, replace=T)
feature <- LETTERS[1:5]
df2 <- data.frame(feature,a,b)
df2
df2
feature a b
A 5 1
B 6 4
C 9 5
D 1 1
E 10 2
Not correct but Im imaging this can be done for a for loop somehow!
for(i in df1[,1]) {
for(j in df1[,2]){
s<- c(s,(df1[i,1] & df1[j,2]== df2[,1] & df2[,2]))# how to add certain allowed diff levels?
}
}
s
Output wanted:
feature_df1 <- LETTERS[1:5]
match <- c(1,0,0,1,0)
feature_df2 <- c("E","","","D", "")
df <- data.frame(feature_df1, match, feature_df2)
df
feature_df1 match feature_df2
A 1 E
B 0
C 0
D 1 D
E 0
I loooove data.table, which is (imo) the weapon of choice for these kind of problems..
library( data.table )
#make df1 and df2 a data.table
setDT(df1, key = "feature"); setDT(df2)
#now perform a join operation on each row of df1,
# creating an on-the-fly subset of df2
df1[ df1, c( "match", "feature_df2") := {
val = df2[ a %between% c( i.a - 1, i.a + 1) & b %between% c(i.b - 1, i.b + 1 ), ]
unique_val = sort( unique( val$feature ) )
num_val = length( unique_val )
list( num_val, paste0( unique_val, collapse = ";" ) )
}, by = .EACHI ][]
# feature a b match feature_df2
# 1: A 9 1 1 E
# 2: B 4 4 0
# 3: C 7 1 0
# 4: D 1 2 1 D
# 5: E 2 5 0
One way to go about this in Base R would be to split the data.frames() into a list of rows then calculate the absolute difference of row vectors to then evaluate how large the absolute difference is and if said difference is larger than a given value.
Code
# Find the absolute difference of all row vectors
listdif <- lapply(l1, function(x){
lapply(l2, function(y){
abs(x - y)
})
})
# Then flatten the list to a list of data.frames
listdifflat <- lapply(listdif, function(x){
do.call(rbind, x)
})
# Finally see if a pair of numbers is within our threshhold or not
m1 <- 2
m2 <- 3
listfin <- Map(function(x){
x[1] > m1 | x[2] > m2
},
listdifflat)
head(listfin, 1)
[[1]]
V1
[1,] TRUE
[2,] FALSE
[3,] TRUE
[4,] TRUE
[5,] TRUE
[6,] TRUE
[7,] TRUE
[8,] TRUE
[9,] TRUE
[10,] TRUE
Data
df1 <- read.table(text = "
4 1
7 5
1 5
2 10
13 6
19 10
11 7
17 9
14 5
3 5")
df2 <- read.table(text = "
15 1
6 3
19 6
8 2
1 3
13 7
16 8
12 7
9 1
2 6")
# convert df to list of row vectors
l1<- lapply(1:nrow(df1), function(x){
df1[x, ]
})
l2 <- lapply(1:nrow(df2), function(x){
df2[x, ]
})
I try to obtain percentages grouping values regarding one variable.
For this I used sapply to obtain the percentage of each column regarding another one, but I dont know how to group these values by type (another variable)
x <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
"type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))
x
A B C type yes
1 0 0 1 x 0
2 0 1 0 x 0
3 1 0 1 x 1
4 1 1 1 y 1
5 1 0 0 y 0
6 1 1 0 y 1
7 1 1 1 x 1
I need to obtaing the next value (percentage): A==1&yes==1/A==1, and for this I use the next code:
result <- as.data.frame(sapply(x[,1:3],
function(i) (sum(i & x$yes)/sum(i))*100))
result
sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A 80
B 75
C 75
Now I need to obtain the same math operation but taking into account the varible "type". It means, obtaing the same percentage but discriminating it by type. So, my expected table was:
type sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A x 40
A y 40
B x 25
B y 50
C x 50
C y 25
In the example it's possible to observe that, by letters, the percentage sum is the same value that the obtained in the first result, just here is discriminated by type.
thanks a lot.
You can do the following using data.table:
Code
setDT(df)
cols = c('A', 'B', 'C')
mat = df[yes == 1, lapply(.SD, function(x){
100 * sum(x)/df[, lapply(.SD, sum), .SDcols = cols][[substitute(x)]]
# Here, the numerator is sum(x | yes == 1) for x == columns A, B, C
# If we look at the denominator, it equals sum(x) for x == columns A, B, C
# The reason why we need to apply substitute(x) is because df[, lapply(.SD, sum)]
# generates a list of column sums, i.e. list(A = sum(A), B = sum(B), ...).
# Hence, for each x in the column names we must subset the list above using [[substitute(x)]]
# Ultimately, the operation equals sum(x | yes == 1)/sum(x) for A, B, C.
}), .(type), .SDcols = cols]
# '.(type)' simply means that we apply this for each type group,
# i.e. once for x and once for y, for each ABC column.
# The dot is just shorthand for 'list()'.
# .SDcols assigns the subset that I want to apply my lapply statement onto.
Result
> mat
type A B C
1: x 40 25 50
2: y 40 50 25
Long format (your example)
> melt(mat)
type variable value
1: x A 40
2: y A 40
3: x B 25
4: y B 50
5: x C 50
6: y C 25
Data
df <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
"type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))
I want to subtract 1 from the values of column A if column B is <= 20.
A = c(1,2,3,4,5)
B = c(10,20,30,40,50)
df = data.frame(A,B)
output
A B
1 0 10
2 1 20
3 3 30
4 4 40
5 5 50
My data is very huge so I prefer not to use a loop. Is there any computationally efficient method in R?
You can do
df$A[df$B <= 20] <- df$A[df$B <= 20] - 1
# A B
#1 0 10
#2 1 20
#3 3 30
#4 4 40
#5 5 50
We can break this down step-by-step to understand how this works.
First we check which numbers in B is less than equal to 20 which gives us a logical vector
df$B <= 20
#[1] TRUE TRUE FALSE FALSE FALSE
Using that logical vector we can select the numbers in A
df$A[df$B <= 20]
#[1] 1 2
Subtract 1 from those numbers
df$A[df$B <= 20] - 1
#[1] 0 1
and replace these values for the same indices in A.
With dplyr we can also use case_when
library(dplyr)
df %>%
mutate(A = case_when(B <= 20 ~ A - 1,
TRUE ~ A))
Another possibility:
df$A <- ifelse(df$B < 21, df$A - 1, df$A)
And here is a data.table solution:
library(data.table)
setDT(df)
df[B <= 20, A := A - 1]
I have an array of data that can be modelled roughly as follows:
x=data.frame(c(2,2,2),c(3,4,6),c(3,4,6), c("x/-","x/x","-/x"))
names(x)=c("A","B","C","D")
I wish to change the values of B to (C + 1) if only the first character in D is -.
I have tried using the following and iterating over the rows:
if(substring(x$D, 1,1) == "-")
{
x$B <- x$C + 1
}
However this method does not seem to work. Is there a way to do this using sapply?
Thanks,
Matt
You can use ifelse and within
within(x, B <- ifelse(substr(D, 1, 1) == "-", C + 1, B))
# A B C D
# 1 2 3 3 x/-
# 2 2 4 4 x/x
# 3 2 7 6 -/x
Or instead of substr, you could use grepl
within(x, B <- ifelse(grepl("^[-]", D), C + 1, B))
# A B C D
# 1 2 3 3 x/-
# 2 2 4 4 x/x
# 3 2 7 6 -/x
data.table solution.
require(data.table)
x <- data.table(c(2,2,2), c(3,4,6), c(3,4,6), c("x/-","x/x","-/x"))
setnames(x, c("A","B","C","D"))
x[grepl("^[-]", D), B := C + 1]
Given the following data.frame
d <- rep(c("a", "b"), each=5)
l <- rep(1:5, 2)
v <- 1:10
df <- data.frame(d=d, l=l, v=v*v)
df
d l v
1 a 1 1
2 a 2 4
3 a 3 9
4 a 4 16
5 a 5 25
6 b 1 36
7 b 2 49
8 b 3 64
9 b 4 81
10 b 5 100
Now I want to add another column after grouping by l. The extra column should contain the value of v_b - v_a
d l v e
1 a 1 1 35 (36-1)
2 a 2 4 45 (49-4)
3 a 3 9 55 (64-9)
4 a 4 16 65 (81-16)
5 a 5 25 75 (100-25)
6 b 1 36 35 (36-1)
7 b 2 49 45 (49-4)
8 b 3 64 55 (64-9)
9 b 4 81 65 (81-16)
10 b 5 100 75 (100-25)
In paranthesis the way how to calculate the value.
I'm looking for a way using dplyr. So I started with something like this
df %.%
group_by(l) %.%
mutate(e=myCustomFunction)
But how should I define myCustomFunction? I thought grouping of the data.frame produces another (sub-)data.frame which is a parameter to this function. But it isn't...
I guess this is the dplyr equivalent to #jlhoward's data.table solution:
df %>%
group_by(l) %>%
mutate(e = v[d == "b"] - v[d == "a"])
Edit after comment by OP:
If you want to use a custom function, here's a possible way:
myfunc <- function(x) {
with(x, v[d == "b"] - v[d == "a"])
}
test %>%
group_by(l) %>%
do(data.frame(. , e = myfunc(.))) %>%
arrange(d, l) # <- just to get it back in the original order
Edit after comment by #hadley:
As hadley commented below, it would be better in this case to define the function as
f <- function(v, d) v[d == "b"] - v[d == "a"]
and then use the custom function f inside a mutate:
df %>%
group_by(l) %>%
mutate(e = f(v, d))
Thanks #hadley for the comment.
Using dplyr:
df %.%
group_by(l) %.%
mutate(e=diff(v))
# d l v e
# 1 a 1 1 35
# 2 a 2 4 45
# 3 a 3 9 55
# 4 a 4 16 65
# 5 a 5 25 75
# 6 b 1 36 35
# 7 b 2 49 45
# 8 b 3 64 55
# 9 b 4 81 65
# 10 b 5 100 75
Here's an approach using data tables.
library(data.table)
DT <- as.data.table(df)
DT[,e := diff(v), by=l]
These approaches using diff(...) assume your data frame is sorted as in your example. If not, this is a more reliable way to do the same thing.
DT[, e := .SD[d == "b", v] - .SD[d == "a", v], by=l]
(or) even more directly
DT[, e := v[d == "b"] - v[d == "a"], by=l]
But if you want to access the entire subset of data and pass it to your custom function, then you can use .SD. Also make sure you read about ?.SDcols from ?data.table.
If you want to consider a non-dplyr option
df$e <- with(df, ave(v, l, FUN=function(x) diff(x)))
will do the trick. The ave function is useful for calculating values for groups of observations.