Subtract a row of values across many rows - r

I have a dataset that simplifies to something like this, let's call that dataset B
V1 V2 V3 V4
sample1 1 2 3
sample2 4 5 6
sample3 7 8 9
then I have another separate row (on its own) called blank,
it would look something like this.
V1 V2 V3 V4
blank 0.5 1.0 1.5
I would like to subtract blank to all the rows of B.
So far I've tried:
B[,2:ncol(B)] <- lapply(B[,2:ncol(B)], function(x) x - blank[,2:ncol(blank)])
B[,2:ncol(B)] <- sweep(B[,2:ncol(B)], 1, blank[,2:ncol(blank)])
B[,2:ncol(B)] <- B[,2:ncol(B)] - blank[,2:ncol(blank)])
B[,2:ncol(B)] <- for(i in 1:nrow(B)){B[ i ,2:ncol(B)] - blank[,2:ncol(B)]}
None of which would work. first one tells me that "replacement element 1 is a matrix/data of 1 row, need 3". Second one tells me "STATS is longer than the extent of 'dim(x)[MARGIN]'", changing margin into 2 does not solve the problem. The third one says "‘-’ only defined for equally-sized data frames". The fourth one returns me a blank matrix.
I've looked through the forum to the best of my ability, but they are limited to applying only one value across the entire dataset, I would like to subtract a whole row of values across the rest of the rows in a dataset.
The end result should look like this (no rounding required).
V1 V2 V3 V4
sample1 0.5 1.0 1.5
sample2 3.5 4.0 4.5
sample3 6.5 7.0 7.5

You can subtract the one row from all rows of the second dataframe by repeating the one row as many times as there are rows in the second dataframe and simply subtract those two dataframe like below.
df1 <- t(data.frame(c(1,2,3), c(4,5,6), c(7,8,9)))
df2 <- data.frame(.5, 1, 1.5)
df1[,]-df2[rep(1,3),] # Note that inside the rep i am creating 3 rows if you have
#more rows you need to change 3 to number of rows you have

We can use sweep :
B[-1] <- sweep(B[-1], 2, unlist(blank[-1]), `-`)
B
# V1 V2 V3 V4
#1 sample1 0.5 1 1.5
#2 sample2 3.5 4 4.5
#3 sample3 6.5 7 7.5
Or using transpose
B[-1] <- t(t(B[-1]) - unlist(blank[-1]))

Related

Calculating weight ratios in the presence of empty cells

I have a sample which needs to weighed in order to represent the population.
library(data.table)
sample <- fread("
1,0,2,2
3,4,3,0
")
V1 V2 V3 V4
1: 1 0 2 2
2: 3 4 3 0
population <- fread("
10,20,20,10
30,40,20,10
")
This weight would simply be:
weights <- population/sample
V1 V2 V3 V4
1: 10 Inf 10.000000 5
2: 10 10 6.666667 Inf
However, because V2 in row 1 of the sample has no observations, it receives an infinite weight (Note that also V4 in row 2 receives an Inf, but this is easier to solve, because the weight is irrelevant, as there are no observations in either the sample or the population).
A solution to the problem, would be to count V1 and V2 together in the sample and the population.
EDIT:
After some thought I realised that, for the weights to be correct, only the population values have to be adapted. If V1 and V2 in row 1 of population are added together in V1 of population, this will already lead to the correct weight for the sample observation of V1 row . The value of V2 becomes irrelevant because there is no observation in the sample to receive that weight.
End of EDIT
The observation would then get a weight of:
(population[1,1]+population[1,2])/(sample[1,1]+sample[1,2])
(10+20)/(1+0)=30
In my actual data, there however many more rows, with hero and there a 0 in the sample. I am trying to figure out if there is a way to write my code, so that I do not have to do this manually..
Desired outcome (notice that the weight of V1 row 1 is now 30):
weights
V1 V2 V3 V4
1: 30 0 10.000000 5
2: 10 10 6.666667 0
Attempt
I was think of doing something like:
for (i in seq_along(ncol(sample))) {
lapply(population, (ifelse(sample[i]==0), population[i]<-population[i+1], population[i])
}
Where the values in the population of the cell to right will be added when the value in the sample is zero. However I am having trouble getting the syntax right, and even if it did, it does not solve the case where V4 is 0.
Here is a rather verbose solution. In case there are more columns that should be aggregated in case of zeros in sample, I would have proposed a more flexible approach but this seems sufficient for your example
library(data.table)
sample <- fread("
1,0,2,2
3,4,3,0
")
population <- fread("
10,20,20,10
30,40,20,10
")
# aggregate Values if sample is zero
population[sample$V1 == 0, `:=`(V1 = 0,
V2 = V1 + V2)]
population[sample$V2 == 0, `:=`(V1 = V1 + V2,
V2 = 0)]
weights <- population/sample
# Fix NaNs
weights[is.na(weights), ] <- 0
weights
#> V1 V2 V3 V4
#> 1: 30 0 10.000000 5
#> 2: 10 10 6.666667 Inf

R Difference with previous column across multiple columns

I have a dataframe like this that resulted from a cumsum of variables:
id v1 v2 v3
1 4 5 9
2 1 1 4
I I would like to get the difference among columns, such as the dataframe is transformed as:
id v1 v2 v3
1 4 1 4
2 1 0 3
So effectively "de-acumulating" the resulting values getting the difference. This is a small example original df is around 150 columns.
Thx!
x <- read.table(header=TRUE, text="
id v1 v2 v3
1 4 5 9
2 1 1 4")
x[,c("v1","v2","v3")] <- cbind(x[,"v1"], t(apply(x[,c("v1","v2","v3")], 1, diff)))
x
# id v1 v2 v3
# 1 1 4 1 4
# 2 2 1 0 3
Explanation:
Up front, a note: when using apply on a data.frame, it converts the argument to a matrix. This means that if you have any character columns in the argument passed to apply, then the entire matrix will be character, likely not what you want. Because of this, it is safer to only select columns you need (and reassign them specifically).
apply(.., MARGIN=1, ...) returns its output in an orientation transposed from what you might expect, so I have to wrap it in t(...).
I'm using diff, which returns a vector of length one shorter than the input, so I'm cbinding the original column to the return from t(apply(...)).
Just as I had to specific about which columns to pass to apply, I'm similarly specific about which columns will be replaced by the return value.
Simple for cycle might do the trick, but for larger data it will be slower that other approaches.
df <- data.frame(id = c(1,2), v1 = c(4,1), v2 = c(5,1))
df2 <- df
for(i in 3:ncol(df)){
df2[,i] <- df[,i] - df[,i-1]
}

Add values to a vector to make a consecutive vector in R

I have several vectors that look like this:
v1 <- c(1,2,4)
v2 <- c(3,5,8)
v3 <- c(4)
This is just a small sample of them. I'm trying to figure out a way to add values to each of them to make them all consecutive vectors. So that at the end, they look like this:
v1 <- c(1,2,3,4)
v2 <- c(1,2,3,4,5,6,7,8)
v3 <- c(1,2,3,4)
So "3" is added to the first vector, "1","2","4","6","7" is added to the second and so forth. I have several hundred vectors that look like this so I'm trying to figure out a solution that would scale/be automated.
You can use seq and max
seq(max(v1))
For multiple vectors, we can loop
lapply(mget(paste0('v',1:3)), function(x) seq(max(x)))
#$v1
#[1] 1 2 3 4
#$v2
#[1] 1 2 3 4 5 6 7 8
#$v3
#[1] 1 2 3 4

Get row with highest value from one column after chunking with plyr - R

Suppose I have a dataframe that looks like this:
v1 v2 v3 v4 v5 v6
r1 1 2 2 4 5 9
r2 1 2 2 4 5 10
r3 1 2 2 4 5 7
r4 1 2 2 4 5 12
r5 2 2 2 4 5 9
r6 2 2 2 4 5 10
I would like to get the row with the highest value in v6 that has the value 1 in v1.
I know how to get all rows where v1 = 1 and select the first row of that, thanks to this answer to a previous question:
ddply( df , .variables = "v1" , .fun = function(x) x[1,] )
How can I change the function so that I get the row with the highest value in v6?
From the previous results, I'd use [ to subset on your first condition using logical comparators and then do a second subset on your second condition because as #sgibb points out in the comments, the max value of v6 might not be in a row where v1 == 1.
# Subset to those rows where v1 == 1
tmp <- df[ df$v1 == 1 , ]
# Then select those rows where the max value of v6 appears
tmp[ tmp$v6 == max( tmp$v6 ) , ]
If you want the first occurence, use which.max()
we could also use the subset operator like
x_sub= subset(x, state == "C" & chainlength == 5 & segment == "C2C_REG")
where x is the data frame and the other parameter is a logical expression

Loop over rows of dataframe applying function with if-statement

I'm new to R and I'm trying to sum 2 columns of a given dataframe, if both the elements to be summed satisfy a given condition. To make things clear, what I want to do is:
> t.d<-as.data.frame(matrix(1:9,ncol=3))
> t.d
V1 V2 V3
1 4 7
2 5 8
3 6 9
> t.d$V4<-rep(0,nrow(t.d))
> for (i in 1:nrow(t.d)){
+ if (t.d$V1[i]>1 && t.d$V3[i]<9){
+ t.d$V4[i]<-t.d$V1[i]+t.d$V3[i]}
+ }
> t.d
V1 V2 V3 V4
1 4 7 0
2 5 8 10
3 6 9 0
I need an efficient code, as my real dataframe has about 150000 rows and 200 columns. This gives an error:
t.d$V4<-t.d$V1[t.d$V1>1]+ t.d$V3[t.d$V3>9]
Is "apply" an option? I tried this:
t.d<-as.data.frame(matrix(1:9,ncol=3))
t.d$V4<-rep(0,nrow(t.d))
my.fun<-function(x,y){
if(x>1 && y<9){
x+y}
}
t.d$V4<-apply(X=t.d,MAR=1,FUN=my.fun,x=t.d$V1,y=t.d$V3)
but it gives an error as well.
Thanks very much for your help.
This operation doesn't require loops, apply statements or if statements. Vectorised operations and subsetting is all you need:
t.d <- within(t.d, V4 <- V1 + V3)
t.d[!(t.d$V1>1 & t.d$V3<9), "V4"] <- 0
t.d
V1 V2 V3 V4
1 1 4 7 0
2 2 5 8 10
3 3 6 9 0
Why does this work?
In the first step I create a new column that is the straight sum of columns V1 and V4. I use within as a convenient way of referring to the columns of d.f without having to write d.f$V all the time.
In the second step I subset all of the rows that don't fulfill your conditions and set V4 for these to 0.
ifelse is your friend here:
t.d$V4<-ifelse((t.d$V1>1)&(t.d$V3<9), t.d$V1+ t.d$V3, 0)
I'll chip in and provide yet another version. Since you want zero if the condition doesn't mach, and TRUE/FALSE are glorified versions of 1/0, simply multiplying by the condition also works:
t.d<-as.data.frame(matrix(1:9,ncol=3))
t.d <- within(t.d, V4 <- (V1+V3)*(V1>1 & V3<9))
...and it happens to be faster than the other solutions ;-)
t.d <- data.frame(V1=runif(2e7, 1, 2), V2=1:2e7, V3=runif(2e7, 5, 10))
system.time( within(t.d, V4 <- (V1+V3)*(V1>1 & V3<9)) ) # 3.06 seconds
system.time( ifelse((t.d$V1>1)&(t.d$V3<9), t.d$V1+ t.d$V3, 0) ) # 5.08 seconds
system.time( { t.d <- within(t.d, V4 <- V1 + V3);
t.d[!(t.d$V1>1 & t.d$V3<9), "V4"] <- 0 } ) # 4.50 seconds

Resources