How to count number of columns by condition on another column - r

I have a dataframe that looks like this:
data <- as.data.frame(cbind('01-01-2018' = c(1.2,3.1,0.7,-0.3,2.0), '02-01-2018' = c(-0.1, 2.4, 4.9,-3.3,-2.7), '03-01-2018' = c(3.4, -2.6, -1.8, 0.1, 0.3)))
01-01-2018 02-01-2018 03-01-2018
1 1.2 -0.1 3.4
2 3.1 2.4 -2.6
3 0.7 4.9 -1.8
4 -0.3 -3.3 0.1
5 2.0 -2.7 0.3
I want to count how many times per each row, a value is bigger than the average of the corresponding row.
data$mn <- apply(data, 1, mean)
01-01-2018 02-01-2018 03-01-2018 mn
1 1.2 -0.1 3.4 1.5000000
2 3.1 2.4 -2.6 0.9666667
3 0.7 4.9 -1.8 1.2666667
4 -0.3 -3.3 0.1 -1.1666667
5 2.0 -2.7 0.3 -0.1333333
My last attempt was the following:
df$events <- apply(data, 1, function(x) sum(x > data$mn))
uhi_events <- numeric(nrow(data))
for (i in 1:nrow(data)) {
uhi <- data[[6]][[i]][["values"]]
uhi_events[i] <- sum(uhi)
}
data$uhi_events <- uhi_events
Is there a more efficient option?
EDIT:
What if the condition is on another column, let's say data$c1, that is not obtained through a simple formula?
data$md <- apply(data, 1, median)
01-01-2018 02-01-2018 03-01-2018 md
1 1.2 -0.1 3.4 1.5000000
2 3.1 2.4 -2.6 0.9666667
3 0.7 4.9 -1.8 1.2666667
4 -0.3 -3.3 0.1 -1.1666667
5 2.0 -2.7 0.3 -0.1333333

Using rowMeans and rowSums:
data$cnt <- rowSums(data > rowMeans(data))
data
# 01-01-2018 02-01-2018 03-01-2018 cnt
# 1 1.2 -0.1 3.4 1
# 2 3.1 2.4 -2.6 2
# 3 0.7 4.9 -1.8 1
# 4 -0.3 -3.3 0.1 2
# 5 2.0 -2.7 0.3 2
If the column was already computed replace rowMeans with existing column data$c1:
#get index excluding "c1":
ix <- grep("c1", colnames(data), invert = TRUE)
data$cnt <- rowSums(data[, ix ] > data$c1)

Using a user defined function to sum from a logical operation (logical vector is coerced by sum() to an integer vector such that TRUE = 1 and FALSE = 0)
data$uhi_events <-
apply(data, 1, function(i){
sum(i>mean(i))
})

library(data.table)
setDT(data)
data[, above_mean := rowSums(.SD > rowMeans(.SD))]
# 01-01-2018 02-01-2018 03-01-2018 above_mean
# 1: 1.2 -0.1 3.4 1
# 2: 3.1 2.4 -2.6 2
# 3: 0.7 4.9 -1.8 1
# 4: -0.3 -3.3 0.1 2
# 5: 2.0 -2.7 0.3 2
edit for question in comments
compare to value in first column
data[, above_col1 := rowSums(.SD > `01-01-2018`)]
# 01-01-2018 02-01-2018 03-01-2018 above_col1
# 1: 1.2 -0.1 3.4 1
# 2: 3.1 2.4 -2.6 0
# 3: 0.7 4.9 -1.8 1
# 4: -0.3 -3.3 0.1 1
# 5: 2.0 -2.7 0.3 0

Using a dplyr approach:
library(dplyr)
data <- as.data.frame(cbind('01-01-2018' = c(1.2,3.1,0.7,-0.3,2.0), '02-01-2018' = c(-0.1, 2.4, 4.9,-3.3,-2.7), '03-01-2018' = c(3.4, -2.6, -1.8, 0.1, 0.3)))
data$mm <- apply(data,1,median)
data %>%
rowwise %>%
mutate(count = sum(c_across(1:3) > mm))
#> # A tibble: 5 × 5
#> # Rowwise:
#> `01-01-2018` `02-01-2018` `03-01-2018` mm count
#> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 1.2 -0.1 3.4 1.2 1
#> 2 3.1 2.4 -2.6 2.4 1
#> 3 0.7 4.9 -1.8 0.7 1
#> 4 -0.3 -3.3 0.1 -0.3 1
#> 5 2 -2.7 0.3 0.3 1

Related

How I can sort these data in R [duplicate]

This question already has answers here:
Multiply rows of matrix by vector?
(6 answers)
Closed 10 months ago.
A small sample of my data is as follows:
A=c(0.1, 0.3, 0.6, 0.1)
dat<-read.table (text=" D1 D2 D3 D4
10 11 13 14
9 8 8 0
70 100 2 3
4 3 3 200
1 2 3 4
", header=TRUE)
The logic is that 0.1 x D1, 0.3xD2, 0.6xD3 and 0.1xD4.
Here is the outcome
1 3.3 7.8 1.4
0.9 2.4 4.8 0
7 30 1.2 0.3
0.4 0.9 1.8 20
0.1 0.6 1.8 0.4
Please assume I have more than 4 Ds
A possible solution, using dplyr:
library(dplyr)
dat %>%
mutate(across(everything(), ~ .x * A[which(names(dat) == cur_column())]))
#> D1 D2 D3 D4
#> 1 1.0 3.3 7.8 1.4
#> 2 0.9 2.4 4.8 0.0
#> 3 7.0 30.0 1.2 0.3
#> 4 0.4 0.9 1.8 20.0
#> 5 0.1 0.6 1.8 0.4
Another possible solution, in base R:
as.data.frame(t(apply(dat, 1, \(x) x * A)))
Yet another possible solution, using purrr::map2_df:
purrr::map2_df(dat, A, `*`)
Or even:
mapply(`*`, dat, A)

Create lower triangle genetic distance matrix

I have distance matrix like this
1 2 3 4 5
A 0.1 0.2 0.3 0.5 0.6
B 0.7 0.8 0.9 1 1.1
C 1.2 1.3 1.4 1.5 1.6
D 1.7 1.8 1.9 2 2.1
E 2.2 2.3 2.4 2.5 2.6
and now I want to create lower triangle matrix like this
1 2 3 4 5 A B C D E
1 0
2 0.1 0
3 0.2 0.1 0
4 0.4 0.3 0.2 0
5 0.5 0.4 0.3 0.1 0
A 0.1 0.2 0.3 0.5 0.6 0
B 0.7 0.8 0.9 1 1.1 0.6 0
C 1.2 1.3 1.4 1.5 1.6 1.1 0.5 0
D 1.7 1.8 1.9 2 2.1 1.6 1 0.5 0
E 2.2 2.3 2.4 2.5 2.6 2.1 1.5 1 0.5 0
I just deducted distance between 2 from 1 from first table to get genetic distance between 1 and 2 (0.2 - 0.1=0.1) and like this I did for rest of the entries and I do not know doing like this is correct or not?, after doing calculation like that made lower triangle matrix. I tried like this in R
x <- read.csv("AD2.csv", head = FALSE, sep = ",")
b<-lower.tri(b, diag = FALSE)
but I am getting only TRUE and FALSE as output not like distance matrix.
can any one help to solve this problem and here is link to my example data.
You can make use of dist to calculate sub-matrices. Then use cbind and create the top and bottom half. Then rbind the 2 halves. Then set upper triangular to NA to create the desired output.
mat <- rbind(
cbind(as.matrix(dist(tbl[1,])), tbl),
cbind(tbl, as.matrix(dist(tbl[,1])))
)
mat[upper.tri(mat, diag=FALSE)] <- NA
mat
Hope it helps.
data:
tbl <- as.matrix(read.table(text="1 2 3 4 5
A 0.1 0.2 0.3 0.5 0.6
B 0.7 0.8 0.9 1 1.1
C 1.2 1.3 1.4 1.5 1.6
D 1.7 1.8 1.9 2 2.1
E 2.2 2.3 2.4 2.5 2.6", header=TRUE, check.names=FALSE, row.names=1))

Multiply values depending on values of certains columns

I have two data base, df and cf. I want to multiply each value of A in df by each coefficient in cf depending on the value of B and C in table df.
For example
row 2 in df A= 20 B= 4 and C= 2 so the correct coefficient is 0.3,
the result is 20*0.3 = 6
There is a simple way to do that in R!?
Thanks in advance!!
df
A B C
20 4 2
30 4 5
35 2 2
24 3 3
43 2 1
cf
C
B/C 1 2 3 4 5
1 0.2 0.3 0.5 0.6 0.7
2 0.1 0.5 0.3 0.3 0.4
3 0.9 0.1 0.6 0.6 0.8
4 0.7 0.3 0.7 0.4 0.6
One solution with apply:
#iterate over df's rows
apply(df, 1, function(x) {
x[1] * cf[x[2], x[3]]
})
#[1] 6.0 18.0 17.5 14.4 4.3
Try this vectorized:
df[,1] * cf[as.matrix(df[,2:3])]
#[1] 6.0 18.0 17.5 14.4 4.3
A solution using dplyr and a vectorised function:
df = read.table(text = "
A B C
20 4 2
30 4 5
35 2 2
24 3 3
43 2 1
", header=T, stringsAsFactors=F)
cf = read.table(text = "
0.2 0.3 0.5 0.6 0.7
0.1 0.5 0.3 0.3 0.4
0.9 0.1 0.6 0.6 0.8
0.7 0.3 0.7 0.4 0.6
")
library(dplyr)
# function to get the correct element of cf
# vectorised version
f = function(x,y) cf[x,y]
f = Vectorize(f)
df %>%
mutate(val = f(B,C),
result = val * A)
# A B C val result
# 1 20 4 2 0.3 6.0
# 2 30 4 5 0.6 18.0
# 3 35 2 2 0.5 17.5
# 4 24 3 3 0.6 14.4
# 5 43 2 1 0.1 4.3
The final dataset has both result and val in order to check which value from cf was used each time.

expand data.frame and insert average in another column

If the data set is
date CPI
2000/ 1 1.2
2000/ 2 3.2
2000/ 3 1.6
then I want to get a weekly cpi
So this is my expected result.
date CPI Average
2000/ 1 1.2 0.3
2000/ 1 1.2 0.3
2000/ 1 1.2 0.3
2000/ 1 1.2 0.3
2000/ 2 3.2 0.8
2000/ 2 3.2 0.8
2000/ 2 3.2 0.8
2000/ 2 3.2 0.8
2000/ 3 1.6 0.4
2000/ 3 1.6 0.4
2000/ 3 1.6 0.4
2000/ 3 1.6 0.4
How Can I do this in R program?
please help me. my monthly Cpi is almost 200.
May be this helps:
n <- 4
mydf1 <- transform(mydf, Average=CPI/n) #created a new column `Average` by dividing CPI by n
mydf2 <-mydf1[rep(1:nrow(mydf1),each=n),] #replicate the row numbers of the dataset `mydf1` by `n` and used the numeric index to expand the rows of `mydf1`
row.names(mydf2) <- 1:nrow(mydf2) #change the rownames
mydf2
# date CPI Average
#1 2000/ 1 1.2 0.3
#2 2000/ 1 1.2 0.3
#3 2000/ 1 1.2 0.3
#4 2000/ 1 1.2 0.3
#5 2000/ 2 3.2 0.8
#6 2000/ 2 3.2 0.8
#7 2000/ 2 3.2 0.8
#8 2000/ 2 3.2 0.8
#9 2000/ 3 1.6 0.4
#10 2000/ 3 1.6 0.4
#11 2000/ 3 1.6 0.4
#12 2000/ 3 1.6 0.4
Or using data.table
Here, the idea is similar to the above. First convert the data.frame to data.table using setDT. Create a new column Average:=CPI/n. Then use replicate rep the rownumbers of the dataset with n and use that numeric index to expand the rows of mydf
library(data.table)
setDT(mydf)[mydf[, Average:=CPI/n][,rep(seq_len(.N), each=n)]]
# date CPI Average
# 1: 2000/ 1 1.2 0.3
# 2: 2000/ 1 1.2 0.3
# 3: 2000/ 1 1.2 0.3
# 4: 2000/ 1 1.2 0.3
# 5: 2000/ 2 3.2 0.8
# 6: 2000/ 2 3.2 0.8
# 7: 2000/ 2 3.2 0.8
# 8: 2000/ 2 3.2 0.8
# 9: 2000/ 3 1.6 0.4
#10: 2000/ 3 1.6 0.4
#11: 2000/ 3 1.6 0.4
#12: 2000/ 3 1.6 0.4
If you need to separate the date in to year and quarter as shown in #KFB's post, you could use cSplit along with data.table. In the below code, setnames are used to rename the columns after the split. Rest of the procedure is the same as above.
Link to cSplit is https://gist.github.com/mrdwab/11380733
library(devtools)
source_gist(11380733)
DT1 <- setnames(cSplit(mydf, "date", '[/]', fixed=FALSE,direction='wide'),
c("CPI", "year", "Quarter"))
DT1[DT1[, Average:= CPI/n][,rep(seq_len(.N), each=n)]]
# CPI year Quarter Average
#1: 1.2 2000 1 0.3
#2: 1.2 2000 1 0.3
#3: 1.2 2000 1 0.3
#4: 1.2 2000 1 0.3
#5: 3.2 2000 2 0.8
#6: 3.2 2000 2 0.8
#7: 3.2 2000 2 0.8
#8: 3.2 2000 2 0.8
#9: 1.6 2000 3 0.4
#10: 1.6 2000 3 0.4
#11: 1.6 2000 3 0.4
#12: 1.6 2000 3 0.4
data
mydf <- structure(list(date = c("2000/ 1", "2000/ 2", "2000/ 3"), CPI = c(1.2,
3.2, 1.6)), .Names = c("date", "CPI"), class = "data.frame", row.names = c("1",
"2", "3"))
Another data.table solution using #akrun's mydf:
mydt = data.table(mydf)
mydt2 = mydt[,data.table(apply(.SD,2,function(x) rep(x,4))),]
mydt2$CPI = as.numeric(mydt2$CPI)
mydt2[,Average:=CPI/4,]
mydt2
date CPI Average
1: 2000/ 1 1.2 0.3
2: 2000/ 2 3.2 0.8
3: 2000/ 3 1.6 0.4
4: 2000/ 1 1.2 0.3
5: 2000/ 2 3.2 0.8
6: 2000/ 3 1.6 0.4
7: 2000/ 1 1.2 0.3
8: 2000/ 2 3.2 0.8
9: 2000/ 3 1.6 0.4
10: 2000/ 1 1.2 0.3
11: 2000/ 2 3.2 0.8
12: 2000/ 3 1.6 0.4

Shifting (+ wrap around) a data frame in R

i have a data frame like this
A B value
1 1 0.123
2 1 0.213
3 1 0.543
1 2 0.313
2 2 0.123
3 2 0.412
what i want to do is to create a function that shift this data frame by a value. for example:
if the value of shifting is 1 the data frame will become:
A B value
3 2 0.412
1 1 0.123
2 1 0.213
3 1 0.543
1 2 0.313
2 2 0.123
etc...
the function should be like this.
shift<-function(dataframe,shiftvalue)
is there any simple way to do this in R without entering in a lot of loops??
You can do it many ways, but one way is to use head and tail:
df <- data.frame(a=1:10, b = 11:20)
shift <- function(d, k) rbind( tail(d,k), head(d,-k), deparse.level = 0 )
> shift(df,3)
a b
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
1 1 11
2 2 12
3 3 13
I prefer plain old modulo ;-)
shift<-function(df,offset) df[((1:nrow(df))-1-offset)%%nrow(df)+1,]
It is pretty straightforward, the only quirk is R's from-one indexing. Also it works for offsets like 0, -7 or 7*nrow(df)...
here is my implementation:
> shift <- function(df, sv = 1) df[c((sv+1):nrow(df), 1:sv),]
> head(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
> tail(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
>
Updated:
> shift <- function(df, sv = 1) df[c((nrow(df)-sv+1):nrow(df), 1:(nrow(df)-sv)),]
> head(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> tail(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
There's a shift function in taRifx that works on vectors. Applying it results in coersion of all columns to character if any are character, so we'll use a trick from plyr. I'll likely write a data.frame method for it soon:
dd <- data.frame(b = seq(4),
x = c("A", "D", "A", "C"), y = c('a','b','c','d'),
z = c(1, 1, 1, 2),stringsAsFactors=FALSE)
> dd
b x y z
1 1 A a 1
2 2 D b 1
3 3 A c 1
4 4 C d 2
library(taRifx)
library(plyr)
shift.data.frame <- colwise(shift)
> shift.data.frame(dd)
b x y z
1 2 D b 1
2 3 A c 1
3 4 C d 2
4 1 A a 1
> shift(dd,n=-1)
b x y z
1 4 C d 2
2 1 A a 1
3 2 D b 1
4 3 A c 1
> shift(dd,n=-1,wrap=FALSE)
b x y z
1 1 A a 1
2 2 D b 1
3 3 A c 1
> shift(dd,n=-1,wrap=FALSE,pad=TRUE)
b x y z
1 NA <NA> <NA> NA
2 1 A a 1
3 2 D b 1
4 3 A c 1
The advantage of shift is that it takes a bunch of options:
n can be positive or negative to wrap from left/right
wrap can be turned on or off
If wrap is turned off, pad can be turned on to pad with NAs so vector remains the same length
https://dplyr.tidyverse.org/reference/lead-lag.html
lag(1:5, n = 1)
#> [1] NA 1 2 3 4
lag(1:5, n = 2)
#> [1] NA NA 1 2 3

Resources