R Conditional standard deviation - r

I have a large data set and I need to get the standard deviation for the Main column based on the number of rows in other columns. Here is a sample data set:
df1 <- data.frame(
Main = c(0.33, 0.57, 0.60, 0.51),
B = c(NA, NA, 0.09,0.19),
C = c(NA, 0.05, 0.07, 0.05),
D = c(0.23, 0.26, 0.23, 0.26)
)
View(df1)
# Main B C D
# 1 0.33 NA NA 0.23
# 2 0.57 NA 0.05 0.26
# 3 0.60 0.09 0.07 0.23
# 4 0.51 0.19 0.05 0.26
Take column B as an example, since row 1&2 are NA, its standard deviation will be sd(df1[3:4,1]); column C&D will be sd(df1[2:4,1]) and sd(df1[1:4,1]). Therefore, the result will be:
# B C D
# 1 0.06 0.05 0.12
I did the followings but it only returned one number - 0.0636
df2 <- df1[,-1]!=0
sd(df1[df2,1], na.rm = T)
My data set has many more columns, and I'm wondering if there is a more efficient way to get it done? Many thanks!

Try:
sapply(df1[,-1], function(x) sd(df1[!is.na(x), 1]))
# B C D
# 0.06363961 0.04582576 0.12093387

x <- colnames(df) # list all columns you want to calculate sd of
value <- sapply(1:length(x) , function(i) sd(df[,x[i],drop=TRUE], na.rm = T))
names(value) <- x
# Main B C D
# 0.12093387 0.07071068 0.01154701 0.01732051

We can get this with colSds from matrixStats
library(matrixStats)
colSds(`dim<-`(df1[,1][NA^is.na(df1[-1])*row(df1[-1])], dim(df1[,-1])), na.rm = TRUE)
#[1] 0.06363961 0.04582576 0.12093387

Related

How to sort 'paired' vectors in R

Suppose I have two independent vectors ``xandy` of the same length:
x y
1 0.12
2 0.50
3 0.07
4 0.10
5 0.02
I want to sort the elements in y in decreasing order, and sort the values in x in a way that allows me to keep the correspondence between the two vectors, which would lead to this:
x y
2 0.50
1 0.12
4 0.10
3 0.07
5 0.02
I'm new to r, and although I know it has a built in sort function that allows me to sort the elements in y, I don't know how to make it sort both. The only thing I can think of involves doing a for cycle to "manually" sort x by checking the original location of the elements in y:
for(i in 1:length(ysorted)){
xsorted[i]=x[which(ysorted[i]==y)]
}
which is very ineffective.
In dplyr :
dat <- structure(list(x = 1:5,
y = c(0.12, 0.5, 0.07, 0.1, 0.02)),
class = "data.frame", row.names = c(NA,
-5L))
library(dplyr)
dat %>% arrange(desc(y))
x y
1 2 0.50
2 1 0.12
3 4 0.10
4 3 0.07
5 5 0.02
In data.table :
library(data.table)
as.data.table(dat)[order(-y)]
x y
1: 2 0.50
2: 1 0.12
3: 4 0.10
4: 3 0.07
5: 5 0.02
Speed Comparison
Three solutions have already been offered in the answers, namely : base, dplyr, and data.table. Similar to this case, in many cases in R programming, you can achieve the exactly same result by different approaches.
In case you need to get a comparison of the approaches based on how fast each approach is executed in R, you can use microbenchmark from {microbenchmark} package (again, there are also some other ways to do this).
Here is an example. In this example each approach is run 1000 times, and then the summaries of the required time are reported.
microbenchmark(
base_order = dat[order(-dat$y),],
dplyr_order = dat %>% arrange(desc(y)),
dt_order = as.data.table(dat)[order(-y)],
times = 1000
)
#Unit: microseconds
# expr min lq mean median uq max neval
# base_order 42.0 63.25 97.2585 79.45 100.35 6761.8 1000
# dplyr_order 1244.5 1503.45 1996.4406 1689.85 2065.30 16868.4 1000
# dt_order 261.3 395.85 583.9086 487.35 587.70 39294.6 1000
The results show that, for your case, base_order is the fastest. It executed the column ordering about 20 times faster than dplyr_order did, and about 6 times faster than dt_order did.
We can use order in base R
df2 <- df1[order(-df1$y),]
-output
df2
x y
2 2 0.50
1 1 0.12
4 4 0.10
3 3 0.07
5 5 0.02
data
df1 <- structure(list(x = 1:5, y = c(0.12, 0.5, 0.07, 0.1, 0.02)),
class = "data.frame", row.names = c(NA,
-5L))

How to use mutate result as input to calc another column in R dplyr

I'd like to calculate data for two new columns in a data.frame where the results are based on the value of the previous row. However, the previous row also needs to be calculated, which means that there is a dependency between the two columns (the input for one calculation is based on the output of another calculation). I could do it through a for, but maybe it's not the right way.
This is a sample for this case:
df <- data.frame(A=c(0.91,0.98,1,1.1), B=c(0.81, 1.11, 0.83, 0.92), C=c(0.09,0.06,0.09,0.08))
df$D <- NA
df$E <- NA
df[1,]$D <- 0.0
I've been trying it through dplyr::mutate.
df %>%
mutate(D = ifelse( lag(A) < 1, lag(E), lag(E) - lag(E) * lag(A)),
E = B - (B - D) * exp(-C)
)
This is how the output should be:
> df
A B C D E
1 0.91 0.81 0.09 0.00000000 0.06971574
2 0.98 1.11 0.06 0.06971574 0.13029718
3 1.00 0.83 0.09 0.13029718 0.19051977
4 1.10 0.92 0.08 0.00000000 0.07073296

Looping through the data frame and match the values from another file in R

dput(df) of dataframe2
I need some help with r.
I have a data frame:
ant <- data.frame(n_scale = c(0.62, 0.29, -0.9),
aa = c('A','B','C'))
It looks like this:
0.62 A
0.29 B
-0.90 C
Then I read a file with a dataframe2 which looks like:
-1 0 1 2
C B A A
I want to achieve this:
-1 0 1 2
C B A A
-0.9 0.29 0.62 0.62
How can I loop through the dataframe2 to get values from the ant data frame?
Thank you very much for your help! :)
Using merge. After that you can match hyd of result with that of df2.
res <- merge(ant, df2)
res <- res[match(df2$hyd, res$hyd), ]
res
# aa n_scale hyd
# 4 C -0.90 -1
# 3 B 0.29 0
# 1 A 0.62 1
# 2 A 0.62 2
Please next time when asking, provide your data as I do below.
Data:
ant <- data.frame(n_scale = c(0.62, 0.29, -0.9),
aa = c('A','B','C'))
df2 <- data.frame(hyd=c(-1, 0, 1, 2),
aa=c("C", "B", "A", "A"))

What is an alternative to ifelse() in R?

I have a variable (say, VarX) with values 1:4 in a dataset with approximately 2000 rows. There are other variables in the dataset too. I would like to create a new variable (NewVar) so that if the value VarX is 1, the value of NewVar is 0.32 (the value from myMat[1, 1]), if the value VarX is 2, the value of NewVar is 0.05 (the value from myMat[2, 1]) and so on...
myMat
VarA VarB VarC
[1,] 0.32 0.34 0.27
[2,] 0.05 0.02 0.11
[3,] 0.11 0.11 0.17
[4,] 0.52 0.52 0.45
I have tried the following and it works fine:
df$NewVar <- ifelse(df$VarX == 1, 0.32,
ifelse(df$VarX == 2, 0.05,
ifelse(df$VarX == 3, 0.11,
ifelse(df$VarX == 4, 0.52, 0))))
However, I have another variable (say, VarY) which has 182 values and another matrix with 182 different values. So, using ifelse() would be quite tedious. Is there another way to perform the task in R? Thank you!

How to delete a duplicate row in R

I have the following data
x y z
1 2 a
1 2
data[2,3] is a factor but nothing shows,
In the data, it has a lot rows like this way.How to delete the row when the z has nothing?
I mean deleting the rows such as the second row.
output should be
x y z
1 2 a
OK. Stabbing a little bit in the dark here.
Imagine the following dataset:
mydf <- data.frame(
x = c(.11, .11, .33, .33, .11, .11),
y = c(.22, .22, .44, .44, .22, .44),
z = c("a", "", "", "f", "b", ""))
mydf
# x y z
# 1 0.11 0.22 a
# 2 0.11 0.22
# 3 0.33 0.44
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
From the combination of your title and your description (neither of which seems to fully describe your problem), I would decode that you want to drop rows 2 and 3, but not row 6. In other words, you want to first check whether the row is duplicated (presumably only the first two columns), and then, if the third column is empty, drop that row. By those instructions, row 5 should remain (column "z" is not blank) and row 6 should remain (the combination of columns 1 and 2 is not a duplicate).
If that's the case, here's one approach:
# Copy the data.frame, "sorting" by column "z"
mydf2 <- mydf[rev(order(mydf$z)), ]
# Subset according to your conditions
mydf2 <- mydf2[duplicated(mydf2[1:2]) & mydf2$z %in% "", ]
mydf2
# x y z
# 3 0.33 0.44
# 2 0.11 0.22
^^ Those are the data that we want to remove. One way to remove them is using setdiff on the rownames of each dataset:
mydf[setdiff(rownames(mydf), rownames(mydf2)), ]
# x y z
# 1 0.11 0.22 a
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
Some example data:
df = data.frame(x = runif(100),
y = runif(100),
z = sample(c(letters[0:10], ""), 100, replace = TRUE))
> head(df)
x y z
1 0.7664915 0.86087017 a
2 0.8567483 0.83715022 d
3 0.2819078 0.85004742 f
4 0.8241173 0.43078311 h
5 0.6433988 0.46291916 e
6 0.4103120 0.07511076
Spot row six with the missing value. You can subset using a vector of logical's (TRUE, FALSE):
df[df$z != "",]
And as #AnandaMahto commented, you can even check against multiple conditions:
df[!df$z %in% c("", " "),]

Resources