Suppose I have two independent vectors ``xandy` of the same length:
x y
1 0.12
2 0.50
3 0.07
4 0.10
5 0.02
I want to sort the elements in y in decreasing order, and sort the values in x in a way that allows me to keep the correspondence between the two vectors, which would lead to this:
x y
2 0.50
1 0.12
4 0.10
3 0.07
5 0.02
I'm new to r, and although I know it has a built in sort function that allows me to sort the elements in y, I don't know how to make it sort both. The only thing I can think of involves doing a for cycle to "manually" sort x by checking the original location of the elements in y:
for(i in 1:length(ysorted)){
xsorted[i]=x[which(ysorted[i]==y)]
}
which is very ineffective.
In dplyr :
dat <- structure(list(x = 1:5,
y = c(0.12, 0.5, 0.07, 0.1, 0.02)),
class = "data.frame", row.names = c(NA,
-5L))
library(dplyr)
dat %>% arrange(desc(y))
x y
1 2 0.50
2 1 0.12
3 4 0.10
4 3 0.07
5 5 0.02
In data.table :
library(data.table)
as.data.table(dat)[order(-y)]
x y
1: 2 0.50
2: 1 0.12
3: 4 0.10
4: 3 0.07
5: 5 0.02
Speed Comparison
Three solutions have already been offered in the answers, namely : base, dplyr, and data.table. Similar to this case, in many cases in R programming, you can achieve the exactly same result by different approaches.
In case you need to get a comparison of the approaches based on how fast each approach is executed in R, you can use microbenchmark from {microbenchmark} package (again, there are also some other ways to do this).
Here is an example. In this example each approach is run 1000 times, and then the summaries of the required time are reported.
microbenchmark(
base_order = dat[order(-dat$y),],
dplyr_order = dat %>% arrange(desc(y)),
dt_order = as.data.table(dat)[order(-y)],
times = 1000
)
#Unit: microseconds
# expr min lq mean median uq max neval
# base_order 42.0 63.25 97.2585 79.45 100.35 6761.8 1000
# dplyr_order 1244.5 1503.45 1996.4406 1689.85 2065.30 16868.4 1000
# dt_order 261.3 395.85 583.9086 487.35 587.70 39294.6 1000
The results show that, for your case, base_order is the fastest. It executed the column ordering about 20 times faster than dplyr_order did, and about 6 times faster than dt_order did.
We can use order in base R
df2 <- df1[order(-df1$y),]
-output
df2
x y
2 2 0.50
1 1 0.12
4 4 0.10
3 3 0.07
5 5 0.02
data
df1 <- structure(list(x = 1:5, y = c(0.12, 0.5, 0.07, 0.1, 0.02)),
class = "data.frame", row.names = c(NA,
-5L))
Related
I have a data frame consituted by two columns
positionx <- c(1:10)
pvalue <- c(0.1, 0.04, 0.03, 0.02, 0.001, 0.2, 0.5, 0.6, 0.001, 0.002)
df <- data.frame(cbind(positionx, pvalue))
df
positionx pvalue
1 1 0.100
2 2 0.040
3 3 0.030
4 4 0.020
5 5 0.001
6 6 0.200
7 7 0.500
8 8 0.600
9 9 0.001
10 10 0.002
I would like to find in which intervals of values of positionx my pvalue is below a certain treshold, let's say 0.05.
Using 'which' I can find the index of the rows and I could go back to the vlaues of positionx.
which(df[,2]<0.05)
[1] 2 3 4 5 9 10
Howeverm what I would like are the edges of the intervals, with that I mean a result like: 2-5, 9-10
I also tried to use the findInterval function as below
int <- c(-10, 0.05, 10)
separation <- findInterval(pvalue,int)
separation
[1] 2 1 1 1 1 2 2 2 1 1
df_sep <- data.frame(cbind(df, separation))
df_sep
positionx pvalue separation
1 1 0.100 2
2 2 0.040 1
3 3 0.030 1
4 4 0.020 1
5 5 0.001 1
6 6 0.200 2
7 7 0.500 2
8 8 0.600 2
9 9 0.001 1
10 10 0.002 1
However I am stuck again with a column of numbers, while I want the edges of the intervals that contain 1 in the separation column.
Is there a way to do that?
This is semplified example, in reality I have many plots and for each plot one data frame of this type (just much longer and with pvalues not as easy to judge at a glance).
The reason why I think I need the information of the edges of my intervals, is that I would like to colour the background of my ggplot according to the pvalue. I know I can use geom_rect for it, but I think I need the edges of the intervals in order to build the coloured rectangles.
Is there a way to do this in an automated way instead of manually?
This seems like a great use case for run length encoding.
Example as below:
library(ggplot2)
# Data from question
positionx <- c(1:10)
pvalue <- c(0.1, 0.04, 0.03, 0.02, 0.001, 0.2, 0.5, 0.6, 0.001, 0.002)
df <- data.frame(cbind(positionx, pvalue))
# Sort data (just to be sure)
df <- df[order(df$positionx),]
# Do run length encoding magic
threshold <- 0.05
rle <- rle(df$pvalue < threshold)
starts <- {ends <- cumsum(rle$lengths)} - rle$lengths + 1
df2 <- data.frame(
xmin = df$positionx[starts],
xmax = df$positionx[ends],
type = rle$values
)
# Filter on type
df2 <- df2[df2$type == TRUE, ] # Satisfied threshold criterium
ggplot(df2, aes(xmin = xmin, xmax = xmax, ymin = 0, ymax = 1)) +
geom_rect()
Created on 2020-05-22 by the reprex package (v0.3.0)
I have a large data set and I need to get the standard deviation for the Main column based on the number of rows in other columns. Here is a sample data set:
df1 <- data.frame(
Main = c(0.33, 0.57, 0.60, 0.51),
B = c(NA, NA, 0.09,0.19),
C = c(NA, 0.05, 0.07, 0.05),
D = c(0.23, 0.26, 0.23, 0.26)
)
View(df1)
# Main B C D
# 1 0.33 NA NA 0.23
# 2 0.57 NA 0.05 0.26
# 3 0.60 0.09 0.07 0.23
# 4 0.51 0.19 0.05 0.26
Take column B as an example, since row 1&2 are NA, its standard deviation will be sd(df1[3:4,1]); column C&D will be sd(df1[2:4,1]) and sd(df1[1:4,1]). Therefore, the result will be:
# B C D
# 1 0.06 0.05 0.12
I did the followings but it only returned one number - 0.0636
df2 <- df1[,-1]!=0
sd(df1[df2,1], na.rm = T)
My data set has many more columns, and I'm wondering if there is a more efficient way to get it done? Many thanks!
Try:
sapply(df1[,-1], function(x) sd(df1[!is.na(x), 1]))
# B C D
# 0.06363961 0.04582576 0.12093387
x <- colnames(df) # list all columns you want to calculate sd of
value <- sapply(1:length(x) , function(i) sd(df[,x[i],drop=TRUE], na.rm = T))
names(value) <- x
# Main B C D
# 0.12093387 0.07071068 0.01154701 0.01732051
We can get this with colSds from matrixStats
library(matrixStats)
colSds(`dim<-`(df1[,1][NA^is.na(df1[-1])*row(df1[-1])], dim(df1[,-1])), na.rm = TRUE)
#[1] 0.06363961 0.04582576 0.12093387
I'm relatively new to R and am having trouble creating a vector that sums certain values based on other values. I'm not quite sure what the problem is. I don't receive an error, but the output is not what I was looking for. Here is a reproducible example:
fakeprice <- c(1, 2, 2, 1, NA, 5, 4, 4, 3, 3, NA)
fakeconversion <-c(.2, .15, .07, .25, NA, .4, .36, NA, .67, .42, .01)
fakedata <- data.frame(fakeprice, fakeconversion)
fake.list <- sort(unique(fakedata$fakeprice))
fake.sum <- vector(,5)
So, fakedata looks like:
fakeprice fakeconversion
1 1 0.20
2 2 0.15
3 2 0.07
4 1 0.25
5 NA NA
6 5 0.40
7 4 0.36
8 4 NA
9 3 0.67
10 3 0.42
11 NA 0.01
I think the problem lies in the NAs, but I'm not quite sure (there are quite a few in the original data set). Here are the for loops with nested if statements. I kept getting an error when the price was 'NA' and so I added the is.na():
for(i in fake.list){
sum=0
for(j in fakedata$fakeprice){
if(is.na(fakedata$fakeprice[j])==TRUE){
NULL
} else {
if(fakedata$fakeprice[j]==fake.list[i]){
sum <- sum+fakedata$fakeconversion[j]
}}
}
fake.sum[i]=sum
}
sumdata <- data.frame(fake.list, fake.sum)
I'm looking for an output that adds up fakeconversion for each unique price. So, for fakeprice=1, fake.sum=0.45. The resulting data I am looking for would look like:
fake.list fake.sum
1 1 0.45
2 2 0.22
3 3 1.09
4 4 0.36
5 5 0.40
What I get, however, is:
sumdata
fake.list fake.sum
1 1 0.90
2 2 0.44
3 3 0.00
4 4 0.00
5 5 0.00
Any help is very much appreciated!
aggregate(fakedata$fakeconversion, list(price = fakedata$fakeprice), sum, na.rm = TRUE)
The above will deal with the NA in fakeprice 4.
The aggregate function works by subsetting your data by something and then running a function, FUN.
So:
aggregate(x, by, FUN, ...,)
x is what you wish to run the FUN on. By can be given a list if you wish to split the data by multiple columns.
I want to interlace two vectors of same mode and equal length. Say:
a <- rpois(lambda=3,n=5e5)
b <- rpois(lambda=4,n=5e5)
I would like to interweave or interlace these two vectors, to create a vector that would be equivalently c(a[1],b[1],a[2],b[2],...,a[length(a)],b[length(b)])
My first attempt was this:
sapply(X=rep.int(c(3,4),times=5e5),FUN=rpois,n=1)
but it requires rpois to be called far more times than needed.
My best attempt so far has been to transform it into a matrix and reconvert back into a vector:
d <- c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5)))
d <- c(rbind(a,b))
Is there a better way to go about doing it? Or is there a function in base R that accomplishes the same thing?
Your rbind method should work well. You could also use
rpois(lambda=c(3,4),n=1e6)
because R will automatically replicate the vector of lambda values to the required length. There's not much difference in speed:
library(rbenchmark)
benchmark(rpois(1e6,c(3,4)),
c(rbind(rpois(5e5,3),rpois(5e5,4))))
# test replications elapsed relative
# 2 c(rbind(rpois(5e+05, 3), rpois(5e+05, 4))) 100 23.390 1.112168
# 1 rpois(1e+06, c(3, 4)) 100 21.031 1.000000
and elegance is in the eye of the beholder ... of course, the c(rbind(...)) method works in general for constructing alternating vectors, while the other solution is specific to rpois or other functions that replicate their arguments in that way.
Some speed tests, incorporating Ben Bolker's answer:
benchmark(
c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5))),
c(t(sapply(X=list(3,4),FUN=rpois,n=5e5))),
sapply(X=rep.int(c(3,4),times=5e5),FUN=rpois,n=1),
rpois(lambda=c(3,4),n=1e6),
rpois(lambda=rep.int(c(3,4),times=5e5),n=1e6)
)
test
1 c(rbind(rpois(lambda = 3, n = 5e+05), rpois(lambda = 4, n = 5e+05)))
2 c(t(sapply(X = list(3, 4), FUN = rpois, n = 5e+05)))
4 rpois(lambda = c(3, 4), n = 1e+06)
5 rpois(lambda = rep.int(c(3, 4), times = 5e+05), n = 1e+06)
3 sapply(X = rep.int(c(3, 4), times = 5e+05), FUN = rpois, n = 1)
replications elapsed relative user.self sys.self user.child sys.child
1 100 6.14 1.000000 5.93 0.15 NA NA
2 100 7.11 1.157980 7.02 0.02 NA NA
4 100 14.09 2.294788 13.61 0.05 NA NA
5 100 14.24 2.319218 13.73 0.21 NA NA
3 100 700.84 114.143322 683.51 0.50 NA NA
I have the following data
x y z
1 2 a
1 2
data[2,3] is a factor but nothing shows,
In the data, it has a lot rows like this way.How to delete the row when the z has nothing?
I mean deleting the rows such as the second row.
output should be
x y z
1 2 a
OK. Stabbing a little bit in the dark here.
Imagine the following dataset:
mydf <- data.frame(
x = c(.11, .11, .33, .33, .11, .11),
y = c(.22, .22, .44, .44, .22, .44),
z = c("a", "", "", "f", "b", ""))
mydf
# x y z
# 1 0.11 0.22 a
# 2 0.11 0.22
# 3 0.33 0.44
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
From the combination of your title and your description (neither of which seems to fully describe your problem), I would decode that you want to drop rows 2 and 3, but not row 6. In other words, you want to first check whether the row is duplicated (presumably only the first two columns), and then, if the third column is empty, drop that row. By those instructions, row 5 should remain (column "z" is not blank) and row 6 should remain (the combination of columns 1 and 2 is not a duplicate).
If that's the case, here's one approach:
# Copy the data.frame, "sorting" by column "z"
mydf2 <- mydf[rev(order(mydf$z)), ]
# Subset according to your conditions
mydf2 <- mydf2[duplicated(mydf2[1:2]) & mydf2$z %in% "", ]
mydf2
# x y z
# 3 0.33 0.44
# 2 0.11 0.22
^^ Those are the data that we want to remove. One way to remove them is using setdiff on the rownames of each dataset:
mydf[setdiff(rownames(mydf), rownames(mydf2)), ]
# x y z
# 1 0.11 0.22 a
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
Some example data:
df = data.frame(x = runif(100),
y = runif(100),
z = sample(c(letters[0:10], ""), 100, replace = TRUE))
> head(df)
x y z
1 0.7664915 0.86087017 a
2 0.8567483 0.83715022 d
3 0.2819078 0.85004742 f
4 0.8241173 0.43078311 h
5 0.6433988 0.46291916 e
6 0.4103120 0.07511076
Spot row six with the missing value. You can subset using a vector of logical's (TRUE, FALSE):
df[df$z != "",]
And as #AnandaMahto commented, you can even check against multiple conditions:
df[!df$z %in% c("", " "),]