I'm relatively new to R and am having trouble creating a vector that sums certain values based on other values. I'm not quite sure what the problem is. I don't receive an error, but the output is not what I was looking for. Here is a reproducible example:
fakeprice <- c(1, 2, 2, 1, NA, 5, 4, 4, 3, 3, NA)
fakeconversion <-c(.2, .15, .07, .25, NA, .4, .36, NA, .67, .42, .01)
fakedata <- data.frame(fakeprice, fakeconversion)
fake.list <- sort(unique(fakedata$fakeprice))
fake.sum <- vector(,5)
So, fakedata looks like:
fakeprice fakeconversion
1 1 0.20
2 2 0.15
3 2 0.07
4 1 0.25
5 NA NA
6 5 0.40
7 4 0.36
8 4 NA
9 3 0.67
10 3 0.42
11 NA 0.01
I think the problem lies in the NAs, but I'm not quite sure (there are quite a few in the original data set). Here are the for loops with nested if statements. I kept getting an error when the price was 'NA' and so I added the is.na():
for(i in fake.list){
sum=0
for(j in fakedata$fakeprice){
if(is.na(fakedata$fakeprice[j])==TRUE){
NULL
} else {
if(fakedata$fakeprice[j]==fake.list[i]){
sum <- sum+fakedata$fakeconversion[j]
}}
}
fake.sum[i]=sum
}
sumdata <- data.frame(fake.list, fake.sum)
I'm looking for an output that adds up fakeconversion for each unique price. So, for fakeprice=1, fake.sum=0.45. The resulting data I am looking for would look like:
fake.list fake.sum
1 1 0.45
2 2 0.22
3 3 1.09
4 4 0.36
5 5 0.40
What I get, however, is:
sumdata
fake.list fake.sum
1 1 0.90
2 2 0.44
3 3 0.00
4 4 0.00
5 5 0.00
Any help is very much appreciated!
aggregate(fakedata$fakeconversion, list(price = fakedata$fakeprice), sum, na.rm = TRUE)
The above will deal with the NA in fakeprice 4.
The aggregate function works by subsetting your data by something and then running a function, FUN.
So:
aggregate(x, by, FUN, ...,)
x is what you wish to run the FUN on. By can be given a list if you wish to split the data by multiple columns.
Related
Suppose I have two independent vectors ``xandy` of the same length:
x y
1 0.12
2 0.50
3 0.07
4 0.10
5 0.02
I want to sort the elements in y in decreasing order, and sort the values in x in a way that allows me to keep the correspondence between the two vectors, which would lead to this:
x y
2 0.50
1 0.12
4 0.10
3 0.07
5 0.02
I'm new to r, and although I know it has a built in sort function that allows me to sort the elements in y, I don't know how to make it sort both. The only thing I can think of involves doing a for cycle to "manually" sort x by checking the original location of the elements in y:
for(i in 1:length(ysorted)){
xsorted[i]=x[which(ysorted[i]==y)]
}
which is very ineffective.
In dplyr :
dat <- structure(list(x = 1:5,
y = c(0.12, 0.5, 0.07, 0.1, 0.02)),
class = "data.frame", row.names = c(NA,
-5L))
library(dplyr)
dat %>% arrange(desc(y))
x y
1 2 0.50
2 1 0.12
3 4 0.10
4 3 0.07
5 5 0.02
In data.table :
library(data.table)
as.data.table(dat)[order(-y)]
x y
1: 2 0.50
2: 1 0.12
3: 4 0.10
4: 3 0.07
5: 5 0.02
Speed Comparison
Three solutions have already been offered in the answers, namely : base, dplyr, and data.table. Similar to this case, in many cases in R programming, you can achieve the exactly same result by different approaches.
In case you need to get a comparison of the approaches based on how fast each approach is executed in R, you can use microbenchmark from {microbenchmark} package (again, there are also some other ways to do this).
Here is an example. In this example each approach is run 1000 times, and then the summaries of the required time are reported.
microbenchmark(
base_order = dat[order(-dat$y),],
dplyr_order = dat %>% arrange(desc(y)),
dt_order = as.data.table(dat)[order(-y)],
times = 1000
)
#Unit: microseconds
# expr min lq mean median uq max neval
# base_order 42.0 63.25 97.2585 79.45 100.35 6761.8 1000
# dplyr_order 1244.5 1503.45 1996.4406 1689.85 2065.30 16868.4 1000
# dt_order 261.3 395.85 583.9086 487.35 587.70 39294.6 1000
The results show that, for your case, base_order is the fastest. It executed the column ordering about 20 times faster than dplyr_order did, and about 6 times faster than dt_order did.
We can use order in base R
df2 <- df1[order(-df1$y),]
-output
df2
x y
2 2 0.50
1 1 0.12
4 4 0.10
3 3 0.07
5 5 0.02
data
df1 <- structure(list(x = 1:5, y = c(0.12, 0.5, 0.07, 0.1, 0.02)),
class = "data.frame", row.names = c(NA,
-5L))
I am trying to change a data frame such that I only include those columns where the first value of the row is the nth largest.
For example, here let's assume I want to only include the columns where the top value in row 1 is the 2nd largest (top 2 largest).
dat1 = data.frame(a = c(0.1,0.2,0.3,0.4,0.5), b = c(0.6,0.7,0.8,0.9,0.10), c = c(0.12,0.13,0.14,0.15,0.16), d = c(NA, NA, NA, NA, 0.5))
a b c d
1 0.1 0.6 0.12 NA
2 0.2 0.7 0.13 NA
3 0.3 0.8 0.14 NA
4 0.4 0.9 0.15 NA
5 0.5 0.1 0.16 0.5
such that a and d are removed, because 0.1 and NA are not the 2nd largest values in
row 1. Here 0.6 and 0.12 are larger than 0.1 and NA in column a and d respectively.
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
Is there a simple way to subset this? I do not want to order it, because that will create problems with other data frames I have that are related.
Complementing pieca's answer, you can encapsulate that into a function.
Also, this way, the returning data.frame won't be sorted.
get_nth <- function(df, n) {
df[] <- lapply(df, as.numeric) # edit
cols <- names(sort(df[1, ], na.last = NA, decreasing = TRUE))
cols <- cols[seq(n)]
df <- df[names(df) %in% cols]
return(df)
}
Hope this works for you.
Sort the first row of your data.frame, and then subset by names:
cols <- names(sort(dat1[1,], na.last = NA, decreasing = TRUE))
> dat1[,cols[1:2]]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
You can get an inverted rank of the first row and take the top nth columns:
> r <- rank(-dat1[1,], na.last=T)
> r <- r <= 2
> dat1[,r]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
Currently I'm working on the following code:
data <- rep(1:3, times = c(10,4,6))
for(i in 1:5) {
samp <- sample(data, 4)
data <- exclude(data, samp)
print(samp)
for(i in 1:3) {
prsamp <- sum(samp == i)/4
print(prsamp)
}
if (length(data) == 0) {
break
}
}
This currently prints out five vectors of length four, with the corresponding probabilities of each number occurring in each vector.
> source("buffoon.R")
> buffoon(20, 4, 3, c(10,4,6))
[1] 1 1 2 3
[1] 0.5
[1] 0.25
[1] 0.25
[1] 1 3 3 2
[1] 0.25
[1] 0.25
[1] 0.5
[1] 2 1 1 1
[1] 0.75
[1] 0.25
[1] 0
[1] 3 1 2 3
[1] 0.25
[1] 0.25
[1] 0.5
[1] 1 3 1 1
[1] 0.75
[1] 0
[1] 0.25
So, for instance, the first vector 1123 gives us a 0.5 prob of 1, 0.25 of 2, and 0.25 of 3. I would like to turn the output into a nice data frame which lists in column 1 each row vector, and in column 2 another row vector corresponding to the respective elemental probability occurrences, but I'm running into many errors. I've been researching this issue for a few hours now, but no success. Any help is appreciated.
My ideal data frame would look like this:
Sample Probability Dist
1 1123 0.5 0.25 0.25
2 1332 0.25 0.25 0.5
and so on, down to row 5.
The first thing you will have to do is create an empty dataframe. Secondly, you will want your for loop to write in this dataframe instead of simply printing out the results directly. Also, you don't want to use a for loop containing i as variable in a for loop already using i. I suggest you try the following:
data <- rep(1:3, times = c(10,4,6))
datafr <- data.frame(Sample=rep(NA,5),Probability.Dist=rep(NA,5))
for(i in 1:5) {
samp <- sample(data, 4)
data <- exclude(data, samp)
datafr$Sample[i] <- samp[1]*1000+samp[2]*100+samp[3]*10+samp[4] #easy way of getting your wanted sample layout
prsamp <- rep(0,3)
for(j in 1:3) {
prsamp[j] <- sum(samp == j)/4
}
datafr$Probability.Dist[i] <- toString(prsamp)
if (length(data) == 0) {
break
}
}
datafr
# Sample Probability.Dist
#1 1231 0.5, 0.25, 0.25
#2 2132 0.25, 0.5, 0.25
#3 1313 0.5, 0, 0.5
#4 2111 0.75, 0.25, 0
#5 3131 0.5, 0, 0.5
I also have to advise you against using 3 values in a single column of a dataframe. For further analysis and even readability, it would be much preferred to give each value it's own column.
I’ve the following data
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
1) I need to repeat number of rows for each id by n. For example, n=2.63 for id=1, then I need to replicated id=1 row three times. If n=0.5, then I need to replicate it only one time... so n needs to be round up.
2) Create a new variable called t, where the sum of t for each id must equal to n.
3) Create another new variable called accumulated.t
Here how the output looks like:
id n t accumulated.t
1 2.63 1 1
1 2.63 1 2
1 2.63 0.63 2.63
2 1.5 1 1
2 1.5 0.5 1.5
3 0.5 0.5 0.5
4 3.5 1 1
4 3.5 1 2
4 3.5 1 3
4 3.5 0.5 3.5
5 4 1 1
5 4 1 2
5 4 1 3
5 4 1 4
Get the ceiling of 'n' column and use that to expand the rows of 'mydata' (rep(1:nrow(mydata), ceiling(mydata$n)))
Using data.table, we convert the 'data.frame' to 'data.table' (setDT(mydata1)), grouped by 'id' column, we replicate (rep) 1 with times specified as the trunc of the first value of 'n' (rep(1, trunc(n[1]))). Take the difference between the unique value of 'n' per group and the sum of 'tmp' (n[1]-sum(tmp)). If the difference is greater than 0, we concatenate 'tmp' and 'tmp2' (c(tmp, tmp2)) or if it is '0', we take only 'tmp'. This can be placed in a list to create the two columns 't' and the cumulative sum of 'tmp3 (cumsum(tmp3)).
library(data.table)
mydata1 <- mydata[rep(1:nrow(mydata),ceiling(mydata$n)),]
setDT(mydata1)[, c('t', 'taccum') := {
tmp <- rep(1, trunc(n[1]))
tmp2 <- n[1]-sum(tmp)
tmp3= if(tmp2==0) tmp else c(tmp, tmp2)
list(tmp3, cumsum(tmp3)) },
by = id]
mydata1
# id n t taccum
# 1: 1 2.63 1.00 1.00
# 2: 1 2.63 1.00 2.00
# 3: 1 2.63 0.63 2.63
# 4: 2 1.50 1.00 1.00
# 5: 2 1.50 0.50 1.50
# 6: 3 0.50 0.50 0.50
# 7: 4 3.50 1.00 1.00
# 8: 4 3.50 1.00 2.00
# 9: 4 3.50 1.00 3.00
#10: 4 3.50 0.50 3.50
#11: 5 4.00 1.00 1.00
#12: 5 4.00 1.00 2.00
#13: 5 4.00 1.00 3.00
#14: 5 4.00 1.00 4.00
An alternative that utilizes base R.
mydata <- data.frame(id=c(1,2,3,4,5), n=c(2.63, 1.5, 0.5, 3.5, 4))
mynewdata <- data.frame(id = rep(x = mydata$id,times = ceiling(x = mydata$n)),
n = mydata$n[match(x = rep(x = mydata$id,ceiling(mydata$n)),table = mydata$id)],
t = rep(x = mydata$n / ceiling(mydata$n),times = ceiling(mydata$n)))
mynewdata$t.accum <- unlist(x = by(data = mynewdata$t,INDICES = mynewdata$id,FUN = cumsum))
We start by creating a data.frame with three columns, id, n, and t. id is calculated using rep and ceiling to repeat the ID variable the number of appropriate times. n is obtained by using match to look up the right value in mydata$n. t is obtained by obtaining the ratio of n and ceiling of n, and then repeating it the appropriate amount of times (in this case, ceiling of n again.
Then, we use cumsum to get the cumulative sum, called using by to allow by-group processing for each group of IDs. You could probably use tapply() here as well.
I have the following data
x y z
1 2 a
1 2
data[2,3] is a factor but nothing shows,
In the data, it has a lot rows like this way.How to delete the row when the z has nothing?
I mean deleting the rows such as the second row.
output should be
x y z
1 2 a
OK. Stabbing a little bit in the dark here.
Imagine the following dataset:
mydf <- data.frame(
x = c(.11, .11, .33, .33, .11, .11),
y = c(.22, .22, .44, .44, .22, .44),
z = c("a", "", "", "f", "b", ""))
mydf
# x y z
# 1 0.11 0.22 a
# 2 0.11 0.22
# 3 0.33 0.44
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
From the combination of your title and your description (neither of which seems to fully describe your problem), I would decode that you want to drop rows 2 and 3, but not row 6. In other words, you want to first check whether the row is duplicated (presumably only the first two columns), and then, if the third column is empty, drop that row. By those instructions, row 5 should remain (column "z" is not blank) and row 6 should remain (the combination of columns 1 and 2 is not a duplicate).
If that's the case, here's one approach:
# Copy the data.frame, "sorting" by column "z"
mydf2 <- mydf[rev(order(mydf$z)), ]
# Subset according to your conditions
mydf2 <- mydf2[duplicated(mydf2[1:2]) & mydf2$z %in% "", ]
mydf2
# x y z
# 3 0.33 0.44
# 2 0.11 0.22
^^ Those are the data that we want to remove. One way to remove them is using setdiff on the rownames of each dataset:
mydf[setdiff(rownames(mydf), rownames(mydf2)), ]
# x y z
# 1 0.11 0.22 a
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
Some example data:
df = data.frame(x = runif(100),
y = runif(100),
z = sample(c(letters[0:10], ""), 100, replace = TRUE))
> head(df)
x y z
1 0.7664915 0.86087017 a
2 0.8567483 0.83715022 d
3 0.2819078 0.85004742 f
4 0.8241173 0.43078311 h
5 0.6433988 0.46291916 e
6 0.4103120 0.07511076
Spot row six with the missing value. You can subset using a vector of logical's (TRUE, FALSE):
df[df$z != "",]
And as #AnandaMahto commented, you can even check against multiple conditions:
df[!df$z %in% c("", " "),]