I would like to do some calculations with the following dataframe. There are some values in specific cells of a column, and I would like to have them replicated based on a second column value, and store these in a new, third column:
x <- c ("1", "2","3", "4")
z <- (rep(x,5))
batch <- sort(z)
NDF <- rnorm(20, 10, 1); NDF <- signif (NDF, digits =3)
Fibre_analysis <- data.frame(batch, NDF)
Fibre_analysis$NDF[[1]] <- 10
Fibre_analysis$NDF[[6]] <- 100
Fibre_analysis$NDF[[11]] <- 1000
Fibre_analysis$NDF[[16]] <- 10000
This is the table that I would like:
batch NDF NEW_column
1 1 10.00 10
2 1 10.80 10
3 1 9.44 10
4 1 10.30 10
5 1 11.60 10
6 2 100.00 100
7 2 8.26 100
8 2 9.15 100
9 2 9.40 100
10 2 8.53 100
11 3 1000.00 1000
12 3 9.41 1000
13 3 9.20 1000
14 3 10.30 1000
15 3 9.32 1000
16 4 10000.00 10000
17 4 11.20 10000
18 4 7.33 10000
19 4 9.34 10000
20 4 11.00 10000
I would like this to create a new column in the dataframe, with absolute cell values from $NDFthat have to change for each value of $batch.
Because I need to use this process more than once I created the following function:
batch_Function <- function (x,y){
ifelse (x =="1", y[[1]],
ifelse (x =="2", y[[6]],
ifelse (x =="3", y[[11]],
y[[16]] )))
print (y)
}
when I call the function:
Fibre_analysis$NEW_column <- batch_Function ( Fibre_analysis$batch , Fibre_analysis$NDF )
I expect $NEW_column to look like this:
x <- c(10,100,1000,10000)
NEW_column <- rep(x, each=5)
whereas instead it is the exact same copy of the $NDF.
The only necessary change is to drop print(y) as it is not allowing to return the actual result:
batch_Function <- function (x, y) {
ifelse (x =="1", y[[1]],
ifelse (x =="2", y[[6]],
ifelse (x =="3", y[[11]],
y[[16]] )))
}
batch_Function (Fibre_analysis$batch , Fibre_analysis$NDF )
# [1] 10 10 10 10 10 100 100 100 100 100 1000 1000 1000 1000
# [15] 1000 10000 10000 10000 10000 10000
In case you still want print(y), you may put it at the beginning of batch_Function.
Related
I have two dataframes - the first contains a single column with 180k rows(i.e. 1x180k) and the other has a single row with 13 columns containing 13 growth rates (i.e. 13x1)
I am trying to multiply these dataframes so that I have a single dataframe that shows the growth of these values overtime.
I can multiply them but I can't work out how to make it compound overtime.
Effectively the dataframe I want will have the existing values in the first column, the second column will have the first column multiplied by the first growth rate, the third column will have the second column multiplied by the second growth rate etc.
Note - my growth rates are in percentages (i.e. 0.05 or 5%)
I have this, but I am not sure how to reflect compounding in it.
LandValuesForecast <- LandValues[,1] %*% (1+t(unlist(GrowthRates[1,])))
You can loop over the columns of both dataframes, applying each rate to the value computed in the previous iteration.
# example data
values <- data.frame(x0 = 1:10 * 100)
rates <- data.frame(r1 = .1, r2 = .01, r3 = .05)
for (i in seq(ncol(rates))) {
values[[paste0("x", i)]] <- values[, i] * (1 + rates[, i])
}
values
x0 x1 x2 x3
1 100 110 111.1 116.655
2 200 220 222.2 233.310
3 300 330 333.3 349.965
4 400 440 444.4 466.620
5 500 550 555.5 583.275
6 600 660 666.6 699.930
7 700 770 777.7 816.585
8 800 880 888.8 933.240
9 900 990 999.9 1049.895
10 1000 1100 1111.0 1166.550
You can use Reduce() - borrowing #zephryl's data:
values <- data.frame(x0 = 1:10 * 100)
rates <- data.frame(r1 = .1, r2 = .01, r3 = .05)
data.frame(Reduce(`*`, rates + 1, init = values, accumulate = TRUE))
x0 x0.1 x0.2 x0.3
1 100 110 111.1 116.655
2 200 220 222.2 233.310
3 300 330 333.3 349.965
4 400 440 444.4 466.620
5 500 550 555.5 583.275
6 600 660 666.6 699.930
7 700 770 777.7 816.585
8 800 880 888.8 933.240
9 900 990 999.9 1049.895
10 1000 1100 1111.0 1166.550
Or same thing with purrr::accumulate():
library(purrr)
data.frame(accumulate(rates + 1, `*`, .init = values))
If I understood your question correctly, I would prefer conversion of dataframes to matrices with multiplication of results using outer function. It is expected to be fast.
library(dplyr)
df1 <- data.frame(aaa = c(1:10))
df2 <- data.frame(a1 = 1, a2 = 2, a3 = 3)
outer(as.matrix(df1, ncol = 1),
as.matrix(df2, nrow = 1),
`*`) %>% as.data.frame
This script will return:
aaa.1.a1 aaa.1.a2 aaa.1.a3
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
6 6 12 18
7 7 14 21
8 8 16 24
9 9 18 27
10 10 20 30
I would like to reduce this code:
carro$custo_tprivate = with(carro, ifelse(decile_renda == 1,
renda_fa*0.116,
ifelse(decile_renda == 2, renda_fa*0.106,
ifelse(decile_renda == 3, renda_fa*0.102,
ifelse(decile_renda == 4, renda_fa*0.115,
ifelse(decile_renda == 5, renda_fa*0.124,
ifelse(decile_renda == 6, renda_fa*0.125,
ifelse(decile_renda == 7, renda_fa*0.137,
ifelse(decile_renda == 8, renda_fa*0.141,
ifelse(decile_renda == 9, renda_fa*0.156,
ifelse(decile_renda == 10, renda_fa*0.131, 0)))))))))))
Someone could teach me how to do that?
Thank you very much!
You can technically use match statement to reduce your ifelse statements as below
# put your case data into a data frame
data = data.frame(x= seq(1, 10, by = 1)
, y = runif(10))
# creating your actual data
carro = data.frame(decile_renda = sample(1:10, 10, replace =T)
,renda_fa = runif(10)
)
#Match it to get positions of case statement
pos = with(carro, match(decile_renda, data$x, 0))
# multiply to get results
data$y[pos]*carro$renda_fa
Alternatively, this can be solved by left-joining with a lookup table.
If I understand correctly the nested ifelse() construct, the OP wants to multiply renda with a factor which depends on the value of decile_renda. The factors are given for 10 distinct values of decile_renda. In all other cases, the result must be zero.
Using left join will find matching values of decile_renda in the lookup table. Non-matching rows in carro will get an NA value. These need to be replaced by zero, subsequently.
The lookup table treats the single use cases as data instead of hard-coding. This gives the flexibility to add or change the use cases without changing the code.
Create lookup table and test dataset
# create lookup table
lut <- data.frame(
decile_renda = 1:10,
fa = c(0.116, 0.106, 0.102, 0.115, 0.124, 0.125, 0.137, 0.141, 0.156, 0.131)
)
lut
decile_renda fa
1 1 0.116
2 2 0.106
3 3 0.102
4 4 0.115
5 5 0.124
6 6 0.125
7 7 0.137
8 8 0.141
9 9 0.156
10 10 0.131
# create test dataset
carro <- data.frame(decile_renda = 0:11, renda_fa = 100)
# randomize row order
set.seed(1L) # required for reproducible data
carro <- carro[sample(nrow(carro)), ]
carro
decile_renda renda_fa
9 8 100
4 3 100
7 6 100
1 0 100
2 1 100
5 4 100
3 2 100
8 7 100
6 5 100
11 10 100
12 11 100
10 9 100
Note that the test dataset has been choosen to allow for easy verification of the results.
Base R: merge()
carro <- merge(carro, lut, all.x = TRUE, by = "decile_renda")
carro$custo_tprivate <- with(carro, ifelse(is.na(fa), 0, renda_fa * fa))
carro
decile_renda renda_fa fa custo_tprivate
1 0 100 NA 0.0
2 1 100 0.116 11.6
3 2 100 0.106 10.6
4 3 100 0.102 10.2
5 4 100 0.115 11.5
6 5 100 0.124 12.4
7 6 100 0.125 12.5
8 7 100 0.137 13.7
9 8 100 0.141 14.1
10 9 100 0.156 15.6
11 10 100 0.131 13.1
12 11 100 NA 0.0
Note that the result column custo_tprivate shows 0.0 for the rows with non-matching decile_renda values of 0 and 11 as requested.
However, the drawback here is that merge() does not maintain the original row order (this is why the test dataset uses a random row order for demonstration). Also, the result contains the fa column which is no longer needed.
dplyr
library(dplyr)
carro %>%
left_join(lut, by = "decile_renda") %>%
mutate(custo_tprivate = if_else(is.na(fa), 0, renda_fa * fa)) %>%
select(-fa)
decile_renda renda_fa custo_tprivate
1 8 100 14.1
2 3 100 10.2
3 6 100 12.5
4 0 100 0.0
5 1 100 11.6
6 4 100 11.5
7 2 100 10.6
8 7 100 13.7
9 5 100 12.4
10 10 100 13.1
11 11 100 0.0
12 9 100 15.6
Here, the original row order is kept and the fa column has been removed.
data.table
With data.table we can do an update join where the matching rows of carro are being updated by reference, i.e., without copying the whole object. Only the result column custo_tprivate is appended to carro but not fa which would have to be removed afterwards. Also, for replacing the NA values only the affected rows are updated in place.
This might be an advantage in terms of speed and memory consumption in case of large datasets.
library(data.table)
setDT(carro)[lut, on = .(decile_renda), custo_tprivate := renda_fa * fa]
carro[is.na(custo_tprivate), custo_tprivate := 0]
carro
decile_renda renda_fa custo_tprivate
1: 8 100 14.1
2: 3 100 10.2
3: 6 100 12.5
4: 0 100 0.0
5: 1 100 11.6
6: 4 100 11.5
7: 2 100 10.6
8: 7 100 13.7
9: 5 100 12.4
10: 10 100 13.1
11: 11 100 0.0
12: 9 100 15.6
I have this dataframe
r2 distance
1 33.64 67866
2 8.50 77229
3 15.07 109119
4 24.35 142279
5 7.74 143393
6 8.21 177670
7 12.26 216440
8 12.66 253751
9 26.31 282556
10 39.08 320816
I need to calculate the mean of column r2 for every block of rows where the distance between two values in the column distance is equal or less than 100000.
For this example the desired output would be:
mean_r2 diff_of_distance
1 17.86 75527 ## mean of rows 1 to 5; distance 5 - distance 1
2 13.91 66164 ## mean of rows 2 to 5; distance 5 - distance 2
3 13.84 68551 ## mean of rows 3 to 6; distance 6 - distance 3
4 13.14 74161 ## mean of rows 4 to 7; distance 7 - distance 4
5 9.40 73047 ## mean of rows 5 to 7; distance 7 - distance 5
6 11.04 76081 ## mean of rows 6 to 8; distance 8 - distance 6
and so on.
Edit 1: I have more than 100,000 rows.
Thanks.
Loop through each value of distance, minus this from the values in the distance vector and test if the result is less than 100000. This creates a boolean vector which you sum to identify the index at which the distance is greater than 100000 (i.e. bool becomes FALSE). Use this index to identify your block then take the mean of r2 in each block.
To speed up the code define your vector type and length (to avoid "growing vectors" on each iteration.
means <- vector("numeric", length = nrow(df))
rows <- vector("numeric", length = nrow(df))
distance_diff <- vector("numeric", length = nrow(df))
for (i in seq_along(df$distance)) {
dis_val <- df$distance[i] # the ith distance value
bools <- (df$distance - dis_val) < 100000 # bool indicating if difference between i and every value in vector is less than 100000
block_range <- sum(bools)# taking sum of bools identifies the value at which the distance becomes > 100000
rows[i] <- paste(as.character(i), "-", as.character(block_range))
means[i] <- mean(df$r2[i:block_range]) # take the mean of r2 in the range i to all rows where distance is < 100000
distance_diff[i] <- df$distance[block_range] - dis_val # minus the distance from the value before distance is > 100000 from i
}
data.frame(mean_r2 = means, rows= rows, diff_of_distance=distance_diff)
mean_r2 rows diff_of_distance
1 17.860000 1 - 5 75527
2 13.915000 2 - 5 66164
3 13.842500 3 - 6 68551
4 13.140000 4 - 7 74161
5 9.403333 5 - 7 73047
6 11.043333 6 - 8 76081
7 17.076667 7 - 9 66116
8 26.016667 8 - 10 67065
9 32.695000 9 - 10 38260
10 39.080000 10 - 10 0
You can try:
# your data
d <- read.table(text="r2 distance
1 33.64 67866
2 8.50 77229
3 15.07 109119
4 24.35 142279
5 7.74 143393
6 8.21 177670
7 12.26 216440
8 12.66 253751
9 26.31 282556
10 39.08 320816", header=T)
library(tidyverse) #dplyr_0.7.2
d %>%
mutate(index=1:n()) %>% add row index
group_by(index) %>% # group by this index
# calculate difference and find max row where diff < 100000
mutate(max_row=max(which(.$distance - distance < 100000, arr.ind=T))) %>%
# calculate mean
mutate(mean_r2=mean(.$r2[index:max_row])) %>%
# calculate the difference
mutate(diff_of_distance=.$distance[max_row] - .$distance[index]) %>%
# unite the columns
unite(rows, index, max_row, sep = "-")
# A tibble: 10 x 5
r2 distance rows mean_r2 diff_of_distance
* <dbl> <int> <chr> <dbl> <int>
1 33.64 67866 1-5 17.860000 75527
2 8.50 77229 2-5 13.915000 66164
3 15.07 109119 3-6 13.842500 68551
4 24.35 142279 4-7 13.140000 74161
5 7.74 143393 5-7 9.403333 73047
6 8.21 177670 6-8 11.043333 76081
7 12.26 216440 7-9 17.076667 66116
8 12.66 253751 8-10 26.016667 67065
9 26.31 282556 9-10 32.695000 38260
10 39.08 320816 10-10 39.080000 0
This works because group_by subsets the dataframe, thus you can access within mutate the respective distance value per group and calculate the difference with the complete vector using .$distance as this access the complete column regardless the group_by() function.
I'm working with data that looks similar to this:
cat value n
1 100 18
2 0 19
3 -100 15
4 100 13
5 0 17
6 -100 18
In the real data, there are many cats and value can be any number between -100 and 100 (no NA).
What I want to do is to calculate the mean of value based on terciles defined by n
So, for example, since sum(n)=100 what I want to do is to get n's as close as possible to 33 and calculate the mean of value. So for the first tercile, 18 isn't quite 33, so I need to take 15 values from cat=2. So the mean for the first tercile should be (100*18+0*15)/(18+15). The second tercile would be the remaining ns from cat=2, then as many as are needed to get to 33: (0*4+-100*15+100*13+0*1)/(4+15+13+1). Similar for the last tercile.
I got started writing this, but ended up with lots of nasty for loops and if statements. I'm hoping that you see an easier way to deal with this than I do. Thanks in advance!
A solution with data.table:
setDT(df)[rep(1:.N,n)
][,indx:=c(rep("a",33),rep("b",33),rep("c",34))
][,.(mean_val_indx=mean(value)),by=indx]
this gives:
indx mean_val_indx
1: a 54.545455
2: b -6.060606
3: c -52.941176
Which are the means of value for the three parts of the data.
Broken down in the intermediate steps:
1: replice the rows according n
setDT(df)[rep(1:.N,n)]
this gives (shortened):
cat value n
1: 1 100 18
2: 1 100 18
....
17: 1 100 18
18: 1 100 18
19: 2 0 19
20: 2 0 19
....
36: 2 0 19
37: 2 0 19
38: 3 -100 15
....
99: 6 -100 18
100: 6 -100 18
2: create an index with [,indx:=c(rep("a",33),rep("b",33),rep("c",34))]
setDT(df)[rep(1:.N,n)
][,indx:=c(rep("a",33),rep("b",33),rep("c",34))]
this gives:
> dt
cat value n indx
1: 1 100 18 a
2: 1 100 18 a
....
17: 1 100 18 a
18: 1 100 18 a
19: 2 0 19 a
20: 2 0 19 a
....
32: 2 0 19 a
33: 2 0 19 a
34: 2 0 19 b
35: 2 0 19 b
....
99: 6 -100 18 c
100: 6 -100 18 c
3: summarise value by indx with [,.(mean_val_indx=mean(value)),by=indx]
You could try something like this, data being your example dataframe:
longData<-unlist(apply(data[,c("value","n")],1,function(x){
rep(x["value"],x["n"])
}))
aggregate(longData,list(cut(seq_along(longData),breaks=3,right=FALSE)),mean)
longData will be a vector of length 100 with, using your example, 18 repetitions of -100, 19 repetitions of 0 etc.
The cut in the aggregate will divide longData into three groups, and the mean of each group will be calculated.
If already the data is very long repetition by "n" is perhaps unwanted.
The following solution doesn't do this. Moreover, 1/3 of the sum of the
"n"-values is not rounded to the nearest integer.
"i" is the vector of row numbers where terciles end. Since it is possible
that several terciles end at the same row, those row numbers are replicated.
The result is the vector "k".
For each index "j" the cumulative sum of "data$value"*"data$n" up to "k[j]"
covers "ms[k[j]]" terciles, so "ms[j]-j" terciles have to be subtracted
to get the cumulative sum up to the "j"th tercile.
m <- 3
sn <- sum(data$n)
ms <- m * cumsum(data$n) / sn
d <- diff(c(0,floor(ms)))
i <- which(d>0)
k <- rep(i,d[i])
vn <- data$value * data$n
sums <- cumsum(vn)[k] - (ms[k]-(1:m))*data$value[k]*sn/m
means <- m*diff(c(0,sums))/sn
The means of the terciles are:
> means
[1] 54 -6 -54
In this example "i" is equal to "k". But if terciles are replaced by deciles,
i.e. "m" is not 3 but 10, they are distinct:
> m
[1] 10
> i
[1] 1 2 3 4 5 6
> k
[1] 1 2 2 3 3 4 5 5 6 6
> means
[1] 100 80 0 -30 -100 60 50 0 -80 -100
I compared the speed of the 4 answers, using out small example with 8 rows:
> ##### "longData"-Answer #####
>
> system.time( for ( i in 1:1000 ) { A1 <- f1(data) } )
User System verstrichen
3.48 0.00 3.49
> ##### "sapply"-Answer #####
>
> system.time( for ( i in 1:1000 ) { A2 <- f2(data) } )
User System verstrichen
1.00 0.00 0.99
> ##### "data.table"Answer #####
>
> system.time( for ( i in 1:1000 ) { A3 <- f3(data) } )
User System verstrichen
4.73 0.00 4.79
> ##### this Answer #####
>
> system.time( for ( i in 1:1000 ) { A4 <- f4(data) } )
User System verstrichen
0.43 0.00 0.44
The "sapply"-Answer is even false:
> A1
Group.1 x
1 [0.901,34) 54.545455
2 [34,67) -6.060606
3 [67,100) -52.941176
> A2
(0,33] (33,67] (67,100]
-100.00000 0.00000 93.93939
> A3
indx mean_val_indx
1: a 54.545455
2: b -6.060606
3: c -52.941176
> A4
[1] 54 -6 -54
>
This is basically the same as NicE although perhaps useful as a different way fo assembling the rep and cutting operations:
sapply(split( sort(unlist( mapply(rep, res$value, res$n) )),
cut(seq(sum(res$n)), breaks=c(0,33,67,100) )),
mean)
(0,33] (33,67] (67,100]
-100.00000 0.00000 93.93939
So basically my code above simply takes every 5th number and calculates the standard deviation of the values for every 5th number....So if I have a sample data like this
Number STD
1 11.15
2 11.18
3 11.21
4 11.24
5 11.3
10 11.36
11 11.42
12 11.48
13 11.54
14 11.6
15 11.66
16 11.72
17 11.78
18 11.84
19 11.9
20 11.96
When I run my code, I'll get this output
Number STD
1 1 0.05770615
2 2 NA
3 3 0.09486833
4 4 0.09486833
So what I want to do is simple replace the NA with 0. Also instead of getting factors like 1,2,3,4 etc...I want to get 5,10,15,20,25 etc....
Another way of doing it:
# Generate data
number <- c(1:5, 10:20)
val <- c(11.15, 11.18, 11.21, 11.24, 11.30, 11.36, 11.42,
11.48, 11.54, 11.60, 11.66, 11.72, 11.78, 11.84, 11.90, 11.96)
data <- data.frame(number, val)
# Calculate SD
breaks <- seq(0, 20, 5)
splitted.data <- split(data$val, f=cut(data$number, breaks, labels=F))
err <- sapply(splitted.data, sd)
err[is.na(err)] <- 0
res <- cbind(Number = breaks[-1], STD = err)
Resulting in:
> res
Number STD
1 5 0.05770615
2 10 0.00000000
3 15 0.09486833
4 20 0.09486833
I haven't tried to rewrite what you try to do , but just for the sake of continuity you can
You can use argument labels of cut to set labels resulting category.
Change NA to 0 using spread[is.na(spread)] <- 0
The all code is :
hunter <- lapply(hunt, function(i) {
random <- cut(value[,i],seq(0,max(value[i]),5),
labels=seq(5,max(value[i]),5))
spread<-tapply(value[,i+1],random, sd,na.rm=TRUE)
spread[is.na(spread)] <- 0
Number<-levels(as.factor(random))
d <- data.frame(Number=Number,STD=spread)
})
Number STD
5 5 0.05770615
10 10 0.00000000
15 15 0.09486833
20 20 0.09486833
Using the data.table package, you can accomplish this in one call:
library(data.table)
DT <- data.table(value)
As a sigle call:
DT[, list(SD = ifelse(is.na(sd(STD)), 0, sd(STD)))
, by=list("Group" = factor(G <- (Number-1) %/% 5, labels=(unique(G) + 1)*5))]
Group SD
1: 5 0.05770615
2: 10 0.00000000
3: 15 0.09486833
4: 20 0.09486833
Breaking it down:
# you can create your groupings by
(Number-1) %/% 5 # (ie, the remainder when divided by 5)
# you can create your factor levels by
5 * ((Number-1) %/% 5 + 1)
# calculate the Group:
DT[, grp := factor(G <- (Number-1) %/% 5, labels=(unique(G) + 1)*5)]
# calculate the SD by Group, replacing NA's with 0:
DT[, SD := ifelse(is.na(sd(STD)), 0, sd(STD)), by=grp]
unique(DT[, list(grp, SD)])