r join two lists and sum their values - r

I have two lists: x, y
> x
carlo monte simulation model quantum
31 31 9 6 6
> y
model system temperature quantum simulation problem
15 15 15 13 13 12
What function should I use to obtain:
simulation model quantum
22 21 19
I tried to merge them like in example but it gives me an error:
merge(x,y,by=intersect(names(x),names(y))) produces:
Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns
There's no argument in that function what to do with values. What would be the best function to use?
intersect(names(x),names(y)) will give the names of resulting list, but how to summarize values together??

You can use Map in base R to return a list.
Map("+", x[intersect(names(x),names(y))], y[intersect(names(x),names(y))])
$simulation
[1] 22
$model
[1] 21
$quantum
[1] 19
or mapply to return a named vector which may be more useful.
mapply("+", x[intersect(names(x),names(y))], y[intersect(names(x),names(y))])
simulation model quantum
22 21 19
Using [intersect(names(x), names(y))] will not only be subset the contents of x and y to those with intersecting names, but will also properly sort the elements for the operation.
data
x <- list(carlo=1, monte=2, simulation=9, model=6, quantum=6)
y <-list(model=15, system=8, temperature=10, quantum=13, simulation=13, problem="no")

simple names matching does the trick :
# subset those from x which have names in y also
x1 = x[names(x)[names(x) %in% names(y)]]
# x1
# simulation model quantum
# 9 6 6
# similarily do it for y. note the order of names might be different from that in x
y1 = y[names(y)%in%names(x1)]
# y1
# model quantum simulation
# 15 13 13
# now order the names in both and then add.
x1[order(names(x1))]+y1[order(names(y1))]
# model quantum simulation
# 21 19 22

Base function merge() should do this with no issue so long as your fields make sense, but you need to include merge(..., all=TRUE), as in:
y <- data.frame(rbind(c(15,15,15,13,13,12)))
names(y) <- c("model","system","temperature","quantum","simulation","problem")
x <- data.frame(rbind(c(31,31,9,6,6)))
names(x) <- c("carlo","monte","simulation","model","quantum")
merge(x, y, by = c("simulation","model","quantum"), all = TRUE)
results in:
simulation model quantum carlo monte system temperature problem
1 9 6 6 31 31 NA NA NA
2 13 15 13 NA NA 15 15 12
Here you actually have data frames of length 1, not lists.

Related

Is there a way to create a permutation of a vector without using the sample() function in R?

I hope you are having a nice day. I would like to know if there is a way to create a permutation (rearrangement) of the values in a vector in R?
My professor provided with an assignment in which we are supposed create functions for a randomization test, one while using sample() to create a permutation and one not using the sample() function. So far all of my efforts have been fruitless, as any answer that I can find always resorts in the use of the sample() function. I have tried several other methods, such as indexing with runif() and writing my own functions, but to no avail. Alas, I have accepted defeat and come here for salvation.
While using the sample() function, the code looks like:
#create the groups
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
#create a permutation of the combined vector without replacement using the sample function()
permsample <-sample(c(a,b),replace=FALSE)
permsample
[1] 2 5 6 1 7 7 3 8 6 3 5 9 2 7 4 8 1 5
And, for reference, the entire code of my function looks like:
PermutationTtest <- function(a, b, P){
sample.t.value <- t.test(a, b)$statistic
perm.t.values<-matrix(rep(0,P),P,1)
N <-length(a)
M <-length(b)
for (i in 1:P)
{
permsample <-sample(c(a,b),replace=FALSE)
pgroup1 <- permsample[1:N]
pgroup2 <- permsample[(N+1) : (N+M)]
perm.t.values[i]<- t.test(pgroup1, pgroup2)$statistic
}
return(mean(perm.t.values))
}
How would I achieve the same thing, but without using the sample() function and within the confines of base R? The only hint my professor gave was "use indices." Thank you very much for your help and have a nice day.
You can use runif() to generate a value between 1.0 and the length of the final array. The floor() function returns the integer part of that number. At each iteration, i decrease the range of the random number to choose, append the element in the rn'th position of the original array to the new one and remove it.
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
c<-c(a,b)
index<-length(c)
perm<-c()
for(i in 1:length(c)){
rn = floor(runif(1, min=1, max=index))
perm<-append(perm,c[rn])
c=c[-rn]
index=index-1
}
It is easier to see what is going on if we use consecutive numbers:
a <- 1:8
b <- 9:17
ab <- c(a, b)
ab
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Now draw 17 (length(ab)) random numbers and use them to order ab:
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 5 13 11 12 6 1 17 3 10 2 8 16 7 4 9 15 14
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 14 11 5 15 10 7 13 9 17 8 2 6 1 4 16 12 3
For each permutation just draw another 17 random numbers.

Why is the for loop returning NA vectors in some positions (in R)?

Following a youtube tutorial, I have created a vector x [-3,6,2,5,9].
Then I create an empty variable of length 5 with the function 'numeric(5)'
I want to store the squares of my vector x in 'Storage2' with a for loop.
When I do the for loop and update my variable, it returns a very strange thing:
[1] 9 4 0 9 25 36 NA NA 81
I can see all numbers in x have been squared, but the order is so random, and there's more than 5.
Also, why are there NAs?? If it's because the last number of x is 9 (and so this number defines the length??), and there's no 7 and 8 position, I would understand, but then I'm also missing positions 1, 3 and 4, so there should be more NAs...
I'm just starting with R, so please keep it simple, and correct me if I'm wrong during my thought process! Thank you!!
x <- c(-3,6,2,5,9)
Storage2 <- numeric(5)
for(i in x){
Storage2[i] <- i^2
}
Storage2
# [1] 9 4 0 9 25 36 NA NA 81
You're looping over the elements of x not over the positions as probably intended. You need to change your loop like so:
for(i in 1:length(x)) {
Storage2[i] <- x[i]^2
}
Storage2
# [1] 9 36 4 25 81
(Note: 1:length(x) can also be expressed as seq_along(x), as pointed out by #NelsonGon in comments and might be faster.)
However, R is a vectorized language so you can simply do that:
Storage2 <- x^2
Storage2
# [1] 9 36 4 25 81

How to use BoxCoxTrans function in R?

I want to use BoxCoxTrans function in R to resolve problem of skewness.
But, I have a problem that couldn't get result as data frame. This is my R code.
df<-read.csv("dataSetNA1.csv",header=TRUE)
dd1<-apply(df[2:61],2,BoxCoxTrans) #Except independent variable that located first column, All variables are numeric variable.
dd1
$LT1Y_MXOD_AMT
Box-Cox Transformation
96249 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 0 19594 0 1600000
Lambda could not be estimated; no transformation is applied
$MOBL_PRIN
Box-Cox Transformation
96249 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0 100000 191229 320000 1100000
Lambda could not be estimated; no transformation is applied
str(dd1)
I don't know how to get result as data frame.
If I use as.data.frame function, this error message is posted.
dd2<-as.data.frame(dd1)
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
클래스 ""BoxCoxTrans""를 data.frame으로 강제형변환 할 수 없습니다
please help me.
Here is one way to accomplish what you are after (I assume you are transforming the features):
library(caret)
data(cars)
#create a list with the BoxCox objects
g <- apply(cars, 2, BoxCoxTrans)
#use map2 from purr to apply the models to new data
z <- purrr::map2(g, cars, function(x, y) predict(x, y))
#here the transformation is performed on the same data on
#which I estimated the BoxCox lambda for
B_trans = as.data.frame(do.call(cbind, z)) #to convert to data frame
head(data.frame(B_trans, cars), 20)
#outpout
speed dist speed.1 dist.1
1 4 0.8284271 4 2
2 4 4.3245553 4 10
3 7 2.0000000 7 4
4 7 7.3808315 7 22
5 8 6.0000000 8 16
6 9 4.3245553 9 10
7 10 6.4852814 10 18
8 10 8.1980390 10 26
9 10 9.6619038 10 34
10 11 6.2462113 11 17
11 11 8.5830052 11 28
12 12 5.4833148 12 14
13 12 6.9442719 12 20
14 12 7.7979590 12 24
15 12 8.5830052 12 28
16 13 8.1980390 13 26
17 13 9.6619038 13 34
18 13 9.6619038 13 34
19 13 11.5646600 13 46
20 14 8.1980390 14 26
First two columns are transformed data and 2nd two are original data.
Another way is to incorporate the transformation of features during the training:
train(....preProcess = "BoxCox"...)
more on the matter: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/train
In order to perform a Box Cox transformation your data has to be positive. Hence, the values should be greater than 0.
The reason for this is, that the logarithm of 0 is -Inf.
If your data contains values of 0 you can just add 1 to each observation. This won't change your distribution/skewness.
A BoxCox transformation is a transformation on your response variable. You could use the Boxcox function of the MASS package to find out what transformation is needed. Boxcox returns a lambda value. U should raise your response, say y, to the power lambda and this results in a new response variable, y*.
Then just replace the y-column in your old data frame by y*.
Note that if the resulting lambda is 0, you should apply a logarithmic transformation ln(y).

getting from histogram counts to cdf

I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.
You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000
If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))
The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.

Construct a vector containing intervals between the values of two other vectors

I have two vectors of integers, like so:
c(1,5,14,24)
c(3,9,22,30)
I need to construct from these the vector containing the ranges between each value, concatenated together, like so:
c(1:3,5:9,14:22,24:30)
What is the best way to do this? I couldn't find another question addressing this on the site. I tried some stuff using higher order functions (Map, Fold, etc.) but they all seem to take only one list argument.
you could use mapply here to get your ranges.
mySeq <- mapply(seq, A, B)
dput(mySeq)
# list(1:3, 5:9, 14:22, 24:30)
As #señor points out, if you want the ranges as a single vector, use unlist as well:
unlist(mapply(seq, A, B))
# [1] 1 2 3 5 6 7 8 9 14 15 16 17 18 19 20 21 22 24 25 26 27 28 29 30

Resources