use one variable conditioned on another - r

I am new to R so not very apt in it. I am trying to use the values of one variable, conditioned on the corresponding value in the other variable. For example,
x 1 2 3 10 20 30
y 45 60 20 78 65 27
I need to calculate a variable, say m, where
m= 5 * (value of y, given value of x)
So, given x=3, corresponding y=20 then m = 5*(20|x=3) = 100
and, if x=30, corresponding y=27, then m = 5*(27|x=30) = 135
Could you please tell me how to define m in this case?
Thanks

Try this
5*y[x == 3]
## [1] 100
And
5*y[x == 30]
## [1] 135
Edit: based on you new explanation, it looks like you are looking for match, i.e.,
m <- c(0, 1, 15, 20, 3)
y[match(m, x)]*5
## [1] NA 225 NA 325 100

Related

Variable FOR LOOP in R [duplicate]

I have a question about creating vectors. If I do a <- 1:10, "a" has the values 1,2,3,4,5,6,7,8,9,10.
My question is how do you create a vector with specific intervals between its elements. For example, I would like to create a vector that has the values from 1 to 100 but only count in intervals of 5 so that I get a vector that has the values 5,10,15,20,...,95,100
I think that in Matlab we can do 1:5:100, how do we do this using R?
I could try doing 5*(1:20) but is there a shorter way? (since in this case I would need to know the whole length (100) and then divide by the size of the interval (5) to get the 20)
In R the equivalent function is seq and you can use it with the option by:
seq(from = 5, to = 100, by = 5)
# [1] 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
In addition to by you can also have other options such as length.out and along.with.
length.out: If you want to get a total of 10 numbers between 0 and 1, for example:
seq(0, 1, length.out = 10)
# gives 10 equally spaced numbers from 0 to 1
along.with: It takes the length of the vector you supply as input and provides a vector from 1:length(input).
seq(along.with=c(10,20,30))
# [1] 1 2 3
Although, instead of using the along.with option, it is recommended to use seq_along in this case. From the documentation for ?seq
seq is generic, and only the default method is described here. Note that it dispatches on the class of the first argument irrespective of argument names. This can have unintended consequences if it is called with just one argument intending this to be taken as along.with: it is much better to use seq_along in that case.
seq_along: Instead of seq(along.with(.))
seq_along(c(10,20,30))
# [1] 1 2 3
Use the code
x = seq(0,100,5) #this means (starting number, ending number, interval)
the output will be
[1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
[17] 80 85 90 95 100
Usually, we want to divide our vector into a number of intervals.
In this case, you can use a function where (a) is a vector and
(b) is the number of intervals. (Let's suppose you want 4 intervals)
a <- 1:10
b <- 4
FunctionIntervalM <- function(a,b) {
seq(from=min(a), to = max(a), by = (max(a)-min(a))/b)
}
FunctionIntervalM(a,b)
# 1.00 3.25 5.50 7.75 10.00
Therefore you have 4 intervals:
1.00 - 3.25
3.25 - 5.50
5.50 - 7.75
7.75 - 10.00
You can also use a cut function
cut(a, 4)
# (0.991,3.25] (0.991,3.25] (0.991,3.25] (3.25,5.5] (3.25,5.5] (5.5,7.75]
# (5.5,7.75] (7.75,10] (7.75,10] (7.75,10]
#Levels: (0.991,3.25] (3.25,5.5] (5.5,7.75] (7.75,10]

Merge with replacement based on multiple non-unique columns

I have two data frames. The first one contains the original state of an image with all the data available to reconstruct the image from scratch (the entire coordinate set and their color values).
I then have a second data frame. This one is smaller and contains only data about the differences (the changes made) between the the updated state and the original state. Sort of like video encoding with key frames.
Unfortunately I don't have an unique id column to help me match them. I have an x column and I have a y column which, combined, can make up a unique id.
My question is this: What is an elegant way of merging these two data sets, replacing the values in the original dataframe with the values in the "differenced" data frame whose x and y coordinates match.
Here's some example data to illustrate:
original <- data.frame(x = 1:10, y = 23:32, value = 120:129)
x y value
1 1 23 120
2 2 24 121
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 127
9 9 31 128
10 10 32 129
And the dataframe with updated differences:
update <- data.frame(x = c(1:4, 8), y = c(2, 24, 17, 23, 30), value = 50:54)
x y value
1 1 2 50
2 2 24 51
3 3 17 52
4 4 23 53
5 8 30 54
The desired final output should contain all the rows in the original data frame. However, the rows in original where the x and y coordinates both match the corresponding coordinates in update, should have their value replaced with the values in the update data frame. Here's the desired output:
original_updated <- data.frame(x = 1:10, y = 23:32,
value = c(120, 51, 122:126, 54, 128:129))
x y value
1 1 23 120
2 2 24 51
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 54
9 9 31 128
10 10 32 129
I've tried to come up with a vectorised solution with indexing for some time, but I can't figure it out. Usually I'd use %in% if it were just one column with unique ids. But the two columns are non unique.
One solution would be to treat them as strings or tuples and combine them to one column as a coordinate pair, and then use %in%.
But I was curious whether there were any solution to this problem involving indexing with boolean vectors. Any suggestions?
First merge in a way which guarantees all values from the original will be present:
merged = merge(original, update, by = c("x","y"), all.x = TRUE)
Then use dplyr to choose update's values where possible, and original's value otherwise:
library(dplyr)
middle = mutate(merged, value = ifelse(is.na(value.y), value.x, value.y))
final = select(middle, x, y, value)
The match function is used to generate indices. Needs a nomatch argument to prevent NA on the left hand side of data.frame.[<-. I don't think it is as transparent as a merge followed by replace, but I'm guessing it will be faster:
original[ match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)] ,
"value"] <-
update[ which( match(update$x, original$x) == match(update$y, original$y)),
"value"]
You can see the difference:
> match(update$x, original$x)[
match(update$x, original$x) ==
match(update$y, original$y) ]
[1] NA 2 NA 8
> match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)]
[1] 2 8
The "interior" match functions are returning:
> match(update$y, original$y)
[1] NA 2 NA 1 8
> match(update$x, original$x)
[1] 1 2 3 4 8

Calculate Total Sum of Square Inconsistency

I am attempting to write my own function for total sum of square, within sum of square, and between sum of square in R Studio for my own implementation of k-means.
I've successfully written the function for within sum of square, but I'm having difficulty with total sum of square (and thus bss). The result I get is significantly larger than what R's own kmeans function computes. I'm confused because I am following exactly what formulas provide. Here is my data:
A =
36 3
73 3
30 3
49 3
47 11
47 11
0 7
46 5
16 3
52 4
0 8
21 3
0 4
57 6
31 5
0 6
40 3
31 5
38 4
0 5
59 4
61 6
48 7
29 2
0 4
19 4
19 3
48 9
48 4
21 5
where each column is a feature. This is the function I've created thus far for tss:
tot_sumoSq <- function(data){
avg = mean( as.matrix(data) )
r = matrix(avg, nrow(data), ncol(data))
tot_sumoSq = sum( (data - r)^2 )
}
I receive the result 24342.4, but R gives 13244.8. Am I completely missing something?
The latter value is calculated using the column means. If you use this for calculating the means, you'll get the same answer.
avg = colMeans(data)
r = matrix(avg, nrow(data), ncol(data), byrow=T)
[1] 13244.8
May be there are something wrong in your program. You subtract a matrix from a data frame. Use the following -
tot_sumoSq <- function(data){
data = as.matrix(data)
x = sum((data - mean(data))^2)
return(x)
}
From my side it gives the correct answer.
I found a solution to my issue by combining solutions provided by the first two commentators. I see what my previous mistake was and would like to clear any confusion for future scientists.
tot_sumoSq <- function(data){
avg = colMeans(data)
r = matrix(avg, nrow(data), ncol(data), byrow = T)
data = as.matrix(data)
return( sum( (data - r)^2 ) )
}
Each column is the entire sample for different features, so when we calculate the mean for each column, it is the mean of means for the entire sample for one feature. My conceptual mistake earlier was to combine both features to calculate an overall mean.

Sorting a vector with predetermined order

I have a vector x of length 10 that I would like to sort based on the order of values in vector y (1:10). Say:
x <- c(188,43,56,3,67,89,12,33,123,345)
y <- c(3,4,5,7,6,9,8,2,1,10)
The vector y will always consist of numbers from 1 to 10, but in different orders. I'd like to match the lowest value in x with 1 and the highest value with 10 so that the output will be something like
x_new <-(33,43,56,67,89,123,188,12,3,345)
How can I do this? I appreciate any input!
sort(x)[y]
[1] 33 43 56 89 67 188 123 12 3 345

Create categorical variable in R based on range

I have a dataframe with a column of integers that I would like to use as a reference to make a new categorical variable. I want to divide the variable into three groups and set the ranges myself (ie 0-5, 6-10, etc). I tried cut but that divides the variable into groups based on a normal distribution and my data is right skewed. I have also tried to use if/then statements but this outputs a true/false value and I would like to keep my original variable. I am sure that there is a simple way to do this but I cannot seem to figure it out. Any advice on a simple way to do this quickly?
I had something in mind like this:
x x.range
3 0-5
4 0-5
6 6-10
12 11-15
x <- rnorm(100,10,10)
cut(x,c(-Inf,0,5,6,10,Inf))
Ian's answer (cut) is the most common way to do this, as far as i know.
I prefer to use shingle, from the Lattice Package
the argument that specifies the binning intervals seems a little more intuitive to me.
you use shingle like so:
# mock some data
data = sample(0:40, 200, replace=T)
a = c(0, 5);b = c(5,9);c = c(9, 19);d = c(19, 33);e = c(33, 41)
my_bins = matrix(rbind(a, b, c, d, e), ncol=2)
# returns: (the binning intervals i've set)
[,1] [,2]
[1,] 0 5
[2,] 5 9
[3,] 9 19
[4,] 19 33
[5,] 33 41
shx = shingle(data, intervals=my_bins)
#'shx' at the interactive prompt will give you a nice frequency table:
# Intervals:
min max count
1 0 5 23
2 5 9 17
3 9 19 56
4 19 33 76
5 33 41 46
We can use smart_cut from package cutr:
devtools::install_github("moodymudskipper/cutr")
library(cutr)
x <- c(3,4,6,12)
To cut with intervals of length 5 starting on 1 :
smart_cut(x,list(5,1),"width" , simplify=FALSE)
# [1] [1,6) [1,6) [6,11) [11,16]
# Levels: [1,6) < [6,11) < [11,16]
To get exactly your requested output :
smart_cut(x,c(0,6,11,16), labels = ~paste0(.y[1],'-',.y[2]-1), simplify=FALSE, open_end = TRUE)
# [1] 0-5 0-5 6-10 11-15
# Levels: 0-5 < 6-10 < 11-15
more on cutr and smart_cut

Resources