Sorting a vector with predetermined order - r

I have a vector x of length 10 that I would like to sort based on the order of values in vector y (1:10). Say:
x <- c(188,43,56,3,67,89,12,33,123,345)
y <- c(3,4,5,7,6,9,8,2,1,10)
The vector y will always consist of numbers from 1 to 10, but in different orders. I'd like to match the lowest value in x with 1 and the highest value with 10 so that the output will be something like
x_new <-(33,43,56,67,89,123,188,12,3,345)
How can I do this? I appreciate any input!

sort(x)[y]
[1] 33 43 56 89 67 188 123 12 3 345

Related

Percentile rank of column values - R

I am looking for a percentage rank for each value in a column.
It is quite easy in Excel, for example:
=RANK.EQ(A1,$A$1:$A$100,1)/COUNT($A$1:$A$100)
Returns a percent value in a new column that ranks the column I referred to above.
I have no problem finding quantile in R, but have not been able to find anything that accurately gives percentile for every single column value.
Try this using the data in your picture:
> Cost.Per.Kilo <- c(rep(c(6045170, 5412330, 3719760, 3589220), each=2),
3507400)
> Cost.Per.Kilo
[1] 6045170 6045170 5412330 5412330 3719760 3719760 3589220 3589220 3507400
> CPK.rank <- rank(Cost.Per.Kilo, ties.method="min")
> CPK.rank
[1] 8 8 6 6 4 4 2 2 1
> round(CPK.rank/length(CPK.rank) * 100)
[1] 89 89 67 67 44 44 22 22 11
In your picture you seem to have divided the ranks by 10, but there are only 9 values. That is why these percentages do not match.

How can I calculate the inter-pair correlation of a variable according to id in the whole dataframe?

I have a twin-dataset, in which there is one column called wpsum, another column is family-id, which is the same for corresponding twin pairs.
wpsum family-id
twin 1 14 220
twin 2 18 220
I want to calculate the correlation between wpsumof those with the same family-id, while there are also some single family id's, if one twin did not take part in the re-survey. family-id is a character.
There's no correlation between wpsum of those with the same family-id, as you put it, mainly because there's no third variable with which to correlate wpsum within the family-id groups (see my comment), but you can get the difference in wpsum scores within the groups. Maybe that's what you meant by correlation. Here's how to get those (I changed and expanded your example):
dat <- data.frame(wpsum = c(14, 18, 20, 5, 10, NA, 1),
family_id = c("220","220","221","221","222","222","223"))
dat
wpsum family_id
1 14 220
2 18 220
3 20 221
4 5 221
5 10 222
6 NA 222
7 1 223
diffs <- by(dat, dat$family_id, function(x) abs(x$wpsum[1] - x$wpsum[2]))
diffs
dat$family_id: 220
[1] 4
------------------------------
dat$family_id: 221
[1] 15
------------------------------
dat$family_id: 222
[1] NA
------------------------------
dat$family_id: 223
[1] NA
You can make a data.frame with this new variable of differences like so:
diff.frame <- data.frame(diffs = as.numeric(diffs), family_id = names(diffs))
diff.frame
diffs family_id
1 4 220
2 15 221
3 NA 222
4 NA 223
Note that neither missing values nor missing observations are a (coding) problem here - they just result in missing differences without error. If you started having more than two observations within each family ID, though, then you'd need to do something different.

Merge with replacement based on multiple non-unique columns

I have two data frames. The first one contains the original state of an image with all the data available to reconstruct the image from scratch (the entire coordinate set and their color values).
I then have a second data frame. This one is smaller and contains only data about the differences (the changes made) between the the updated state and the original state. Sort of like video encoding with key frames.
Unfortunately I don't have an unique id column to help me match them. I have an x column and I have a y column which, combined, can make up a unique id.
My question is this: What is an elegant way of merging these two data sets, replacing the values in the original dataframe with the values in the "differenced" data frame whose x and y coordinates match.
Here's some example data to illustrate:
original <- data.frame(x = 1:10, y = 23:32, value = 120:129)
x y value
1 1 23 120
2 2 24 121
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 127
9 9 31 128
10 10 32 129
And the dataframe with updated differences:
update <- data.frame(x = c(1:4, 8), y = c(2, 24, 17, 23, 30), value = 50:54)
x y value
1 1 2 50
2 2 24 51
3 3 17 52
4 4 23 53
5 8 30 54
The desired final output should contain all the rows in the original data frame. However, the rows in original where the x and y coordinates both match the corresponding coordinates in update, should have their value replaced with the values in the update data frame. Here's the desired output:
original_updated <- data.frame(x = 1:10, y = 23:32,
value = c(120, 51, 122:126, 54, 128:129))
x y value
1 1 23 120
2 2 24 51
3 3 25 122
4 4 26 123
5 5 27 124
6 6 28 125
7 7 29 126
8 8 30 54
9 9 31 128
10 10 32 129
I've tried to come up with a vectorised solution with indexing for some time, but I can't figure it out. Usually I'd use %in% if it were just one column with unique ids. But the two columns are non unique.
One solution would be to treat them as strings or tuples and combine them to one column as a coordinate pair, and then use %in%.
But I was curious whether there were any solution to this problem involving indexing with boolean vectors. Any suggestions?
First merge in a way which guarantees all values from the original will be present:
merged = merge(original, update, by = c("x","y"), all.x = TRUE)
Then use dplyr to choose update's values where possible, and original's value otherwise:
library(dplyr)
middle = mutate(merged, value = ifelse(is.na(value.y), value.x, value.y))
final = select(middle, x, y, value)
The match function is used to generate indices. Needs a nomatch argument to prevent NA on the left hand side of data.frame.[<-. I don't think it is as transparent as a merge followed by replace, but I'm guessing it will be faster:
original[ match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)] ,
"value"] <-
update[ which( match(update$x, original$x) == match(update$y, original$y)),
"value"]
You can see the difference:
> match(update$x, original$x)[
match(update$x, original$x) ==
match(update$y, original$y) ]
[1] NA 2 NA 8
> match(update$x, original$x)[
match(update$x, original$x, nomatch=0) ==
match(update$y, original$y,nomatch=0)]
[1] 2 8
The "interior" match functions are returning:
> match(update$y, original$y)
[1] NA 2 NA 1 8
> match(update$x, original$x)
[1] 1 2 3 4 8

Replace value in a column based on a Frequency Count using R

I have a dataset with multiple columns. Many of these columns contain over 32 factors, so to run a Random Forest (for example), I want to replace values in the column based on their Frequency Count.
One of the column reads like this:
$ country
: Factor w/ 92 levels "China","India","USA",..: 30 39 39 20 89 30 16 21 30 30 ...
What I would like to do is only retain the top N (where N is a value between 5 and 20) countries, and replace the remaining values with "Other".
I know how to calculate the frequency of the values using the table function, but I can't seem to find a solution for replacing values on the basis of such a rule. How can this be done?
Some example data:
set.seed(1)
x <- factor(sample(1:5,100,prob=c(1,3,4,2,5),replace=TRUE))
table(x)
# 1 2 3 4 5
# 4 26 30 13 27
Replace all the levels other than the top 3 (Levels 2/3/5) with "Other":
levels(x)[rank(table(x)) < 3] <- "Other"
table(x)
#Other 2 3 5
# 17 26 30 27

use one variable conditioned on another

I am new to R so not very apt in it. I am trying to use the values of one variable, conditioned on the corresponding value in the other variable. For example,
x 1 2 3 10 20 30
y 45 60 20 78 65 27
I need to calculate a variable, say m, where
m= 5 * (value of y, given value of x)
So, given x=3, corresponding y=20 then m = 5*(20|x=3) = 100
and, if x=30, corresponding y=27, then m = 5*(27|x=30) = 135
Could you please tell me how to define m in this case?
Thanks
Try this
5*y[x == 3]
## [1] 100
And
5*y[x == 30]
## [1] 135
Edit: based on you new explanation, it looks like you are looking for match, i.e.,
m <- c(0, 1, 15, 20, 3)
y[match(m, x)]*5
## [1] NA 225 NA 325 100

Resources