Convert "yymm" numeric to months since 9101 (Jan 1991) - r

I have a data set that runs from 1991-01 to 1996-12 in R. In order to calculate some time-dependent metrics on some of the data set, I am trying to have a "months since first entry" column. To do so I need to convert a number like 9107 into 07, 9207 into 12+7=19, and 9301 into 12+12+1=25. Ie, the first two digits specify a full year (12 months), and the last two digits specify months since January (01). How would I go about this?
Thank you!

May be, we can use substr (splitted in multiple lines for more clarity and also as.integer)
yrs <- as.integer(substr(v1, 1, 2))
mths <- as.integer(substr(v1, 3, 4))
mths[mths == 1] <- 0
12 * (yrs - yrs[1]) + mths
#[1] 7 19 24
Or without substr
yrs <- v1 %/% 100
12 * (yrs - yrs[1]) + (v1 %% 100)
data
v1 <- c(9107, 9207, 9301)

Using the substrings.
(as.numeric(substr(x, 2, 2)) - 1)*12 + as.numeric(substr(x, 3, 4)) - 1
# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
# [24] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
# [47] 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
# [70] 69 70 71
I believe, 9101 should be 0, since it is the distance of itself.
Data:
x <- c(9101, 9102, 9103, 9104, 9105, 9106, 9107, 9108, 9109, 9110,
9111, 9112, 9201, 9202, 9203, 9204, 9205, 9206, 9207, 9208, 9209,
9210, 9211, 9212, 9301, 9302, 9303, 9304, 9305, 9306, 9307, 9308,
9309, 9310, 9311, 9312, 9401, 9402, 9403, 9404, 9405, 9406, 9407,
9408, 9409, 9410, 9411, 9412, 9501, 9502, 9503, 9504, 9505, 9506,
9507, 9508, 9509, 9510, 9511, 9512, 9601, 9602, 9603, 9604, 9605,
9606, 9607, 9608, 9609, 9610, 9611, 9612)

Related

How to use characters in variables summing in R?

I have some dataframe. Here is a small expample:
a <- rnorm(100, 5, 2)
b <- rnorm(100, 10, 3)
c <- rnorm(100, 15, 4)
df <- data.frame(a, b, c)
And I have a character variable vect <- "c('a','b')"
When I try to calculate sum of vars using command
df$d <- df[vect]
which must be an equivalent of
df$d <- df[c('a','b')]
But, as a reslut I have got an error
[.data.frame(df, vect) :undefined columns selected
You're assumption that
vect <- "c('a','b')"
df$d <- df[vect]
is equivalent to
df$d <- df[c('a','b')]
is incorrect.
As #Karthik points out, you should remove the quotation marks in the assignment to vect
However, from your question it sounds like you want to then sum the elements specified in vect and then assign to d. To do this you need to slightly change your code
vect <- c('a','b')
df$d <- apply(X = df[vect], MARGIN = 1, FUN = sum)
This does elementwise sum on the columns in df specified by vect. The MARGIN = 1 specifies that we want to apply the sum rowise rather than columnwise.
EDIT:
As #ThomasIsCoding points out below, if for some reason vect has to be a string, you can parse a string to an R expression using str2lang
vect <- "c('a','b')"
parsed_vect <- eval(str2lang(vect))
df$d <- apply(X = df[parsed_vect], MARGIN = 1, FUN = sum)
Perhaps you can try
> df[eval(str2lang(vect))]
a b
1 8.1588519 9.0617818
2 3.9361214 13.2752377
3 5.5370983 8.8739725
4 8.4542050 8.5704234
5 3.9044461 13.2642793
6 5.6679639 12.9529061
7 4.0183808 6.4746806
8 3.6415608 11.0308990
9 4.5237453 7.3255129
10 6.9379168 9.4594150
11 5.1557935 11.6776181
12 2.3829337 3.5170335
13 4.3556430 7.9706624
14 7.3274615 8.1852829
15 -0.5650641 2.8109197
16 7.1742283 6.8161200
17 3.3412044 11.6298940
18 2.5388981 10.1289533
19 3.8845686 14.1517643
20 2.4431608 6.8374837
21 4.8731053 12.7258259
22 6.9534912 6.5069513
23 4.4394807 14.5320225
24 2.0427553 12.1786148
25 7.1563978 11.9671603
26 2.4231207 6.1801862
27 6.5830372 0.9814878
28 2.5443326 9.8774632
29 1.1260322 9.4804636
30 4.0078436 12.9909014
31 9.3599808 12.2178596
32 3.5362245 8.6758910
33 4.6462337 8.6647953
34 2.0698037 7.2750532
35 7.0727970 8.9386798
36 4.8465248 8.0565347
37 5.6084462 7.5676308
38 6.7617479 9.5357666
39 5.2138482 13.6822924
40 3.6259103 13.8659939
41 5.8586547 6.5087016
42 4.3490281 9.5367522
43 7.5130701 8.1699117
44 3.7933813 9.3241308
45 4.9466813 9.4432584
46 -0.3730035 6.4695187
47 2.0646458 10.6511916
48 4.6027309 4.9207746
49 5.9919348 7.1946723
50 6.0148330 13.4702419
51 5.5354452 9.0193366
52 5.2621651 12.8856488
53 6.8580210 6.3526151
54 8.0812166 14.4659778
55 3.6039030 5.9857886
56 9.8548553 15.9081336
57 3.3675037 14.7207681
58 3.9935336 14.3186175
59 3.4308085 10.6024579
60 3.9609624 6.6595521
61 4.2358603 10.6600581
62 5.1791856 9.3241118
63 4.6976289 13.2833055
64 5.1868906 7.1323826
65 3.1810915 12.8402472
66 6.0258287 9.3805249
67 5.3768112 6.3805096
68 5.7072092 7.1130150
69 6.5789349 8.0092541
70 5.3175820 17.3377234
71 9.7706112 10.8648956
72 5.2332127 12.3418373
73 4.7626124 13.8816910
74 3.9395911 6.5270785
75 6.4394724 10.6344965
76 2.6803695 10.4501753
77 3.5577834 8.2323369
78 5.8431140 7.7932460
79 2.8596818 8.9581837
80 2.7365174 10.2902512
81 4.7560973 6.4555758
82 4.6519084 8.9786777
83 4.9467471 11.2818536
84 5.6167284 5.2641380
85 9.4700525 2.9904731
86 4.7392906 11.3572521
87 3.1221908 6.3881556
88 5.6949432 7.4518023
89 5.1435241 10.8912283
90 2.1628966 10.5080671
91 3.6380837 15.0594135
92 5.3434709 7.4034042
93 -0.1298439 0.4832707
94 7.8759390 2.7411723
95 2.0898649 9.7687250
96 4.2131549 9.3175228
97 5.0648105 11.3943350
98 7.7225193 11.4180456
99 3.1018895 12.8890257
100 4.4166832 10.4901303

Conversion of strings to numbers

Hello I'm looking for a way to turn user inputted strings into matrices for example: 28 = SPACE, 27 = ?, 26 = 0, 25 = A, 24=B 23=C 22=D 21=E 20=F 19=G 18=H
17=I 16=J 15=K 14=L 13=M 12=N 11=O 10=P 9=Q 8=R 7=S 6=T 5=U 4=V 3=W 2=X 1=Y 0=Z
"HI HOW ARE YOU?" -> "[18 17 28 18 11][3 28 25 8 21][28 1 11 5 27]"
wherein each letter/symbol of the string is converted to a numerical value (special attention to spacebar I really don't know how to turn space into numbers). I'll be using these matrices to make a cryptograph
You could use utf8ToInt
x <- "HI HOW ARE YOU?"
We need pmin to get your condition 28 = SPACE right.
pmin(abs(utf8ToInt("HI HOW ARE YOU?") - utf8ToInt("Z")), 28)
# [1] 18 17 28 18 11 3 28 25 8 21 28 1 11 5 27
From ?utf8ToInt :
Conversion of UTF-8 encoded character vectors to and from integer vectors representing a UTF-32 encoding.
First step is
utf8ToInt("HI HOW ARE YOU?")
[1] 72 73 32 72 79 87 32 65 82 69 32 89 79 85 63
from which we substract utf8ToInt("Z"), i.e. 90 because you wrote 0=Z.
Call abs on the result to get positive numbers.
abs(utf8ToInt("HI HOW ARE YOU?") - utf8ToInt("Z"))
# [1] 18 17 58 18 11 3 58 25 8 21 58 1 11 5 27
The last piece is your condition 28 = SPACE, which is where pmin helps you out.

Function with a for loop to create a column with values 1:n conditioned by intervals matched by another column

I have a data frame like the following
my_df=data.frame(x=runif(100, min = 0,max = 60),
y=runif(100, min = 0,max = 60)) #x and y in cm
With this I need a new column with values from 1 to 36 that match x and y every 10 cm. For example, if 0<=x<=10 & 0<=y<=10, put 1, then if 10<=x<=20 & 0<=y<=10, put 2 and so on up to 6, then 0<=x<=10 & 10<=y<=20 starting with 7 up to 12, etc. I tried to make a function with an if repeating the interval for x 6 times, and increasing by 10 the interval for y every iteration. Here is the function
#my miscarried function 'zones'
>zones= function(x,y) {
i=vector(length = 6)
n=vector(length = 6)
z=vector(length = 36)
i[1]=0
z[1]=0
n[1]=1
for (t in 1:6) {
if (0<=x & x<10 & i[t]<=y & y<i[t]+10) { z[t] = n[t]} else
if (10<=x & x<20 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+1} else
if (20<=x & x<30 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+2} else
if (30<=x & x<40 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+3} else
if (40<=x & x<50 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+4}else
if (50<=x & x<=60 & i[t]<=y & y<i[t]+10) {z[t]=n[t]+5}
else {i[t+1]=i[t]+10
n[t+1]=n[t]+6}
}
return(z)
}
>xy$z=zones(x=xy$x,y=xy$y)
and I got
There were 31 warnings (use warnings() to see them)
>xy$z
[1] 0 0 0 0 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Please,help me before I die alone!
I think think this does the trick.
a <- cut(my_df$x, (0:6) * 10)
b <- cut(my_df$y, (0:6) * 10)
z <- interaction(a, b)
levels(z)
[1] "(0,10].(0,10]" "(10,20].(0,10]" "(20,30].(0,10]" "(30,40].(0,10]"
[5] "(40,50].(0,10]" "(50,60].(0,10]" "(0,10].(10,20]" "(10,20].(10,20]"
[9] "(20,30].(10,20]" "(30,40].(10,20]" "(40,50].(10,20]" "(50,60].(10,20]"
[13] "(0,10].(20,30]" "(10,20].(20,30]" "(20,30].(20,30]" "(30,40].(20,30]"
[17] "(40,50].(20,30]" "(50,60].(20,30]" "(0,10].(30,40]" "(10,20].(30,40]"
[21] "(20,30].(30,40]" "(30,40].(30,40]" "(40,50].(30,40]" "(50,60].(30,40]"
[25] "(0,10].(40,50]" "(10,20].(40,50]" "(20,30].(40,50]" "(30,40].(40,50]"
[29] "(40,50].(40,50]" "(50,60].(40,50]" "(0,10].(50,60]" "(10,20].(50,60]"
[33] "(20,30].(50,60]" "(30,40].(50,60]" "(40,50].(50,60]" "(50,60].(50,60]"
If this types of levels aren't for your taste, then change as below:
levels(z) <- 1:36
Is this what you're after? The resulting numbers are in column res:
# Get bin index for x values and y values
my_df$bin1 <- as.numeric(cut(my_df$x, breaks = seq(0, max(my_df$x) + 10, by = 10)));
my_df$bin2 <- as.numeric(cut(my_df$y, breaks = seq(0, max(my_df$x) + 10, by = 10)));
# Multiply bin indices
my_df$res <- my_df$bin1 * my_df$bin2;
> head(my_df)
x y bin1 bin2 res
1 49.887499 47.302849 5 5 25
2 43.169773 50.931357 5 6 30
3 10.626466 43.673533 2 5 10
4 43.401454 3.397009 5 1 5
5 7.080386 22.870539 1 3 3
6 39.094724 24.672907 4 3 12
I've broken down the steps for illustration purposes; you probably don't want to keep the intermediate columns bin1 and bin2.
We probably need a table showing the relationship between x, y, and z. After that, we can define a function to do the join.
The solution is related and inspired by this post (R dplyr join by range or virtual column). You may also find other solutions are useful.
# Set seed for reproducibility
set.seed(1)
# Create example data frame
my_df <- data.frame(x=runif(100, min = 0,max = 60),
y=runif(100, min = 0,max = 60))
# Load the dplyr package
library(dplyr)
# Create a table to show the relationship between x, y, and z
r <- expand.grid(x_from = seq(0, 50, 10), y_from = seq(0, 50, 10)) %>%
mutate(x_to = x_from + 10, y_to = y_from + 10, z = 1:n())
# Define a function for dynamic join
dynamic_join <- function(d, r){
if (!("z" %in% colnames(d))){
d[["z"]] <- NA_integer_
}
d <- d %>%
mutate(z = ifelse(x >= r$x_from & x < r$x_to & y >= r$y_from & y < r$y_to,
r$z, z))
return(d)
}
re_dynamic_join <- function(d, r){
r_list <- split(r, r$z)
for (i in 1:length(r_list)){
d <- dynamic_join(d, r_list[[i]])
}
return(d)
}
# Apply the function
re_dynamic_join(my_df, r)
x y z
1 15.930520 39.2834357 20
2 22.327434 21.1918363 15
3 34.371202 16.2156088 10
4 54.492467 59.5610437 36
5 12.100916 38.0095959 20
6 53.903381 12.7924881 12
7 56.680516 7.7623409 6
8 39.647868 28.6870821 16
9 37.746843 55.4444682 34
10 3.707176 35.9256580 19
11 12.358474 58.5702417 32
12 10.593405 43.9075507 26
13 41.221371 21.4036147 17
14 23.046223 25.8884214 15
15 46.190485 8.8926936 5
16 29.861955 0.7846545 3
17 43.057110 42.9339640 29
18 59.514366 6.1910541 6
19 22.802111 26.7770609 15
20 46.646713 38.4060627 23
21 56.082314 59.5103172 36
22 12.728551 29.7356147 14
23 39.100426 29.0609715 16
24 7.533306 10.4065401 7
25 16.033240 45.2892567 26
26 23.166846 27.2337294 15
27 0.803420 30.6701870 19
28 22.943277 12.4527068 9
29 52.181451 13.7194886 12
30 20.420940 35.7427198 21
31 28.924807 34.4923319 21
32 35.973950 4.6238628 4
33 29.612478 2.1324348 3
34 11.173056 38.5677295 20
35 49.642399 55.7169120 35
36 40.108004 35.8855453 23
37 47.654392 33.6540449 23
38 6.476618 31.5616634 19
39 43.422657 59.1057134 35
40 24.676466 30.4585093 21
41 49.256778 40.9672847 29
42 38.823612 36.0924731 22
43 46.975966 14.3321207 11
44 33.182179 15.4899556 10
45 31.783175 43.7585774 28
46 47.361374 27.1542499 17
47 1.399872 10.5076061 7
48 28.633804 44.8018962 27
49 43.938824 6.2992584 5
50 41.563893 51.8726969 35
51 28.657177 36.8786983 21
52 51.672569 33.4295723 24
53 26.285826 19.7266391 9
54 14.687837 27.1878867 14
55 4.240743 30.0264584 19
56 5.967970 10.8519817 7
57 18.976302 31.7778362 20
58 31.118056 4.5165447 4
59 39.720305 16.6653560 10
60 24.409811 12.7619712 9
61 54.772555 17.0874289 12
62 17.616202 53.7056462 32
63 27.543944 26.7741194 15
64 19.943680 46.7990934 26
65 39.052228 52.8371421 34
66 15.481007 24.7874526 14
67 28.712715 3.8285088 3
68 45.978640 20.1292495 17
69 5.054815 43.4235568 25
70 52.519280 20.2569200 18
71 20.344376 37.8248473 21
72 50.366421 50.4368732 36
73 20.801009 51.3678999 33
74 20.026496 23.4815569 15
75 28.581075 22.8296331 15
76 53.531900 53.7267256 36
77 51.860368 38.6589458 24
78 23.399373 44.4647189 27
79 46.639242 36.3182068 23
80 57.637080 54.1848967 36
81 26.079569 17.6238093 9
82 42.750881 11.4756066 11
83 23.999662 53.1870566 33
84 19.521129 30.2003691 20
85 45.425229 52.6234526 35
86 12.161535 11.3516173 8
87 42.667273 45.4861831 29
88 7.301515 43.4699336 25
89 14.729311 56.6234891 32
90 8.598263 32.8587952 19
91 14.377765 42.7046321 26
92 3.536063 23.3343060 13
93 38.537296 6.0523876 4
94 52.576153 55.6381253 36
95 46.734881 16.9939500 11
96 47.838530 35.4343895 23
97 27.316467 6.6216363 3
98 24.605045 50.4304219 33
99 48.652215 19.0778211 11
100 36.295997 46.9710802 28

Use vector to make probability table

In the form of a probability table, I'd like to illustrate a vector of quantiles divisible by 7 and 5, for marginal probability distributions, and 5 given 7, for conditional probability.
Let's assume this is my data:
>prob.table(table(x)) # discrete number and its probability
20 22 23 24 25 26 27 28 29 30 31
0.000152 0.000625 0.000796 0.001224 0.003138 0.003043 0.004549 0.006444 0.005938 0.009301 0.009456
32 33 34 35 36 37 38 39 40 41 42
0.013448 0.019839 0.018596 0.026613 0.028902 0.027377 0.035156 0.041379 0.041092 0.047733 0.055827
43 44 45 46 47 48 49 50 51 52 53
0.046099 0.051624 0.055131 0.049779 0.056992 0.049801 0.052912 0.031924 0.049114 0.022880 0.042279
54 55 56 57 58 59 61 63 65
0.013946 0.032340 0.003466 0.021240 0.001227 0.011734 0.005115 0.001491 0.000278
How can I turn this into a two-way probability table that shows which numbers are divisible by 7 and/or 5 for marginal and conditional probability?
This is what I'd hope the table to look like
Yes NO # Probability of numbers divisible by 7
Yes 0.02754 0.02886
No 0.02656 0.02831
# Probability of numbers divisible by 5
x <- sample(1:100, 100, replace = TRUE)
# %% is the mod operator, which gives the remainder after the division of the left-hand side by the right-hand side. x %% y == 0 therefore returns TRUE if x is divisible by y
db5 <- x %% 5 == 0
db7 <- x %% 7 == 0
table(db5, db7) / length(x)
# db7
# db5 FALSE TRUE
# FALSE 0.62 0.13
# TRUE 0.24 0.01

Given data points and y value, give x value

Given a set of (x,y) coordinates, how can I solve for x, from y. If you were to plot the coordinates, they would be non-linear, but pretty close to exponential. I tried approx(), but it is way off. Here is example data. In this scenario, how could I solve for y == 50?
V1 V3
1 5.35 11.7906
2 10.70 15.0451
3 16.05 19.4243
4 21.40 20.7885
5 26.75 22.0584
6 32.10 25.4367
7 37.45 28.6701
8 42.80 30.7500
9 48.15 34.5084
10 53.50 37.0096
11 58.85 39.3423
12 64.20 41.5023
13 69.55 43.4599
14 74.90 44.7299
15 80.25 46.5738
16 85.60 47.7548
17 90.95 49.9749
18 96.30 51.0331
19 101.65 52.0207
20 107.00 52.9781
21 112.35 53.8730
22 117.70 54.2907
23 123.05 56.3025
24 128.40 56.6949
25 133.75 57.0830
26 139.10 58.5051
27 144.45 59.1440
28 149.80 60.0687
29 155.15 60.6627
30 160.50 61.2313
31 165.85 61.7748
32 171.20 62.5587
33 176.55 63.2684
34 181.90 63.7085
35 187.25 64.0788
36 192.60 64.5807
37 197.95 65.2233
38 203.30 65.5331
39 208.65 66.1200
40 214.00 66.6208
41 219.35 67.1952
42 224.70 67.5270
43 230.05 68.0175
44 235.40 68.3869
45 240.75 68.7485
46 246.10 69.1878
47 251.45 69.3980
48 256.80 69.5899
49 262.15 69.7382
50 267.50 69.7693
51 272.85 69.7693
52 278.20 69.7693
53 283.55 69.7693
54 288.90 69.7693
I suppose the problem you have is that approx solves for y given x, while you are talking about solving for x given y. So you need to switch your variables x and y when using approx:
df <- read.table(textConnection("
V1 V3
85.60 47.7548
90.95 49.9749
96.30 51.0331
101.65 52.0207
"), header = TRUE)
approx(x = df$V3, y = df$V1, xout = 50)
# $x
# [1] 50
#
# $y
# [1] 91.0769
Also, if y is exponential with respect to x, then you have a linear relationship between x and log(y), so it makes more sense to use a linear interpolator between x and log(y), then take the exponential to get back to y:
exp(approx(x = df$V3, y = log(df$V1), xout = 50)$y)
# [1] 91.07339

Resources