I have a data.frame like so:
df <- data.frame(x = c(998,994,992,990,989,988), y = seq(0.5, 3, by = 0.5))
df
x y
1 998 0.5
2 994 1.0
3 992 1.5
4 990 2.0
5 989 2.5
6 988 3.0
I would like to expand it so the values in x are exactly 1 apart so the final data.frame looks like this:
x y
1 998 0.5
2 997 0.5
3 996 0.5
5 995 0.5
6 994 1.0
7 993 1.0
8 992 1.5
9 991 1.5
10 990 2.0
11 989 2.5
12 988 3.0
You can also use approx:
data.frame(approx(df, xout=max(df$x):min(df$x), method="constant", f=1))
x y
1 998 0.5
2 997 0.5
3 996 0.5
4 995 0.5
5 994 1.0
6 993 1.0
7 992 1.5
8 991 1.5
9 990 2.0
10 989 2.5
11 988 3.0
you can try with function na.locf from package zoo:
all_values <- max(df$x):min(df$x)
na.locf(merge(df, x=all_values, all=TRUE)[rev(seq(all_values)),])
# x y
# 11 998 0.5
# 10 997 0.5
# 9 996 0.5
# 8 995 0.5
# 7 994 1.0
# 6 993 1.0
# 5 992 1.5
# 4 991 1.5
# 3 990 2.0
# 2 989 2.5
# 1 988 3.0
NB
As suggested by #Ananta and #ProcrastinatusMaximus, another option is to set fromLast=TRUE in the na.locf call (if you need to have x in descending order, you'll need to sort the data.frame afterwards):
na.locf(merge(df, x=all_values, all=TRUE), fromLast=TRUE)
Related
This question already has answers here:
Categorize numeric variable into group/ bins/ breaks
(4 answers)
Closed 1 year ago.
I am attempting to add a new column to the state sample data frame in R. I am hoping for this column to cluster the ID of states into broader categories (1-4). My code is close to what I am looking for but I am not getting it quite right.. I know I could enter each state ID line by line but is there a a quicker way? Thank you!
library(tidyverse)
#Add column to denote each state
States=state.x77
States=data.frame(States)
States <- tibble::rowid_to_column(States, "ID")
States
#Create new variable for state buckets
States <- States %>%
mutate(WAGE_BUCKET=case_when(ID <= c(1,12) ~ '1',
ID <= c(13,24) ~ '2',
ID <= c(25,37) ~ '3',
ID <= c(38,50) ~ '4',
TRUE ~ 'NA'))
View(States) #It is not grouping the states in the way I want/I am still getting some NA values but unsure why!
You can use cut or findInterval if all of your groups will be using contiguous ID values:
findInterval(States$ID, c(0, 12, 24, 37, 51))
# [1] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
If you want to make it a bit more verbose, you can use dplyr::between in your case_when:
States %>%
mutate(
WAGE_BUCKET = case_when(
between(ID, 1, 12) ~ "1",
between(ID, 13, 24) ~ "2",
between(ID, 25, 37) ~ "3",
between(ID, 38, 50) ~ "4",
TRUE ~ NA_character_)
)
# ID Population Income Illiteracy Life Exp Murder HS Grad Frost Area WAGE_BUCKET
# 1 1 3615 3624 2.1 69.05 15.1 41.3 20 50708 1
# 2 2 365 6315 1.5 69.31 11.3 66.7 152 566432 1
# 3 3 2212 4530 1.8 70.55 7.8 58.1 15 113417 1
# 4 4 2110 3378 1.9 70.66 10.1 39.9 65 51945 1
# 5 5 21198 5114 1.1 71.71 10.3 62.6 20 156361 1
# 6 6 2541 4884 0.7 72.06 6.8 63.9 166 103766 1
# 7 7 3100 5348 1.1 72.48 3.1 56.0 139 4862 1
# 8 8 579 4809 0.9 70.06 6.2 54.6 103 1982 1
# 9 9 8277 4815 1.3 70.66 10.7 52.6 11 54090 1
# 10 10 4931 4091 2.0 68.54 13.9 40.6 60 58073 1
# 11 11 868 4963 1.9 73.60 6.2 61.9 0 6425 1
# 12 12 813 4119 0.6 71.87 5.3 59.5 126 82677 1
# 13 13 11197 5107 0.9 70.14 10.3 52.6 127 55748 2
# 14 14 5313 4458 0.7 70.88 7.1 52.9 122 36097 2
# 15 15 2861 4628 0.5 72.56 2.3 59.0 140 55941 2
# 16 16 2280 4669 0.6 72.58 4.5 59.9 114 81787 2
# 17 17 3387 3712 1.6 70.10 10.6 38.5 95 39650 2
# 18 18 3806 3545 2.8 68.76 13.2 42.2 12 44930 2
# 19 19 1058 3694 0.7 70.39 2.7 54.7 161 30920 2
# 20 20 4122 5299 0.9 70.22 8.5 52.3 101 9891 2
# 21 21 5814 4755 1.1 71.83 3.3 58.5 103 7826 2
# 22 22 9111 4751 0.9 70.63 11.1 52.8 125 56817 2
# 23 23 3921 4675 0.6 72.96 2.3 57.6 160 79289 2
# 24 24 2341 3098 2.4 68.09 12.5 41.0 50 47296 2
# 25 25 4767 4254 0.8 70.69 9.3 48.8 108 68995 3
# 26 26 746 4347 0.6 70.56 5.0 59.2 155 145587 3
# 27 27 1544 4508 0.6 72.60 2.9 59.3 139 76483 3
# 28 28 590 5149 0.5 69.03 11.5 65.2 188 109889 3
# 29 29 812 4281 0.7 71.23 3.3 57.6 174 9027 3
# 30 30 7333 5237 1.1 70.93 5.2 52.5 115 7521 3
# 31 31 1144 3601 2.2 70.32 9.7 55.2 120 121412 3
# 32 32 18076 4903 1.4 70.55 10.9 52.7 82 47831 3
# 33 33 5441 3875 1.8 69.21 11.1 38.5 80 48798 3
# 34 34 637 5087 0.8 72.78 1.4 50.3 186 69273 3
# 35 35 10735 4561 0.8 70.82 7.4 53.2 124 40975 3
# 36 36 2715 3983 1.1 71.42 6.4 51.6 82 68782 3
# 37 37 2284 4660 0.6 72.13 4.2 60.0 44 96184 3
# 38 38 11860 4449 1.0 70.43 6.1 50.2 126 44966 4
# 39 39 931 4558 1.3 71.90 2.4 46.4 127 1049 4
# 40 40 2816 3635 2.3 67.96 11.6 37.8 65 30225 4
# 41 41 681 4167 0.5 72.08 1.7 53.3 172 75955 4
# 42 42 4173 3821 1.7 70.11 11.0 41.8 70 41328 4
# 43 43 12237 4188 2.2 70.90 12.2 47.4 35 262134 4
# 44 44 1203 4022 0.6 72.90 4.5 67.3 137 82096 4
# 45 45 472 3907 0.6 71.64 5.5 57.1 168 9267 4
# 46 46 4981 4701 1.4 70.08 9.5 47.8 85 39780 4
# 47 47 3559 4864 0.6 71.72 4.3 63.5 32 66570 4
# 48 48 1799 3617 1.4 69.48 6.7 41.6 100 24070 4
# 49 49 4589 4468 0.7 72.48 3.0 54.5 149 54464 4
# 50 50 376 4566 0.6 70.29 6.9 62.9 173 97203 4
It is a vector of length > 1. The comparison operators works on a single vector. We could use between
library(dplyr)
States <- States %>%
mutate(WAGE_BUCKET=case_when(between(ID, 1, 12) ~ '1',
between(ID, 13,24) ~ '2',
between(ID, 25,37) ~ '3',
between(ID, 38,50) ~ '4',
TRUE ~ NA_character_))
Or another option is to use & with > and <=
States %>%
mutate(WAGE_BUCKET=case_when(ID >= 1 & ID <=12 ~ '1',
ID >= 13 & ID <= 24) ~ '2',
ID >= 25 & ID <= 37 ~ '3',
ID >= 38 & ID <= 50 ~ '4',
TRUE ~ NA_character))
Or may be the OP meant to use %in%
States %>%
mutate(WAGE_BUCKET=case_when(ID %in% c(1,12) ~ '1',
ID %in% c(13,24) ~ '2',
ID %in% c(25,37) ~ '3',
ID %in% c(38,50) ~ '4',
TRUE ~ NA_character_))
I have two data frames, Table1 and Table2.
Table1:
code
CM171
CM114
CM129
CM131
CM154
CM197
CM42
CM54
CM55
Table2:
code;y;diff_y
CM60;1060;2.9
CM55;255;0.7
CM54;1182;3.2
CM53;1046;2.9
CM47;589;1.6
CM42;992;2.7
CM39;1596;4.4
CM36;1113;3
CM34;1975;5.4
CM226;155;0.4
CM224;46;0.1
CM212;43;0.1
CM197;726;2
CM154;1122;3.1
CM150;206;0.6
CM144;620;1.7
CM132;8;0
CM131;618;1.7
CM129;479;1.3
CM121;634;1.7
CM114;15;0
CM109;1050;2.9
CM107;1165;3.2
CM103;194;0.5
I want to add a column to Table2 based on the values in Table1. I tried to do this using dplyr:
result <-Table2 %>%
mutate (fbp = case_when(
code == Table1$code ~"y",))
But this only works for a few rows. Does anyone know why it doesn't add all rows? The values are not repeated.
Try this. It looks like the == operator is only checking for one value. Instead you can use %in% to test all values. Here the code:
#Code
result <-Table2 %>%
mutate (fbp = case_when(
code %in% Table1$code ~"y",))
Output:
code y diff_y fbp
1 CM60 1060 2.9 <NA>
2 CM55 255 0.7 y
3 CM54 1182 3.2 y
4 CM53 1046 2.9 <NA>
5 CM47 589 1.6 <NA>
6 CM42 992 2.7 y
7 CM39 1596 4.4 <NA>
8 CM36 1113 3.0 <NA>
9 CM34 1975 5.4 <NA>
10 CM226 155 0.4 <NA>
11 CM224 46 0.1 <NA>
12 CM212 43 0.1 <NA>
13 CM197 726 2.0 y
14 CM154 1122 3.1 y
15 CM150 206 0.6 <NA>
16 CM144 620 1.7 <NA>
17 CM132 8 0.0 <NA>
18 CM131 618 1.7 y
19 CM129 479 1.3 y
20 CM121 634 1.7 <NA>
21 CM114 15 0.0 y
22 CM109 1050 2.9 <NA>
23 CM107 1165 3.2 <NA>
24 CM103 194 0.5 <NA>
This question already has answers here:
Selecting only integers from a vector [duplicate]
(2 answers)
Closed 5 years ago.
I would like to filter my data frame based on integer values from the first column v :
v P_el
1 2.5 0
2 3.0 78
3 3.5 172
4 4.0 287
5 4.5 426
6 5.0 601
7 5.5 814
8 6.0 1069
9 6.5 1367
10 7.0 1717
11 7.5 2110
12 8.0 2546
13 8.5 3002
14 9.0 3427
15 9.5 3751
16 10.0 3922
The output should look like this :
v P_el
2 3 78
4 4 287
6 5 601
8 6 1069
10 7 1717
12 8 2546
14 9 3427
16 10 3922
We can check if the values divided by one are with a remainder of 0.
dat[dat$v %% 1 == 0, ]
v P_el
2 3 78
4 4 287
6 5 601
8 6 1069
10 7 1717
12 8 2546
14 9 3427
16 10 3922
DATA
dat <- read.table(text = " v P_el
1 2.5 0
2 3.0 78
3 3.5 172
4 4.0 287
5 4.5 426
6 5.0 601
7 5.5 814
8 6.0 1069
9 6.5 1367
10 7.0 1717
11 7.5 2110
12 8.0 2546
13 8.5 3002
14 9.0 3427
15 9.5 3751
16 10.0 3922",
header = TRUE)
You can use seq( ) function if you have an idea of sequence in column v
dat
# v P_el
# 1 2.5 0
# 2 3.0 78
# 3 3.5 172
# 4 4.0 287
# 5 4.5 426
# 6 5.0 601
# 7 5.5 814
# 8 6.0 1069
# 9 6.5 1367
# 10 7.0 1717
# 11 7.5 2110
# 12 8.0 2546
# 13 8.5 3002
# 14 9.0 3427
# 15 9.5 3751
# 16 10.0 3922
dat[seq(2,16,by = 2),]
# v P_el
# 2 3 78
# 4 4 287
# 6 5 601
# 8 6 1069
# 10 7 1717
# 12 8 2546
# 14 9 3427
# 16 10 3922
I have two Data Frames. One is an Eye Tracking data frame with subject, condition, timestamp, xposition, and yposition. It has over 400,000 rows. Here's a toy data set for an example:
subid condition time xpos ypos
1 1 1 1.40 195 140
2 1 1 2.50 138 147
3 1 1 3.40 140 162
4 1 1 4.10 188 150
5 1 2 1.10 131 194
6 1 2 2.10 149 111
eyedata <- data.frame(subid = rep(1:2, each = 8),
condition = rep(rep(1:2, each = 4),2),
time = c(1.4, 2.5, 3.4, 4.1,
1.1, 2.1, 3.23, 4.44,
1.33, 2.3, 3.11, 4.1,
.49, 1.99, 3.01, 4.2),
xpos = round(runif(n = 16, min = 100, max = 200)),
ypos = round(runif(n = 16, min = 100, max = 200)))
Then I have a Data Frame with subject, condition, a trial number, and a trial begin and end time. It looks like this:
subid condition trial begin end
1 1 1 1 1.40 2.4
2 1 1 2 2.50 3.2
3 1 1 2 3.21 4.5
4 1 2 1 1.10 1.6
5 1 2 2 2.10 3.3
6 1 2 2 3.40 4.1
7 2 1 1 0.50 1.1
8 2 1 1 1.44 2.9
9 2 1 2 2.97 3.3
10 2 2 1 0.35 1.9
11 2 2 1 2.12 4.5
12 2 2 2 3.20 6.3
trials <- data.frame(subid = rep(1:2, each = 6),
condition = rep(rep(1:2, each = 3),2),
trial= c(rep(c(1,rep(2,2)),2),rep(c(rep(1,2),2),2)),
begin = c(1.4, 2.5, 3.21,
1.10, 2.10, 3.4, .50,
1.44,2.97,.35,2.12,3.20),
end = c(2.4,3.2,4.5,1.6,
3.3,4.1,1.1,2.9,
3.3,1.9,4.5,6.3))
The number of trials in a condition are variable, and I want to add a column to my eyetracking dataframe that specifies the correct trial based upon whether the timestamp falls within the time interval. The time intervals do not overlap, but there will be many rows for the eyetracking data in between trials. In the end I'd like a dataframe like this:
subid condition trial time xpos ypos
1 1 1 1.40 198 106
1 1 2 2.50 166 139
1 1 2 3.40 162 120
1 1 2 4.10 113 164
1 2 1 1.10 162 120
1 2 2 2.10 162 120
I've seen data.table rolling joins, but would prefer a solution with dplyr or fuzzyjoin. Thanks in advance.
Here's what I tried, but I can't figure the discrepancies, so it is likely an incomplete answer. Row 12,13 of this result may be an overlap in time. Also, when using random generation functions such as runif please set.seed -- here xpos and ypos have no bearing on the result, so not an issue.
eyedata %>%
left_join(trials, by = c("subid", "condition")) %>%
filter( (time >= begin & time <= end))
# subid condition time xpos ypos trial begin end
# 1 1 1 1.40 143 101 1 1.40 2.4
# 2 1 1 2.50 152 173 2 2.50 3.2
# 3 1 1 3.40 185 172 2 3.21 4.5
# 4 1 1 4.10 106 119 2 3.21 4.5
# 5 1 2 1.10 155 165 1 1.10 1.6
# 6 1 2 2.10 169 154 2 2.10 3.3
# 7 1 2 3.23 166 134 2 2.10 3.3
# 8 2 1 2.30 197 171 1 1.44 2.9
# 9 2 1 3.11 140 135 2 2.97 3.3
# 10 2 2 0.49 176 139 1 0.35 1.9
# 11 2 2 3.01 187 180 1 2.12 4.5
# 12 2 2 4.20 147 176 1 2.12 4.5
# 13 2 2 4.20 147 176 2 3.20 6.3
Trying to plot a time series chart with ggplot2 and using the alpha value to make the lines darkers/lighter, as per ggplot2. Got it working in 1 function but when I try with another dataset the alpha doesnt work. Guess I am calling something incorrectly bc I have the alpha variable set at 0.2 but the line still come out dark
Here is the code and some sample data
tsplot <- ggplot(xall, aes(x=Var1, y=value)) +
geom_line(size=.01) + guides(colour=FALSE) + xlab(x.lab) +ylab("Time Series")
tsplot <- tsplot + aes(alpha=alpha, group= factor(Var2)) +guides(alpha=F)
Sample data for xall
Var1 Var2 value alpha row
1 1 657 0 0.2 Other Rows
2 2 657 -0.006748957 0.2 Other Rows
3 3 657 -0.00088561 0.2 Other Rows
4 4 657 0.009399679 0.2 Other Rows
5 5 657 0.020216333 0.2 Other Rows
6 6 657 0.035222838 0.2 Other Rows
7 7 657 0.038869107 0.2 Other Rows
8 8 657 0.034068491 0.2 Other Rows
9 9 657 0.044237734 0.2 Other Rows
81 1 553 0 0.2 Other Rows
82 2 553 -0.006172511 0.2 Other Rows
83 3 553 -0.004779576 0.2 Other Rows
84 4 553 0.000116964 0.2 Other Rows
85 5 553 -0.013408332 0.2 Other Rows
86 6 553 -0.003200561 0.2 Other Rows
87 7 553 0.000574187 0.2 Other Rows
88 8 553 0.025227017 0.2 Other Rows
89 9 553 0.019984901 0.2 Other Rows
241 1 876 0 0.2 Other Rows
242 2 876 0.006348487 0.2 Other Rows
243 3 876 0.020292484 0.2 Other Rows
244 4 876 0.030155311 0.2 Other Rows
245 5 876 0.02664097 0.2 Other Rows
246 6 876 0.021992971 0.2 Other Rows
247 7 876 0.015871216 0.2 Other Rows
248 8 876 0.020519216 0.2 Other Rows
249 9 876 0.017004875 0.2 Other Rows
250 10 876 0.029588482 0.2 Other Rows
Any help would be greatly appreciated.
You need to add alpha to the global aesthetic. You should also add the group mapping:
ggplot(xall, aes(x=Var1, y=value, alpha=alpha, group= factor(Var2))) +
geom_line(size=.01) + guides(colour=FALSE) + xlab(x.lab) +ylab("Time Series")