This question already has answers here:
Reshape horizontal to to long format using pivot_longer
(3 answers)
Closed 2 years ago.
Thank you all for your answers, I thought I was smarter than I am and hoped I would've understood any of it. I think I messed up my visualisation of my data aswell. I have edited my post to better show my sample data. Sorry for the inconvenience, and I truly hope that someone can help me.
I have a question about reshaping my data. The data collected looks as such:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick. Am reletively new to Rstudio and Stackoverflow (if you couldn't tell that already).
Kind regards, and thank you in advance.
Here is a slightly different pivot_longer() version.
library(tidyr)
library(dplyr)
dw %>%
pivot_longer(cols = -PID, names_to =".value", names_pattern = "(.+)[0-9]")
# A tibble: 9 x 3
PID T measurement
<dbl> <dbl> <dbl>
1 1 1 100
2 1 4 200
3 1 7 50
4 2 2 150
5 2 5 300
6 2 8 60
7 3 3 120
8 3 6 210
9 3 9 70
The names_to = ".value" argument creates new columns from column names based on the names_pattern argument. The names_pattern argument takes a special regex input. In this case, here is the breakdown:
(.+) # match everything - anything noted like this becomes the ".values"
[0-9] # numeric characters - tells the pattern that the numbers
# at the end are excluded from ".values". If you have multiple digit
# numbers, use [0-9*]
In the last edit you asked for a solution that is easy to understand. A very simple approach would be to stack the measurement columns on top of each other and the Tdays columns on top of each other. Although specialty packages make things very concise and elegant, for simplicity we can solve this without additional packages. Standard R has a convenient function aptly named stack, which works like this:
> exp <- data.frame(value1 = 1:5, value2 = 6:10)
> stack(exp)
values ind
1 1 value1
2 2 value1
3 3 value1
4 4 value1
5 5 value1
6 6 value2
7 7 value2
8 8 value2
9 9 value2
10 10 value2
We can stack measurements and Tdays seperately and then combine them via cbind:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurement4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
cbind(stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
Which keeps measurements and Tdays neatly together but leaves us without pid which we can add using rep to replicate the original pid 4 times:
result <- cbind(pid = rep(data$pid, 4),
stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
The head of which looks like
> head(result)
pid values ind values ind
1 1 1356 measurement1 1435 Tdays1
2 2 943 measurement1 1848 Tdays1
3 3 1590 measurement1 185 Tdays1
4 4 130 measurement1 72 Tdays1
5 4 140 measurement1 82 Tdays1
6 4 220 measurement1 126 Tdays1
As I said above, this is not the order you expected and you can try to sort this data.frame, if that is of any concern:
result <- result[order(result$pid), c(1, 4, 2)]
names(result) <- c("pid", "Time", "Value")
leading to the final result
> head(result)
pid Time Value
1 1 1435 1356
13 1 1405 1483
25 1 1374 1563
37 1 NA NA
2 2 1848 943
14 2 1818 1173
tidyverse solution
library(tidyverse)
dw %>%
pivot_longer(-PID) %>%
mutate(name = gsub('^([A-Za-z]+)(\\d+)$', '\\1_\\2', name )) %>%
separate(name, into = c('A', 'B'), sep = '_', convert = T) %>%
pivot_wider(names_from = A, values_from = value)
Gives the following output
# A tibble: 9 x 4
PID B T measurement
<int> <int> <int> <int>
1 1 1 1 100
2 1 2 4 200
3 1 3 7 50
4 2 1 2 150
5 2 2 5 300
6 2 3 8 60
7 3 1 3 120
8 3 2 6 210
9 3 3 9 70
Considering a dataframe, df like the following:
PID T1 measurement1 T2 measurement2 T3 measurement3
1 1 100 4 200 7 50
2 2 150 5 300 8 60
3 3 120 6 210 9 70
You can use this solution to get your required dataframe:
iters = seq(from = 4, to = length(colnames(df))-1, by = 2)
finalDf = df[, c(1,2,3)]
for(j in iters){
tobind = df[, c(1,j,j+1)]
finalDf = rbind(finalDf, tobind)
}
finalDf = finalDf[order(finalDf[,1]),]
print(finalDf)
The output of the print statement is this:
PID T1 measurement1
1 1 1 100
4 1 4 200
7 1 7 50
2 2 2 150
5 2 5 300
8 2 8 60
3 3 3 120
6 3 6 210
9 3 9 70
Maybe you can try reshape like below
reshape(
setNames(data, gsub("(\\d+)$", "\\.\\1", names(data))),
direction = "long",
varying = 2:ncol(data)
)
Related
I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56
I used this method to gather mean and sd result successly before here .And then, I tried to use this methond to gather my gene counts DEG data with "logFC","cil","cir","ajustP_value" .But I failed because something wrong with my result.
Just like this:
data_1<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_1) <- c(paste0("Gene_", 1:25))
rownames(data_1)<-NULL
head(data_1)
A<-paste0(1:48,"_logFC")
data_logFC<-data.frame(A=A,data_1)
#
data_2<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_2) <- c(paste0("Gene_", 1:25))
rownames(data_1)<-NULL
B_L<-paste0(1:48,"_CI.L")
data_CIL<-data.frame(A=B_L,data_2)
data_CIL[1:48,1:6]
#
data_3<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_3) <- c(paste0("Gene_", 1:25))
rownames(data_3)<-NULL
C_R<-paste0(1:48,"_CI.R")
data_CIR<-data.frame(A=C_R,data_3)
data_CIR[1:48,1:6]
#
data_4<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_4) <- c(paste0("Gene_", 1:25))
rownames(data_4)<-NULL
D<-paste0(1:48,"_adj.P.Val")
data_ajustP<-data.frame(A=D,data_4)
data_ajustP[1:48,1:6]
# combine data_logFC data_CIL data_CIR data_ajustP
data <- bind_rows(list(
logFC = data_logFC,
CIL = data_CIL,
CIR =data_CIR,
AJSTP=data_ajustP
), .id = "stat")
data[1:10,1:6]
data_DEG<- data %>%
pivot_longer(-c(stat,A), names_to = "Gene", values_to = "value") %>%pivot_wider(names_from = "stat", values_from = "value")
head(data_DEG,100)
str(data_DEG$CIL)
> head(data_DEG,100)
# A tibble: 100 x 6
A Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 1_logFC Gene_1 504 NA NA NA
2 1_logFC Gene_2 100 NA NA NA
3 1_logFC Gene_3 689 NA NA NA
4 1_logFC Gene_4 779 NA NA NA
5 1_logFC Gene_5 397 NA NA NA
6 1_logFC Gene_6 1152 NA NA NA
7 1_logFC Gene_7 780 NA NA NA
8 1_logFC Gene_8 155 NA NA NA
9 1_logFC Gene_9 142 NA NA NA
10 1_logFC Gene_10 1150 NA NA NA
# … with 90 more rows
Why is there so many NAs ?
Can somebody help me ? Vary thankful.
EDITE:
I confused the real sample group of my data. So I reshape my data without a right index.
Here is my right method:
data[1:10,1:6]
data<-separate(data,A,c("Name","stat2"),"_")
data<-data[,-3]
data_DEG<- data %>%
pivot_longer(-c(stat,Name), names_to = "Gene", values_to = "value") %>%pivot_wider(names_from = "stat", values_from = "value")
head(data_DEG,10)
tail(data_DEG,10)
> head(data_DEG,10)
# A tibble: 10 x 6
Name Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 1 Gene_1 504 1116 774 278
2 1 Gene_2 100 936 448 887
3 1 Gene_3 689 189 718 933
4 1 Gene_4 779 943 690 19
5 1 Gene_5 397 976 40 135
6 1 Gene_6 1152 304 343 647
7 1 Gene_7 780 1076 796 1024
8 1 Gene_8 155 645 469 180
9 1 Gene_9 142 256 889 1047
10 1 Gene_10 1150 976 1194 670
> tail(data_DEG,10)
# A tibble: 10 x 6
Name Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 48 Gene_16 448 633 1080 1122
2 48 Gene_17 73 772 14 388
3 48 Gene_18 652 999 699 912
4 48 Gene_19 600 1163 512 241
5 48 Gene_20 428 1119 1142 348
6 48 Gene_21 66 553 240 82
7 48 Gene_22 753 1119 630 117
8 48 Gene_23 1017 305 1120 447
9 48 Gene_24 432 1175 447 670
10 48 Gene_25 482 394 371 696
It's a perfect result!!
I have a large dataset, 4666972 obs. of 5 variables.
I want to impute one column, MPR, with Kalman method based on each groups.
> str(dt)
Classes ‘data.table’ and 'data.frame': 4666972 obs. of 5 variables:
$ Year : int 1999 2000 2001 1999 2000 2001 1999 2000 2001 1999 ...
$ State: int 1 1 1 1 1 1 1 1 1 1 ...
$ CC : int 1 1 1 1 1 1 1 1 1 1 ...
$ ID : chr "1" "1" "1" "2" ...
$ MPR : num 54 54 55 52 52 53 60 60 65 70 ...
I tried the code below but it crashed after a while.
> library(imputeTS)
> data.table::setDT(dt)[, MPR_kalman := with(dt, ave(MPR, State, CC, ID, FUN=na_kalman))]
I don't know how to improve the time efficiency and impute successfully without crashed.
Is it better to split the dataset with ID to list and impute each of them with for loop?
> length(unique(hpms_S3$Section_ID))
[1] 668184
> split(dt, dt$ID)
However, I think this will not save too much of memory use or avoid crashed since when I split the dataset to 668184 lists and impute, I need to do multiple times and then combine to one dataset at last.
Is there any great way to do or how can I optimize code I did?
I provide the simple sample here:
# dt
Year State CC ID MPR
2002 15 3 3 NA
2003 15 3 3 NA
2004 15 3 3 193
2005 15 3 3 193
2006 15 3 3 348
2007 15 3 3 388
2008 15 3 3 388
1999 53 33 1 NA
2000 53 33 1 NA
2002 53 33 1 NA
2003 53 33 1 NA
2004 53 33 1 NA
2005 53 33 1 170
2006 53 33 1 170
2007 53 33 1 330
2008 53 33 1 330
EDIT:
As #r2evans mentioned in comment, I modified the code:
> setDT(dt)[, MPR_kalman := ave(MPR, State, CC, ID, FUN=na_kalman), by = .(State, CC, ID)]
Error in optim(init[mask], getLike, method = "L-BFGS-B", lower = rep(0, :
L-BFGS-B needs finite values of 'fn'
I got the error above. I found the post here for this error discussions. However, even I use na_kalman(MPR, type = 'level'), I still got error. I think there might be some repeated values within groups so that it produced error.
Perhaps splitting should be done using data.table's by= operator, perhaps more efficient.
Since I don't have imputeTS installed (there are several nested dependencies I don't have), I'll fake imputation using zoo::na.locf, both forward/backwards. I'm not suggesting this be your imputation mechanism, I'm using it to demonstrate a more-common pattern with data.table.
myimpute <- function(z) zoo::na.locf(zoo::na.locf(z, na.rm = FALSE), fromLast = TRUE, na.rm = FALSE)
Now some equivalent calls, one with your with(dt, ...) and my alternatives (which are really walk-throughs until my ultimate suggestion of 5):
dt[, MPR_kalman1 := with(dt, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman2 := with(.SD, ave(MPR, State, CC, ID, FUN = myimpute))]
dt[, MPR_kalman3 := with(.SD, ave(MPR, FUN = myimpute)), by = .(State, CC, ID)]
dt[, MPR_kalman4 := ave(MPR, FUN = myimpute), by = .(State, CC, ID)]
dt[, MPR_kalman5 := myimpute(MPR), by = .(State, CC, ID)]
# Year State CC ID MPR MPR_kalman1 MPR_kalman2 MPR_kalman3 MPR_kalman4 MPR_kalman5
# 1: 2002 15 3 3 NA 193 193 193 193 193
# 2: 2003 15 3 3 NA 193 193 193 193 193
# 3: 2004 15 3 3 193 193 193 193 193 193
# 4: 2005 15 3 3 193 193 193 193 193 193
# 5: 2006 15 3 3 348 348 348 348 348 348
# 6: 2007 15 3 3 388 388 388 388 388 388
# 7: 2008 15 3 3 388 388 388 388 388 388
# 8: 1999 53 33 1 NA 170 170 170 170 170
# 9: 2000 53 33 1 NA 170 170 170 170 170
# 10: 2002 53 33 1 NA 170 170 170 170 170
# 11: 2003 53 33 1 NA 170 170 170 170 170
# 12: 2004 53 33 1 NA 170 170 170 170 170
# 13: 2005 53 33 1 170 170 170 170 170 170
# 14: 2006 53 33 1 170 170 170 170 170 170
# 15: 2007 53 33 1 330 330 330 330 330 330
# 16: 2008 53 33 1 330 330 330 330 330 330
The two methods produce the same results, but the latter preserves many of the memory-efficiencies that can make data.table preferred.
The use of with(dt, ...) is an anti-pattern in one case, and a strong risk in another. For the "risk" part, realize that data.table can do a lot of things behind-the-scenes so that the calculations/function-calls within the j= component (second argument) only sees data that is relevant. A clear example is grouping, but another (unrelated to this) data.table example is conditional replacement, as in dt[is.na(x), x := -1]. With the reference to the enter table dt inside of this, if there is ever something in the first argument (conditional replacement) or a by= argument, then it fails.
MPR_kalman2 mitigates this by using .SD, which is data.table's way of replacing the data-to-be-used with the "Subset of the Data" (ref). But it's still not taking advantage of data.table's significant efficiencies in dealing in-memory with groups.
MPR_kalman3 works on this by grouping outside, still using with but not (as in 2) in a more friendly way.
MPR_kalman4 removes the use of with, since really the MPR visible to ave is only within each group anyway. And then when you think about it, since ave is given no grouping variables, it really just passes all of the MPR data straight-through to myimpute. From this, we have MPR_kalman5, a direct method that is along the normal patterns of data.table.
While I don't know that it will mitigate your crashing, it is intended very much to be memory-efficient (in data.table's ways).
I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679
Task:
I have to check that if the value in the data vector is above from the given threshold,
If in my data vector, I found 5 consecutive values greater then the given threshold then I keep these values as they are.
If I have less then 5 values (not 5 consecutive values) then I will replace these values with NA's.
The sample data and required output is shown below. In this example the threshold value is 1000. X is input data variable and the desired output is: Y = X(Threshold > 1000)
X Y
580 580
457 457
980 980
1250 NA
3600 NA
598 598
1200 1200
1345 1345
9658 9658
1253 1253
4500 4500
1150 1150
596 596
594 594
550 550
1450 NA
320 320
1780 NA
592 592
590 590
I have used the following code in R for my desired output but unable to get the appropriate one:
for (i in 1:nrow(X)) # X is my data vector
{counter=0
if (X[i]>10000)
{
for (j in i:(i+4))
{
if (X[j]>10000)
{counter=counter+1}
}
ifelse (counter < 5, NA, X[j])
}
X[i]<- NA
}
X
I am sure that the above code contain some error. I need help in the form of either a new code or modifying this code or any package in R.
Here is an approach using dplyr, using a cumulative sum of diff(x > 1000) to group the values.
library(dplyr)
df <- data.frame(x)
df
# x
# 1 580
# 2 457
# 3 980
# 4 1250
# 5 3600
# 6 598
# 7 1200
# 8 1345
# 9 9658
# 10 1253
# 11 4500
# 12 1150
# 13 596
# 14 594
# 15 550
# 16 1450
# 17 320
# 18 1780
# 19 592
# 20 590
df %>% mutate(group = cumsum(c(0, abs(diff(x>1000))))) %>%
group_by(group) %>%
mutate(count = n()) %>%
ungroup() %>%
mutate(y = ifelse(x<1000 | count > 5, x, NA))
# x group count y
# (int) (dbl) (int) (int)
# 1 580 0 3 580
# 2 457 0 3 457
# 3 980 0 3 980
# 4 1250 1 2 NA
# 5 3600 1 2 NA
# 6 598 2 1 598
# 7 1200 3 6 1200
# 8 1345 3 6 1345
# 9 9658 3 6 9658
# 10 1253 3 6 1253
# 11 4500 3 6 4500
# 12 1150 3 6 1150
# 13 596 4 3 596
# 14 594 4 3 594
# 15 550 4 3 550
# 16 1450 5 1 NA
# 17 320 6 1 320
# 18 1780 7 1 NA
# 19 592 8 2 592
# 20 590 8 2 590
Another approach :
Y<-rep(NA,nrow(X))
for (i in 1:nrow(X)) {
if (X[i,1]<1000) {Y[i]<-X[i,1]} else if (sum(X[i:min((i+4),nrow(X)),1]>1000)>=5) {
Y[i:min((i+4),nrow(X))]<-X[i:min((i+4),nrow(X)),1]}
}
returns
> Y
[1] 580 457 980 NA NA 598 1200 1345 9658 1253 4500 1150 596 594 550 NA 320 NA 592 590
This assumes that the values of X are in the first column of a dataframe named X.
It then creates Y with NAand only change the values if the criteria are met.