R expand and process data with nested for loops tail - r

I need to expand some data and then restrict which data remains through tail.
Example of data:
list_1 <- list(1:15)
list_2 <- list(16:30)
list_3 <- list(31:45)
short_lists[[1]] <- list_1
short_lists[[2]] <- list_2
short_lists[[3]] <- list_3
str(short_lists)
List of 3
$ :List of 1
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ :List of 1
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
$ :List of 1
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
And how long I want my tail of a given list to be from list_1, list_2, list_3
how_long <-
c(4,2,5,3,6,4,7,5,8,6,9,7,10,8,2,4,6,8,10,12,14,10,9,7,11)
And I expand through nested for loops and try to get the tail of the expanded lists, but just get the expanded lists.
for (i in 1:length(how_long)) {
for (j in 1:length(short_lists)) {
tail_temp[[j]][i] <- tail(short_lists2[[j]], n = how_long[i])
}
}
And this yields:
str(tail_temp)
List of 3
$ :List of 25
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
[snip]
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ :List of 25
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
[snip]
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
$ :List of 25
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
[snip]
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
And I'm happy the j's were expanded, but I never get to the tail call and what I'm seeking:
str(tail_temp)
List of 3
$ :List of 25
..$ : int [1:4] 12 13 14 15
..$ : int [1:2] 14 15
..$ : int [1:5] 11 12 13 14 15
[snip]
so what simple thing am I missing. Any help appreciated. Thanks.

Very close indeed.
I prefer vectors in lists over R.
If you're familiar with python,
vectors behave like 'lists' in python.
Where as lists in R behave like dictionaries.
Therefore, you just needed to unlist first (into a vector),
to then assign to an item of a list,
hence it should be assigned to:
tail_temp[[i]][[j]] instead of tail_temp[i][[j]]
list_1 <- list(1:15)
list_2 <- list(16:30)
list_3 <- list(31:45)
short_lists = list()
short_lists[[1]] <- list_1
short_lists[[2]] <- list_2
short_lists[[3]] <- list_3
how_long <- c(4,2,5,3,6,4,7,5,8,6,9,7,10,
8,2,4,6,8,10,12,14,10,9,7,11)
tail_temp = list()
for (i in 1:length(short_lists)){
tail_temp[[i]] = list()
for (j in 1:length(how_long)){
tail_temp[[i]][[j]] <- tail(unlist(short_lists[[i]]), n = how_long[j])
}
}
Output
[[1]]
[[1]][[1]]
[1] 12 13 14 15
[[1]][[2]]
[1] 14 15
[[1]][[3]]
[1] 11 12 13 14 15
…
[[3]][[23]]
[1] 37 38 39 40 41 42 43 44 45
[[3]][[24]]
[1] 39 40 41 42 43 44 45
[[3]][[25]]
[1] 35 36 37 38 39 40 41 42 43 44 45

Related

List being added to a dataframe

Why is a list being added to my dataframe here?
Here's my dataframe
df <- data.frame(ch = rep(1:10, each = 12), # care home id
year_id = rep(2018),
month_id = rep(1:12), # month using the system over the course of a year (1 = first month, 2 = second month...etc.)
totaladministrations = rbinom(n=120, size = 1000, prob = 0.6), # administrations that were scheduled to have been given in the month
missed = rbinom(n=120, size = 20, prob = 0.8), # administrations that weren't given in the month (these are bad!)
beds = rep(rbinom(n = 10, size = 60, prob = 0.6), each = 12), # number of beds in the care home
rating = rep(rbinom(n= 10, size = 4, prob = 0.5), each = 12)) # latest inspection rating (1. Inadequate, 2. Requires Improving, 3. Good, 4 Outstanding)
df <- arrange(df, df$ch, df$year_id, df$month_id)
str(df)
> str(df)
'data.frame': 120 obs. of 7 variables:
$ ch : int 1 1 1 1 1 1 1 1 1 1 ...
$ year_id : num 2018 2018 2018 2018 2018 ...
$ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ totaladministrations: int 576 598 608 576 608 637 611 613 593 626 ...
$ missed : int 18 18 19 16 16 13 17 16 15 17 ...
$ beds : int 38 38 38 38 38 38 38 38 38 38 ...
$ rating : int 2 2 2 2 2 2 2 2 2 2 ...
All good so far.
I just want to add another column that sequences the month number within the ch group (this equates to the actual month_id in this example but ignore that, my real life data is different), so I'm using:
df <- df %>% group_by(ch) %>%
mutate(sequential_month_counter = 1:n())
This appears to add a bunch stuff I don't really understand or want or need, such as a list ...
str(df)
> str(df)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 8 variables:
$ ch : int 1 1 1 1 1 1 1 1 1 1 ...
$ year_id : num 2018 2018 2018 2018 2018 ...
$ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ totaladministrations : int 601 590 593 599 615 611 628 587 604 600 ...
$ missed : int 16 14 17 16 18 16 15 18 15 20 ...
$ beds : int 35 35 35 35 35 35 35 35 35 35 ...
$ rating : int 3 3 3 3 3 3 3 3 3 3 ...
$ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "groups")=Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 2 variables:
..$ ch : int 1 2 3 4 5 6 7 8 9 10
..$ .rows:List of 10
.. ..$ : int 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int 13 14 15 16 17 18 19 20 21 22 ...
.. ..$ : int 25 26 27 28 29 30 31 32 33 34 ...
.. ..$ : int 37 38 39 40 41 42 43 44 45 46 ...
.. ..$ : int 49 50 51 52 53 54 55 56 57 58 ...
.. ..$ : int 61 62 63 64 65 66 67 68 69 70 ...
.. ..$ : int 73 74 75 76 77 78 79 80 81 82 ...
.. ..$ : int 85 86 87 88 89 90 91 92 93 94 ...
.. ..$ : int 97 98 99 100 101 102 103 104 105 106 ...
.. ..$ : int 109 110 111 112 113 114 115 116 117 118 ...
..- attr(*, ".drop")= logi TRUE
What's going on here? I just want a dataframe. Why is there all that additional output after $ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ... and more importantly can I ignore it and just keep treating it as a normal dataframe (i'll be running some generalised linear mixed models on the df)?
The attribute "groups" is where dplyr stores the grouping information added when you did group_by(ch). It doesn't hurt anything, and it will disappear if you ungroup():
df %>% group_by(ch) %>%
mutate(sequential_month_counter = 1:n()) %>%
ungroup %>%
str
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 8 variables:
# $ ch : int 1 1 1 1 1 1 1 1 1 1 ...
# $ year_id : num 2018 2018 2018 2018 2018 ...
# $ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
# $ totaladministrations : int 575 597 579 605 582 599 577 604 630 632 ...
# $ missed : int 18 16 16 18 18 11 10 13 17 16 ...
# $ beds : int 33 33 33 33 33 33 33 33 33 33 ...
# $ rating : int 3 3 3 3 3 3 3 3 3 3 ...
# $ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ...
As a side-note, you should use bare column names inside dplyr verbs, not data$column. With arrange, it doesn't much matter, but in grouped operations it will cause bugs. You should get in the habit of using arrange(df, ch, year_id, month_id) instead of arrange(df, df$ch, df$year_id, df$month_id).

How to combine three date columns in a data frame into a single variable?

I have a data frame that looks a bit like this:
Type Size `Jul-17` `Aug-17` `Sep-17`
1 A Large 35 24 80
2 B Medium 81 13 38
3 C Small 30 64 45
4 D Large 97 68 65
5 E Medium 31 69 33
6 F Small 84 74 12
I use the ddply function a lot, and instead of summing the three columns together like below...
result <- ddply(Example, .(Type), (summarize),
Q3sum = sum(`Jul-17`, `Aug-17`, `Sep-17`))
I'd like to be able to reference a single variable that contains those three columns and call it "Q3". Is there a way to do this that will still allow the data to work with ddply? I've tried setting the three columns to a single variable using Q3<- c(`Jul-17`, `Aug-17`, `Sep-17`), but it doesn't seem to work.
Any suggestions would be greatly appreciated.
Reproducible data frame:
read.table(check.names = FALSE, text="Type Size Jul-17 Aug-17 Sep-17
A Large 35 24 80
B Medium 81 13 38
C Small 30 64 45
D Large 97 68 65
E Medium 31 69 33
F Small 84 74 12", header=TRUE, stringsAsFactors=FALSE) -> xdf
xdf
## Type Size Jul-17 Aug-17 Sep-17
## 1 A Large 35 24 80
## 2 B Medium 81 13 38
## 3 C Small 30 64 45
## 4 D Large 97 68 65
## 5 E Medium 31 69 33
## 6 F Small 84 74 12
If you just want the sum of the columns into one Q3 column:
xdf$Q3 <- rowSums(xdf[,3:5])
xdf
## Type Size Jul-17 Aug-17 Sep-17 Q3
## 1 A Large 35 24 80 139
## 2 B Medium 81 13 38 132
## 3 C Small 30 64 45 139
## 4 D Large 97 68 65 230
## 5 E Medium 31 69 33 133
## 6 F Small 84 74 12 170
If you want the 3 months making up "Q3" nested into one column:
xdf$q3_alt <- apply(xdf, 1, function(x) { list(as.numeric(x[3:5])) })
xdf
## Type Size Jul-17 Aug-17 Sep-17 Q3 q3_alt
## 1 A Large 35 24 80 139 35, 24, 80
## 2 B Medium 81 13 38 132 81, 13, 38
## 3 C Small 30 64 45 139 30, 64, 45
## 4 D Large 97 68 65 230 97, 68, 65
## 5 E Medium 31 69 33 133 31, 69, 33
## 6 F Small 84 74 12 170 84, 74, 12
str(xdf)
## 'data.frame': 6 obs. of 7 variables:
## $ Type : chr "A" "B" "C" "D" ...
## $ Size : chr "Large" "Medium" "Small" "Large" ...
## $ Jul-17: int 35 81 30 97 31 84
## $ Aug-17: int 24 13 64 68 69 74
## $ Sep-17: int 80 38 45 65 33 12
## $ Q3 : num 139 132 139 230 133 170
## $ q3_alt:List of 6
## ..$ :List of 1
## .. ..$ : num 35 24 80
## ..$ :List of 1
## .. ..$ : num 81 13 38
## ..$ :List of 1
## .. ..$ : num 30 64 45
## ..$ :List of 1
## .. ..$ : num 97 68 65
## ..$ :List of 1
## .. ..$ : num 31 69 33
## ..$ :List of 1
## .. ..$ : num 84 74 12
the solution is the gather function from tidyr. If you use dplyr you can make it in one line of code.
> library(dplyr)
> library(tidyr)
> df%>%
+ gather(key = Q3,value = values,Jul_17:Sep_17)
type size Q3 values
1 1 A Large Jul_17 35
2 2 B Medium Jul_17 81
3 3 C Small Jul_17 30
4 4 D Large Jul_17 97
5 5 E Medium Jul_17 31
6 6 F Small Jul_17 84
7 1 A Large Aug_17 24
8 2 B Medium Aug_17 13
9 3 C Small Aug_17 64
10 4 D Large Aug_17 68
11 5 E Medium Aug_17 69
12 6 F Small Aug_17 74
13 1 A Large Sep_17 80
14 2 B Medium Sep_17 38
15 3 C Small Sep_17 45
16 4 D Large Sep_17 65
17 5 E Medium Sep_17 33
18 6 F Small Sep_17 12
Sounds to me like you want something along the lines of melt from the reshape2 package or gather from the tidyr packge. They will make your data.frame longer with all the Jul-17, Aug-17, and Sep-17 values in one column and another column declaring which month each data point came from.
Check out this nice primer on data tidying.

R: Subsetting returns "0 obs."

I'm trying to subset my dataset 'eggdat' for daytime and nighttime hours. This:
'data.frame': 54847 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 20 20 20 20 20 20 20 20 20 20 ...
$ minute: int 5 5 5 5 5 5 5 5 5 5 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num -159 179 -164 -155 -137 ...
$ Pitch : num -31.36 -41.05 -23.85 -6.62 -9.13 ...
$ Yaw : num -71.8 -113.3 -67.2 -140.2 -78.2 ...
$ temp1 : num 25 33.5 34 34 34 34 34 34 34 34 ...
Subsetting for daytime works fine:
daytime <- eggdat[eggdat$hour >= 7 & eggdat$hour <= 20, ]
'data.frame': 18847 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 20 20 20 20 20 20 20 20 20 20 ...
$ minute: int 5 5 5 5 5 5 5 5 5 5 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num -159 179 -164 -155 -137 ...
$ Pitch : num -31.36 -41.05 -23.85 -6.62 -9.13 ...
$ Yaw : num -71.8 -113.3 -67.2 -140.2 -78.2 ...
$ temp1 : num 25 33.5 34 34 34 34 34 34 34 34 ...
Doing exactly the same thing for nighttime, however, returns a subset with 0 observations:
nighttime <- eggdat[eggdat$hour <= 7 & eggdat$hour >= 21, ]
'data.frame': 0 obs. of 10 variables:
$ year : int
$ month : int
$ day : int
$ hour : int
$ minute: int
$ second: int
$ Roll : num
$ Pitch : num
$ Yaw : num
$ temp1 : num
I really don't know what to do.. I tried using subset , but without success.. I also tried eggdat$hour <- as.factor(eggdat$hour), but couldn't get it to work either.
Even more confusingly, adding the quotation marks in the subset function (daytime <- eggdat[eggdat$hour >= '7' & eggdat$hour <= '20', ] and nighttime <- eggdat[eggdat$hour <= '7' & eggdat$hour >= '21', ]) resulted in the daytime subset containing '0 obs.', but the nighttime subset working fine, so it's just the other way around!
Daytime: 'data.frame': 0 obs. of 10 variables:
Nighttime:
'data.frame': 28800 obs. of 10 variables:
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 31 31 31 31 31 31 31 31 31 31 ...
$ hour : int 21 21 21 21 21 21 21 21 21 21 ...
$ minute: int 0 0 0 0 0 0 0 0 0 0 ...
$ second: int 0 1 2 3 4 5 6 7 8 9 ...
$ Roll : num 65.8 65.8 66.1 65.6 65.6 ...
$ Pitch : num 6.35 6.34 6.24 6.4 6.27 ...
$ Yaw : num 171 172 174 176 176 ...
$ temp1 : num 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 ...
I really don't know what to do, I'm very confused by all of this..
You want eggdat[eggdat$hour <= 7 | eggdat$hour >= 21, ]
x < 7 & x > 21 translates to x smaller than 7 AND larger than 21
x < 7 | x > 21 translates to x smaller than 7 OR larger than 21

Error in ncol(xj) : object 'xj' not found when using R matplot()

Using matplot, I'm trying to plot the 2nd, 3rd and 4th columns of airquality data.frame after dividing these 3 columns by the first column of airquality.
However I'm getting an error
Error in ncol(xj) : object 'xj' not found
Why are we getting this error? The code below will reproduce this problem.
attach(airquality)
airquality[2:4] <- apply(airquality[2:4], 2, function(x) x /airquality[1])
matplot(x= airquality[,1], y= as.matrix(airquality[-1]))
You have managed to mangle your data in an interesting way. Starting with airquality before you mess with it. (And please don't attach() - it's unnecessary and sometimes dangerous/confusing.)
str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
After you do
airquality[2:4] <- apply(airquality[2:4], 2,
function(x) x /airquality[1])
you get
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R:'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 4.63 3.28 12.42 17.39 NA ...
$ Wind :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 0.18 0.222 1.05 0.639 NA ...
$ Temp :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 1.63 2 6.17 3.44 NA ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
or
sapply(airquality,class)
## Ozone Solar.R Wind Temp Month Day
## "integer" "data.frame" "data.frame" "data.frame" "integer" "integer"
that is, you have data frames embedded within your data frame!
rm(airquality) ## clean up
Now change one character and divide by the column airquality[,1] rather than airquality[1] (divide by a vector, not a list of length one ...)
airquality[,2:4] <- apply(airquality[,2:4], 2,
function(x) x/airquality[,1])
matplot(x= airquality[,1], y= as.matrix(airquality[,-1]))
In general it's safer to use [, ...] indexing rather than [] indexing to refer to columns of a data frame unless you really know what you're doing ...

Throw away first and last n rows

I have a data.table in R where I want to throw away the first and the last n rows. I want to to apply some filtering before and then truncate the results. I know I can do this this way:
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
e2=example[row1%%2==0]
e2[100:(nrow(e2)-100)]
Is there a possiblity of doing this in one line? I thought of something like:
example[row1%%2==0][100:-100]
This of course does not work, but is there a simpler solution which does not require a additional variable?
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
n = 5
str(example[!rownames(example) %in%
c( head(rownames(example), n), tail(rownames(example), n)), ])
Classes ‘data.table’ and 'data.frame': 990 obs. of 2 variables:
$ row1: num 6 7 8 9 10 11 12 13 14 15 ...
$ row2: num 17 20 23 26 29 32 35 38 41 44 ...
- attr(*, ".internal.selfref")=<externalptr>
Added a one-liner version with the selection criterion
str(
(res <- example[row1 %% 2 == 0])[ n:( nrow(res)-n ), ]
)
Classes ‘data.table’ and 'data.frame': 491 obs. of 2 variables:
$ row1: num 10 12 14 16 18 20 22 24 26 28 ...
$ row2: num 29 35 41 47 53 59 65 71 77 83 ...
- attr(*, ".internal.selfref")=<externalptr>
And further added this version that does not use an intermediate named value
str(
example[row1 %% 2 == 0][n:(sum( row1 %% 2==0)-n ), ]
)
Classes ‘data.table’ and 'data.frame': 491 obs. of 2 variables:
$ row1: num 10 12 14 16 18 20 22 24 26 28 ...
$ row2: num 29 35 41 47 53 59 65 71 77 83 ...
- attr(*, ".internal.selfref")=<externalptr>
In this case you know the name of one column (row1) that exists, so using length(<any column>) returns the number of rows within the unnamed temporary data.table:
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
e2=example[row1%%2==0]
ans1 = e2[100:(nrow(e2)-100)]
ans2 = example[row1%%2==0][100:(length(row1)-100)]
identical(ans1,ans2)
[1] TRUE

Resources