I have a data.table in R where I want to throw away the first and the last n rows. I want to to apply some filtering before and then truncate the results. I know I can do this this way:
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
e2=example[row1%%2==0]
e2[100:(nrow(e2)-100)]
Is there a possiblity of doing this in one line? I thought of something like:
example[row1%%2==0][100:-100]
This of course does not work, but is there a simpler solution which does not require a additional variable?
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
n = 5
str(example[!rownames(example) %in%
c( head(rownames(example), n), tail(rownames(example), n)), ])
Classes ‘data.table’ and 'data.frame': 990 obs. of 2 variables:
$ row1: num 6 7 8 9 10 11 12 13 14 15 ...
$ row2: num 17 20 23 26 29 32 35 38 41 44 ...
- attr(*, ".internal.selfref")=<externalptr>
Added a one-liner version with the selection criterion
str(
(res <- example[row1 %% 2 == 0])[ n:( nrow(res)-n ), ]
)
Classes ‘data.table’ and 'data.frame': 491 obs. of 2 variables:
$ row1: num 10 12 14 16 18 20 22 24 26 28 ...
$ row2: num 29 35 41 47 53 59 65 71 77 83 ...
- attr(*, ".internal.selfref")=<externalptr>
And further added this version that does not use an intermediate named value
str(
example[row1 %% 2 == 0][n:(sum( row1 %% 2==0)-n ), ]
)
Classes ‘data.table’ and 'data.frame': 491 obs. of 2 variables:
$ row1: num 10 12 14 16 18 20 22 24 26 28 ...
$ row2: num 29 35 41 47 53 59 65 71 77 83 ...
- attr(*, ".internal.selfref")=<externalptr>
In this case you know the name of one column (row1) that exists, so using length(<any column>) returns the number of rows within the unnamed temporary data.table:
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
e2=example[row1%%2==0]
ans1 = e2[100:(nrow(e2)-100)]
ans2 = example[row1%%2==0][100:(length(row1)-100)]
identical(ans1,ans2)
[1] TRUE
Related
Why is a list being added to my dataframe here?
Here's my dataframe
df <- data.frame(ch = rep(1:10, each = 12), # care home id
year_id = rep(2018),
month_id = rep(1:12), # month using the system over the course of a year (1 = first month, 2 = second month...etc.)
totaladministrations = rbinom(n=120, size = 1000, prob = 0.6), # administrations that were scheduled to have been given in the month
missed = rbinom(n=120, size = 20, prob = 0.8), # administrations that weren't given in the month (these are bad!)
beds = rep(rbinom(n = 10, size = 60, prob = 0.6), each = 12), # number of beds in the care home
rating = rep(rbinom(n= 10, size = 4, prob = 0.5), each = 12)) # latest inspection rating (1. Inadequate, 2. Requires Improving, 3. Good, 4 Outstanding)
df <- arrange(df, df$ch, df$year_id, df$month_id)
str(df)
> str(df)
'data.frame': 120 obs. of 7 variables:
$ ch : int 1 1 1 1 1 1 1 1 1 1 ...
$ year_id : num 2018 2018 2018 2018 2018 ...
$ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ totaladministrations: int 576 598 608 576 608 637 611 613 593 626 ...
$ missed : int 18 18 19 16 16 13 17 16 15 17 ...
$ beds : int 38 38 38 38 38 38 38 38 38 38 ...
$ rating : int 2 2 2 2 2 2 2 2 2 2 ...
All good so far.
I just want to add another column that sequences the month number within the ch group (this equates to the actual month_id in this example but ignore that, my real life data is different), so I'm using:
df <- df %>% group_by(ch) %>%
mutate(sequential_month_counter = 1:n())
This appears to add a bunch stuff I don't really understand or want or need, such as a list ...
str(df)
> str(df)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 8 variables:
$ ch : int 1 1 1 1 1 1 1 1 1 1 ...
$ year_id : num 2018 2018 2018 2018 2018 ...
$ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ totaladministrations : int 601 590 593 599 615 611 628 587 604 600 ...
$ missed : int 16 14 17 16 18 16 15 18 15 20 ...
$ beds : int 35 35 35 35 35 35 35 35 35 35 ...
$ rating : int 3 3 3 3 3 3 3 3 3 3 ...
$ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "groups")=Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 2 variables:
..$ ch : int 1 2 3 4 5 6 7 8 9 10
..$ .rows:List of 10
.. ..$ : int 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int 13 14 15 16 17 18 19 20 21 22 ...
.. ..$ : int 25 26 27 28 29 30 31 32 33 34 ...
.. ..$ : int 37 38 39 40 41 42 43 44 45 46 ...
.. ..$ : int 49 50 51 52 53 54 55 56 57 58 ...
.. ..$ : int 61 62 63 64 65 66 67 68 69 70 ...
.. ..$ : int 73 74 75 76 77 78 79 80 81 82 ...
.. ..$ : int 85 86 87 88 89 90 91 92 93 94 ...
.. ..$ : int 97 98 99 100 101 102 103 104 105 106 ...
.. ..$ : int 109 110 111 112 113 114 115 116 117 118 ...
..- attr(*, ".drop")= logi TRUE
What's going on here? I just want a dataframe. Why is there all that additional output after $ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ... and more importantly can I ignore it and just keep treating it as a normal dataframe (i'll be running some generalised linear mixed models on the df)?
The attribute "groups" is where dplyr stores the grouping information added when you did group_by(ch). It doesn't hurt anything, and it will disappear if you ungroup():
df %>% group_by(ch) %>%
mutate(sequential_month_counter = 1:n()) %>%
ungroup %>%
str
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 120 obs. of 8 variables:
# $ ch : int 1 1 1 1 1 1 1 1 1 1 ...
# $ year_id : num 2018 2018 2018 2018 2018 ...
# $ month_id : int 1 2 3 4 5 6 7 8 9 10 ...
# $ totaladministrations : int 575 597 579 605 582 599 577 604 630 632 ...
# $ missed : int 18 16 16 18 18 11 10 13 17 16 ...
# $ beds : int 33 33 33 33 33 33 33 33 33 33 ...
# $ rating : int 3 3 3 3 3 3 3 3 3 3 ...
# $ sequential_month_counter: int 1 2 3 4 5 6 7 8 9 10 ...
As a side-note, you should use bare column names inside dplyr verbs, not data$column. With arrange, it doesn't much matter, but in grouped operations it will cause bugs. You should get in the habit of using arrange(df, ch, year_id, month_id) instead of arrange(df, df$ch, df$year_id, df$month_id).
I have a data frame, str(data) to show more about my data frame the result is the following:
> str(data)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
However, for example, when I want to subset the amounts of Ozone above 14 I use the following code which gives me an error:
> data[data$Ozone > 14 ]
Error in [.data.frame(data, data$Ozone > 14) : undefined columns selected
You want rows where that condition is true so you need a comma:
data[data$Ozone > 14, ]
I need to expand some data and then restrict which data remains through tail.
Example of data:
list_1 <- list(1:15)
list_2 <- list(16:30)
list_3 <- list(31:45)
short_lists[[1]] <- list_1
short_lists[[2]] <- list_2
short_lists[[3]] <- list_3
str(short_lists)
List of 3
$ :List of 1
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ :List of 1
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
$ :List of 1
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
And how long I want my tail of a given list to be from list_1, list_2, list_3
how_long <-
c(4,2,5,3,6,4,7,5,8,6,9,7,10,8,2,4,6,8,10,12,14,10,9,7,11)
And I expand through nested for loops and try to get the tail of the expanded lists, but just get the expanded lists.
for (i in 1:length(how_long)) {
for (j in 1:length(short_lists)) {
tail_temp[[j]][i] <- tail(short_lists2[[j]], n = how_long[i])
}
}
And this yields:
str(tail_temp)
List of 3
$ :List of 25
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
[snip]
..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ...
$ :List of 25
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
[snip]
..$ : int [1:15] 16 17 18 19 20 21 22 23 24 25 ...
$ :List of 25
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
[snip]
..$ : int [1:15] 31 32 33 34 35 36 37 38 39 40 ...
And I'm happy the j's were expanded, but I never get to the tail call and what I'm seeking:
str(tail_temp)
List of 3
$ :List of 25
..$ : int [1:4] 12 13 14 15
..$ : int [1:2] 14 15
..$ : int [1:5] 11 12 13 14 15
[snip]
so what simple thing am I missing. Any help appreciated. Thanks.
Very close indeed.
I prefer vectors in lists over R.
If you're familiar with python,
vectors behave like 'lists' in python.
Where as lists in R behave like dictionaries.
Therefore, you just needed to unlist first (into a vector),
to then assign to an item of a list,
hence it should be assigned to:
tail_temp[[i]][[j]] instead of tail_temp[i][[j]]
list_1 <- list(1:15)
list_2 <- list(16:30)
list_3 <- list(31:45)
short_lists = list()
short_lists[[1]] <- list_1
short_lists[[2]] <- list_2
short_lists[[3]] <- list_3
how_long <- c(4,2,5,3,6,4,7,5,8,6,9,7,10,
8,2,4,6,8,10,12,14,10,9,7,11)
tail_temp = list()
for (i in 1:length(short_lists)){
tail_temp[[i]] = list()
for (j in 1:length(how_long)){
tail_temp[[i]][[j]] <- tail(unlist(short_lists[[i]]), n = how_long[j])
}
}
Output
[[1]]
[[1]][[1]]
[1] 12 13 14 15
[[1]][[2]]
[1] 14 15
[[1]][[3]]
[1] 11 12 13 14 15
…
[[3]][[23]]
[1] 37 38 39 40 41 42 43 44 45
[[3]][[24]]
[1] 39 40 41 42 43 44 45
[[3]][[25]]
[1] 35 36 37 38 39 40 41 42 43 44 45
Using matplot, I'm trying to plot the 2nd, 3rd and 4th columns of airquality data.frame after dividing these 3 columns by the first column of airquality.
However I'm getting an error
Error in ncol(xj) : object 'xj' not found
Why are we getting this error? The code below will reproduce this problem.
attach(airquality)
airquality[2:4] <- apply(airquality[2:4], 2, function(x) x /airquality[1])
matplot(x= airquality[,1], y= as.matrix(airquality[-1]))
You have managed to mangle your data in an interesting way. Starting with airquality before you mess with it. (And please don't attach() - it's unnecessary and sometimes dangerous/confusing.)
str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
After you do
airquality[2:4] <- apply(airquality[2:4], 2,
function(x) x /airquality[1])
you get
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R:'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 4.63 3.28 12.42 17.39 NA ...
$ Wind :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 0.18 0.222 1.05 0.639 NA ...
$ Temp :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 1.63 2 6.17 3.44 NA ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
or
sapply(airquality,class)
## Ozone Solar.R Wind Temp Month Day
## "integer" "data.frame" "data.frame" "data.frame" "integer" "integer"
that is, you have data frames embedded within your data frame!
rm(airquality) ## clean up
Now change one character and divide by the column airquality[,1] rather than airquality[1] (divide by a vector, not a list of length one ...)
airquality[,2:4] <- apply(airquality[,2:4], 2,
function(x) x/airquality[,1])
matplot(x= airquality[,1], y= as.matrix(airquality[,-1]))
In general it's safer to use [, ...] indexing rather than [] indexing to refer to columns of a data frame unless you really know what you're doing ...
I have 2142 rows and 9 columns in my data frame. When I call head(df),
the data frame appears fine, something like below:
Local Identifier Local System Parent ID Storage Type Capacity Movable? Storage Unit Order Number
2209 NEZ0037-76 FreezerWorks NEZ0037 BoxPos 1 N 76
2210 NEZ0037-77 FreezerWorks NEZ0037 BoxPos 1 N 77
2211 NEZ0037-78 FreezerWorks NEZ0037 BoxPos 1 N 78
2212 NEZ0037-79 FreezerWorks NEZ0037 BoxPos 1 N 79
2213 NEZ0037-80 FreezerWorks NEZ0037 BoxPos 1 N 80
2214 NEZ0037-81 FreezerWorks NEZ0037 BoxPos 1 N 81
Description Storage.Label
2209 I4
2210 I5
2211 I6
2212 I7
2213 I8
2214 I9`
However, when I call write.csv or write.table, I get an incoherent output. Something like below:
Local Identifier Local System Parent ID Storage Type Capacity Movable
1 NEZ0011 FreezerWorks NEZ0011 Box-9X9 81 Y
39 40 41 42 43 44 45
80 81 "Box-9X9 NEZ0014" 1 2 3 4
38 39 40 41 42 43 44
79 80 81 "Box-9X9 NEZ0017" 1 2 3
37 38 39 40 41 42 43
78 79 80 81 "Box-9X9 NEZ0020" 1 2
36 37 38 39 40 41 42
77 78 79 80 81 "Box-9X9 NEZ0023" 1
35 36 37 38 39 40 41
76 77 78 79 80 81 "Box-9X9 NEZ0026"`
Calling sapply(df, class) reveals that all columns in the data frame are [1] "factor"
except for $Storage.Level which is [1] "data.table" "data.frame". When I called unlist on $Storage.Level, the output is better but it changes the value in the column. I also tried
df <- data.frame(df, stringsAsFactors=FALSE) without success. Also data.frame(lapply(df, factor)) as suggested in the thread here and as.data.frame in the thread here did not work. Is there a way to unlist $Storage.Level without tampering with the values in the column? Or maybe there is a way to change from level "data.table" "data.frame" to factor and output the data safely.
R version 3.0.3 (2014-03-06)
It sounds like you have something like this:
df <- data.frame(A = 1:2, C = 3:4)
df$AC <- data.table(df)
str(df)
# 'data.frame': 2 obs. of 3 variables:
# $ A : int 1 2
# $ C : int 3 4
# $ AC:Classes ‘data.table’ and 'data.frame': 2 obs. of 2 variables:
# ..$ A: int 1 2
# ..$ C: int 3 4
# ..- attr(*, ".internal.selfref")=<externalptr>
sapply(df, class)
# $A
# [1] "integer"
#
# $C
# [1] "integer"
#
# $AC
# [1] "data.table" "data.frame"
If that's the case, you will have trouble writing to a csv file.
Try first calling do.call(data.frame, your_data_frame) to see if that sufficiently "flattens" your data.frame, as it does with this example.
str(do.call(data.frame, df))
# 'data.frame': 2 obs. of 4 variables:
# $ A : int 1 2
# $ C : int 3 4
# $ AC.A: int 1 2
# $ AC.C: int 3 4
You should be able to write this to a csv file without any problems.