Will head() and tail() functions in R change the order of output? - r

I know head() and tail() function will return the first or last parts of a dataset, but I wanna know if the two functions are gonna order the output, or just return without ordering them? Thanks many in advance!

As you can see below, they do keep the original order:
df <- data.frame(number = 1:26, letter = letters[1:26])
> head(df)
number letter
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
> tail(df)
number letter
21 21 u
22 22 v
23 23 w
24 24 x
25 25 y
26 26 z

Related

How to extract diagonal elements from dataframe and store in a variable?

I have a simple 9 element dataframe.
A B C
1 8 21 1
2 40 25 32
3 10 15 49
I want to extract the diagonal elements and store it in a variable. Is there an easier way to do this other than taking one number out at a time to store to a variable?
In this case as they are all numeric you can use:
df <- data.frame(a=c(4,8,10), b = c(25,24,15), c = c(1,32,49))
df
df
a b c
1 4 25 1
2 8 24 32
3 10 15 49
Where this takes the diagonal.
diag(as.matrix(df))
[1] 4 24 49
You can use the diag function which extracts the diagonal of a matrix:
Data <- data.frame(a = c(1,2,3), b= c(11,12,13), c = c(111,112,113))
Data2 <- as.matrix(Data)
Result <- diag(Data2)
Result #Returns 1 12 113

Group-specific ID numbers using group_indices or similar in R

I am trying to group a series of observations by two columns, and then create a third column with an id number. I've tried group_indices, but that gives each combination of observations a unique number. I want the number to revert to 1 for the first observation of each group.
In my data there are a series of Sites with a number of rows showing the calendar Day when an observation was collected. I want to calculate the chronological day within a Site.
library(dplyr)
# Make some data
df <- data.frame(Site = rep(c("A", "B", "C"), each = 70),
Day = as.integer(rep(c(21,22,23,24,25,26,27,1,2,3,4,5,6,7,
24,25,26,27,28,29,30), each = 10)))
# Create Day Number column (this doesn't actually work, but is the sort
# of thing I'm looking for...)
df <- df %>% group_by(Site, Day) %>%
mutate(Day.Number = group_indices(Day))
# Desired output
Site Day Day.Number
1 A 21 1
2 A 21 1
3 A 21 1
...
11 A 22 2
12 A 22 2
13 A 22 2
14 A 22 2
15 A 22 2
...
141 C 24 1
142 C 24 1
143 C 24 1
144 C 24 1
...
151 C 25 2
152 C 25 2
153 C 25 2
154 C 25 2
155 C 25 2
...
This is just a toy dataset to demonstrate the problem. Although most sites will have ten observations of seven days it is not always a given, so I can't just use a sequence of rep() etc.
There is a bit of a discussion about this on github here and here but it doesn't seem to have been resolved. Any suggestions for workarounds are much appreciated.
Here's one way to do it:
df <- df %>%
left_join(unique(df) %>% group_by(Site) %>% mutate(Day.Number=1:n()))
head(df)
# Site Day Day.Number
# 1 A 21 1
# 2 A 21 1
# 3 A 21 1
# 4 A 21 1
# 5 A 21 1
# 6 A 21 1

aggregate using "factors" that are NA

I'm struggling to aggregate a data frame into the format I want. The data frame contains a series of parts, along with a list of tests that are performed (Length and Width), and a lower and upper limit (LL and UL) for each measurement. Some of the tests don't have one or the other limit. I'm trying to get a count of how many parts have a given "test-LL-UL" combination, including those tests with NA as one of the limits.
What I've tried so far is the following:
df<-read.table(header = TRUE, text = "
Part Test LL UL
A L 20 40
A W 5 7
B L 20 NA
B W 5 7
C L 20 40
C W 10 30
")
aggregate(data=df,Part~Test+LL+UL,FUN=length,na.action=na.pass)
This gives the following output:
Test LL UL Part
1 W 5 7 2
2 W 10 30 1
3 L 20 40 2
What I was expecting to get was:
Test LL UL Part
1 W 5 7 2
2 W 10 30 1
3 L 20 40 2
4 L 20 NA 1
Any help would be greatly appreciated!
dplyr handles this quite nicely:
library(dplyr)
df %>% group_by(Test,LL,UL) %>% summarise( n() )
Package {dplyr} can be utilized with functions group_by() and summarize():
df <- data.frame(Part = c("A","A","B","B","C","C"),
Test = c("L","W","L","W","L","W"),
LL = c(20,5,20,5,20,10),
UL = c(40,7,NA,7,40,30))
grouped <- dplyr::group_by(df, Test, LL, UL)
summarize(grouped, count = n())
## Test LL UL count
## (fctr) (dbl) (dbl) (int)
##1 L 20 40 2
##2 L 20 NA 1
##3 W 5 7 2
##4 W 10 30 1
In line with Jimbou's suggestion, the following works (but feels a little messy):
df<-read.table(header = TRUE, text = "
Part Test LL UL
A L 20 40
A W 5 7
B L 20 NA
B W 5 7
C L 20 40
C W 10 30
")
df[is.na(df)] <- "NA"
df<-aggregate(data=df,Part~Test+LL+UL,FUN=length,na.action=na.pass)
df$UL<-as.numeric(df$UL)
I think the appropriate thing to do is to set the Upper Limits to Inf and the Lower Limits to -Inf (this more accurately reflects the meaning of the limits). In this case, the aggregate works as I'd expect.

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

R self reference

In R I find myself doing something like this a lot:
adataframe[adataframe$col==something]<-adataframe[adataframe$col==something)]+1
This way is kind of long and tedious. Is there some way for me
to reference the object I am trying to change such as
adataframe[adataframe$col==something]<-$self+1
?
Try package data.table and its := operator. It's very fast and very short.
DT[col1==something, col2:=col3+1]
The first part col1==something is the subset. You can put anything here and use the column names as if they are variables; i.e., no need to use $. Then the second part col2:=col3+1 assigns the RHS to the LHS within that subset, where the column names can be assigned to as if they are variables. := is assignment by reference. No copies of any object are taken, so is faster than <-, =, within and transform.
Also, soon to be implemented in v1.8.1, one end goal of j's syntax allowing := in j like that is combining it with by, see question: when should I use the := operator in data.table.
UDPDATE : That was indeed released (:= by group) in July 2012.
You should be paying more attention to Gabor Grothendeick (and not just in this instance.) The cited inc function on Matt Asher's blog does all of what you are asking:
(And the obvious extension works as well.)
add <- function(x, inc=1) {
eval.parent(substitute(x <- x + inc))
}
# Testing the `inc` function behavior
EDIT: After my temporary annoyance at the lack of approval in the first comment, I took the challenge of adding yet a further function argument. Supplied with one argument of a portion of a dataframe, it would still increment the range of values by one. Up to this point has only been very lightly tested on infix dyadic operators, but I see no reason it wouldn't work with any function which accepts only two arguments:
transfn <- function(x, func="+", inc=1) {
eval.parent(substitute(x <- do.call(func, list(x , inc)))) }
(Guilty admission: This somehow "feels wrong" from the traditional R perspective of returning values for assignment.) The earlier testing on the inc function is below:
df <- data.frame(a1 =1:10, a2=21:30, b=1:2)
inc <- function(x) {
eval.parent(substitute(x <- x + 1))
}
#---- examples===============>
> inc(df$a1) # works on whole columns
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 6 25 1
6 7 26 2
7 8 27 1
8 9 28 2
9 10 29 1
10 11 30 2
> inc(df$a1[df$a1>5]) # testing on a restricted range of one column
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 7 25 1
6 8 26 2
7 9 27 1
8 10 28 2
9 11 29 1
10 12 30 2
> inc(df[ df$a1>5, ]) #testing on a range of rows for all columns being transformed
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
# and even in selected rows and grepped names of columns meeting a criterion
> inc(df[ df$a1 <= 3, grep("a", names(df)) ])
> df
a1 a2 b
1 3 22 1
2 4 23 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
Here is what you can do. Let us say you have a dataframe
df = data.frame(x = 1:10, y = rnorm(10))
And you want to increment all the y by 1. You can do this easily by using transform
df = transform(df, y = y + 1)
I'd be partial to (presumably the subset is on rows)
ridx <- adataframe$col==something
adataframe[ridx,] <- adataframe[ridx,] + 1
which doesn't rely on any fancy / fragile parsing, is reasonably expressive about the operation being performed, and is not too verbose. Also tends to break lines into nicely human-parse-able units, and there is something appealing about using standard idioms -- R's vocabulary and idiosyncrasies are already large enough for my taste.

Resources