I have data table which looks like:
require(data.table)
df <- data.table(Day = seq(as.Date('2014-01-01'), as.Date('2014-12-31'), by = 'days'), Number = 1:365)
I want to subset my data table such that it returns just values of the first 110 rows which are higher than 10. When I use
df2 <- subset(df[1:110,], df$Number[1:110] > 10)
everything works well. However, if I subset using
df2 <- subset(df[1:110,], df[1:110,2] > 10)
R returns the following error:
Error in `[.data.table`(x, r, vars, with = FALSE) :
i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please report to data.table issue tracker if you'd like this, or add your comments to FR #657.
Should the way of subsetting not be the same? The problem is that I want to use this subset in an apply command and therefore, the names of the data table change. Hence, I cannot use the column name with the $-operator to refer to the second column and want use the index number but it does not work. I could rename the data table columns or read out the names of the column and use the $-operator but my apply function runs over lots of entries and I want to minimize the workload of the apply function.
So how do I make the subsetting with the index number work and why do I get the mentioned error in the first place? I would like to understand what my mistake is. Thanks!
First let's understand why it doesn't work in your case. When you are doing
df[1:110,2] > 10
# Number
# [1,] FALSE
# [2,] FALSE
# [3,] FALSE
# [4,] FALSE
# [5,] FALSE
# [6,] FALSE
# [7,] FALSE
#....
it returns a 1 column matrix which is used for subsetting.
class(df[1:110,2] > 10)
#[1] "matrix"
which works fine on dataframe
df1 <- data.frame(df)
subset(df1[1:110,], df1[1:110,2] > 10)
# Day Number
#11 2014-01-11 11
#12 2014-01-12 12
#13 2014-01-13 13
#14 2014-01-14 14
#15 2014-01-15 15
#....
but not on data.table. Unfortunately subsetting doesn't work that way in data.table. You could convert it into a vector instead of matrix and then use it for subsetting
subset(df[1:110,], df[1:110][[2]] > 10)
# Day Number
# 1: 2014-01-11 11
# 2: 2014-01-12 12
# 3: 2014-01-13 13
# 4: 2014-01-14 14
# 5: 2014-01-15 15
#...
The difference would be more clear when you see the results of
df[matrix(TRUE), ]
vs
df1[matrix(TRUE), ]
PS - in the first case doing
subset(df[1:110,], Number > 10)
would also have worked.
Related
I am struggling with R since 2 days without finding any solution !
Here is my problem :
I have a list of symbols extracted from one data-frame : annotation$"SYMBOL"
I would like to bind it to another data-frame, called "matrix", and to assign them as rownames.
I extracted the column, bound it without problems. However, I realized that once this was done, changing them into rownames doesn't work because ~ 5000 genes / 15000 are then changed as "NA"
I realize that actually it's all the genes with "NA" in their symbol that are seen as "missing values"
I try to change them as.character(annotation$"SYMBOL") but that doesn't change....
HERE:
X=as.character(annotation$"SYMBOL")
summary(X)
Length Class Mode
16978 character character
unique (unlist (lapply (as.character(annotation$"SYMBOL"), function (x) which (is.na (x)))))
[1] 1
Y=na.exclude(X)
summary(Y)
Length Class Mode
9954 character character
U=na.exclude(annotation$"SYMBOL")
Error in `$<-.data.frame`(`*tmp*`, "SYMBOL", value = c("SCYL3", "C1orf112", :
replacement has 9954 rows, data has 16978
And I know that they replace all the genes with "NA" in their names as NA....
Does someone have an idea how to go through this?
For example, Number 11 and number 15 in this image are deleted when I use "na.omit" function ....
To set your NA values, you should use the code df[df == "NA"] <- NA. I used this with your test dataset and produced the desired results. You can then use the na.omit() function on your df to remove the now set NA data. I don't have a working code from you, so I will supply the outline of what your code should look like:
df <- data.frame(lapply(df, as.character), stringAsFactors = FALSE)
df
X1 X2
1 1 SCYL3
2 2 C1orf112
3 3 FGR
4 4 CFH
5 5 STPG1
6 6 NIPAL3
7 7 AK2
8 8 KDM1A
9 9 TTC22
10 10 ST7L
11 11 DNAJC11
12 12 FMO3
13 13 E2F2
14 14 CDK11A
15 15 NADK
16 16 CSDE1
17 17 MASP2
df[df == "NA"] <- NA
The is.na(df) function will return FALSE for all results. If you add any data which is NA, you can omit that row using the na.omit(df) now.
I have a time series dataset with 1000 columns. Each row, is of course, a different record. There are some NA values that are scattered throughout the dataset.
I would like to replace each NA with either the adjacent left-value or the adjacent-right value, it doesn't matter which.
A neat solution and one which I was going for is to replace each NA with the value to its right, unless it is in the last column, in which case replace it with the value to its left.
I was just going to do a for loop, but I assume a function would be more efficient. Essentially, I wasn't sure how to reference the adjacent values.
Here is what I was trying:
for (entry in dataset) {
if (any(is.na(entry)) == TRUE && entry[,1:999]) {
entry = entry[,1]
}
else if (any(is.na(entry)) == TRUE && entry[,1000]) {
entry = cell[,-1]
}
}
As you can tell, I'm inexperienced with R :) Not really sure how you index the values to the left or right.
I would suggest using na.locf on the transposed of your dataset.
The na.locf function of the zoo package is designed to replace NA by the closest value (+1 or -1 in the same row). Since you want the columns, we can just transpose first the dataset:
library(zoo)
df=matrix(c(1,3,4,10,NA,52,NA, 11, 100), ncol=3)
step1 <- t(na.locf(t(df), fromLast=T))
step2 <- t(na.locf(t(step1), fromLast=F))
print(df)
#### [1,] 1 10 NA
#### [2,] 3 NA 11
#### [3,] 4 52 100
print(step2)
#### [1,] 1 10 10
#### [2,] 3 11 11
#### [3,] 4 52 100
I do it in 2 steps since there is a different treatment for inside columns and last column. If you know the dplyr package it's even more straightforward to turn it into a function:
library(dplyr)
MyReplace = function(data) {data %>% t %>% na.locf(.,,T) %>% na.locf %>% t}
MyReplace(df)
I have two similar data frames with the same number of columns, but different numbers of rows. Most of the entries between the two are the same, but in a few places there are differences, and these are what I care about. The first column in both data frames serves as a key.
Ideally, I'd like to be able to see whether they've changed, as well as the values from each of the two data frames. My first solution was to create a merged dataframe and re-organize the columns side-by-side like so:
df1<-data.frame(gene=c('cyp1a1','cyp2a6','srd5a','slc5a5','cox15'), updated=c(TRUE,TRUE,FALSE,TRUE,FALSE),version=c(2,3,1,2,1))
df2<-data.frame(gene=c('cyp1a1','cyp2a6','srd5a','slc5a5'), updated=c(FALSE,TRUE,FALSE,FALSE),version=c(1,2,1,1))
#merge data frames
comp<-merge(df1,df2, by="gene", all=TRUE)
#re-order columns side-by-side
#probably a better way to do this
ordList<-c(1,2,4,3,5)
comp<-comp[ordList]
So now I have a side-by-side comparison data frame. I am unsure about how to iterate over the data frames to perform the comparison. Eventually I would like to create a new data-frame which uses information from the comparison to exclude data that is identical (replace with empty string) and includes data that differs from the first df to the second.
This is what comp looks like now:
gene updated.x updated.y version.x version.y
1 cox15 FALSE NA 1 NA
2 cyp1a1 TRUE FALSE 2 1
3 cyp2a6 TRUE TRUE 3 2
4 slc5a5 TRUE FALSE 2 1
5 srd5a FALSE FALSE 1 1
This is what I want it to look like:
gene updated.x updated.y version.x version.y
1 cox15 FALSE NA 1 NA
2 cyp1a1 TRUE FALSE 2 1
3 cyp2a6 3 2
4 slc5a5 TRUE FALSE 2 1
5 srd5a
In my actual data, I have 14 columns in each data frame, and hundreds of rows. I may be doing similar comparisons in the future, so having a functional way of executing this task would be ideal.
Here is my suggestion, considering you have 14 columns:
library(data.table)
library(magrittr)
z = rbindlist(list(df1,df2), idcol=TRUE)
z[, lapply(.SD, . %>% unique %>% paste(collapse=";")), keyby=gene]
# gene .id updated version
# 1: cox15 1 FALSE 1
# 2: cyp1a1 1;2 TRUE;FALSE 2;1
# 3: cyp2a6 1;2 TRUE 3;2
# 4: slc5a5 1;2 TRUE;FALSE 2;1
# 5: srd5a 1;2 FALSE 1
This shows you which data frame each gene appears in (.id) as well as the attributes (updated and version). This display extends naturally to additional tables, like list(df1,df2,df3).
If you really are not interested in unchanged values, you can hide them with an if test:
z[, lapply(.SD, function(x)
if (uniqueN(x)>1) x %>% unique %>% paste(collapse=";")
else ""
), keyby=gene]
# gene .id updated version
# 1: cox15
# 2: cyp1a1 1;2 TRUE;FALSE 2;1
# 3: cyp2a6 1;2 3;2
# 4: slc5a5 1;2 TRUE;FALSE 2;1
# 5: srd5a 1;2
This also hides .id for genes only showing up once, but that can be tweaked.
Explanation. z contains all the data, "stacked" or stored in "long" format.
To make the summary table, we use z[, j, keyby=gene] where j works on the Subset of Data, .SD, associated with each keyby=gene group and returns a list of column vectors for the result.
The . %>% unique %>% paste(collapse=";") uses a feature of magrittr. It is just an easy-to-read version of function(y) paste(unique(y), collapse=";"). When it starts with x, it applies the function to x. You can replace it if you prefer to write these in the standard way.
Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.
This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.
A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT
I'm trying to get a handle on the ubiquitous which function. Until I started reading questions/answers on SO I never found the need for it. And I still don't.
As I understand it, which takes a Boolean vector and returns a weakly shorter vector containing the indices of the elements which were true:
> seq(10)
[1] 1 2 3 4 5 6 7 8 9 10
> x <- seq(10)
> tf <- (x == 6 | x == 8)
> tf
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
> w <- which(tf)
> w
[1] 6 8
So why would I ever use which instead of just using the Boolean vector directly? I could maybe see some memory issues with huge vectors, since length(w) << length(tf), but that's hardly compelling. And there are some options in the help file which don't add much to my understanding of possible uses of this function. The examples in the help file aren't of much help either.
Edit for clarity-- I understand that the which returns the indices. My question is about two things: 1) why you would ever need to use the indices instead of just using the boolean selector vector? and 2) what interesting behaviors of which might make it preferred to just using a vectorized Boolean comparison?
Okay, here is something where it proved useful last night:
In a given vector of values what is the index of the 3rd non-NA value?
> x <- c(1,NA,2,NA,3)
> which(!is.na(x))[3]
[1] 5
A little different from DWin's use, although I'd say his is compelling too!
The title of the man page ?which provides a motivation. The title is:
Which indices are TRUE?
Which I interpret as being the function one might use if you want to know which elements of a logical vector are TRUE. This is inherently different to just using the logical vector itself. That would select the elements that are TRUE, not tell you which of them was TRUE.
Common use cases were to get the position of the maximum or minimum values in a vector:
> set.seed(2)
> x <- runif(10)
> which(x == max(x))
[1] 5
> which(x == min(x))
[1] 7
Those were so commonly used that which.max() and which.min() were created:
> which.max(x)
[1] 5
> which.min(x)
[1] 7
However, note that the specific forms are not exact replacements for the generic form. See ?which.min for details. One example is below:
> x <- c(4,1,1)
> which.min(x)
[1] 2
> which(x==min(x))
[1] 2 3
Two very compelling reasons not to forget which:
1) When you use "[" to extract from a dataframe, any calculation in the row position that results in NA will get a junk row returned. Using which removes the NA's. You can use subset or %in%, which do not create the same problem.
> dfrm <- data.frame( a=sample(c(1:3, NA), 20, replace=TRUE), b=1:20)
> dfrm[dfrm$a >0, ]
a b
1 1 1
2 3 2
NA NA NA
NA.1 NA NA
NA.2 NA NA
6 1 6
NA.3 NA NA
8 3 8
# Snipped remaining rows
2) When you need the array indicators.
which could be useful (by the means of saving both computer and human resources) e.g. if you have to filter the elements of a data frame/matrix by a given variable/column and update other variables/columns based on that. Example:
df <- mtcars
Instead of:
df$gear[df$hp > 150] <- mean(df$gear[df$hp > 150])
You could do:
p <- which(df$hp > 150)
df$gear[p] <- mean(df$gear[p])
Extra case would be if you have to filter a filtered elements what could not be done with a simple & or |, e.g. when you have to update some parts of a data frame based on other data tables. This way it is required to store (at least temporary) the indexes of the filtered element.
Another issue what cames to my mind if you have to loop thought a part of a data frame/matrix or have to do other kind of transformations requiring to know the indexes of several cases. Example:
urban <- which(USArrests$UrbanPop > 80)
> USArrests[urban, ] - USArrests[urban-1, ]
Murder Assault UrbanPop Rape
California 0.2 86 41 21.1
Hawaii -12.1 -165 23 -5.6
Illinois 7.8 129 29 9.8
Massachusetts -6.9 -151 18 -11.5
Nevada 7.9 150 19 29.5
New Jersey 5.3 102 33 9.3
New York -0.3 -31 16 -6.0
Rhode Island -2.9 68 15 -6.6
Sorry for the dummy examples, I know it makes not much sense to compare the most urbanized states of USA by the states prior to those in the alphabet, but I hope this makes sense :)
Checking out which.min and which.max gives some clue also, as you do not have to type a lot, example:
> row.names(mtcars)[which.max(mtcars$hp)]
[1] "Maserati Bora"
Well, I found one possible reason. At first I thought it might be the ,useNames option, but it turns out that simple boolean selection does that too.
However, if your object of interest is a matrix, you can use the ,arr.ind option to return the result as (row,column) ordered pairs:
> x <- matrix(seq(10),ncol=2)
> x
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> which((x == 6 | x == 8),arr.ind=TRUE)
row col
[1,] 1 2
[2,] 3 2
> which((x == 6 | x == 8))
[1] 6 8
That's a handy trick to know about, but hardly seems to justify its constant use.
Surprised no one has answered this: how about memory efficiency?
If you have a long vector of very sparse TRUE's, then keeping track of only the indices of the TRUE values will probably be much more compact.
I use it quiet often in data exploration. For example if I have a dataset of kids data and see from summary that the max age is 23 (and should be 18), I might go:
sum(dat$age>18)
If that was 67, and I wanted to look closer I might use:
dat[which(dat$age>18)[1:10], ]
Also useful if you're making a presentation and want to pull out a snippet of data to demonstrate a certain oddity or what not.