dplyr::filter nested dataframe - r
Question:
How to filter rows based on a nested dataframe using dplyr:filter
Problem:
The following code provides an example dataset to enable a working example.
Using the example code I can subset using which, but I am having a problem using dplyr due to the nested data frames.
Now I appreciate I could flatten the dataframe using jsonlite, however I am interested to know if and how I might harness dplyr without flattening the dataframe.
All help gratefully received and appreciated.
requiredPackages <- c("devtools","dplyr","tidyr","data.table","ggplot2","ggvis","RMySQL", "jsonlite", "psych", "plyr", "knitr")
ipak <- function(pkg)
{
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
ipak(requiredPackages)
dataDir <- "./data"
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/yelp_dataset_challenge_academic_dataset.zip"
filePath <- file.path(dataDir)
# Does the directory Exist? If it does'nt create it
if (!file.exists(dataDir)) {
dir.create(dataDir)
}
# Now we check if we have downloaded the data already into
# "./data/yelp_dataset_challenge_academic_dataset". If not, then we download the
# zip file... and extract it under the data directory as
# './data/yelp_dataset_challenge_academic_dataset'...
if (!file.exists( file.path(dataDir,"yelp_dataset_challenge_academic_dataset"))) {
temp <- tempfile()
download.file(fileUrl, temp, mode = "wb", method = "curl")
unzip(temp, exdir = dataDir)
unlink(temp)
}
if ( !exists("yelpBusinessData") )
{
if (file.exists( file.path(dataDir,"yelpBusinessData.rds"))) {
yelpBusinessData <- readRDS(file.path(dataDir,"yelpBusinessData.rds"))
} else {
yelpBusinessDataFilePath <- file.path(dataDir,
"yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json")
yelpBusinessData <- fromJSON(sprintf("[%s]",
paste(readLines(yelpBusinessDataFilePath),
collapse = ",")),
flatten = FALSE)
str(yelpBusinessData, max_level = 1)
# Fix the column name duplication issue
# If and when you flatten the data the you create two columns wiht the same column id
#
# i.e. yelpBusinessData$attributes.Good.for.kids
#
# This fixes the issue by renaming the first column...
#
colnames(yelpBusinessData$attributes)[6] <- "Price_Range"
colnames(yelpBusinessData$attributes)[7] <- "Good_For_Kids"
saveRDS( yelpBusinessData, file.path(dataDir, "yelpBusinessData.rds"))
}
}
The above code loads the example dataframe.
Here is an example of the problem I mentioned above. The first code example works and harnesses which to select four records. The problem is how to do the same with dplyr::filter - what am I missing? Specifically, how do you dereference nested dataframes???
# Extract the Phoenix subset using `which`
yelpBusinessData.PA <- yelpBusinessData[which(yelpBusinessData$city == "Phoenix"),]
yelpBusinessData.PA.rest <- yelpBusinessData.PA[which(grepl("Restaurants",
yelpBusinessData.PA$categories)),]
Exp <- yelpBusinessData.PA.rest[which(yelpBusinessData.PA.rest$attributes$Price_Range == 4),]
dim(Exp)
Result - Four records selected :-)
> dim(Exp)
[1] 4 15
Question: How to do this with dplyr?
yelpBusinessData.PA.rest <- yelpBusinessData %>%
filter(city == "Phoenix") %>%
filter(grepl("Restaurants", categories)) %>%
filter(attributes$Price_Range == 4)
the above code fails... now if I flatten the file I can get this to work correctly but...
Note the subtle change from: "attributes$Price_Range" to "attributes.Price_Range".
yelpBusinessData2 <- flatten(yelpBusinessData, recursive = TRUE)
dim(yelpBusinessData2)
Exp2 <- yelpBusinessData2 %>%
filter(city == "Phoenix") %>%
filter(grepl("Restaurants", categories)) %>%
filter(attributes.Price_Range == 4)
dim(Exp2)
my goal however is to understand how to do this without flattening the nested data frames.
I.E -> **How to use dplyr with nested dataframes? **
What am I missing here? :-)
One potential answer that I have tried is to index the nested data frame using [[]], this does work but you loose the elegance of dplyr...
Is there a better way?
Exp2 <- yelpBusinessData %>%
filter(city == "Phoenix") %>%
filter(grepl("Restaurants", categories)) %>%
filter( attributes[[6]][] == 4)
The above indexes into "attributes$Price_range" and returned the correct result when using nested data frames. i.e Price_Range is the 6th dataframe of the attributes dataframe...
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitcitations_1.0.6 pander_0.5.2 plyr_1.8.3 jsonlite_0.9.16 ggvis_0.4.2.9000
[6] tidyr_0.2.0 devtools_1.8.0 qmap_1.0-3 fitdistrplus_1.0-4 knitr_1.11
[11] dplyr_0.4.3.9000 data.table_1.9.4 psych_1.5.6 mapproj_1.2-4 maptools_0.8-36
[16] rworldmap_1.3-1 sp_1.1-1 maps_2.3-11 ggmap_2.5.2 ggplot2_1.0.1
[21] RMySQL_0.10.5 DBI_0.3.1 setwidth_1.0-4 colorout_1.1-1 vimcom_1.2-3
loaded via a namespace (and not attached):
[1] httr_1.0.0 splines_3.2.2 shiny_0.12.2 assertthat_0.1 highr_0.5
[6] yaml_2.1.13 lattice_0.20-33 chron_2.3-47 digest_0.6.8 RefManageR_0.8.63
[11] colorspace_1.2-6 htmltools_0.2.6 httpuv_1.3.3 XML_3.98-1.3 bibtex_0.4.0
[16] xtable_1.7-4 scales_0.3.0 jpeg_0.1-8 git2r_0.11.0 lazyeval_0.1.10.9000
[21] mnormt_1.5-3 proto_0.3-10 survival_2.38-3 RJSONIO_1.3-0 magrittr_1.5
[26] mime_0.3 memoise_0.2.1 evaluate_0.7.2 MASS_7.3-43 xml2_0.1.1
[31] foreign_0.8-66 ggthemes_2.2.1 rsconnect_0.4.1.4 tools_3.2.2 geosphere_1.4-3
[36] RgoogleMaps_1.2.0.7 formatR_1.2 stringr_1.0.0 munsell_0.4.2 rversions_1.0.2
[41] grid_3.2.2 RCurl_1.95-4.7 rstudioapi_0.3.1 rjson_0.2.15 spam_1.0-1
[46] bitops_1.0-6 labeling_0.3 rmarkdown_0.7 gtable_0.1.2 curl_0.9.3
[51] reshape2_1.4.1 R6_2.1.1 lubridate_1.3.3 stringi_0.5-5 parallel_3.2.2
[56] Rcpp_0.12.0 fields_8.2-1 png_0.1-7
There are at least 3 different parts to this question, each of which has very likely been answered well (& thoroughly) elsewhere on SO.
These are:
How to work with a "messy" data.frame in R/dplyr?
The example you give here is messier than a 'nested' data.frame since it contains list-columns as well as data-frame-columns containing data-frame-columns.
How to clean up a "messy" data.frame in R/dplyr?
Is there a better way to work with these data, maintaining their hierarchy?
Working with a "messy" data frame in R/dplyr?
Generally and particularly when starting out I take an approach of iteratively cleaning my data. This means I first identify the columns I most need to work with, the columns that are most problematic, and clean only those at the intersection.
Specifically:
Filter out any column that is problematic but not important
Focus my effort on any column that is problematic AND important
Keep any column that is important and not problematic
Aside: This leaves a fourth group of columns that are BOTH unimportant and not problematic. What you do with these depends on the problem. For example, if I'm preparing a production database, I will exclude them and only include the "cleaned" columns (#2 and #3 above). If I'm doing an exploratory analysis I'll include them since I may change my mind about their importance down the line.
In this example you give, the problematic columns are those that contain data.frames. (These are problematic because they break compatibility with dplyr -- not because they are messy).
You can filter them out using dplyr::select_if:
yelpBusinessData %>%
dplyr::select_if(purrr::negate(is.data.frame)) %>%
dplyr::filter(city == 'Phoenix')
After that, the other dplyr operators in your example will work provided they don't reference data in columns that are data.frames (for example, the attributes). This brings me to part II ..
How to clean a "messy" data.frame in R/dplyr?
One way to handle the "messy" data-frame columns in this data would be to flatten each one & join it back to the original data frame.
Taking the attributes column as an example, we can use jsonlite::flatten on this sub-data-frame & then join it back to our original:
yelpBusinessData %>%
dplyr::select_if(purrr::negate(is.data.frame)) %>%
dplyr::bind_cols(jsonlite::flatten(yelpBusinessData$attributes, recursive = T)) %>%
dplyr::filter(city == 'Phoenix') %>%
dplyr::filter(grepl("Restaurants", categories)) %>%
dplyr::filter(Price_Range == 4)
The hours component, however, you might want to handle differently. In this example, the hours data.frame contains a data.frame for each day of the week with two fields ("open" and "close"). Here I use purrr:map to apply a function to each column simplifying the data.frame into a character vector.
hours <-
yelpBusinessData$hours %>%
purrr::map(. %>%
dplyr::transmute(hours = stringr::str_c(open, close, sep = ' - ')) %>%
unlist()) %>%
tibble::as_tibble()
This produces a data.frame with the start - stop time for each day of the week:
> str(hours)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 61184 obs. of 7 variables:
$ Tuesday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Friday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Monday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Wednesday: chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Thursday : chr "08:00 - 17:00" NA NA "10:00 - 21:00" ...
$ Sunday : chr NA NA NA "11:00 - 18:00" ...
$ Saturday : chr NA NA NA "10:00 - 21:00" ...
Similarly, one could use map2_dfc (which automagically calls bind_cols after mapping) to collapse the data-frames of this object yourself:
hours <- yelpBusinessData$hours %>%
purrr::map2_dfc(.x = .,
.y = names(.),
.f = ~ .x %>%
dplyr::rename_all(funs(stringr::str_c(.y, ., sep = '_'))))
This produces a single data.frame with the day-specific start & stop times:
> str(hours)
'data.frame': 61184 obs. of 14 variables:
$ Tuesday_close : chr "17:00" NA NA "21:00" ...
$ Tuesday_open : chr "08:00" NA NA "10:00" ...
$ Friday_close : chr "17:00" NA NA "21:00" ...
$ Friday_open : chr "08:00" NA NA "10:00" ...
$ Monday_close : chr "17:00" NA NA "21:00" ...
$ Monday_open : chr "08:00" NA NA "10:00" ...
$ Wednesday_close: chr "17:00" NA NA "21:00" ...
$ Wednesday_open : chr "08:00" NA NA "10:00" ...
$ Thursday_close : chr "17:00" NA NA "21:00" ...
$ Thursday_open : chr "08:00" NA NA "10:00" ...
$ Sunday_close : chr NA NA NA "18:00" ...
$ Sunday_open : chr NA NA NA "11:00" ...
$ Saturday_close : chr NA NA NA "21:00" ...
$ Saturday_open : chr NA NA NA "10:00" ...
Rather than put important information in the field names, however, you might prefer to "denormalize" some data, to produce a more tidy structure:
> purrr::flatten_dfr(yelpBusinessData$hours, .id = 'day')
# A tibble: 61,184 x 3
day close open
<chr> <chr> <chr>
1 1 NA NA
2 1 NA NA
3 1 NA NA
4 1 21:00 10:00
5 1 16:00 10:00
6 1 NA NA
7 1 NA NA
8 1 NA NA
9 1 NA NA
10 1 02:00 08:00
# ... with 61,174 more rows
Is there a better way to filter these data, maintaining their original hierarchy?
At the end of the day, there is a fundamental problem in your original data structure. A data.frame in R is implemented as a list of lists, and yet your data are stored as a data.frame of data.frames. This leads to confusion when indexing into various parts of the structure.
This is a little unorthodox, but one option is to keep your data as a list of lists rather than convert to a data.frame right away. Using the tools in purrr package, you can work with lists pretty easily to filter/flatten your data and then construct a data.frame from the filtered results.
For example:
> ## read in yelpBusinessData without converting to data.frame
> yelpBusinessData2 <- fromJSON(sprintf("[%s]",
paste(readLines(yelpBusinessDataFilePath),
collapse = ",")),
flatten = FALSE,
simplify = FALSE)
# filter to Phoenix cities _before_ converting to a data.frame
> yelpBusinessData2 %>%
purrr::keep(~ .$'city' == 'Phoenix'
&& grepl("Restaurants", .$categories)) %>%
jsonlite:::simplify(., flatten = T) %>%
dplyr::select(business_id, full_address, contains('kids')) %>%
str()
'data.frame': 8410 obs. of 5 variables:
$ business_id : chr "vcNAWiLM4dR7D2nwwJ7nCA" "x5Mv61CnZLohZWxfCVCPTQ" "2ZnCITVa0abGce4gZ6RhIw" "EmzaQR5hQlF0WIl24NxAZA" ...
$ full_address : chr "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018" "2819 N Central Ave\nPhoenix, AZ 85004" "1850 N Central Ave\nPhoenix, AZ 85004" "132 E Washington St\nPhoenix, AZ 85004" ...
$ attributes.Good for Kids : logi NA FALSE TRUE FALSE NA NA ...
$ attributes.Good For Kids : logi NA NA NA NA NA NA ...
$ attributes.Hair Types Specialized In.kids: logi NA NA NA NA NA NA ...
As a final thought, I would say, if you're still left with variable-naming issues, take a look at the janitor package in R, specifically the clean_names() function. This package has some nice features for working with messy data, particularly when read in from Excel.
Related
R Using grep() to extract characters
In the link below is a list (of warning messages): https://drive.google.com/file/d/1pz-jSkqU5nG_ipaezFCvWNI6WHgekAdE/view?usp=sharing And I am trying to get: Only at the start of the string, Where this pattern exists "x = ECC ", Retrieve only the "ECC" portion. I was successful on this test: regex.com But R doesn't work with this code: grep("(?<=\\A\"x\\s=\\s')[A-Z]*", names(warnings), value = TRUE, perl = TRUE) #> character(0) What's not working?
In this data you have additional spaces in the text (Eg - "x = 'GEN ') hence the pattern does not match. We may switch to str_match here : stringr::str_match(names(warnings), "x\\s=\\s'(\\w+)\\s+'")[, 2] # [1] "ECC" "ECC" "ECOM" "ECOM" "ETX" "ETX" NA NA NA "FEI" #[11] "FEI" "GEN" "GEN" NA NA NA "SAND" "SAND" NA NA #[21] NA "STAR" "STAR" NA NA NA
R: putting elements in list
Total R noob here. I am having difficulty creating a list of stock tickers. Here's the situation: I've created a dataframe of tickers pulled in from Quandl's API. x1<-Quandl.datatable('SHARADAR/SF1',paginate=TRUE, qopts.columns=c('ticker')) I then try to put this dataframe into a list. x2<-as.list(x1) So that I can then use the API to pull data for all the tickers in the list. x3<-Quandl.datatable('SHARADAR/SF1',paginate=TRUE, qopts.columns=c('ticker','dimension','datekey','revenue'), dimension='ART', calendardate='2015-12-31',ticker=c(x2)) But, alas, this doesn't work. Compare this, however, with when I pull specific tickers: Quandl.datatable('SHARADAR/SF1', ticker=c('AAPL', 'TSLA')) z = list('AAPL','TSLA') The code behaves itself: x3<-Quandl.datatable('SHARADAR/SF1',paginate=TRUE, qopts.columns=c('ticker','dimension','datekey','revenue'), dimension='ART', calendardate='2015-12-31',ticker=z) This is because each ticker is its own component in the list(z): [[1]] [1] "AAPL" [[2]] [1] "TSLA" Whereas for x2 all the tickers are stored as a single list component: [1] "AAPL", "TSLA", etc. Therefore I'd be swell if I could find a way to convert vector x2 into a list where each element is it's own component. Thank a bunch (and for your patience as well!)
This should work: x = sapply(1:5000, list) The length is 5000: length(x) [1] 5000 All elements are integers: all(sapply(x, is.integer) == TRUE) [1] TRUE This also works with character vectors: sapply(c('AAPL', 'MSFT', 'AMZN'), list) $AAPL [1] "AAPL" $MSFT [1] "MSFT" $AMZN [1] "AMZN"
One option could be as: x1 <- c(list(),1:5000) str(x1) # List of 10 # $ : int 1 # $ : int 2 # $ : int 3 # $ : int 4 # $ : int 5 # $ : int 6 # $ : int 7 # $ : int 8 #... #.....
x1 is a one column data frame. Because a data.frame really is a list under the hood, as.list() just gives you a list of columns, in this case list(x1$column1). You need to run as.list on a vector to get the result you want. Either of these will work: as.list(x1$your_column_name) as.list(x1[["your_column_name"]])
R - difference between sums of consecutive pairs
I have used the diff(x,) function on a data set to find the difference between force for time intervals of 0.0003s. The problem is due to this high sampling frequencies and the measuring inaccuracies of the system this does not yield any meaningful data. I am wondering if it was possible to find the difference between 100 consecutive numbers. For example "difference = (Fz101+Fz102...Fz200)-(Fz1+Fz2...Fz100) currently i am dividing by 0.000333 (sampling frequency) since i am trying to find the peak value for (delta Fz)/(delta t.) (max(diff(lift_S8_Intns40_chainno$Fz)))/0.00033333 [1] 40472.2 Seen a similiar question asked if this helps anyone in helping me. [link to similiar question][1] How to find the difference between sums of consecutive pairs and triplets in R? Any help is appreciated but my skills are not good in r so please make it as simple as possible. Data structure structure(list(Time = c(18.31266667, 18.313, 18.31333333, 18.31366667, 18.314, 18.31433333, 18.31466667, 18.315, 18.31533333, 18.31566667, 18.316, 18.31633333, 18.31666667, 18.317, 18.31733333, 18.31766667, 18.318, 18.31833333, 18.31866667, 18.319, 18.31933333, 18.31966667, 18.32, 18.32033333, 18.32066667, 18.321, 18.32133333, 18.32166667, 18.322, 18.32233333, 18.32266667, 18.323, 18.32333333, 18.32366667, 18.324, 18.32433333, 18.32466667, 18.325, 18.32533333, 18.32566667, 18.326, 18.32633333, 18.32666667, 18.327, 18.32733333, 18.32766667, 18.328, 18.32833333, 18.32866667, 18.329, 18.32933333, 18.32966667, 18.33, 18.33033333, 18.33066667, 18.331, 18.33133333, 18.33166667, 18.332, 18.33233333, 18.33266667, 18.333, 18.33333333, 18.33366667, 18.334, 18.33433333, 18.33466667, 18.335, 18.33533333, 18.33566667, 18.336, 18.33633333, 18.33666667, 18.337, 18.33733333, 18.33766667, 18.338, 18.33833333, 18.33866667, 18.339, 18.33933333, 18.33966667, 18.34, 18.34033333, 18.34066667, 18.341, 18.34133333, 18.34166667, 18.342, 18.34233333, 18.34266667, 18.343, 18.34333333, 18.34366667, 18.344, 18.34433333, 18.34466667, 18.345, 18.34533333, 18.34566667, 18.346, 18.34633333, 18.34666667, 18.347, 18.34733333, 18.34766667, 18.348, 18.34833333, 18.34866667, 18.349, 18.34933333, 18.34966667, 18.35, 18.35033333, 18.35066667, 18.351, 18.35133333, 18.35166667, 18.352, 18.35233333, 18.35266667, 18.353, 18.35333333, 18.35366667, 18.354, 18.35433333, 18.35466667, 18.355, 18.35533333, 18.35566667, 18.356, 18.35633333, 18.35666667, 18.357, 18.35733333, 18.35766667, 18.358, 18.35833333, 18.35866667, 18.359, 18.35933333, 18.35966667, 18.36, 18.36033333, 18.36066667, 18.361, 18.36133333, 18.36166667, 18.362, 18.36233333, 18.36266667, 18.363, 18.36333333, 18.36366667, 18.364, 18.36433333, 18.36466667, 18.365, 18.36533333, 18.36566667, 18.366, 18.36633333, 18.36666667, 18.367, 18.36733333, 18.36766667, 18.368, 18.36833333, 18.36866667, 18.369, 18.36933333, 18.36966667, 18.37, 18.37033333, 18.37066667, 18.371, 18.37133333, 18.37166667, 18.372, 18.37233333, 18.37266667, 18.373, 18.37333333, 18.37366667, 18.374, 18.37433333, 18.37466667, 18.375, 18.37533333, 18.37566667, 18.376, 18.37633333, 18.37666667, 18.377, 18.37733333, 18.37766667, 18.378, 18.37833333, 18.37866667, 18.379), Fz = c(2751.996392, 2751.614101, 2752.364284, 2751.993635, 2751.615723, 2751.995127, 2751.989063, 2751.241442, 2749.366648, 2750.486962, 2748.242022, 2747.498762, 2748.245589, 2747.4998, 2748.254604, 2748.631317, 2749.006798, 2750.129349, 2749.38176, 2749.385262, 2748.640657, 2750.13658, 2748.640073, 2749.393888, 2750.145076, 2751.274421, 2750.157172, 2750.15688, 2750.165279, 2752.407479, 2752.042472, 2753.550297, 2752.424309, 2750.927558, 2751.688556, 2751.317583, 2752.819117, 2752.45398, 2752.836757, 2752.472626, 2754.353062, 2754.725527, 2755.111158, 2754.376021, 2752.882448, 2752.510924, 2754.006329, 2755.508722, 2756.638796, 2756.636202, 2757.76947, 2756.284199, 2757.408096, 2758.165381, 2757.418894, 2758.172774, 2758.178838, 2758.936415, 2758.56421, 2757.448825, 2757.824079, 2758.569203, 2759.69824, 2760.082914, 2759.711827, 2760.840004, 2761.215128, 2760.84743, 2760.099193, 2760.851808, 2759.359468, 2758.25031, 2758.624201, 2759.755281, 2760.511577, 2761.635327, 2763.137948, 2763.143331, 2761.64191, 2763.1528, 2762.406524, 2763.525491, 2763.157339, 2762.781356, 2763.909695, 2764.666851, 2763.919813, 2763.543975, 2763.928017, 2765.045866, 2765.426243, 2766.182961, 2765.058383, 2763.936351, 2765.06542, 2763.947052, 2764.699019, 2764.70424, 2763.584867, 2763.22145, 2763.975492, 2764.349189, 2764.354994, 2763.613307, 2760.9942, 2759.878475, 2762.486297, 2762.115389, 2763.994527, 2762.121615, 2762.876597, 2761.391148, 2761.391991, 2761.768055, 2761.769126, 2759.525662, 2758.406516, 2758.786634, 2758.783261, 2757.669758, 2756.920758, 2756.540737, 2756.545277, 2756.176055, 2756.171515, 2755.423667, 2754.67611, 2753.178257, 2752.423923, 2753.16895, 2748.678277, 2747.927494, 2746.429754, 2746.424955, 2746.797971, 2746.424663, 2745.674982, 2746.424792, 2743.432215, 2743.432994, 2741.179315, 2742.67553, 2739.672447, 2738.921647, 2738.173637, 2737.424005, 2736.294822, 2735.180735, 2734.431882, 2734.425007, 2733.301013, 2732.930623, 2730.676264, 2729.172881, 2728.424855, 2727.298218, 2726.930066, 2726.177678, 2724.680797, 2723.5587, 2723.931652, 2722.428675, 2721.683421, 2720.557108, 2718.311893, 2716.438883, 2717.549484, 2717.545139, 2718.671354, 2714.548217, 2716.42254, 2713.060416, 2714.557005, 2714.554119, 2712.681174, 2711.561898, 2710.437904, 2710.438585, 2709.688272, 2707.067122, 2708.192203, 2705.564778, 2707.439879, 2705.949079, 2705.573079, 2705.577036, 2704.828393, 2702.954102, 2702.20199, 2701.079406, 2699.206235, 2697.708138, 2695.836101, 2696.588814, 2696.968721, 2696.966354, 2696.970764, 2696.22504, 2694.349549, 2695.852121)), .Names = c("Time", "Fz"), row.names = 1299:1498, class = "data.frame")
Here's a possible approach using the zoo package Function library(zoo) IntrvalsDif <- function(x, n) { temp <- rollsum(x, n) sapply(seq_len(length(x) - n + 1), function(x) temp[x + n] - temp[x]) } Test IntrvalsDif(df$Fz, 20) # [1] 22.712120 31.780260 37.846728 48.042024 55.623904 59.831923 61.789132 67.470180 73.906584 79.581470 # [11] 81.889676 83.816224 80.498591 81.303658 85.113723 87.410109 91.579034 93.126436 97.284240 99.556696 # [21] 101.445531 99.204143 98.458872 96.574869 97.299629 101.391636 108.484213 111.843855 111.830721 108.817601 # [31] 108.804484 104.667484 103.899693 100.132011 94.484088 91.846433 88.453795 88.053184 85.777665 83.127912 # [41] 83.855688 84.967116 86.079728 84.951745 81.943294 79.311783 74.809550 70.305436 67.663273 68.031700 # [51] 68.010152 72.486929 78.453469 83.671546 86.262716 87.723876 86.573048 83.175009 80.529002 79.394259 # [61] 73.758934 70.745457 66.612867 64.351423 62.484932 55.371569 46.756346 42.618145 38.493014 34.730700 # [71] 27.612391 18.995970 6.271506 -3.829068 -10.178434 -18.028571 -24.761686 -32.615260 -40.093775 -46.838338 # [81] -52.458680 -61.082382 -69.714532 -78.721904 -86.991107 -88.898297 -88.564729 -94.941400 -102.449946 -114.087060 # [91] -120.115474 -131.764148 -140.435989 -151.731834 -164.906638 -176.581499 -185.261108 -191.700139 -198.144375 -208.693815 # [101] -217.378887 -226.065596 -232.122351 -241.185464 -251.002620 -264.177813 -277.722667 -288.293768 -297.354158 -303.775595 # [111] -313.566873 -314.745817 -316.279034 -317.070287 -318.979261 -322.381222 -328.406668 -334.420050 -342.305322 -345.705694 # [121] -351.343224 -352.849444 -359.231092 -360.347288 -361.457419 -363.321285 -366.306745 -366.670795 -366.308869 -364.077356 # [131] -365.210203 -366.711412 -371.584748 -371.950517 -369.317205 -366.687770 -363.297645 -361.044891 -356.536870 -352.777977 # [141] -349.395261 -347.887047 -344.504089 -340.758605 -337.002095 -329.879165 -319.755890 -313.731643 -310.687084 -311.395920 # [151] -304.987941 -305.325773 -300.807844 -303.409489 -306.756032 -306.724804 -305.584028 -302.559006 -301.033458 -301.379656 # [161] -296.103079 NA NA NA NA NA NA NA NA NA # [171] NA NA NA NA NA NA NA NA NA NA # [181] NA
How to check how much two vectors are different?
I have two variables with names stored in them. I want to see how many names in variable ScanName are in vector B while ignoring the capitals. Also, what are the differences? I want to ignore the difference between a capital letter in the search (for example it should consider hsa-mir-1 and hsa-miR-1 as the same). My data are like this : str(B) Factor w/ 1046 levels "hsa-let-7a-1",..: 1 2 3 4 5 6 7 8 9 10 ... >B [1] hsa-let-7a-1 hsa-let-7a-2 hsa-let-7a-3 hsa-let-7b hsa-let-7c hsa-let-7d [7] hsa-let-7e hsa-let-7f-1 hsa-let-7f-2 hsa-let-7g hsa-let-7i hsa-mir-1-1 [13] hsa-miR-1238 hsa-mir-100 hsa-mir-101-1 hsa-mir-101-2 hsa-mir-103-1 hsa-mir-103-1-as [19] hsa-mir-103-2 hsa-mir-103-2-as hsa-mir-105-1 hsa-mir-105-2 hsa-mir-106a hsa-mir-106b and > str(ScanName) chr [1:1146] "hsa-miR-103b" "hsa-miR-1178" "hsa-miR-1179" "hsa-miR-1180" "hsa-miR-1181 > ScanName [1] "hsa-miR-103b" "hsa-miR-1178" "hsa-miR-1179" "hsa-miR-1180" "hsa-miR-1181" "hsa-miR-1182" [7] "hsa-miR-1183" "hsa-miR-1184" "hsa-miR-1193" "hsa-miR-1197" "hsa-miR-1200" "hsa-miR-1203" [13] "hsa-miR-1204" "hsa-miR-1205" "hsa-miR-1206" "hsa-miR-1208" "hsa-miR-1224-3p" "hsa-miR-1225-3p" [19] "hsa-miR-1225-5p" "hsa-miR-1227" "hsa-miR-1228" "hsa-miR-1229" "hsa-miR-1231" "hsa-miR-1233" [25] "hsa-miR-1234" "hsa-let-7a-2" "hsa-miR-1238" "hsa-miR-1243" "hsa-miR-1244" "hsa-miR-1245" [31] "hsa-miR-1245b-3p" "hsa-miR-1246" "hsa-miR-1247" "hsa-miR-1248" "hsa-miR-1249" "hsa-miR-1250" [37] "hsa-miR-1251" "hsa-miR-1252"
you can use %in% and tolower ScanName[tolower(ScanName) %in% tolower(B)]
You can also use grep with the ignore.case argument set to TRUE > unlist(sapply(B, function(x){ grep(x, ScanName, ignore.case = TRUE, value = TRUE) }, USE.NAMES = FALSE)) ## [1] "hsa-let-7a-2" "hsa-miR-1238" which gives the same result at > ScanName[tolower(ScanName) %in% tolower(B)] ## [1] "hsa-let-7a-2" "hsa-miR-1238"
Character to date with as.Date
I have a vector (length=1704) of character like this: [1] "1871_01" "1871_02" "1871_03" "1871_04" "1871_05" "1871_06" "1871_07" "1871_08" "1871_09" "1871_10" "1871_11" "1871_12" [13] "1872_01" "1872_02" "1872_03" ... . . . [1681] "2011_01" "2011_02" "2011_03" "2011_04" "2011_05" "2011_06" "2011_07" "2011_08" "2011_09" "2011_10" "2011_11" "2011_12" [1693] "2012_01" "2012_02" "2012_03" "2012_04" "2012_05" "2012_06" "2012_07" "2012_08" "2012_09" "2012_10" "2012_11" "2012_12" I want to convert this vector into a vector of dates. For that I use: as.Date(vector, format="%Y_%m") But it returns "NA" I tried for one value: b <- "1871_01" as.Date(b, format="%Y_%m") [1] NA strptime(b, "%Y_%m") [1] NA I don't understand why it doesn't work... Does anyone have a clue?
If you do regular work in year+month format, the zoo package can come in handy since it treats yearmon as a first class citizen (and is compatible with Date objects/functions): library(zoo) my.ym <- as.yearmon("1871_01", format="%Y_%m") print(my.ym) ## [1] "Jan 1871" str(my.ym) ## Class 'yearmon' num 1871 my.date <- as.Date(my.date) print(my.date) ## [1] "1871-01-01" str(my.date) ## Date[1:1], format: "1871-01-01"