Total R noob here.
I am having difficulty creating a list of stock tickers.
Here's the situation:
I've created a dataframe of tickers pulled in from Quandl's API.
x1<-Quandl.datatable('SHARADAR/SF1',paginate=TRUE,
qopts.columns=c('ticker'))
I then try to put this dataframe into a list.
x2<-as.list(x1)
So that I can then use the API to pull data for all the tickers in the list.
x3<-Quandl.datatable('SHARADAR/SF1',paginate=TRUE,
qopts.columns=c('ticker','dimension','datekey','revenue'),
dimension='ART', calendardate='2015-12-31',ticker=c(x2))
But, alas, this doesn't work.
Compare this, however, with when I pull specific tickers:
Quandl.datatable('SHARADAR/SF1', ticker=c('AAPL', 'TSLA'))
z = list('AAPL','TSLA')
The code behaves itself:
x3<-Quandl.datatable('SHARADAR/SF1',paginate=TRUE,
qopts.columns=c('ticker','dimension','datekey','revenue'),
dimension='ART', calendardate='2015-12-31',ticker=z)
This is because each ticker is its own component in the list(z):
[[1]]
[1] "AAPL"
[[2]]
[1] "TSLA"
Whereas for x2 all the tickers are stored as a single list component:
[1] "AAPL", "TSLA", etc.
Therefore I'd be swell if I could find a way to convert vector x2 into a list where each element is it's own component.
Thank a bunch (and for your patience as well!)
This should work:
x = sapply(1:5000, list)
The length is 5000:
length(x)
[1] 5000
All elements are integers:
all(sapply(x, is.integer) == TRUE)
[1] TRUE
This also works with character vectors:
sapply(c('AAPL', 'MSFT', 'AMZN'), list)
$AAPL
[1] "AAPL"
$MSFT
[1] "MSFT"
$AMZN
[1] "AMZN"
One option could be as:
x1 <- c(list(),1:5000)
str(x1)
# List of 10
# $ : int 1
# $ : int 2
# $ : int 3
# $ : int 4
# $ : int 5
# $ : int 6
# $ : int 7
# $ : int 8
#...
#.....
x1 is a one column data frame. Because a data.frame really is a list under the hood, as.list() just gives you a list of columns, in this case list(x1$column1).
You need to run as.list on a vector to get the result you want. Either of these will work:
as.list(x1$your_column_name)
as.list(x1[["your_column_name"]])
Related
I am reading several SAS files from a server and load them all into a list into R. I removed one of the datasets because I didn't need it in the final analysis ( dateset # 31)
mylist<-list.files("path" , pattern = ".sas7bdat")
mylist <- mylist[- 31]
Then I used lapply to read all the datasets in the list ( mylist) at the same time
read.all <- lapply(mylist, read_sas)
the code works well. However when I run view(read.all) to see the the datasets, I can only see a number ( e.g, 1, 2, etc) instead of the names of the initial datasets.
Does anyone know how I can keep the name of datasets in the final list?
Also, can anyone tell me how I can work with this list in R?
is it an object ? may I read one of the dateset of the list ? or how can I join some of the datasets of the list?
Use basename and tools::file_path_sans_ext:
filenames <- head(list.files("~/StackOverflow", pattern = "^[^#].*\\.R", recursive = TRUE, full.names = TRUE))
filenames
# [1] "C:\\Users\\r2/StackOverflow/1000343/61469332.R" "C:\\Users\\r2/StackOverflow/10087004/61857346.R"
# [3] "C:\\Users\\r2/StackOverflow/10097832/60589834.R" "C:\\Users\\r2/StackOverflow/10214507/60837843.R"
# [5] "C:\\Users\\r2/StackOverflow/10215127/61720149.R" "C:\\Users\\r2/StackOverflow/10226369/60778116.R"
basename(filenames)
# [1] "61469332.R" "61857346.R" "60589834.R" "60837843.R" "61720149.R" "60778116.R"
tools::file_path_sans_ext(basename(filenames))
# [1] "61469332" "61857346" "60589834" "60837843" "61720149" "60778116"
somedat <- setNames(lapply(filenames, readLines, n=2),
tools::file_path_sans_ext(basename(filenames)))
names(somedat)
# [1] "61469332" "61857346" "60589834" "60837843" "61720149" "60778116"
str(somedat)
# List of 6
# $ 61469332: chr [1:2] "# https://stackoverflow.com/questions/61469332/determine-function-name-within-that-function/61469380" ""
# $ 61857346: chr [1:2] "# https://stackoverflow.com/questions/61857346/how-to-use-apply-family-instead-of-nested-for-loop-for-my-problem?noredirect=1" ""
# $ 60589834: chr [1:2] "# https://stackoverflow.com/questions/60589834/add-columns-to-data-frame-based-on-function-argument" ""
# $ 60837843: chr [1:2] "# https://stackoverflow.com/questions/60837843/how-to-remove-all-parentheses-from-a-vector-of-string-except-whe"| __truncated__ ""
# $ 61720149: chr [1:2] "# https://stackoverflow.com/questions/61720149/extracting-the-original-data-based-on-filtering-criteria" ""
# $ 60778116: chr [1:2] "# https://stackoverflow.com/questions/60778116/how-to-shift-data-by-a-factor-of-two-months-in-r" ""
Each "name" is the character representation of (in this case) the stackoverflow question number, with the ".R" removed. (And since I typically include the normal URL as the first line then an empty line in the files I use to test/play and answer SO questions, all of these files look similar at the top two lines.)
I'm having a problem using lapply and xml_find_first from the xml2 package to pull nodes from a list of xml objects. I'm pulling a few thousand records from the Scopus API. Since I can only get 25 records at a time, I run it so I get a list with 100+ elements of 25 records each. I know a few of the records have missing values, so my goal is to shuffle things around until I get a list were each record is its own element, then use lapply and xml_find_first so that I'll get null values where appropriate. The problem is that I end up pulling repeated values as if everything is still nested in their initial lists.
Here's a reproducible example with a list of 2 elements with 2 records each, with citedby-count missing from the last one:
```{r}
library(xml2)
# Simulate how data come in from Scopus
# Build 2 list elements, 2 entries each
el1 <- read_xml(
"<feed>
<blah>Bunch of stuff I don't need</blah>
<blah>Bunch of other stuff I don't need</blah>
<entry>
<eid>2-s2.0-1542382496</eid>
<citedby-count>9385</citedby-count>
</entry>
<entry>
<eid>2-s2.0-0032721879</eid>
<citedby-count>4040</citedby-count>
</entry>
</feed>"
)
el2 <- read_xml( # This one's missing citedby-count for last entry
"<feed>
<blah>Bunch of stuff I don't need</blah>
<blah>Bunch of other stuff I don't need</blah>
<entry>
<eid>2-s2.0-0041751098</eid>
<citedby-count>3793</citedby-count>
</entry>
<entry>
<eid>2-s2.0-73449149291</eid>
</entry>
</feed>"
)
# Combine into list
lst <- list(el1,el2)
# Check
lst
```
This gives me:
My goal is to pull out the entries so they are list items. This way, xml_find_first should stick a null value in for the entry where citedby-count is missing.
```{r}
# Pull entry nodes
lst2 <- lapply(lst, xml_find_all, "//entry")
# Unlist
lst2 <- unlist(lst2, recursive=FALSE)
# Check - each entry is its own element
lst2
```
The hangup is when I try to extract a node that I know is missing in some of the entries in a way that will leave a null where it's missing. xml_find_first should do that. But...
```{r}
cbc <- lapply(lst2, xml_find_first, "//citedby-count")
cbc <- lapply(cbc, xml_text)
cbc # Repeats the first values of original nesting
```
So I checked what would happen with xml_find_all:
```{r}
cbc2 <- lapply(lst2, xml_find_all, "//citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2 # Elements contain all values from initial nesting
```
Which makes no sense in comparison with the output of lst2 above. For some reason, pulling the text retains the values from the initial nesting, even though it doesn't show up when looking at the final list of xml objects. I'm stumped.
Indeed, as #Dave2e comments, do not simply use the "anywhere" XPath search (specifically the descendant-or-self search) with // for child elements as the search will run on entire document.
How can this be if I do not explicitly call the original document? If you run str() on any of your xml_find lists, you will see the object carries Rcpp external pointers to the current node and document available for recall as needed. In fact, I believe the node pointer displays when calling the list.
str(ls2)
# List of 4
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
lst2[[1]]$doc
# <pointer: 0x000000000ca7ff90>
typeof(lst2[[1]]$doc)
# [1] "externalptr"
Therefore, be careful of context when searching. You can use the dot prefix (as #Dave2e advises), .//, or no slashes at all for retrieval of child elements which here will be equivalent.
cbc2 <- lapply(lst2, xml_find_all, "citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2
# [[1]]
# [1] "9385"
# [[2]]
# [1] "4040"
# [[3]]
# [1] "3793"
# [[4]]
# character(0)
cbc2 <- lapply(lst2, xml_find_all, ".//citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2
# [[1]]
# [1] "9385"
# [[2]]
# [1] "4040"
# [[3]]
# [1] "3793"
# [[4]]
# character(0)
Do note the .// will search ALL descendants (i.e., children, grandchildren, etc.) starting at current node. See What is the difference between .// and //* in XPath?
R accepts only alphanumeric characters, "dot" and "underscore" in variable names. I had names like tmax_60_days_Dec13-Feb13_mean or tmax_60_days_Dec13-Feb13_tmax:>=:-5. Used such system, so I can parse select sub strings easily and also because I was calculating rolling means and used these conditions themselves as names :o
Until recently, I have got away with it, using get or manually removing the 'apostrophe' which knitr added.
But, when I try to use these variables/column names of data fames in functions like party or randomForests, it backfired. They were not recognised
I can change the colon and hypen to dot or underscore, though I would prefer some other possibility. And the ">=" to "ge" and "<=" to "le". But, how do people code the "negative" or "minus" sign if you want to have it in your variable name or column name of a data frame?
I thought of prefixing the number with "neg" or "minus", but wanted to ask around if there are more elegant ways of doing it or simply to know what other ways people mange it.
Thanks
You can use the comment function:
x <- 1:10
comment(x) <- "this is a comment"
y <- 1:10
comment(y) <- "this is another comment"
xy <- data.frame(x=x,y=y)
str(xy)
#----------------
'data.frame': 10 obs. of 2 variables:
$ x: atomic 1 2 3 4 5 6 7 8 9 10
..- attr(*, "comment")= chr "this is a comment"
$ y: atomic 1 2 3 4 5 6 7 8 9 10
..- attr(*, "comment")= chr "this is another comment"
#--------------
comment(xy$x) <- "prod"
comment(xy$y) <- "sum"
interpret <- function(x) eval(parse(text=paste0(comment(x) ,"(",quote(x),")") ) )
lapply(xy, interpret)
#-----------------
$x
[1] 3628800
$y
[1] 55
A more expansive response would need a data-object that warrants further testing.
I am new to R and, have some problems with looping and grepl functions
I have a data from like:
str(peptidesFilter)
'data.frame': 78389 obs. of 130 variables:
$ Sequence : chr "AAAAAIGGR" "AAAAAIGGRPNYYGNEGGR" "AAAAASSNPGGGPEMVR" "AAAAAVGGR" ...
$ First.amino.acid : chr "A" "A" "A" "A" ...
$ Protein.group.IDs : chr "1" "1;2;4" "2;5 "3" "4;80" ...
I want to filter the data according to $ Protein.group.IDs by using grepl function below
peptidesFilter.new <- peptidesFilter[grepl('(^|;)2($|;)',
peptidesFilter$Protein.group.IDs),]
I want to do it with a loop for every individual data ( e.g 1, 2, 3, etc...)
and re-write name of data frame containing variable peptidesFilter.i
i =1
while( i <= N) { peptidesFilter.[[i]] <-
peptidesFilter[grepl('(^|;)i($|;)',
peptidesFilter$Protein.group.IDs),]
i=i+1 }
i have two problems,
main one i in the grep1 function does not recognized as a variable and how i can re-name filtered data in a way it will contain variable.
any ideas?
For grepl problem you can use paste0 for example:
paste0('(^|;)',i,'($|;)')
For the loop , you can so something like this :
ll <- lapply(seq(1:4),function(x)
peptidesFilter[grepl(paste0('(^|;)',x,'($|;)'),
peptidesFilter$Protein.group.IDs),])
then you can transform it to a data.frame:
do.call(rbind,ll)
Sequence First.amino.acid Protein.group.IDs
1 AAAAAIGGR A 1
2 AAAAAIGGRPNYYGNEGGR A 1;2;4
21 AAAAAIGGRPNYYGNEGGR A 1;2;4
3 AAAAASSNPGGGPEMVR A 2;5
4 AAAAAVGGR A 3
22 AAAAAIGGRPNYYGNEGGR A 1;2;4
I have this csv file (fm.file):
Date,FM1,FM2
28/02/2011,14.571611,11.469457
01/03/2011,14.572203,11.457512
02/03/2011,14.574798,11.487183
03/03/2011,14.575558,11.487802
04/03/2011,14.576863,11.490246
And so on.
I run this commands:
fm.data <- as.xts(read.zoo(file=fm.file,format='%d/%m/%Y',tz='',header=TRUE,sep=','))
is.character(fm.data)
And I get the following:
[1] TRUE
How do I get the fm.data to be numeric without loosing its date index. I want to perform some statistics operations that require the data to be numeric.
I was puzzled by two things: It didn't seem that that 'read.zoo' should give you a character matrix, and it didn't seem that changing it's class would affect the index values, since the data type should be separate from the indices. So then I tried to replicate the problem and get a different result:
txt <- "Date,FM1,FM2
28/02/2011,14.571611,11.469457
01/03/2011,14.572203,11.457512
02/03/2011,14.574798,11.487183
03/03/2011,14.575558,11.487802
04/03/2011,14.576863,11.490246"
require(xts)
fm.data <- as.xts(read.zoo(file=textConnection(txt),format='%d/%m/%Y',tz='',header=TRUE,sep=','))
is.character(fm.data)
#[1] FALSE
str(fm.data)
#-------------
An ‘xts’ object from 2011-02-28 to 2011-03-04 containing:
Data: num [1:5, 1:2] 14.6 14.6 14.6 14.6 14.6 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "FM1" "FM2"
Indexed by objects of class: [POSIXct,POSIXt] TZ:
xts Attributes:
List of 2
$ tclass: chr [1:2] "POSIXct" "POSIXt"
$ tzone : chr ""
zoo- and xts-objects have their data in a matrix accessed with coredata and their indices are a separate set of attributes.
I think the problem is you have some dirty data in you csv file. In other words FM1 or FM2 columns contain a character, somewhere, that stops it being interpreted as a numeric column. When that happens, XTS (which is a matrix underneath) will force the whole thing to character type.
Here is one way to use R to find suspicious data:
s <- scan(fm.file,what="character")
# s is now a vector of character strings, one entry per line
s <- s[-1] #Chop off the header row
all(grepl('^[-0-9,.]*$',s,perl=T)) #True means all your data is clean
s[ !grepl('^[-0-9,.]*$',s,perl=T) ]
which( !grepl('^[-0-9,.]*$',s,perl=T) ) + 1
The second-to-last line prints out all the csv rows that contain characters you did not expect. The last line tells you which rows in the file they are (+1 because we removed the header row).
Why not simply use read.csv and then convert the first column to an Date object using as.Date
> x <- read.csv(fm.file, header=T)
> x$Date <- as.Date(x$Date, format="%d/%m/%Y")
> x
Date FM1 FM2
1 2011-02-28 14.57161 11.46946
2 2011-03-01 14.57220 11.45751
3 2011-03-02 14.57480 11.48718
4 2011-03-03 14.57556 11.48780
5 2011-03-04 14.57686 11.49025