I am looking for another way to achieve the same result because the for statement is too slow.
I have the following data frame.
'data.frame': 50000 obs. of 2 variables:
$ user_id: chr "user1#test.com" "user2#test.com" ......
$ result : logi NA NA ......
Function f takes a user ID and returns a specific result.
f <- function(user_id){
......
return(json_result)
}
The result I want is as follows.
'data.frame': 50000 obs. of 2 variables:
$ user_id: chr "user1#test.com" "user2#test.com" ......
$ result : chr "{....}" "{....}" ......
I am running a loop like the code below, but the speed is too slow.
for (t in df$user_id) {
print(t)
df$result[df$user_id==t] <- f(t)
}
It takes about 3 seconds per user, and 3*50000 seconds to get a total of 50,000 users.
Is there any other way to get results faster?
You're looking for lapply function:
df$result <- lapply(df$user_id, f)
Alternatively, you can use purrr's map functions.
library(tidyverse)
purrr::map(df$user_id, f)
This will output a list where each element is the output of the function call. Depending on the output of your function, you could use a map variant to output a vector of some type. You can read about this in the docs: https://purrr.tidyverse.org/reference/map.html
Related
I'm scraping data from a large online database (GBIF), which requires three steps: (1) match a GBIF "key" identifier to a species name, (2) send a query to the database, getting a download key ("res") in return, and (3) download, import, and filter the data associated with that species. I've written a function for each of these (not including the actual code here, since it's unfortunately very long and requires login credentials):
get_gbif_key <- function(species) {}
get_gbif_res <- function(gbifkey) {}
get_gbif_dat <- function(gbifres) {}
I have a list of several hundred species to which I want to apply these three functions in order. I know they work individually, but I can't figure out how to feed them into each other (probably using purrr?) and reference the correct inputs from the nested outputs of the previous function.
So, for example:
> testlist <- c('Gadus morhua','Caretta caretta')
> testkey <- map(testlist, get_gbif_key)
> testkey
[[1]]
[1] 8084280
[[2]]
[1] 8894817
Here's where I'm stuck. I want to feed the keys in this list structure into the next function, but I don't know how to properly reference them using map or other functions. I can do it by manually creating a new list for the next function:
> testlist2 <- c('8084280','8894817')
> testres <- map(testlist2, get_gbif_res)
> testres
[[1]]
<<gbif download>>
Username: XXXX
E-mail: XXXX#gmail.com
Download key: 0001342-180412121330197
[[2]]
<<gbif download>>
Username: XXXX
E-mail: XXXX#gmail.com
Download key: 0001343-180412121330197
EDIT: the structure of this output may be posing a problem here. When I run listviewer::jsonedit(testres), it just looks like a normal nested list with entries 0 and 1 holding the two download keys. However, when I run str(testres), I get the following:
> str(testres)
List of 2
$ :Class 'occ_download' atomic [1:1] 0001342-180412121330197
.. ..- attr(*, "user")= chr "XXXX"
.. ..- attr(*, "email")= chr "XXXX#gmail.com"
$ :Class 'occ_download' atomic [1:1] 0001343-180412121330197
.. ..- attr(*, "user")= chr "XXXX"
.. ..- attr(*, "email")= chr "XXXX#gmail.com"
And, again, for the third one:
> testlist3 <- c('0001342-180412121330197','0001343-180412121330197')
> testdat <- map(testlist3, get_gbif_dat)
Which successfully loads a list object with the desired data into R (it has two unnamed elements, 0 and 1, each of which is a list of 28 requested variables for each species). Any advice for scripting this get_gbif_key %>% get_gbif_res %>% get_gbif_dat workflow in a way that unpacks the preceding list structures correctly?
Here's what you should try based on the evidence provided so far. Basically, the results suggest you should be able to succeed with nested map-ping:
yourData <- map( unlist( # to make same class as your single func version
map(
map(testlist,
get_gbif_key), # returns gbifkeys
get_gbif_res)), # returns gbif_res's
get_gbif_dat) # returns data items
The last item that you showed the structure for is just a list of atomic character vectors with some extra attributes and your functions seems to handle that without difficulty, so mapping should succeed.
I have the following dataframe:
str(dat2)
data.frame: 29081 obs. of 105 variables:
$ id: int 20 34 46 109 158....
$ reddit_id: chr "t1_cnas90f" "t1_cnas90t" "t1_cnas90g"....
$ subreddit_id: chr "t5_cnas90f" "t5_cnas90t" "t5_cnas90g"....
$ link_id: chr "t3_c2qy171" "t3_c2qy172" "t3_c2qy17f"....
$ created_utc: chr "2015-01-01" "2015-01-01" "2015-01-01"....
$ ups: int 3 1 0 1 2....
...
How can i change the datatype of reddit_id, subreddit_id and link_id from character to factor? I know how to do it one column by column, but as this is tedious work, i am searching for a faster way to do it.
I have tried the following, without success:
dat2[2:4] <- data.frame(lapply(dat2[2:4], factor))
From this approach. Its end up giving me an error message: invalid "length" argument
Another approach was to do it this way:
dat2 <- as.factor(data.frame(dat2$reddit_id, dat2$subreddit_id, dat2$link_id))
Result: Error in sort.list(y): "x" must be atomic for "sort.
After reading the error i also tried it the other way around:
dat2 <- data.frame(as.factor(dat2$reddit_id, dat2$subreddit_id, dat2$link_id))
Also without success
If some information are missing, I am sorry. I am a newbie to R and Stackoverflow...Thank you for your help!!!
Try with:
library("tidyverse")
data %>%
mutate_at(.vars = vars(reddit_id, subreddit_id, link_id)),
.fun = factor)
To take advantage of partial matching, use
data %>%
mutate_at(.vars = vars(contains("reddit"), link_id),
.fun = factor)
I am trying to achieve the following
stocks <- c('AXP', 'VZ', 'V')
library('quantmod')
getSymbols(stocks)
Above command creates 3 data variables named AXP, VZ, and V
prices <- data.frame(stringAsFactors=FALSE)
Here I am trying to create a column with name as ticket (e.g. AXP) with data in
The following should add 3 columns to the frame, names AXP, VZ, and V with data in
AXP$AXP.Adjusted, VZ$VZ.Adjusted, V$V.Adjusted
for (ticker in stocks)
{
prices$ticker <- ticker$ticker.Adjusted
}
How do I achieve this? R gives an error like this when I try this
Error in ticker$ticker.Adjusted :
$ operator is invalid for atomic vectors
Any ideas?
Thanks in advance
Here is a simpler way to do this
do.call('cbind', lapply(mget(stocks), function(d) d[,6]))
Explanation:
mget(stocks) gets the three data frames as a list
lapply extracts the 6th column which contains the variable of interest.
do.call passes the list from (2) to cbind, which binds them together as columns.
NOTE: This solution does not take care of the different number of columns in the data frames.
I did not understand your question before, now I think I understood what you want:
What you wrote does not work because the object ticker is character string. If you want to get the object named after that string, you have to evaluate the parsed text.
Try this:
for (ticker in stocks){
prices <- cbind(prices, eval(parse(text=ticker))[,paste0(ticker, ".", "Adjusted")])
}
This will give you:
An ‘xts’ object on 2007-01-03/2014-01-28 containing:
Data: num [1:1780, 1:4] 53.4 53 52.3 52.8 52.5 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:4] "AXP.Adjusted" "AXP.Adjusted.1" "VZ.Adjusted" "V.Adjusted"
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
List of 2
$ src : chr "yahoo"
$ updated: POSIXct[1:1], format: "2014-01-29 01:06:51"
One problem you're going to have is that the three downloads have different number of rows, so binding them all into a single data frame will fail.
The code below uses the last 1000 rows of each file (most recent), and does not use loops.
stocks <- c('AXP', 'VZ', 'V')
library('quantmod')
getSymbols(stocks)
prices=do.call(data.frame,
lapply(stocks,
function(s)tail(get(s)[,paste0(s,".Adjusted")],1000)))
colnames(prices)=stocks
head(prices)
# AXP VZ V
# 2010-02-08 34.70 21.72 80.58
# 2010-02-09 35.40 22.01 80.79
# 2010-02-10 35.60 22.10 81.27
# 2010-02-11 36.11 22.23 82.73
# 2010-02-12 36.23 22.15 82.38
# 2010-02-16 37.37 22.34 83.45
Working from the inside out, s is the ticker (so, e.g., "AXP"); get(s) returns the object with that name, so AXP; get(s)[,paste0(s,".Adjusted")] is equivalent to AXP[,"AXP.Adjusted"]; tail(...,1000) returns the last 1000 rows of .... So when s="AXP", the function returns the last 1000 rows of AXP$AXP.Adjusted.
lapply(...) applies that function to each element in stocks.
do.call(data.frame,...) invokes the data.frame function with the list of columns returned by lapply(...).
I am new to R and, have some problems with looping and grepl functions
I have a data from like:
str(peptidesFilter)
'data.frame': 78389 obs. of 130 variables:
$ Sequence : chr "AAAAAIGGR" "AAAAAIGGRPNYYGNEGGR" "AAAAASSNPGGGPEMVR" "AAAAAVGGR" ...
$ First.amino.acid : chr "A" "A" "A" "A" ...
$ Protein.group.IDs : chr "1" "1;2;4" "2;5 "3" "4;80" ...
I want to filter the data according to $ Protein.group.IDs by using grepl function below
peptidesFilter.new <- peptidesFilter[grepl('(^|;)2($|;)',
peptidesFilter$Protein.group.IDs),]
I want to do it with a loop for every individual data ( e.g 1, 2, 3, etc...)
and re-write name of data frame containing variable peptidesFilter.i
i =1
while( i <= N) { peptidesFilter.[[i]] <-
peptidesFilter[grepl('(^|;)i($|;)',
peptidesFilter$Protein.group.IDs),]
i=i+1 }
i have two problems,
main one i in the grep1 function does not recognized as a variable and how i can re-name filtered data in a way it will contain variable.
any ideas?
For grepl problem you can use paste0 for example:
paste0('(^|;)',i,'($|;)')
For the loop , you can so something like this :
ll <- lapply(seq(1:4),function(x)
peptidesFilter[grepl(paste0('(^|;)',x,'($|;)'),
peptidesFilter$Protein.group.IDs),])
then you can transform it to a data.frame:
do.call(rbind,ll)
Sequence First.amino.acid Protein.group.IDs
1 AAAAAIGGR A 1
2 AAAAAIGGRPNYYGNEGGR A 1;2;4
21 AAAAAIGGRPNYYGNEGGR A 1;2;4
3 AAAAASSNPGGGPEMVR A 2;5
4 AAAAAVGGR A 3
22 AAAAAIGGRPNYYGNEGGR A 1;2;4
for starters: I searched for hours on this problem by now - so if the answer should be trivial, please forgive me...
What I want to do is delete a row (no. 101) from a data.frame. It contains test data and should not appear in my analyses. My problem is: Whenever I subset from the data.frame, the attributes (esp. comments) are lost.
str(x)
# x has comments for each variable
x <- x[1:100,]
str(x)
# now x has lost all comments
It is well documented that subsetting will drop all attributes - so far, it's perfectly clear. The manual (e.g. http://stat.ethz.ch/R-manual/R-devel/library/base/html/Extract.data.frame.html) even suggests a way to preserve the attributes:
## keeping special attributes: use a class with a
## "as.data.frame" and "[" method:
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function(x,i,...) {
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
r
}
d <- data.frame(i= 0:7, f= gl(2,4),
u= structure(11:18, unit = "kg", class="avector"))
str(d[2:4, -1]) # 'u' keeps its "unit"
I am not yet so far into R to understand what exactly happens here. However, simply running these lines (except the last three) does not change the behavior of my subsetting. Using the command subset() with an appropriate vector (100-times TRUE + 1 FALSE) gives me the same result. And simply storing the attributes to a variable and restoring it after the subset, does not work, either.
# Does not work...
tmp <- attributes(x)
x <- x[1:100,]
attributes(x) <- tmp
Of course, I could write all comments to a vector (var=>comment), subset and write them back using a loop - but that does not seem a well-founded solution. And I am quite sure I will encounter datasets with other relevant attributes in future analyses.
So this is where my efforts in stackoverflow, Google, and brain power got stuck. I would very much appreciate if anyone could help me out with a hint. Thanks!
If I understand you correctly, you have some data in a data.frame, and the columns of the data.frame have comments associated with them. Perhaps something like the following?
set.seed(1)
mydf<-data.frame(aa=rpois(100,4),bb=sample(LETTERS[1:5],
100,replace=TRUE))
comment(mydf$aa)<-"Don't drop me!"
comment(mydf$bb)<-"Me either!"
So this would give you something like
> str(mydf)
'data.frame': 100 obs. of 2 variables:
$ aa: atomic 3 3 4 7 2 7 7 5 5 1 ...
..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2 2 5 4 2 1 3 5 3 ...
..- attr(*, "comment")= chr "Me either!"
And when you subset this, the comments are dropped:
> str(mydf[1:2,]) # comment dropped.
'data.frame': 2 obs. of 2 variables:
$ aa: num 3 3
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
To preserve the comments, define the function [.avector, as you did above (from the documentation) then add the appropriate class attributes to each of the columns in your data.frame (EDIT: to keep the factor levels of bb, add "factor" to the class of bb.):
mydf$aa<-structure(mydf$aa, class="avector")
mydf$bb<-structure(mydf$bb, class=c("avector","factor"))
So that the comments are preserved:
> str(mydf[1:2,])
'data.frame': 2 obs. of 2 variables:
$ aa:Class 'avector' atomic [1:2] 3 3
.. ..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
..- attr(*, "comment")= chr "Me either!"
EDIT:
If there are many columns in your data.frame that have attributes you want to preserve, you could use lapply (EDITED to include original column class):
mydf2 <- data.frame( lapply( mydf, function(x) {
structure( x, class = c("avector", class(x) ) )
} ) )
However, this drops comments associated with the data.frame itself (such as comment(mydf)<-"I'm a data.frame"), so if you have any, assign them to the new data.frame:
comment(mydf2)<-comment(mydf)
And then you have
> str(mydf2[1:2,])
'data.frame': 2 obs. of 2 variables:
$ aa:Classes 'avector', 'numeric' atomic [1:2] 3 3
.. ..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
..- attr(*, "comment")= chr "Me either!"
- attr(*, "comment")= chr "I'm a data.frame"
For those who look for the "all-in" solution based on BenBarnes explanation: Here it is.
(give the your "up" to the post from BenBarnes if this is working for you)
# Define the avector-subselection method (from the manual)
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function(x,i,...) {
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
r
}
# Assign each column in the data.frame the (additional) class avector
# Note that this will "lose" the data.frame's attributes, therefore write to a copy
df2 <- data.frame(
lapply(df, function(x) {
structure( x, class = c("avector", class(x) ) )
} )
)
# Finally copy the attribute for the original data.frame if necessary
mostattributes(df2) <- attributes(df)
# Now subselects work without losing attributes :)
df2 <- df2[1:100,]
str(df2)
The good thing: When attached the class to all the data.frame's element once, the subselects never again bother attributes.
Okay - sometimes I am stunned how complicated it is to do the most simple operations in R. But I surely did not learn about the "classes" feature if I just marked and deleted the case in SPSS ;)
This is solved by the sticky package. (Full Disclosure: I am the package author.) Apply the sticky() to your vectors and the attributes are preserved through subset operations. For example:
> df <- data.frame(
+ sticky = sticky( structure(1:5, comment="sticky attribute") ),
+ nonstick = structure( letters[1:5], comment="non-sticky attribute" )
+ )
>
> comment(df[1:3, "nonstick"])
NULL
> comment(df[1:3, "sticky"])
[1] "sticky attribute"
This works for any attribute and not only comment.
See the sticky package for details:
on Github
on CRAN
I spent hours trying to figure out how to retain attribute data (specifically variable labels) when subsetting a dataframe (removing columns). The answer was so simple, I couldn't believe it. Just use the function spss.get from the Hmisc package, and then no matter how you subset, the variable labels are retained.