Concatenate multiple strings in R - r

I want to concatenate the below urls, I have written a below function to concatenate all the urls:
library(datetime)
library(lubridate)
get_thredds_url<- function(mon, hr){
a <-"http://abc.co.in/"
b <-"thredds/path/"
c <-paste0("%02d", ymd_h(mon))
d <-paste0(strftime(datetime_group, format="%Y%m%d%H"))
e <-paste0("/gfs.t%sz.pgrb2.0p25.f%03d",(c, hr))
url <-paste0(a,b,b,d)
return (url)
}
mon = datetime(2017, 9, 26, 0)
hr = 240
url = get_thredds_url(mon,hr)
print (url)
But I am getting below error when I execute the definition of get_thredds_url():
Error: unexpected ',' in:
" d<-paste0(strftime(datetime_group, format="%Y%m%d%H"))
e<-paste0("/gfs.t%sz.pgrb2.0p25.f%03d",(c,"
url <-paste0(a,b,b,d)
Error in paste0(a, b, b, d) : object 'a' not found
return (url)
Error: no function to return from, jumping to top level
}
Error: unexpected '}' in "}"
What is wrong with my function and how can I solve this?
The final output should be:
http://abc.co.in/thredds/path/2017092600/gfs.t00z.pgrb2.0p25.f240

Using sprintf allows more control of values being inserted into string
library(lubridate)
get_thredds_url<- function(mon, hr){
sprintf("http://abc.co.in/thredds/path/%s/gfs.t%02dz.pgrb2.0p25.f%03d",
strftime(mon, format = "%Y%m%d%H", tz = "UTC"),
hour(mon),
hr)
}
mon <- make_datetime(2017, 9, 26, 0, tz = "UTC")
hr <- 240
get_thredds_url(mon, hr)
[1] "http://abc.co.in/thredds/path/2017092600/gfs.t00z.pgrb2.0p25.f240"

It was a bit messy to figure out what it is, you're trying to do. There seem to be quite a couple of contradicting pieces in your code, especially compared to your wanted final output. Therefore, I decided to focus on the wanted output and the inputs you provided in your variables.
get_thredds_url <- function(yr, mnth, day, hrs1, hrs2){
part1 <- "http://abc.co.in/"
part2 <- "thredds/path/"
ymdh <- c(yr, formatC(c(mnth, day, hrs1), width=2, flag="0"))
part3 <- paste0(ymdh, collapse="")
pre4 <- formatC(hrs1, width=2, flag="0")
part4 <- paste0("/gfs.t", pre4, "z.pgrb2.0p25.f", hrs2)
return(paste0(part1, part2, part3, part4))
}
get_thredds_url(2017, 9, 26, 0, 240)
# [1] "http://abc.co.in/thredds/path/2017092600/gfs.t00z.pgrb2.0p25.f240"
The key is using paste0() appropriately and I think formatC() may be new to some people (including me).
formatC() is used here to pad zeros in front of the number you provide, and thus makes sure that 9 is converted to 09, whereas 12 remains 12.
Note that this answer is in base R and does not require additional packages.
Also note that you should not use url and c as variable names. These names are already reserved for other functionalities in R. By using them as variable names, you are overwriting their actual purpose, which can (will) lead to problems at some point down the road

Related

Transaction problem in RStudio for tweet apriori analysis

I want to use the apriori algorithm to apply association rules between words on the tweet database I have with RStudio. However, the code below gives an error on a million rows of data, while working on a small number of data. I needed your help as I couldn't understand what caused the error.
TweetTrans <- read.transactions("../input/tweets/output.csv",
rm.duplicates=FALSE,
format = "basket",
sep = ",",
encoding = "UTF-8")
The Error is:
Error in validObject(.Object): invalid class “ngCMatrix” object: row indices are not sorted within columns
Traceback:
1. read.transactions("../input/tweets/output.csv", rm.duplicates = FALSE,
. format = "basket", sep = ",", encoding = "UTF-8")
2. as(data, "transactions")
3. asMethod(object)
4. new("transactions", as(from, "itemMatrix"), itemsetInfo = data.frame(transactionID = names(from),
. stringsAsFactors = FALSE))
5. initialize(value, ...)
6. initialize(value, ...)
7. callNextMethod()
8. .nextMethod(.Object = .Object, ... = ...)
9. callNextMethod()
10. .nextMethod(.Object = .Object, ... = ...)
11. as(from, "itemMatrix")
12. asMethod(object)
13. new("ngCMatrix", p = c(0L, p), i = as.integer(i) - 1L, Dim = c(length(levels(i)),
. length(p)))
14. initialize(value, ...)
15. initialize(value, ...)
16. callNextMethod()
17. .nextMethod(.Object = .Object, ... = ...)
18. validObject(.Object)
19. stop(msg, ": ", errors, domain = NA)
Here are some ideas for how to find a rogue line in the data file. The input to read.transactions should be a text file the looks something like
A, B, C
B, C
C, D, E
D, A, B, F
where A, B ,C, etc are the names of the items (probably longer than one character each!)
So you could read in the file using readLines...
data <- readLines("../input/tweets/output.csv")
Each element of data (one per line of the file) should be a string of the form "A, B, C" etc, as above.
You could then use functions (e.g. from the stringr package) to check if any lines contain unusual characters, or have an odd format. Without seeing your file, it is hard to say how to do this, but you might, for example, look for quotes in odd places (str_detect(data, '\\"')) or characters that are not letters, digits , spaces or commas (str_detect(data, "[^\\w\\d\\s,]")).
Another thing you could try is to write a for loop to take each element of data (or perhaps larger chunks if that is too slow), save it as a file, try reading it with read.transactions, and see where it crashes.
for(i in seq_along(data)){
writeLines(data[i], "dummyfile.csv")
trans <- read.transactions("dummyfile.csv",
rm.duplicates=FALSE,
format = "basket",
sep = ",",
encoding = "UTF-8")
}
The value of i when it crashes will give you the problem row number. It might take a long time to run, though!
I ran into a very similar problem: the same error got triggered when trying to cast a list to a transaction object.
I also couldn't easily figure out what lines in the data caused the issue, as it seems to be triggered by a combination of transactions and not necessarily by any individual one, but I managed to track down the source of the problem in this assignment (source):
p <- new("ngCMatrix", p = c(0L, p),
i = as.integer(i) - 1L,
Dim = c(length(levels(i)), length(p)))
My R got pretty rusty over time and I couldn't find an immediate way to patch the code, but I came up with an alternative solution for constructing the ngCMatrix object:
Assume you have the data in a data.frame following some sort of (user, item) format - in your case it would most likely be (tweet_id, term/word)
Create a unique incremental ID for every user and item and add it to your data.frame
Use those ID to create the sparse matrix and - optionally - enrich it with the labels for item and user to make it more interpretable
Finally, cast the sparse matrix to a transaction object
Example (I implemented mine with data.table, but a traditional dataframe implementation would be very similar):
library(Matrix)
library(data.table)
library(arules)
DT <- data.table(user = c('A','A','B','B','A','C','D'),
item = c('AAB','AAA','AAB','BBB','ABA','BBB','AAB'))
# Create user_ids
unique_users <- unique(DT$user)
users <- data.table(user=unique_users,
user_id=c(1:length(unique_users)))
# Repeat for items
unique_items <- unique(DT$item)
items <- data.table(item=unique_items,
item_id=c(1:length(unique_items)))
# Add indexes to original data table (setting keys helps with performance)
DT <- merge.data.table(x=DT, y=users, by='user')
DT <- merge.data.table(x=DT, y=items, by='item')
# Create the sparse matrix
mat <- sparseMatrix(
i = DT$item_id,
j = DT$user_id,
dims = c(nrow(items), nrow(users)),
dimnames = list(items$item, users$user)
)
# transform to arules 'transactions'
txn <- as(op, "transactions")
Please note that this doesn't help understanding what caused the issue, but rather provides a workaround to solve it. In my data.table implementation the code is pretty performant, taking only a few seconds to process over 30M transactions on a laptop-sized machine (2 CPUs, 16gb RAM).

Convert R JSON Twitter data to list

When using SearchTwitter, I converted to dataframe and then exported to JSON. However, all the text is in one line, etc (sample below). I need to separate so that each tweet is its own.
phish <- searchTwitteR('phish', n = 5, lang = 'en')
phishdf <- do.call("rbind", lapply(phish, as.data.frame))
exportJson <-toJSON(phishdf)
write(exportJson, file = "phishdf.json")
json_phishdf <- fromJSON(file="phishdf.json")
I tried converting to a list and am wondering if maybe converting to a data frame is a mistake.
However, for a list, I tried:
newlist['text']=phish[[1]]$getText()
But this will just give me the text for the first tweet. Is there a way to iterate over the entire data set, maybe in a for loop?
{"text":["#ilazer #abbijacobson I do feel compelled to say that I phind phish awphul... sorry, Abbi!","#phish This on-sale was an embarrassment. Something needs to change.","FS: Have 2 Tix To Phish In Chula Vista #Phish #facevaluetickets #phish #facevalue GO: https://t.co/dFdrpyaotp","RT #WKUPhiDelt: Come unwind from a busy week of class and kick off the weekend with a Phish Fry! 4:30-7:30 at the Phi Delt house. Cost is $\u2026","RT #phish: Tickets for Phish's July 15 & 16 shows at The Gorge go on sale in fifteen minutes at 1PM ET: https://t.co/tEKLNjI5u7 https://t.c\u2026"],
"favorited":[false,false,false,false,false],
"favoriteCount":[0,0,0,0,0],
"replyToSN":["rAlexandria","phish","NA","NA","NA"],
"created":[1456521159,1456521114,1456521022,1456521016,1456520988],
"truncated":[false,false,false,false,false],
"replyToSID":["703326502629277696","703304948990222337","NA","NA","NA"],
"id":["703326837720662016","703326646074343424","703326261045829632","703326236722991105","703326119328686080"],
"replyToUID":["26152867","14503997","NA","NA","NA"],"statusSource":["Mobile Web (M5)","Twitter for iPhone","CashorTrade - Face Value Tickets","Twitter for iPhone","Twitter for Android"],
"screenName":["rAlexandria","adamgelvan","CashorTrade","Kyle_Smith1087","timogrennell"],
"retweetCount":[0,0,0,2,5],
"isRetweet":[false,false,false,true,true],
"retweeted":[false,false,false,false,false],
"longitude":["NA","NA","NA","NA","NA"],
"latitude":["NA","NA","NA","NA","NA"]}
I followed your code and don't have the issue you're describing. Are you using library(twitteR) and library(jsonlite)?
Here is the code, and a screenshot of it working
library(twitteR)
library(jsonlite)
phish <- searchTwitteR('phish', n = 5, lang = 'en')
phishdf <- do.call("rbind", lapply(phish, as.data.frame))
exportJson <-toJSON(phishdf)
write(exportJson, file = "./../phishdf.json")
## note the `txt` argument, as opposed to `file` used in the question
json_phishdf <- fromJSON(txt="./../phishdf.json")

Truncate decimal to specified places

This seems like it should be a fairly easy problem to solve but I am having some trouble locating an answer.
I have a vector which contains long decimals and I want to truncate it to a specific number of decimals. I do not wish to round it, but rather just remove the values beyond my desired number of decimals.
For example I would like 0.123456789 to return 0.1234 if I desired 4 decimal digits. This is not an issue of printing a specific number of digits but rather returning the original value truncated to a given number.
Thanks.
trunc(x*10^4)/10^4
yields 0.1234 like expected.
More generally,
trunc <- function(x, ..., prec = 0) base::trunc(x * 10^prec, ...) / 10^prec;
print(trunc(0.123456789, prec = 4) # 0.1234
print(trunc(14035, prec = -2), # 14000
I used the technics above for a long time. One day I had some issues when I was copying the results to a text file and I solved my problem in this way:
trunc_number_n_decimals <- function(numberToTrunc, nDecimals){
numberToTrunc <- numberToTrunc + (10^-(nDecimals+5))
splitNumber <- strsplit(x=format(numberToTrunc, digits=20, format=f), split="\\.")[[1]]
decimalPartTrunc <- substr(x=splitNumber[2], start=1, stop=nDecimals)
truncatedNumber <- as.numeric(paste0(splitNumber[1], ".", decimalPartTrunc))
return(truncatedNumber)
}
print(trunc_number_n_decimals(9.1762034354551236, 6), digits=14)
[1] 9.176203
print(trunc_number_n_decimals(9.1762034354551236, 7), digits=14)
[1] 9.1762034
print(trunc_number_n_decimals(9.1762034354551236, 8), digits=14)
[1] 9.17620343
print(trunc_number_n_decimals(9.1762034354551236, 9), digits=14)
[1] 9.176203435
This solution is very handy in cases when its necessary to write to a file the number with many decimals, such as 16.
Just remember to convert the number to string before writing to the file, using format()
numberToWrite <- format(trunc_number_n_decimals(9.1762034354551236, 9), digits=20)
Not the most elegant way, but it'll work.
string_it<-sprintf("%06.9f", old_numbers)
pos_list<-gregexpr(pattern="\\.", string_it)
pos<-unlist(lapply(pos_list, '[[', 1)) # This returns a vector with the first
#elements
#you're probably going to have to play around with the pos- numbers here
new_number<-as.numeric(substring(string_it, pos-1,pos+4))

Converting geo coordinates from degree to decimal

I want to convert my geographic coordinates from degrees to decimals, my data are as follows:
lat long
105252 30°25.264 9°01.331
105253 30°39.237 8°10.811
105255 31°37.760 8°06.040
105258 31°41.190 8°06.557
105259 31°41.229 8°06.622
105260 31°38.891 8°06.281
I have this code but I can not see why it is does not work:
convert<-function(coord){
tmp1=strsplit(coord,"°")
tmp2=strsplit(tmp1[[1]][2],"\\.")
dec=c(as.numeric(tmp1[[1]][1]),as.numeric(tmp2[[1]]))
return(dec[1]+dec[2]/60+dec[3]/3600)
}
don_convert=don1
for(i in 1:nrow(don1)){don_convert[i,2]=convert(as.character(don1[i,2])); don_convert[i,3]=convert(as.character(don1[i,3]))}
The convert function works but the code where I am asking the loop to do the job for me does not work.
Any suggestion is apperciated.
Use the measurements package from CRAN which has a unit conversion function already so you don't need to make your own:
x = read.table(text = "
lat long
105252 30°25.264 9°01.331
105253 30°39.237 8°10.811
105255 31°37.760 8°06.040
105258 31°41.190 8°06.557
105259 31°41.229 8°06.622
105260 31°38.891 8°06.281",
header = TRUE, stringsAsFactors = FALSE)
Once your data.frame is set up then:
# change the degree symbol to a space
x$lat = gsub('°', ' ', x$lat)
x$long = gsub('°', ' ', x$long)
# convert from decimal minutes to decimal degrees
x$lat = measurements::conv_unit(x$lat, from = 'deg_dec_min', to = 'dec_deg')
x$long = measurements::conv_unit(x$long, from = 'deg_dec_min', to = 'dec_deg')
Resulting in the end product:
lat long
105252 30.4210666666667 9.02218333333333
105253 30.65395 8.18018333333333
105255 31.6293333333333 8.10066666666667
105258 31.6865 8.10928333333333
105259 31.68715 8.11036666666667
105260 31.6481833333333 8.10468333333333
Try using the char2dms function in the sp library. It has other functions that will additionally do decimal conversion.
library("sp")
?char2dms
A bit of vectorization and matrix manipulation will make your function much simpler:
x <- read.table(text="
lat long
105252 30°25.264 9°01.331
105253 30°39.237 8°10.811
105255 31°37.760 8°06.040
105258 31°41.190 8°06.557
105259 31°41.229 8°06.622
105260 31°38.891 8°06.281",
header=TRUE, stringsAsFactors=FALSE)
x
The function itself makes use of:
strsplit() with the regex pattern "[°\\.]" - this does the string split in one step
sapply to loop over the vector
Try this:
convert<-function(x){
z <- sapply((strsplit(x, "[°\\.]")), as.numeric)
z[1, ] + z[2, ]/60 + z[3, ]/3600
}
Try it:
convert(x$long)
[1] 9.108611 8.391944 8.111111 8.254722 8.272778 8.178056
Disclaimer: I didn't check your math. Use at your own discretion.
Thanks for answers by #Gord Stephen and #CephBirk. Sure helped me out.
I thought I'd just mention that I also found that measurements::conv_unit doesn't deal with "E/W" "N/S" entries, it requires positive/negative degrees.
My coordinates comes as character strings "1 1 1W" and needs to first be converted to "-1 1 1".
I thought I'd share my solution for that.
df <- c("1 1 1E", "1 1 1W", "2 2 2N","2 2 2S")
measurements::conv_unit(df, from = 'deg_min_sec', to = 'dec_deg')
[1] "1.01694444444444" NA NA NA
Warning message:
In split(as.numeric(unlist(strsplit(x, " "))) * c(3600, 60, 1), :
NAs introduced by coercion
ewns <- ifelse( str_extract(df,"\\(?[EWNS,.]+\\)?") %in% c("E","N"),"+","-")
dms <- str_sub(df,1,str_length(df)-1)
df2 <- paste0(ewns,dms)
df_dec <- measurements::conv_unit(df2,
from = 'deg_min_sec',
to = 'dec_deg'))
df_dec
[1] "1.01694444444444" "-1.01694444444444" "2.03388888888889" "-2.03388888888889"
as.numeric(df_dec)
[1] 1.016944 -1.016944 2.033889 -2.033889
Have a look at the command degree in the package OSMscale.
As Jim Lewis commented before it seems your are using floating point minutes. Then you only concatenate two elements on
dec=c(as.numeric(tmp1[[1]][1]),as.numeric(tmp2[[1]]))
Having degrees, minutes and seconds in the form 43°21'8.02 which as.character() returns "43°21'8.02\"", I updated your function to
convert<-function(coord){
tmp1=strsplit(coord,"°")
tmp2=strsplit(tmp1[[1]][2],"'")
tmp3=strsplit(tmp2[[1]][2],"\"")
dec=c(as.numeric(tmp1[[1]][1]),as.numeric(tmp2[[1]][1]),as.numeric(tmp3[[1]]))
c<-abs(dec[1])+dec[2]/60+dec[3]/3600
c<-ifelse(dec[1]<0,-c,c)
return(c)
}
adding the alternative for negative coordinates, and works great for me . I still don't get why char2dms function in the sp library didn't work for me.
Thanks
Another less elegant option using substring instead of strsplit. This will only work if all your positions have the same number of digits. For negative co-ordinates just multiply by -1 for the correct decimal degree.
x$LatDD<-(as.numeric(substring(x$lat, 1,2))
+ (as.numeric(substring(x$lat, 4,9))/60))
x$LongDD<-(as.numeric(substring(x$long, 1,1))
+ (as.numeric(substring(x$long, 3,8))/60))

creating time object in R

I have the following data
>d
2010-07-02
>t
835
I issue the following command
>dt<-paste(d,t)
>dt
"2010-07-02 835"
i then issue the following command, and it returns NA as below:
>dtt<-as.POSIXlt(strptime(dt,'%Y-%m-%d %H%M'))
>dtt
NA
so I made the following change
>t=1001
and now when I run
dt followed by dtt, it works fine, returning
dtt
"2010-07-02 10:01:00"
so, it seems to me that it is having a problem with the first digit of the hour being a 0, which is why when HHMM is less than 1000, it generates NA. can anyone please suggest how I fix this. thanks!
Use sprintf to format your time string before you pass it to as.POSIXlt:
d <- "2010-07-02"
h <- 835
dtt <- sprintf("%s %04d", d, h)
as.POSIXlt(dtt, format="%Y-%m-%d %H%M")
[1] "2010-07-02 08:35:00"
The string "%s %04d" tells sprintf to concatenate d as a string (%s) and h as a fixed width string of length 4, with leading zeroes (%04d).

Resources