Trying to aggregate Queried data using a forloop in R - r

I am having a little bit of trouble creating a loop that
sqlQueries every device in the variable "sensorname" (this is roughly 30 elements, but will increase in the future)
takes the data table associated with the device query and puts it into a separate data frame "data1" but keeps adding into it.
Below is my sample loop and a sample of what data1 looks like which is "correct", but not complete. LSF20_3a0925 is the last element in the variable sensorname so essentially the loop runs 30 times each time overwriting the data in variable data1 until it runs for the last time.
library(RODBC)
ch <- odbcConnect("SweetLab", uid='---', pwd='------')
sqlQuery(ch, "use SweetDatabase")
sensorname <- sqlQuery(ch,paste("SELECT site_device.code
FROM site_device, device
WHERE site_device.did=device.id AND
device.name='LSF20'
LIMIT 0, 1000;",
sep="")
)
for(k in 1:length(sensorname[[1]])){
sqlQuery(ch, "use SweetAnalysis")
sql <- na.omit(sqlQuery(ch,paste("select * From ",sensorname[[1]][k],"_Events",sep="")));
if (is.null(sql))
{return(NULL)}
data1 <- merge(sensorname[[1]][k],sql)
}
#############################################
data1
x row_names PeaksP1Time PeaksP1
1 LSF20_3a0925 24 1346781683 5.076920
2 LSF20_3a0925 31 1358444323 0.043240
3 LSF20_3a0925 13 1358444463 0.133170
4 LSF20_3a0925 12 1358445120 5.286443
Any help would be most appretiated I am new to writing code in general so please excuse me if this is a dumb question. I've searched around for a bit on this topic, but honestly wasn't quite sure how to search for this topic.

After some tweaking this looks great!
library(RODBC)
ch <- odbcConnect("SweetLab", uid='***', pwd='******')
sqlQuery(ch, "use SweetDatabase")
sensorname <- sqlQuery(ch,paste("select site_device.code from site_device, device where site_device.did=device.id and device.name='LSF20' LIMIT 0, 1000;",sep=""));
sqlQuery(ch, "use SweetAnalysis")
Datalist <- lapply(sensorname[[1]],function(x){
query <- paste("SELECT PeaksP1Time,PeaksP1 FROM ",x,"_Events",sep="")
dat <- (na.omit(sqlQuery(ch,query)))
data2.nlist<-list(device=x,data=dat)
names(Datalist)<-sensorname$code[1:30]
})
close(ch)
Looking at the structure of this list I get
> str(Datalist[1:3])
List of 3
$ LSF20_39ecf7:List of 1
..$ :'data.frame': 306 obs. of 2 variables:
.. ..$ PeaksP1Time: num [1:306] 1.35e+09 1.35e+09 1.35e+09 1.35e+09 1.35e+09 ...
.. ..$ PeaksP1 : num [1:306] 4.5 4.379 0.706 3 0 ...
$ LSF20_39cd3e:List of 1
..$ :'data.frame': 202 obs. of 2 variables:
.. ..$ PeaksP1Time: num [1:202] 1.35e+09 1.35e+09 1.35e+09 1.35e+09 1.35e+09 ...
.. ..$ PeaksP1 : num [1:202] 0.664 3.235 5.765 4.636 2.936 ...
.. ..- attr(*, "na.action")=Class 'omit' Named int [1:24] 203 204 205 206 207 208 209 210 211 212 ...
.. .. .. ..- attr(*, "names")= chr [1:24] "203" "204" "205" "206" ...
$ LSF20_3a09ac:List of 1
..$ :'data.frame': 42 obs. of 2 variables:
.. ..$ PeaksP1Time: num [1:42] 1.35e+09 1.35e+09 1.35e+09 1.35e+09 1.35e+09 ...
.. ..$ PeaksP1 : num [1:42] 5.589 2.897 2.713 1.706 0.831 ...
Now I'm moving on to the next phase of this which is graphing multiple sets at the same time.
My problem is how do I tell R that it should graph the data with in each list or specific lists. I have a save file of the work history if anyone wants to work with something reproducible.

Ok..without a reproducible example it is not clear what you asked. here how I would do it..
I open the connection
I loop using lapply to create a list
I close the connection
I bind list elements into data.frame. I assume taht you have the same columns for different sensor table.
ch <- odbcConnect("SweetLab", uid='---', pwd='------')
ll <- lapply(sensorname[[1]],function(x){
query <- paste("SELECT * FROM ",x,"_Events",sep="")
dat <- na.omit(sqlQuery(ch,query))
data.frame(sensor=x,dat)
})
close(ch)
data1 <- do.call(rbind,ll)

This is resource intensive, but will work.
# before your for loop
results <- list()
#inside your for loop
for (k ......) {
....
....
results[[k]] <- sql
}
# after your for loop
Data1 <- do.call(rbind, results) # if same schema
# OR
Data1 <- do.call(merge, results) # if different schema

Related

R Function behaves differently than the code entered line by line

I am at a loss. Googling has failed me because I'm not sure I know the right question to ask.
I have a data frame (df1) and my goal is to use a function to get a moving average using forecast::ma.
Here is str(df1)
'data.frame': 934334 obs. of 6 variables:
$ clname : chr ...
$ dos : Date, format: "2011-10-05" ...
$ subpCode: chr
$ ch1 : chr "
$ prov : chr
$ ledger : chr
I have a function that I am trying to write.
process <- function(df, y, sub, ...) {
prog <- df %>%
filter(subpCode == sub) %>%
group_by(dos, subpCode) %>%
summarise(services = n())
prog$count_ts <- ts(prog[ , c('services')])
}
The problem is that when I run the function, my final result is data object that is 1x1798 and it's just a time series. If I go a run the code line by line I get what I need but my function that hypothetically does the same thing wont work.
Here is my desired result
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1718 obs. of 4 variables:
$ dos : Date, format: "2010-09-21" "2010-11-18" "2010-11-19" "2010-11-30" ...
$ subpCode: chr "CII " "CII " "CII " "CII " ...
$ services: int 1 1 2 2 2 2 1 2 1 3 ...
$ count_ts: Time-Series [1:1718, 1] from 1 to 1718: 1 1 2 2 2 2 1 2 1 3 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "services"
- attr(*, "vars")= chr "dos"
- attr(*, "drop")= logi TRU
And here is the code that gets it.
CII <- df1 %>%
filter(subpCode == "CII ") %>%
group_by(dos, subpCode) %>%
summarise(services = n())
CII$count_ts <- ts(CII[ , c('services')])
Could someone point me in the right direction. I've exhausted my usual places.
Thanks!
Following the vignette pointed out by #CalumYou, you should use more something like this:
process <- function(df, sub) {
## Enquoting sub
sub <- enquo(sub)
## Piping stuff
prog <- df %>%
filter(!! subpCode == sub) %>%
group_by(dos, subpCode) %>%
summarise(services = n())
prog$count_ts <- ts(prog[ , c('services')])
## Returning the prog object
return(prog)
}

Number of records limited to 1000 (R client on ElasticSearch)

I'm querying ElasticSearch with R (using elastic client) and it seems I can't reach more than 1000 records. Even if $hits$total returns 4187.
Following are the first lines of str function when applied to the returned list.
List of 4
$ took : int 1844
$ timed_out: logi FALSE
$ _shards :List of 3
..$ total : int 692
..$ successful: int 692
..$ failed : int 0
$ hits :List of 3
..$ total : int 4187
..$ max_score: num 1
..$ hits :List of 1000
If I limit my query to fewer records, I don't get this problem. Following are the first lines of str function when applied to the returned list, where we see that $hits$total equals the number of returned lists - $hits$hits
List of 4
$ took : int 157
$ timed_out: logi FALSE
$ _shards :List of 3
..$ total : int 692
..$ successful: int 692
..$ failed : int 0
$ hits :List of 3
..$ total : int 13
..$ max_score: num 1
..$ hits :List of 13
I guess this could be due to some configuration parameter, since this limit is so exact. How can I avoid this limitation and access the all list/number of records?
EDIT: (information added)
The parameter body is
bdy <- '{
"id": "getKpiHistMetric",
"params": {"KpiKey":"Agg:Net|SL,Wind:10min,Net:PT,SL:1,Metric:Rate",
"from": "2016-11-01T00:00:00",
"to": "2016-11-30T23:59:59"}
}'
The error apears when interating over the whole list. Firstly I create an empty data.frame
df <- data.frame(DATE = integer(),
TICK = integer(),
VALUE = double(), stringsAsFactors = FALSE)
And then I fill it:
for(i in 1:q$hits$total){
a <- as.Date(as.POSIXct(q$hits$hits[[i]]$`_source`$Timestamp/1000, origin="1970-01-01"), format = "%m/%d/%y")
b <- strftime(as.POSIXct(round(q$hits$hits[[i]]$`_source`$Timestamp/1000, -1), origin="1970-01-01"), format = "%H:%M")
c <- q$hits$hits[[i]]$`_source`$Value
df.row <- data.frame(DATE = a, TICK = b, VALUE = c, stringsAsFactors = FALSE)
df <- rbind(df, df.row)
}
At this time I receive the following error:
Error in q$hits$hits[[i]] : subscript out of bounds
At this time, i = 1001
The reason why the number of records is limited to 1000 is due to elasticsearch configuration. This was the answer I got from database maintenance team.

Dynamically Changing Data Type for a Data Frame

I have a set of data frames belonging to many countries consisting of 3 variables (year, AI, OAD). The example for Zimbabwe is shown as below,
>str(dframe_Zimbabwe_1955_1970)
'data.frame': 16 obs. of 3 variables:
$ year: chr "1955" "1956" "1957" "1958" ...
$ AI : chr "11.61568161" "11.34114927" "11.23639317" "11.18841409" ...
$ OAD : chr "5.740789488" "5.775882473" "5.800441036" "5.822536579" ...
I am trying to change the data type of the variables in data frame to below so that I can model the linear fit using lm(dframe_Zimbabwe_1955_1970$AI ~ dframe_Zimbabwe_1955_1970$year).
>str(dframe_Zimbabwe_1955_1970)
'data.frame': 16 obs. of 3 variables:
$ year: int 1955 1956 1957 1958 ...
$ AI : num 11.61568161 11.34114927 11.23639317 11.18841409 ...
$ OAD : num 5.740789488 5.775882473 5.800441036 5.822536579 ...
The below static code able to change AI from character (chr) to numeric (num).
dframe_Zimbabwe_1955_1970$AI <- as.numeric(dframe_Zimbabwe_1955_1970$AI)
However when I tried to automate the code as below, AI still remains as character (chr)
countries <- c('Zimbabwe', 'Afghanistan', ...)
for (country in countries) {
assign(paste('dframe_',country,'_1955_1970$AI', sep=''), eval(parse(text = paste('as.numeric(dframe_',country,'_1955_1970$AI)', sep=''))))
}
Can you advice what I could have done wrong?
Thanks.
42: Your code doesn't work as written but with some edits it will. in addition to the missing parentheses and wrong sep, you can't use $'column name' in assign, but you don't need it anyway
for (country in countries) {
new_val <- get(paste( 'dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
remove(new_val)
}
proof it works:
dframe_Zimbabwe_1955_1970 <- data.frame(year = c("1955", "1956", "1957"),
AI = c("11.61568161", "11.34114927", "11.23639317"),
OAD = c("5.740789488", "5.775882473", "5.800441036"),
stringsAsFactors = F)
str(dframe_Zimbabwe_1955_1970)
'data.frame': 3 obs. of 3 variables:
$ year: chr "1955" "1956" "1957"
$ AI : chr "11.61568161" "11.34114927" "11.23639317"
$ OAD : chr "5.740789488" "5.775882473" "5.800441036"
countries <- 'Zimbabwe'
for (country in countries) {
new_val <- get(paste( 'dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
remove(new_val)
}
str(dframe_Zimbabwe_1955_1970)
'data.frame': 3 obs. of 3 variables:
$ year: num 1955 1956 1957
$ AI : num 11.6 11.3 11.2
$ OAD : num 5.74 5.78 5.8
It's going to be considered fairly ugly code by teh purists but perhaps this:
for (country in countries) {
new_val <- get(paste('dframe_',country,'_1955_1970', sep=''))
new_val[] <- lapply(new_val, as.numeric) # the '[]' on LHS keeps dataframe
assign(paste('dframe_',country,'_1955_1970', sep=''), new_val)
}
Using the get('obj_name') function is considered cleaner than eval(parse(text=...)). It would get handled more R-naturally had you assembled these dataframes in a list.

replacement has x rows, data has y - paste() function

I am trying to group by the following sample values,
latitude | longitude | TotalGreenhouseGases | Amount | Branch |End Date
-37.80144| 144.95402| 42965.9868|32549.99|Arts and Culture| 07/31/2013 12:00:00 AM
-37.80144| 144.95402| 43246.6716|32762.63|Arts and Culture| 08/30/2013 12:00:00 AM
-37.80144| 144.95402| 21374.1264|16192.52|Arts and Culture| 09/31/2013 12:00:00 AM
mapdata <- aggregate(cbind(TotalGreenhouseGases,Amount) ~ latitude+longitude,data = dt2,FUN=function(dt2) c(mn =sum(dt2), n=length(dt2) ) )
163 obs. and 4 Variables are created as a result, now to plot it in a map using plot.ly i am trying to add a text for hovering,
mapdata$hover <- paste( mapdata$TotalGreenhouseGases, "CO2 Emission ",'<br>', "Resource Consumption ", mapdata$Amount)
but this results in the following error,
Error in `$<-.data.frame`(`*tmp*`, "hover", value = c("264.06428571 CO2 Emission <br> Resource Consumption 200", :
replacement has 326 rows, data has 163
can anyone let me know where I am going wrong or if it has been solved before can you please provide a link for that.
I think the the problem is that the way you created mapdata you end up with a list of length of 2 for both TotalGreenhouseGases and Amount.
> str(mapdata)
'data.frame': 1 obs. of 5 variables:
$ latitude : num -37.8
$ longitude : num 145
$ TotalGreenhouseGases: num [1, 1:2] 107587 3
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
$ Amount : num [1, 1:2] 81505 3
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
So if you want to use the sum of these values in your paste function then you will need to use [1] indexing, if you need to use the sample size n then use [2]. For example:
mapdata$hover <- paste( mapdata$TotalGreenhouseGases[1],
"CO2 Emission ",'<br>', "Resource Consumption ",
mapdata$Amount[1])
will give you
[1] "107586.7848 CO2 Emission <br> Resource Consumption 81505.14"

Reading durations

I have a CSV file containing times per competitor of each section of a triathlon. I am having trouble reading the data so that R can use it. Here is an example of how the data looks (I've removed some columns for clarity):
"Place","Division","Gender","Swim","T1","Bike","T2","Run","Finish"
1, "40-49","M","7:45","0:55","27:07","0:29","18:53","55:07"
2, "UNDER 18","M","5:41","0:28","30:41","0:28","18:38","55:55"
3, "40-49","M","6:27","0:26","29:24","0:40","20:16","57:11"
4, "40-49","M","7:57","0:35","29:19","0:23","19:20","57:32"
5, "40-49","M","6:28","0:32","31:00","0:34","19:19","57:51"
6, "40-49","M","7:42","0:30","30:02","0:37","19:11","58:02"
....
250 ,"18-29","F","13:20","3:23","1:06:40","1:19","38:00","2:02:40"
251 ,"30-39","F","13:01","2:42","1:02:12","1:20","43:45","2:02:58"
252 ,50 ,"F","20:45","1:33","58:09","3:17","40:14","2:03:56"
253 ,"30-39","M","13:14","1:14","DNF","1:11","25:10","DNF bike"
254 ,"40-49","M","10:04","1:41","56:36","2:32",,"D.N.F"
My first naive attempt to plot the data went like this.
> tri <- read.csv(file.choose(), header=TRUE, as.is=TRUE)
> pairs(~ Bike + Run + Swim, data=tri)
The times are not being imported in a sensible way so the charts don't make sense.
I have found the difftime type and have tried to use it to parse the times in the data file.
There are some rows with DNF or similar in place of times, I'm happy for rows with times that can't be parsed to be discarded. There are two formats for the times "%M:%S" and "%H:%M:%S"
I think I need to create a new data frame from the data but I am having trouble parsing the times. This is what I have so far.
> tri <- read.csv(file.choose(), header=TRUE, as.is=TRUE)
> str(tri)
'data.frame': 254 obs. of 12 variables:
$ Place : num 1 2 3 4 5 6 7 8 9 10 ...
$ Race.. : num 237 274 268 226 267 247 264 257 273 272 ...
$ First.Name: chr ** removed names ** ...
$ Last.Name : chr ** removed names ** ...
$ Division : chr "40-49" "UNDER 18" "40-49" "40-49" ...
$ Gender : chr "M" "M" "M" "M" ...
$ Swim : chr "7:45" "5:41" "6:27" "7:57" ...
$ T1 : chr "0:55" "0:28" "0:26" "0:35" ...
$ Bike : chr "27:07" "30:41" "29:24" "29:19" ...
$ T2 : chr "0:29" "0:28" "0:40" "0:23" ...
$ Run : chr "18:53" "18:38" "20:16" "19:20" ...
$ Finish : chr "55:07" "55:55" "57:11" "57:32" ...
> as.numeric(as.difftime(tri$Bike, format="%M:%S"), units="secs")
This converts all the times that are under one hour, but the hours are interpreted as minutes for any times over an hour. Substituting "%H:%M:%S" for "%M:%S" parses times over an hour but produces NA otherwise. What is the best way to convert both types of times?
EDIT: Adding a simple example as requested.
> times <- c("27:07", "1:02:12", "DNF")
> as.numeric(as.difftime(times, format="%M:%S"), units="secs")
[1] 1627 62 NA
> as.numeric(as.difftime(times, format="%H:%M:%S"), units="secs")
[1] NA 3732 NA
The output I would like would be 1627 3732 NA
Here's a quick hack at a solution, although there may be a better one:
cdifftime <- function(x) {
x2 <- gsub("^([0-9]+:[0-9]+)$","00:\\1",x) ## prepend 00: to %M:%S elements
res <- as.difftime(x2,format="%H:%M:%S")
units(res) <- "secs"
as.numeric(res)
}
times <- c("27:07", "1:02:12", "DNF")
cdifftime(times)
## [1] 1627 3732 NA
You can apply this to the relevant columns:
tri[4:9] <- lapply(tri[4:9],cdifftime)
A couple of notes from trying to replicate your example:
you may want to use na.strings="DNF" to set "did not finish" values to NA automatically
you need to make sure strings are not read in as factors, e.g. (1) set options(stringsAsFactors="FALSE"); (2) use stringsAsFactors=FALSE when calling read.csv; (3) use as.is=TRUE, ditto.

Resources