I'm converting a local R script to make use of the RevoScaleR functions in the Revolution-R (aka Microsoft R Client/Server) package. This to be able to scale better with large amounts of data.
The goal is to create a new column that numbers the rows per group. Using data.table this would be achieved using the following code:
library(data.table)
eventlog[,ActivityNumber := seq(from=1, to=.N, by=1), by=Case.ID]
For illustration purposes, the output is something like this:
Case.ID ActivityNumber
1 A 1
2 A 2
3 B 1
4 C 1
5 C 2
6 C 3
After some research to do this using the rx-functions I found the package dplyrXdf, which is basically a wrapper to use dplyrfunctions on Xdfstored data, while still benefitting from the optimized functions of RevoScaleR (see http://blog.revolutionanalytics.com/2015/10/using-the-dplyrxdf-package.html)
In my case, this would lead to the following:
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = seq_len(n()))
However, this leads to the following error:
ERROR: Attempting to add a variable without a name to an analysis.
Caught exception in file: CxAnalysis.cpp, line: 3756. ThreadID: 1248 Rethrowing.
Caught exception in file: CxAnalysis.cpp, line: 5249. ThreadID: 1248 Rethrowing.
Error in doTryCatch(return(expr), name, parentenv, handler) :
Error in executing R code: ERROR: Attempting to add a variable without a name to an analysis.
Any ideas how to solve this error? Or other (better?) approaches to get the requested result?
Thanks to #Matt-parker for pointing me to this question.
Note that n() is not a regular R function, although it looks like one. It needs to be implemented specially for each data source, and maybe also separately for each of mutate, summarise and filter.
Right now, the only usage of n that is supported for xdf files is within summarise, to count the number of rows. Implementing it for the other verbs is actually nontrivial.
In particular, there is a problem with Matt's use of seq_along to implement n's functionality. Remember that xdf files are block-structured: each chunk of rows is read in and processed independently of other chunks. This means that the sequence generated is for that chunk of rows only, and not for all the rows in a group. If a group spans more than one chunk, the sequence numbers will restart in the middle.
The way to get correct sequence numbers is to keep a running count of how many rows you've read in for that group, and update it each time a chunk is processed. You can do this with a transformFunc, which you pass to transmute via the .rxArgs argument:
ev <- eventlog %>% group_by(Case.ID) %>% transmute(.rxArgs = list(
transformFunc = function(varList) {
n <- .n + seq_along(varList[[1]])
if(!.rxIsTestChunk) # need this b/c rxDataStep does a test run on the 1st 10 rows
.n <<- n[length(n)]
list(n=n)
},
transformObjects = list(.n = 0))
This should work with the local, localpar and foreach compute contexts. It may not work (or at least won't give a reproducible result) with any context where you can't guarantee that rxDataStep will process the rows in a deterministic order -- so Mapreduce, Spark, Teradata or similar.
I'm not sure why this works, but try using seq_along(Case.ID) instead of seq_len(n()):
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = seq_along(Case.ID))
It seems to be some problem with n(). Here's my exploratory code, in case anyone else wants to experiment:
options(stringsAsFactors = FALSE)
library(dplyrXdf)
# Set up some test data
eventlog_df <- data.frame(Case.ID = c("A", "A", "A", "A", "A", "B", "C", "C", "C"))
# Add a variable for artificially splitting the XDF into small chunks
eventlog_df$Chunk.ID <- factor((seq_len(nrow(eventlog_df)) + 2) %/% 3)
# Check the results
eventlog_df
# Now read it into an XDF file. I'm going to read just three rows in at a time
# so that the XDF file has several chunks, so we can be confident this works
# across chunks
eventlog <- tempfile(fileext = ".xdf")
for(i in 1:3) {
rxImport(inData = eventlog_df[eventlog_df$Chunk.ID %in% i, ],
outFile = eventlog,
colInfo = list(Case.ID = list(type = "factor",
levels = c("A", "B", "C"))),
append = file.exists(eventlog))
}
# Convert to a proper data source
eventlog <- RxXdfData(eventlog)
rxGetInfo(eventlog, getVarInfo = TRUE, numRows = 10)
# Now to dplyr. First, let's make sure it can count up the records
# in each group without any trouble.
result <- eventlog %>%
group_by(Case.ID) %>%
summarise(ActivityNumber = n())
# It can:
rxDataStep(result)
# Now if we switch to mutate, does n() still work?
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = n())
# No - and it seems to be complaining about missing variables. So what if
# we try to refer to a variable we *know* exists?
result <- eventlog %>%
group_by(Case.ID) %>%
mutate(ActivityNumber = seq_along(Case.ID))
# It works
rxDataStep(result)
dplyr and dplyrXdf have a tally method that counts items per group:
result <- eventlog %>%
group_by(Case.ID) %>%
tally()
If you want to do more than just tabulate the records per group, you can use summarize (since you didn't show your data, I'm using a hypothetical column called delay, which I'm assuming is numeric for illustrative purposes):
result <- eventlog %>%
group_by(Case.ID) %>%
summarize(counts = n(),
ave_delay = mean(delay))
You could do the above with regular RevoScaleR functions,
rxCrossTabs(~ Case.ID, data = eventlog)
and for the second example:
rxCube(delay ~ Case.ID, data = eventlog)
Related
In a new user created function I like to do some data.table transformations, especially I like to create a new column with the ':=' command.
Assume I like to make a new column called Sex that capitalizes the first letter of the column df$sex in my example data.frame df.
The output of my prepare function should be a data.table with the same name as before but with the additional "capitalised" column.
I try several ways to loop over the data.table. However I always get the following warning (and no correct output):
Warning message:
In [.data.table(x, , :=(Sex, stringr::str_to_title(sex))) :
Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.
library(data.table)
library(magrittr)
library(stringr)
df <- data.frame("age" = c(17, 04),
sex = c("m", "f"))
df %>% setDT()
is.data.table(df)
This is the easiest way to write my function:
prepare1<-function(x){
x[,Sex:=stringr::str_to_title(sex)]
}
prepare1(df)
#--> WARNING. (as block quoted above)
prepare2<-function(x){
x[, `:=`(Sex, stringr::str_to_title(sex))]
}
prepare2(df)
#--> WARNING. . (as block quoted above)
prepare3<-function(x){
require(data.table)
y <-as.data.table(list(x))
y <- y[,Sex:=stringr::str_to_title(sex)]
x <<- y
}
prepare3(df)
The last version does NOT throw the warning, but makes a new dataset called x. But I wanted to override the dataset I put in the function (if I have to go that way at all.)
From the := help file I also know I can use set, however I am not able to adapt the command appropriate. In case that could cure my problem I am happy to receive help on that, too! set(x, i = NULL, Sex, str_to_title(sex)) is apparently wrong ...
Up on request/to make the discussion in the comments clearer I show how my code produces the problem
library(data.table)
library(stringr)
df <- data.frame("age" = c(17, 04),
sex = c("m", "f"))
GetLastAssigned <- function(match = "<- *data.frame",
remove = " *<-.*") {
f <- tempfile()
savehistory(f)
history <- readLines(f)
unlink(f)
match <- grep(match, history, value = TRUE)
get(sub(remove, "", match[length(match)]))
}
#ok, no need for magrittr
setDT(GetLastAssigned())
#check the last function worked
is.data.table(df)
prepare1<-function(x){
x[,Sex:=stringr::str_to_title(sex)]
}
prepare1(GetLastAssigned())
# I get a warning and it does not work.
prepare1(df)
# I get a warning and it does not work, either.
#If I manually type setDT(df) everything works fine but I cannot type the "right" dfs at all the places where I need to do this transformation.
A workaround along the OP's lines:
library(data.table)
library(stringr)
GetLastAssigned2 <- function(match = "<- *data.frame", remove = " *<-.*") {
f <- tempfile()
savehistory(f)
history <- readLines(f)
unlink(f)
match <- grep(match, history, value = TRUE)
nm <- sub(remove, "", match[length(match)])
list(nm = as.name(nm), addr = address(get(nm)))
}
prepit <- function(x){
x[,Sex:=stringr::str_to_title(sex)]
}
# usage
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
z <- GetLastAssigned2()
eval(substitute(setDT(x), list(x=z$nm)))
str(df) # it seemingly works, since there is a selfref
# usage 2
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
setDT(df)
prepit(df)
str(df) # works
# usage 3
df <- data.frame("age" = c(17, 04), sex = c("m", "f"))
z <- GetLastAssigned2()
eval(substitute(setDT(x), list(x=z$nm)))
eval(substitute(prepit(x), list(x=z$nm)))
str(df) # works
Some big caveats:
savehistory is only effective in interactive use, based on my reading of the docs
using regex on human input (code typed in interactively) is complicated and risky
even this workaround will fail if data.table x passed to prepit is not sufficiently "pre-allocated" space for extra columns
The data.table interface is based on passing the name/symbol of the data.frame or data.table, rather than the value (which is what get provides), as explained by Arun one of the data.table authors. Note that the address cannot be passed around either. z$address soon fails to match address(df) in all examples above.
If I manually type setDT(df) everything works fine but I cannot type the "right" dfs at all the places where I need to do this transformation.
One idea:
# helper to compose expressions
subit = function(cmd, df_nm)
do.call("substitute", list(cmd, list(x=as.name(df_nm))))
# list of expressions with x where the df name belongs
my_cmds = list(
setDT = quote(setDT(x)),
prepit = quote(x[,Sex:=stringr::str_to_title(sex)])
)
# usage 4
df = data.frame("age" = c(17, 04), sex = c("m", "f"))
df_nm = "df" # somehow get this... hopefully not via regex of command history
eval(subit(my_cmds$setDT, df_nm))
eval(subit(my_cmds$prepit, df_nm))
# usage 5
df = data.frame("age" = c(17, 04), sex = c("m", "f"))
df_nm = "df"
for(ex in lapply(my_cmds, subit, df_nm = df_nm)) eval(ex)
I think this is more aligned with recommended programmatic usage of data.table.
There is probably some way to wrap this in a function by altering the envir= argument to eval() but I'm not knowledgeable about that.
Regarding how to get the name of the assignment target in nm <- data.frame(...), it looks like there are no good options. Maybe see How do I access the name of the variable assigned to the result of a function within the function? or Get name of x when defining `(<-` operator
I am trying to create a large list of file URLs by concatenating various pieces together. (Say, ~40 file URLs which represent multiple data types for each of the 50 states.) Eventually, I will download and then unzip/unrar these files. (I have working code for that portion of it.)
I'm very much an R noob, so please bear with me, here.
I have a set of data frames:
states - list of 50 state abbreviations
partial_url - a partial URL for the 50 states
url_parts - a list of each of the remaining URL
pieces (5 file types to download)
year
filetype
I need a URL that looks like this:
http://partial_url/state_urlpart_2017_file.csv.gz
I was able to build the partial_url data frame with the following:
for (i in seq_along(states)) {
url_part1 <- as.data.frame(paste0(url,states[[i]],"/",dir,"/"))
}
I was hoping that some kind of nested loop might work to do the rest, like so:
for (i in 1:partial_url){
for (j in 1:url_parts){
for(k in 1:states){
url_part2 <- as.data.frame(paste0(partial_url[[i]],"/",url_parts[[j]],states[[k]],year,filetype))
}}}
Can anyone suggest how to proceed with the final step?
In my understanding all OP needs can be handled by paste0 function itself. The paste0 works as vectorise format. Hence, the for-loop shown by OP is not needed. The data used in my example is stored in vector format but it can be represented by a column of data.frame.
For example:
states <- c("Alabama", "Colorado", "Georgia")
partial_url <- c("URL_1", "URL_2", "URL_3")
url_parts <- c("PART_1", "PART_2", "PART_3")
year <- 2017
fileType <- "xls"
#Now use paste0 will list out all the URLS
paste0(partial_url,"/",url_parts,states,year,fileType)
#[1] "URL_1/PART_1Alabama2017xls" "URL_2/PART_2Colorado2017xls"
#[3] "URL_3/PART_3Georgia2017xls"
EDIT: multiple fileType based on feedback from #Onyambu
We can use rep(fileType, each = length(states)) to support multiple files.
The solution will look like.
fileType <- c("xls", "doc")
paste0(partial_url,"/",url_parts,states,year,rep(fileType,each = length(states)))
[1] "URL_1/PART_1Alabama2017xls" "URL_2/PART_2Colorado2017xls" "URL_3/PART_3Georgia2017xls"
[4] "URL_1/PART_1Alabama2017doc" "URL_2/PART_2Colorado2017doc" "URL_3/PART_3Georgia2017doc"
Here is a tidyverse solution with some simple example data. The approach is to use complete to give yourself a data frame with all possible combinations of your variables. This works because if you make each variable a factor, complete will include all possible factor levels even if they don't appear. This makes it easy to combine your five url parts even though they appear to have different nrow (e.g. 50 states but only 5 file types). unite allows you to join together columns as strings, so we call it three times to include the right separators, and then finally add the http:// with mutate.
Re: your for loop, I find it hard to work through nested for loop logic in the first place. But at least two issues as written include that you have 1:partial_url instead of 1:length(partial_url) and similar, and you are simply overwriting the same object with every pass of the loop. I prefer to avoid loops unless it's a problem where they're absolutely necessary.
library(tidyverse)
states <- tibble(state = c("AK", "AZ", "AR", "CA", "NY"))
partial_url <- tibble(part = c("part1", "part2"))
url_parts <- tibble(urlpart = c("urlpart1", "urlpart2"))
year <- tibble(year = 2007:2010)
filetype <- tibble(filetype = c("csv", "txt", "tar"))
urls <- bind_cols(
states = states[[1]] %>% factor() %>% head(2),
partial_url = partial_url[[1]] %>% factor() %>% head(2),
url_parts = url_parts[[1]] %>% factor() %>% head(2),
year = year[[1]] %>% factor() %>% head(2),
filetype = filetype[[1]] %>% factor() %>% head(2)
) %>%
complete(states, partial_url, url_parts, year, filetype) %>%
unite("middle", states, url_parts, year, sep = "_") %>%
unite("end", middle, filetype, sep = ".") %>%
unite("url", partial_url, end, sep = "/") %>%
mutate(url = str_c("http://", url))
print(urls)
# A tibble: 160 x 1
url
<chr>
1 http://part1/AK_urlpart1_2007.csv
2 http://part1/AK_urlpart1_2007.txt
3 http://part1/AK_urlpart1_2008.csv
4 http://part1/AK_urlpart1_2008.txt
5 http://part1/AK_urlpart1_2009.csv
6 http://part1/AK_urlpart1_2009.txt
7 http://part1/AK_urlpart1_2010.csv
8 http://part1/AK_urlpart1_2010.txt
9 http://part1/AK_urlpart2_2007.csv
10 http://part1/AK_urlpart2_2007.txt
# ... with 150 more rows
Created on 2018-02-22 by the reprex package (v0.2.0).
I need help to modify my code to do the following tasks... I've used help from the following questions and answers thus far
Opening all files in a folder, and applying a function
How to assign a unique ID number to each group of identical values in a column
Here are things i hope to be able to do with my code...
I need to read in several files from a folder
I will like to use the name of each of the files in the folder to add a column. I was able to do this simply with 'mutate' but for a single file
I will like to save the result of each file separately and also combine to a single file
I also want to keep the code for reading the files separate from the function, so i can apply to other projects.
I'm trying to avoid using the 'loop' statements
Here is the sample of my incomplete code which gives error
library(tidyverse)
library(readr)
cleaningdata<- function(data){
data$Label<-gsub(".tif", "", data$Label)
data %>% select(Label:Solidity) %>%group_by(Label)%>%
mutate(view = seq_along(Label), Station="T1-1")%>%
rename(Species = Label)%>%
mutate(view = recode(view, "1" = "a","2" = "b","3" = "c"))
}
filenames <- list.files("Data", pattern="*.txt", full.names=TRUE)
ldf <- lapply(filenames, read.txt)
res <- lapply(ldf, cleaningdata)
Here is a sample of my dataset Data Folder and below is my work thus far
The fs package contains the useful dir_map function, which applies a function to each file in the path. If you need more control over the files to use, you could alternatively pipe a vector of the filenames into purrr::map() instead.
Your error Warning message: Unreplaced values treated as NA as .x is not compatible. Please specify replacements exhaustively or supply .default was because you were recoding 1, 2, 3 to a, b, c but one of the Species had 6 rows so 4, 5, 6 were recoded to NA. I've used letters[n] to avoid this problem.
library(tidyverse)
library(fs)
result <- dir_map(path = 'Data', fun = function(filepath) {
read_tsv(filepath) %>%
select(-1) %>%
rename(Species = Label) %>%
mutate(Species = sub('.tif$', '', Species)) %>%
group_by(Species) %>%
mutate(
View = seq_along(Species),
View = letters[View], # a, b, c, etc. instead of 1, 2, 3, etc.
Station = sub('.txt$', '', basename(filepath))
)
})
# get rows from second file
result[[2]]
# bind rows from all files
result %>% bind_rows()
Short version: when executing the following command qtm(World, "amount") I get the following error message:
Error in $<-.data.frame(*tmp*, "SHAPE_AREAS", value =
c(653989.801201595, : replacement has 177 rows, data has 175
Disclaimer: this is the same problem I used to have in this question, but if I'm not wrong, in that one the problem was that I had one variable on the left dataframe that matched to several variables on the right one, and hence, I needed to group variables on right dataframe. In this case, I am pretty sure that I do not have the same problem, as can be seen from the code below:
library(tmap)
library(tidyr)
# Read tmap's world map.
data("World")
# Load my dataframe.
df = read.csv("https://gist.githubusercontent.com/ccamara/ad106eda807f710a6f331084ea091513/raw/dc9b51bfc73f09610f199a5a3267621874606aec/tmap.sample.dataframe.csv",
na = "")
# Compare the countries in df that do not match with World's
# SpatialPolygons.
df$iso_a3 %in% World$iso_a3
# Return rows which do not match
selected.countries = df$iso_a3[!df$iso_a3 %in% World$iso_a3]
df.f = filter(df, !(iso_a3 %in% selected.countries))
# Verification.
df.f$iso_a3[!df.f$iso_a3 %in% World$iso_a3]
World#data = World#data %>%
left_join(df.f, by = "iso_a3") %>%
mutate(iso_a3 = as.factor(iso_a3)) %>%
filter(complete.cases(iso_a3))
qtm(World, "amount")
My guess is that the clue may be the fact that the column I am using when joining both dataframes has different levels (hence it is converted to string), but I'm ashamed to admit that I still don't understand the error that I am having here. I'm assuming I have something wrong with my dataframe, although I have to admit that it didn't work even with a smaller dataframe:
selected.countries2 = c("USA", "FRA", "ITA", "ESP")
df.f2 = filter(df, iso_a3 %in% selected.countries2)
df.f2$iso_a3 = droplevels(df.f2$iso_a3)
World#data = World#data %>%
left_join(df.f2, by = "iso_a3") %>%
mutate(iso_a3 = as.factor(iso_a3)) %>%
filter(complete.cases(iso_a3))
World$iso_a3 = droplevels(World$iso_a3)
qtm(World, "amount")
Can anyone help me pointing out what's causing this error (providing an solution may also be much appreaciated)
Edited: It is again your data
table(df$iso_a3)
Short version: when executing the following command qtm(countries, "freq") I get the following error message:
Error in $<-.data.frame(*tmp*, "SHAPE_AREAS", value =
c(652270.070308042, : replacement has 177 rows, data has 210
Disclaimer: I have already checked other answers like this one or this one as well as this explanation that states that usually this error comes from misspelling objects, but could not find an answer to my problem.
Reproducible code:
library(rgdal)
library(dplyr)
library(tmap)
# Load JSON file with countries.
countries = readOGR(dsn = "https://gist.githubusercontent.com/ccamara/fc26d8bb7e777488b446fbaad1e6ea63/raw/a6f69b6c3b4a75b02858e966b9d36c85982cbd32/countries.geojson")
# Load dataframe.
df = read.csv("https://gist.githubusercontent.com/ccamara/fc26d8bb7e777488b446fbaad1e6ea63/raw/754ea37e4aba1b7ed88eaebd2c75fd4afcc54c51/sample-dataframe.csv")
countries#data = left_join(countries#data, df, by = c("iso_a2" = "country_code"))
qtm(countries, "freq")
Your error is in the data - the code works fine.
What you are doing right now is:
1) attempting a 1:1 match
2) realize that your .csv data contains several ids to match
3) a left-join then multiplies the left hand side with all matches on the right hand-side
To avoid this issue you have to aggregate your data one more time like:
library(dplyr)
df_unique = df %>%
group_by(country_code, country_name) %>%
summarize(total = sum(total), freq = sum(freq))
#after that you should be fine - as long as just adding up the data is okay.
countries#data = left_join(countries#data, df, by = c("iso_a2" =
"country_code"))
qtm(countries, "freq")