How to speed up the progress of transposing a data set - r

For my thesis I'm using an sql dump that is very large. So far, I managed to open small segments of the sql dump in R. However, all the data is structured as follows:
X
X
X
X
X
Y
Y
Y
X
Z
Z
Z
If I want to interpret the data more efficiently, the data should look like this:
XXXX
XYYY
XZZZ
To accomplish this, I've written a for loop that transposes the data. Unfortunately, due to the size of the data set (and my memory), this loop is really slow. Does anyone have an idea how to speed up the for loop or speed up the process in general? I've tried to use dcast/reshape, but it seems that these functions will not do the trick.
Right now, my code looks like this:
DATAclean <- data.table()
for (i in c(1:100)){
vector <- DATAtransed[,..i]
vector <- na.omit(vector)
StartCol <- seq(from = 1, to = (nrow(vector)), by = 4)
Sys.sleep(0.001)
print(i)
flush.console()
for (j in StartCol){
new_data <- vector[c(j:(j+3))]
new_data <- t(new_data)
DATAclean <- rbind(DATAclean, new_data)
}
}

Maybe you can try the base R option below
do.call(
rbind,
lapply(
unname(split(seq_along(data), ceiling(seq_along(data) / 4))),
function(k) data[k]
)
)
which gives
[,1] [,2] [,3] [,4]
[1,] "X" "X" "X" "X"
[2,] "X" "Y" "Y" "Y"
[3,] "X" "Z" "Z" "Z"
DATA
data <- structure(c("X", "X", "X", "X", "X", "Y", "Y", "Y", "X", "Z",
"Z", "Z"), .Dim = c(12L, 1L))

Sometimes it can be more efficient (especially regarding memory) to work on large files in a line-based manner, e.g. using the usual suspects on the unix/linux command line, or perl, python...
Assuming your data has one column as in your example, and you want to concatenate every 4 rows without any separator, removing empty rows, you could do:
awk 'NF' infile | paste -d '' - - - - > outfile
Obviously, you could also wrap that in a system call in R to get the entire result into R:
system("awk 'NF' filename | paste -d '' - - - - ", intern=TRUE)

Related

Combine names of variables (in a formula) retrieved by grep in R

I'd like to use the output of grep directly in a formula.
In other words, I use grep to retrieve the variables I want to select and store them in a vector.
The cool thing would be to be able to use this vector in a formula.
As to say
var.to.retrieve <- grep(pattern="V", x=data)
lm(var.dep~var.to.retrieve)
but this doesn't work...
I've tried the solution paste(var.to.retrieve, collapse="+") but this doesn't work either.
EDIT
The solution could be
formula <- as.formula(paste(var.dep, paste(var.to.retrieve, collapse="+"), sep="~"))
but I cannot imagine there is no more elegant way to do it
reformulate(var.to.retrieve, response = var.dep) is basically this.
var.dep <- "y"
var.to.retrieve <- LETTERS[1:10]
r1 <- reformulate(var.to.retrieve, response = var.dep)
r2 <- as.formula(
paste(var.dep,
paste(var.to.retrieve, collapse = "+"),
sep = "~")
)
identical(r1,r2) ## TRUE
var_to_retrieve <- colnames(data)[grep(pattern = "V", x = colnames(data))]
lm(formula(paste(var.dep, paste(var_to_retrieve, collapse = "+"), sep = "~")),
data = data)

Beginner using pipes

I am a beginner and I'm trying to find the most efficient way to change the name of the first column for many CSV files that I will be creating. Once I have created the CSV files, I am loading them into R as follows:
data <- read.csv('filename.csv')
I have used the names() function to do the name change of a single file:
names(data)[1] <- 'Y'
However, I would like to find the most efficient way of combining/piping this name change to read.csv so the same name change is applied to every file when they are opened. I tried to write a 'simple' function to do this:
addName <- function(data) {
names(data)[1] <- 'Y'
data
}
However, I do not yet fully understand the syntax for writing a function and I can't get this to work.
Note
If you were expecting your original addName function to "mutate" an existing object like so
x <- data.frame(Column_1 = c(1, 2, 3), Column_2 = c("a", "b", "c"))
# Try (unsuccessfully) to change title of "Column_1" to "Y" in x.
addName(x)
# Print x.
x
please be aware that R passes by value rather than by reference, so x itself would remain unchanged:
Column_1 Column_2
1 1 a
2 2 b
3 3 c
Any "mutation" would be achieved by overwriting x with the return value of the function
x <- addName(x)
# Print x.
x
in which case x itself would obviously be changed:
Y Column_2
1 1 a
2 2 b
3 3 c
Answer
Anyway, here's a solution that compactly incorporates pipes (%>% from the magrittr package) and a custom function. Please note that without the linebreaks and comments, which I have added for clarity, this could be condensed to only a few lines of code.
# The dplyr package helps with easy renaming, and it includes the magrittr pipe.
library(dplyr)
# ...
filenames <- c("filename1.csv", "filename2.csv", "filename3.csv")
# A function to take a CSV filename and give back a renamed dataset taken from that file.
addName <- function(filename) {
return(# Read in the named file as a data.frame.
read.csv(file = filename) %>%
# Take the resulting data.frame, and rename its first column as "Y";
# quotes are optional, unless the name contains spaces: "My Column"
# or `My Column` are needed then.
dplyr::rename(Y = 1))
}
# Get a list of all the renamed datasets, as taken by addName() from each of the filenames.
all_files <- sapply(filenames, FUN = addName,
# Keep the list structure, in which each element is a
# data.frame.
simplify = FALSE,
# Name each list element by its filename, to help keep track.
USE.NAMES = TRUE)
In fact, you could easily rename any columns you desire, all in one fell swoop:
dplyr::rename(Y = 1, 'X' = 2, "Z" = 3, "Column 4" = 4, `Column 5` = 5)
This will read a vector of filenames, change the name of the first column of each one to "Y" and store all of the files in a list.
filenames <- c("filename1.csv","filename2.csv")
addName <- function(filename) {
data <- read.csv(filename)
names(data)[1] <- 'Y'
data
}
files <- list()
for (i in 1:length(filenames)) {
files[[i]] <- addName(filenames[i])
}

How to convert several characters to vectors in R?

I am struggling with converting several characters to vectors and making them as a list in R.
The converting rule is as follows:
Assign a number to each character. ex. A=1, B=2, C=3,...
Make a vector when the length of characters is ">=2". ex. AB = c(1,2), ABC = c(1,2,3)
Make lists containing several vectors.
For example, suppose that there is ex object with three components. For each component, I want to make it to list objects list1, list2, and list3.
ex = c("(A,B,C,D)", "(AB,BC,CD)","(AB,C)")
# 3 lists to be returned from ex object
list1 = "list(1,2,3,4)" # from (A,B,C,D)
list2 = "list(c(1,2), c(2,3), c(3,4))" # from (AB,BC,CD)
list3 = "list(c(1,2), c(3))" # from (AB,C)
Please let me know a good R function to solve the example above.
* The minor change is reflected.
lookUpTable = as.numeric(1:4) #map numbers to their respective strings
names(lookUpTable) = LETTERS[1:4]
step1<- #get rid of parentheses and split by ",".
strsplit(gsub("[()]", "", ex), ",")
result<- #split again to make things like "AB" into "A", "B", also convert the strings to numbers acc. to lookUpTable
lapply(step1, function(x){ lapply(strsplit(x, ""), function(u) unname(lookUpTable[u])) })
# assign to the global environment.
invisible(
lapply(seq_along(result), function(x) {assign(paste0("list", x), result[[x]], envir = globalenv()); NULL})
)
# get it as strings:
invisible(
lapply(seq_along(result), function(x) {assign(paste0("list_string", x), capture.output(dput(result[[x]])), envir = globalenv()); NULL})
)
data:
ex = c("(A,B,C,D)", "(AB,BC,CD)","(AB,C)")
tips and tricks:
I make use of regular expressions in gsub (and strsplit). Learn regex!
I made a lookUpTable that maps the individual strings to numbers. Make sure your lookUpTable is set up analogously.
Have a look at apply functions like in that case ?lapply.
lastly I assign the result to the global environment. I dont recommend this step but its what you have requested.

Manipulating the quotes on strings when coding in R

This is actually a series of questions about the referencing character type of values in R. Would add more bullets when I recalled any other related questions I believe which is interesting and related to this topic. For simplification, here I shall use some simple random examples to explain my questions. Hope this helps:
When building up a set of datasets using for loops and wanted to output a series of vectors with names restored in a list called name_list = ("a", "b", "c", "d", "e", "f") in the loop we would like to define as
for(i in 1:4){
a <- data[data$Year == 2010,]
b <- unique(data$Name)
c <- summarise(group_by(data,Year,Name), avg = mean(quantity))
...
f <- left_join(data,data1, by = c("Year", "Names)
}
Is there any function that allows me to use function(name_list[1]) through function(name_list[6]) to replace the a through f in the for loop? This question also goes for trying to create columns using column names in some tables/data frames embedded a chunk of code. (as.name and noquote function work when just referencing the vector/dataset but don't work when attempting to assign values to the target variable, if possible could anyone share why this happens?)
When we extract some information from SQL or other data sources we might have some information separated by comma or some other delimiters as one variable. How could we test if certain values is among one of the values separated by commas? See the example below:
1567 %in% c(1567,1456,123)
TRUE
a <- "c(1567,1456,123)"
noquote(a)
c(1567,1456,123)
1567 %in% noquote(a)
FALSE
1567 %in% list(noquote(a))
FALSE
b <- "1567,1456,123"
noquote(b)
1567,1456,123
1567 %in% noquote(strsplit(a,","))
FALSE
1567 %in% list(noquote(strsplit(a,",")))
FALSE
I kind of get why the %in% here doesn't work, seems like R is taking 1567,1456,123 as one element. So I used the strsplit to separate them. But seems that it's still not working. Wondering is there any way that allows us to get R taking the string as commands?
If all you need to do is convert comma-separated lists like "1567,1456,123" into R vectors like c(1567, 1456, 123), you definitely do not need to wrap them in c(...) and try to evaluate them directly as vectors. You should just use strsplit to split the data:
data_str <- "1567,1456,123"
data_vec <- as.integer(strsplit(string_data, ","))
stopifnot(1567 %in% data_vec)
Note that strsplit returns a list, because it can also character vectors of length greater than one:
stopifnot(
all.equal(
list(c("a", "b"), c("x", "y")),
strsplit(c("a,b", "x,y"), ",")) == TRUE)
which makes it useful for operating on columns of SQL output:
| id | concatenated_field |
|----|--------------------|
| 1 | 5362,395,9000,7 |
| 2 | 319,75624,63 |
(etc.)
d <- data.frame(
id = c(1, 2),
concatenated_field = c("5362,395,9000,7", "319,75624,63"))
d$split_field <- strsplit(d$concatenated_field, ",")
sapply(d, class)
# id concatenated_field split_field
# "numeric" "character" "list"
d$split_field[[1]]
# [1] "5362" "395" "9000" "7"
Alternatively, if you're reading in one big stream of comma-separated data, you can use scan:
data_vec <- scan(
what = 0, # arcane way to say "expect numeric input"
sep = ",",
text = "1,2,3,4,5,6,7,8,9,10")
stopifnot(all.equal(data_vec, 1:10) == TRUE)
scan is more heavy-duty than strsplit and can handle more complicated inputs as well, such as data with quoted fields:
weird_data <- scan(what="", sep=",", text='marvin,ruby,"joe,joseph",dean')
print(weird_data)
# [1] "marvin" "ruby" "joe,joseph" "dean"
If you are really really sure you need to be able to accept and evaluate R code passed as an input (this can be VERY DANGEROUS since it means you will be executing arbitrary unverified R code), you can use
r_code_string <- 'c("a", "b"), c("x", "y"))'
stopifnot(
all.equal(
c("a", "b"), c("x", "y")),
eval(parse(r_code_string))) == TRUE)
parse converts raw text into an unevaluated "expression", which is a representation of R code in the form of a special R object, eval passes the expression to the interpreter for execution.
As for noquote, it doesn't do what you think it does. It doesn't actually modify the string, it just adds a flag to the variable so that it will print without quotation marks. You can emulate this behavior with print(..., quote = FALSE).

Column names of a data.frame separated with comma

I want to get column names of a data.frame separated with comma (,). I remembered I got this result in past but now forgot the command.
df<- data.frame(x=1:10, y=11:20)
names(df)
Output
"x" "y"
Desired Output
c("x", "y")
The easiest way to get exactly what it sounds like you're asking for (without knowing exactly how you plan to use this information) is to use dput:
dput(names(df))
# c("x", "y")
By extension, without fussing with paste:
x <- capture.output(dput(names(df)))
x
# [1] "c(\"x\", \"y\")"
cat(x)
# c("x", "y")
Although #Jilber deleted his answer, you can use shQuote to go from what he had started with to the output of "x" above:
paste("c(", paste(shQuote(names(df)), collapse = ", "), ")", sep = "")
# [1] "c(\"x\", \"y\")"

Resources