Related
For my thesis I'm using an sql dump that is very large. So far, I managed to open small segments of the sql dump in R. However, all the data is structured as follows:
X
X
X
X
X
Y
Y
Y
X
Z
Z
Z
If I want to interpret the data more efficiently, the data should look like this:
XXXX
XYYY
XZZZ
To accomplish this, I've written a for loop that transposes the data. Unfortunately, due to the size of the data set (and my memory), this loop is really slow. Does anyone have an idea how to speed up the for loop or speed up the process in general? I've tried to use dcast/reshape, but it seems that these functions will not do the trick.
Right now, my code looks like this:
DATAclean <- data.table()
for (i in c(1:100)){
vector <- DATAtransed[,..i]
vector <- na.omit(vector)
StartCol <- seq(from = 1, to = (nrow(vector)), by = 4)
Sys.sleep(0.001)
print(i)
flush.console()
for (j in StartCol){
new_data <- vector[c(j:(j+3))]
new_data <- t(new_data)
DATAclean <- rbind(DATAclean, new_data)
}
}
Maybe you can try the base R option below
do.call(
rbind,
lapply(
unname(split(seq_along(data), ceiling(seq_along(data) / 4))),
function(k) data[k]
)
)
which gives
[,1] [,2] [,3] [,4]
[1,] "X" "X" "X" "X"
[2,] "X" "Y" "Y" "Y"
[3,] "X" "Z" "Z" "Z"
DATA
data <- structure(c("X", "X", "X", "X", "X", "Y", "Y", "Y", "X", "Z",
"Z", "Z"), .Dim = c(12L, 1L))
Sometimes it can be more efficient (especially regarding memory) to work on large files in a line-based manner, e.g. using the usual suspects on the unix/linux command line, or perl, python...
Assuming your data has one column as in your example, and you want to concatenate every 4 rows without any separator, removing empty rows, you could do:
awk 'NF' infile | paste -d '' - - - - > outfile
Obviously, you could also wrap that in a system call in R to get the entire result into R:
system("awk 'NF' filename | paste -d '' - - - - ", intern=TRUE)
One of the strings in my vector (df$location1) is the following:
Potomac, MD 20854\n(39.038266, -77.203413)
Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.
So far I have been able to isolate the lat. long. data into a separate column by doing the following:
df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)
I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.
What's the best way to subset data like this?
Is using regex a good way to do this? What other ways should I be looking into?
I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx
We can also use the separate function from the tidyr package (part of the tidyverse package).
library(tidyverse)
# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
stringsAsFactors = FALSE)
dat
# Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)
# Separate the Data column
dat2 <- dat %>%
separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
# City State Zip Latitude Longitude
# 1 Potomac MD 20854 39.038266 -77.203413
You can try strsplit or data.table::tstrsplit(strsplit + transpose):
> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"
[[2]]
[1] "MD 20854"
[[3]]
[1] "39.038266"
[[4]]
[1] "-77.203413"
More generally, you can do this:
library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".
In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:
# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')
Here is an option using base R
read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))),
header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"),
stringsAsFactors = FALSE)
# City State Zip Latitude Longitude
#1 Potomac MD 20854 39.03827 -77.20341
So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.
library(stringi)
library(dplyr)
library(purrr)
rgexVec <- c("[\\w\\s-]+(?=,)",
"[A-Z]{2}",
"\\d+(?=\\n)",
"[\\d-\\.]+(?=,)",
"[\\d-\\.]+(?=\\))")
varNames <- c("city",
"state",
"zip",
"lat",
"long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
names(extractedVal) <- vn
extractedVal %>% as_tibble()
})
\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.
I have an issue about replacing strings with the new ones conditionally.
I put short version of my real problem so far its working however I need a better solution since there are many rows in the real data.
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
Basicly I want to replace strings with replace_strings. First item in the strings replaced with the first item in the replace_strings.
replace_strings <- c("A1","A2","A3","A4")
So the final string should look like
final string <- c("ca_A1","cb_A2","cc_A3","cd_A4")
I write some simple function assign_new
assign_new <- function(x){
ifelse(grepl("A33",x),gsub("A33","A1",x),
ifelse(grepl("A32",x),gsub("A32","A2",x),
ifelse(grepl("A31",x),gsub("A31","A3",x),
ifelse(grepl("A30",x),gsub("A30","A4",x),x))))
}
assign_new(strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Ok it seems we have solution. But lets say if I have A1000 to A1 and want to replace them from A1 to A1000 I need to do 1000 of rows of ifelse statement. How can we tackle that?
If your vectors are ordered to be matched, then you can use:
> paste0(gsub("(.*_)(.*)","\\1", strings ), replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
You can use regmatches.First obtain all the characters that are followed by _ using regexpr then replace as shown below
`regmatches<-`(strings,regexpr("(?<=_).*",strings,perl = T),value=replace_strings)
[1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Not the fastests but very tractable and easy to maintain:
for (i in 1:length(strings)) {
strings[i] <- gsub("\\d+$", i, strings[i])
}
"\\d+$" just matches any number at the end of the string.
EDIT: Per #Onyambu's comment, removing map2_chr as paste is a vectorized function.
foo <- function(x, y){
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_"))
}
foo(strings, replace_strings)
with x being strings and y being replace_strings. You first split the strings object at the _ character, and paste with the respective replace_strings object.
EDIT:
For objects where there is no positional relationship you could create a reference table (dataframe, list, etc.) and match your values.
reference_tbl <- data.frame(strings, replace_strings)
foo <- function(x){
y <- reference_tbl$replace_strings[match(x, reference_tbl$strings)]
x <- unlist(lapply(strsplit(x, "_"), '[', 1))
paste(x, y, sep = "_")
}
foo(strings)
Using the dplyr package:
strings <- c("ca_A33","cb_A32","cc_A31","cd_A30")
replace_strings <- c("A1","A2","A3","A4")
df <- data.frame(strings, replace_strings)
df <- mutate(rowwise(df),
strings = gsub("_.*",
paste0("_", replace_strings),
strings)
)
df <- select(df, strings)
Output:
# A tibble: 4 x 1
strings
<chr>
1 ca_A1
2 cb_A2
3 cc_A3
4 cd_A4
yet another way:
mapply(function(x,y) gsub("(\\w\\w_).*",paste0("\\1",y),x),strings,replace_strings,USE.NAMES=FALSE)
# [1] "ca_A1" "cb_A2" "cc_A3" "cd_A4"
Currently I'm using nested ifelse functions with grepl to check for matches to a vector of strings in a data frame, for example:
# vector of possible words to match
x <- c("Action", "Adventure", "Animation")
# data
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
my_text <- as.data.frame(my_text)
my_text$new_column <- ifelse (
grepl("Action", my_text$my_text) == TRUE,
"Action",
ifelse (
grepl("Adventure", my_text$my_text) == TRUE,
"Adventure",
ifelse (
grepl("Animation", my_text$my_text) == TRUE,
"Animation", NA)))
> my_text$new_column
[1] "Animation" NA "Adventure"
This is fine for just a few elements (e.g., the three here), but how do I return when the possible matches are much larger (e.g., 150)? Nested ifelse seems crazy. I know I can grepl multiple things at once as in the code below, but this return a logical telling me only if the string was matched, not which one was matched. I'd like to know what was matched (in the case of multiple, any of the matches is fine.
x <- c("Action", "Adventure", "Animation")
my_text <- c("This one has Animation.", "This has none.", "Here is Adventure.")
grepl(paste(x, collapse = "|"), my_text)
returns: [1] TRUE FALSE TRUE
what i'd like it to return: "Animation" ""(or FALSE) "Adventure"
Following the pattern here, a base solution.
x <- c("ActionABC", "AdventureDEF", "AnimationGHI")
regmatches(x, regexpr("(Action|Adventure|Animation)", x))
stringr has an easier way to do this
library(stringr)
str_extract(x, "(Action|Adventure|Animation)")
Building on Benjamin's base solution, use lapply so that you will have a character(0) value when there is no match.
Just using regmatches on your sample code directly, will you give the following error.
my_text$new_column <-regmatches(x = my_text$my_text, m = regexpr(pattern = paste(x, collapse = "|"), text = my_text$my_text))
Error in `$<-.data.frame`(`*tmp*`, new_column, value = c("Animation", :
replacement has 2 rows, data has 3
This is because there are only 2 matches and it will try to fit the matches values in the data frame column which has 3 rows.
To fill non-matches with a special value so that this operation can be done directly we can use lapply.
my_text$new_column <-
lapply(X = my_text$my_text, FUN = function(X){
regmatches(x = X, m = regexpr(pattern = paste(x, collapse = "|"), text = X))
})
This will put character(0) where there is no match.
Table screenshot
Hope this helps.
This will do it...
my_text$new_column <- unlist(
apply(
sapply(x, grepl, my_text$my_text),
1,
function(y) paste("",x[y])))
The sapply produces a logical matrix showing which of the x terms appears in each element of your column. The apply then runs through this row-by-row and pastes together all of the values of x corresponding to TRUE values. (It pastes a "" at the start to avoid NAs and keep the length of the output the same as the original data.) If there are two terms in x matched for a row, they will be pasted together in the output.
I have a dataframe called df with column names in the following format:
"A Agarwal" "A Agrawal" "A Balachandran"
"A.Brush" "A.Casavant" "A.Chakrabarti"
They are first initial and last name. However, some of them are separated with a space, while other are with a period. I need to replace the period with a period.(The first column is called author.ID, and I excluded it from the following code)
I have tried the following codes but the resulting colnames still do not change.
colnames(df[, -1]) = gsub("\\s", "\\.", colnames(df[, -1]))
colnames(df[, -1]) = gsub(" ", ".", colnames(df[, -1]))
What am I doing wrong?
Thanks.
Note that df[, -1] gets you all rows and columns except the first column (see this reference). In order to modify the column names you should use colnames(df).
To replace the first literal space with a dot, use
colnames(df) <- sub(" ", ".", colnames(df), fixed=TRUE)
If there can be more than one whitespace, use a regex:
colnames(df) <- sub("\\s+", ".", colnames(df))
If you need to remove all whitespaces sequences with a single dot in the column names, use gsub:
colnames(df) <- gsub("\\s+", ".", colnames(df))