Reshaping a data frame by "combining" columns in a data frame - r

I have a data frame with url strings and am using the stringr package in R to produce new columns with a boolean on whether the string contains an element or not.
library(stringr)
url = data.frame(u=c("http://www.subaru.com/vehicles/impreza/index.html",
"http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602",
"http://www.subaru.com/customer-support.html",
"http://www.subaru.com/",
"http://www.subaru.com/vehicles/forester/index.html"))
url
cs = c("customer-support")
f = c("forester")
one_match <- str_c(cs, collapse = "|")
two_match <- str_c(f, collapse = "|")
main <- function(df) {
df$customer_support <- as.numeric(str_detect(url$u, one_match))
df
}
d1 = main(url)
main <- function(df) {
df$forester <- as.numeric(str_detect(url$u, two_match))
df
}
d2 = main(url)
mydt = join(d1, d2)
mydt
The above code produces the following results.
mydt
u
1 http://www.subaru.com/vehicles/impreza/index.html
2 http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602
3 http://www.subaru.com/customer-support.html
4 http://www.subaru.com/
5 http://www.subaru.com/vehicles/forester/index.html
customer_support forester
1 0 0
2 0 0
3 1 0
4 0 0
5 0 1
What I want to do is reshape the data frame so that I restructure columns 2 and 3 so that they are combined and no longer boolean values
It should look like:
page
0
0
customer_support
0
forester
I've tried many different things, including variations of reshape, transform, dcast, etc and nothing seems to get the job done. Can anyone help me get the desired output.

You don't need to write such complicated functions.. You can simply use grepl and ifelse functions as below
urldata = data.frame(u = c("http://www.subaru.com/vehicles/impreza/index.html", "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602",
"http://www.subaru.com/customer-support.html", "http://www.subaru.com/", "http://www.subaru.com/vehicles/forester/index.html"))
cs = c("customer-support")
f = c("forester")
urldata
## u
## 1 http://www.subaru.com/vehicles/impreza/index.html
## 2 http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602
## 3 http://www.subaru.com/customer-support.html
## 4 http://www.subaru.com/
## 5 http://www.subaru.com/vehicles/forester/index.html
urldata$page <- ifelse(grepl(cs, urldata$u), cs, ifelse(grepl(f, urldata$u), f, 0))
urldata
## u
## 1 http://www.subaru.com/vehicles/impreza/index.html
## 2 http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602
## 3 http://www.subaru.com/customer-support.html
## 4 http://www.subaru.com/
## 5 http://www.subaru.com/vehicles/forester/index.html
## page
## 1 0
## 2 0
## 3 customer-support
## 4 0
## 5 forester

Related

Pass a data.table and a variable into a function

I am just started learning data science and I have a question that is probably easy for you.
I have a dataset that looks something like this
df <- data.frame(id= c(1,1,1,2,2,2,3,3,3), time=c(1,2,3,1,2,3,1,2,3),y = rnorm(9), x1 = LETTERS[seq( from = 1, to = 9 )], x2 = c(0,0,0,0,1,0,1,1,1),c2 = rnorm(9))
df
# id time y x1 x2 c2
# 1 1 1 0.6364831 A 0 -0.066480473
# 2 1 2 0.4476390 B 0 0.161372575
# 3 1 3 1.5113458 C 0 0.343956178
# 4 2 1 0.3532957 D 0 0.279987147
# 5 2 2 0.3401402 E 1 -0.462635393
# 6 2 3 -0.3160222 F 0 0.338454940
# 7 3 1 -1.3797158 G 1 -0.621169576
# 8 3 2 1.4026640 H 1 -0.005690801
# 9 3 3 0.2958363 I 1 -0.176488132
I am writing a function with multiple steps. I would like the feed the function with two elements: the dataset and the variable of interest.
However, the function breaks down in an intermediate step, when I try to filter my data using data table. The crucial step of the function looks something like this.
testfun<- function(dataset,var){
intermediatedf<-unique(setDT(dataset)[var==1 & c2>0,.(y)])
return(intermediatedf)
}
However, running the df2<-testfun(df,y) breaks down.
Can anyone help me and explain how can I create a function where I index both a dataset and a variable?
Thank you in advance for your help
You can use substitute and eval
testfun <- function(dataset, var) {
var <- substitute(var)
intermediatedf <- unique(dataset[eval(var) == 1 & c2 > 0, .(y)])
return(intermediatedf)
}

Run if loop in parallel

I have a data set with ~4 million rows that I need to loop over. The data structure is there are repeated IDs that are dependent on each other but data is independent across IDs. For each ID, the [i+1] row is a dependent on [i]. Here is a reproducible example. I do realize that this example is not practical in terms of the inner functions but it is simply a demonstration of the structure I have.
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
y = rep(0,15)
df = as.data.frame(cbind(ids,month,x,y))
for (i in 1:nrow(df)){
if(i>1 && df[i,1]==df[i-1,1]){
#Main functions go here
df[i,4] = df[i-1,4]^2+df[i,3]
}
else {
df[i,4] = 1
}
}
The issue is in reality 1000 loops of the real function takes ~90 seconds, so 4 million rows takes days. It isn't feasible for me to run this way. However the IDs are independent and don't need to run together. My question is: is there a way to run this type of loop in parallel? A very non-elegant solution would be to split the file into 50 sections without splitting an ID and simply run the same code on the 50 sub-files. I figure there should be a way to code this though.
EDIT: Added month column to show why the rows are dependent on each other. To address two comments below:
1) There are actually 6-7 lines of functions to run. Could I use ifelse() with multiple functions?
2) The desired output would be the full data frame. In reality there are more columns but I need each row in a data frame.
ids month x y
1 1 1 -1 1
2 1 2 1 2
3 1 3 10 14
4 1 4 2 198
5 1 5 3 39207
6 2 1 11 1
7 2 2 4 5
8 2 3 -4 21
9 2 4 -1 440
10 2 5 0 193600
11 3 1 8 1
12 3 2 4 5
13 3 3 4 29
14 3 4 3 844
15 3 5 -1 712335
EDIT2: I've tried applying the foreach() package from another post but it doesn't seem to work. This code will run but I think the issue is the way that rows are distributed among cores. If each row is sequentially sent to a different core then the same ID will never be in the same core.
library(foreach)
library(doParallel)
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
y = rep(0,15)
df = as.data.frame(cbind(ids,month,x,y))
#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
finalMatrix <- foreach(i=1:nrow(df), .combine=cbind) %dopar% {
for (i in 1:nrow(df)){
if(i>1 && df[i,1]==df[i-1,1]){
#Main functions go here
df[i,4] = df[i-1,4]^2+df[i,3]
}
else {
df[i,4] = 1
}
}
}
#stop cluster
stopCluster(cl)
So, simply recode your loop with Rcpp:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector fill_y(const NumericVector& x) {
int n = x.length();
NumericVector y(n); y[0] = 1;
for (int i = 1; i < n; i++) {
y[i] = pow(y[i - 1], 2) + x[i];
}
return y;
}
And, to apply it on each group, use dplyr:
df %>%
group_by(ids) %>%
mutate(y2 = fill_y(x))
I think this should be fast enough so that you don't need parallelism.
Actually I ran it on #Val's testdat and it took only 2 seconds (with an old computer).
Tell me if it's okay. Otherwise, I'll make a parallel version.
Here's a solution using foreach. Hard to say how it would work in your real life example, at least it works with the testdata ...
First I generate some testdata:
# function to generate testdata
genDat <- function(id){
# observations per id, fixed or random
n <- 50
#n <- round(runif(1,5,1000))
return(
data.frame(id=id,month=rep(1:12,ceiling(n/12))[1:n],x=round(rnorm(n,2,5)),y=rep(0,n))
)
}
#generate testdata
testdat <- do.call(rbind,lapply(1:90000,genDat))
> head(testdat)
id month x y
1 1 1 7 0
2 1 2 6 0
3 1 3 -9 0
4 1 4 3 0
5 1 5 -9 0
6 1 6 8 0
> str(testdat)
'data.frame': 4500000 obs. of 4 variables:
$ id : int 1 1 1 1 1 1 1 1 1 1 ...
$ month: int 1 2 3 4 5 6 7 8 9 10 ...
$ x : num 7 6 -9 3 -9 8 -4 13 0 5 ...
$ y : num 0 0 0 0 0 0 0 0 0 0 ...
So the testdata has ~ 4.5 million rows with 90k unique ids.
Now since your calculations are independent between the IDs, the idea would be to ship off data with unique IDs to each core ... this would ultimately also get rid of the necessity for an if or ifelse condition.
To do this, I first generate a matrix with start and stop row indices, to split the dataset in unique IDs:
id_len <- rle(testdat$id)
ixmat <- cbind(c(1,head(cumsum(id_len$lengths)+1,-1)),cumsum(id_len$lengths))
This matrix can then be passed on to foreach for running the specific parts in parallel.
In this example I modify your calculations slightly to avoid astronomical values leading to Inf.
library(parallel)
library(doParallel)
library(iterators)
cl <- makeCluster(parallel::detectCores())
registerDoParallel(cl) #create a cluster
r <- foreach (i = iter(ixmat,by='row')) %dopar% {
x <- testdat$x[i[1,1]:i[1,2]]
y <- testdat$y[i[1,1]:i[1,2]]
y[1] <- 1
for(j in 2:length(y)){
#y[j] <- (y[j-1]^2) + x[j] ##gets INF
y[j] <- y[j-1] + x[j]
}
return(y)
}
parallel::stopCluster(cl)
Finally you could replace the values in the original dataframe:
testdat$y <- unlist(r)
As for the time, the foreach loop runs in about 40 seconds on my 8 core machine.
Base R Matrix operations and melt/dcast from data.table
As discussed in the comments above, this solution is very specific to the use case in the example, but perhaps might be applicable to your use case.
Using matrix operations and the dcast.data.table and melt.data.table functions from the data.table package to make fast transitions from a long to wide format and back is pretty efficient.
All things considered, the bigger constraint will likely how much RAM you have available than processing time with these methods.
library(data.table)
set.seed(123)
id1 = rep(1,5)
id2 = rep(2,5)
id3 = rep(3,5)
ids = c(id1,id2,id3)
month = rep(seq(1,5),3)
x = round(rnorm(15,2,5))
# y = rep(0,15) ## no need to pre-define y with this method
df = as.data.frame(cbind(ids,month,x))
setDT(df) ## Convert to data.table by reference
wide <- dcast.data.table(df, month ~ ids, value.var = "x") ## pivot to 'wide' format
mat <- data.matrix(wide[,-c("month")]) ## Convert to matrix
print(mat)
gives
1 2 3
[1,] -1 11 8
[2,] 1 4 4
[3,] 10 -4 4
[4,] 2 -1 3
[5,] 3 0 -1
Then operating on it as a matrix:
mat[1,] <- 1 ## fill the first row with 1's as in your example
for (i in 2:nrow(mat)){
mat[i,] = mat[i-1L,]^2 + mat[i,]
}
print(mat)
gives
1 2 3
[1,] 1 1 1
[2,] 2 5 5
[3,] 14 21 29
[4,] 198 440 844
[5,] 39207 193600 712335
Next, melt back to a long format and then join back to the original data on key columns ids and month:
yresult <- as.data.table(mat) ## convert back to data.table format
yresult[,month := wide[,month]] ## Add back the month column
ylong <- melt.data.table(yresult,
id.vars = "month",
variable.factor = FALSE,
variable.name = "ids",
value.name = "y") ## Pivot back to 'long' format
ylong[,ids := as.numeric(ids)] ## reclass ids to match input ids
setkey(ylong, ids, month) ## set keys for join on 'ids' and 'month'
setkey(df, ids,month)
merge(df,ylong) ## join data.table with the result
yields the final result:
ids month x y
1: 1 1 -1 1
2: 1 2 1 2
3: 1 3 10 14
4: 1 4 2 198
5: 1 5 3 39207
6: 2 1 11 1
7: 2 2 4 5
8: 2 3 -4 21
9: 2 4 -1 440
10: 2 5 0 193600
11: 3 1 8 1
12: 3 2 4 5
13: 3 3 4 29
14: 3 4 3 844
15: 3 5 -1 712335
Scale Testing
To test and illustrate scaling, the function testData below generates a data set by cross joining a given number of ids and a given number of months. Then, the function testFunc performs the recursive row-wise matrix operations.
testData <- function(id_count, month_count) {
id_vector <- as.numeric(seq_len(id_count))
months_vector <- seq_len(month_count)
df <- CJ(ids = id_vector,month = months_vector)
df[,x := rnorm(.N,0,0.1)]
return(df)
}
testFunc <- function(df) {
wide <- dcast.data.table(df,month ~ ids, value.var = "x")
mat <- data.matrix(wide[,-c("month")])
mat[1,] <- 1
for (i in 2:nrow(mat)){
mat[i,] = mat[i-1L,]^2 + mat[i,]
}
yresult <- as.data.table(mat)
yresult[,month := wide[,month]]
ylong <- melt.data.table(yresult,
id.vars = "month",
variable.factor = FALSE,
variable.name = "ids",
value.name = "y")
ylong[,ids := as.numeric(ids)]
setkey(ylong, ids, month)
setkey(df, ids,month)
merge(df,ylong)
}
With 90,000 ids and 45 months:
foo <- testData(90000,45)
system.time({
testFunc(foo)
})
user system elapsed
8.186 0.013 8.201
Run-time comes in under 10 seconds with a single thread.
With 100,000 ids and 1,000 months:
This three column input data.table is ~1.9GB
foo <- testData(1e5,1e3)
system.time({
testFunc(foo)
})
user system elapsed
52.790 4.046 57.031
A single threaded run-time of less than a minute seems pretty manageable depending on how many times this needs to be run. As always, this could be sped up further by improvements to my code or converting the recursive portion to C++ using Rcpp, but avoiding the mental overhead of learning C++ and switching between languages in your workflow is always nice!

R - setting a value based on a function applied to other values in the same row

I have a dataframe containing (surprise) data. I have one column which I wish to populated on a per-row basis, calculated from the values of other columns in the same row.
From googling, it seems like I need 'apply', or one of it's close relatives. Unfortunately I haven't managed to make it actually work.
Example code:
#Example function
getCode <- function (ar1, ar2, ar3){
if(ar1==1 && ar2==1 && ar3==1){
return(1)
} else if(ar1==0 && ar2==0 && ar3==0){
return(0)
}
return(2)
}
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
#Add column for new data
df[,"x"] <- 0
#Apply function to new column
df[,"x"] <- apply(df[,"x"], 1, getCode(df[,"a"], df[,"b"], df[,"c"]))
I would like df to be taken from:
a b c x
1 1 1 1 0
2 1 0 1 0
3 0 0 0 0
to
a b c x
1 1 1 1 1
2 1 0 1 2
3 0 0 0 0
Unfortunately running this spits out:
Error in match.fun(FUN) : 'getCode(df[, "a"], df[, "b"], df[,
"c"])' is not a function, character or symbol
I'm new to R, so apologies if the answer is blindingly simple. Thanks.
A few things: apply would be along the dataframe itself (i.e. apply(df, 1, someFunc)); it's more idiomatic to access columns by name using the $ operator.. so if I have a dataframe named df with a column named a, access a with df$a.
In this case, I like to do an sapply along the index of the dataframe, and then use that index to get the appropriate elements from the dataframe.
df$x <- sapply(1:nrow(df), function(i) getCode(df$a[i], df$b[i], df$c[i]))
As #devmacrile mentioned above, I would just modify the function to be able to get a vector with 3 elements as input and use it within an apply command as you mentioned.
#Example function
getCode <- function (x){
ifelse(x[1]==1 & x[2]==1 & x[3]==1,
1,
ifelse(x[1]==0 & x[2]==0 & x[3]==0,
0,
2)) }
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
df
# a b c
# 1 1 1 1
# 2 1 0 1
# 3 0 0 0
# create your new column of results
df$x = apply(df, 1, getCode)
df
# a b c x
# 1 1 1 1 1
# 2 1 0 1 2
# 3 0 0 0 0

Subsetting character vector to create columns in data.frame

I am attempting to create a new data.frame object that is composed of the columns of the old data.frame with every row set at a given value (for this example, I will use 7). I am running into problems, however, naming the variables of the new data.frame object. Here is what I'm trying to do:
my.df <- data.frame(x = 1:3, y123=4:6)
my.df
x y123
1 1 4
2 2 5
3 3 6
Then when I attempt to assign these variable names to the new data.frame:
the.names <- names(my.df)
for(i in 1:length(the.names)){
data.frame(
the.names[i] = 7
)
}
But this throws errors with unexpected =, ), and }. This would usually make me suspect a typo, but I've reviewed this several times and can't find anything. Any ideas?
The easiest way is probably to just copy the old data.frame in its entirety, and then assign every cell to your new value:
df <- data.frame(x=1:3,y123=4:6);
df2 <- df;
df2[] <- 7;
df2;
## x y123
## 1 7 7
## 2 7 7
## 3 7 7
If you only want one row in the new data.frame, you can index only the top row when making the copy:
df <- data.frame(x=1:3,y123=4:6);
df2 <- df[1,];
df2[] <- 7;
df2;
## x y123
## 1 7 7
Edit: Here's how you can set each column to a different value:
df <- data.frame(x=1:3,y123=4:6);
df2 <- df;
df2[] <- rep(c(7,31),each=nrow(df2));
df2;
## x y123
## 1 7 31
## 2 7 31
## 3 7 31
You can use within:
> within(my.df, {
+ assign(the.names[1], 0)
+ assign(the.names[2], 1)
+ })
x y123
1 0 1
2 0 1
3 0 1

R Undo Dummy Variables

I have a data set where a bunch of categorical variables were converted to dummy variables (all classes used, NOT n-1) and some were not. I'm trying to recode them in a single column.
For instance
Q1.1 Q1.2 Q1.3 Q1.NA Q2 Q3.1 Q3.2
1 0 0 0 3 0 1
0 1 0 0 4 1 0
0 0 1 0 2 0 1
Is there a simple way to convert this to:
Q1 Q2 Q3
1 3 2
2 4 1
3 2 2
Right now I'm just using strsplit() (as all the dummied variable names contain '.') with a couple loops but feel like there should be a better way. Any suggestions?
I wrote a function a while back that did this sort of thing.
MultChoiceCondense<-function(vars,indata){
tempvar<-matrix(NaN,ncol=1,nrow=length(indata[,1]))
dat<-indata[,vars]
for (i in 1:length(vars)){
for (j in 1:length(indata[,1])){
if (dat[j,i]==1) tempvar[j]=i
}
}
return(tempvar)
}
If your data is called Dat, then:
Dat$Q1<-MultChoiceCondense(c("Q1.1","Q1.2","Q1.3"),Dat)
Here's an approach that uses melt from "reshape2" and cSplit from my "splitstackshape" package along with some "data.table" fun. I've loaded dplyr so that we can pipe all the things.
library(splitstackshape)
library(reshape2)
library(dplyr)
mydf %>%
as.data.table(keep.rownames = TRUE) %>% # Convert to data.table. Keep rownames
melt(id.vars = "rn", variable.name = "V") %>% # Melt the dataset by rownames
.[value > 0] %>% # Subset for all non-zero values
cSplit("V", ".") %>% # Split the "V" column (names) by "."
.[is.na(V_2), V_2 := value] %>% # Replace NA values with actual values
dcast.data.table(rn ~ V_1, value.var = "V_2") # Go wide.
# rn Q1 Q2 Q3
# 1: 1 1 3 2
# 2: 2 2 4 1
# 3: 3 3 2 2
Here's a possible base R approach:
## Which columns are binary?
Bins <- sapply(mydf, function(x) {
all(x %in% c(0, 1))
})
## Two vectors -- part after the dot and before
X <- gsub(".*\\.(.*)$", "\\1", names(mydf)[Bins])
Y <- unique(gsub("(.*)\\..*$", "\\1", names(mydf)[Bins]))
## Use `apply` to subset the X value based on the
## logical version of the binary variable
cbind(mydf[!Bins],
`colnames<-`(t(apply(mydf[Bins], 1, function(z) {
X[as.logical(z)]
})), Y))
# Q2 Q1 Q3
# 1 3 1 2
# 2 4 2 1
# 3 2 3 2
At the end, you can just reorder the columns as required. You may also need to convert them to numeric since in this case, Q1 and Q3 will be factors.
another base R approach
dat <- read.table(header = TRUE, text = "Q1.1 Q1.2 Q1.3 Q1.NA Q2 Q3.1 Q3.2
1 0 0 0 3 0 1
0 1 0 0 4 1 0
0 0 1 0 2 0 1")
## this will take all the unique questions; Q1, Q2, Q3; test if
## they are dummies; and return the column if so or find which
## dummy column is a 1 otherwise
res <- lapply(unique(gsub('\\..*', '', names(dat))), function(x) {
tmp <- dat[, grep(x, names(dat)), drop = FALSE]
if (ncol(tmp) == 1) unlist(tmp, use.names = FALSE) else max.col(tmp)
})
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 3 4 2
#
# [[3]]
# [1] 2 1 2
do.call('cbind', res)
# [,1] [,2] [,3]
# [1,] 1 3 2
# [2,] 2 4 1
# [3,] 3 2 2
I'm assuming your data looks like this, where the categorical columns are encoded using a dot at the end. You may also have a case where all of the values in a row are zero, which indicates a base level (such as how dummyVars in caret works with fullRank=FALSE). If so, here is a vectorized solution.
library(dplyr)
dummyVars.undo = function(df, col_prefix) {
if (!endsWith(col_prefix, '.')) {
# If col_prefix doesn't end with a period, include one, but save the
# "pretty name" as the one without a period
pretty_col_prefix = col_prefix
col_prefix = paste0(col_prefix, '.')
} else {
# Otherwise, strip the period for the pretty column name
pretty_col_prefix = substr(col_prefix, 1, nchar(col_prefix)-1)
}
# Get all columns with that encoding prefix
cols = names(df)[names(df) %>% startsWith(col_prefix)]
# Find the rows where all values are zero. If this isn't the case
# with your data there's no worry, it won't hurt anything.
base_level.idx = rowSums(df[cols]) == 0
# Set the column value to a base value of zero
df[base_level.idx, pretty_col_prefix] = 0
# Go through the remaining columns and find where the maximum value (1) occurs
df[!base_level.idx, pretty_col_prefix] = cols[apply(df[!base_level.idx, cols], 1, which.max)] %>%
strsplit('\\.') %>%
sapply(tail, 1)
# Drop the encoded columns
df[cols] = NULL
return(df)
}
Usage:
# Collapse Q1
df = dummyVars.undo(df, 'Q1')
# Collapse Q3
df = dummyVars.undo(df, 'Q3')
This uses dplyr, but only for the pipe operator %>%. You could certainly remove that if you'd prefer to do base R instead.

Resources