Loop a sequence in R (standardize and winsorize dataframe)

Loop a sequence in R (standardize and winsorize dataframe) - r

I'm trying to loop this sequence of steps in r for a data frame.
Here is my data:
ID Height Weight
a 100 80
b 80 90
c na 70
d 120 na
....
Here is my code so far
winsorize2 <- function(x) {
Min <- which(x == min(x))
Max <- which(x == max(x))
ord <- order(x)
x[Min] <- x[ord][length(Min)+1]
x[Max] <- x[ord][length(x)-length(Max)]
x}
df<-read.csv("data.csv")
df2 <- scale(df[,-1], center = TRUE, scale = TRUE)
id<-df$Type
full<-data.frame(id,df2)
full[is.na(full)] <- 0
full[, -1] <- sapply(full[,-1], winsorize2)
what i'm trying to do is this:-> Standardize a dataframe, then winsorize the standardized dataframe using the function winsorize2, ie replace the most extreme values with the second least extreme value. This is then repeated 10 times. How do i do a loop for this? Im confused as in the sequence ive already replaced the nas with 0s and so i should remove this step from the loop too?
edit:After discussion with #ekstroem, we decided to change to code to introduce the boundaries
df<-read.csv("data.csv")
id<-df$Type
df2<- scale(df[,-1], center = TRUE, scale = TRUE)
df2[is.na(df2)] <- 0
df2[df2<=-3] = -3
df2[df2>=3] = 3
df3<-df2 #trying to loop again
df3<- scale(df3, center = TRUE, scale = TRUE)
df3[is.na(df3)] <- 0
df3[df3<=-3] = -3
df3[df3>=3] = 3

There are some boundary issues that are not fully specified in your code, but maybe the following can be used (using base R and not super efficient)
wins2 <- function(x, n=1) {
xx <- sort(unique(x))
x[x<=xx[n]] <- xx[n+1]
x[x>=xx[length(xx)-n]] <- xx[length(xx)-n]
x
}
This yields:
x <- 1:11
wins(x,1)
[1] 2 2 3 4 5 6 7 8 9 10 10
wins(x,3)
[1] 4 4 4 4 5 6 7 8 8 8 8

Related

Is there a R function for touch point attribution in a customer journey?

My dataframe contains a column with various touch points, numbers 1 till 18. I want to know which touch point results in touch point 10. Therefore I want to create a new column which shows the touch point which occurred before touch point 10 per customer journey (PurchaseID). If touch point 10 doesn't occur in a customer journey the value can be NULL or 0.
So for example:
dd <- read.table(text="
PurchaseId TouchPoint DesiredOutcome
1 8 6
1 6 6
1 10 6
2 12 0
2 8 0
3 17 4
3 3 4
3 4 4
3 10 4", header=TRUE)
The complete dataset contains 2.500.000 observations. Does anyone know how to solve my problem? Thanks in advance.

Firstly, it is better to give a complete reproducible sample code. I suggest you look at the data.table library which is nice for handling large datasets.
library(data.table)
mdata <- matrix(sample(x = c(1:20, 21), size = 15*10, replace = TRUE), ncol = 10)
mdata[mdata==21] <- NA
mdata <- data.frame(mdata)
names(mdata) <- paste0("cj", 1:10)
df_touch <- data.table(mdata)
# -- using for
res <- rep(0, nrow(df_touch))
for( i in 1:10){
cat(i, "\n")
res[i] <- i*df_touch[, (10 %in% get(paste0("cj", i)))]
cat(res[i], "\n")
}
# -- using lapply
dfun <- function(x, k = 10){ return( k %in% x ) }
df_touch[, lapply(.SD, dfun)]

Removing points in a vector based on another

I've two vectors with data x and y. Let's say the first one is the distance and the second the temperature.
How can I remove from both x and y all points which distance is lower a constant distance 'd' between two consecutive points ( xi - xi-1 )
x = (1,2,3,8,12)
y = (10,12,11,9,12)
remove points with a distance smaller than 5
x = 1, 2(out as 2-1 <5), 3 (out as 3-1 <5), 8, 12 (fine as last even thoug 12-8<5)
x = (1,8,12)
y = (10,9,12)

Here is one idea assuming that your first and last elements are never removed,
v1 <- setNames(x, y)[c(TRUE, (diff(x) >= 5)[-(length(x)-1)], TRUE)]
#10 9 12
# 1 8 12
#To make it a bit more clear on how the named vector is structured (still a vector)
names(v1)
#[1] "10" "9" "12" <- Note: I get 9 whereas you get 11
unname(v1)
#[1] 1 8 12
Or you can make it a function,
rm_elements <- function(x, y, n){
v1 <- setNames(x, y)[c(TRUE, (diff(x) >= n)[-(length(x)-1)], TRUE)]
return(list(x = unname(v1), y = as.numeric(names(v1))))
}
rm_elements(x, y, 5)
#$x
#[1] 1 8 12
#$y
#[1] 10 9 12
EDIT: To accomodate your comment for when you have them in a data frame, then we can alter the function a bit to accept a data frame (no matter how you name the variables), and return a subset of that data frame, i.e.
rm_elements <- function(df, n){
v1 <- df[c(TRUE, (diff(df[[1]]) >= n)[-(nrow(df)-1)], TRUE),]
return(v1)
}
#Make a data frame from the vectors,
d1 <- data.frame(x=x, y=y)
rm_elements(d1, 5)
which gives,
x y
1 1 10
4 8 9
5 12 12

Select rows in a data.frame when some rows repeat

I have the following toy dataset
set.seed(100)
df <- data.frame(ID = rep(1:5, each = 3),
value = sample(LETTERS, 15, replace = TRUE),
weight = rep(c(0.1, 0.1, 0.5, 0.2, 0.1), each = 3))
df
ID value weight
1 1 I 0.1
2 1 G 0.1
3 1 O 0.1
4 2 B 0.1
5 2 M 0.1
6 2 M 0.1
7 3 V 0.5
8 3 J 0.5
9 3 O 0.5
10 4 E 0.2
11 4 Q 0.2
12 4 W 0.2
13 5 H 0.1
14 5 K 0.1
15 5 T 0.1
where each ID is an individual respondent, answering 3 questions (in the actual dataset, the number of questions answered is variable, so I can't rely on a certain number of rows per ID).
I want to create a new (larger) dataset which samples from the individual IDs based on the weights in weight.
probs <- data.frame(ID = unique(df$ID))
probs$prob <- NA
for(i in 1:nrow(probs)){
probs$prob[i] <- df[df$ID %in% probs$ID[i],]$weight[1]
}
probs$prob <- probs$prob / sum(probs$prob)
sampledIDs <- sample(probs$ID, size = 10000, replace = TRUE, prob = probs$prob)
head(sampledIDs,10)
[1] 4 3 3 3 4 4 2 4 2 3
Moving from the probabilistic sampling of IDs to the actual creation of the new data.frame is stumping me. I've tried
dfW <- df[df$ID %in% sampledIDs,]
but that obviously doesn't take into account the fact that IDs repeat. I've also tried a loop:
dfW <- df[df$ID == sampledIDs[1],]
for(i in 2:length(sampledIDs)){
dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
}
but that's painfully slow with a large dataset.
Any help would be very appreciated.
(Also, if there are simpler ways of doing the probabilistic selection of IDs, that would be great to hear too!)

The code speed is low because you resize the data frame in every cycle of the for loop. Here is my suggestion. Create a dataframe with the final size that the data framedfW will have before the for loop. Then assign the values from data frame df to dfW in the for loop. You may change the last part of your code with this:
dfW <- as.data.frame(matrix(nrow = 3 * length(sampledIDs), ncol = 3))
colnames(dfW) <- colnames(df) # make the column names the same
for(i in 1:length(sampledIDs)){ # notice the start index is changed from 2 to 1
#dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
Your code should run much faster with this change. Let me know how it goes!

If you don't know the final size you can resize it whenever needed, but a new if condition should be added in the for loop. First define the function to resize the dataframe as follow:
double_rowsize <- function(df) {
mdf <- as.data.frame(matrix(, nrow = nrow(df), ncol = ncol(df)))
colnames(mdf) <- colnames(df)
df <- rbind(df, mdf)
return(df)
}
Then start the dfW with an initial size like 12 (3 times 4):
dfW <- as.data.frame(matrix(nrow = 12, ncol = 3))
colnames(dfW) <- colnames(df)
And finally add an if condition in the for loop to resize the dataframe whenever needed:
for(i in 1:length(sampledIDs)){
if (3*i > nrow(dfW))
dfW <- double_rowsize(dfW)
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
You can change the details of function double_rowsize to change the dataframe size with a different number rather than 2 if anything else works better. 2 is common because it works best in array resizing.
Good luck!

Changing dataframe values with a function

My problem is that i want to use a function to change a random value to NA in a global data frame.
df is a dataframe with 230 rows and 2 columns.
abstract code:
emptychange<- function(x){
placenumber <- round(runif(1,min= min(1),max=max(nrow(x))))
x[placenumber,2] <<- NA
}
emptychange(df)
The Error is:"Error in x[placenumber, 2] <<- NA : object 'x' not found".
I think the mistake is, that r searches for the global variable 'x' and doesn't use the function x-value (in this case df). How can I fix this? Thanks!

This works. The problem was this: <<- NA Double arrows are used when you want to assign a value to an object outside the function. In you case, your x is inside the function.
df1 <-data.frame(x = 1, y = 1:10)
emptychange<- function(x){
placenumber <- round(runif(1,min= min(1),max=max(nrow(x))))
x[placenumber,2] <- NA
return(x)
}
emptychange(df1)

f you want this to be done at the console, you can just use sample-ing from the row count inside the [<- function:
> df1 <-data.frame(x = 1, y = 1:10)
> df1[sample(nrow(df1), 1) , 2] <- NA
> df1
x y
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 NA
7 1 7
8 1 8
9 1 9
10 1 10
If you want to destructively change the dataframe argument given to a function you should instead assign the value which is returned back to the original name:
> randNA.secCol <- function(df) {df[sample(nrow(df), 1) , 2] <- NA; df}
> df1 <-data.frame(x = 1, y = 1:10)
> df1 <- randNA.secCol(df1)
Best practice in R is avoidance of the use of the <<- function.

Find top deciles from dataframe by group

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.
Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign), I have left the naming process until after the lapply. I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?
#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
"v1" = rep(c("A", "B"), c(10,20)),
"v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)
#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {
df <- df[df$v1 == cat,]
df <- df[order(-df$v2),]
#Divide the customers into top percents
nr <- nrow(df)
p10 <- round(nr * .10, 0)
cat("Number of people in the Top 10% :", p10, "\n")
p20 <- round(nr * .20, 0)
p11_20 <- p20-p10
cat("Number of people in the 11-20% :", p11_20, "\n")
#Keep only those customers in the top groups
df <- df[1:p20,]
#Create a variable to identify the percent group the customer is in
top_pct <- integer(length = p10 + p11_20)
#Identify those in each group
top_pct[1:p10] <- 10
top_pct[(p10+1):p20] <- 20
#Add this variable to the data frame
df$top_pct <- top_pct
#Keep only custID and the new variable
df <- subset(df, select = c(custID, top_pct))
return(df)
}
##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)
#Explore the results
summary(res)
# Length Class Mode
# [1,] 2 data.frame list
# [2,] 2 data.frame list
print(res)
# [[1]]
# custID top_pct
# 1 1 10
# 2 2 20
#
# [[2]]
# custID top_pct
# 11 1 10
# 16 6 10
# 12 2 20
# 17 7 20
##Merge the two data frames but with top_pct as a different variable for each category
#Change the new variable name
for(i in 1:length(res)) {
names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
}
#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
}
print(res_m)
# custID A_top_pct B_top_pct
# 1 1 10 10
# 2 2 20 20
# 3 6 NA 10
# 4 7 NA 20

Stick to your Stata instincts and use a single data set:
require(data.table)
DT <- data.table(data)
DT[,r:=rank(v2)/.N,by=v1]
You can see the result by typing DT.
From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...
DT[,g:={
x = rep(0,.N)
x[r>.8] = 20
x[r>.9] = 10
x
}]
This is like gen and then two replace ... if statements. Again, you can see the result with DT.
Finally, you can subset with
DT[g>0]
which gives
custID v1 v2 r g
1: 1 A 30 1.000 10
2: 2 A 29 0.900 20
3: 1 B 20 0.975 10
4: 2 B 19 0.875 20
5: 6 B 20 0.975 10
6: 7 B 19 0.875 20
These steps can also be chained together:
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
(Thanks to #ExperimenteR:)
To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:
dcast(
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0],
custID~v1)
Currently, dcast requires the latest version of data.table, available (I think) from Github.

You don't need the function pf to achieve what you want. Try dplyr/tidyr combo
library(dplyr)
library(tidyr)
data %>%
group_by(v1) %>%
arrange(desc(v2))%>%
mutate(n=n()) %>%
filter(row_number() <= round(n * .2)) %>%
mutate(top_pct= ifelse(row_number()<=round(n* .1), 10, 20)) %>%
select(custID, top_pct) %>%
spread(v1, top_pct)
# custID A B
#1 1 10 10
#2 2 20 20
#3 6 NA 10
#4 7 NA 20

The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.
lapply(split(data, data$v1), function(df) {
cutoff <- quantile(df$v2, c(0.8, 0.9))
top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
na.omit(data.frame(id=df$custID, top_pct))
})
Finding quantiles is done with quantile.