I am writing R, and I want to add a new column WITHOUT using for loop.
Here's the thing I want to do:
I want to calculate the mean from the first value to the current value.
If I use for loop, I will do in this way:
for (i in c(1:nrow(data))){
data$Xn_bar[i] = mean(data$Xn[1:i])
}
Is there other way(i.e. map?)
Here's the data:
a = data.frame(
n = c(1:10),
Xn = c(-0.502,0.132,-0.079,0.887,0.117,0.319,-0.582,0.715,-0.825,-0.360)
)
You can do this with dplyr::cummean() or calculate it in base R by dividing the cumulative sum by the number of values so far:
cumsum(a$Xn) / seq_along(a$Xn) # base R
dplyr::cummean(a$Xn) # dplyr
# Output in both cases
# [1] -0.50200000 -0.18500000 -0.14966667 0.10950000 0.11100000 0.14566667 0.04171429
# [8] 0.12587500 0.02022222 -0.01780000
Here is one solution using row_number() of dplyr and mapply() function:
library(dplyr)
df=data.frame(n=c(1,2,3,4,5),
Xn=c(-0.502,0.132,-0.079,0.887,0.117))
# add column row_index which contain row_number of current row
df= df%>%
mutate(row_index=row_number())
# Add column Xxn
df$Xxn=mapply(function(x,y)return(
round(
mean(
df$Xn[1:y]),3
)
),
df$Xn,
df$row_index,
USE.NAMES = F)
#now remove row_index column
df= df%>%select(-row_index)
df
# > df
# n Xn Xxn
# 1 1 -0.502 -0.502
# 2 2 0.132 -0.185
# 3 3 -0.079 -0.150
# 4 4 0.887 0.110
# 5 5 0.117 0.111
Related
I'm trying to map across a vector, and the output of each element of the input vector is itself a vector. In other words, the output of the first element of the vector, is a vector. And the same for the output of the second, third, fourth etc.
Minimal reproducible example
library(purrr)
library(tidyverse)
set.seed(123)
input <- c(1, 2, 3)
rand <- runif(3)
data.frame(input=input) %>%
map_df(function(x) { x * rand})
# input
# <dbl>
# 1 0.288
# 2 1.58
# 3 1.23
The actual returns only a single value per row, whereas the desired output is a nested vector, list, or, data.frame (something sensible with 3 elements per row instead of 1).
How can this be done?
One option is to use map within mutate to create a list within the dataframe.
data.frame(input=input) %>%
mutate(rand_3 = map(input, function(x) x * rand))
#------
input rand_3
1 1 0.2875775, 0.7883051, 0.4089769
2 2 0.5751550, 1.5766103, 0.8179538
3 3 0.8627326, 2.3649154, 1.2269308
You can use rowwise :
library(dplyr)
df <- data.frame(input)
df %>%
rowwise() %>%
mutate(rand_3 = list(input * rand))
# input rand_3
#1 1 0.288, 0.788, 0.409
#2 2 0.575, 1.577, 0.818
#3 3 0.863, 2.365, 1.227
Or lapply in base R :
df$rand_3 <- lapply(df$input, `*`, rand)
My imported data comes with varying row/col size. I need to convert text % (32%) into decimal (0.32). Some columns have the text percentage, others are normal numeric and need to be unchanged.
I can convert the string to decimal across a column, and apply this across the data frame, however no elegant way of selectively only applying the conversion to relevant columns. I have solved my problem in a clunky manner by creating a vector to detect columns with % strings and then running a loop across the dataframe checking the vector for which columns to apply this rule. I'm looking for a cleaner solution
# Example structure of data on a small scale
df <- data.frame(desc = c('a','b','c'),val = c(10, 3, 100), perc = c('23.01%', '11.0%','2.33%'))
# desc val perc
# 1 a 10 23.01%
# 2 b 3 11.0%
# 3 c 100 2.33%
# the below converts everything which is not desired
sapply(df, function(x) as.numeric(sub("%","",x))/100)
# desc val perc
# [1,] NA 0.10 0.2301
# [2,] NA 0.03 0.1100
# [3,] NA 1.00 0.0233
# my (clunky) solution
aa <- rep(0,ncol(df))
for(i in 1:ncol(df)){aa[i] <- length(grep("%",df[,i]))}
# [1] 0 0 3
for(i in 1:ncol(df)){if (aa[i]>0) {df[,i] <- as.numeric(sub("%", "",df[,i],fixed=TRUE))/100 } }
# desc val perc
# 1 a 10 0.2301
# 2 b 3 0.1100
# 3 c 100 0.0233
A tidyverse solution would be the following:
df %>%
mutate_if(~sum(str_detect(., "%")) > 0,
~as.numeric(str_remove(., "%")) / 100)
What I would do is find columns that have a %, convert them to character (just so you don't have to work with factors which are PITA in this case) and remove % signs and divide numbers by 100.
xy <- data.frame(desc = c('a','b','c'),val = c(10, 3, 100), perc = c('23.01%', '11.0%','2.33%'))
# find which colums have a % - this assumes % is used only to denote percentages
perc.index <- sapply(xy, grepl, pattern = "%")
# convert columns that have at least one % to character
# this step can be also done manually or on import (stringsAsFactors = FALSE)
xy[, colSums(perc.index) > 0] <- sapply(xy[, colSums(perc.index) > 0, drop = FALSE], as.character)
xy[perc.index] <- as.numeric(gsub("%", "", xy[perc.index])) / 100
xy
desc val perc
1 a 10 0.2301
2 b 3 0.11
3 c 100 0.0233
tmp=nchar(as.character(df$perc))
tmp2=which(substr(df$perc,tmp,tmp)=="%")
tmp3=which(!substr(df$perc,tmp,tmp)=="%")
df$perc2=NA
df$perc2[tmp2]=as.numeric(gsub("%","",df$perc[tmp2]))/100
df$perc2[tmp3]=as.numeric(as.character(df$perc[tmp3]))
I have a dataframe with unique values $Number identifying specific points where a polygon is intersecting. Some points (i.e. 56) have 3 polygons that intersect. I want to extract the three rows which start with 56.
df <- cbind(Number = rownames(check), check)
df
df table
The issue going forward is I will be applying this for 10,000 points and won't know the repeating number such as "56". So is there a way to have a general expression which chooses rows with a general match without knowing that value?
You can achieve the desired output with:
subset2 <- function(n) df[floor(df$Number) == n,]
where df is the name of your dataset and Number is the name of the target column. We can fill in n as needed:
#Example
df <- data.frame(Number=c(1,3,24,56.65,56.99,56.14,66),y=sample(LETTERS,7))
df
# Number y
# 1 1.00 J
# 2 3.00 B
# 3 24.00 D
# 4 56.65 R
# 5 56.99 I
# 6 56.14 H
# 7 66.00 V
subset2(56)
# Number y
# 4 56.65 R
# 5 56.99 I
# 6 56.14 H
I simply changed the $Number column into a numeric field, then rounded down to integer data.
numeric <- as.numeric(as.character(df$Number))
Id <- floor(numeric)
If we only want $Number with more than 3 counts then we can use dplyr to group by $Number and then retain $Number if it has more than 3 counts
library(dplyr)
# Data
df <- data.frame(Number = c(1,1,1,2,2,3,3))
# Filtering
df %>% group_by(Number) %>% filter(n() >= 3)
I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.
Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign), I have left the naming process until after the lapply. I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?
#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
"v1" = rep(c("A", "B"), c(10,20)),
"v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)
#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {
df <- df[df$v1 == cat,]
df <- df[order(-df$v2),]
#Divide the customers into top percents
nr <- nrow(df)
p10 <- round(nr * .10, 0)
cat("Number of people in the Top 10% :", p10, "\n")
p20 <- round(nr * .20, 0)
p11_20 <- p20-p10
cat("Number of people in the 11-20% :", p11_20, "\n")
#Keep only those customers in the top groups
df <- df[1:p20,]
#Create a variable to identify the percent group the customer is in
top_pct <- integer(length = p10 + p11_20)
#Identify those in each group
top_pct[1:p10] <- 10
top_pct[(p10+1):p20] <- 20
#Add this variable to the data frame
df$top_pct <- top_pct
#Keep only custID and the new variable
df <- subset(df, select = c(custID, top_pct))
return(df)
}
##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)
#Explore the results
summary(res)
# Length Class Mode
# [1,] 2 data.frame list
# [2,] 2 data.frame list
print(res)
# [[1]]
# custID top_pct
# 1 1 10
# 2 2 20
#
# [[2]]
# custID top_pct
# 11 1 10
# 16 6 10
# 12 2 20
# 17 7 20
##Merge the two data frames but with top_pct as a different variable for each category
#Change the new variable name
for(i in 1:length(res)) {
names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
}
#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
}
print(res_m)
# custID A_top_pct B_top_pct
# 1 1 10 10
# 2 2 20 20
# 3 6 NA 10
# 4 7 NA 20
Stick to your Stata instincts and use a single data set:
require(data.table)
DT <- data.table(data)
DT[,r:=rank(v2)/.N,by=v1]
You can see the result by typing DT.
From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...
DT[,g:={
x = rep(0,.N)
x[r>.8] = 20
x[r>.9] = 10
x
}]
This is like gen and then two replace ... if statements. Again, you can see the result with DT.
Finally, you can subset with
DT[g>0]
which gives
custID v1 v2 r g
1: 1 A 30 1.000 10
2: 2 A 29 0.900 20
3: 1 B 20 0.975 10
4: 2 B 19 0.875 20
5: 6 B 20 0.975 10
6: 7 B 19 0.875 20
These steps can also be chained together:
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
(Thanks to #ExperimenteR:)
To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:
dcast(
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0],
custID~v1)
Currently, dcast requires the latest version of data.table, available (I think) from Github.
You don't need the function pf to achieve what you want. Try dplyr/tidyr combo
library(dplyr)
library(tidyr)
data %>%
group_by(v1) %>%
arrange(desc(v2))%>%
mutate(n=n()) %>%
filter(row_number() <= round(n * .2)) %>%
mutate(top_pct= ifelse(row_number()<=round(n* .1), 10, 20)) %>%
select(custID, top_pct) %>%
spread(v1, top_pct)
# custID A B
#1 1 10 10
#2 2 20 20
#3 6 NA 10
#4 7 NA 20
The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.
lapply(split(data, data$v1), function(df) {
cutoff <- quantile(df$v2, c(0.8, 0.9))
top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
na.omit(data.frame(id=df$custID, top_pct))
})
Finding quantiles is done with quantile.
I am trying to vectorize the following task with one of the apply functions, but in vain.
I have a list and a dataframe. What I am trying to accomplish is to create subgroups in a dataframe using a lookup list.
The lookup list (which are basically percentile groups) looks like the following:
Look_Up_List
$`1`
A B C D E
0.000 0.370 0.544 0.698 9.655
$`2`
A B C D E
0.000 0.506 0.649 0.774 1.192
The Curret Dataframe looks like this :
Score Big_group
0.1 1
0.4 1
0.3 2
Resulting dataframe must look like the following with an additional column. It matches the score in the percentile bucket from the lookup list in the corresponding Big_Group:
Score Big_group Sub_Group
0.1 1 A
0.4 1 B
0.3 2 A
Thanks so much
You can create a function like this:
myFun <- function(x) {
names(Look_Up_List[[as.character(x[2])]])[
findInterval(x[1], Look_Up_List[[as.character(x[2])]])]
}
And apply it by row with apply:
apply(mydf, 1, myFun)
# [1] "A" "B" "A"'
# reproducible input data
Look_Up_List <- list('1' <- c(A=0.000, B=0.370, C=0.544, D=0.698, E=9.655),
'2' <- c(A=0.000, B=0.506, C=0.649, D=0.774, E=1.192))
Current <- data.frame(Score=c(0.1, 0.4, 0.3),
Big_group=c(1,1,2))
# Solution 1
Current$Sub_Group <- sapply(1:nrow(Current), function(i) max(names(Look_Up_List[[1]][Current$Score[i] > Look_Up_List[[1]] ])))
# Alternative solution (using findInterval, slightly slower at least for this dataset)
Current$Sub_Group <- sapply(1:nrow(Current), function(i) names(Look_Up_List[[1]])[findInterval(Current$Score[i], Look_Up_List[[1]])])
# show result
Current