Add value to a dataframe without using forloop in

Add value to a dataframe without using forloop in - r

I am writing R, and I want to add a new column WITHOUT using for loop.
Here's the thing I want to do:
I want to calculate the mean from the first value to the current value.
If I use for loop, I will do in this way:
for (i in c(1:nrow(data))){
data$Xn_bar[i] = mean(data$Xn[1:i])
}
Is there other way(i.e. map?)
Here's the data:
a = data.frame(
n = c(1:10),
Xn = c(-0.502,0.132,-0.079,0.887,0.117,0.319,-0.582,0.715,-0.825,-0.360)
)

You can do this with dplyr::cummean() or calculate it in base R by dividing the cumulative sum by the number of values so far:
cumsum(a$Xn) / seq_along(a$Xn) # base R
dplyr::cummean(a$Xn) # dplyr
# Output in both cases
# [1] -0.50200000 -0.18500000 -0.14966667 0.10950000 0.11100000 0.14566667 0.04171429
# [8] 0.12587500 0.02022222 -0.01780000

Here is one solution using row_number() of dplyr and mapply() function:
library(dplyr)
df=data.frame(n=c(1,2,3,4,5),
Xn=c(-0.502,0.132,-0.079,0.887,0.117))
# add column row_index which contain row_number of current row
df= df%>%
mutate(row_index=row_number())
# Add column Xxn
df$Xxn=mapply(function(x,y)return(
round(
mean(
df$Xn[1:y]),3
)
),
df$Xn,
df$row_index,
USE.NAMES = F)
#now remove row_index column
df= df%>%select(-row_index)
df
# > df
# n Xn Xxn
# 1 1 -0.502 -0.502
# 2 2 0.132 -0.185
# 3 3 -0.079 -0.150
# 4 4 0.887 0.110
# 5 5 0.117 0.111

Related

map along a vector and output a data.frame column of nested lists (or nested data.frames) using purrr family of functions?

I'm trying to map across a vector, and the output of each element of the input vector is itself a vector. In other words, the output of the first element of the vector, is a vector. And the same for the output of the second, third, fourth etc.
Minimal reproducible example
library(purrr)
library(tidyverse)
set.seed(123)
input <- c(1, 2, 3)
rand <- runif(3)
data.frame(input=input) %>%
map_df(function(x) { x * rand})
# input
# <dbl>
# 1 0.288
# 2 1.58
# 3 1.23
The actual returns only a single value per row, whereas the desired output is a nested vector, list, or, data.frame (something sensible with 3 elements per row instead of 1).
How can this be done?

One option is to use map within mutate to create a list within the dataframe.
data.frame(input=input) %>%
mutate(rand_3 = map(input, function(x) x * rand))
#------
input rand_3
1 1 0.2875775, 0.7883051, 0.4089769
2 2 0.5751550, 1.5766103, 0.8179538
3 3 0.8627326, 2.3649154, 1.2269308

You can use rowwise :
library(dplyr)
df <- data.frame(input)
df %>%
rowwise() %>%
mutate(rand_3 = list(input * rand))
# input rand_3
#1 1 0.288, 0.788, 0.409
#2 2 0.575, 1.577, 0.818
#3 3 0.863, 2.365, 1.227
Or lapply in base R :
df$rand_3 <- lapply(df$input, `*`, rand)

Identify columns applicable for percentage text to decimal conversion

My imported data comes with varying row/col size. I need to convert text % (32%) into decimal (0.32). Some columns have the text percentage, others are normal numeric and need to be unchanged.
I can convert the string to decimal across a column, and apply this across the data frame, however no elegant way of selectively only applying the conversion to relevant columns. I have solved my problem in a clunky manner by creating a vector to detect columns with % strings and then running a loop across the dataframe checking the vector for which columns to apply this rule. I'm looking for a cleaner solution
# Example structure of data on a small scale
df <- data.frame(desc = c('a','b','c'),val = c(10, 3, 100), perc = c('23.01%', '11.0%','2.33%'))
# desc val perc
# 1 a 10 23.01%
# 2 b 3 11.0%
# 3 c 100 2.33%
# the below converts everything which is not desired
sapply(df, function(x) as.numeric(sub("%","",x))/100)
# desc val perc
# [1,] NA 0.10 0.2301
# [2,] NA 0.03 0.1100
# [3,] NA 1.00 0.0233
# my (clunky) solution
aa <- rep(0,ncol(df))
for(i in 1:ncol(df)){aa[i] <- length(grep("%",df[,i]))}
# [1] 0 0 3
for(i in 1:ncol(df)){if (aa[i]>0) {df[,i] <- as.numeric(sub("%", "",df[,i],fixed=TRUE))/100 } }
# desc val perc
# 1 a 10 0.2301
# 2 b 3 0.1100
# 3 c 100 0.0233

A tidyverse solution would be the following:
df %>%
mutate_if(~sum(str_detect(., "%")) > 0,
~as.numeric(str_remove(., "%")) / 100)

What I would do is find columns that have a %, convert them to character (just so you don't have to work with factors which are PITA in this case) and remove % signs and divide numbers by 100.
xy <- data.frame(desc = c('a','b','c'),val = c(10, 3, 100), perc = c('23.01%', '11.0%','2.33%'))
# find which colums have a % - this assumes % is used only to denote percentages
perc.index <- sapply(xy, grepl, pattern = "%")
# convert columns that have at least one % to character
# this step can be also done manually or on import (stringsAsFactors = FALSE)
xy[, colSums(perc.index) > 0] <- sapply(xy[, colSums(perc.index) > 0, drop = FALSE], as.character)
xy[perc.index] <- as.numeric(gsub("%", "", xy[perc.index])) / 100
xy
desc val perc
1 a 10 0.2301
2 b 3 0.11
3 c 100 0.0233

tmp=nchar(as.character(df$perc))
tmp2=which(substr(df$perc,tmp,tmp)=="%")
tmp3=which(!substr(df$perc,tmp,tmp)=="%")
df$perc2=NA
df$perc2[tmp2]=as.numeric(gsub("%","",df$perc[tmp2]))/100
df$perc2[tmp3]=as.numeric(as.character(df$perc[tmp3]))

Select rows in data frame with same values

I have a dataframe with unique values $Number identifying specific points where a polygon is intersecting. Some points (i.e. 56) have 3 polygons that intersect. I want to extract the three rows which start with 56.
df <- cbind(Number = rownames(check), check)
df
df table
The issue going forward is I will be applying this for 10,000 points and won't know the repeating number such as "56". So is there a way to have a general expression which chooses rows with a general match without knowing that value?

You can achieve the desired output with:
subset2 <- function(n) df[floor(df$Number) == n,]
where df is the name of your dataset and Number is the name of the target column. We can fill in n as needed:
#Example
df <- data.frame(Number=c(1,3,24,56.65,56.99,56.14,66),y=sample(LETTERS,7))
df
# Number y
# 1 1.00 J
# 2 3.00 B
# 3 24.00 D
# 4 56.65 R
# 5 56.99 I
# 6 56.14 H
# 7 66.00 V
subset2(56)
# Number y
# 4 56.65 R
# 5 56.99 I
# 6 56.14 H

I simply changed the $Number column into a numeric field, then rounded down to integer data.
numeric <- as.numeric(as.character(df$Number))
Id <- floor(numeric)

If we only want $Number with more than 3 counts then we can use dplyr to group by $Number and then retain $Number if it has more than 3 counts
library(dplyr)
# Data
df <- data.frame(Number = c(1,1,1,2,2,3,3))
# Filtering
df %>% group_by(Number) %>% filter(n() >= 3)

Find top deciles from dataframe by group

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.
Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign), I have left the naming process until after the lapply. I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?
#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
"v1" = rep(c("A", "B"), c(10,20)),
"v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)
#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {
df <- df[df$v1 == cat,]
df <- df[order(-df$v2),]
#Divide the customers into top percents
nr <- nrow(df)
p10 <- round(nr * .10, 0)
cat("Number of people in the Top 10% :", p10, "\n")
p20 <- round(nr * .20, 0)
p11_20 <- p20-p10
cat("Number of people in the 11-20% :", p11_20, "\n")
#Keep only those customers in the top groups
df <- df[1:p20,]
#Create a variable to identify the percent group the customer is in
top_pct <- integer(length = p10 + p11_20)
#Identify those in each group
top_pct[1:p10] <- 10
top_pct[(p10+1):p20] <- 20
#Add this variable to the data frame
df$top_pct <- top_pct
#Keep only custID and the new variable
df <- subset(df, select = c(custID, top_pct))
return(df)
}
##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)
#Explore the results
summary(res)
# Length Class Mode
# [1,] 2 data.frame list
# [2,] 2 data.frame list
print(res)
# [[1]]
# custID top_pct
# 1 1 10
# 2 2 20
#
# [[2]]
# custID top_pct
# 11 1 10
# 16 6 10
# 12 2 20
# 17 7 20
##Merge the two data frames but with top_pct as a different variable for each category
#Change the new variable name
for(i in 1:length(res)) {
names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
}
#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
}
print(res_m)
# custID A_top_pct B_top_pct
# 1 1 10 10
# 2 2 20 20
# 3 6 NA 10
# 4 7 NA 20

Stick to your Stata instincts and use a single data set:
require(data.table)
DT <- data.table(data)
DT[,r:=rank(v2)/.N,by=v1]
You can see the result by typing DT.
From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...
DT[,g:={
x = rep(0,.N)
x[r>.8] = 20
x[r>.9] = 10
x
}]
This is like gen and then two replace ... if statements. Again, you can see the result with DT.
Finally, you can subset with
DT[g>0]
which gives
custID v1 v2 r g
1: 1 A 30 1.000 10
2: 2 A 29 0.900 20
3: 1 B 20 0.975 10
4: 2 B 19 0.875 20
5: 6 B 20 0.975 10
6: 7 B 19 0.875 20
These steps can also be chained together:
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
(Thanks to #ExperimenteR:)
To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:
dcast(
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0],
custID~v1)
Currently, dcast requires the latest version of data.table, available (I think) from Github.

You don't need the function pf to achieve what you want. Try dplyr/tidyr combo
library(dplyr)
library(tidyr)
data %>%
group_by(v1) %>%
arrange(desc(v2))%>%
mutate(n=n()) %>%
filter(row_number() <= round(n * .2)) %>%
mutate(top_pct= ifelse(row_number()<=round(n* .1), 10, 20)) %>%
select(custID, top_pct) %>%
spread(v1, top_pct)
# custID A B
#1 1 10 10
#2 2 20 20
#3 6 NA 10
#4 7 NA 20

The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.
lapply(split(data, data$v1), function(df) {
cutoff <- quantile(df$v2, c(0.8, 0.9))
top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
na.omit(data.frame(id=df$custID, top_pct))
})
Finding quantiles is done with quantile.

Vectorizing with R instead of for loop

I am trying to vectorize the following task with one of the apply functions, but in vain.
I have a list and a dataframe. What I am trying to accomplish is to create subgroups in a dataframe using a lookup list.
The lookup list (which are basically percentile groups) looks like the following:
Look_Up_List
$`1`
A B C D E
0.000 0.370 0.544 0.698 9.655
$`2`
A B C D E
0.000 0.506 0.649 0.774 1.192
The Curret Dataframe looks like this :
Score Big_group
0.1 1
0.4 1
0.3 2
Resulting dataframe must look like the following with an additional column. It matches the score in the percentile bucket from the lookup list in the corresponding Big_Group:
Score Big_group Sub_Group
0.1 1 A
0.4 1 B
0.3 2 A
Thanks so much

You can create a function like this:
myFun <- function(x) {
names(Look_Up_List[[as.character(x[2])]])[
findInterval(x[1], Look_Up_List[[as.character(x[2])]])]
}
And apply it by row with apply:
apply(mydf, 1, myFun)
# [1] "A" "B" "A"'

# reproducible input data
Look_Up_List <- list('1' <- c(A=0.000, B=0.370, C=0.544, D=0.698, E=9.655),
'2' <- c(A=0.000, B=0.506, C=0.649, D=0.774, E=1.192))
Current <- data.frame(Score=c(0.1, 0.4, 0.3),
Big_group=c(1,1,2))
# Solution 1
Current$Sub_Group <- sapply(1:nrow(Current), function(i) max(names(Look_Up_List[[1]][Current$Score[i] > Look_Up_List[[1]] ])))
# Alternative solution (using findInterval, slightly slower at least for this dataset)
Current$Sub_Group <- sapply(1:nrow(Current), function(i) names(Look_Up_List[[1]])[findInterval(Current$Score[i], Look_Up_List[[1]])])
# show result
Current

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Add value to a dataframe without using forloop in - r

Related

map along a vector and output a data.frame column of nested lists (or nested data.frames) using purrr family of functions?

Identify columns applicable for percentage text to decimal conversion

Select rows in data frame with same values

Find top deciles from dataframe by group

Vectorizing with R instead of for loop

Categories

Resources