How to add columns to a dataframe based on indexes in R? (See example) - r

I'm working with a self made infix function which simply calculates the
percentage growth between observations in columns.
options(digits=3)
`%grow%` <- function(x,y) {
(y-x) / x * 100
}
test <- data.frame(a=c(101,202,301), b=c(123,214,199), h=c(134, 217, 205))
Then I use lapply to my toy database in order to add two new columns.
test[,4:5] <- lapply(1:(ncol(test)-1), function(i) test[,i] %grow% test[,(i+1)])
test
#Output
a b h V4 V5
1 101 123 134 21.78 8.94
2 202 214 217 5.94 1.40
3 301 199 205 -33.89 3.02
This is easy considering I just have three columns and I just can write test[,4:5]. Now talking in general terms: How to do this if we have n columns using column indexes?
What I mean is I want to create n-1 columns to a given database starting from the last one. Something like:
test[,(last_current_column+1):(last_column_created_using_function)]
Considering what I've read in some other posts, using my example, test[,(last_current_column+1): could be written as:
test[,(ncol(test)+1):]
but second part is still missing and I have no idea how to write it.
I hope I made myself clear. I fully appreciate any comment or advise.
Happy 2019 :)

Another way would be:
#options(digits=3)
`%grow%` <- function(x,y) {
(y-x) / x * 100
}
test <- data.frame(a=c(101,202,301),
b=c(123,214,199),
h=c(134, 217, 205),
d=c(156,234,235))
# a b h d
# 1 101 123 134 156
# 2 202 214 217 234
# 3 301 199 205 235
seqcols <- seq_along(test) # saved just to improve readability
test[,seqcols[-length(seqcols)] + max(seqcols)] <- lapply(seqcols[-length(seqcols)],
function(i) test[,i] %grow% test[,(i+1)])
test
# a b h d V5 V6 V7
# 1 101 123 134 156 21.78 8.94 16.42
# 2 202 214 217 234 5.94 1.40 7.83
# 3 301 199 205 235 -33.89 3.02 14.63
Similar to the second solution from #Ronak Shah, just with the use of map2_df from purrr:
cbind(test,
new=purrr::map2_df(test[seqcols[-length(seqcols)]], test[seqcols[-1]], `%grow%`),
deparse.level=1)
# a b h d new.a new.b new.h
# 1 101 123 134 156 21.78 8.94 16.42
# 2 202 214 217 234 5.94 1.40 7.83
# 3 301 199 205 235 -33.89 3.02 14.63

You would always ncol(test) - 1 new columns. Now using this logic there are multiple ways to do this.
One way would be to construct a character vector with some prefix value.
test[paste0("new_col", seq_len(ncol(test) - 1))] <- lapply(1:(ncol(test)-1),
function(i) test[,i] %grow% test[,(i+1)])
test
# a b h new_col1 new_col2
#1 101 123 134 21.782178 8.943089
#2 202 214 217 5.940594 1.401869
#3 301 199 205 -33.887043 3.015075
Another option using mapply and transform by creating subsets of dataframe
transform(test,
new_col = mapply(`%grow%`, test[1:(ncol(test)- 1)], test[2:ncol(test)]))
# a b h new_col.a new_col.b
#1 101 123 134 21.782178 8.943089
#2 202 214 217 5.940594 1.401869
#3 301 199 205 -33.887043 3.015075

Related

R - Percentage of whole dataframe per column

I have a data frame reporting the count of answers per question (this is just a part of it), and I'd like to obtain the answer percentage for each question. I've found adorn_percentages, but it computes the percentage by dividing the values for the whole data frame, meanwhile, I just want the percentage for each column. Each column has a total of 2230 answers.
I was thinking to use something like (x/2230)*100 but I don't know how to go on.
df<-data.frame(q1=c(159,139,1048,571,93), q2=c(106,284,1043,672,125), q3=c(99,222,981,843,94))
q1 q2 q3
1 159 106 99
2 139 284 222
3 1048 1043 981
4 571 672 843
5 93 125 94
We may use colSums to do the division after making the lengths same
100 * df/colSums(df)[col(df)]
or use sweep
100 * sweep(df, 2, colSums(df), `/`)
Or use proportions
df[paste0(names(df), "_prop")] <- 100 * proportions(as.matrix(df), 2)
-output
> df
q1 q2 q3 q1_prop q2_prop q3_prop
1 159 106 99 7.910448 4.753363 4.421617
2 139 284 222 6.915423 12.735426 9.915141
3 1048 1043 981 52.139303 46.771300 43.814203
4 571 672 843 28.407960 30.134529 37.650737
5 93 125 94 4.626866 5.605381 4.198303
You can apply prop.table for each column -
library(dplyr)
df %>% mutate(across(.fns = prop.table, .names = '{col}_prop') * 100)
# q1 q2 q3 q1_prop q2_prop q3_prop
#1 159 106 99 7.910448 4.753363 4.421617
#2 139 284 222 6.915423 12.735426 9.915141
#3 1048 1043 981 52.139303 46.771300 43.814203
#4 571 672 843 28.407960 30.134529 37.650737
#5 93 125 94 4.626866 5.605381 4.198303

Prevent duplicates in R

I have a column in a data table which has entries in non-decreasing order. But there can be duplicate entries.
labels <- c(123,123,124,125,126,126,128)
time <- data.table(labels,unique_labels="")
time
labels unique_labels
1: 123
2: 123
3: 124
4: 125
5: 126
6: 126
7: 128
I want to make all entries unique, so the output will be
time
labels unique_labels
1: 123 123
2: 123 124
3: 124 125
4: 125 126
5: 126 127
6: 126 128
7: 128 130
Following is a loop implementation for this:
prev_label <- 0
unique_counter <- 0
for (i in 1:length(time$label)){
if (time$label[i]!=prev_label)
prev_label <- time$label[i]
else
unique_counter <- unique_counter + 1
time$unique_label[i] <- time$label[i] + unique_counter
}
There's a vectorized solution that completly prevents you from using for loops.
Since time is a R function I've changed the name of your data.frame to tm.
cumsum(duplicated(tm$labels)) + tm$labels
[1] 123 124 125 126 127 128 130
tm$unique_labels <- cumsum(duplicated(tm$labels)) + tm$labels
tm
labels unique_labels
1: 123 123
2: 123 124
3: 124 125
4: 125 126
5: 126 127
6: 126 128
7: 128 130
tank = ("t", 1:NROW(labels), sep="")
time$unique_labels = ifelse(duplicated(time), tank, time$labels)
the duplicated function of the data.table package returns the index of duplicated rows of your dataset, just replace them with "random" values you are sure are not used in your set

Comparing variables with values in another dataframe and replace them with another value

I have a Data.Frame with:
Height <- c(169,176,173,172,176,158,168,162,178)
and another with reference heights and weights.
heights_f <- c(144.8,147.3,149.9,152.4,154.9,157.5,160,162.6,165.1,167.6,170.2,172.7,175.3,177.8,180.3,182.9,185.4,188,190.5,193,195.6)
weights_f <- c(38.6,40.9,43.1,45.4,47.7,49.9,52.2,54.5,56.8,59,61.3,63.6,65.8,68.1,70.4,72.6,74.9,77.2,79.5,81.7,84)
weightfactor_f <- data.frame(heights_f, weights_f)
I now need to match the values of the heights from the first data.frame with the height reference in the second one that's the most fitting and to give me the correspondent reference weight.
I haven't yet had any success, as I haven't been able to find anything about matching values that are not exactly the same.
If I understand your goal, instead of taking the nearest value, consider interpolating through the approx function. For instance:
approx(weightfactor_f$heights_f,weightfactor_f$weights_f,xout=Height)$y
#[1] 60.23846 66.44400 63.85385 62.95600 66.44400 50.36000 59.35385 53.96923
#[9] 68.28400
You could do:
Height<- c(169,176,173,172,176,158,168,162,178)
heights_f<- as.numeric(c(144.8,147.3,149.9,152.4,154.9,157.5,160,162.6,165.1,167.6,170.2,172.7,175.3,177.8,180.3,182.9,185.4,188,190.5,193,195.6))
weights_f<- as.numeric(c(38.6,40.9,43.1,45.4,47.7,49.9,52.2,54.5,56.8,59,61.3,63.6,65.8,68.1,70.4,72.6,74.9,77.2,79.5,81.7,84))
df = data.frame(Height=Height, match_weight=
sapply(Height, function(x) {weights_f[which.min(abs(heights_f-x))]}))
i.e. for each entry in Height, find the corresponding element in the heights_f vector by doing which.min(abs(heights_f-x) and fetch the corresponding entry from the weights_f vector.
Output:
Height match_weight
1 169 61.3
2 176 65.8
3 173 63.6
4 172 63.6
5 176 65.8
6 158 49.9
7 168 59.0
8 162 54.5
9 178 68.1
library(dplyr)
Slightly different structure to reproducible example:
Height <- data.frame(height = as.numeric(c(169,176,173,172,176,158,168,162,178)))
The rest is the same:
heights_f<- as.numeric(c(144.8,147.3,149.9,152.4,154.9,157.5,160,162.6,165.1,167.6,170.2,172.7,175.3,177.8,180.3,182.9,185.4,188,190.5,193,195.6))
weights_f<- as.numeric(c(38.6,40.9,43.1,45.4,47.7,49.9,52.2,54.5,56.8,59,61.3,63.6,65.8,68.1,70.4,72.6,74.9,77.2,79.5,81.7,84))
weightfactor_f<- data.frame(heights_f,weights_f)
Then, round to the nearest whole number:
weightfactor_f$heights_f <- round(weightfactor_f$heights_f, 0)
Then just:
left_join(Height, weightfactor_f, by = c("height" = "heights_f"))
Output:
height weights_f
1 169 NA
2 176 NA
3 173 63.6
4 172 NA
5 176 NA
6 158 49.9
7 168 59.0
8 162 NA
9 178 68.1
z <- vector()
for(i in 1:length(Height)) {
z[i] <- weightfactor_f$weights_f[which.min(abs(Height[i]-weightfactor_f$heights_f))]
}

R One sample test for set of columns for each row

I have a data set where I have the Levels and Trends for say 50 cities for 3 scenarios. Below is the sample data -
City <- paste0("City",1:50)
L1 <- sample(100:500,50,replace = T)
L2 <- sample(100:500,50,replace = T)
L3 <- sample(100:500,50,replace = T)
T1 <- runif(50,0,3)
T2 <- runif(50,0,3)
T3 <- runif(50,0,3)
df <- data.frame(City,L1,L2,L3,T1,T2,T3)
Now, across the 3 scenarios I find the minimum Level and Minimum Trend using the below code -
df$L_min <- apply(df[,2:4],1,min)
df$T_min <- apply(df[,5:7],1,min)
Now I want to check if these minimum values are significantly different between the levels and trends respectively. So check L_min with columns 2-4 and T_min with columns 5-7. This needs to be done for each city (row) and if significant then return which column it is significantly different with.
It would help if some one could guide how this can be done.
Thank you!!
I'll put my idea here, nevertheless I'm looking forward for ideas for others.
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min
1 City1 251 176 263 1.162313 0.07196579 2.0925715 176 0.07196579
2 City2 385 406 264 0.353124 0.66089524 2.5613980 264 0.35312402
3 City3 437 333 426 2.625795 1.43547766 1.7667891 333 1.43547766
4 City4 431 405 493 2.042905 0.93041254 1.3872058 405 0.93041254
5 City5 101 429 100 1.731004 2.89794314 0.3535423 100 0.35354230
6 City6 374 394 465 1.854794 0.57909775 2.7485841 374 0.57909775
> df$FC <- rowMeans(df[,2:4])/df[,8]
> df <- df[order(-df$FC), ]
> head(df)
City L1 L2 L3 T1 T2 T3 L_min T_min FC
18 City18 461 425 117 2.7786757 2.6577894 0.75974121 117 0.75974121 2.857550
38 City38 370 117 445 0.1103141 2.6890014 2.26174542 117 0.11031411 2.655271
44 City44 101 473 222 1.2754675 0.8667007 0.04057544 101 0.04057544 2.627063
10 City10 459 361 132 0.1529519 2.4678493 2.23373484 132 0.15295194 2.404040
16 City16 232 393 110 0.8628494 1.3995549 1.01689217 110 0.86284938 2.227273
15 City15 499 475 182 0.3679611 0.2519497 2.82647041 182 0.25194969 2.117216
Now you have the most different rows based on columns 2:4 at the top. Columns 5:7 in analogous way.
And some tips for stastical tests:
Always use t.test(parametrical, based on mean) instead of wilcoxon(u-mann whitney - non-parametrical, based on median), it has more power; HOWEVER:
-Data sets should be big ex. hipotesis: Montreal has taller citizens than Quebec; t.test will work fine when you take a 100 people from each city, so we have height measurment of 200 people 100 vs 100.
-Distribution should be close to normal distribution in all samples; or both samples should have similar distribution far from normal - it may be binominal. Anyway we can't use this test when one sample has normal distribution, and second hasn't.
-Size of both samples should be eqal, so 100 vs 100 is ok, but 87 vs 234 not exactly, p-value will be below 0.05, however it may be misrepresented.
If your data doesn't meet above conditions, I prefer non-parametrical test, less power but more resistant.

Subset Columns based on partial matching of column names in the same data frame

I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

Resources