How to use apply function from R to normalize data frame? - r

I have a data frame and I want to do rowise normalization for each row.
For example:
row1_new = (row1_old - mean_of_row1)/standard_dev_of_row1.
I wrote the following code for that:
normalize_df <- function(x){
mean1<- mean(unlist(as.list(x)))
std1<- sd(unlist(as.list(x)))
y = (x - mean1)/std1
return(y)
}
n_rows <- length(row.names(query_data))
for(i in seq(1, n_rows)){
query_data[i,]<- query_data[i,]
}
But it seems to be a lot slower and I didn't succeed in using apply function.
How can I use apply function to row-wise normalize dataframe?

As an alternative, using the built-in function scale t(scale(t(df))).

Here is another option using rowMeans and rowSds function.
library(matrixStats)
df <- mtcars
(df - rowMeans(df))/rowSds(as.matrix(df))
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 -0.16637 -0.447 2.43 1.496 -0.486 -0.510 -0.25117 -0.559 -0.540 -0.484 -0.484
#Mazda RX4 Wag -0.16784 -0.448 2.43 1.495 -0.487 -0.507 -0.24221 -0.560 -0.542 -0.486 -0.486
#Datsun 710 -0.02053 -0.504 2.17 1.785 -0.508 -0.547 -0.12833 -0.581 -0.581 -0.504 -0.581
#Hornet 4 Drive -0.21836 -0.412 2.76 0.897 -0.449 -0.447 -0.24304 -0.475 -0.488 -0.450 -0.475
#Hornet Sportabout -0.30751 -0.402 2.69 1.067 -0.444 -0.442 -0.32228 -0.472 -0.472 -0.446 -0.454
#Valiant -0.24228 -0.415 2.72 1.000 -0.462 -0.452 -0.21197 -0.487 -0.501 -0.458 -0.487
#...
#...

It is simple to use an anonymous function:
t(apply(mtcars, 1, function(x) (x-mean(x))/sd(x)))
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360 Merc 240D Merc 230 Merc 280 Merc 280C
mpg -0.1663702 -0.1678380 -0.02053465 -0.2183565 -0.3075069 -0.2422777 -0.3696702 -0.005278277 -0.09496285 -0.2208754 -0.2439523
cyl -0.4465404 -0.4481484 -0.50419828 -0.4122884 -0.4016114 -0.4152404 -0.4209455 -0.464365595 -0.49763495 -0.4511720 -0.4497564
disp 2.4298741 2.4297052 2.17138772 2.7611422 2.6941650 2.7152412 2.4439581 2.746995205 2.43244710 2.3682166 2.3687127

Using dapply
library(collapse)
dapply(df, MARGIN = 1, FUN = function(x) (x-mean(x))/sd(x))

Related

Table reformating

I have this table with two columns, major_activity_area and word_stem.
major_activity_area
word_stem
Youth Development
program
Youth Development
girl
Youth Development
youth
Youth Development
school
Religion Related Spiritual Development
service
Religion Related Spiritual Development
provid
Religion Related Spiritual Development
program
Religion Related Spiritual Development
hous
What I want to do is to make major_Activity_areas new columns and word_stem words to be listed under each columns. Such as:
youth development.
Religion Related Spiritual Development
program.
servic
girl.
provid
youth
program
school
hous
I would appreciate any help! :)
Try the transpose function t()
since you did not give a sample dataset dput()
I'll just create a dummy dataframe from mtcars
df.1<-mtcars%>%
head(10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
using the transpose function in r, t(dataframe)
df.2 <- t(df.1)
which uses the first column as headers and gives the result below
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360 Merc 240D
mpg 21.00 21.000 22.80 21.400 18.70 18.10 14.30 24.40
cyl 6.00 6.000 4.00 6.000 8.00 6.00 8.00 4.00
disp 160.00 160.000 108.00 258.000 360.00 225.00 360.00 146.70
hp 110.00 110.000 93.00 110.000 175.00 105.00 245.00 62.00
drat 3.90 3.900 3.85 3.080 3.15 2.76 3.21 3.69
wt 2.62 2.875 2.32 3.215 3.44 3.46 3.57 3.19
qsec 16.46 17.020 18.61 19.440 17.02 20.22 15.84 20.00
vs 0.00 0.000 1.00 1.000 0.00 1.00 0.00 1.00
am 1.00 1.000 1.00 0.000 0.00 0.00 0.00 0.00
gear 4.00 4.000 4.00 3.000 3.00 3.00 3.00 4.00
carb 4.00 4.000 1.00 1.000 2.00 1.00 4.00 2.00
Merc 230 Merc 280
mpg 22.80 19.20
cyl 4.00 6.00
disp 140.80 167.60
hp 95.00 123.00
drat 3.92 3.92
wt 3.15 3.44
qsec 22.90 18.30
vs 1.00 1.00
am 0.00 0.00
gear 4.00 4.00
carb 2.00 4.00
If this is not what you're looking for, kindly share a sample dataset to be able to produce exactly what you want using dput()
Assuming the str of data is tibble/ data frame, the function "pivot_wider" from tidyverse package would be a solution. I named your data as formal name (df) then calling the function. The code is like this :
library(tidyverse)
dfw <- pivot_wider( df, names_from =
'major_activity_area',values_from = 'word_stem') %>% unnest()
dfw

Calculate the mean value for each variable in multiple dataframes stored in single list

I was wondering if it's possible to loop through the contents of a list, and calculate the mean values of all variables, across multiple dataframes. Here I've created an example list, containing three dataframes.
datalist <- list(a=mtcars[1:4, 1:2],b=mtcars[ 5:9 ,3:7],c=mtcars[10:15,8:11])
> datalist
$a
mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
$b
disp hp drat wt qsec
Hornet Sportabout 360.0 175 3.15 3.44 17.02
Valiant 225.0 105 2.76 3.46 20.22
Duster 360 360.0 245 3.21 3.57 15.84
Merc 240D 146.7 62 3.69 3.19 20.00
Merc 230 140.8 95 3.92 3.15 22.90
$c
vs am gear carb
Merc 280 1 0 4 4
Merc 280C 1 0 4 4
Merc 450SE 0 0 3 3
Merc 450SL 0 0 3 3
Merc 450SLC 0 0 3 3
Cadillac Fleetwood 0 0 3 4
I need to get the mean for each variable in each dataframe. The results I'd like to get in form of a list for each dataframe, containing the means for their respective variables. Any help greatly appreciated. Thanks
You can apply colMeans on each dataframe with the help of lapply -
lapply(datalist, colMeans)
#$a
# mpg cyl
#21.55 5.50
#$b
# disp hp drat wt qsec
#246.500 136.400 3.346 3.362 19.196
#$c
# vs am gear carb
#0.3333333 0.0000000 3.3333333 3.5000000
Using tidyverse:
library(tidyverse)
map(datalist, ~ summarise(.x, across(everything(), mean)))
#> $a
#> mpg cyl
#> 1 21.55 5.5
#>
#> $b
#> disp hp drat wt qsec
#> 1 246.5 136.4 3.346 3.362 19.196
#>
#> $c
#> vs am gear carb
#> 1 0.3333333 0 3.333333 3.5

Scale & replace in Dataframe

I need to scale some columns out of my dataframe and I found a function which did it perfectly. So my question is how to replace the scaled columns in my dataframe? Is there a specific function for that or am I supposed to do it in a different way?
df <- raw_data %>%
select(CLIENTNUM:Avg_Utilization_Ratio) %>%
rename(Customer_Nr = CLIENTNUM,
Customer_Act = Attrition_Flag,
Total_Product_Count = Total_Relationship_Count)
scaled <- scale(select(df, Customer_Nr, Customer_Age, Dependent_count, Months_on_book,
Total_Product_Count, Months_Inactive_12_mon, Contacts_Count_12_mon, Credit_Limit,
Total_Revolving_Bal, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt,
Total_Trans_Ct)
, center = T, scale = T)
Create a character vector of column name and apply scale only on those columns.
cols <- c('Customer_Nr', 'Customer_Age', 'Dependent_count' .....)
df[cols] <- scale(df[cols])
Using an example of mtcars dataset :
df <- mtcars
cols <- c('mpg', 'disp')
df[cols] <- scale(df[cols])
df
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 0.1509 6 -0.5706 110 3.90 2.62 16.5 0 1 4 4
#Mazda RX4 Wag 0.1509 6 -0.5706 110 3.90 2.88 17.0 0 1 4 4
#Datsun 710 0.4495 4 -0.9902 93 3.85 2.32 18.6 1 1 4 1
#Hornet 4 Drive 0.2173 6 0.2201 110 3.08 3.21 19.4 1 0 3 1
#Hornet Sportabout -0.2307 8 1.0431 175 3.15 3.44 17.0 0 0 3 2
#Valiant -0.3303 6 -0.0462 105 2.76 3.46 20.2 1 0 3 1
#...
#...

Cannot coerce to a table when doing correlation

I have a 33 rows list data,
as below:
enter link description here
And I was using the code as below to create a correlation coefficient:
mpg.df <- as.data.frame(mpg)
cor(mpg, method = "pearson", use = "complete.obs")
However, I am getting the same error:
Error in cor(mpg, method = "pearson", use = "complete.obs") :
'x' must be numeric
And I used typeof(mpg.df), the result was "list" still.
Any suggestions would be highly appreciated!!!
It is likely that your example data frame has one or more columns that are not numeric. Since your example data frame looks like a subset of the pre-defined mtcars data frame in R, I will just create this subset in R as follows.
# Select some columns
mtcars2 <- mtcars[, c("mpg", "cyl", "disp", "hp", "drat", "wt")]
# View the first six rows
head(mtcars2)
# mpg cyl disp hp drat wt
# Mazda RX4 21.0 6 160 110 3.90 2.620
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875
# Datsun 710 22.8 4 108 93 3.85 2.320
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215
# Hornet Sportabout 18.7 8 360 175 3.15 3.440
# Valiant 18.1 6 225 105 2.76 3.460
Now take a look at the structure mtcars2, you can see that all columns are numeric.
# Show the class of each column
str(mtcars2)
# 'data.frame': 32 obs. of 6 variables:
# $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
# $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
# $ disp: num 160 160 108 258 360 ...
# $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
# $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
# $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
Since all columns are numeric, we can thus use the cor function to do correlation analysis.
# Do correlation analysis
cor(mtcars2)
# mpg cyl disp hp drat wt
# mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594
# cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.6999381 0.7824958
# disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.7102139 0.8879799
# hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.4487591 0.6587479
# drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.0000000 -0.7124406
# wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.7124406 1.0000000

calculate the percentage change of every nth row in the dataframe

I am trying to calculate the percent change in the values of the dataframe. Now the rows for which the percent change has to be calculated is entered by the user. If the user enters 2 then for every 2nd row the percentage change will be calculated. I have the following function:
change <- function(DF, change) {
numCols <- ncol(DF)
for (i in 1: numCols) {
pctChange <-rep(NA,nrow(DF))
x<-DF[,i]
y<-c(DF[(change+1):nrow(DF),i], rep(NA,change))
pctChange <-round((y-x)*100/x,2)
DF$pct<-pctChange
colnames(DF)[ncol(DF)]<-paste(colnames(DF[i]),"pctChangeby",
change,sep = "")
}
return (DF)
}
When i tested this function:
change(mtcars[1:5], 1)
I got the following output:
(just showing output for first row:
mpg mpgpctChangeby1
Mazda RX4 22.8 7.02
Mazda RX4 Wag 24.4 -6.56
Datsun 710 22.8 42.11
Hornet 4 Drive 32.4 -6.17
Expected Output:
mpg mpgpctChangeby1
Mazda RX4 22.8 NA
Mazda RX4 Wag 24.4 7.02
Datsun 710 22.8 -6.56
Hornet 4 Drive 32.4 42.11
For change = 2,
Expected Output:
mpg mpgpctChangeby2
Mazda RX4 21.0 NA
Mazda RX4 Wag 21.0 NA
Datsun 710 22.8 8.57
Hornet 4 Drive 21.4 1.90
Hornet Sportabout 18.7 -17.98
By running your function change, I am getting a different output than the one you showed.
change(mtcars[1],1)[1:4,]
# mpg mpgpctChangeby1
#Mazda RX4 21.0 0.00
#Mazda RX4 Wag 21.0 8.57
#Datsun 710 22.8 -6.14
#Hornet 4 Drive 21.4 -12.62
I guess this would help in getting your expected output:
For change=2
mtcars1 <- mtcars[,1] #1 column
x <- c(rep(NA,2), mtcars1[-((length(mtcars1)-1):length(mtcars1))])
y <- c(rep(NA,2), mtcars1[-(1:2)])
round(100*(y-x)/x,2)
# [1] NA NA 8.57 1.90 -17.98 -15.42 -23.53 34.81 59.44 -21.31
#[11] -21.93 -14.58 -2.81 -7.32 -39.88 -31.58 41.35 211.54 106.80 4.63
#[21] -29.28 -54.28 -29.30 -14.19 26.32 105.26 35.42 11.36 -39.23 -35.20
#[31] -5.06 8.63
For change=3
x <- c(rep(NA,3), mtcars1[-((length(mtcars1)-2):length(mtcars1))])
y <- c(rep(NA,3), mtcars1[-(1:3)])
round(100*(y-x)/x,2)
# [1] NA NA NA 1.90 -10.95 -20.61 -33.18 30.48 25.97 34.27
#[11] -27.05 -28.07 -9.90 -14.61 -36.59 -39.88 -3.29 211.54 192.31 130.61
#[21] -33.64 -49.01 -55.16 -38.14 23.87 79.61 95.49 58.33 -42.12 -24.23
#[31] -50.66 35.44
Regarding the change=1, the values you showed for mpg are not matching with the rownames. By changing your function:
change <- function(DF, change) {
numCols <- ncol(DF)
for (i in 1: numCols) {
pctChange <-rep(NA,nrow(DF))
x<- c(rep(NA, change), DF[-((nrow(DF)-(change-1)):nrow(DF)),i])
y<-c(rep(NA, change), DF[-(seq_len(change)),i])
pctChange <-round((y-x)*100/x,2)
DF$pct<-pctChange
colnames(DF)[ncol(DF)]<-paste(colnames(DF[i]),"pctChangeby",
change,sep = "")
}
return (DF)
}
change(mtcars[1],2)[1:6,]
# mpg mpgpctChangeby2
#Mazda RX4 21.0 NA
#Mazda RX4 Wag 21.0 NA
#Datsun 710 22.8 8.57
#Hornet 4 Drive 21.4 1.90
#Hornet Sportabout 18.7 -17.98
#Valiant 18.1 -15.42
change(mtcars[1],3)[1:6,]
change(mtcars[1],1)[1:6,]

Resources