How to iterate over columns with Sapply for Pearson coefficient - r

[i] indicates where I have to iterate pearsons coefficient over the columns and how to convert this into a dataframe attached onto a variable?
Code example:
*INSTEAD OF DOING THIS*
F.ReedBunting.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$ReedBunting,method='pearson')
F.Whitethroat.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$Whitethroat,method='pearson')
F.Rook.pear<- cor.test(W_farmland_mean$Years,W_farmland_mean$Rook,method='pearson')
.
.
.
*HOW CAN IT BE DONE QUICKLY WITH THIS*
workspaceone <- sapply(W_farmland_mean, function(x){
cor.test(W_farmland_mean$Years, W_farmland_mean[, 1[i]], method = 'pearson')
})

I think you should try:
result_cor <- apply(W_farmland_mean,2,function(x){cor.test(W_farmland_mean$Years,x, method = 'pearson')$estimate})
It will extract the Pearson coefficient of the comparison of each columns with the column years of your dataset.
Example
With the mtcars dataset:
df <- mtcars[c(1:10),]
> df
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
And if we apply the function:
result_cor = apply(df,2, function(x){cor.test(x,df$mpg,method ='pearson')$estimate})
And you get the following output:
> result_cor
mpg cyl disp hp drat wt qsec
1.0000000 -0.8614165 -0.7739868 -0.8937223 0.5413585 -0.5991894 0.5494131
vs am gear carb
0.4796102 0.2919683 0.6646449 -0.3711956

Related

Curly curly - How to access the variable name [duplicate]

This question already has answers here:
In R, how to get an object's name after it is sent to a function?
(4 answers)
Closed 1 year ago.
I am trying to create a function which summarises a grouped dataset and then adds a column to identify which variable is being summarised (ID column).
I am not sure how to add the ID column using the curly curly appraoch.
my_fun <- function(dat, var_name){
dat %>%
mutate(id_column = names({{var_name}}))
}
my_fun(mtcars, cyl)
What I want is for the variable name, in this case cyl, to be recycled.
Just, deparse/subsitute at the start
my_fun <- function(dat, var_name){
nm1 <- deparse(substitute(var_name))
dat %>%
mutate(id_column = nm1)
}
-testing
my_fun(mtcars, cyl)
mpg cyl disp hp drat wt qsec vs am gear carb id_column
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 cyl
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 cyl
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 cyl
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 cyl
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 cyl
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 cyl
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 cyl
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 cyl
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 cyl
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 cyl
...
In the tidyverse, it may also be done directly from a symbol i.e. use ensym to convert to symbol and then evaluate (!!) to get the value or convert to string with as_string
my_fun <- function(dat, var_name){
var_name <- rlang::ensym(var_name)
dat %>%
mutate(id_column = rlang::as_string(var_name), val_column = !! var_name)
}
-testing
my_fun(head(mtcars), cyl)
mpg cyl disp hp drat wt qsec vs am gear carb id_column val_column
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 cyl 6
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 cyl 6
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 cyl 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 cyl 6
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 cyl 8
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 cyl 6

add column with value depending on value in other column

Is there a smart way of doing this:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Now I want to add a column named X which could be either 1 or 0 depending on the name of the car. For example all cars starting with M should be 1 and the rest 0.
Best regards,
H
Many ways to do this :
mtcars$X <- +(startsWith(rownames(mtcars), 'M'))
You can also use grepl/str_detect :
mtcars$X <- as.integer(grepl('^M', rownames(mtcars)))
mtcars$X <- as.integer(stringr::str_detect(rownames(mtcars), '^M'))
The above two are similar to using ifelse :
mtcars$X <- ifelse(grepl('^M', rownames(mtcars)), 1, 0)
but they are more efficient than using ifelse.

Using dplyr, how should I create a column of strings repeating a character based on the value of another column?

With mtcars for example, I'd like to create a new column carb_dots such that when carb = 4, carb_dots = "...."
Using dplyr, I've tried
library(dplyr)
mtcars2 <- mtcars %>% mutate(carb_dots = rep(".", carb))
This errors with
Error in mutate_impl(.data, dots) :
Evaluation error: invalid 'times' argument.
What should I do? Thanks for your suggestions.
With the addition of stringr, you can do:
mtcars %>%
mutate(carb_dots = str_dup(".", carb))
mpg cyl disp hp drat wt qsec vs am gear carb carb_dots
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ....
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ....
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 .
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 .
5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ..
6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 .
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ....
8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ..
9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ..
10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ....
We can use strrep
library(dplyr)
mtcars %>%
mutate(carb_dots = strrep(".", carb))
# mpg cyl disp hp drat wt qsec vs am gear carb carb_dots
#Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ....
#Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ....
#Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 .
#Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 .
#Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ..
#Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 .
#Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ....
#...
If we need to use rep
mtcars %>%
rowwise %>%
mutate(carb_dots = paste(rep(".", carb), collapse=""))

how to shuffle the data of subsample after splitting the data frame into smaller dataframes in R

I am splitting the large data frame into smaller data frame each of size 5000 records. But after performing the rbind operation on each subsample I want to shuffle the subsample data. When I tried to shuffle data it is not throwing me any error or shuffling the data. Can any one help me in reshuffling the data
# splitting the dataframe into smaller dataframes
test_list <-split(New_data_zero, (seq(nrow(New_data_zero))-1) %/% 5000)
# performing the rbind to add data for all the data frames
for (i in 1: length(test_list)){
test_list[[i]] <- rbind(test_list[[i]],New_data)
}
# Trying to shuffle the each subsample but not performing the operation
for (i in 1: length(test_list)){
test_list[[i]] <- test_list[[i]][sample(1:nrow(test_list[[i]])),]
}
Try this
myfun <- function(df, numobs) {
sdf <- split(df, rep(1:ceiling(nrow(df)/numobs), each=numobs))
lapply(sdf, function(x) x[sample(nrow(x)),])
}
set.seed(1)
myfun(mtcars, 5)
Output
$`1`
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
$`2`
mpg cyl disp hp drat wt qsec vs am gear carb
Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4
Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2
Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2
etc

R - Create missingness in DataFrame for testing

I need to test some imputation evaluation software I'm creating and am struggling to get benchmark datasets.
Does anyone know of a way to delete a certain amount of data from a dataframe.
As an example of what I need:
You have a dataset and you want a random 20% of the rows to have a random amounts of variables in that row removed (ie. NA)
Or: Something that can turn
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Into:
> head(mtcars,n=10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 NA 6 160.0 NA 3.90 2.620 NA 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 NA 108.0 93 NA NA 18.61 NA 1 NA 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
I have tried a couple of methods that manipulate the columns but these have some fundamental flaws in them which render them useless.
This is my first every question on here, if I have missed out anything or done something wrong, please do let me know.
All the best
This should do it:
df_new <- as.data.frame(apply(mtcars,2,function(x){
x[sample(1:length(x),round(length(x)*0.2))] <- NA
return(x)
}))
Apply() goes through the columns and in each column sample() is used to randomly select 20% of the values to be set to NA.
New answer after comment:
This randomly adds NA in 10% of all rows.
df <- mtcars
random_rows <- sample(1:nrow(df),round(nrow(df)*0.2))
for(i_row in random_rows){
df[i_row,sample(1:ncol(df),sample(1:ncol(df),1))] <- NA
}

Resources