Apply a function to a set of columns in a dataset

Apply a function to a set of columns in a dataset - r

using this function I calculate the variance of some 3d points.
centroid_3d_sq_dist <- function(
point_matrix
) {
if (nrow(point_matrix) == 1) {
return(0)
}
mean_point <- apply(point_matrix, 2, mean)
point_sq_distances <- apply(
point_matrix,
1,
function(row_point) {
sum((row_point - mean_point) ** 2)
}
)
sum_sq_distances <- sum(point_sq_distances)
return(sum_sq_distances)
}
point_3d_variance <- function(
point_matrix
) {
if (nrow(point_matrix) == 1) {
return(0)
}
dist_var <- centroid_3d_sq_dist(point_matrix) /
(nrow(point_matrix) - 1)
return(dist_var)
}
The argument of this function is a matrix (x,y,z).
Now I have a dataset with two 3D points.
ID Trial Size PP PA FkA ciccioX ciccioY ciccioZ pinoX pinoY pinoZ
1 Gigi 1 40 39.6 1050. 31.5 521. 293. 10.6 516. 323. 6.41
2 Gigi 2 20.0 30.7 944. 9.35 525. 300. 12.6 520. 305. 7.09
3 Gigi 3 30 29.5 1056. 24.1 521. 298. 12.3 519. 321. 5.89
4 Gigi 5 60 53.0 1190. 53.0 680. 287. 64.4 699. 336. 68.6
5 Bibi 1 40 38.3 1038. 31.4 524. 289. 10.9 519. 319. 6.17
6 Bibi 2 60 64.7 1293. 47.8 516. 282. 10.4 519. 330. 6.32
7 Bibi 3 20.0 33.8 1092. 17.5 523. 300. 12.8 518. 315. 6.22
8 Bibi 4 30 35.0 1108. 26.4 525. 295. 11.7 517. 320. 5.78
9 Bibi 5 50 46.5 1199. 34.2 515. 289. 11.2 517. 323. 6.27
10 Bibi 6 30 28.7 1016. 17.1 528. 298. 12.7 524. 314. 6.36
The 3D points are:
ciccio: ciccioX ciccioY ciccioZ
pino: pinoX pinoY pinoZ
I want to calculate the variance of ciccio and the variance of pino grouped by ID and SIZE.
I tried to do:
data %>%
group_by(SubjectID, Size) %>%
summarize(as.data.frame(matrix(f4(dd[7:9],dd[10:12]), nr = 1)))
But it doesn't work.
Do you have any advice?

Your shown dataset is too small to calculate (meaningful) variations. But you could use:
library(dplyr)
df %>%
group_by(ID, Size) %>%
summarise(var_ciccio = point_3d_variance(as.matrix(across(ciccioX:ciccioZ))),
var_pino = point_3d_variance(as.matrix(across(pinoX:pinoZ))),
.groups = "drop")
This returns
# A tibble: 9 x 4
ID Size var_ciccio var_pinoo
<chr> <dbl> <dbl> <dbl>
1 Bibi 20 0 0
2 Bibi 30 9.5 42.7
3 Bibi 40 0 0
4 Bibi 50 0 0
5 Bibi 60 0 0
6 Gigi 20 0 0
7 Gigi 30 0 0
8 Gigi 40 0 0
9 Gigi 60 0 0

Related

Add new column to state data frame based on other column data [duplicate]

This question already has answers here:
Categorize numeric variable into group/ bins/ breaks
(4 answers)
Closed 1 year ago.
I am attempting to add a new column to the state sample data frame in R. I am hoping for this column to cluster the ID of states into broader categories (1-4). My code is close to what I am looking for but I am not getting it quite right.. I know I could enter each state ID line by line but is there a a quicker way? Thank you!
library(tidyverse)
#Add column to denote each state
States=state.x77
States=data.frame(States)
States <- tibble::rowid_to_column(States, "ID")
States
#Create new variable for state buckets
States <- States %>%
mutate(WAGE_BUCKET=case_when(ID <= c(1,12) ~ '1',
ID <= c(13,24) ~ '2',
ID <= c(25,37) ~ '3',
ID <= c(38,50) ~ '4',
TRUE ~ 'NA'))
View(States) #It is not grouping the states in the way I want/I am still getting some NA values but unsure why!

You can use cut or findInterval if all of your groups will be using contiguous ID values:
findInterval(States$ID, c(0, 12, 24, 37, 51))
# [1] 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4
If you want to make it a bit more verbose, you can use dplyr::between in your case_when:
States %>%
mutate(
WAGE_BUCKET = case_when(
between(ID, 1, 12) ~ "1",
between(ID, 13, 24) ~ "2",
between(ID, 25, 37) ~ "3",
between(ID, 38, 50) ~ "4",
TRUE ~ NA_character_)
)
# ID Population Income Illiteracy Life Exp Murder HS Grad Frost Area WAGE_BUCKET
# 1 1 3615 3624 2.1 69.05 15.1 41.3 20 50708 1
# 2 2 365 6315 1.5 69.31 11.3 66.7 152 566432 1
# 3 3 2212 4530 1.8 70.55 7.8 58.1 15 113417 1
# 4 4 2110 3378 1.9 70.66 10.1 39.9 65 51945 1
# 5 5 21198 5114 1.1 71.71 10.3 62.6 20 156361 1
# 6 6 2541 4884 0.7 72.06 6.8 63.9 166 103766 1
# 7 7 3100 5348 1.1 72.48 3.1 56.0 139 4862 1
# 8 8 579 4809 0.9 70.06 6.2 54.6 103 1982 1
# 9 9 8277 4815 1.3 70.66 10.7 52.6 11 54090 1
# 10 10 4931 4091 2.0 68.54 13.9 40.6 60 58073 1
# 11 11 868 4963 1.9 73.60 6.2 61.9 0 6425 1
# 12 12 813 4119 0.6 71.87 5.3 59.5 126 82677 1
# 13 13 11197 5107 0.9 70.14 10.3 52.6 127 55748 2
# 14 14 5313 4458 0.7 70.88 7.1 52.9 122 36097 2
# 15 15 2861 4628 0.5 72.56 2.3 59.0 140 55941 2
# 16 16 2280 4669 0.6 72.58 4.5 59.9 114 81787 2
# 17 17 3387 3712 1.6 70.10 10.6 38.5 95 39650 2
# 18 18 3806 3545 2.8 68.76 13.2 42.2 12 44930 2
# 19 19 1058 3694 0.7 70.39 2.7 54.7 161 30920 2
# 20 20 4122 5299 0.9 70.22 8.5 52.3 101 9891 2
# 21 21 5814 4755 1.1 71.83 3.3 58.5 103 7826 2
# 22 22 9111 4751 0.9 70.63 11.1 52.8 125 56817 2
# 23 23 3921 4675 0.6 72.96 2.3 57.6 160 79289 2
# 24 24 2341 3098 2.4 68.09 12.5 41.0 50 47296 2
# 25 25 4767 4254 0.8 70.69 9.3 48.8 108 68995 3
# 26 26 746 4347 0.6 70.56 5.0 59.2 155 145587 3
# 27 27 1544 4508 0.6 72.60 2.9 59.3 139 76483 3
# 28 28 590 5149 0.5 69.03 11.5 65.2 188 109889 3
# 29 29 812 4281 0.7 71.23 3.3 57.6 174 9027 3
# 30 30 7333 5237 1.1 70.93 5.2 52.5 115 7521 3
# 31 31 1144 3601 2.2 70.32 9.7 55.2 120 121412 3
# 32 32 18076 4903 1.4 70.55 10.9 52.7 82 47831 3
# 33 33 5441 3875 1.8 69.21 11.1 38.5 80 48798 3
# 34 34 637 5087 0.8 72.78 1.4 50.3 186 69273 3
# 35 35 10735 4561 0.8 70.82 7.4 53.2 124 40975 3
# 36 36 2715 3983 1.1 71.42 6.4 51.6 82 68782 3
# 37 37 2284 4660 0.6 72.13 4.2 60.0 44 96184 3
# 38 38 11860 4449 1.0 70.43 6.1 50.2 126 44966 4
# 39 39 931 4558 1.3 71.90 2.4 46.4 127 1049 4
# 40 40 2816 3635 2.3 67.96 11.6 37.8 65 30225 4
# 41 41 681 4167 0.5 72.08 1.7 53.3 172 75955 4
# 42 42 4173 3821 1.7 70.11 11.0 41.8 70 41328 4
# 43 43 12237 4188 2.2 70.90 12.2 47.4 35 262134 4
# 44 44 1203 4022 0.6 72.90 4.5 67.3 137 82096 4
# 45 45 472 3907 0.6 71.64 5.5 57.1 168 9267 4
# 46 46 4981 4701 1.4 70.08 9.5 47.8 85 39780 4
# 47 47 3559 4864 0.6 71.72 4.3 63.5 32 66570 4
# 48 48 1799 3617 1.4 69.48 6.7 41.6 100 24070 4
# 49 49 4589 4468 0.7 72.48 3.0 54.5 149 54464 4
# 50 50 376 4566 0.6 70.29 6.9 62.9 173 97203 4

It is a vector of length > 1. The comparison operators works on a single vector. We could use between
library(dplyr)
States <- States %>%
mutate(WAGE_BUCKET=case_when(between(ID, 1, 12) ~ '1',
between(ID, 13,24) ~ '2',
between(ID, 25,37) ~ '3',
between(ID, 38,50) ~ '4',
TRUE ~ NA_character_))
Or another option is to use & with > and <=
States %>%
mutate(WAGE_BUCKET=case_when(ID >= 1 & ID <=12 ~ '1',
ID >= 13 & ID <= 24) ~ '2',
ID >= 25 & ID <= 37 ~ '3',
ID >= 38 & ID <= 50 ~ '4',
TRUE ~ NA_character))
Or may be the OP meant to use %in%
States %>%
mutate(WAGE_BUCKET=case_when(ID %in% c(1,12) ~ '1',
ID %in% c(13,24) ~ '2',
ID %in% c(25,37) ~ '3',
ID %in% c(38,50) ~ '4',
TRUE ~ NA_character_))

Reshape horizontal to to long format using pivot_longer

I am trying to reshape my data to long instead of wide format using the same code provided earlier link , however it doesn't work even after several trials to modify names_pattern = "(.*)_(pre|post.*)",
My data sample is
data1<-read.table(text="
Serial_ID pre_EDV pre_ESV pre_LVEF post_EDV post_ESV post_LVEF
1 76.2 32.9 56.8 86.3 36.6 57.6
2 65.4 35.9 45.1 60.1 26.1 56.7
3 64.4 35.1 45.5 72.5 41.1 43.3
4 50 13.9 72.1 46.4 18.4 60.4
5 89.6 32 64.3 70.9 19.3 72.8
6 62 20.6 66.7 55.9 17.8 68.2
7 91.2 37.7 58.6 61.9 23.8 61.6
8 62 24 61.3 69.3 34.9 49.6
9 104.1 22.7 78.8 38.6 11.5 70.1
10 90.6 31.2 65.6 48 16.1 66.4", sep="", header=T)
I want to reshape my data to
put identical column headings below each other eg post_EDV below
pre_EDV
Create new column Pre vs. post
Fix column heading (remove "pre_" and "post_" to be "EDV" only (as shown in the screenshot below)).
This is the used code:
library(dplyr);library(tidyr);library(stringr)
out <- data %>% pivot_longer(cols = -Serial_ID,
names_to = c(".value", "prevspost"),
names_pattern = "(.*)_(pre|post.*)",
names_sep="_") #%>% as.data.frame
Also I tried names_prefix = c("pre_","post_") instead of names_pattern = "(.*)_(pre|post.*)", but it doesn't work.
Any advice will be greatly appreciated.

Edit I recommend using #Dave2e's superior approach.
The reason your attempt didn't work is because the pattern has to match in order. You could try this:
library(tidyr)
library(dplyr)
data1 %>% pivot_longer(cols = -Serial_ID,
names_to = c("prevspost",".value"),
names_pattern = "(pre|post)_(\\w+)") %>%
dplyr::arrange(desc(prevspost),Serial_ID)
# A tibble: 20 x 5
Serial_ID prevspost EDV ESV LVEF
<int> <chr> <dbl> <dbl> <dbl>
1 1 pre 76.2 32.9 56.8
2 2 pre 65.4 35.9 45.1
3 3 pre 64.4 35.1 45.5
4 4 pre 50 13.9 72.1
5 5 pre 89.6 32 64.3
6 6 pre 62 20.6 66.7
7 7 pre 91.2 37.7 58.6
8 8 pre 62 24 61.3
9 9 pre 104. 22.7 78.8
10 10 pre 90.6 31.2 65.6
11 1 post 86.3 36.6 57.6
12 2 post 60.1 26.1 56.7
13 3 post 72.5 41.1 43.3
14 4 post 46.4 18.4 60.4
15 5 post 70.9 19.3 72.8
16 6 post 55.9 17.8 68.2
17 7 post 61.9 23.8 61.6
18 8 post 69.3 34.9 49.6
19 9 post 38.6 11.5 70.1
20 10 post 48 16.1 66.4

Your initial approach very close, it needed some simplification. Use only "names_sep" or "names_pattern"
library(tidyr)
library(dplyr)
data1 %>% pivot_longer(cols = -Serial_ID,
names_to = c("Pre vs. post", '.value'),
names_sep="_")
# A tibble: 20 x 5
Serial_ID `Pre vs. post` EDV ESV LVEF
<int> <chr> <dbl> <dbl> <dbl>
1 1 pre 76.2 32.9 56.8
2 1 post 86.3 36.6 57.6
3 2 pre 65.4 35.9 45.1
4 2 post 60.1 26.1 56.7
5 3 pre 64.4 35.1 45.5
6 3 post 72.5 41.1 43.3
7 4 pre 50 13.9 72.1
8 4 post 46.4 18.4 60.4
9 5 pre 89.6 32 64.3
10 5 post 70.9 19.3 72.8
11 6 pre 62 20.6 66.7
12 6 post 55.9 17.8 68.2
13 7 pre 91.2 37.7 58.6
14 7 post 61.9 23.8 61.6
15 8 pre 62 24 61.3
16 8 post 69.3 34.9 49.6
17 9 pre 104. 22.7 78.8
18 9 post 38.6 11.5 70.1
19 10 pre 90.6 31.2 65.6
20 10 post 48 16.1 66.4

try this:
library(dplyr);library(tidyr);library(stringr)
out <- data1 %>% pivot_longer(-Serial_ID,
names_to = c("measurement", "names"),
values_to = "values",
names_sep = "_")
out
# # A tibble: 60 x 4
# Serial_ID measurement names values
# <int> <chr> <chr> <dbl>
# 1 1 pre EDV 76.2
# 2 1 pre ESV 32.9
# 3 1 pre LVEF 56.8
# 4 1 post EDV 86.3
# 5 1 post ESV 36.6
# 6 1 post LVEF 57.6
# 7 2 pre EDV 65.4
# 8 2 pre ESV 35.9
# 9 2 pre LVEF 45.1
# 10 2 post EDV 60.1
# # ... with 50 more rows
Your code snipped passed the object "data" instead of "data1" into the pipe which produced an error:
"Error: No tidyselect variables were registered".

Check for nearest value in a column

Is there a way to check which value in a vector/column is nearest to a given value?
so for example I have column with number of days:
days: 50, 49, 59, 180, 170, 199, 200
I want to make a new column in the dataframe that marks an X everytime the dayscolumn has the value 183 or close to 183
It should look like this:
DAYS new column
0
12
12
14
133
140 X
0
12
14
15
178
183 X
0
15
30
72
172 X
Hope you can help me!

You're searching for local maxima, essentially. Start off by normalizing your data to your target, i.e. 183, and search for values closest to zero. Those are your local maxima. I added data with values greater than your target to demonstrate.
df <- data.frame(DAYS = c(0,12,12,14,133,140,0,12,14,15,178,183,184,190,0,15,30,72,172,172.5))
df$localmin <- abs(df$DAYS - 183)
df
> df
DAYS localmin
1 0.0 183.0
2 12.0 171.0
3 12.0 171.0
4 14.0 169.0
5 133.0 50.0
6 140.0 43.0
7 0.0 183.0
8 12.0 171.0
9 14.0 169.0
10 15.0 168.0
11 178.0 5.0
12 183.0 0.0
13 184.0 1.0
14 190.0 7.0
15 0.0 183.0
16 15.0 168.0
17 30.0 153.0
18 72.0 111.0
19 172.0 11.0
20 172.5 10.5
targets <- which(diff(sign(diff(c(df$localmin, 183)))) == 2) + 1L
df$targets <- 0
df$targets[targets] <- 1
df
> df
DAYS localmin targets
1 0.0 183.0 0
2 12.0 171.0 0
3 12.0 171.0 0
4 14.0 169.0 0
5 133.0 50.0 0
6 140.0 43.0 1
7 0.0 183.0 0
8 12.0 171.0 0
9 14.0 169.0 0
10 15.0 168.0 0
11 178.0 5.0 0
12 183.0 0.0 1
13 184.0 1.0 0
14 190.0 7.0 0
15 0.0 183.0 0
16 15.0 168.0 0
17 30.0 153.0 0
18 72.0 111.0 0
19 172.0 11.0 0
20 172.5 10.5 1

How to index dataframe column inside a function in R

I have a function that takes in a dataframe, a percentile threshold, and the name of a given column, and computes all values that are above this threshold in the given column as a new column (0 for <, and 1 for >=). However, it won't allow me to do the df$column_name inside the quantile function because column_name is not actually a column name, but a variable storing the actual column name. Therefore df$column_name will return NULL. Is there any way to work around this and keep the code forma somewhat similar to what it is currently? Or do I have to specify the actual numerical column value instead of the name? While I can do this, it is definitely not as convenient/comprehensible as just passing in the column name.
func1 <- function(df, threshold, column_name) {
threshold_value <- quantile(df$column_name, c(threshold))
new_df <- df %>%
mutate(ifelse(column_name > threshold_value, 1, 0))
return(new_df)
}
Thank you so much for your help!

I modified your function as follows. Now the function can take a data frame, a threshold, and a column name. This function only needs the base R.
# Modified function
func1 <- function(df, threshold, column_name) {
threshold_value <- quantile(df[[column_name]], threshold)
new_df <- df
new_df[["new_col"]] <- ifelse(df[[column_name]] > threshold_value, 1, 0)
return(new_df)
}
# Take the trees data frame as an example
head(trees)
# Girth Height Volume
# 1 8.3 70 10.3
# 2 8.6 65 10.3
# 3 8.8 63 10.2
# 4 10.5 72 16.4
# 5 10.7 81 18.8
# 6 10.8 83 19.7
# Apply the function
func1(trees, 0.5, "Volume")
# Girth Height Volume new_col
# 1 8.3 70 10.3 0
# 2 8.6 65 10.3 0
# 3 8.8 63 10.2 0
# 4 10.5 72 16.4 0
# 5 10.7 81 18.8 0
# 6 10.8 83 19.7 0
# 7 11.0 66 15.6 0
# 8 11.0 75 18.2 0
# 9 11.1 80 22.6 0
# 10 11.2 75 19.9 0
# 11 11.3 79 24.2 0
# 12 11.4 76 21.0 0
# 13 11.4 76 21.4 0
# 14 11.7 69 21.3 0
# 15 12.0 75 19.1 0
# 16 12.9 74 22.2 0
# 17 12.9 85 33.8 1
# 18 13.3 86 27.4 1
# 19 13.7 71 25.7 1
# 20 13.8 64 24.9 1
# 21 14.0 78 34.5 1
# 22 14.2 80 31.7 1
# 23 14.5 74 36.3 1
# 24 16.0 72 38.3 1
# 25 16.3 77 42.6 1
# 26 17.3 81 55.4 1
# 27 17.5 82 55.7 1
# 28 17.9 80 58.3 1
# 29 18.0 80 51.5 1
# 30 18.0 80 51.0 1
# 31 20.6 87 77.0 1
If you still want to use dplyr, it is essential to learn how to deal with non-standard evaluation. Please see this to learn more (https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html). The following code will also works.
library(dplyr)
func2 <- function(df, threshold, column_name) {
col_en <- enquo(column_name)
threshold_value <- quantile(df %>% pull(!!col_en), threshold)
new_df <- df %>%
mutate(new_col := ifelse(!!col_en >= threshold_value, 1, 0))
return(new_df)
}
func2(trees, 0.5, Volume)
# Girth Height Volume new_col
# 1 8.3 70 10.3 0
# 2 8.6 65 10.3 0
# 3 8.8 63 10.2 0
# 4 10.5 72 16.4 0
# 5 10.7 81 18.8 0
# 6 10.8 83 19.7 0
# 7 11.0 66 15.6 0
# 8 11.0 75 18.2 0
# 9 11.1 80 22.6 0
# 10 11.2 75 19.9 0
# 11 11.3 79 24.2 1
# 12 11.4 76 21.0 0
# 13 11.4 76 21.4 0
# 14 11.7 69 21.3 0
# 15 12.0 75 19.1 0
# 16 12.9 74 22.2 0
# 17 12.9 85 33.8 1
# 18 13.3 86 27.4 1
# 19 13.7 71 25.7 1
# 20 13.8 64 24.9 1
# 21 14.0 78 34.5 1
# 22 14.2 80 31.7 1
# 23 14.5 74 36.3 1
# 24 16.0 72 38.3 1
# 25 16.3 77 42.6 1
# 26 17.3 81 55.4 1
# 27 17.5 82 55.7 1
# 28 17.9 80 58.3 1
# 29 18.0 80 51.5 1
# 30 18.0 80 51.0 1
# 31 20.6 87 77.0 1

changing variable values into new variables in r

I have a data set with about 600 animals with this structure:
anim <- c(1,1,1,1,1,2,2,2,2)
point <- c(1,2,3,4,6,3,4,5,6)
temp <- c(38.8,38.5,33.2,33.5,37.2,36.2,36.5,36.0,37.8)
mydf <- data.frame(anim,point,temp)
anim point temp
1 1 38.8
1 2 38.5
1 3 33.2
1 4 33.5
1 6 37.2
2 3 36.2
2 4 36.5
2 5 36.0
2 6 37.8
The variable "point" contains different measuring points (temperature) on an animal and I need to have these measuring points (1,2,3,4,5,6) as new variables such that 1=bel,2=ber,3=le,4=re,5=ey,6=cr. Note, that some points may not appear or measured and therefore NA should be put in.
mynewdf should look like this:
anim bel ber le re ey cr
1 38.8 38.5 33.2 33.5 NA 37.2
2 NA NA 36.2 36.5 36.0 37.8
I hope that my question is clear enough and any help would be very much appreciated.
EDITED:
This is an extract from my actual data set:
head(irpig,n=25)
head(irpig,n=25)
dam anim point temp
1 1A0331 20584 1 37.9
2 1A0331 20584 2 37.7
3 1A0331 20584 3 34.3
4 1A0331 20584 4 35.8
5 1A0331 20584 6 37.6
6 1A0331 20585 2 38.7
7 1A0331 20585 4 36.4
8 1A0331 20585 6 38.0
9 1A0331 20586 1 39.0
10 1A0331 20586 2 39.8
11 1A0331 20586 3 37.9
12 1A0331 20586 4 38.0
13 1A0331 20586 6 38.5
14 1A0331 20587 1 39.3
15 1A0331 20587 2 38.9
16 1A0331 20587 3 39.4
17 1A0331 20587 4 38.6
18 1A0331 20587 6 39.4
19 1A0331 20588 1 39.6
20 1A0331 20588 2 39.2
21 1A0331 20588 3 38.9
22 1A0331 20588 4 38.0
23 1A0331 20588 6 39.6
24 1A0331 20589 1 38.1
25 1A0331 20589 2 38.7
Baz

Here is one solution. It uses the dcast function from reshape2 library by Hadley Wickham
mydf$point = as.factor(mydf$point)
levels(mydf$point) = c("bel", "ber", "le", "re", "ey", "cr")
library(reshape2)
dcast(mydf, anim ~ point)
anim bel ber le re ey cr
1 1 38.8 38.5 33.2 33.5 NA 37.2
2 2 NA NA 36.2 36.5 36 37.8

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Apply a function to a set of columns in a dataset - r

Related

Add new column to state data frame based on other column data [duplicate]

Reshape horizontal to to long format using pivot_longer

Check for nearest value in a column

How to index dataframe column inside a function in R

changing variable values into new variables in r

Categories

Resources