adding new column to dataframe using formula

adding new column to dataframe using formula - r

I have a dataframe and the head() looks like this:
CEMETERY SEX CONTEXT RaHD L RaHD R DIRECTIONAL ASYMMETRY
1 Medieval-St. Mary Graces FEMALE 7172 21.2 21.6 NA
2 Medieval-St. Mary Graces MALE 6225 23.9 25.2 NA
3 Medieval-St. Mary Graces MALE 9987 23.9 23.5 NA
4 Medieval-St. Mary Graces MALE 11475 22.4 22.3 NA
5 Medieval-St. Mary Graces MALE 12356 25.8 25.4 NA
6 Medieval-St. Mary Graces MALE 12525 22.4 22.3 NA
(RaHD L and RaHD R are bone measurements).
I have just created the 'DIRECTIONAL ASYMMETRY' column by doing:
MRaHDTABLE["DIRECTIONAL ASYMMETRY"]=NA
and I now need to input data into that column. The formula for directional asymmetry is '%DA = (right - left) / (average of left and right) x 100'
so would be (RaHD R - RaHD L) / (average of RaHD R and RaHD L) x 100. I'm not sure how to input this into my table as I tried:
MRaHDTABLE$'DIRECTIONAL ASYMMETRY'=(MRaHDTABLE$`RaHD R`-
MRaHDTABLE$`RaHDL`)/mean(MRaHDTABLE$`RaHD L`,MRaHDTABLE$`RaHD R`)*100
but got the error:
Error in mean.default(MRaHDTABLE$`RaHD L`, MRaHDTABLE$`RaHD R`) :
'trim' must be numeric of length one

You are using the mean function incorrectly in your formula. If you look at the documentation (?mean), the function takes three arguments: a numeric vector (x), the fraction of values to be trimmed (trim), and how to treat missing values (na.rm). Therefore, in your specification mean(MRaHDTABLE$`RaHD L`,MRaHDTABLE$`RaHD R`), the first term is interpreted as the input vector (x),and the second term is interpreted as the the trim parameter.
Try replace
mean(MRaHDTABLE$`RaHD L`,MRaHDTABLE$`RaHD R`)
With
rowMeans(name_of_df[ , c(4,5)])

The OP has asked to implement the formula
(RaHD R - RaHD L) / (average of RaHD R and RaHD L) x 100
The answers posted so far are trying to make the mean() function work row-wise just to compute the average of two numbers which simply is
average of RaHD R and RaHD L = (RaHD R + RaHD L) / 2
So, the answer could be as simple as:
MRaHDTABLE["DIRECTIONAL.ASYMMETRY"] <-
with(MRaHDTABLE, 200 * (RaHD.R - RaHD.L) / (RaHD.R + RaHD.L))
MRaHDTABLE
i X2 CEMETERY SEX CONTEXT RaHD.L RaHD.R DIRECTIONAL.ASYMMETRY
1 1 Medieval-St. Mary Graces FEMALE 7172 21.2 21.6 1.8691589
2 2 Medieval-St. Mary Graces MALE 6225 23.9 25.2 5.2953157
3 3 Medieval-St. Mary Graces MALE 9987 23.9 23.5 -1.6877637
4 4 Medieval-St. Mary Graces MALE 11475 22.4 22.3 -0.4474273
5 5 Medieval-St. Mary Graces MALE 12356 25.8 25.4 -1.5625000
6 6 Medieval-St. Mary Graces MALE 12525 22.4 22.3 -0.4474273
Note The data look differently to OP's posted data. This is due to OP's failure to provide sample data in a reproducible way, i.e., by posting the result of dput(MRaHDTABLE). So, I tried to reproduce the data with a less effort as possible.
The with() function is used to save typing.
Data
MRaHDTABLE <- data.frame(readr::read_table(
" i CEMETERY SEX CONTEXT RaHD.L RaHD.R DIRECTIONAL.ASYMMETRY
1 Medieval-St. Mary Graces FEMALE 7172 21.2 21.6 NA
2 Medieval-St. Mary Graces MALE 6225 23.9 25.2 NA
3 Medieval-St. Mary Graces MALE 9987 23.9 23.5 NA
4 Medieval-St. Mary Graces MALE 11475 22.4 22.3 NA
5 Medieval-St. Mary Graces MALE 12356 25.8 25.4 NA
6 Medieval-St. Mary Graces MALE 12525 22.4 22.3 NA"
))

Related

reshape untidy data frame, spreading rows to columns names [duplicate]

This question already has answers here:
Transpose a data frame
(6 answers)
Closed 2 years ago.
Have searched the threads but can't understand a solution that will solve the problem with the data frame that I have.
My current data frame (df):
# A tibble: 8 x 29
`Athlete` Monday...2 Tuesday...3 Wednesday...4 Thursday...5 Friday...6 Saturday...7 Sunday...8
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Date 29/06/2020 30/06/2020 43837.0 43868.0 43897.0 43928.0 43958.0
2 HR 47.0 54.0 51.0 56.0 59.0 NA NA
3 HRV 171.0 91.0 127.0 99.0 77.0 NA NA
4 Sleep Duration 9.11 7.12 8.59 7.15 8.32 NA NA
5 Sleep Efficien~ 92.0 94.0 89.0 90.0 90.0 NA NA
6 Recovery Score 98.0 66.0 96.0 72.0 46.0 NA NA
7 Life Stress NO NO NO NO NO NA NA
8 Sick NO NO NO NO NO NA NA
Have tried to use spread and pivot wider but I know there would require additional functions in order to get the desired output which beyond my level on understanding in R.
Do I need to u
Desired output:
Date HR HRV Sleep Duration Sleep Efficiency Recovery Score Life Stress Sick
29/06/2020 47.0 171.0 9.11
30/06/2020 54.0 91.0 7.12
43837.0 51.0 127.0 8.59
43868.0 56.0 99.0 7.15
43897.0 59.0 77.0 8.32
43928.0 NA NA NA
43958.0 NA NA NA
etc.
Thank you

In Base R you will do:
type.convert(setNames(data.frame(t(df[-1]), row.names = NULL), df[,1]))
Date HR HRV Sleep Duration Sleep Efficien~ Recovery Score Life Stress Sick
1 29/06/2020 47 171 9.11 92 98 NO NO
2 30/06/2020 54 91 7.12 94 66 NO NO
3 43837.0 51 127 8.59 89 96 NO NO
4 43868.0 56 99 7.15 90 72 NO NO
5 43897.0 59 77 8.32 90 46 NO NO
6 43928 NA NA NA NA NA <NA> <NA>
7 43958 NA NA NA NA NA <NA> <NA>

Removing data from a data frame [duplicate]

This question already has answers here:
Remove groups with less than three unique observations [duplicate]
(3 answers)
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 5 years ago.
I have a data frame that looks like this:
CEMETERY CONTEXT SEX BONE MEASUREMENT VALUE
1 Medieval-St. Mary Graces 6225 MALE HuE1 L 64.1
2 Medieval-St. Mary Graces 6225 MALE HuE1 R 62.7
3 Medieval-St. Mary Graces 6225 MALE HuHD L 50.1
4 Medieval-St. Mary Graces 6225 MALE HuHD R 51.3
5 Medieval-St. Mary Graces 6225 MALE HuL1 R 346.0
6 Medieval-St. Mary Graces 6272 FEMALE HuHD L 41.3
I need to remove any specimens (CONTEXTs) where there is only a bone measurement for left (L) or (R), instead of having both (e.g. if a specimen has HuE1L but not HuE1R then I need to remove it). I'm not sure what the best way to do this is as the data frame is too large to individually remove certain rows. To create this data frame I used the merge() function so I also have data frames for each bone (left and right are in separate data frames), if that makes any difference to what I need to do?
EDIT:
I tried using data.table:
library(data.table)
setDT(df)
setkey(df, CONTEXT, BONE)
df[df[, .N, key(df)][N == 2, .(CONTEXT, BONE)]]
but that returns this:
CEMETERY CONTEXT SEX EXPANSION VALUE
1: Medieval-Spital Square 19 FEMALE HuE1 L 57.9
2: Medieval-Spital Square 19 FEMALE HuE1 R 58.8
3: Medieval-Spital Square 19 FEMALE HuHD R 44.6
4: Medieval-Spital Square 19 FEMALE HuL1 L 326.0
5: Medieval-Spital Square 19 FEMALE HuL1 R 332.0
474: Medieval-St. Mary Graces 16332 MALE RaHD L 25.4
475: Medieval-St. Mary Graces 16344 MALE HuHD R 48.8
476: Medieval-St. Mary Graces 20001 FEMALE HuHD L 40.2
477: Medieval-St. Mary Graces 20001 FEMALE HuHD R 39.8
478: Medieval-St. Mary Graces 20001 FEMALE RaHD R 20.8
so it hasn't actually removed bone measurements that only have left or right.
To clarify - the Ls and Rs are part of the column 'EXPANSION', not a separate column - would I first need to make that a column on its own/how would I go about doing this?

You can subset you dataset using data.table:
library(data.table)
setDT(df)
setkey(df, CONTEXT, BONE)
df[df[, .N, key(df)][N == 2, .(CONTEXT, BONE)]]
# CEMETERY CONTEXT SEX BONE MEASUREMENT VALUE
# 1: Medieval-St. Mary Graces 6225 MALE HuE1 L 64.1
# 2: Medieval-St. Mary Graces 6225 MALE HuE1 R 62.7
# 3: Medieval-St. Mary Graces 6225 MALE HuHD L 50.1
# 4: Medieval-St. Mary Graces 6225 MALE HuHD R 51.3
Explanation:
Turn your data into a data.table (setDT())
Set key (index) in your data (setkey()). Using setkey(df, CONTEXT, BONE) as we want to count by CONTEXT and BONE
Count number of rows by key (df[, .N, key(df)])
Subset data with 2 occurrences (N == 2)

Renaming vectors in a column

I have a dataframe which, summarised, looks like this:
CEMETERY SEX CONTEXT RaHD.L RaHD.R
1 Medieval-St. Mary Graces FEMALE 7172 21.2 21.6
2 Medieval-St. Mary Graces MALE 6225 23.9 25.2
3 Medieval-St. Mary Graces MALE 9987 23.9 23.5
4 Medieval-St. Mary Graces MALE 11475 22.4 22.3
5 Medieval-St. Mary Graces MALE 12356 25.8 25.4
6 Medieval-St. Mary Graces MALE 12525 22.4 22.3
7 Medieval-St. Mary Graces MALE 12785 22.9 22.6
8 Medieval-St. Mary Graces MALE 13840 22.5 22.9
9 Medieval-Spital Square FEMALE 383 21.5 22.0
10 Medieval-Spital Square MALE 31 23.3 22.0
17 Post-Medieval-Chelsea Old Church FEMALE 19 20.0 20.6
18 Post-Medieval-Chelsea Old Church FEMALE 31 19.5 20.0
19 Post-Medieval-Chelsea Old Church FEMALE 39 19.6 19.2
41 Post-Medieval-St. Thomas Hospital FEMALE 60 21.8 22.6
43 Post-Medieval-St. Thomas Hospital MALE 83 22.4 23.0
I want to change the vectors in the CEMETERY column to simply 'Medieval' and 'Post-Medieval', instead of having the entire cemetery name, or alternatively create a new column stating 'Medieval' or 'Post-medieval'.

We can use sub to capture the substring upto "Medieval" and then in the replacement use the backreference (\\1) for the captured substring
df1$CEMETERY <- sub("(.*(M|m)edieval).*", "\\1", df1$CEMETERY)
df1$CEMETERY
#[1] "Medieval" "Medieval" "Medieval" "Medieval"
#[5] "Medieval" "Medieval" "Medieval" "Medieval"
#[9] "Medieval" "Medieval" "Post-Medieval" "Post-Medieval"
#[13] "Post-Medieval" "Post-Medieval" "Post-Medieval"

In case the information on the location should be kept, there is an alternative approach which splits the CEMETERY column at the first hyphen after "Medieval" (which includes splitting after "Post-Medieval") and assigns the two parts to two columns PERIOD and CEMETERY:
library(data.table)
setDT(DF)[, c("PERIOD", "CEMETERY") := tstrsplit(CEMETERY, "(?<=Medieval)-", perl = TRUE)][]
CEMETERY SEX CONTEXT RaHD.L RaHD.R PERIOD
1: St. Mary Graces FEMALE 7172 21.2 21.6 Medieval
2: St. Mary Graces MALE 6225 23.9 25.2 Medieval
3: St. Mary Graces MALE 9987 23.9 23.5 Medieval
4: St. Mary Graces MALE 11475 22.4 22.3 Medieval
5: St. Mary Graces MALE 12356 25.8 25.4 Medieval
6: St. Mary Graces MALE 12525 22.4 22.3 Medieval
7: St. Mary Graces MALE 12785 22.9 22.6 Medieval
8: St. Mary Graces MALE 13840 22.5 22.9 Medieval
9: Spital Square FEMALE 383 21.5 22.0 Medieval
10: Spital Square MALE 31 23.3 22.0 Medieval
11: Chelsea Old Church FEMALE 19 20.0 20.6 Post-Medieval
12: Chelsea Old Church FEMALE 31 19.5 20.0 Post-Medieval
13: Chelsea Old Church FEMALE 39 19.6 19.2 Post-Medieval
14: St. Thomas Hospital FEMALE 60 21.8 22.6 Post-Medieval
15: St. Thomas Hospital MALE 83 22.4 23.0 Post-Medieval
The feature used in the regular expression to identify the correct hyphen to split on is called positive look-behind.
Data
DF <- readr::read_table(
" CEMETERY SEX CONTEXT RaHD.L RaHD.R
1 Medieval-St. Mary Graces FEMALE 7172 21.2 21.6
2 Medieval-St. Mary Graces MALE 6225 23.9 25.2
3 Medieval-St. Mary Graces MALE 9987 23.9 23.5
4 Medieval-St. Mary Graces MALE 11475 22.4 22.3
5 Medieval-St. Mary Graces MALE 12356 25.8 25.4
6 Medieval-St. Mary Graces MALE 12525 22.4 22.3
7 Medieval-St. Mary Graces MALE 12785 22.9 22.6
8 Medieval-St. Mary Graces MALE 13840 22.5 22.9
9 Medieval-Spital Square FEMALE 383 21.5 22.0
10 Medieval-Spital Square MALE 31 23.3 22.0
17 Post-Medieval-Chelsea Old Church FEMALE 19 20.0 20.6
18 Post-Medieval-Chelsea Old Church FEMALE 31 19.5 20.0
19 Post-Medieval-Chelsea Old Church FEMALE 39 19.6 19.2
41 Post-Medieval-St. Thomas Hospital FEMALE 60 21.8 22.6
43 Post-Medieval-St. Thomas Hospital MALE 83 22.4 23.0"
)[, -1]

How best to index for max values in data frame?

Here dataset in use is genotype from the cran package,MASS.
> names(genotype)
[1] "Litter" "Mother" "Wt"
> str(genotype)
'data.frame': 61 obs. of 3 variables:
$ Litter: Factor w/ 4 levels "A","B","I","J": 1 1 1 1 1 1 1 1 1 1 ...
$ Mother: Factor w/ 4 levels "A","B","I","J": 1 1 1 1 1 2 2 2 3 3 ...
$ Wt : num 61.5 68.2 64 65 59.7 55 42 60.2 52.5 61.8 ...
This was the given question from a tutorial:
Exercise 6.7. Find the heaviest rats born to each mother in the genotype() data.
tapply, whence split by factor genotype$Mother gives:
> tapply(genotype$Wt, genotype$Mother, max)
A B I J
68.2 69.8 61.8 61.0
Also:
> out <- tapply(genotype$Wt, genotype[,1:2],max)
> out
Mother
Litter A B I J
A 68.2 60.2 61.8 61.0
B 60.3 64.7 59.0 51.3
I 68.0 69.8 61.3 54.5
J 59.0 59.5 61.4 54.0
First tapply gives the heaviest rats from each mother , and second (out) gives a table that allows me identify which type of litter of each mother was heaviest. Is there another way to match which Litter is has the most weight for each mother, for instance if the 2 dim table is real large.

We could use data.table. We convert the 'data.frame' to 'data.table' (setDT(genotype)). Create the index using which.max and subset the rows of the dataset grouped by the 'Mother'.
library(data.table)#v1.9.5+
setDT(genotype)[, .SD[which.max(Wt)], by = Mother]
# Mother Litter Wt
#1: A A 68.2
#2: B I 69.8
#3: I A 61.8
#4: J A 61.0
If we are only interested in the max of 'Wt' by 'Mother'
setDT(genotype)[, list(Wt=max(Wt)), by = Mother]
# Mother Wt
#1: A 68.2
#2: B 69.8
#3: I 61.8
#4: J 61.0
Based on the last tapply code showed by the OP, if we need similar output, we can use dcast from the devel version of 'data.table'
dcast(setDT(genotype), Litter ~ Mother, value.var='Wt', max)
# Litter A B I J
#1: A 68.2 60.2 61.8 61.0
#2: B 60.3 64.7 59.0 51.3
#3: I 68.0 69.8 61.3 54.5
#4: J 59.0 59.5 61.4 54.0
data
library(MASS)
data(genotype)

From stats:
aggregate(. ~ Mother, data = genotype, max)
or
aggregate(Wt ~ Mother, data = genotype, max)

Rolling Row Subtraction [duplicate]

This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 7 years ago.
I am looking to perform a row subtraction, where I have a group of individuals and I want to subtract the more recent row from the row above it (like a rolling row subtraction). Does anyone know a simple way to do this?
The data would look something like this:
Name Day variable.1
1 Bob 1 43.4
2 Bob 2 32.0
3 Bob 3 18.1
4 Bob 4 41.2
5 Bob 5 85.2
6 Jeff 1 17.4
7 Jeff 2 55.6
8 Jeff 3 58.7
9 Jeff 4 40.6
10 Jeff 5 77.3
11 Carl 1 52.9
12 Carl 2 71.7
13 Carl 3 84.3
14 Carl 4 54.8
15 Carl 5 69.7
For example, for Bob, I would like it to come out as:
Name Day variable.1
1 Bob 1 NA
2 Bob 2 -11.4
3 Bob 3 -13.9
4 Bob 4 23.1
5 Bob 5 44
And then it would go to the next name and perform the same task.

You could try
library(data.table)#v1.9.5+
setDT(df1)[,variable.1:=c(NA,diff(variable.1)) , Name]
Or using shift from the devel version of data.table (as suggested by #Jan Gorecki). Instructions to install are here
setDT(df1)[, variable.1 := variable.1- shift(variable.1), Name]

You can use the base ave() function. For example, if your data is in a data.frame named dd,
dd$newcol <-ave(dd$variable.1, dd$Name, FUN=function(x) c(NA,diff(x)))

You can also try:
library(dplyr)
df %>% group_by(Name) %>% mutate(diff = variable.1-lag(variable.1))
Source: local data frame [15 x 4]
Groups: Name
Name Day variable.1 diff
1 Bob 1 43.4 NA
2 Bob 2 32.0 -11.4
3 Bob 3 18.1 -13.9
4 Bob 4 41.2 23.1
5 Bob 5 85.2 44.0
6 Jeff 1 17.4 NA
7 Jeff 2 55.6 38.2
8 Jeff 3 58.7 3.1
9 Jeff 4 40.6 -18.1
10 Jeff 5 77.3 36.7
11 Carl 1 52.9 NA
12 Carl 2 71.7 18.8
13 Carl 3 84.3 12.6
14 Carl 4 54.8 -29.5
15 Carl 5 69.7 14.9

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

adding new column to dataframe using formula - r

Related

reshape untidy data frame, spreading rows to columns names [duplicate]

Removing data from a data frame [duplicate]

Renaming vectors in a column

How best to index for max values in data frame?

Rolling Row Subtraction [duplicate]

Categories

Resources