Renaming vectors in a column - r

I have a dataframe which, summarised, looks like this:
CEMETERY SEX CONTEXT RaHD.L RaHD.R
1 Medieval-St. Mary Graces FEMALE 7172 21.2 21.6
2 Medieval-St. Mary Graces MALE 6225 23.9 25.2
3 Medieval-St. Mary Graces MALE 9987 23.9 23.5
4 Medieval-St. Mary Graces MALE 11475 22.4 22.3
5 Medieval-St. Mary Graces MALE 12356 25.8 25.4
6 Medieval-St. Mary Graces MALE 12525 22.4 22.3
7 Medieval-St. Mary Graces MALE 12785 22.9 22.6
8 Medieval-St. Mary Graces MALE 13840 22.5 22.9
9 Medieval-Spital Square FEMALE 383 21.5 22.0
10 Medieval-Spital Square MALE 31 23.3 22.0
17 Post-Medieval-Chelsea Old Church FEMALE 19 20.0 20.6
18 Post-Medieval-Chelsea Old Church FEMALE 31 19.5 20.0
19 Post-Medieval-Chelsea Old Church FEMALE 39 19.6 19.2
41 Post-Medieval-St. Thomas Hospital FEMALE 60 21.8 22.6
43 Post-Medieval-St. Thomas Hospital MALE 83 22.4 23.0
I want to change the vectors in the CEMETERY column to simply 'Medieval' and 'Post-Medieval', instead of having the entire cemetery name, or alternatively create a new column stating 'Medieval' or 'Post-medieval'.

We can use sub to capture the substring upto "Medieval" and then in the replacement use the backreference (\\1) for the captured substring
df1$CEMETERY <- sub("(.*(M|m)edieval).*", "\\1", df1$CEMETERY)
df1$CEMETERY
#[1] "Medieval" "Medieval" "Medieval" "Medieval"
#[5] "Medieval" "Medieval" "Medieval" "Medieval"
#[9] "Medieval" "Medieval" "Post-Medieval" "Post-Medieval"
#[13] "Post-Medieval" "Post-Medieval" "Post-Medieval"

In case the information on the location should be kept, there is an alternative approach which splits the CEMETERY column at the first hyphen after "Medieval" (which includes splitting after "Post-Medieval") and assigns the two parts to two columns PERIOD and CEMETERY:
library(data.table)
setDT(DF)[, c("PERIOD", "CEMETERY") := tstrsplit(CEMETERY, "(?<=Medieval)-", perl = TRUE)][]
CEMETERY SEX CONTEXT RaHD.L RaHD.R PERIOD
1: St. Mary Graces FEMALE 7172 21.2 21.6 Medieval
2: St. Mary Graces MALE 6225 23.9 25.2 Medieval
3: St. Mary Graces MALE 9987 23.9 23.5 Medieval
4: St. Mary Graces MALE 11475 22.4 22.3 Medieval
5: St. Mary Graces MALE 12356 25.8 25.4 Medieval
6: St. Mary Graces MALE 12525 22.4 22.3 Medieval
7: St. Mary Graces MALE 12785 22.9 22.6 Medieval
8: St. Mary Graces MALE 13840 22.5 22.9 Medieval
9: Spital Square FEMALE 383 21.5 22.0 Medieval
10: Spital Square MALE 31 23.3 22.0 Medieval
11: Chelsea Old Church FEMALE 19 20.0 20.6 Post-Medieval
12: Chelsea Old Church FEMALE 31 19.5 20.0 Post-Medieval
13: Chelsea Old Church FEMALE 39 19.6 19.2 Post-Medieval
14: St. Thomas Hospital FEMALE 60 21.8 22.6 Post-Medieval
15: St. Thomas Hospital MALE 83 22.4 23.0 Post-Medieval
The feature used in the regular expression to identify the correct hyphen to split on is called positive look-behind.
Data
DF <- readr::read_table(
" CEMETERY SEX CONTEXT RaHD.L RaHD.R
1 Medieval-St. Mary Graces FEMALE 7172 21.2 21.6
2 Medieval-St. Mary Graces MALE 6225 23.9 25.2
3 Medieval-St. Mary Graces MALE 9987 23.9 23.5
4 Medieval-St. Mary Graces MALE 11475 22.4 22.3
5 Medieval-St. Mary Graces MALE 12356 25.8 25.4
6 Medieval-St. Mary Graces MALE 12525 22.4 22.3
7 Medieval-St. Mary Graces MALE 12785 22.9 22.6
8 Medieval-St. Mary Graces MALE 13840 22.5 22.9
9 Medieval-Spital Square FEMALE 383 21.5 22.0
10 Medieval-Spital Square MALE 31 23.3 22.0
17 Post-Medieval-Chelsea Old Church FEMALE 19 20.0 20.6
18 Post-Medieval-Chelsea Old Church FEMALE 31 19.5 20.0
19 Post-Medieval-Chelsea Old Church FEMALE 39 19.6 19.2
41 Post-Medieval-St. Thomas Hospital FEMALE 60 21.8 22.6
43 Post-Medieval-St. Thomas Hospital MALE 83 22.4 23.0"
)[, -1]

Related

Scrapy Xpath return empty list

It work if Xpath using contains function
response.xpath('//table[contains(#class, "wikitable sortable")]')
However it returns a empty using code below:
response.xpath('//table[#class="wikitable sortable jquery-tablesorter"]')
Any explanation about why it return an empty list?
For more information, I'm trying to extract territory rankings table from this site https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population as practice.
You can extract territory rankings table easily using only pandas as follows:
Code:
import pandas as pd
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population',attrs={'class':'wikitable sortable'})
df = dfs[0]#.to_csv('d.csv')
print(df)
Output:
Rank State or territory ... % of the total U.S. pop.[d] % of Elec. Coll.
'20 '10 State or territory ... 2010 Ch.2010–2020 % of Elec. Coll.
0 1.0 1.0 California ... 11.91% –0.11%
10.04%
1 2.0 2.0 Texas ... 8.04% 0.66%
7.43%
2 3.0 4.0 Florida ... 6.01% 0.42%
5.58%
3 4.0 3.0 New York ... 6.19% –0.17%
5.20%
4 5.0 6.0 Pennsylvania ... 4.06% –0.18%
3.53%
5 6.0 5.0 Illinois ... 4.10% –0.28%
3.53%
6 7.0 7.0 Ohio ... 3.69% –0.17%
3.16%
7 8.0 9.0 Georgia ... 3.10% 0.10%
2.97%
8 9.0 10.0 North Carolina ... 3.05% 0.07%
2.97%
9 10.0 8.0 Michigan ... 3.16% –0.15%
2.79%
10 11.0 11.0 New Jersey ... 2.81% –0.04%
2.60%
11 12.0 12.0 Virginia ... 2.56% 0.02%
2.42%
12 13.0 13.0 Washington ... 2.15% 0.15%
2.23%
13 14.0 16.0 Arizona ... 2.04% 0.09%
2.04%
14 15.0 14.0 Massachusetts ... 2.09% 0.00%
2.04%
15 16.0 17.0 Tennessee ... 2.03% 0.03%
2.04%
16 17.0 15.0 Indiana ... 2.07% –0.05%
2.04%
17 18.0 19.0 Maryland ... 1.85% –0.00%
1.86%
18 19.0 18.0 Missouri ... 1.91% –0.08%
1.86%
19 20.0 20.0 Wisconsin ... 1.82% –0.06%
1.86%
20 21.0 22.0 Colorado ... 1.61% 0.12%
1.86%
21 22.0 21.0 Minnesota ... 1.70% 0.01%
1.86%
22 23.0 24.0 South Carolina ... 1.48% 0.05%
1.67%
23 24.0 23.0 Alabama ... 1.53% –0.03%
1.67%
24 25.0 25.0 Louisiana ... 1.45% –0.06%
1.49%
25 26.0 26.0 Kentucky ... 1.39% –0.04%
1.49%
26 27.0 27.0 Oregon ... 1.22% 0.04%
1.49%
27 28.0 28.0 Oklahoma ... 1.20% –0.02%
1.30%
28 29.0 30.0 Connecticut ... 1.14% –0.07%
1.30%
29 30.0 29.0 Puerto Rico ... 1.19% –0.21%
—
30 31.0 35.0 Utah ... 0.88% 0.09%
1.12%
31 32.0 31.0 Iowa ... 0.97% –0.02%
1.12%
32 33.0 36.0 Nevada ... 0.86% 0.06%
1.12%
33 34.0 33.0 Arkansas ... 0.93% –0.03%
1.12%
34 35.0 32.0 Mississippi ... 0.95% –0.06%
1.12%
35 36.0 34.0 Kansas ... 0.91% –0.04%
1.12%
36 37.0 37.0 New Mexico ... 0.66% –0.03%
0.93%
37 38.0 39.0 Nebraska ... 0.58% 0.00%
0.93%
38 39.0 40.0 Idaho ... 0.50% 0.05%
0.74%
39 40.0 38.0 West Virginia ... 0.59% –0.06%
0.74%
40 41.0 41.0 Hawaii ... 0.43% 0.00%
0.74%
41 42.0 43.0 New Hampshire ... 0.42% –0.01%
0.74%
42 43.0 42.0 Maine ... 0.42% –0.02%
0.74%
43 44.0 44.0 Rhode Island ... 0.34% –0.01%
0.74%
44 45.0 45.0 Montana ... 0.32% 0.01%
0.74%
45 46.0 46.0 Delaware ... 0.29% 0.01%
0.56%
46 47.0 47.0 South Dakota ... 0.26% 0.00%
0.56%
47 48.0 49.0 North Dakota ... 0.21% 0.02%
0.56%
48 49.0 48.0 Alaska ... 0.23% –0.01%
0.56%
49 50.0 51.0 District of Columbia ... 0.19% 0.01% 0.56%
50 51.0 50.0 Vermont ... 0.20% –0.01% 0.56%
51 52.0 52.0 Wyoming ... 0.18% –0.01% 0.56%
52 53.0 53.0 Guam[8] ... 0.05% –0.00% —
53 54.0 54.0 U.S. Virgin Islands[9] ... 0.03% –0.00% —
54 55.0 55.0 American Samoa[10] ... 0.02% –0.00% —
55 56.0 56.0 Northern Mariana Islands[11] ... 0.02% –0.00% —
56 NaN NaN Contiguous United States ... 98.03% 0.23% 98.70%
57 NaN NaN The fifty states ... 98.50% 0.21% 99.44%
58 NaN NaN The fifty states and D.C. ... 98.69% 0.22% 100.00%
59 NaN NaN Total United States ... — — —
[60 rows x 16 columns]

Removing data from a data frame [duplicate]

This question already has answers here:
Remove groups with less than three unique observations [duplicate]
(3 answers)
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 5 years ago.
I have a data frame that looks like this:
CEMETERY CONTEXT SEX BONE MEASUREMENT VALUE
1 Medieval-St. Mary Graces 6225 MALE HuE1 L 64.1
2 Medieval-St. Mary Graces 6225 MALE HuE1 R 62.7
3 Medieval-St. Mary Graces 6225 MALE HuHD L 50.1
4 Medieval-St. Mary Graces 6225 MALE HuHD R 51.3
5 Medieval-St. Mary Graces 6225 MALE HuL1 R 346.0
6 Medieval-St. Mary Graces 6272 FEMALE HuHD L 41.3
I need to remove any specimens (CONTEXTs) where there is only a bone measurement for left (L) or (R), instead of having both (e.g. if a specimen has HuE1L but not HuE1R then I need to remove it). I'm not sure what the best way to do this is as the data frame is too large to individually remove certain rows. To create this data frame I used the merge() function so I also have data frames for each bone (left and right are in separate data frames), if that makes any difference to what I need to do?
EDIT:
I tried using data.table:
library(data.table)
setDT(df)
setkey(df, CONTEXT, BONE)
df[df[, .N, key(df)][N == 2, .(CONTEXT, BONE)]]
but that returns this:
CEMETERY CONTEXT SEX EXPANSION VALUE
1: Medieval-Spital Square 19 FEMALE HuE1 L 57.9
2: Medieval-Spital Square 19 FEMALE HuE1 R 58.8
3: Medieval-Spital Square 19 FEMALE HuHD R 44.6
4: Medieval-Spital Square 19 FEMALE HuL1 L 326.0
5: Medieval-Spital Square 19 FEMALE HuL1 R 332.0
474: Medieval-St. Mary Graces 16332 MALE RaHD L 25.4
475: Medieval-St. Mary Graces 16344 MALE HuHD R 48.8
476: Medieval-St. Mary Graces 20001 FEMALE HuHD L 40.2
477: Medieval-St. Mary Graces 20001 FEMALE HuHD R 39.8
478: Medieval-St. Mary Graces 20001 FEMALE RaHD R 20.8
so it hasn't actually removed bone measurements that only have left or right.
To clarify - the Ls and Rs are part of the column 'EXPANSION', not a separate column - would I first need to make that a column on its own/how would I go about doing this?
You can subset you dataset using data.table:
library(data.table)
setDT(df)
setkey(df, CONTEXT, BONE)
df[df[, .N, key(df)][N == 2, .(CONTEXT, BONE)]]
# CEMETERY CONTEXT SEX BONE MEASUREMENT VALUE
# 1: Medieval-St. Mary Graces 6225 MALE HuE1 L 64.1
# 2: Medieval-St. Mary Graces 6225 MALE HuE1 R 62.7
# 3: Medieval-St. Mary Graces 6225 MALE HuHD L 50.1
# 4: Medieval-St. Mary Graces 6225 MALE HuHD R 51.3
Explanation:
Turn your data into a data.table (setDT())
Set key (index) in your data (setkey()). Using setkey(df, CONTEXT, BONE) as we want to count by CONTEXT and BONE
Count number of rows by key (df[, .N, key(df)])
Subset data with 2 occurrences (N == 2)

adding new column to dataframe using formula

I have a dataframe and the head() looks like this:
CEMETERY SEX CONTEXT RaHD L RaHD R DIRECTIONAL ASYMMETRY
1 Medieval-St. Mary Graces FEMALE 7172 21.2 21.6 NA
2 Medieval-St. Mary Graces MALE 6225 23.9 25.2 NA
3 Medieval-St. Mary Graces MALE 9987 23.9 23.5 NA
4 Medieval-St. Mary Graces MALE 11475 22.4 22.3 NA
5 Medieval-St. Mary Graces MALE 12356 25.8 25.4 NA
6 Medieval-St. Mary Graces MALE 12525 22.4 22.3 NA
(RaHD L and RaHD R are bone measurements).
I have just created the 'DIRECTIONAL ASYMMETRY' column by doing:
MRaHDTABLE["DIRECTIONAL ASYMMETRY"]=NA
and I now need to input data into that column. The formula for directional asymmetry is '%DA = (right - left) / (average of left and right) x 100'
so would be (RaHD R - RaHD L) / (average of RaHD R and RaHD L) x 100. I'm not sure how to input this into my table as I tried:
MRaHDTABLE$'DIRECTIONAL ASYMMETRY'=(MRaHDTABLE$`RaHD R`-
MRaHDTABLE$`RaHDL`)/mean(MRaHDTABLE$`RaHD L`,MRaHDTABLE$`RaHD R`)*100
but got the error:
Error in mean.default(MRaHDTABLE$`RaHD L`, MRaHDTABLE$`RaHD R`) :
'trim' must be numeric of length one
You are using the mean function incorrectly in your formula. If you look at the documentation (?mean), the function takes three arguments: a numeric vector (x), the fraction of values to be trimmed (trim), and how to treat missing values (na.rm). Therefore, in your specification mean(MRaHDTABLE$`RaHD L`,MRaHDTABLE$`RaHD R`), the first term is interpreted as the input vector (x),and the second term is interpreted as the the trim parameter.
Try replace
mean(MRaHDTABLE$`RaHD L`,MRaHDTABLE$`RaHD R`)
With
rowMeans(name_of_df[ , c(4,5)])
The OP has asked to implement the formula
(RaHD R - RaHD L) / (average of RaHD R and RaHD L) x 100
The answers posted so far are trying to make the mean() function work row-wise just to compute the average of two numbers which simply is
average of RaHD R and RaHD L = (RaHD R + RaHD L) / 2
So, the answer could be as simple as:
MRaHDTABLE["DIRECTIONAL.ASYMMETRY"] <-
with(MRaHDTABLE, 200 * (RaHD.R - RaHD.L) / (RaHD.R + RaHD.L))
MRaHDTABLE
i X2 CEMETERY SEX CONTEXT RaHD.L RaHD.R DIRECTIONAL.ASYMMETRY
1 1 Medieval-St. Mary Graces FEMALE 7172 21.2 21.6 1.8691589
2 2 Medieval-St. Mary Graces MALE 6225 23.9 25.2 5.2953157
3 3 Medieval-St. Mary Graces MALE 9987 23.9 23.5 -1.6877637
4 4 Medieval-St. Mary Graces MALE 11475 22.4 22.3 -0.4474273
5 5 Medieval-St. Mary Graces MALE 12356 25.8 25.4 -1.5625000
6 6 Medieval-St. Mary Graces MALE 12525 22.4 22.3 -0.4474273
Note The data look differently to OP's posted data. This is due to OP's failure to provide sample data in a reproducible way, i.e., by posting the result of dput(MRaHDTABLE). So, I tried to reproduce the data with a less effort as possible.
The with() function is used to save typing.
Data
MRaHDTABLE <- data.frame(readr::read_table(
" i CEMETERY SEX CONTEXT RaHD.L RaHD.R DIRECTIONAL.ASYMMETRY
1 Medieval-St. Mary Graces FEMALE 7172 21.2 21.6 NA
2 Medieval-St. Mary Graces MALE 6225 23.9 25.2 NA
3 Medieval-St. Mary Graces MALE 9987 23.9 23.5 NA
4 Medieval-St. Mary Graces MALE 11475 22.4 22.3 NA
5 Medieval-St. Mary Graces MALE 12356 25.8 25.4 NA
6 Medieval-St. Mary Graces MALE 12525 22.4 22.3 NA"
))

Can't remove a row from a matrix in R

I'm trying to remove an outlier from a data matrix. The original matrix is called Westdata and I want to remove row 51.
I've tried the following line of code but it doesn't remove the outlier and the new matrix is identical to the old one.
Westdata.Outlier<-Westdata[-51,]
Westdata.Outlier
State Region Pay Spend Area
20 Mont. MN 22.5 3.95 West
21 Wyo. MN 27.2 5.44 West
22 N.Mex. MN 22.6 3.40 West
23 Utah MN 22.3 2.30 West
24 Wash. PA 26.0 3.71 West
25 Calif. PA 29.1 3.61 West
26 Hawaii PA 25.8 3.77 West
46 Idaho MN 21.0 2.51 West
47 Colo. MN 25.9 4.04 West
48 Ariz. MN 26.6 2.83 West
49 Nev. MN 25.6 2.93 West
50 Oreg. PA 25.8 4.12 West
51 Alaska PA 41.5 8.35 West
Any suggestions?

Sorting a dotchart with matrix input in R

How do you generate a grouped Cleveland dot plot (dot chart), where the data is sorted from highest to loweset in each subroup, when your input is a matrix?
For example, R has a nice built-in example of a dotchart using groups with a matrix as input:
dotchart(VADeaths, main = "Death Rates in Virginia - 1940")
In this particular example, the data is already sorted in each category for each of the groups (Rural Male, Rural Female, etc.). However, if it wasn't, what are the R commands to generate a plot such that the data points in each subgroup are sorted from highest to lowest?
If you do not want to order your data by the column names, as #DWin suggested, but solely on numeric data, you might try:
# get data
data <- VADeaths[sample(1:5), ]
# order data by first row's numeric values
data <- data[order(data[,1]),]
dotchart(data)
Note: this will sort the dataframe by the first column only! It is not possible to sort every column in a table without specifying different rownames for each column, which is definitely not possible with table class.
If you stick to your original question: I would suggest splitting up the data by the columns, plot the dotchart for each sorted column and pile up those in a layout.
This shows the creation of a matrix with arbitrary row order and how one can restore it to proper order.
> set.seed(123)
> VA2 <- VADeaths[sample(1:5), ]
> VA2
Rural Male Rural Female Urban Male Urban Female
55-59 18.1 11.7 24.3 13.6
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
60-64 26.9 20.3 37.0 19.3
50-54 11.7 8.7 15.4 8.4
> VA2[order(rownames(VA2)), ]
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
If you were faced with disordered colnames but not something with a the desired lexical order you could just use a character vector in the proper order with "["
> c2 <- c("Rural Male", "Rural Female", "Urban Male" , "Urban Female")
> VA3 <- VA2[ , sample(1:4)]
> VA3
Rural Male Rural Female Urban Male Urban Female
55-59 18.1 11.7 24.3 13.6
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
60-64 26.9 20.3 37.0 19.3
50-54 11.7 8.7 15.4 8.4
> VA3[ , c2]
Rural Male Rural Female Urban Male Urban Female
55-59 18.1 11.7 24.3 13.6
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
60-64 26.9 20.3 37.0 19.3
50-54 11.7 8.7 15.4 8.4

Resources