How to refer to specific row in R by row name?

How to refer to specific row in R by row name? - r

I have just loaded built-in R data set 'emissions'.
I would like to remove from data set first row 'United States'.
Apparently I can do it like:
data2 <- data[1,]
but what, if i know the name of row but not a position in data set?
How to remove it refering only to name, knowing that this row is named 'United States'?
Here is how data set looks like:
GDP perCapita CO2
UnitedStates 8083000 29647 6750
Japan 3080000 24409 1320
Germany 1740000 21197 1740
France 1320000 22381 550
UnitedKingdom 1242000 21010 675
Italy 1240000 21856 540
Russia 692000 4727 2000
Canada 658000 21221 700
Spain 642400 16401 370
Australia 394000 20976 480
Netherlands 343900 21755 240
Poland 280700 7270 400
Belgium 236300 23208 145
Sweden 176200 19773 75
I only tried to refer to it by row positions. Works fine, but I guess in bigger data sets I will not scroll trough rows and count them...

You could filter your dataframe by row.names using the following code:
data2[!(row.names(data2) %in% "UnitedStates"),]
#> GDP perCapita CO2
#> Japan 3080000 24409 1320
#> Germany 1740000 21197 1740
#> France 1320000 22381 550
#> UnitedKingdom 1242000 21010 675
#> Italy 1240000 21856 540
#> Russia 692000 4727 2000
#> Canada 658000 21221 700
#> Spain 642400 16401 370
#> Australia 394000 20976 480
#> Netherlands 343900 21755 240
#> Poland 280700 7270 400
#> Belgium 236300 23208 145
#> Sweden 176200 19773 75
Created on 2022-12-26 with reprex v2.0.2
Make sure you spelled the row name right.
Data:
data2 <- read.table(text = ' GDP perCapita CO2
UnitedStates 8083000 29647 6750
Japan 3080000 24409 1320
Germany 1740000 21197 1740
France 1320000 22381 550
UnitedKingdom 1242000 21010 675
Italy 1240000 21856 540
Russia 692000 4727 2000
Canada 658000 21221 700
Spain 642400 16401 370
Australia 394000 20976 480
Netherlands 343900 21755 240
Poland 280700 7270 400
Belgium 236300 23208 145
Sweden 176200 19773 75', header = TRUE)

yet another approach:
setdiff(rownames(data2),
c('UnitedStates', 'SkipThis', 'OmitThatToo')
) %>%
data2[., ]

Using which:
mtcars[which(rownames(mtcars)!='Mazda RX4'),]

As it has been said before:
df[!row.names(df) == "United States",]

Related

How to filter a dataframe so that it finds the maximum value for 10 unique occurrences of another variable

I have this dataframe here which I filter down to only include counties in the state of Washington and only include columns that are relevant for the answer I am looking for. What I want to do is filter down the dataframe so that I have 10 rows only, which have the highest Black Prison Population out of all of the counties in Washington State regardless of year. The part that I am struggling with is that there can't be repeated counties, so each row should include the highest Black Prison Populations for the top 10 unique county names in the state of Washington. Some of the counties have Null data for the populations for the black prison populations as well. for You should be able to reproduce this to get the updated dataframe.
library(dplyr)
incarceration <- read.csv("https://raw.githubusercontent.com/vera-institute/incarceration-trends/master/incarceration_trends.csv")
blackPrisPop <- incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA")
Sample of what the updated dataframe looks like (should include 1911 rows):
fips county_name state year black_pop_15to64 black_prison_pop
130 53005 Benton County WA 2001 1008 25
131 53005 Benton County WA 2002 1143 20
132 53005 Benton County WA 2003 1208 21
133 53005 Benton County WA 2004 1236 27
134 53005 Benton County WA 2005 1310 32
135 53005 Benton County WA 2006 1333 35

You can group_by the county county_name, and then use slice_max taking the row with maximum value for black_prison_pop. If you set n = 1 option you will get one row for each county. If you set with_ties to FALSE, you also will get one row even in case of ties.
You can arrange in descending order the black_prison_pop value to get the overall top 10 values across all counties.
library(dplyr)
incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA") %>%
group_by(county_name) %>%
slice_max(black_prison_pop, n = 1, with_ties = FALSE) %>%
arrange(desc(black_prison_pop)) %>%
head(10)
Output
black_prison_pop black_pop_15to64 year fips county_name state
<dbl> <dbl> <int> <int> <chr> <chr>
1 1845 73480 2002 53033 King County WA
2 975 47309 2013 53053 Pierce County WA
3 224 5890 2005 53063 Spokane County WA
4 172 19630 2015 53061 Snohomish County WA
5 137 8129 2016 53011 Clark County WA
6 129 5146 2003 53035 Kitsap County WA
7 102 5663 2009 53067 Thurston County WA
8 58 706 1991 53021 Franklin County WA
9 50 1091 1991 53077 Yakima County WA
10 46 1748 2008 53073 Whatcom County WA

Selecting a column with a dot in R (nested object)

I'm new to R and I'm not sure how to rephrase the question, but basically, I have this dataset coming from the following code:
data_url <- 'https://prod-scores-api.ausopen.com/year/2021/stats'
dat <- jsonlite::fromJSON(data_url)
men_aces <- bind_rows(dat$statistics$rankings[[1]]$players[1])
men_aces_table <- dat$players %>%
inner_join(men_aces, by = c('uuid' = 'player_id')) %>% select(full_name, nationality)
Which resulted in this data frame:
full_name nationality.uuid nationality.name nationality.code
1 Novak Djokovic 99da9b29-eade-4ac3-a7b0-b0b8c2192df7 Serbia SRB
2 Alexander Zverev 99d83e85-3173-4ccc-9d91-8368720f4a47 Germany GER
3 Milos Raonic 07779acb-6740-4b26-a664-f01c0b54b390 Canada CAN
4 Daniil Medvedev fa925d2d-337f-4074-a0bd-afddb38d66e1 Russia RUS
5 Nick Kyrgios 9b11f78c-47c1-43c4-97d0-ba3381eb9f07 Australia AUS
nationality is the nested object inside the player object if you check the JSON url, it contains the above properties (uuid, name, code), if I select the full_name property I would get the value (which is of type character) right back.
I'm not sure how to select the name and from that data frame (nationality) and rename it to country.
My expected outcome is:
full_name country
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
I would appreciate some help. Sorry I was unclear.

Use purrr::pmap_chr
library(tidyverse)
dat$players %>%
inner_join(men_aces, by = c('uuid' = 'player_id')) %>%
select(full_name, nationality) %>%
mutate(nationality = pmap_chr(nationality, ~ ..2))
full_name nationality
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
6 Alexander Bublik Kazakhstan
7 Reilly Opelka United States of America
8 Jiri Vesely Czech Republic
9 Andrey Rublev Russia
10 Lloyd Harris South Africa
11 Aslan Karatsev Russia
12 Taylor Fritz United States of America
13 Matteo Berrettini Italy
14 Grigor Dimitrov Bulgaria
15 Feliciano Lopez Spain
16 Stefanos Tsitsipas Greece
17 Felix Auger-Aliassime Canada
18 Thanasi Kokkinakis Australia
19 Ugo Humbert France
20 Borna Coric Croatia

You could do:
bind_cols(full_name = dat$players$full_name, country = dat$players$nationality$name)
# A tibble: 169 x 2
full_name country
<chr> <chr>
1 Novak Djokovic Serbia
2 Alexander Zverev Germany
3 Milos Raonic Canada
4 Daniil Medvedev Russia
5 Nick Kyrgios Australia
6 Alexander Bublik Kazakhstan
7 Reilly Opelka United States of America
8 Jiri Vesely Czech Republic
9 Andrey Rublev Russia
10 Lloyd Harris South Africa

just add this line at the end
newdf <- data.frame(full_name = men_aces_table$full_name, country = men_aces_table$nationality$name)

Adding conditional variables to dataframe

Say we have a Dataframe that look like this:
UNIT NUMBER Year City STATE
124 1996 Prague CZECH
121 2001 Sofie BULG
122 2003 Ostrava CZECH
147 1986 Kyjev UKRAINE
133 2005 Lvov UKRAINE
...
...
...
188 2001 Rome ITALY
And say I need to add anothet variable to dataframe called Capital city - that would be equal to 1 if the City is a capital city of STATE and 0 otherwise.
how would I add this variable?
Capital cities in above dataframe are: Prague, Sofie, Kyjev
PS: I know I can do it 'by hand' in above dataframe, but I need universal solution for mutch bigger dataframes...

If you have many cities names with some cities with same names:
library(dplyr)
df <- data.frame(
unit = c(124, 121, 122, 147, 133),
Year = c(1996,2001,2003,1986,2005),
City = c("Prague", "Sofie", "Ostrava", "Kyjev", "Lvov"),
State = c("CZECH", "BULG", "CZECH", "UKRAINE", "UKRAINE"))
capital <- data.frame(
City = c("Prague", "Sofie", "Kyjev"),
State = c("CZECH", "BULG", "UKRAINE"),
Capital = "YES"
)
left_join(df, capital, by = c("State" = "State", "City" = "City"))
Get:
> left_join(df, capital, by = c("State" = "State", "City" = "City"))
unit Year City State Capital
1 124 1996 Prague CZECH YES
2 121 2001 Sofie BULG YES
3 122 2003 Ostrava CZECH <NA>
4 147 1986 Kyjev UKRAINE YES
5 133 2005 Lvov UKRAINE <NA>
If all city names are unique, then
cap_list = c("Prague", "Sofie", "Kyjev")
df %>%
mutate (
yes = as.numeric(City %in% cap_list)
)
unit Year City State yes
1 124 1996 Prague CZECH 1
2 121 2001 Sofie BULG 1
3 122 2003 Ostrava CZECH 0
4 147 1986 Kyjev UKRAINE 1
5 133 2005 Lvov UKRAINE 0

How to remove duplicate values in specific column without removing related row

Want to remove duplicate values in specific column without deleting the rows related with duplicate column values as below example:
Input
-----
Date Market Quantity
4/2/2018 Indonesia 1000
4/2/2018 Australia 500
4/2/2018 India 300
4/2/2018 USA 500
4/2/2018 Germany 200
5/2/2018 India 400
5/2/2018 Japan 400
5/2/2018 Russia 457
6/2/2018 Austria 260
6/2/2018 Swiss 700
6/2/2018 USA 1200
6/2/2018 Indonesia 400
output
------
Date Market Quantity
4/2/2018 Indonesia 1000
Australia 500
India 300
USA 500
Germany 200
5/2/2018 India 400
Japan 400
Russia 457
6/2/2018 Austria 260
Swiss 700
USA 1200
Indonesia 400
And if possible , how to plot a graph(bar/column) for same output(something like given)?
Sample Graph

I would add this to comments but I don't have rights yet...
I don't think you actually want to change the data, but as a few mentioned in the comments there are easy ways to do that.
If you're just trying to show the multi-dimensional data in plotly and you're just not familiar with the library syntax try the code below...
df <- data.frame(Date = c('2018/04/02','2018/04/02','2018/04/02','2018/04/02','2018/04/02','2018/05/02','2018/05/02','2018/05/02','2018/06/02','2018/06/02','2018/06/02','2018/06/02'),
Market = c('Indonesia','Australia','India','USA','Germany','India','Japan','Russia','Austria','Swiss','USA','Indonesia'),
Quantity = c(1000,500,300,500,200,400,400,457,260,700,1200,400),
stringsAsFactors = F)
plotly::ggplotly(
ggplot2::ggplot(df, ggplot2::aes(x=Market, y=Quantity)) +
ggplot2::geom_col(ggplot2::aes(fill=Market))+
ggplot2::facet_grid(~Date,scale='free_x') +
ggthemes::theme_tufte()
)

R t-test of mean vs observations for multiple factor levels

I have a dataset of some 39k rows of data, an excerpt is below:
'Country', 'Group', 'Item', 'Year' are categorical
'Production' and 'Waste' are numerical
'LF' is also numerical, but is the result of 'Waste'/'Production
Region Country Group Item Year Production Waste LF
Europe Bulgaria Cereals Wheat 1961 2040 274 0.134313725
Europe Bulgaria Cereals Wheat 1962 2090 262 0.125358852
Europe Bulgaria Cereals Wheat 1963 1894 277 0.14625132
Europe Bulgaria Cereals Wheat 1964 2121 286 0.134842056
Europe Bulgaria Cereals Wheat 1965 2923 341 0.116660965
Europe Bulgaria Cereals Wheat 1966 3193 385 0.120576261
Europe Bulgaria Cereals Barley 1961 612 15 0.024509804
Europe Bulgaria Cereals Barley 1962 599 16 0.026711185
Europe Bulgaria Cereals Barley 1963 618 16 0.025889968
Europe Bulgaria Cereals Barley 1964 764 21 0.027486911
Europe Bulgaria Cereals Barley 1965 876 22 0.025114155
Europe Bulgaria Cereals Barley 1966 1064 24 0.022556391
I have used the following code to generate 991 different means by Item and Group
df2 <- aggregate(LF ~ Country + Item, data=df1, FUN='mean')
The results of this function look ok.
I would like to test whether the respective means of LF in df2 are different to the underlying annual observations in df1 for each Country-Item combination (ie. if FALSE, then LF is really just a static ratio, if TRUE then 'Waste' is independent from 'Production').
How might this best be done? There seem to be 991 tests to conduct for this dataset alone and I don't know how to mix the apply and t.test functions in this manner.
Thanks!

t.test requires two groups to compare on a numeric/scale dependent output variable. Here, it seems to me that for each combination of country and item you want to compare all different year averages/means. In other words, you are trying to investigate if year is influencing the LF averages, for each combination of country and item.
The easiest way to do this is to create a linear model (LF ~ Year) for each combination of country and item and interpret the coefficient and p value of the variable year.
library(dplyr)
library(broom)
set.seed(115)
# example dataset
dt = data.frame(Country = rep("country1",12),
Item = c(rep("item1",6), rep("item2",6)),
Year = rep(1961:1966,2),
LF = runif(12,0,1))
# general means by country and item
dt %>% group_by(Country,Item) %>% summarise(Mean_LF = mean(LF))
# each years means by country and item
dt %>% group_by(Country,Item,Year) %>% summarise(Mean_LF = mean(LF))
# does year influence the means for each country and item?
dt %>% group_by(Country,Item) %>% do(tidy(lm(LF~Year, data=.)))
Hope this helps. Let me know if I'm missing something and I'll update my code.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to refer to specific row in R by row name? - r

yet another approach: setdiff(rownames(data2), c('UnitedStates', 'SkipThis', 'OmitThatToo') ) %>% data2[., ]

Using which: mtcars[which(rownames(mtcars)!='Mazda RX4'),]

As it has been said before: df[!row.names(df) == "United States",]

Related

How to filter a dataframe so that it finds the maximum value for 10 unique occurrences of another variable

Selecting a column with a dot in R (nested object)

Adding conditional variables to dataframe

How to remove duplicate values in specific column without removing related row

R t-test of mean vs observations for multiple factor levels

Categories

Resources