How to a row in a dataframe based on certain conditions - r

I have some data that looks like this:
id
ethnicity
1
white
2
south asian
2
other
3
other
4
white
4
south asian
as seen above there is potential for an id to have two ethnicity values. How would I go about removing these 'other' rows if that id already has an entry such as "white" or "south asian" while keeping the "white" or "south asian" entry?
I have noticed there are entries which also have south asian along with a white entry
My priority would be South Asian > White > Other in terms of keeping rows
So an expected output would be
id
ethnicity
1
white
2
south asian
3
other
4
south asian

If the intention is to get the prioritized 'ethnicity' per 'id', convert the column 'ethnicity' to ordered with levels specified in the order of preference, then do a group by 'id' and filter the first available level in that order
library(dplyr)
df2 %>%
mutate(ethnicity = ordered(ethnicity,
c( "south asian", "white", "other"))) %>%
group_by(id) %>%
filter(ethnicity %in% first(levels(droplevels(ethnicity)))) %>%
ungroup
-output
# A tibble: 4 × 2
id ethnicity
<int> <ord>
1 1 white
2 2 south asian
3 3 other
4 4 south asian
data
df2 <- structure(list(id = c(1L, 2L, 2L, 3L, 4L, 4L), ethnicity = c("white",
"south asian", "other", "other", "white", "south asian")),
class = "data.frame", row.names = c(NA,
-6L))

Related

Duplicating values in a data frame conditionally

I have a dataframe that in an entirely simplistic representation looks like this:
structure(list(Plant = c("rose", "rose", "rose", "rose", "rose",
"rose", "rose", "rose", "cactus", "cactus", "cactus", "cactus"
), Area = c("North", "North", "North", "North", "South", "South",
"South", "South", "South", "South", "South", "South"), dups = c(4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-12L))
For any row of plant, I want to replace it with specific possible combinations of plant and area that are within another data frame. They are not ALL possible combinations, but just the ones that truly exist.
The possible combinations of the variables look like this:
structure(list(nam = c("rose", "rose", "rose", "rose", "cactus",
"cactus"), area = c("North", "South", "East", "West", "South",
"Northwest")), class = "data.frame", row.names = c(NA, -6L))
The final dataset should look like:
structure(list(Plant2 = c("rose", "rose", "rose", "rose", "rose",
"rose", "rose", "rose", "cactus", "cactus", "cactus", "cactus"
), Area2 = c("North", "South", "East", "West", "North", "South",
"East", "West", "South", "Northwest", "South", "Northwest")), class = "data.frame", row.names = c(NA,
-12L))
This is how I started. I created a variable for how many combinations were potentially possible and added them to the datframe with a join. And then I got super stuck because try as I might, I can't change the Area variables properly. I thought I could basically paste all the combinations of the variables with the same dups value, but I can't call to the other dataframe from dplyr. This is a very simplistic version of the data, there are many other combinations and so it's not really want to do by subsetting the data, etc...
dups<-combos %>% group_by(nam) %>% mutate(dups=n())
colnames(dups)<-c("Plant","Area","dups")
df<-left_join(df,dups)
df<-df %>% uncount(dups, .remove=FALSE)
The information you have provided is not enough to produce a final dataframe like that since each combination of Plant and dups in df can be mapped to multiple values in combos. For instance, each "rose" and "4" could be matched against the first four rows in combos. However, it seems that you simply want Area2 to repeat itself until the values fill up all possible entries for each group of Plant and dups. If so, you can try
library(dplyr)
combos <- combos %>% group_by(nam) %>% mutate(dups = n())
df %>%
group_by(Plant, dups) %>%
mutate(Area2 = rep(
combos$area[combos$nam == Plant[[1L]] & combos$dups == dups[[1L]]],
length.out = n()
))
Output
# A tibble: 12 x 4
# Groups: Plant, dups [2]
Plant Area dups Area2
<chr> <chr> <int> <chr>
1 rose North 4 North
2 rose North 4 South
3 rose North 4 East
4 rose North 4 West
5 rose South 4 North
6 rose South 4 South
7 rose South 4 East
8 rose South 4 West
9 cactus South 2 South
10 cactus South 2 Northwest
11 cactus South 2 South
12 cactus South 2 Northwest
You can use expand.grid to create a dataframe with all possible conditions
expand.grid(name = unique(df$name), area = unique(df$area))
Plant Area
1 rose North
2 cactus North
3 rose South
4 cactus South
5 rose East
6 cactus East
7 rose West
8 cactus West
9 rose Northwest
10 cactus Northwest
This snippet should do what you want, if I've understood correctly. Here, d1 and d2 are your first and second data frames. I don't think that computing dups as you have is necessary for this task, but maybe I've misunderstood your intention.
library("dplyr")
l <- split(d2$area, d2$nam)
d1 %>%
group_by(Plant) %>%
mutate(Area = rep_len(l[[Plant[1L]]], n())) %>%
ungroup() %>%
select(-dups)
# A tibble: 12 × 2
Plant Area
<chr> <chr>
1 rose North
2 rose South
3 rose East
4 rose West
5 rose North
6 rose South
7 rose East
8 rose West
9 cactus South
10 cactus Northwest
11 cactus South
12 cactus Northwest

Mutate a new column based on multiple conditions in R

Assuming the following dataset:
df <- structure(list(id = 1:9, city = structure(c(1L, 7L, 2L, 6L, 4L,
9L, 3L, 8L, 5L), .Label = c("bj", "gz", "lz", "nj", "sh", "sz",
"tj", "wh", "xa"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
How could create a new column direction based on conditions:
if city is in list ['bj', 'tj'], then returns north for direction, if in ['sz', 'nj', 'sh'] returns east, if in ['xa', 'lz'] returns west, if in ['wh'] returns center, if in ['gz', 'sz'] returns south.
The expected result will like this:
My code:
df %>%
filter(city %in% c('bj', 'tj')) %>%
mutate(direction = 'north')
Out:
Use case_when :
library(dplyr)
df %>%
mutate(direction = case_when(city %in% c('bj', 'tj') ~ 'north',
city %in% c('sz', 'nj', 'sh') ~ 'east',
city %in% c('xa', 'lz') ~ 'west',
city %in% c('wh') ~ 'center',
city %in% c('gz', 'sz') ~ 'south',
))
# id city direction
#1 1 bj north
#2 2 tj north
#3 3 gz south
#4 4 sz east
#5 5 nj east
#6 6 xa west
#7 7 lz west
#8 8 wh center
#9 9 sh east
You can do it in an easy way using basic R data.frame manipulation:
df$direction <- ""
df[df$city %in% c('bj', 'tj'), "direction"] <- "north"
df[df$city %in% c('sz', 'nj', 'sh'),"direction"] <- "east"
df[df$city %in% c('xa', 'lz'), "direction"] <- "west"
df[df$city %in% c('wh'), "direction"] <- "center"
df[df$city %in% c('gz', 'sz'), "direction"] <- "south"
df
id city direction
1 1 bj north
2 2 tj north
3 3 gz south
4 4 sz south
5 5 nj east
6 6 xa west
7 7 lz west
8 8 wh center
9 9 sh east
Using nested ifelse statements can do the job as well.
df$direction=ifelse(df$city %in% c("bj","tj"), yes = "north",
ifelse(df$city %in% c('sz', 'nj', 'sh'), yes = "east",
ifelse(df$city %in% c("xa", "lz"), yes = "west",
ifelse(df$city %in% c("gz", "sz"), yes = "south", no = "center"))))
You can try stack to create a dictionary first and then match the cities, e.g.,
d <- stack(
list(
north = c("bj", "tj"),
east = c("sz", "nj", "sh"),
west = c("xa", "lz"),
center = "wh",
south = c("gz", "sz")
)
)
df <- transform(
df,
direction = d$ind[match(city,d$values)]
)
which gives
id city direction
1 1 bj north
2 2 tj north
3 3 gz south
4 4 sz east
5 5 nj east
6 6 xa west
7 7 lz west
8 8 wh center
9 9 sh east
I have also tried another sulution with mutate but the error message is the same about the wrong symbol
Here I loaded the dplyr library first.
library(dplyr)
new_soccer_referee %>%
mutate(postion_new = case_when (position %in% c("Right Fullback", "Left Fullback", "Center Back", "Defensive Midfielder") ~ "Defense",
position %in% c("Right Midfielder", "Left Midfielder", "Center Midfielder") ~ "Midfield",
position %in% c("Attacking Midfielder", "Right Winger", "Left Winger", "Center Forward") ~ "Offense",
))

Dividing each row by the previous one in R

I have R dataframe:
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
You can try aggregate like below
aggregate(value ~city,df, function(x) x[-1]/x[1])
which gives
city value
1 LA 3
2 NY 2
Data
> dput(df)
structure(list(city = c("NY", "NY", "LA", "LA"), hour = c(0L,
12L, 0L, 12L), value = c(12L, 24L, 3L, 9L)), class = "data.frame", row.names = c("0",
"1", "2", "3"))
You can use lag to get previous value, divide each value by it's previous value for each city and drop NA rows.
library(dplyr)
df %>%
arrange(city, hour) %>%
group_by(city) %>%
summarise(value = value/lag(value)) %>%
na.omit()
# city value
# <chr> <dbl>
#1 LA 3
#2 NY 2
In data.table we can do this via shift :
library(data.table)
setDT(df)[order(city, hour), value := value/shift(value), city]
na.omit(df)

Reshaping a table in R while parsing information from column names and using it to collect information from specific columns

I have this badly organized data table given to me, in which there are hundreds of columns (subset is given below)
Names of columns are dot delimited where the first field holds information about a type of object (e.g. Item123, object_AB etc.) without any naming convention. There is no specific order for these columns as well.
Other columns share the type of object field and also have the name of some property for that object (e.g. color, manufacturer etc.).
Item123.type.value Item123.mass.value Item123.color.value object_AB.type.value object_AB.mass.value object_AB.color.value
Desk 11.2 blue Chair 2.3 orange
Desk 14.2 red Sofa 22 grey
Armchair 23.3 black Monitor 2.2 white
EDITED: Adding dput() structure:
structure(list(Item123.type.value = structure(c(2L, 2L, 1L),
levels = c("Armchair", "Desk"), class = "factor"), Item123.mass.value = structure(1:3,
levels = c("11.2", "14.2", "23.3"), class = "factor"), Item123.color.value = structure(c(2L,
3L, 1L), levels = c("black", "blue", "red"), class = "factor"),
object_AB.type.value = structure(c(1L, 3L, 2L), levels = c("Chair",
"Monitor", "Sofa"), class = "factor"), object_AB.mass.value = structure(c(2L,
3L, 1L), levels = c("2.2", "2.3", "22"), class = "factor"),
object_AB.color.value = structure(c(2L, 1L, 3L), levels = c("grey",
"orange", "white"), class = "factor")), row.names = c(NA_integer_,
-3L), class = "data.frame")
I need to convert the table into something like this (order of rows does not matter):
type name mass color
Item123 Desk 11.2 blue
Item123 Desk 14.2 red
object_AB Chair 2.3 orange
object_AB Sofa 22 grey
Item123 Armchair 23.3 black
object_AB Monitor 2.2 white
I would really appreciate any help I could get!!
You can use pivot_longer here specifying names_pattern to get data from the column names.
tidyr::pivot_longer(df,
cols = everything(),
names_to = c('name', '.value'),
names_pattern = '(\\w+)\\.(\\w+)\\.')
# A tibble: 6 x 4
# name type mass color
# <chr> <fct> <fct> <fct>
#1 Item123 Desk 11.2 blue
#2 object_AB Chair 2.3 orange
#3 Item123 Desk 14.2 red
#4 object_AB Sofa 22 grey
#5 Item123 Armchair 23.3 black
#6 object_AB Monitor 2.2 white
I would suggest this approach and maybe it could be longest and boring using as df the data you added. The code looks for specific patterns in your column names, reshape it and finally merge all:
library(tidyverse)
#Code
df %>% select(contains('type')) %>%
mutate(id=1:n()) %>%
pivot_longer(-id) %>%
separate(name,into = c(paste0('V',1:3)),sep = '\\.') %>%
select(-c(V2,V3)) %>%
rename(Value1=value) %>%
left_join(df %>% select(contains('mass')) %>%
mutate(id=1:n()) %>%
pivot_longer(-id) %>%
separate(name,into = c(paste0('V',1:3)),sep = '\\.') %>%
select(-c(V2,V3)) %>%
rename(Value2=value)) %>%
left_join(df %>% select(contains('color')) %>%
mutate(id=1:n()) %>%
pivot_longer(-id) %>%
separate(name,into = c(paste0('V',1:3)),sep = '\\.') %>%
select(-c(V2,V3)) %>%
rename(Value3=value))
Output:
# A tibble: 6 x 5
id V1 Value1 Value2 Value3
<int> <chr> <chr> <dbl> <chr>
1 1 Item123 Desk 11.2 blue
2 1 object_AB Chair 2.3 orange
3 2 Item123 Desk 14.2 red
4 2 object_AB Sofa 22 grey
5 3 Item123 Armchair 23.3 black
6 3 object_AB Monitor 2.2 white

Create new variable in dataframe based on condition in one column, pulling from other column? (dplyr)

I have the following dataframe:
df <- structure(list(country = c("Ghana", "Eritrea", "Ethiopia", "Ethiopia",
"Congo - Kinshasa", "Ethiopia", "Ethiopia", "Ghana", "Botswana",
"Nigeria"), CommodRank = c(1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L,
1L), topCommodInCountry = c(TRUE, FALSE, FALSE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE), Main_Commod = c("Gold", "Copper",
"Nickel", "Gold", "Gold", "Gold", "Gold", "Gold", "Diamonds",
"Iron Ore")), row.names = c(NA, -10L), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), vars = "country", drop = TRUE, indices = list(
8L, 4L, 1L, c(2L, 3L, 5L, 6L), c(0L, 7L), 9L), group_sizes = c(1L,
1L, 1L, 4L, 2L, 1L), biggest_group_size = 4L, labels = structure(list(
country = c("Botswana", "Congo - Kinshasa", "Eritrea", "Ethiopia",
"Ghana", "Nigeria")), row.names = c(NA, -6L), class = "data.frame", vars = "country", drop = TRUE, .Names = "country"), .Names = c("country",
"CommodRank", "topCommodInCountry", "Main_Commod"))
df
country CommodRank topCommodInCountry Main_Commod
1 Ghana 1 TRUE Gold
2 Eritrea 2 FALSE Copper
3 Ethiopia 3 FALSE Nickel
4 Ethiopia 1 TRUE Gold
5 Congo - Kinshasa 3 FALSE Gold
6 Ethiopia 1 TRUE Gold
7 Ethiopia 1 TRUE Gold
8 Ghana 1 TRUE Gold
9 Botswana 1 TRUE Diamonds
10 Nigeria 1 TRUE Iron Ore
I am trying to add another column showing the top commodity (top CommodRank) for every country in this dataset, but I'm not sure how. I'm able to label 'topcommod' with the 'Main_Commod' where CommodRank == 1, but I want to copy this same value to cases where CommodRank != 1. Looking below, both Ethiopia values at rows 3 & 4 should read 'Gold'.
df %>% mutate(topcommod = ifelse(CommodRank == 1, Main_Commod, 'unknown'))
country CommodRank topCommodInCountry Main_Commod topcommod
1 Ghana 1 TRUE Gold Gold
2 Eritrea 2 FALSE Copper unknown
3 Ethiopia 3 FALSE Nickel unknown
4 Ethiopia 1 TRUE Gold Gold
5 Congo - Kinshasa 3 FALSE Gold unknown
6 Ethiopia 1 TRUE Gold Gold
7 Ethiopia 1 TRUE Gold Gold
8 Ghana 1 TRUE Gold Gold
9 Botswana 1 TRUE Diamonds Diamonds
10 Nigeria 1 TRUE Iron Ore Iron Ore
I'm ideally looking for a dplyr solution I can add to an existing long series of pipe %>% function calls, but any solution would help.
IIUC, there are multiple ways to do this, for example:
df %>% mutate(topCom = if(!any(topCommodInCountry)) "unknown"
else Main_Commod[which.max(topCommodInCountry)])
# A tibble: 10 x 5
# Groups: country [6]
country CommodRank topCommodInCountry Main_Commod topCom
<chr> <int> <lgl> <chr> <chr>
1 Ghana 1 TRUE Gold Gold
2 Eritrea 2 FALSE Copper unknown
3 Ethiopia 3 FALSE Nickel Gold
4 Ethiopia 1 TRUE Gold Gold
5 Congo - Kinshasa 3 FALSE Gold unknown
6 Ethiopia 1 TRUE Gold Gold
7 Ethiopia 1 TRUE Gold Gold
8 Ghana 1 TRUE Gold Gold
9 Botswana 1 TRUE Diamonds Diamonds
10 Nigeria 1 TRUE Iron Ore Iron Ore
Regarding OP's question in comment how to handle ties of multiple top Commodities, you could do the following:
df %>%
mutate(topCom = if(!any(topCommodInCountry)) "unknown"
else paste(unique(Main_Commod[topCommodInCountry]), collapse = "/"))
If there are multiple unique top Commodities in a country, they will be paste together into a single string, separated by /.
another pattern with dplyr...
df %>% arrange(CommodRank) %>%
mutate(topCommod = Main_Commod[1])
It's not an answer but learning greatly from #docendo discimus answer, it took me a second to understand the "if negative" (!any(topCommodInCountry)), and I was wondering if it's only me or it would take my computer a second more to do that too :)
Using the same dataset I examined the idea of making the if else positive. First I tested for identical between the two solutions:
identical(
#Negative
df %>%
mutate(topCom = if(!any(topCommodInCountry)) "unknown"
else Main_Commod[which.max(topCommodInCountry)]),
#Positive
df %>%
mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)]
else "unknown"))
[1] TRUE
Next, I tested the benchmark of the two:
require(rbenchmark)
benchmark("Negative" = {
df %>%
mutate(topCom = if(!any(topCommodInCountry)) "unknown"
else Main_Commod[which.max(topCommodInCountry)])
},
"Positive" = {
df %>%
mutate(topCom = if(any(topCommodInCountry)) Main_Commod[which.max(topCommodInCountry)]
else "unknown")
},
replications = 10000,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
The difference is not that big but I'm assuming that with a bigger dataset it will increase.
test replications elapsed relative user.self sys.self
1 Negative 10000 12.59 1.015 12.44 0
2 Positive 10000 12.41 1.000 12.30 0

Resources