Duplicating values in a data frame conditionally - r

I have a dataframe that in an entirely simplistic representation looks like this:
structure(list(Plant = c("rose", "rose", "rose", "rose", "rose",
"rose", "rose", "rose", "cactus", "cactus", "cactus", "cactus"
), Area = c("North", "North", "North", "North", "South", "South",
"South", "South", "South", "South", "South", "South"), dups = c(4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-12L))
For any row of plant, I want to replace it with specific possible combinations of plant and area that are within another data frame. They are not ALL possible combinations, but just the ones that truly exist.
The possible combinations of the variables look like this:
structure(list(nam = c("rose", "rose", "rose", "rose", "cactus",
"cactus"), area = c("North", "South", "East", "West", "South",
"Northwest")), class = "data.frame", row.names = c(NA, -6L))
The final dataset should look like:
structure(list(Plant2 = c("rose", "rose", "rose", "rose", "rose",
"rose", "rose", "rose", "cactus", "cactus", "cactus", "cactus"
), Area2 = c("North", "South", "East", "West", "North", "South",
"East", "West", "South", "Northwest", "South", "Northwest")), class = "data.frame", row.names = c(NA,
-12L))
This is how I started. I created a variable for how many combinations were potentially possible and added them to the datframe with a join. And then I got super stuck because try as I might, I can't change the Area variables properly. I thought I could basically paste all the combinations of the variables with the same dups value, but I can't call to the other dataframe from dplyr. This is a very simplistic version of the data, there are many other combinations and so it's not really want to do by subsetting the data, etc...
dups<-combos %>% group_by(nam) %>% mutate(dups=n())
colnames(dups)<-c("Plant","Area","dups")
df<-left_join(df,dups)
df<-df %>% uncount(dups, .remove=FALSE)

The information you have provided is not enough to produce a final dataframe like that since each combination of Plant and dups in df can be mapped to multiple values in combos. For instance, each "rose" and "4" could be matched against the first four rows in combos. However, it seems that you simply want Area2 to repeat itself until the values fill up all possible entries for each group of Plant and dups. If so, you can try
library(dplyr)
combos <- combos %>% group_by(nam) %>% mutate(dups = n())
df %>%
group_by(Plant, dups) %>%
mutate(Area2 = rep(
combos$area[combos$nam == Plant[[1L]] & combos$dups == dups[[1L]]],
length.out = n()
))
Output
# A tibble: 12 x 4
# Groups: Plant, dups [2]
Plant Area dups Area2
<chr> <chr> <int> <chr>
1 rose North 4 North
2 rose North 4 South
3 rose North 4 East
4 rose North 4 West
5 rose South 4 North
6 rose South 4 South
7 rose South 4 East
8 rose South 4 West
9 cactus South 2 South
10 cactus South 2 Northwest
11 cactus South 2 South
12 cactus South 2 Northwest

You can use expand.grid to create a dataframe with all possible conditions
expand.grid(name = unique(df$name), area = unique(df$area))
Plant Area
1 rose North
2 cactus North
3 rose South
4 cactus South
5 rose East
6 cactus East
7 rose West
8 cactus West
9 rose Northwest
10 cactus Northwest

This snippet should do what you want, if I've understood correctly. Here, d1 and d2 are your first and second data frames. I don't think that computing dups as you have is necessary for this task, but maybe I've misunderstood your intention.
library("dplyr")
l <- split(d2$area, d2$nam)
d1 %>%
group_by(Plant) %>%
mutate(Area = rep_len(l[[Plant[1L]]], n())) %>%
ungroup() %>%
select(-dups)
# A tibble: 12 × 2
Plant Area
<chr> <chr>
1 rose North
2 rose South
3 rose East
4 rose West
5 rose North
6 rose South
7 rose East
8 rose West
9 cactus South
10 cactus Northwest
11 cactus South
12 cactus Northwest

Related

How to a row in a dataframe based on certain conditions

I have some data that looks like this:
id
ethnicity
1
white
2
south asian
2
other
3
other
4
white
4
south asian
as seen above there is potential for an id to have two ethnicity values. How would I go about removing these 'other' rows if that id already has an entry such as "white" or "south asian" while keeping the "white" or "south asian" entry?
I have noticed there are entries which also have south asian along with a white entry
My priority would be South Asian > White > Other in terms of keeping rows
So an expected output would be
id
ethnicity
1
white
2
south asian
3
other
4
south asian
If the intention is to get the prioritized 'ethnicity' per 'id', convert the column 'ethnicity' to ordered with levels specified in the order of preference, then do a group by 'id' and filter the first available level in that order
library(dplyr)
df2 %>%
mutate(ethnicity = ordered(ethnicity,
c( "south asian", "white", "other"))) %>%
group_by(id) %>%
filter(ethnicity %in% first(levels(droplevels(ethnicity)))) %>%
ungroup
-output
# A tibble: 4 × 2
id ethnicity
<int> <ord>
1 1 white
2 2 south asian
3 3 other
4 4 south asian
data
df2 <- structure(list(id = c(1L, 2L, 2L, 3L, 4L, 4L), ethnicity = c("white",
"south asian", "other", "other", "white", "south asian")),
class = "data.frame", row.names = c(NA,
-6L))

Mutate a new column based on multiple conditions in R

Assuming the following dataset:
df <- structure(list(id = 1:9, city = structure(c(1L, 7L, 2L, 6L, 4L,
9L, 3L, 8L, 5L), .Label = c("bj", "gz", "lz", "nj", "sh", "sz",
"tj", "wh", "xa"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
How could create a new column direction based on conditions:
if city is in list ['bj', 'tj'], then returns north for direction, if in ['sz', 'nj', 'sh'] returns east, if in ['xa', 'lz'] returns west, if in ['wh'] returns center, if in ['gz', 'sz'] returns south.
The expected result will like this:
My code:
df %>%
filter(city %in% c('bj', 'tj')) %>%
mutate(direction = 'north')
Out:
Use case_when :
library(dplyr)
df %>%
mutate(direction = case_when(city %in% c('bj', 'tj') ~ 'north',
city %in% c('sz', 'nj', 'sh') ~ 'east',
city %in% c('xa', 'lz') ~ 'west',
city %in% c('wh') ~ 'center',
city %in% c('gz', 'sz') ~ 'south',
))
# id city direction
#1 1 bj north
#2 2 tj north
#3 3 gz south
#4 4 sz east
#5 5 nj east
#6 6 xa west
#7 7 lz west
#8 8 wh center
#9 9 sh east
You can do it in an easy way using basic R data.frame manipulation:
df$direction <- ""
df[df$city %in% c('bj', 'tj'), "direction"] <- "north"
df[df$city %in% c('sz', 'nj', 'sh'),"direction"] <- "east"
df[df$city %in% c('xa', 'lz'), "direction"] <- "west"
df[df$city %in% c('wh'), "direction"] <- "center"
df[df$city %in% c('gz', 'sz'), "direction"] <- "south"
df
id city direction
1 1 bj north
2 2 tj north
3 3 gz south
4 4 sz south
5 5 nj east
6 6 xa west
7 7 lz west
8 8 wh center
9 9 sh east
Using nested ifelse statements can do the job as well.
df$direction=ifelse(df$city %in% c("bj","tj"), yes = "north",
ifelse(df$city %in% c('sz', 'nj', 'sh'), yes = "east",
ifelse(df$city %in% c("xa", "lz"), yes = "west",
ifelse(df$city %in% c("gz", "sz"), yes = "south", no = "center"))))
You can try stack to create a dictionary first and then match the cities, e.g.,
d <- stack(
list(
north = c("bj", "tj"),
east = c("sz", "nj", "sh"),
west = c("xa", "lz"),
center = "wh",
south = c("gz", "sz")
)
)
df <- transform(
df,
direction = d$ind[match(city,d$values)]
)
which gives
id city direction
1 1 bj north
2 2 tj north
3 3 gz south
4 4 sz east
5 5 nj east
6 6 xa west
7 7 lz west
8 8 wh center
9 9 sh east
I have also tried another sulution with mutate but the error message is the same about the wrong symbol
Here I loaded the dplyr library first.
library(dplyr)
new_soccer_referee %>%
mutate(postion_new = case_when (position %in% c("Right Fullback", "Left Fullback", "Center Back", "Defensive Midfielder") ~ "Defense",
position %in% c("Right Midfielder", "Left Midfielder", "Center Midfielder") ~ "Midfield",
position %in% c("Attacking Midfielder", "Right Winger", "Left Winger", "Center Forward") ~ "Offense",
))

Grouping by Multiple variables and summarizing character frequencies

I am trying to group my dataset by multiple variables and build a frequency table of the number of times a character variable appears. Here is an example data set:
Location State County Job Pet
Ohio Miami Data Dog
Urban Ohio Miami Business Dog, Cat
Urban Ohio Miami Data Cat
Rural Kentucky Clark Data Cat, Fish
City Indiana Shelby Business Dog
Rural Kentucky Clark Data Dog, Fish
Ohio Miami Data Dog, Cat
Urban Ohio Miami Business Dog, Cat
Rural Kentucky Clark Data Fish
City Indiana Shelby Business Cat
I want my output to look like this:
Location State County Job Frequency Pet:Cat Pet:Dog Pet:Fish
Ohio Miami Data 2 1 2 0
Urban Ohio Miami Business 2 2 2 0
Urban Ohio Miami Data 1 1 0 0
Rural Kentucky Clark Data 3 1 1 3
City Indiana Shelby Business 2 1 1 0
I have tried different iterations of the following code, and I get close, but not quite right:
Output<-df%>%group_by(Location, State, County, Job)%>%
dplyr::summarise(
Frequency= dplyr::n(),
Pet:Cat = count(str_match(Pet, "Cat")),
Pet:Dog = count(str_match(Pet, "Dog")),
Pet:Fish = count(str_match(Pet, "Fish")),
)
Any help would be appreciated! Thank you in advance
Try this:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
separate_rows(Pet,sep=',') %>%
mutate(Pet=trimws(Pet)) %>%
group_by(Location,State,County,Job,Pet) %>%
summarise(N=n()) %>%
mutate(Pet=paste0('Pet:',Pet)) %>%
group_by(Location,State,County,Job,.drop = F) %>%
mutate(Freq=n()) %>%
pivot_wider(names_from = Pet,values_from=N,values_fill=0)
Output:
# A tibble: 5 x 8
# Groups: Location, State, County, Job [5]
Location State County Job Freq `Pet:Cat` `Pet:Dog` `Pet:Fish`
<chr> <chr> <chr> <chr> <int> <int> <int> <int>
1 "" Ohio Miami Data 2 1 2 0
2 "City" Indiana Shelby Business 2 1 1 0
3 "Rural" Kentucky Clark Data 3 1 1 3
4 "Urban" Ohio Miami Business 2 2 2 0
5 "Urban" Ohio Miami Data 1 1 0 0
Some data used:
#Data
df <- structure(list(Location = c("", "Urban", "Urban", "Rural", "City",
"Rural", "", "Urban", "Rural", "City"), State = c("Ohio", "Ohio",
"Ohio", "Kentucky", "Indiana", "Kentucky", "Ohio", "Ohio", "Kentucky",
"Indiana"), County = c("Miami", "Miami", "Miami", "Clark", "Shelby",
"Clark", "Miami", "Miami", "Clark", "Shelby"), Job = c("Data",
"Business", "Data", "Data", "Business", "Data", "Data", "Business",
"Data", "Business"), Pet = c("Dog", "Dog, Cat", "Cat", "Cat, Fish",
"Dog", "Dog, Fish", "Dog, Cat", "Dog, Cat", "Fish", "Cat")), row.names = c(NA,
-10L), class = "data.frame")

Filling in NAs in data in R by id

I am having a little problem "filling in gaps". It's not a missing data question, it's more about merging but it's not working great.
So, my data looks like this
id name region Company
1 John Smith West Walmart
1 John Smith West Amazon
1 John Smith
1 John Smith West P&G
2 Jane Smith South Apple
2 Jane Smith
3 Richard Burkett
3 Richard Burkett West Walmart
And so on.
What I want to do is fill in those gaps in the region variable by their id. So, id 1, John Smith, on the third row, should have West in the third column. Jane Smith's region should be filled in "South" where it is missing.
I've tried creating a separate dataset and then merging it based on id but it creates duplicate rows and basically increases the N by something like 14 times (no idea why).
region1<-subset(df1, df1$region=="DC"| df1$region=="Midwest"|df1$region=="Northeast"|df1$region=="South"|df1$region=="West")
region<-region1[,c(id","region")]
df2<-merge(df1, region, by="id")
I've checked the structure of the variables. Id variable is interval and region is a factor. I think there should be a super simple way to do this but I'm just not getting it. Any ideas?
Thank you in advance.
Here´s an R base solution. Suppose your data.frame is df
regions <- sapply(split(df$region, df$id), function(x) {
ind <- is.na(x);
x[ind] <- x[!ind][1];
x
})
df$region <- unlist(regions)
df
id name region Company
1 1 John Smith West Walmart
2 1 John Smith West Amazon
3 1 John Smith West <NA>
4 1 John Smith West P&G
5 2 Jane Smith South Apple
6 2 Jane Smith South <NA>
7 3 Richard Burkett West Walmart
8 3 Richard Burkett West <NA>
I would use dplyr::arrange followed by tidyr::fill
library(dplyr)
library(tidyr)
data.frame(id=c(1,1,1,1,2,2,3,3),
name=c(rep("John Smith",4), rep("Jane Smith", 2), rep("Richard Burkett", 2)),
region=c("West", "West", NA, "West", "South",NA, "West", NA),
Company=c("Walmart","Amazon",NA,"P&G","Apple",NA,"Walmart",NA)) %>%
arrange(id, name) %>%
fill(region)
Results in:
id name region Company
1 1 John Smith West Walmart
2 1 John Smith West Amazon
3 1 John Smith West NA
4 1 John Smith West P&G
5 2 Jane Smith South Apple
6 2 Jane Smith South NA
7 3 Richard Burkett West Walmart
8 3 Richard Burkett West NA
The solution which should work is group_by on id and then fill. Ideally the solution which should work in OP condition should cover in both direction.
library(tidyverse)
df %>% group_by(id) %>%
fill(region) %>%
fill(region, .direction = "up")
# id name region Company
# <int> <chr> <chr> <chr>
#1 1 John Smith West Walmart
#2 1 John Smith West Amazon
#3 1 John Smith West <NA>
#4 1 John Smith West P&G
#5 2 Jane Smith South Apple
#6 2 Jane Smith South <NA>
#7 3 Richard Burkett West Walmart
#8 3 Richard Burkett West <NA>
Data
structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), name = c("John Smith",
"John Smith", "John Smith", "John Smith", "Jane Smith", "Jane Smith",
"Richard Burkett", "Richard Burkett"), region = c("West", "West",
NA, "West", "South", NA, "West", NA), Company = c("Walmart",
"Amazon", NA, "P&G", "Apple", NA, "Walmart", NA)), .Names = c("id",
"name", "region", "Company"), class = "data.frame", row.names = c(NA,
-8L))

R table with country names in cells

My aim is to create a table that summarizes the countries featured in my sample. This table should only have two rows, a first row with different columns for each region and a second row with country names that are located in the respective region.
To give you an example, this is what my data.frame XYZ looks like:
..................wvs5red2.s003names.....wvs5red2.regiondummies
21............."Hong Kong"......................Asian Tigers
45............."South Korea"....................Asian Tigers
49............."Taiwan".............................Asian Tigers
66............."China"...............................East Asia & Pacific
80............."Indonesia"........................East Asia & Pacific
86............."Malaysia"...........................East Asia & Pacific
My aim is to obtain a table that looks similar to this:
region.............Asian Tigers..............................................East Asia & Pacific
countries........Hong Kong, South Korea, Taiwan...........China, Indonesia, etc.
Do you have any idea how to obtain such a table? It took me hours searching for something similar.
Simplest way is tapply:
XYZ <- structure(list(
names = structure(c(2L, 5L, 6L, 1L, 3L, 4L), .Label = c("China", "Hong Kong", "Indonesia", "Malaysia", "South Korea", "Taiwan"), class = "factor"),
region = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Asian Tigers", "East Asia & Pacific"), class = "factor")),
.Names = c("names", "region"), row.names = c(NA, -6L), class = "data.frame")
tapply(XYZ$names, XYZ$region, paste, collapse=", ")
# Asian Tigers East Asia & Pacific
# "Hong Kong, South Korea, Taiwan" "China, Indonesia, Malaysia"
Recreate the data:
dat <- data.frame(
country = c("Hong Kong", "South Korea", "Taiwan", "China", "Indonesia", "Malaysia"),
region = c(rep("Asian Tigers", 3), rep("East Asia & Pacific", 3))
)
dat
country region
1 Hong Kong Asian Tigers
2 South Korea Asian Tigers
3 Taiwan Asian Tigers
4 China East Asia & Pacific
5 Indonesia East Asia & Pacific
6 Malaysia East Asia & Pacific
Use ddply in package plyr combined with paste to summarise the data:
library(plyr)
ddply(dat, .(region), function(x)paste(x$country, collapse= ","))
region V1
1 Asian Tigers Hong Kong,South Korea,Taiwan
2 East Asia & Pacific China,Indonesia,Malaysia
First create data:
> country<-c("Hong Kong","Taiwan","China","Indonesia")
> region<-rep(c("Asian Tigers","East Asia & Pacific"),each=2)
> df<-data.frame(country=country,region=region)
Then run through column region and gather all the countries. We can use tapply, but I will use dlply from package plyr, since it retains list names.
> ll<-dlply(df,~region,function(d)paste(d$country,collapse=","))
> ll
$`Asian Tigers`
[1] "Hong Kong,Taiwan"
$`East Asia & Pacific`
[1] "China,Indonesia"
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
region
1 Asian Tigers
2 East Asia & Pacific
Now convert the list to the data.frame using do.call. Since we need nice names we need to pass argument check.names=FALSE:
> ll$check.names <- FALSE
> do.call("data.frame",ll)
Asian Tigers East Asia & Pacific
1 Hong Kong,Taiwan China,Indonesia

Resources