Creating new R column with different number of rows to df - r

I want to add a new column to a df called alcohol
df$alcohol <- df %>%
filter(condition=='alcohol') %>%
select(drift)
However, as 'alcohol' was one of two conditions, this new column will have less values and so I receive the following error message:
Error: replacement has 36 rows, data has 72
Does anyone know how to get around this error message and add the new column with less values?

You can use an ifelse statement to mark the values that match your condition with some label of your choice, e.g., alcohol, and mark the remaining values as NA(or some other value):
DATA:
df <- data.frame(
drinks = c("apple juice", "coke", "whiskey", "milk", "water")
)
SOLUTION:
df$alcohol <- ifelse(df$drinks=="whiskey", "alcohol", "NA")
RESULT:
df
drinks alcohol
1 apple juice NA
2 coke NA
3 whiskey alcohol
4 milk NA
5 water NA

Related

Add value in one column based on multiple key words in another column in r

I want to do the following things: if key words "GARAGE", "PARKING", "LOT" exist in column "Name" then I would add value "Parking&Garage" into column "Type".
Here is the dataset:
df<-data.frame(Name=c("GARAGE 1","GARAGE 2", "101 GARAGE","PARKING LOT","CENTRAL PARKING","SCHOOL PARKING 1","CITY HALL"))
The following codes work well for me, but is there a neat way to make the codes shorter? Thanks!
df$Type[grepl("GARAGE", df$Name) |
grepl("PARKING", df$Name) |
grepl("LOT", df$Name)]<-"Parking&Garage"
The regex "or" operator | is your friend here:
df$Type[grepl("GARAGE|PARKING|LOT", df$Name)]<-"Parking&Garage"
You can create a list of keywords to change, create a pattern dynamically and replace the values.
keywords <- c('GARAGE', 'PARKING', 'LOT')
df$Type <- NA
df$Type[grep(paste0(keywords, collapse = '|'), df$Name)] <- "Parking&Garage"
df
# Name Type
#1 GARAGE 1 Parking&Garage
#2 GARAGE 2 Parking&Garage
#3 101 GARAGE Parking&Garage
#4 PARKING LOT Parking&Garage
#5 CENTRAL PARKING Parking&Garage
#6 SCHOOL PARKING 1 Parking&Garage
#7 CITY HALL <NA>
This would be helpful if you need to add more keywords to your list later.
an alternative with dpylr and stringr packages:
library(stringr)
library(dplyr)
df %>%
dplyr::mutate(TYPE = stringr::str_detect(Name, "GARAGE|PARKING|LOT"),
TYPE = ifelse(TYPE == TRUE, "Parking&Garage", NA_character_))

How to remove dupes and replace column variables

I'm working with a data set named CCCrn on candidates in a local election with some duplicate values. Here's a sample:
Adam Hill 4100 New Texas Rd. Pittsburgh 15239 School Director PLUM Democratic 4 5
Adam Hill 4100 New Texas Rd. Pittsburgh 15239 School Director PLUM Republican 4 5
As you can see, this candidate cross listed and was on both parties' ballots. I'd like to remove one of the rows, and then edit the Party variable to say "Cross Listed.
Obviously unique and distinct haven't been much help. I tried
test <- CCCrn[!duplicated(CCCrn$Name), ] which succeeded in removing the duplicate canidates, but now I'm not sure how I would go back and edit the "Party" variable.
create a flag for duplicate record
df <- df %>% mutate(dup = ifelse(duplicated(name)|duplicated(name, fromLast=TRUE),1,0))
df <- df[!duplicated(df$name),] ## remove duplicate
df <- df %>% mutate(party= ifelse(dup==1, "Cross Listed", party)) # update party
df <- df%>% select(-dup) ## remove flag
One way, using dplyr, would be to group_by all fields other than the party, and then summarise to "CrossListed" if the number of rows in the group is bigger than 1, i.e. if n()>1.
Something like this...
library(dplyr)
df2 <- df %>% group_by(-Party) %>%
summarise(Party = ifelse(n() > 1, "CrossListed", first(Party))
or an alternative to the last line would be to paste all the party names together so that you can see where they are cross-listed (which might be useful if there are lots of parties - less so if there are only two!)... summarise(Party = paste(sort(Party), collapse=", "))

How to identify observations with multiple matching patterns and create another variable in R?

I am trying to create a broad industry category from detailed categories in my data. I am wondering where am I going wrong in creating this with grepl in R?
My example data is as follows:
df <- data.frame(county = c(01001, 01002, 02003, 04004, 08005, 01002, 02003, 04004),
ind = c("0700","0701","0780","0980","1000","1429","0840","1500"))
I am trying to create a variable called industry with 2 levels (e.g., agri, manufacturing) with the help of grepl or str_replace commands in R.
I have tried this:
newdf$industry <- ""
newdf[df$ind %>% grepl(c("^07|^08|^09", levels(df$ind), value = TRUE)), "industry"] <- "Agri"
But this gives me the following error:
argument 'pattern' has length > 1 and only the first element will be used
I want to get the following dataframe as my result:
newdf <- data.frame(county = c(01001, 01002, 02003, 04004, 08005, 01002, 02003, 04004),
ind = c("0700","0701","0780","0980","1000","1429","0840","1500"),
industry = c("Agri", "Agri", "Agri", "Agri", "Manufacturing", "Manufacturing", "Agri", "Manufacturing"))
So my question is this, how do I specify if variable 'ind' starts with 07,08 or 09, my industry variable will take the value 'agri', if 'ind' starts with 10, 14 or 15, industry will be 'manufacturing'? Needless to say, there is a huge list of industry codes that I am trying to crunch in 10 categories, so looking for a solution which will help me do it with pattern recognition.
Any help is appreciated! Thanks!
Try this:
newdf = df %>%
mutate(industry = ifelse(str_detect(string = ind,
pattern = '^07|^08|^09'),
'Agri',
'Manufacturing'))
This works, using ifelse() to add desired column to df data.frame
df$industry <- ifelse(grepl(paste0("^", c('07','08','09'), collapse = "|"), df$ind), "Agri", "Manufacturing")
> df
county ind industry
1 1001 0700 Agri
2 1002 0701 Agri
3 2003 0780 Agri
4 4004 0980 Agri
5 8005 1000 Manufacturing
6 1002 1429 Manufacturing
7 2003 0840 Agri
8 4004 1500 Manufacturing

How to Use forcats::fct_collapse in a Function Across Different Dataframes with Different Factor Levels

library(tidyverse)
library(forcats)
I have two simple dataframes (code at bottom) and I want to create a new recoded variable by collapsing the "Animal" column. I usually do this with forcats::fct_collapse. However, I want to make a function to apply fct_collapse to many different dataframes that have the same variables, except that some might be missing one or two of the factor levels. For example, in this case, Df2 is missing "Rhino".
Is there a way I can change the code (using pkg:tidyverse) so that factor categories that are missing will be returned as NA? In this example I know it's "Rhino", but in my real data there may be other missing levels. I'm open to other options besides forcats::fct_collapse, but I would like to stay within the realm of tidyverse.
REC <- function(Df, Data){
Df %>%
mutate(NEW = fct_collapse(Data, One = c("Cat","Dog","Snake"),
Two = c("Elephant","Bird","Rhino")))
}
REC(Df1,Animal) - this works
REC(DF2,Animal) - this doesn't, it throws an error because of "Rhino"
Sample Data:
Animal <- c("Cat","Dog","Snake","Elephant","Bird","Rhino")
Code <- c(101,222,434,545,444,665)
Animal2 <- c("Cat","Dog","Snake","Elephant","Bird")
Code2 <- c(101,222,434,545,444)
Df1 <- data_frame(Code, Animal)
Df2 <- data_frame(Code2, Animal2) %> %rename(Animal = Animal2)
Here is one idea for you. I initially tried to have two arguments in my function. One was for a data frame, and the other was a column including animal names. But this attempt failed. I had an error message saying, "Error in mutate_impl(.data, dots) : Column new must be length 5 (the number of rows) or one, not 6." So I decided not to have the column name in the function; I clearly said Animal in my function. Then, things worked. The idea was to create a factor variable with missing animal names. That was done in factor() with setdiff(). Once I had all animals names, I used fct_collapse().
myfun <- function(mydf){
animals <- c("Cat", "Dog", "Snake", "Elephant", "Bird", "Rhino")
mydf %>%
mutate(new = factor(Animal, levels = c(unique(Animal), setdiff(animals, Animal))),
new = fct_collapse(new, One = c("Cat", "Dog", "Snake"),
Two = c("Elephant", "Bird", "Rhino"))) -> x
x}
> myfun(Df2)
# A tibble: 5 x 3
Code2 Animal new
<dbl> <chr> <fct>
1 101 Cat One
2 222 Dog One
3 434 Snake One
4 545 Elephant Two
5 444 Bird Two
> myfun(Df1)
# A tibble: 6 x 3
Code Animal new
<dbl> <chr> <fct>
1 101 Cat One
2 222 Dog One
3 434 Snake One
4 545 Elephant Two
5 444 Bird Two
6 665 Rhino Two
Memo:
The following function is the same except that I have two arguments. This is not working. If any revision is possible, please let me know.
myfun2 <- function(mydf, mycol){
animals <- c("Cat", "Dog", "Snake", "Elephant", "Bird", "Rhino")
mydf %>%
mutate(new = factor(mycol, levels = c(unique(mycol), setdiff(animals, mycol))),
new = fct_collapse(new, One = c("Cat", "Dog", "Snake"),
Two = c("Elephant", "Bird", "Rhino"))) -> x
x}
> myfun2(Df2, Animal)
Error in mutate_impl(.data, dots) :
Column `new` must be length 5 (the number of rows) or one, not 6

Q-How to fill a new column in data.frame based on row values by two conditions in R

I am trying to figure out how to generate a new column in R that accounts for whether a politician "i" remains in the same party or defect for a given legislatures "l". These politicians and parties are recognized because of indexes. Here is an example of how my data originally looks like:
## example of data
names <- c("Jesus Martinez", "Anrita blabla", "Paco Pico", "Reiner Steingress", "Jesus Martinez Porras")
Parti.affiliation <- c("Winner","Winner","Winner", "Loser", NA)#NA, "New party", "Loser", "Winner", NA
Legislature <- c(rep(1, 5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5))
selection <- c(rep("majority", 15), rep("PR", 15))
sex<- c("Male", "Female", "Male", "Female", "Male")
Election<- c(rep(1955, 5), rep(1960, 5), rep(1965, 5), rep(1970,5), rep(1975,5), rep(1980,5))
d<- data.frame(names =factor(rep(names, 6)), party.affiliation = c(rep(Parti.affiliation,5), NA, "New party", "Loser", "Winner", NA), legislature = Legislature, selection = selection, gender =rep(sex, 6), Election.date = Election)
## genrating id for politician and party.affiliation
d$id_pers<- paste(d$names, sep="")
d <- arrange(d, id_pers)
d <- transform(d, id_pers = as.numeric(factor(id_pers)))
d$party.affiliation1<- as.numeric(d$party.affiliation)
The expected outcome should show the following: if a politician (showed through the column "id_pers") has changed their values in the column "party.affiliation1", a value 1 will be assigned in a new column called "switch", otherwise 0. The same procedure should be done with every politician in the dataset, so the expected outcome should be like this:
d["switch"]<- c(1, rep(0,4), NA, rep(0,6), rep(NA, 6),1, rep(0,5), rep (0,5),1) # 0= remains in the same party / 1= switch party affiliation.
As example, you can see in this data.frame that the first politician, called "Anrita blabla", was a candidate of the party '3' from the 1st to 5th legislature. However, we can observe that "Anrita" changes her party affiliation in the 6th legislature, so she was a candidate for the party '2'. Therefore, the new column "switch" should contain a value '1' to reflect this Anrita's change of party affiliation, and '0' to show that "Anrita" did not change her party affiliation for the first 5 legislatures.
I have tried several approaches to do that (e.g. loops). I have found this strategy the simplest one, but it does not work :(
## add a new column based on raw values
ind <- c(FALSE, party.affiliation1[-1L]!= party.affiliation1[-length(party.affiliation1)] & party.affiliation1!= 'Null')
d <- d %>% group_by(id_pers) %>% mutate(this = ifelse(ind, 1, 0))
I hope you find this explanation clear. Thanks in advance!!!
I think you could do:
library(tidyverse)
d%>%
group_by(id_pers)%>%
mutate(switch=as.numeric((party.affiliation1-lag(party.affiliation1)!=0)))
The first entry will be NA as we don't have information on whether their previous, if any, party affiliation was different.
Edit: We use the default= parameter of lag() with ifelse() nested to differentiate the first values.
df=d%>%
group_by(id_pers)%>%
mutate(switch=ifelse((party.affiliation1-lag(party.affiliation1,default=-99))>90,99,ifelse(party.affiliation1-lag(party.affiliation1)!=0,1,0)))
Another approach, using data.table:
library(data.table)
# Convert to data.table
d <- as.data.table(d)
# Order by election date
d <- d[order(Election.date)]
# Get the previous affiliation, for each id_pers
d[, previous_party_affiliation := shift(party.affiliation), by = id_pers]
# If the current affiliation is different from the previous one, set to 1
d[, switch := ifelse(party.affiliation != previous_party_affiliation, 1, 0)]
# Remove the column
d[, previous_party_affiliation := NULL]
As Haboryme has pointed out, the first entry of each person will be NA, due to the lack of information on previous elections. And the result would give this:
names party.affiliation legislature selection gender Election.date id_pers party.affiliation1 switch
1: Anrita blabla Winner 1 majority Female 1955 1 NA NA
2: Anrita blabla Winner 2 majority Female 1960 1 NA 0
3: Anrita blabla Winner 3 majority Female 1965 1 NA 0
4: Anrita blabla Winner 4 PR Female 1970 1 NA 0
5: Anrita blabla Winner 5 PR Female 1975 1 NA 0
6: Anrita blabla New party 6 PR Female 1980 1 NA 1
(...)
EDITED
In order to identify the first entry of the political affiliation and assign the value 99 to them, you can use this modified version:
# Note the "fill" parameter passed to the function shift
d[, previous_party_affiliation := shift(party.affiliation, fill = "First"), by = id_pers]
# Set 99 to the first occurrence
d[, switch := ifelse(party.affiliation != previous_party_affiliation, ifelse(previous_party_affiliation == "First", 99, 1), 0)]

Resources