Replace values into previous row with a condition - r

I want to get data where ID column doesn´t start with 00 and append this value of ID column to the end of Description column in previous row.
Then replace the rest of values into after Name column in the previous row. How can I do that with R?
Here is source of dummy data: https://docs.google.com/spreadsheets/d/1SbmaM8hXck-z5nsNfDMbhwijvAGPkPPBgQ_eY4JAMC8/edit?usp=sharing
ID Year Description Name User Factor_1 Factor_2 Factor_3
0011 2016 blue colour AA James Xfac NA NA
is nice XXX XLM Yfac different Yfac NA NA
0024 2017 red colour DD Mark Zfac NA NA
is good YYY STM Lfac unique Zfac NA NA
What I want to have:
ID Year Description Name User Factor_1 Factor_2 Factor_3
0011 2016 blue colour is nice XXX XLM Yfac different Yfac
0024 2017 red colour is good YYY STM Lfac unique Zfac

There's the first part where you want to paste the descriptions together, and there's the part where you want to move your variables as well, as you want "XXX" and "YYY" in your "user" column.
Also, in Viveks answer all wrong lines are pasted with ALL "right" lines, which works in your example, but not if you have a few right lines, and then a wrong one.
Working with booleans (TRUE/FALSE) sometimes works fine, but in this case, I think you want to use an integer index, as that makes it easier to refer to "the previous line". Which gives me code:
rmlines <- which(!substr(df$ID,1,2)=="00")
df$Description[rmlines-1] <- paste(df$Description[rmlines-1], df[rmlines,1], sep=" ")
df[rmlines-1, 4:8] <- df[rmlines, 2:6]
df <- df[-rmlines,]
But there's one more problem to consider: what classes are your columns?
When I tried it out, I treated everything as a character, which means you can move columns around fine. In your data, some may be factors or something else, so you might want to change the classes. I think it's easiest to first change it all to character, and then change it (back) to the final class you want your columns to be.
# To change everything to character:
df <- as.data.frame(lapply(df, as.character), stringsAsFactors = FALSE)
# And to assign the right classes, you need to decide case-by-case:
df$Year <- as.integer(df$Year)
df$Factor_1 <- as.factor(df$Factor1) # Optionally provide levels

Here's a solution with dplyr:
library(dplyr)
df %>%
bind_cols(df %>% rename_all(function(x) paste0(x, "_dummy"))) %>%
mutate(
Description = ifelse(substr(lead(ID), 1, 2) != "00",
paste(Description, lead(ID)), Description),
Name = lead(Year_dummy),
User = lead(Description_dummy),
Factor_1 = lead(Name_dummy),
Factor_2 = lead(User_dummy),
Factor_3 = lead(Factor_1_dummy)
) %>% select(-ends_with("dummy")) %>%
filter(substr(ID, 1, 2) == "00")
Output:
ID Year Description Name User Factor_1 Factor_2 Factor_3
1 0011 2016 blue colour is nice XXX XLM Yfac different Yfac
2 0024 2017 red colour is good YYY STM Lfac unique Zfac
In case you're dealing with a large number of columns, a combination of dplyr and base R could do it:
library(dplyr)
df_combo <- cbind(df, df)
df$Description <- ifelse(substr(lead(df$ID), 1, 2) != "00",
paste(df$Description, lead(df$ID)), df$Description)
for (i in (ncol(df) + 4):ncol(df_combo)) {
df_combo[[i]] <- lead(df_combo[[i - ncol(df) - 2]])
}
df_combo <- subset(df_combo, substr(ID, 1, 2) == "00")
df_descr <- subset(df, substr(ID, 1, 2) == "00")
df_final <- df_combo[, (ncol(df) + 1):ncol(df_combo)]
df_final$Description <- df_descr$Description
rm(df_descr, df_combo)
Output:
ID Year Description Name User Factor_1 Factor_2 Factor_3
1: 0011 2016 blue colour is nice XXX XLM Yfac different Yfac
2: 0024 2017 red colour is good YYY STM Lfac unique Zfac

Use -
bools <- !substr(df$ID,1,2)=="00"
values <- df[bools,1]
df <- df[!bools,]
df$Description <- paste(df[substr(df$ID,1,2)=="00","Description"],values,sep=" ")
df
Output
ID Year Description Name User Factor_1 Factor_2
1 0011 2016 blue colour is nice AA James Xfac NA
3 0024 2017 red colour is good DD Mark Zfac NA
Factor_3
1 NA
3 NA

Related

Change cell contents if it contains a certain letter

I have a column that lists the race/ethnicity of individuals. I am trying to make it so that if the cell contains an 'H' then I only want H. Similarly, if the cell contains an 'N' then I want an N. Finally, if the cell has multiple races, not including H or N, then I want it to be M. Below is how it is listed currently and the desired output.
Current output
People | Race/Ethnicity
PersonA| HAB
PersonB| NHB
PersonC| AB
PersonD| ABW
PersonE| A
Desired output
PersonA| H
PersonB| N
PersonC| M
PersonD| M
PersonE| A
You can try the following dplyr approach, which combines grepl with dplyr::case_when to first search for N values, then among those not with N values, search for H values, then among those without an H or an N will assign M to those with >1 races and the original letter to those with only one race (assuming each race is represented by a single character).
A base R approach is below as well - no need for dependencies but but less elegant.
Data
df <- read.table(text = "person ethnicity
PersonA HAB
PersonB NHB
PersonC AB
PersonD ABW
PersonE A", header = TRUE)
dplyr (note order matters given your priority)
df %>% mutate(eth2 = case_when(
grepl("N", ethnicity) ~ "N",
grepl("H", ethnicity) ~ "H",
!grepl("H|N", ethnicity) & nchar(ethnicity) > 1 ~ "M",
TRUE ~ ethnicity
))
You could also do it "manually" in base r by indexing (note order matters given your priority):
df[grepl("H", df$ethnicity), "eth2"] <- "H"
df[grepl("N", df$ethnicity), "eth2"] <- "N"
df[!grepl("H|N", df$ethnicity) & nchar(df$ethnicity) > 1, "eth2"] <- "M"
df[nchar(df$ethnicity) %in% 1, "eth2"] <- df$ethnicity[nchar(df$ethnicity) %in% 1]
In both cases the output is:
# person ethnicity eth2
# 1 PersonA HAB H
# 2 PersonB NHB N
# 3 PersonC AB M
# 4 PersonD ABW M
# 5 PersonE A A
Note this is based on your comment about assigning superiority (that N anywhere supersedes those with both N and H, etc)
We could use str_extract. When the number of characters in the column is greater than 1, extract, the 'N', 'M' separately, do a coalesce with the extracted elements along with 'M' (thus if there is no match, we get 'M', or else it will be in the order we placed the inputs in coalecse, For the other case, i.e. number of characters is 1, return the column values. Thus, N supersedes 'H' no matter the position in the string.
library(dplyr)
library(stringr)
df1 %>%
mutate(output = case_when(nchar(`Race/Ethnicity`) > 1
~ coalesce(str_extract(`Race/Ethnicity`, 'N'),
str_extract(`Race/Ethnicity`, 'H'), "M"),
TRUE ~ `Race/Ethnicity`))
-output
People Race/Ethnicity output
1 PersonA HAB H
2 PersonB NHB N
3 PersonC AB M
4 PersonD ABW M
5 PersonE A A
data
df1 <- structure(list(People = c("PersonA", "PersonB", "PersonC", "PersonD",
"PersonE"), `Race/Ethnicity` = c("HAB", "NHB", "AB", "ABW", "A"
)), class = "data.frame", row.names = c(NA, -5L))

How to rename observations with semi-consistent format?

I'm working with the following data frame:
Team Direction Side
Joe HB-L L
Eric HB-R R
Tim FB-L R
Mike HB L
I would like to eliminate the "HB" or "FB" preceding the "L" or "R" in the 'Direction' column. I would also like to eliminate the observations for which there is no "L" or "R" in the 'Direction' column. I would like it to look like this:
Team Direction Side
Joe L L
Eric R R
Tim L R
Then, I want to add a column that indicates if the 'Direction' and 'Side' columns are the same. If yes, I would like it to read 'NEAR', if not I would like it to read "FAR."
Team Direction Side Relation
Joe L L NEAR
Eric R R NEAR
Tim L R FAR
We can first filter the rows that have "-" in them , remove everything until "-" and with if_else assign 'NEAR' OR 'FAR' value to Relation.
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Direction, '-')) %>%
mutate(Direction = str_remove(Direction, '.*-'),
Relation = if_else(Direction == Side, 'NEAR', 'FAR'))
# Team Direction Side Relation
#1 Joe L L NEAR
#2 Eric R R NEAR
#3 Tim L R FAR
We can do this in base R with sub to remove the substring till the - after subsettng the rows based on the presence of - in 'Direction
df1 <- subset(df1, grepl('-', Direction))
df1$Direction <- sub(".*-", "", df1$Direction)
-output
df1
# Team Direction Side
#1 Joe L L
#2 Eric R R
#3 Tim L R
Then, we can use == to create a logical condition to replace the valuess to 'FAR', 'NEAR'
df1$Relation <- with(df1, c('FAR', 'NEAR')[(Direction == Side) + 1])
-output
df1
# Team Direction Side Relation
#1 Joe L L NEAR
#2 Eric R R NEAR
#3 Tim L R FAR
Or with tidyverse
library(dplyr)
library(stringr)
df1 %>%
filter(grepl('-', Direction)) %>%
mutate(Direction = str_replace(Direction, '.*-', ''),
Relation = case_when(Direction == Side ~ 'NEAR', TRUE ~ 'FAR'))
# Team Direction Side Relation
#1 Joe L L NEAR
#2 Eric R R NEAR
#3 Tim L R FAR
Or using data.table
library(data.table)
setDT(df1)[grepl('-', Direction)][, Direction := trimws(Direction,
whitespace = '.*-')][, Relation := fifelse(Direction == Side, 'NEAR', 'FAR')][]
# Team Direction Side Relation
#1: Joe L L NEAR
#2: Eric R R NEAR
#3: Tim L R FAR
data
df1 <- structure(list(Team = c("Joe", "Eric", "Tim", "Mike"), Direction = c("HB-L",
"HB-R", "FB-L", "HB"), Side = c("L", "R", "R", "L")),
class = "data.frame", row.names = c(NA,
-4L))

how do I extract a part of data from a column and and paste it n another column using R?

I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)

Q-How to fill a new column in data.frame based on row values by two conditions in R

I am trying to figure out how to generate a new column in R that accounts for whether a politician "i" remains in the same party or defect for a given legislatures "l". These politicians and parties are recognized because of indexes. Here is an example of how my data originally looks like:
## example of data
names <- c("Jesus Martinez", "Anrita blabla", "Paco Pico", "Reiner Steingress", "Jesus Martinez Porras")
Parti.affiliation <- c("Winner","Winner","Winner", "Loser", NA)#NA, "New party", "Loser", "Winner", NA
Legislature <- c(rep(1, 5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5))
selection <- c(rep("majority", 15), rep("PR", 15))
sex<- c("Male", "Female", "Male", "Female", "Male")
Election<- c(rep(1955, 5), rep(1960, 5), rep(1965, 5), rep(1970,5), rep(1975,5), rep(1980,5))
d<- data.frame(names =factor(rep(names, 6)), party.affiliation = c(rep(Parti.affiliation,5), NA, "New party", "Loser", "Winner", NA), legislature = Legislature, selection = selection, gender =rep(sex, 6), Election.date = Election)
## genrating id for politician and party.affiliation
d$id_pers<- paste(d$names, sep="")
d <- arrange(d, id_pers)
d <- transform(d, id_pers = as.numeric(factor(id_pers)))
d$party.affiliation1<- as.numeric(d$party.affiliation)
The expected outcome should show the following: if a politician (showed through the column "id_pers") has changed their values in the column "party.affiliation1", a value 1 will be assigned in a new column called "switch", otherwise 0. The same procedure should be done with every politician in the dataset, so the expected outcome should be like this:
d["switch"]<- c(1, rep(0,4), NA, rep(0,6), rep(NA, 6),1, rep(0,5), rep (0,5),1) # 0= remains in the same party / 1= switch party affiliation.
As example, you can see in this data.frame that the first politician, called "Anrita blabla", was a candidate of the party '3' from the 1st to 5th legislature. However, we can observe that "Anrita" changes her party affiliation in the 6th legislature, so she was a candidate for the party '2'. Therefore, the new column "switch" should contain a value '1' to reflect this Anrita's change of party affiliation, and '0' to show that "Anrita" did not change her party affiliation for the first 5 legislatures.
I have tried several approaches to do that (e.g. loops). I have found this strategy the simplest one, but it does not work :(
## add a new column based on raw values
ind <- c(FALSE, party.affiliation1[-1L]!= party.affiliation1[-length(party.affiliation1)] & party.affiliation1!= 'Null')
d <- d %>% group_by(id_pers) %>% mutate(this = ifelse(ind, 1, 0))
I hope you find this explanation clear. Thanks in advance!!!
I think you could do:
library(tidyverse)
d%>%
group_by(id_pers)%>%
mutate(switch=as.numeric((party.affiliation1-lag(party.affiliation1)!=0)))
The first entry will be NA as we don't have information on whether their previous, if any, party affiliation was different.
Edit: We use the default= parameter of lag() with ifelse() nested to differentiate the first values.
df=d%>%
group_by(id_pers)%>%
mutate(switch=ifelse((party.affiliation1-lag(party.affiliation1,default=-99))>90,99,ifelse(party.affiliation1-lag(party.affiliation1)!=0,1,0)))
Another approach, using data.table:
library(data.table)
# Convert to data.table
d <- as.data.table(d)
# Order by election date
d <- d[order(Election.date)]
# Get the previous affiliation, for each id_pers
d[, previous_party_affiliation := shift(party.affiliation), by = id_pers]
# If the current affiliation is different from the previous one, set to 1
d[, switch := ifelse(party.affiliation != previous_party_affiliation, 1, 0)]
# Remove the column
d[, previous_party_affiliation := NULL]
As Haboryme has pointed out, the first entry of each person will be NA, due to the lack of information on previous elections. And the result would give this:
names party.affiliation legislature selection gender Election.date id_pers party.affiliation1 switch
1: Anrita blabla Winner 1 majority Female 1955 1 NA NA
2: Anrita blabla Winner 2 majority Female 1960 1 NA 0
3: Anrita blabla Winner 3 majority Female 1965 1 NA 0
4: Anrita blabla Winner 4 PR Female 1970 1 NA 0
5: Anrita blabla Winner 5 PR Female 1975 1 NA 0
6: Anrita blabla New party 6 PR Female 1980 1 NA 1
(...)
EDITED
In order to identify the first entry of the political affiliation and assign the value 99 to them, you can use this modified version:
# Note the "fill" parameter passed to the function shift
d[, previous_party_affiliation := shift(party.affiliation, fill = "First"), by = id_pers]
# Set 99 to the first occurrence
d[, switch := ifelse(party.affiliation != previous_party_affiliation, ifelse(previous_party_affiliation == "First", 99, 1), 0)]

How to search part of string that contain in a list of string, and return the matched one in R

The following data frame contain a "Campaign" column, the value of column contain information about season, name, and position, however, the order of these information are quiet different in each row. Lucky, these information is a fixed list, so we could create a vector to match the string inside the "Campaign_name" column.
Date Campaign
1 Jan-15 Summer|Peter|Up
2 Feb-15 David|Winter|Down
3 Mar-15 Up|Peter|Spring
Here is what I want to do, I want to create 3 columns as Name, Season, Position. So these column can search the string inside the campaign column and return the matched value from the list below.
Name <- c("Peter, David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
So my desired result would be following
Temp
Date Campaign Name Season Position
1 15-Jan Summer|Peter|Up Peter Summer Up
2 15-Feb David|Winter|Down David Winter Down
3 15-Mar Up|Peter|Spring Peter Spring Up
Another way:
L <- strsplit(df$Campaign,split = '\\|')
df$Name <- sapply(L,intersect,Name)
df$Season <- sapply(L,intersect,Season)
df$Position <- sapply(L,intersect,Position)
Do the following:
Date = c("Jan-15","Feb-15","Mar-15")
Campaign = c("Summer|Peter|Up","David|Winter|Down","Up|Peter|Spring")
df = data.frame(Date,Campaign)
Name <- c("Peter", "David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
for(k in Name){
df$Name[grepl(pattern = k, x = df$Campaign)] <- k
}
for(k in Season){
df$Season[grepl(pattern = k, x = df$Campaign)] <- k
}
for(k in Position){
df$Position[grepl(pattern = k, x = df$Campaign)] <- k
}
This gives:
> df
Date Campaign Name Season Position
1 Jan-15 Summer|Peter|Up Peter Summer Up
2 Feb-15 David|Winter|Down David Winter Down
3 Mar-15 Up|Peter|Spring Peter Spring Up
I had the same idea as Marat Talipov; here's a data.table option:
library(data.table)
Name <- c("Peter", "David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
dat <- data.table(Date=c("Jan-15", "Feb-15", "Mar-15"),
Campaign=c("Summer|Peter|Up", "David|Winter|Down", "Up|Peter|Spring"))
Gives
> dat
Date Campaign
1: Jan-15 Summer|Peter|Up
2: Feb-15 David|Winter|Down
3: Mar-15 Up|Peter|Spring
Processing is then
dat[ , `:=`(Name = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Name),
Season = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Season),
Position = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Position))
]
Result:
> dat
Date Campaign Name Season Position
1: Jan-15 Summer|Peter|Up Peter Summer Up
2: Feb-15 David|Winter|Down David Winter Down
3: Mar-15 Up|Peter|Spring Peter Spring Up
Maybe there's some benefit if you're doing this to a lot of columns or need to modify in place (by reference).
I'm interested if anyone can show me how to update all three columns at once.
EDIT: Never mind, figured it out;
for (icol in c("Name", "Season", "Position"))
dat[, (icol):=sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, get(icol))]

Resources