I have a survey data set in wide form. For a particular question, a set of variables was created in the raw data to represent different the fact that the survey question was asked on a particular month.
I wish to create a new set of variables that have month-invariant names; the value of these variables will correspond to the value of a month-variant question for the month observed.
Please see an example / fictitious data set:
require(data.table)
data <- data.table(month = rep(c('may', 'jun', 'jul'), each = 5),
may.q1 = rep(c('yes', 'no', 'yes'), each = 5),
jun.q1 = rep(c('breakfast', 'lunch', 'dinner'), each = 5),
jul.q1 = rep(c('oranges', 'apples', 'oranges'), each = 5),
may.q2 = rep(c('econ', 'math', 'science'), each = 5),
jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5))
In this survey, there are really only two questions: "q1" and "q2". Each of these questions is repeatedly asked for several months. However, the observation contains a valid response only if the month observed in the data matches up with the survey question for a particular month.
For example: "may.q1" is observed as "yes" for any observation in "May". I would like a new "Q1" variable to represent "may.q1", "jun.q1", and "jul.q1". The value of "Q1" will take on the value of "may.q1" when the month is "may", and the value of "Q1" will take on the value of "jun.q1" when the month is "jun".
If I were to try and do this by hand using data table, I would want something like:
mdata <- data[month == 'may', c('month', 'may.q1', 'may.q2'), with = F]
setnames(mdata, names(mdata), gsub('may\\.', '', names(mdata)))
I would want this repeated "by = month".
If I were to use the "plyr" package for a data frame, I would solve using the following approach:
require(plyr)
data <- data.frame(data)
mdata <- ddply(data, .(month), function(dfmo) {
dfmo <- dfmo[, c(1, grep(dfmo$month[1], names(dfmo)))]
names(dfmo) <- gsub(paste0(dfmo$month[1], '\\.'), '', names(dfmo))
return(dfmo)
})
Any help using a data.table method would be greatly appreciated, as my data are large. Thank you.
A different way to illustrate :
data[, .SD[,paste0(month,c(".q1",".q2")), with=FALSE], by=month]
month may.q1 may.q2
1: may yes econ
2: may yes econ
3: may yes econ
4: may yes econ
5: may yes econ
6: jun lunch foggy
7: jun lunch foggy
8: jun lunch foggy
9: jun lunch foggy
10: jun lunch foggy
11: jul oranges heavy rain
12: jul oranges heavy rain
13: jul oranges heavy rain
14: jul oranges heavy rain
15: jul oranges heavy rain
But note the column names come from the first group (can rename afterwards using setnames). And it may not be the most efficient if there are a great number of columns with only a few needed. In that case Arun's solution melting to long format should be faster.
Edit: Seems very inefficient on bigger data. Check out #MatthewDowle's answer for a really fast and neat solution.
Here's a solution using data.table.
dd <- melt.dt(data, id.var=c("month"))[month == gsub("\\..*$", "", ind)][,
ind := gsub("^.*\\.", "", ind)][, split(values, ind), by=list(month)]
The function melt.dt is a small function (still more improvements to be made) I wrote to melt a data.table similar to that of the melt function in plyr (copy/paste this function shown below before trying out the code above).
melt.dt <- function(DT, id.var) {
stopifnot(inherits(DT, "data.table"))
measure.var <- setdiff(names(DT), id.var)
ind <- rep.int(measure.var, rep.int(nrow(DT), length(measure.var)))
m1 <- lapply(c("list", id.var), as.name)
m2 <- as.call(lapply(c("factor", "ind"), as.name))
m3 <- as.call(lapply(c("c", measure.var), as.name))
quoted <- as.call(c(m1, ind = m2, values = m3))
DT[, eval(quoted)]
}
The idea: First melt the data.table with id.var = month column. Now, all your melted column names are of the form month.question. So, by removing ".question" from this melted column and equating with month column, we can remove all unnecessary entries. Once we did this, we don't need the "month." in the melted column "ind" anymore. So, we use gsub to remove "month." to retain just q1, q2 etc.. After this, we have to reshape (or cast) it. This is done by grouping by month and splitting the values column by ind (which has either q1 or q2. So, you'll get 2 columns for every month (which is then stitched together) to get your desired output.
What about something like this
data <- data.table(
may.q1 = rep(c('yes', 'no', 'yes'), each = 5),
jun.q1 = rep(c('breakfast', 'lunch', 'dinner'), each = 5),
jul.q1 = rep(c('oranges', 'apples', 'oranges'), each = 5),
may.q2 = rep(c('econ', 'math', 'science'), each = 5),
jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5)
)
tmp <- reshape(data, direction = "long", varying = 1:6, sep = ".", timevar = "question")
str(tmp)
## Classes ‘data.table’ and 'data.frame': 30 obs. of 5 variables:
## $ question: chr "q1" "q1" "q1" "q1" ...
## $ may : chr "yes" "yes" "yes" "yes" ...
## $ jun : chr "breakfast" "breakfast" "breakfast" "breakfast" ...
## $ jul : chr "oranges" "oranges" "oranges" "oranges" ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
If you want to go further and melting this data again you can use the melt package
require(reshape2)
## remove the id column if you want (id is the last col so ncol(tmp))
res <- melt(tmp[,-ncol(tmp), with = FALSE], measure.vars = c("may", "jun", "jul"), value.name = "response", variable.name = "month")
str(res)
## 'data.frame': 90 obs. of 3 variables:
## $ question: chr "q1" "q1" "q1" "q1" ...
## $ month : Factor w/ 3 levels "may","jun","jul": 1 1 1 1 1 1 1 1 1 1 ...
## $ response: chr "yes" "yes" "yes" "yes" ...
Related
I am trying to deidentify data using the duawranglr package in R presented in this example: https://cran.r-project.org/web/packages/duawranglr/vignettes/securing_data.html.
As an example, I created a data frame:
data <- data.frame(
Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2)
)
I am trying to create unique, hexadecimal strings without a crosswalk that correspond to the Name column, using the deid_dua function.
data <- deid_dua(data, id_col = "Name", new_id_name = "DID", write_crosswalk = TRUE, id_length = 12)
The error that I keep getting is:
Error in data.frame(old = old_ids, new = new_ids, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 5, 0
At first I thought the issue was with the name column being a factor. However, I receive the same error after converting it to character using the stringsAsFactors = FALSE statement in data.frame. I'm also not sure based on the CRAN example if I need these statements:
admin_file <- system.file('extdata', 'admin_data.csv', package = 'duawranglr')
df <- read_dua_file(admin_file)
df
Do they apply if you're not importing the data? The example doesn't explain very well what they are for.
Here's a much simpler solution:
# create a custom 8-digit random identifier string called ID:
library(stringi)
data$ID <- stri_rand_strings(nrow(data), 8)
# remove the name column to create a de-identified dataset
data_deidentified <- data[,-1]
Your data_deidentified dataframe will look something like this:
V1 V2 ID
1 16 3 V2Hziep8
2 20 7 vFeQW1OQ
3 34 5 E5vcWYfm
4 25 3 VLbHzU3H
5 26 2 acCbXiO1
And obviously retain the original data dataframe as your crosswalk. You can make the ID variable longer by changing the '8' value in that call.
Now if you have duplicate names in your data, you will need to do a few extra steps:
# note that I've modified the original dataframe to include two "Martin" values:
data <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin", "Martin"),
V1 = c(16, 20, 34, 25, 26, 28),
V2 = c(3, 7, 5, 3, 2, 5))
# get list of unique names and convert to dataframe
names <- data.frame('Name' = unique(data$Name))
# assign ID string to each unique name
names$ID <- stri_rand_strings(nrow(names), 8)
# now merge back into original df
data <- merge(data, names)
Your result is:
Name V1 V2 ID
1 Jan 25 3 e8da7lO4
2 Jane 20 7 pGeeklL1
3 Kate 16 3 5yYAtO9B
4 Martin 26 2 BwC6jPBh
5 Martin 28 5 BwC6jPBh
6 Rod 34 5 f3xvGbu2
I get an error if I don't set a crosswalk first, but this is fairly trivial:
library(duawranglr)
df <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2))
# You only have a single column to obscure, so you only need a one-cell data frame to set up
set_dua_cw(data.frame(secure = "Name"))
#> -- duawranglr note -------------------------------------------------------------------
#> DUA crosswalk has been set!
# Simultaneously secure the data and write the crosswalk
df <- deid_dua(df,
id_col = "Name",
new_id_name = "ID",
write_crosswalk = T,
id_length = 12,
crosswalk_filename = "cw.csv")
print(df)
#> ID V1 V2
#> 1 950dce035280 16 3
#> 2 6b95d061b59f 20 7
#> 3 00a5d8ab2a4c 34 5
#> 4 ea03e704d806 25 3
#> 5 3eba984ebcba 26 2
And you can see the contents of the crosswalk by reading the csv file's contents
read.csv("cw.csv")
#> Name ID
#> 1 Kate 950dce035280
#> 2 Jane 6b95d061b59f
#> 3 Rod 00a5d8ab2a4c
#> 4 Jan ea03e704d806
#> 5 Martin 3eba984ebcba
And if you want to get the names back in the future, you can do:
cw <- read.csv("cw.csv")
df$Name <- cw$Name[match(cw$ID, df$ID)]
I'm a little late, but as the package author, I'll try to clear up some confusion.
tl;dr
The answer #Allan Cameron gave worked for me, but if all you want to do is hash your IDs, then #mh765's solution is probably the best.
Longer explanation of duawranglr purpose
duawranglr assumes you have a restricted data frame and that you want to do two things so that you can share it:
Drop columns which contain restricted data elements (like DOB or
other identifying information)
Convert unique identifiers into another unique ID that can't be used to back into the original IDs (in case the original IDs are also restricted, like SSNs)
Since you aren't trying to do #1, then it makes sense to have a DUA crosswalk that only has one column with one element: the name of your ID column (per #Allan Cameron).
But let's say you have two potential levels of security and in the second, you can't include V1. Then your DUA crosswalk might look like this:
library(duawranglr)
## your data frame
df <- data.frame(Name = c("Kate", "Jane", "Rod", "Jan", "Martin"),
V1 = c(16, 20, 34, 25, 26),
V2 = c(3, 7, 5, 3, 2))
## create dua crosswalk
dua_cw <- data.frame(secure_level_i = c("Name",""),
secure_level_ii = c("Name", "V1"))
## show cw (level_i won't allow name; level_ii won't allow name or V1)
dua_cw
secure_level_i secure_level_ii
1 Name Name
2 V1
## set the dua cw
set_dua_cw(dua_cw)
-- duawranglr note -------------------------------------------------------------
DUA crosswalk has been set!
Now you can set the level of security. Let's say you set it at secure_level_i, meaning it's okay to keep V1 in the final data frame you share:
## set DUA level
set_dua_level("secure_level_i", deidentify_required = TRUE, id_column = "Name")
-- duawranglr note -------------------------------------------------------------
Unique IDs in [ Name ] must be deidentified; use -deid_dua()-.
Now you can use deid_dua() as you wanted to hash your IDs, in this case, names.
## deidentify data (don't need to set id_col since we set it in set_dua_level)
df <- deid_dua(df,
new_id_name = "DID",
write_crosswalk = TRUE,
id_length = 12,
crosswalk_filename = "cw.csv")
## show result
df
DID V1 V2
1 d164bb624da2 16 3
2 a8b33e3b0230 20 7
3 a1d287cbdde7 34 5
4 1c00ba576e1a 25 3
5 a870564b3365 26 2
## show crosswalk
read.csv("cw.csv")
Name DID
1 Kate d164bb624da2
2 Jane a8b33e3b0230
3 Rod a1d287cbdde7
4 Jan 1c00ba576e1a
5 Martin a870564b3365
## check restrictions to see if you can save data
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
Data set has passed check and may be saved.
If, however, you set_dua_level() to "secure_level_ii", then you won't pass the last check since you'll still have V1 in your data.
## set new more secure level
set_dua_level("secure_level_ii", deidentify_required = TRUE, id_column = "Name")
-- duawranglr note -------------------------------------------------------------
Unique IDs in [ Name ] must be deidentified; use -deid_dua()-.
## check again
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
The following variables are not allowed at the current data usage level
restriction [ secure_level_ii ] and MUST BE REMOVED before saving:
- V1
To pass under the new level, you'll need to drop V1 from your data frame.
## drop
df$V1 <- NULL
## check again
check_dua_restrictions(df)
-- duawranglr note -------------------------------------------------------------
Data set has passed check and may be saved.
As a final note, your id_col must contain unique IDs. The names work in the toy example because they are unique, but as others have noted, repeated names for different observations won't work with duawranglr.
I'm working with Lending Club data set and I'm trying to create a dummy variable for the target variable loan_status. So my main goal is for Charged Off to be 0 and Fully Paid to be 1 and all else would be 'NA'. The variable loan status has several values: Current, Fully Paid, Late, Grace Period, Delinquent, Charged off, and Does not qualify due to credit profile. I just want to focus on Charged Off and Fully Paid. I've tried numerous times but still no success. For example:
Creating a new target variable
loan_status1 <- if(loan_status== 'Fully Paid'){'Yes'} else if
(loan_status== 'Charged Off') {'No'} else 'NA'
Also I've tried this:
if(loan_status=='Fully Paid'){
0} else if (loan_status=='Charged Off') {
1} else (loan_status=='NA')
I would appreciate any guidance.
Basically you could try to run a for-loop over your data by executing this:
Don't set NAs as strings ('NA'), better set to data type NA
loan_status <- sample(rep(c('Fully Paid', 'Charged Off', "abc"), 100), 100, replace = FALSE)
for (i in seq_along(loan_status)){
if (loan_status[i] == 'Fully Paid'){
loan_status[i] <- as.integer(0)
} else if (loan_status[i] == 'Charged Off'){
loan_status[i] <- as.integer(1)
} else {
loan_status[i] == NA
}
}
Maybe you want to do this the easy way with the factor() function:
For instance you could do:
factor(loan_status, levels = c('Fully Paid', 'Charged Off'), labels = c(0, 1))
The OP requested a 1:1 replacement, i.e., only one data field involved, of selected values. Besides the nested ifelse approach, this could be done by using factors or join for larger data.
If more than two or three values need to be replaced, the "hard-coded" nested ifelse approach easily gets unhandy.
Factor case 1: Yes, No
# create some data
loan_status <- c("Fully Paid", "Charged Off", "Something", "Else")
# do the conversion
factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("Yes", "No"))
#[1] Yes No <NA> <NA>
#Levels: Yes No
Or,
as.character(factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("Yes", "No")))
#[1] "Yes" "No" NA NA
if the result is expected as character.
Factor case 2: 0L, 1L as integers
If the result is expected to be of type integer, the factor approach can still be used but needs additonal conversion.
as.integer(as.character(factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("0", "1"))))
#[1] 0 1 NA NA
Note, that the conversion to character is essential here. Otherwise, the result would return the numbers of the factor levels:
as.integer(factor(loan_status, levels = c("Fully Paid", "Charged Off"), labels = c("0", "1")))
#[1] 1 2 NA NA
Join
In case of larger data and many items to be replaced using data.table join might be an alternative worth considering:
library(data.table)
# create translation table
translation_map <- data.table(
loan_status = c("Fully Paid", "Charged Off"),
target = c(0L, 1L))
# create some user data
DT <- data.table(id = LETTERS[1:4],
loan_status = c("Fully Paid", "Charged Off", "Something", "Else"))
DT
# id loan_status
#1: A Fully Paid
#2: B Charged Off
#3: C Something
#4: D Else
# right join
translation_map[DT, on = "loan_status"]
# loan_status target id
#1: Fully Paid 0 A
#2: Charged Off 1 B
#3: Something NA C
#4: Else NA D
By default (nomatch = NA), data.table does a right join, i.e, takes all rows of DT.
I am trying to figure out how to generate a new column in R that accounts for whether a politician "i" remains in the same party or defect for a given legislatures "l". These politicians and parties are recognized because of indexes. Here is an example of how my data originally looks like:
## example of data
names <- c("Jesus Martinez", "Anrita blabla", "Paco Pico", "Reiner Steingress", "Jesus Martinez Porras")
Parti.affiliation <- c("Winner","Winner","Winner", "Loser", NA)#NA, "New party", "Loser", "Winner", NA
Legislature <- c(rep(1, 5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5))
selection <- c(rep("majority", 15), rep("PR", 15))
sex<- c("Male", "Female", "Male", "Female", "Male")
Election<- c(rep(1955, 5), rep(1960, 5), rep(1965, 5), rep(1970,5), rep(1975,5), rep(1980,5))
d<- data.frame(names =factor(rep(names, 6)), party.affiliation = c(rep(Parti.affiliation,5), NA, "New party", "Loser", "Winner", NA), legislature = Legislature, selection = selection, gender =rep(sex, 6), Election.date = Election)
## genrating id for politician and party.affiliation
d$id_pers<- paste(d$names, sep="")
d <- arrange(d, id_pers)
d <- transform(d, id_pers = as.numeric(factor(id_pers)))
d$party.affiliation1<- as.numeric(d$party.affiliation)
The expected outcome should show the following: if a politician (showed through the column "id_pers") has changed their values in the column "party.affiliation1", a value 1 will be assigned in a new column called "switch", otherwise 0. The same procedure should be done with every politician in the dataset, so the expected outcome should be like this:
d["switch"]<- c(1, rep(0,4), NA, rep(0,6), rep(NA, 6),1, rep(0,5), rep (0,5),1) # 0= remains in the same party / 1= switch party affiliation.
As example, you can see in this data.frame that the first politician, called "Anrita blabla", was a candidate of the party '3' from the 1st to 5th legislature. However, we can observe that "Anrita" changes her party affiliation in the 6th legislature, so she was a candidate for the party '2'. Therefore, the new column "switch" should contain a value '1' to reflect this Anrita's change of party affiliation, and '0' to show that "Anrita" did not change her party affiliation for the first 5 legislatures.
I have tried several approaches to do that (e.g. loops). I have found this strategy the simplest one, but it does not work :(
## add a new column based on raw values
ind <- c(FALSE, party.affiliation1[-1L]!= party.affiliation1[-length(party.affiliation1)] & party.affiliation1!= 'Null')
d <- d %>% group_by(id_pers) %>% mutate(this = ifelse(ind, 1, 0))
I hope you find this explanation clear. Thanks in advance!!!
I think you could do:
library(tidyverse)
d%>%
group_by(id_pers)%>%
mutate(switch=as.numeric((party.affiliation1-lag(party.affiliation1)!=0)))
The first entry will be NA as we don't have information on whether their previous, if any, party affiliation was different.
Edit: We use the default= parameter of lag() with ifelse() nested to differentiate the first values.
df=d%>%
group_by(id_pers)%>%
mutate(switch=ifelse((party.affiliation1-lag(party.affiliation1,default=-99))>90,99,ifelse(party.affiliation1-lag(party.affiliation1)!=0,1,0)))
Another approach, using data.table:
library(data.table)
# Convert to data.table
d <- as.data.table(d)
# Order by election date
d <- d[order(Election.date)]
# Get the previous affiliation, for each id_pers
d[, previous_party_affiliation := shift(party.affiliation), by = id_pers]
# If the current affiliation is different from the previous one, set to 1
d[, switch := ifelse(party.affiliation != previous_party_affiliation, 1, 0)]
# Remove the column
d[, previous_party_affiliation := NULL]
As Haboryme has pointed out, the first entry of each person will be NA, due to the lack of information on previous elections. And the result would give this:
names party.affiliation legislature selection gender Election.date id_pers party.affiliation1 switch
1: Anrita blabla Winner 1 majority Female 1955 1 NA NA
2: Anrita blabla Winner 2 majority Female 1960 1 NA 0
3: Anrita blabla Winner 3 majority Female 1965 1 NA 0
4: Anrita blabla Winner 4 PR Female 1970 1 NA 0
5: Anrita blabla Winner 5 PR Female 1975 1 NA 0
6: Anrita blabla New party 6 PR Female 1980 1 NA 1
(...)
EDITED
In order to identify the first entry of the political affiliation and assign the value 99 to them, you can use this modified version:
# Note the "fill" parameter passed to the function shift
d[, previous_party_affiliation := shift(party.affiliation, fill = "First"), by = id_pers]
# Set 99 to the first occurrence
d[, switch := ifelse(party.affiliation != previous_party_affiliation, ifelse(previous_party_affiliation == "First", 99, 1), 0)]
I need some help to re-design the output of a function that comes through an R package.
My scope is to reshape a dataframe called output_IMFData in a way that look very similar to the shape of output_imfr. The codes of a MWE reproducing these dataframes are:
library(imfr)
output_imfr <- imf_data(database_id="IFS", indicator="IAD_BP6_USD", country = "", start = 2010, end = 2014, freq = "A", return_raw =FALSE, print_url = T, times = 3)
and for output_IMFData
library(IMFData)
databaseID <- "IFS"
startdate <- "2010"
enddate <- "2014"
checkquery <- FALSE
queryfilter <- list(CL_FREA = "A", CL_AREA_IFS = "", CL_INDICATOR_IFS = "IAD_BP6_USD")
output_IMFData <- CompactDataMethod(databaseID, queryfilter, startdate, enddate,
checkquery)
the output from output_IMFData looks like this:
But, I want to redesign this dataframe to look like the output of output_imfr:
Sadly, I am not that advanced user and could not find something that can help me. My basic problem in converting the shape of output_IMFData to the shape of the second ``panel-data-looking" dataframework is that I don't know how to handle the Obs in output_IMFData in a way that cannot lose the "correspondence" with the reference code #REF-AREA in output_IMFData. That is, in column #REF-AREA there are codes of country names and the column in Obs has their respective time series data. This is very cumbersome way of working with panel data, and therefore I want to reshape that dataframe to the much nicer form of output_imfr dataframe.
The data of interest are stored in a list in the column Obs. Here is a dplyr solution to split the data, crack open the list, then stitch things back together.
longData <-
output_IMFData %>%
split(1:nrow(.)) %>%
lapply(function(x){
data.frame(
iso2c = x[["#REF_AREA"]]
, x$Obs
)
}) %>%
bind_rows()
head(longData)
gives:
iso2c X.TIME_PERIOD X.OBS_VALUE X.OBS_STATUS
1 FJ 2010 47.2107721901621 <NA>
2 FJ 2011 48.28347 <NA>
3 FJ 2012 51.0823499999999 <NA>
4 FJ 2013 157.015648875072 <NA>
5 FJ 2014 186.623232882226 <NA>
6 AW 2010 616.664804469274 <NA>
Here's another approach:
NewDataFrame <- data.frame(iso2c=character(),
year=numeric(),
IAD_BP6_USD=character(),
stringsAsFactors=FALSE)
newrow = 1
for(i in 1:nrow(output_IMFData)) { # for each row of your cludgy df
for(j in 1:length(output_IMFData$Obs[[i]]$`#TIME_PERIOD`)) { # for each year
NewDataFrame[newrow,'iso2c']<-output_IMFData[i, '#REF_AREA']
NewDataFrame[newrow,'year']<-output_IMFData$Obs[[i]]$`#TIME_PERIOD`[j]
NewDataFrame[newrow,'IAD_BP6_USD']<-output_IMFData$Obs[[i]]$`#OBS_VALUE`[j]
newrow<-newrow + 1 # increment down a row
}
}
The following data frame contain a "Campaign" column, the value of column contain information about season, name, and position, however, the order of these information are quiet different in each row. Lucky, these information is a fixed list, so we could create a vector to match the string inside the "Campaign_name" column.
Date Campaign
1 Jan-15 Summer|Peter|Up
2 Feb-15 David|Winter|Down
3 Mar-15 Up|Peter|Spring
Here is what I want to do, I want to create 3 columns as Name, Season, Position. So these column can search the string inside the campaign column and return the matched value from the list below.
Name <- c("Peter, David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
So my desired result would be following
Temp
Date Campaign Name Season Position
1 15-Jan Summer|Peter|Up Peter Summer Up
2 15-Feb David|Winter|Down David Winter Down
3 15-Mar Up|Peter|Spring Peter Spring Up
Another way:
L <- strsplit(df$Campaign,split = '\\|')
df$Name <- sapply(L,intersect,Name)
df$Season <- sapply(L,intersect,Season)
df$Position <- sapply(L,intersect,Position)
Do the following:
Date = c("Jan-15","Feb-15","Mar-15")
Campaign = c("Summer|Peter|Up","David|Winter|Down","Up|Peter|Spring")
df = data.frame(Date,Campaign)
Name <- c("Peter", "David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
for(k in Name){
df$Name[grepl(pattern = k, x = df$Campaign)] <- k
}
for(k in Season){
df$Season[grepl(pattern = k, x = df$Campaign)] <- k
}
for(k in Position){
df$Position[grepl(pattern = k, x = df$Campaign)] <- k
}
This gives:
> df
Date Campaign Name Season Position
1 Jan-15 Summer|Peter|Up Peter Summer Up
2 Feb-15 David|Winter|Down David Winter Down
3 Mar-15 Up|Peter|Spring Peter Spring Up
I had the same idea as Marat Talipov; here's a data.table option:
library(data.table)
Name <- c("Peter", "David")
Season <- c("Summer","Spring","Autumn", "Winter")
Position <- c("Up","Down")
dat <- data.table(Date=c("Jan-15", "Feb-15", "Mar-15"),
Campaign=c("Summer|Peter|Up", "David|Winter|Down", "Up|Peter|Spring"))
Gives
> dat
Date Campaign
1: Jan-15 Summer|Peter|Up
2: Feb-15 David|Winter|Down
3: Mar-15 Up|Peter|Spring
Processing is then
dat[ , `:=`(Name = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Name),
Season = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Season),
Position = sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, Position))
]
Result:
> dat
Date Campaign Name Season Position
1: Jan-15 Summer|Peter|Up Peter Summer Up
2: Feb-15 David|Winter|Down David Winter Down
3: Mar-15 Up|Peter|Spring Peter Spring Up
Maybe there's some benefit if you're doing this to a lot of columns or need to modify in place (by reference).
I'm interested if anyone can show me how to update all three columns at once.
EDIT: Never mind, figured it out;
for (icol in c("Name", "Season", "Position"))
dat[, (icol):=sapply(strsplit(Campaign, "|", fixed=TRUE), intersect, get(icol))]