Suppose we have 2 questions in a survey, one is about how likely an individual is to recommend a company (let's say there's 2 companies for simplicity).
So, I have one data.frame with 2 columns for this question:
df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Company1","Company2")
And, suppose we have another question that asks respondents to checkmark a box beside an attribute that they believe "fits" with the company.
So, I have another data.frame with 4 columns for this question:
df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))
colnames(df.attribute) <- c(
"Attribute1.Company1",
"Attribute2.Company1",
"Attribute1.Company2",
"Attribute2.Company2")
Now, what I would like to be able to do is review how Attributes 1 and 2 are related to the scale in the likelyhood to recommend question, for all companies (company independent). Just to get an idea of what inertia lies between those people that are highly likely to recommend and attribute 1 for example.
So, I start off by binding the two questions together:
df <- cbind(df.recommend, df.attribute)
My problem is trying to figure out how to stack these data such that the columns look something like:
df.stacked <- data.frame(c(df$Company1,df$Company2),
c(df$Attribute1.Company1,df$Attribute1.Company2),
c(df$Attribute2.Company1,df$Attribute2.Company2))
colnames(df.stacked) <- c("Likelihood","Attribute1","Attribute2")
This example is simplified to a large degree. In my actual problem, I have 34 companies and 24 attributes.
Could you think of a way to stack them effectively, without having to type out all the c() statements?
Note: The column pattern for likelyhood is Co1,Co2,Co3,Co4... and the pattern for the attributes is At1.Co1,At2.Co1,At3.Co1 ... At1.Co34,At2.Co34...
For this type of problem, Hadley's reshape package is the perfect tool. I combine it with a few stringr and plyr statements (also packages written by Hadley).
Here is what I believe to be a complete solution in about a dozen lines of code.
First, create some data
library(reshape2) # EDIT 1: reshape2 is faster
library(stringr)
library(plyr)
# Create data frame
# Important: note the addition of a respondent id column
df_comp <- data.frame(
RespID = 1:10,
Company1 = rep(1:5, 2),
Company2 = rep(1:5, 2)
)
df_attr <- data.frame(
RespID = 1:10,
Attribute1.Company1 = rep(0:1,5),
Attribute2.Company1 = rep(1:0,5),
Attribute1.Company2 = rep(0:1,5),
Attribute2.Company2 = rep(1:0,5)
)
Now start the data manipulation:
# Use melt to convert data from wide to tall
melt_comp <- melt(df_comp, id.vars="RespID")
melt_comp <- rename(melt_comp, c(variable="comp", value="likelihood"))
melt_attr <- melt(df_attr, id.vars="RespID")
# Use str_split to split attribute variables into attribute and company
# "." period needs to be escaped
# EDIT 2: reshape::colsplit is simpler than str_split
split <- colsplit(melt_attr$variable, "\\.", names=c("attr", "comp"))
melt_attr <- data.frame(melt_attr, split)
melt_attr$variable <- NULL
# Use cast to convert from tall to somewhat tall
cast_attr <- cast(melt_attr, RespID + comp ~ attr, mean)
# Combine data frames using join() in package plyr
df <- join(melt_comp, cast_attr)
head(df)
And the output:
RespID comp likelihood Attribute1 Attribute2
1 1 Company1 1 0 1
2 2 Company1 2 1 0
3 3 Company1 3 0 1
4 4 Company1 4 1 0
5 5 Company1 5 0 1
6 6 Company1 1 1 0
Something I quickly cooked up. Doesn't look the best and uses a for-loop but that shouldn't be a problem with only 24 values
df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Co1","Co2")
df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))
colnames(df.attribute) <- c(
"At1.Co1",
"At2.Co1",
"At1.Co2",
"At2.Co2")
df.stacked <- data.frame(
likelihood <- unlist(df.recommend)
)
str <- strsplit(names(df.attribute),split="\\.")
atts <- unique(sapply(str,function(x)x[1]))
for (i in 1:length(atts))
{
df.stacked[,i+1] <- unlist(df.attribute[sapply(str,function(x)x[1]==atts[i])])
}
names(df.stacked) <- c("likelihood",paste("attribute",1:length(atts),sep=""))
EDIT: It assumes that companies are in the same order for each attribute
Related
I need to standardize how subgroups are referred to in a data set. To do this I need to identify when a variable matches one of several strings and then set a new variable with the standardized name. I am trying to do that with the following:
df <- data.frame(a = c(1,2,3,4), b = c(depression_male, depression_female, depression_hsgrad, depression_collgrad))
TestVector <- "male"
for (i in TestVector) {
df$grpl <- grepl(paste0(i), df$b)
df[ which(df$grpl == TRUE),]$standard <- "male"
}
The test vector will frequently have multiple elements. The grepl works (I was going to deal with the male/female match confusion later but I'll take suggestions on that) but the subsetting and setting a new variable doesn't. It would be better (and work) if I could transform the grepl output directly into the standard name variable.
Your only real issue is that you need to initialize the standard column. But we can simplify your code a bit:
df <- data.frame(a = c(1,2,3,4), b = c("depression_male", "depression_female", "depression_hsgrad", "depression_collgrad"))
TestVector <- "male"
df$standard <- NA
for (i in TestVector) {
df[ grepl(i, df$b), "standard"] <- "male"
}
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female male
# 3 3 depression_hsgrad <NA>
# 4 4 depression_collgrad <NA>
Then you've got the issue that the "male" pattern matches "female" as well.
Perhaps you're looking for sub instead? It works like find/replace:
df$standard = sub(pattern = "depression_", replacement = "", df$b)
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female female
# 3 3 depression_hsgrad hsgrad
# 4 4 depression_collgrad collgrad
It's hard to generalize what will be best in your case without more example input/output pairs. If all your data is of the form "depression_" this will work well. Or maybe the standard name is always after an underscore, so you could use pattern = ".*_" to replace everything before the last underscore. Or maybe something else... Hopefully these ideas give you a good start.
I want to write a function in R that does the following:
I have a table of cases, and some data. I want to find the correct row matching to each observation from the data. Example:
crit1 <- c(1,1,2)
crit2 <- c("yes","no","no")
Cases <- matrix(c(crit1,crit2),ncol=2,byrow=FALSE)
data1 <- c(1,2,1)
data2 <- c("no","no","yes")
data <- matrix(c(data1,data2),ncol=2,byrow=FALSE)
Now I want a function that returns for each row of my data, the matching row from Cases, the result would be the vector
c(2,3,1)
Are you sure you want to be using matrices for this?
Note that the numeric data in crit1 and data1 has been converted to string (matrices can only store one data type):
typeof(data[ , 1L])
# [1] character
In R, a data.frame is a much more natural choice for what you're after. data.table is (among many other things) a toolset for working with "enhanced" data.frames; See the Introduction.
I would create your data as:
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
We can get the matching row indices as asked by doing a keyed join (See the vignette on keys):
setkey(Cases) # key by all columns
Cases
# crit1 crit2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
setkey(data)
data
# data1 data2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
Cases[data, which=TRUE]
# [1] 1 2 3
This differs from 2,3,1 because the order of your data has changed, but note that the answer is still correct.
If you don't want to change the order of your data, it's slightly more complicated (but more readable if you're not used to data.table syntax):
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
Cases[data, on = setNames(names(data), names(Cases)), which=TRUE]
# [1] 2 3 1
The on= part creates the mapping between the columns of data and those of Cases.
We could write this in a bit more SQL-like fashion as:
Cases[data, on = .(crit1 == data1, crit2 == data2), which=TRUE]
# [1] 2 3 1
This is shorter and more readable for your sample data, but not as extensible if your data has many columns or if you don't know the column names in advance.
The prodlim package has a function for that:
library(prodlim)
row.match(data,Cases)
[1] 2 3 1
I'm trying to do something that I thought would be pretty simple that has me stumped.
Say I have the following data frame:
id <- c("bob_geldof", "billy_bragg", "melvin_smith")
code <- c("blah", "di", "blink")
df <- as.data.frame(cbind(id,code))
> df
id code
1 bob_geldof blah
2 billy_bragg di
3 melvin_smith blink
And another like this:
ID1 <- c("bob_geldof", "melvin_smith")
ID2 <- c("the_builder", "kelvin")
alternates <- as.data.frame(cbind(ID1, ID2))
> alternates
ID1 ID2
1 bob_geldof the_builder
2 melvin_smith kelvin
If the character string in df$id matches alternates$ID1, I'd like to replace it with alternates$ID2. If it doesn't match I'd like to just leave it as it is.
The final df should look like
> df
id code
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
This is obviously a silly example and my real dataset requires lots of replacements.
I've included the 'code' column to demonstrate that I'm working with a data frame and not just a character vector.
I’ve been using gsub to replace them individually but it's time consuming and the list keeps changing.
I looked into str_replace but it seems you can only specify one replacement value.
Any help would be much appreciated.
Cheers!
EDIT: Not all ids contain underscores, and I need to retain the bit that does match. E.g. bob_geldolf becomes bob_the_builder.
EDIT 2(!): Thanks for your suggestions everyone. I've got round the problem by merging the data frames (so that there are NAs where there's no change to be made), and creating new IDs using an ifelse statement. It's a bit clunky but it works!
When creating the dataframes use stringsAsFactors = FALSE so as to not deal with factors. Then, if the rows are ordered, just apply:
df <- as.data.frame(cbind(id,code),stringsAsFactors = FALSE)
alternates <- as.data.frame(cbind(ID1, ID2),stringsAsFactors = FALSE)
df$id[c(TRUE,FALSE)]=paste(gsub("(.*)(_.*)","\\1",df$id[c(TRUE,FALSE)]),
alternates$ID2,sep="_")
> df
id code
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
If they are unordered, we can use dlyr:
df%>%rowwise()%>%mutate(id=if_else(length(which(alternates$ID1==id))>0,
paste(gsub("(.*)(_.*)","\\1",id),
alternates$ID2[which(alternates$ID1==id)],sep="_"),
id))
# A tibble: 3 x 2
id code
<chr> <chr>
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
We are using the same logic as before. Here we check the df by row. If its id matches any of alternatives$ID1 (checked by length()), we update it.
The following solution uses base-R and is streamlined a bit. Step 1: merge the main "df" and the "alternates" df together, using a left-join. Step 2: check where there the ID2 value is not missing (NA) and then assign those values to "id". This will keep your original id where available; and replace it with ID2 where those matching IDs are available
The solution:
combined <- merge(x=df,y=alternates,by.x="id",by.y="ID1",all.x=T)
combined$id[!is.na(combined$ID2)] <- combined$ID2[!is.na(combined$ID2)]
With full original data frame definitions (using stringsAsFactors=F):
id <- c("bob_geldof", "billy_bragg", "melvin_smith")
code <- c("blah", "di", "blink")
df <- as.data.frame(cbind(id,code),stringsAsFactors = F)
ID1 <- c("bob_geldof", "melvin_smith")
ID2 <- c("the_builder", "kelvin")
alternates <- as.data.frame(cbind(ID1, ID2),stringsAsFactors = F)
combined <- merge(x=df,y=alternates,by.x="id",by.y="ID1",all.x=T)
combined$id[!is.na(combined$ID2)] <- combined$ID2[!is.na(combined$ID2)]
Results: (the full merge below, you can also do combined[,c("id","code")] for the streamlined results). Here, the non-matching "billy_bragg" is kept; and the others are replaced with the matched ID
> combined
id code ID2
1 billy_bragg di <NA>
2 the_builder blah the_builder
3 kelvin blink kelvin
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4
Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)