How to show the rows that have the same variable? - r

I would like to show the department that uses the same vendor using the vendor code in a very big dataset, so I guess I will need a loop for that but I am not really sure how to start.
for example, I want to see for each vendor code, all the department that uses it, only if it's used by 3 or more department
see the sample of data here

Here's a base R solution.
# get the repeated values
dat_tb <- table(dat$vendor_code)
# select for the condition and print from the whole data set
dat[ dat$vendor_code %in% names(dat_tb[ dat_tb > 2 ]), ]
vendor_code department
2 9966 dept2
3 9966 dept3
8 9966 dept8
9 9966 dept9
Data:
dat <- data.frame( vendor_code=rep(c(3344,9966,9966,3444,5566,3388),2),
department=paste0("dept",1:12))

Related

Replace a character string based on a separate list/dataframe R

I'm trying to do something that I thought would be pretty simple that has me stumped.
Say I have the following data frame:
id <- c("bob_geldof", "billy_bragg", "melvin_smith")
code <- c("blah", "di", "blink")
df <- as.data.frame(cbind(id,code))
> df
id code
1 bob_geldof blah
2 billy_bragg di
3 melvin_smith blink
And another like this:
ID1 <- c("bob_geldof", "melvin_smith")
ID2 <- c("the_builder", "kelvin")
alternates <- as.data.frame(cbind(ID1, ID2))
> alternates
ID1 ID2
1 bob_geldof the_builder
2 melvin_smith kelvin
If the character string in df$id matches alternates$ID1, I'd like to replace it with alternates$ID2. If it doesn't match I'd like to just leave it as it is.
The final df should look like
> df
id code
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
This is obviously a silly example and my real dataset requires lots of replacements.
I've included the 'code' column to demonstrate that I'm working with a data frame and not just a character vector.
I’ve been using gsub to replace them individually but it's time consuming and the list keeps changing.
I looked into str_replace but it seems you can only specify one replacement value.
Any help would be much appreciated.
Cheers!
EDIT: Not all ids contain underscores, and I need to retain the bit that does match. E.g. bob_geldolf becomes bob_the_builder.
EDIT 2(!): Thanks for your suggestions everyone. I've got round the problem by merging the data frames (so that there are NAs where there's no change to be made), and creating new IDs using an ifelse statement. It's a bit clunky but it works!
When creating the dataframes use stringsAsFactors = FALSE so as to not deal with factors. Then, if the rows are ordered, just apply:
df <- as.data.frame(cbind(id,code),stringsAsFactors = FALSE)
alternates <- as.data.frame(cbind(ID1, ID2),stringsAsFactors = FALSE)
df$id[c(TRUE,FALSE)]=paste(gsub("(.*)(_.*)","\\1",df$id[c(TRUE,FALSE)]),
alternates$ID2,sep="_")
> df
id code
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
If they are unordered, we can use dlyr:
df%>%rowwise()%>%mutate(id=if_else(length(which(alternates$ID1==id))>0,
paste(gsub("(.*)(_.*)","\\1",id),
alternates$ID2[which(alternates$ID1==id)],sep="_"),
id))
# A tibble: 3 x 2
id code
<chr> <chr>
1 bob_the_builder blah
2 billy_bragg di
3 melvin_kelvin blink
We are using the same logic as before. Here we check the df by row. If its id matches any of alternatives$ID1 (checked by length()), we update it.
The following solution uses base-R and is streamlined a bit. Step 1: merge the main "df" and the "alternates" df together, using a left-join. Step 2: check where there the ID2 value is not missing (NA) and then assign those values to "id". This will keep your original id where available; and replace it with ID2 where those matching IDs are available
The solution:
combined <- merge(x=df,y=alternates,by.x="id",by.y="ID1",all.x=T)
combined$id[!is.na(combined$ID2)] <- combined$ID2[!is.na(combined$ID2)]
With full original data frame definitions (using stringsAsFactors=F):
id <- c("bob_geldof", "billy_bragg", "melvin_smith")
code <- c("blah", "di", "blink")
df <- as.data.frame(cbind(id,code),stringsAsFactors = F)
ID1 <- c("bob_geldof", "melvin_smith")
ID2 <- c("the_builder", "kelvin")
alternates <- as.data.frame(cbind(ID1, ID2),stringsAsFactors = F)
combined <- merge(x=df,y=alternates,by.x="id",by.y="ID1",all.x=T)
combined$id[!is.na(combined$ID2)] <- combined$ID2[!is.na(combined$ID2)]
Results: (the full merge below, you can also do combined[,c("id","code")] for the streamlined results). Here, the non-matching "billy_bragg" is kept; and the others are replaced with the matched ID
> combined
id code ID2
1 billy_bragg di <NA>
2 the_builder blah the_builder
3 kelvin blink kelvin

Generate fixed length random id by year as character

I would like to create random id with fixed length 8
Here is sample data:
x <- data.frame(id=c(1,1,1,2,2,3,3,3,3,4,4), year=c(2001,2001,2001,2010,2010,2002,2002,2002,2002,2005,2005),x=seq(0,0.1,0.01))
My attempt:
x$new.id <- ave(x$id, x$year, FUN = function(x) rnorm(x,90000000,100000))
The random generated new.id should have equal id's for given id and year
There must be simple solution, yet I cannot find one. Thanks.
EDIT: Or otherwise how to create new 8 digit id for given number of rows.
Desired output: the column new.id should be class character
new.id year new.id
1 1 2001 89957391
2 1 2001 89957391
3 1 2001 89957391
4 2 2010 90331214
5 2 2010 90331214
6 3 2002 89995435
7 3 2002 89995435
8 3 2002 89995435
9 3 2002 89995435
10 4 2005 90058279
11 4 2005 90058279
You were pretty close with your coding approach (to use ave in that manner), though if you want to generate only one value per each group, you should pass 1 into rnorms n parameter.
The biggest problem as I see it here, is that you want to generate a random number of class integer (and then convert to character class) while rnorm returns double by definition.
So you could potentially do this (using round or floor or ceiling)
transform(x, new.id = ave(id,
year,
FUN = function(x) as.character(round(rnorm(1, 9e7, 1e5)))))
But it seems to me that more appropriate way would be to use sample instead
indx <- 1e7:(1e8 - 1)
transform(x, new.id = ave(id, year, FUN = function(x) as.character(sample(indx, 1))))
Edit: Now that I came to think about it a little more, it is possible that for a large enough data set you will have duplicated new.ids because you are independantly calling sample function each time. It seem to me that the best solution would be first creating a data set with new indexes per each id while generated by a single sample call and then merge it back to the data set. This Operation could be best done using the data.table package (because of it efficient joins and the ability to only add a single column while joining), something like the following should work
library(data.table)
y <- data.table(id = unique(x$id),
new.id = as.character(sample(indx, length(unique(x$id)))))
setkey(setDT(x), id) ; setkey(y, id)
x[y, new.id := i.new.id]
This will update you original data set by reference (without the need in <- assignment). You can convert back to data.frame (if you wish) by simply doing setDF(x).

Average across some rows in R

I have not found a way to take an average across SOME columns in R when working with a data frame table. Basically, I want to take the average of the 3 controls (CTR_R1+CTR_R2+CTR_R3) and insert that value as another column right after CTR_R3 (see below). The same for the TRT.
Is there away to take the average and insert it in a specific location?
GeneID|CTR_R1|CTR_R2|CTR_R3|CTR_AVG|TRT_R1| TRT_R2| TRT_R3|TRT_AVG|pValue
How about
df$CTR_AVG <- rowMeans(df[,2:4])
df$TRT_AVG <- rowMeans(df[,6:8])
This code should work for you, if your data.frame is named df:
df$CTR_AVG <- ( df$CTR_R1 + df$CTR_R2 + df$CTR_R3 ) / 3
That is assuming that the CTR_AVG column already exists as you shown in your question. If it does not the code will put the column at the end of the data.frame. To move it to the right spot, you will need to select the columns in the correct order, like so:
df[ , c( 'GeneID', 'CTR_R1', 'CTR_R2', 'CTR_R3', 'CTR_AVG', 'TRT_R1', 'TRT_R2', 'TRT_R3','TRT_AVG','pValue' ]
The below code should work even if there are many CTR or TRT columns (i.e. 100s). But, I am guessing #beginneR's solution to be faster.
indx <- grep("^CTR", colnames(df1), value=TRUE)
indxT <- grep("^TRT", colnames(df1), value=TRUE)
df1[,c('CTR_Avg', 'TRT_Avg')] <- lapply(list(indx, indxT),
function(x) Reduce(`+`, df1[,x])/length(x))
or you can use rowMeans in the above step.
df2 <- df1[,c('GeneID', indx, 'CTR_Avg', indxT, 'TRT_Avg', 'pValue')]
head(df2,2)
# GeneID CTR_R1 CTR_R2 CTR_R3 CTR_Avg TRT_R1 TRT_R2 TRT_R3 TRT_Avg pValue
#1 1 6 2 10 6.000000 10 11 15 12 0.091
#2 2 5 12 8 8.333333 5 3 13 7 0.051
data
set.seed(24)
df1 <- as.data.frame(matrix(sample(1:20,20*6, replace=TRUE), ncol=6))
colnames(df1) <- c("CTR_R1", "CTR_R2", "CTR_R3", "TRT_R1", "TRT_R2", "TRT_R3")
df1 <- cbind(GeneID=1:20, df1,
pValue=sample(seq(0.001, 0.10, by=0.01), 20, replace=TRUE))
make some dummy data
df=data.frame(CTR_R1=1:10,CTR_R2=1:10,CTR_R3=1:10,somethingelse=1:10)
get a new column
df$CTR_AVG=apply(df[c("CTR_R1","CTR_R2","CTR_R3")],1,mean)
Thanks so much for your replies. I am sorry I did not phrase my original question better. I meant to ask how to write one script to take the average and place that value in the right place. I do not have in my table the column that says "CTR_AVG", nor the column "TRT_AVG".
I was wondering if i could do it more 'elegantly' than doing what i did below (which works too).
Many thanks.
#
names (edgeR_table)
"GeneID" "CTR_R1" "CTR_R2" "CTR_R3" "TRT_R1" "TRT_R2" "TRT_R3" "logFC" "logCPM" "LR" "PValue" "FDR"
#
edgeR_table$CTR_AVG <- rowMeans(edgeR_table[,2:4])
edgeR_table$TRT_AVG <- rowMeans(edgeR_table[,5:7])
edgeR_table <- edgeR_table[, c(1,2,3,4,13,5,6,7,14,8,9,10,11,12)]

Custom function within subset of data, base functions, vector output?

Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)

R: Stacking Multiple Punch Question Data

Suppose we have 2 questions in a survey, one is about how likely an individual is to recommend a company (let's say there's 2 companies for simplicity).
So, I have one data.frame with 2 columns for this question:
df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Company1","Company2")
And, suppose we have another question that asks respondents to checkmark a box beside an attribute that they believe "fits" with the company.
So, I have another data.frame with 4 columns for this question:
df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))
colnames(df.attribute) <- c(
"Attribute1.Company1",
"Attribute2.Company1",
"Attribute1.Company2",
"Attribute2.Company2")
Now, what I would like to be able to do is review how Attributes 1 and 2 are related to the scale in the likelyhood to recommend question, for all companies (company independent). Just to get an idea of what inertia lies between those people that are highly likely to recommend and attribute 1 for example.
So, I start off by binding the two questions together:
df <- cbind(df.recommend, df.attribute)
My problem is trying to figure out how to stack these data such that the columns look something like:
df.stacked <- data.frame(c(df$Company1,df$Company2),
c(df$Attribute1.Company1,df$Attribute1.Company2),
c(df$Attribute2.Company1,df$Attribute2.Company2))
colnames(df.stacked) <- c("Likelihood","Attribute1","Attribute2")
This example is simplified to a large degree. In my actual problem, I have 34 companies and 24 attributes.
Could you think of a way to stack them effectively, without having to type out all the c() statements?
Note: The column pattern for likelyhood is Co1,Co2,Co3,Co4... and the pattern for the attributes is At1.Co1,At2.Co1,At3.Co1 ... At1.Co34,At2.Co34...
For this type of problem, Hadley's reshape package is the perfect tool. I combine it with a few stringr and plyr statements (also packages written by Hadley).
Here is what I believe to be a complete solution in about a dozen lines of code.
First, create some data
library(reshape2) # EDIT 1: reshape2 is faster
library(stringr)
library(plyr)
# Create data frame
# Important: note the addition of a respondent id column
df_comp <- data.frame(
RespID = 1:10,
Company1 = rep(1:5, 2),
Company2 = rep(1:5, 2)
)
df_attr <- data.frame(
RespID = 1:10,
Attribute1.Company1 = rep(0:1,5),
Attribute2.Company1 = rep(1:0,5),
Attribute1.Company2 = rep(0:1,5),
Attribute2.Company2 = rep(1:0,5)
)
Now start the data manipulation:
# Use melt to convert data from wide to tall
melt_comp <- melt(df_comp, id.vars="RespID")
melt_comp <- rename(melt_comp, c(variable="comp", value="likelihood"))
melt_attr <- melt(df_attr, id.vars="RespID")
# Use str_split to split attribute variables into attribute and company
# "." period needs to be escaped
# EDIT 2: reshape::colsplit is simpler than str_split
split <- colsplit(melt_attr$variable, "\\.", names=c("attr", "comp"))
melt_attr <- data.frame(melt_attr, split)
melt_attr$variable <- NULL
# Use cast to convert from tall to somewhat tall
cast_attr <- cast(melt_attr, RespID + comp ~ attr, mean)
# Combine data frames using join() in package plyr
df <- join(melt_comp, cast_attr)
head(df)
And the output:
RespID comp likelihood Attribute1 Attribute2
1 1 Company1 1 0 1
2 2 Company1 2 1 0
3 3 Company1 3 0 1
4 4 Company1 4 1 0
5 5 Company1 5 0 1
6 6 Company1 1 1 0
Something I quickly cooked up. Doesn't look the best and uses a for-loop but that shouldn't be a problem with only 24 values
df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Co1","Co2")
df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))
colnames(df.attribute) <- c(
"At1.Co1",
"At2.Co1",
"At1.Co2",
"At2.Co2")
df.stacked <- data.frame(
likelihood <- unlist(df.recommend)
)
str <- strsplit(names(df.attribute),split="\\.")
atts <- unique(sapply(str,function(x)x[1]))
for (i in 1:length(atts))
{
df.stacked[,i+1] <- unlist(df.attribute[sapply(str,function(x)x[1]==atts[i])])
}
names(df.stacked) <- c("likelihood",paste("attribute",1:length(atts),sep=""))
EDIT: It assumes that companies are in the same order for each attribute

Resources