Using frequency of column value in dataframe to calculate new column value - r

So I have an example dataframe that hold the columns id, count and username with id and count being numbers and username being a string.
For every row of the dataframe I want to set a value of a new column called 'ratio', with ratio being defined as
count / number of rows where username == the username in this row
Example from the provided data:
In every row where the username is 'Tom' the ratio would be count/4 , because the user Tom is found four times in the data.
This is just a simplified version of my problem, a for-loop is not an option because my original dataframe has about 3.4 million rows and my previous approach where I used for-loops to iterate the unique values of e.g. 'username' to solve this problem takes forever.
dput of my dataframe:
structure(list(id = 1:20, count = c(140L, 89L, 17L, 114L, 129L,
86L, 21L, 50L, 197L, 160L, 8L, 14L, 78L, 208L, 155L, 55L, 63L,
20L, 189L, 79L), usernames = structure(c(4L, 3L, 5L, 5L, 2L,
3L, 1L, 1L, 3L, 1L, 3L, 2L, 5L, 5L, 4L, 4L, 2L, 2L, 2L, 3L), .Label = c("Jerry",
"Mark", "Phil", "Tina", "Tom"), class = "factor")), .Names = c("id",
"count", "usernames"), row.names = c(NA, 20L), class = "data.frame")
I hope I provided everything for you to understand and reproduce the problem, if something's missing don't hesitate to mention it in the comments.

There are several options. Here are three, one in base R, one with data.table, and one with "plyr". Both assume we're starting with a data.frame named "mydf":
Base R
within(mydf, {
temp <- as.numeric(ave(as.character(usernames), usernames, FUN = length))
ratio <- count/temp
rm(temp)
})
data.table
library(data.table)
DT <- data.table(mydf)
DT[, ratio := count/.N, by = "usernames"]
DT
plyr
library(plyr)
ddply(mydf, .(usernames), transform,
ratio = count/length(usernames))

You can use ave for this:
transform(d, x=count/as.numeric(ave(d$usernames, d$usernames, FUN=length)))

Related

ggplot2 - How can I change facet label text using another dataframe as lookup table

I use ggplot 2.2.0 and R version 3.3.2 w64
According to http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/ I can specify a function to provide the facet labels.
I plot patient data of a study:
I have a dataframe with the Ids and the data, and I have a second dataframe containing some general information (age and gender)
patmeta <- data.frame(
"pat_id"=c(66, 103, 219, 64, 62, 111, 232),
"gender"=c("f","f","f", "m","f", "f", "f"),
"age"=c(56, 32, 73, 58,37,33,52))
I defined a global labeller function and a special one for my pat_id (pat_id_fac is the same as pat_id but as a factor, pat_id is numeric)
PatIdLabeller <- function(id) {
res <- sprintf("Pat %s (%i y, %s)", id,
subset(patmeta, pat_id == id)$age,
subset(patmeta, pat_id == id)$gender)
return(res)
}
globalLabeller <- labeller(
pat_id_fac = PatIdLabeller,
pat_id = PatIdLabeller,
.default = label_both
)
Testing the PatIdLabeller function gives the desired output (though I think, using subset is not most elegant way to do it), e.g.
> PatIdLabeller('103')
[1] "Pat 103 (32 y, f)"
But using it in ggplot, the IDs are correct, but age and gender are for all the same (last row of patmeta) as you see in the picture.
A subset of my qdat is the following
structure(list(pat_id = c(103L, 103L, 103L, 64L, 64L, 64L, 66L,
66L, 66L, 219L, 219L, 219L, 62L, 62L, 62L, 111L, 111L, 111L,
232L, 232L, 232L), pat_id_fac = structure(c(4L, 4L, 4L, 2L, 2L,
2L, 3L, 3L, 3L, 6L, 6L, 6L, 1L, 1L, 1L, 5L, 5L, 5L, 7L, 7L, 7L
), .Label = c("62", "64", "66", "103", "111", "219", "232"),
class = c("ordered", "factor")),
Activity = structure(c(9L, 3L, 9L, 2L, 9L, 9L, 9L,
2L, 2L, 3L, 8L, 4L, 2L, 2L, 2L, 4L, 4L, 7L, 2L, 2L, 9L), .Label = c("",
"Anderes", "Essen", "Hausarbeit", "Hobbies", "Körperpflege",
"Liegen", "Medienkonsum", "Sozialer Kontakt"), class = "factor")),
.Names = c("pat_id", "pat_id_fac", "Activity"), row.names = c(1L, 2L, 3L,
128L, 129L, 130L, 199L, 200L, 201L, 217L, 218L, 219L, 343L, 344L, 345L,
397L, 398L, 399L, 451L, 452L, 453L), class = "data.frame")
g.bar.activities <-
ggplot(data=qdat, aes(x=Activity)) +
geom_bar() +
facet_wrap(~ pat_id_fac, labeller= globalLabeller)
From other questions and answers, I know I could define a character vector, but I am lazy and would like to do it more elegant reusing my patmeta, because the list of study participants will become quite long and evolve over time.
With smaller test data set
t <- data.frame("pat_id"=c(103, 103, 103, 219, 219, 219),
"Activity" = c("sleep", "sleep", "eat", "eat", "eat", "sleep"))
patmeta <- data.frame("pat_id"=c(103, 219),
"gender"=c("m","f"), "age"=c(32,52))
ggplot(data=t, aes(x=Activity)) + geom_bar() +
facet_wrap(~pat_id, labeller=globalLabeller)
I get exactly what I want. I don't see the difference.
It appears that the subsetting is not working properly, likely because the == is trying to act as a vector along the length of all of the id's being passed in. That is, it is checking each pat_id in patmeta to see if it matches the pat_id passed in. The differences in sorting are somehow leaving only that one pat_id matching.
You can see this in action if you try any of the following:
PatIdLabeller(c(103, 66))
gives character(0) and this warning:
In pat_id == id : longer object
length is not a multiple of shorter object length
because none of the rows return, and R is forced to repeat the elements in the ==
ggplot(data=head(qdat), aes(x=Activity)) +
geom_bar() +
facet_wrap(~ pat_id, labeller= globalLabeller)
gives a plot with duplicated age/gender again, and this warning
In pat_id == id : longer object length is not a
multiple of shorter object length
(ditto above).
Of note, even with your smaller data set, if you reverse the row order of your new patmeta (so that 219 is before 103), then run the code you get
Error in FUN(X[[i]], ...) : Unknown input
because the labeller is returning an empty character() (as above).
I don't have a lot of experience with labellers (this answer was a good chance to explore them), but this one should work by using left_join from dplyr, rather than trying to use ==.
myLabeller <- function(x){
lapply(x,function(y){
toLabel <-
data.frame(pat_id = y) %>%
left_join(patmeta)
paste0("Pat ", toLabel$pat_id
, " (", toLabel$age, "y, "
, toLabel$gender, ")")
})
}
and use gives:
ggplot(data=qdat, aes(x=Activity)) + geom_bar() +
facet_wrap(~pat_id, labeller=myLabeller) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
An alternative option would be to skip the labeller step, and just generate the label you actually want to use directly. Here, just merge the meta data with the patient data (using left_join from dplyr), then generate a column using the format/style that you want (here, using mutate from dplyr and paste0).
forPlotting <-
qdat %>%
left_join(patmeta) %>%
mutate(forFacet = paste0("Pat ", pat_id
, " (", age, "y, "
, gender, ")"))
Then, use that data for plotting, and the new column for faceting.
ggplot(forPlotting, aes(x=Activity)) +
geom_bar() +
facet_wrap(~forFacet) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
gives
note that the facets are now sorted alphabetically, but you could adjust that as needed by setting the column as a factor with explicitly sorted levels when you make it.

Subset list based on a condition in R

For a sample dataframe:
df <- structure(list(area = structure(c(1L, 4L, 3L, 8L, 5L, 7L, 6L,
2L), .Label = c("DE1", "DE3", "DE4", "DE5", "DE9", "DEA", "DEB",
"DEC"), class = "factor"), to.delete = c(1L, 0L, 1L, 0L, 1L,
1L, 1L, 0L)), .Names = c("area", "to.delete"), class = "data.frame", row.names = c(NA,
-8L))
I want to create a list of the areas which have a '1' in the 'to'delete' column. I know how to subset the 1s out of this dataframe, however I want the list of areas as eventually I will use this list to extract these areas from the main master data file (df2, listed below).
df2 <- structure(list(id = 1:24, area = structure(c(1L, 1L, 4L, 4L,
4L, 3L, 3L, 3L, 3L, 3L, 8L, 8L, 8L, 8L, 5L, 7L, 7L, 7L, 6L, 6L,
2L, 2L, 2L, 2L), .Label = c("DE1", "DE3", "DE4", "DE5", "DE9",
"DEA", "DEB", "DEC"), class = "factor")), .Names = c("id", "area"
), class = "data.frame", row.names = c(NA, -24L))
I prefer to do this in two steps, so I can easily see which areas I have deleted (thanks to answers below for suggestions of using list).
a <- list(df$area[df$to.delete == 1])
df2.subset <- df2[df2$area %in% a,]
This however doesn't seem to work at the moment, so if anyone has any ideas, then that would be great.
df2 should then be left with only areas DE5, DEC and DE3.
Many thanks.
Here is another method using split to collect the areas into two lists:
# get two lists of areas and give list items appropriate names
keepDrop <- setNames(split(df$area, df$to.delete), c("drop", "keep"))
# now perform dropping
df2.smaller <- df2[df2$area %in% keepDrop[["keep"]],]
We can use subset. Based on the description, the OP wants to subset the rows of a main data ('maindata') based on the 'area' that corresponds to 1 in 'to.delete' column. In that case, we extract the 'area' (df$area[df$to.delete ==1]) and with %in% we subset the 'maindata'.
subset(maindata, area %in% df$area[df$to.delete==1])
It's not too clear what you are asking.
This will create a list where each element is a different Area:
lapply(df$area[df$to.delete == 1], function(x) x)
If you want a list with just one element containing all the areas:
list(df$area[df$to.delete == 1])
Edit:
To answer the second part of your question:
a <- list(df$area[df$to.delete == 1])
df2.subset <- df2[!df2$area %in% a[[1]], ]
Here's what you can try .
a <- as.list(subset(df,df$to.delete == 1))
> a
$area
[1] DE1 DE4 DE9 DEB DEA
Levels: DE1 DE3 DE4 DE5 DE9 DEA DEB DEC
$to.delete
[1] 1 1 1 1 1

how do you subset a data frame based on a variable name

my data frame called d:
dput(d)
structure(list(Hostname = structure(c(8L, 8L, 9L, 5L, 6L, 7L,
1L, 2L, 3L, 4L), .Label = c("db01", "db02", "farm01", "farm02",
"tom01", "tom02", "tom03", "web01", "web03"), class = "factor"),
Date = structure(c(6L, 10L, 5L, 3L, 2L, 1L, 8L, 9L, 7L, 4L
), .Label = c("10/5/2015 1:15", "10/5/2015 1:30", "10/5/2015 2:15",
"10/5/2015 4:30", "10/5/2015 8:30", "10/5/2015 8:45", "10/6/2015 8:15",
"10/6/2015 8:30", "9/11/2015 5:00", "9/11/2015 6:00"), class = "factor"),
Cpubusy = c(31L, 20L, 30L, 20L, 18L, 20L, 41L, 21L, 29L,
24L), UsedPercentMemory = c(99L, 98L, 95L, 99L, 99L, 99L,
99L, 98L, 63L, 99L)), .Names = c("Hostname", "Date", "Cpubusy",
"UsedPercentMemory"), class = "data.frame", row.names = c(NA,
-10L))
In a loop I need to go through this data frame based on metrics variable, I need to createa subset data frame for summarization:
metrics<-as.vector(unique(colnames(d[,c(3:4)])))
for (m in metrics){
sub<-dd[,c(1,m)]
}
I cannot use m in this subset line, any ideas how I could subset data frame based on a variable name?
In your subsetting call you are mixing column indexes and column names so R does not understand what you are trying to do.
Either use column names:
for (m in metrics) {
sub <- d[, c(colnames(d)[1], m)]
}
Or indexes:
for (i in 3:4) {
sub <- d[, c(1, i)]
}
Having said that, for loops in R are usually for cases where dynamic assignments are needed or for calling functions with side effects or some other relatively unusual case. Creating a summary by slicing and dicing data in for loops is almost never the proper way to do it in R. If the usual functional tools are not enough there are fantastic packages like plyr, dplyr, etc that let you split-apply-combine your data in very convenient and idiomatic ways.

Outlier detection for multi column data frame in R

I have a data frame with 18 columns and about 12000 rows. I want to find the outliers for the first 17 columns and compare the results with the column 18. The column 18 is a factor and contains data which can be used as indicator of outlier.
My data frame is ufo and I remove the column 18 as follow:
ufo2 <- ufo[,1:17]
and then convert 3 non0numeric columns to numeric values:
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
and then use the following command for outlier detection:
outlier.scores <- lofactor(ufo2, k=5)
But all of the elements of the outlier.scores are NA!!!
Do I have any mistake in this code?
Is there another way to find outlier for such a data frame?
All of my code:
setwd(datadirectory)
library(doMC)
registerDoMC(cores=8)
library(DMwR)
# load data
load("data_9802-f2.RData")
ufo2 <- ufo[,2:17]
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
outlier.scores <- lofactor(ufo2, k=5)
The output of the dput(head(ufo2)) is:
structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L,
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L,
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L,
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L,
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L,
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900,
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896,
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L,
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667,
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93,
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787,
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin",
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country",
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight",
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA,
6L), class = "data.frame")
First of all, you need to spend a lot more time preprocessing your data.
Your axes have completely different meaning and scale. Without care, the outlier detection results will be meaningless, because they are based on a meaningless distance.
For example produceCode. Are you sure, this should be part of your similarity?
Also note that I found the lofactor implementation of the R DMwR package to be really slow. Plus, it seems to be hard-wired to Euclidean distance!
Instead, I recommend using ELKI for outlier detection. First of all, it comes with a much wider choice of algorithms, secondly it is much faster than R, and third, it is very modular and flexible. For your use case, you may need to implement a custom distance function instead of using Euclidean distance.
Here's the link to the ELKI tutorial on implementing a custom distance function.

melting multiple spans of variables

(still) new to r, and very confused as to how I should accomplish multiple melts of my data. Here is a subset:
df <- structure(list(Subject = c(101L, 101L, 101L, 102L, 102L, 102L
), Condition = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("apass",
"vpas"), class = "factor"), FreqCode = structure(c(1L, 1L, 1L,
2L, 2L, 2L), .Label = c("LessVerbal", "MoreVerbal"), class = "factor"),
Item = c(1L, 4L, 7L, 1L, 4L, 7L), Len = c(80L, 68L, 85L,
68L, 85L, 79L), R1_1.RT = c(237L, 203L, 207L, 336L, 487L,
340L), R1_2.RT = c(177L, 225L, 162L, 634L, 590L, 347L), R1_3.RT = c(200L,
226L, 212L, 707L, 653L, 379L), R1.RT = c(614L, 654L, 581L,
1677L, 1730L, 1066L), R1_1 = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = "The", class = "factor"), R1_2 = structure(c(3L,
1L, 2L, 1L, 2L, 4L), .Label = c("antique", "course", "new",
"road"), class = "factor"), R1_3 = structure(c(4L, 1L, 2L,
1L, 2L, 3L), .Label = c("car", "materials", "surfaces", "technology"
), class = "factor"), R1 = structure(c(3L, 1L, 2L, 1L, 2L,
4L), .Label = c("The antique car", "The course materials",
"The new technology", "The road surfaces"), class = "factor")), .Names = c("Subject",
"Condition", "FreqCode", "Item", "Len", "R1_1.RT", "R1_2.RT",
"R1_3.RT", "R1.RT", "R1_1", "R1_2", "R1_3", "R1"), class = "data.frame", row.names =
c(NA,
-6L))
My goal is to get output that (in part) looks like this:
Region RT WordRegion Word
R1_1.RT 237 R1_1 the
...
R1_2.RT 177 R1_2 new
...
EDIT: The variable ending with ".RT" (e.g., R1_1.RT) are Region names and will be melted into a Region column. The variables ending in numbers (e.g., R1_1) correspond exactly to the Region names and their associated values. I want them to be melted alongside the Region names so that I can analyze them in relation to the Region column
In the first part of the code, I melt all of the values into a Region column and change the value to RT. This seems to work fine:
#long transform (with individual regions at end)
SmallMelt1 = melt(df, measure.vars = c("R1_1.RT", "R1_2.RT", "R1_3.RT", "R1.RT"), var = "Region")
#change newly created column name to "RT" (note:you have to change the number in [] to match your data)
colnames(SmallMelt1)[11 ] <- "RT"
But I don't get how to simultaneously melt another span of variables such that they will line up vertically with the first span. I want to do something like this, after the first melt, but it does not work:
#Second Melt for region names (doesn't work)
SmallMelt2 = melt(SmallMelt1, measure.vars = c("R1_1", "R1_2", "R1_3", "R1"), var = "WordRegion")
#Change name to Word
colnames(SmallMelt2)[9] <- "Word" #add col number for "value" here
Please let me know if you need any clarification. I hope someone can help... thanks in advance - DT
So, after consulting with someone off-list, I found the solution. My mistake was that I was trying to run the second step on the output of the first step. By running the two steps independently on the original data and then concatenating, I get the right result.
SmallMelt1 = melt(df, measure.vars = c("R1_1.RT", "R1_2.RT", "R1_3.RT", "R1.RT"), var = "Region")
SmallMelt2 = melt(df, measure.vars = c("R1_1", "R1_2", "R1_3", "R1"), var = "WordRegion")
SmallMelt3=cbind(SmallMelt1,SmallMelt2[,11])

Resources