how to drop multiple column which has categorical values in R? - r

I know how to drop columns by name, but I am not quite sure how I am going to drop the columns which has categorical values. It can be done manually looking at which columns has categorical values, but not intuitive for me using R code. How to detect columns which has categorical values? any way to make this happen?
minimal data
mydf=structure(list(taken_time = c(15L, 5L, 39L, -21L, 46L, 121L),
ap6xl = c(203.2893857, 4.858269406, 2, 14220, 218.2215352,
115.5227706), pct5 = c(732.074484, 25.67901235, 1.01, 120.0477168,
3621.328567, 79.30561111), crp4 = c(196115424.7, 1073624.455,
1.23, 1457496.474, 10343851.7, 81288042.73), age = c(52L,
74L, 52L, 67L, 82L, 67L), gender = structure(c(2L, 2L, 2L,
1L, 2L, 1L), .Label = c("F", "M"), class = "factor"), inpatient_readmission_time_rtd = c(79.78819444,
57.59068053, 57.59068053, 57.59068053, 57.59068053, 9.893055556
), infection_flag = c(0L, 0L, 1L, 1L, 0L, 1L), temperature_value = c(98.9,
98.9, 98, 101.3, 99.5, 98.1), heartrate_value = c(106, 61,
78, 91, 120, 68), pH_result_time_rta = c(11, 85.50402145,
85.50402145, 85.50402145, 85.50402145, 85.50402145), gcst_value = c(15,
15, 15, 14.63769293, 15, 14.63769293)), row.names = c(NA,
6L), class = "data.frame")
instead of manually typing name of columns which has categorical values, is there any way we can detect categorical columns and drop it?
I am concerning the case such as dataframe might have more than 10 categorical columns, it is sort of pain, so I am curious if it is possible using R. any thought?
for example, I can do this for above dataframe by manually looking at which one are categorical columns:
mydf <- mydf[!names(mydf) %in% c("gender", "infection_flag")]
is there any way we can detect which ones is categorical columns and drop it for numerical calculation purpose? any idea?

An option with base R
i1 <- sapply(mydf, is.numeric)
df[i1]

You can use dplyr and select all the numerical columns:
library(dplyr)
mydf %>% select_if(is.numeric)

Related

left_join duplicates even after troubleshooting

Sample data:
full<-structure(list(Location = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("AKS",
"AOK", "BTX", "GTX", "HKS", "JKS", "LOK", "MKS", "MOK", "PKS",
"SKS", "VTX"), class = "factor"), CT_NT = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = c("CT", "NT"), class = "factor"), Depth = c(5L,
10L, 15L, 5L, 10L, 15L), Site = c(1L, 1L, 1L, 1L, 1L, 1L), PW = c(22.8,
21.5, 18.2, 22.5, 20.5, 19.2), BD = c(1.1, 1.2, 1.1, 1.3, 1.3,
1.5)), .Names = c("Location", "CT_NT", "Depth", "Site", "PW",
"BD"), row.names = c(NA, 6L), class = "data.frame")
osu<-structure(list(Location = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("AKS",
"AOK", "BTX", "GTX", "HKS", "JKS", "LOK", "MKS", "MOK", "PKS",
"SKS", "VTX"), class = "factor"), CT_NT = structure(c(1L, 1L,
1L, 2L, 2L, 2L), .Label = c("CT", "NT"), class = "factor"), Depth = c(5L,
10L, 15L, 5L, 10L, 15L), pH = c(5.1, 5.4, 5.9, 5.2, 5.9, 6.2),
N = c(50, 31, 22, 35, 17, 8), P = c(122, 55, 34, 107, 23,
17), K = c(1301, 1202, 1078, 1196, 1028, 948), OM = c(2.3,
1.8, 1.5, 2.1, 1.4, 1.2), NH4 = c(19.3, 14.5, 11.6, 12.3,
8.6, 8.4), Sand = c(22.5, 25, 25, 25, 22.5, 18.8), Silt = c(56.3,
52.5, 50, 51.3, 52.5, 51.3), Clay = c(21.3, 22.5, 25, 23.8,
25, 30)), .Names = c("Location", "CT_NT", "Depth", "pH",
"N", "P", "K", "OM", "NH4", "Sand", "Silt", "Clay"), row.names = c(NA,
6L), class = "data.frame")
I am trying to join two datasets using left_join in dplyr. To my astonishment, I'm getting duplicate rows that are somehow not being identified as such. After reading all the other answers I could get my hands on here that seemed to address "join" issues (at least I'm not the only one who has them...?), I have tried:
Checking the group types of the joining variables in the two datasets
to ensure they match
Checking that I don't have duplicates within f1 or f2
Checking that the categorical columns I'm using to join are, in fact,
the same length and have the same contents. They're EXACTLY the same,
all the way down to the order I put them in
Explicitly specifying to dplyr to use Location, CT_NT, and Depth to
join
Letting dplyr figure out the joining variables itself Joining in both
orders Using inner_join--I ended up with f1 only
I've used left_join before and not had this issue, and it was with a very similar dataset (the pilot data to this full study, in fact). I thought I understood what left_join was doing, but now I'm wondering if I don't actually. I'm trying to get better with using dplyr, but unfortunately it's a lot of me bashing away at things until something works and I can figure out why it worked so I can reproduce it again later as needed.
Given my inexperience, I'm sure the answer is going to be frustratingly straightforward and simple, to the annoyance of everyone involved. Such is the life of learning to code, I guess. Thank you in advance for dealing with a rookie's doofy questions!
Here's my code:
f1<-full %>% #Build pilot_summary. Pipe pilot to...
group_by(Location,CT_NT,Depth,Site) %>% #group_by to work on CT or NT at each site
summarise_at(5:6,funs(mean)) %>% #calculate site means
ungroup(f1)
f1$Depth<-as.factor(f1$Depth)
f1$Site<-NULL
osu$Texture_Class<-NULL#Take out the texture class column
f2<- osu %>%
group_by(Location,CT_NT,Depth) %>% #group because otherwise R tries to crash on the next line of code...
arrange(Location,CT_NT,Depth) %>% #Put everything in order like f1, just in case
ungroup(f2)
f2$Depth<-as.factor(f2$Depth)
full_summary<-left_join(f1,f2)

ggplot2 - How can I change facet label text using another dataframe as lookup table

I use ggplot 2.2.0 and R version 3.3.2 w64
According to http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/ I can specify a function to provide the facet labels.
I plot patient data of a study:
I have a dataframe with the Ids and the data, and I have a second dataframe containing some general information (age and gender)
patmeta <- data.frame(
"pat_id"=c(66, 103, 219, 64, 62, 111, 232),
"gender"=c("f","f","f", "m","f", "f", "f"),
"age"=c(56, 32, 73, 58,37,33,52))
I defined a global labeller function and a special one for my pat_id (pat_id_fac is the same as pat_id but as a factor, pat_id is numeric)
PatIdLabeller <- function(id) {
res <- sprintf("Pat %s (%i y, %s)", id,
subset(patmeta, pat_id == id)$age,
subset(patmeta, pat_id == id)$gender)
return(res)
}
globalLabeller <- labeller(
pat_id_fac = PatIdLabeller,
pat_id = PatIdLabeller,
.default = label_both
)
Testing the PatIdLabeller function gives the desired output (though I think, using subset is not most elegant way to do it), e.g.
> PatIdLabeller('103')
[1] "Pat 103 (32 y, f)"
But using it in ggplot, the IDs are correct, but age and gender are for all the same (last row of patmeta) as you see in the picture.
A subset of my qdat is the following
structure(list(pat_id = c(103L, 103L, 103L, 64L, 64L, 64L, 66L,
66L, 66L, 219L, 219L, 219L, 62L, 62L, 62L, 111L, 111L, 111L,
232L, 232L, 232L), pat_id_fac = structure(c(4L, 4L, 4L, 2L, 2L,
2L, 3L, 3L, 3L, 6L, 6L, 6L, 1L, 1L, 1L, 5L, 5L, 5L, 7L, 7L, 7L
), .Label = c("62", "64", "66", "103", "111", "219", "232"),
class = c("ordered", "factor")),
Activity = structure(c(9L, 3L, 9L, 2L, 9L, 9L, 9L,
2L, 2L, 3L, 8L, 4L, 2L, 2L, 2L, 4L, 4L, 7L, 2L, 2L, 9L), .Label = c("",
"Anderes", "Essen", "Hausarbeit", "Hobbies", "Körperpflege",
"Liegen", "Medienkonsum", "Sozialer Kontakt"), class = "factor")),
.Names = c("pat_id", "pat_id_fac", "Activity"), row.names = c(1L, 2L, 3L,
128L, 129L, 130L, 199L, 200L, 201L, 217L, 218L, 219L, 343L, 344L, 345L,
397L, 398L, 399L, 451L, 452L, 453L), class = "data.frame")
g.bar.activities <-
ggplot(data=qdat, aes(x=Activity)) +
geom_bar() +
facet_wrap(~ pat_id_fac, labeller= globalLabeller)
From other questions and answers, I know I could define a character vector, but I am lazy and would like to do it more elegant reusing my patmeta, because the list of study participants will become quite long and evolve over time.
With smaller test data set
t <- data.frame("pat_id"=c(103, 103, 103, 219, 219, 219),
"Activity" = c("sleep", "sleep", "eat", "eat", "eat", "sleep"))
patmeta <- data.frame("pat_id"=c(103, 219),
"gender"=c("m","f"), "age"=c(32,52))
ggplot(data=t, aes(x=Activity)) + geom_bar() +
facet_wrap(~pat_id, labeller=globalLabeller)
I get exactly what I want. I don't see the difference.
It appears that the subsetting is not working properly, likely because the == is trying to act as a vector along the length of all of the id's being passed in. That is, it is checking each pat_id in patmeta to see if it matches the pat_id passed in. The differences in sorting are somehow leaving only that one pat_id matching.
You can see this in action if you try any of the following:
PatIdLabeller(c(103, 66))
gives character(0) and this warning:
In pat_id == id : longer object
length is not a multiple of shorter object length
because none of the rows return, and R is forced to repeat the elements in the ==
ggplot(data=head(qdat), aes(x=Activity)) +
geom_bar() +
facet_wrap(~ pat_id, labeller= globalLabeller)
gives a plot with duplicated age/gender again, and this warning
In pat_id == id : longer object length is not a
multiple of shorter object length
(ditto above).
Of note, even with your smaller data set, if you reverse the row order of your new patmeta (so that 219 is before 103), then run the code you get
Error in FUN(X[[i]], ...) : Unknown input
because the labeller is returning an empty character() (as above).
I don't have a lot of experience with labellers (this answer was a good chance to explore them), but this one should work by using left_join from dplyr, rather than trying to use ==.
myLabeller <- function(x){
lapply(x,function(y){
toLabel <-
data.frame(pat_id = y) %>%
left_join(patmeta)
paste0("Pat ", toLabel$pat_id
, " (", toLabel$age, "y, "
, toLabel$gender, ")")
})
}
and use gives:
ggplot(data=qdat, aes(x=Activity)) + geom_bar() +
facet_wrap(~pat_id, labeller=myLabeller) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
An alternative option would be to skip the labeller step, and just generate the label you actually want to use directly. Here, just merge the meta data with the patient data (using left_join from dplyr), then generate a column using the format/style that you want (here, using mutate from dplyr and paste0).
forPlotting <-
qdat %>%
left_join(patmeta) %>%
mutate(forFacet = paste0("Pat ", pat_id
, " (", age, "y, "
, gender, ")"))
Then, use that data for plotting, and the new column for faceting.
ggplot(forPlotting, aes(x=Activity)) +
geom_bar() +
facet_wrap(~forFacet) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
gives
note that the facets are now sorted alphabetically, but you could adjust that as needed by setting the column as a factor with explicitly sorted levels when you make it.

group variables depending on defined circular area with center of circle having variable radius

I have a data table object:
> dput(head(trackdatacompvar))
structure(list(wellvid = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("A4-009",
"B3-006", "B4-015", "C2-009", "C2-034", "C3-017", "C4-014", "C4-016",
"C4-026", "C4-036"), class = "factor"), TRACK_ID = c(0L, 0L,
0L, 0L, 0L, 0L), treatment = structure(c(2L, 2L, 2L, 2L, 2L,
2L), .Label = c("Si_induced", "Si_notinduced"), class = "factor"),
A = c(0L, 0L, 0L, 0L, 0L, 0L), X = c(50.216, 50.216, 50.091,
50.091, 50.216, 50.216), Y = c(295.609, 295.609, 295.477,
295.477, 295.609, 295.609), T = 0:5, V = c(0, 0, 0.181793839279557,
0, 0.181793839279557, 0), x_grpA = c(641.67, 641.67, 641.67,
641.67, 641.67, 641.67), y_grpA = c(625, 625, 625, 625, 625,
625), rad_grpA = c(50L, 50L, 50L, 50L, 50L, 50L), x_grpB = c(889.58,
889.58, 889.58, 889.58, 889.58, 889.58), y_grpB = c(377.08,
377.08, 377.08, 377.08, 377.08, 377.08), rad_grpB = c(20L,
20L, 20L, 20L, 20L, 20L)), .Names = c("wellvid", "TRACK_ID",
"treatment", "A", "X", "Y", "T", "V", "x_grpA", "y_grpA", "rad_grpA",
"x_grpB", "y_grpB", "rad_grpB"), sorted = "wellvid", class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000000210788>)
I want to define 4 groups of data depending on circular area. Groups A and B will be dependent on the x,y origin of 2 beads (labelled as x_grpA, y_grpA and x_grpB, y_grpB), group C is an outside area and group D as the area where groups A and B overlap (but this area is sometimes not there). The 2 circular groups should be inside a circular area with radius of 115 µm. This 115 µm is dependent on the size of the bead, so I also have in my data 2 radius (rad_grpA and rad_grpB). To understand it visually, here are 2 pictures:
My original idea is to reuse the awesome script I was given before. So, I tried defining the center of the each data point and the corresponding length of the whole area of group A as:
center_grpA <- c(trackdatacompvar$x_grpA, trackdatacompvar$y_grpA)
circle_grpA <- (trackdatacompvar$rad_grpA)*2 + 115
But after this I am lost.
In the end I want to put inside my dataframe their grouping as one variable.
Would appreciate any help! Thanks :)
We can use a little convenience function from a package of mine here:
check_if_in_circle <- function(points, x, y, r) {
(points[, 1] - x) ^ 2 + (points[, 2] - y) ^ 2 < r ^ 2
}
Now we check for each point, whether it's in circle A, circle B, and then ifelse to figure out whether to assign A, B, C or D. I use within to avoid typing that long data name.
trackdatacompvar <- within(trackdatacompvar,
{
grpA <- check_if_in_circle(points = cbind(X, Y),
x_grpA, y_grpA, rad_grpA + 115)
grpB <- check_if_in_circle(points = cbind(X, Y),
x_grpB, y_grpB, rad_grpB + 115)
grp <- ifelse(grpA, ifelse(grpB, 'D', 'A'),
ifelse(grpB, 'B', 'C'))
} )
For the few rows you gave us, all are in group C.

Outlier detection for multi column data frame in R

I have a data frame with 18 columns and about 12000 rows. I want to find the outliers for the first 17 columns and compare the results with the column 18. The column 18 is a factor and contains data which can be used as indicator of outlier.
My data frame is ufo and I remove the column 18 as follow:
ufo2 <- ufo[,1:17]
and then convert 3 non0numeric columns to numeric values:
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
and then use the following command for outlier detection:
outlier.scores <- lofactor(ufo2, k=5)
But all of the elements of the outlier.scores are NA!!!
Do I have any mistake in this code?
Is there another way to find outlier for such a data frame?
All of my code:
setwd(datadirectory)
library(doMC)
registerDoMC(cores=8)
library(DMwR)
# load data
load("data_9802-f2.RData")
ufo2 <- ufo[,2:17]
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
outlier.scores <- lofactor(ufo2, k=5)
The output of the dput(head(ufo2)) is:
structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L,
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L,
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L,
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L,
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L,
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900,
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896,
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L,
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667,
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93,
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787,
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin",
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country",
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight",
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA,
6L), class = "data.frame")
First of all, you need to spend a lot more time preprocessing your data.
Your axes have completely different meaning and scale. Without care, the outlier detection results will be meaningless, because they are based on a meaningless distance.
For example produceCode. Are you sure, this should be part of your similarity?
Also note that I found the lofactor implementation of the R DMwR package to be really slow. Plus, it seems to be hard-wired to Euclidean distance!
Instead, I recommend using ELKI for outlier detection. First of all, it comes with a much wider choice of algorithms, secondly it is much faster than R, and third, it is very modular and flexible. For your use case, you may need to implement a custom distance function instead of using Euclidean distance.
Here's the link to the ELKI tutorial on implementing a custom distance function.

Smoothing in ggplot

I have this ggplot
ggplot(dt.1, aes(x=pctOAC,y=NoP, fill=Age)) +
geom_bar(stat="identity",position=position_dodge()) +
geom_smooth(aes(x=pctOAC,y=NoP, colour=Age), se=F, method="loess",show_guide = FALSE,lwd=0.7) +
theme(legend.position=c(.2,0.8))
dt1 <- structure(list(Age = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("o80", "u80"), class = "factor"), NoP = c(47L, 5L, 33L, 98L, 287L, 543L, 516L, 222L, 67L, 14L, 13L, 30L, 1L, 6L, 17L, 30L, 116L, 390L, 612L, 451L, 146L, 52L), pctOAC = c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100)), .Names = c("Age", "NoP", "pctOAC"), row.names = c(NA, -22L), class = "data.frame")
I would like to have the smooth lines constrained to lie above zero, perhaps something similar to a kernel density. In fact if I had the underlying data, I expect a kernel density is exactly what I would want, but I only have the aggregated data. Is there any way to do this ? I tried using different method= in the geom_smooth, but the small dataset seems to prevent it. I wondered about using stat_function but I don't have much clue about how to proceed with finding a suitable function to plot.
Another possibility is to use method="glm" with a spline curve and a log link (i.e. also tried method="gam", but its automatic complexity adjustment wanted to reduce the wiggliness too much:
library(splines)
ggplot(dt.1, aes(x=pctOAC,y=NoP, fill=Age)) +
geom_bar(stat="identity",position=position_dodge()) +
geom_smooth(aes(colour=Age), se=F,
method="glm",
formula=y~ns(x,8),
family=gaussian(link="log"),
show_guide = FALSE,lwd=0.7) +
theme(legend.position=c(.2,0.8))
How about geom_density()?
ggplot(dt1, aes(x=pctOAC,y=NoP, colour=Age, fill=Age)) +
geom_bar(stat="identity",position=position_dodge()) +
geom_density(stat="identity", fill=NA) +
theme(legend.position=c(.2,0.8))

Resources