Using conditional selection to create a subset of data

Using conditional selection to create a subset of data - r

I have a dataset called dietox which has missing values (NA) for the Feed variable. I need to use conditional selection to create a subset of the data for which the rows with missing values are deleted.
The code I tried was:
dietox[!is.NA[dietox$Feed, ]
... but am not sure if that is right to create a subset.
dput(head(dietox))
dietox <- structure(list(Weight = c(26.5, 27.59999, 36.5, 40.29999, 49.09998,
55.39999), Feed = c(NA, 5.200005, 17.6, 28.5, 45.200001, 56.900002 ),
Time = 1:6, Pig = c(4601L, 4601L, 4601L, 4601L, 4601L, 4601L ),
Evit = c(1L, 1L, 1L, 1L, 1L, 1L), Cu = c(1L, 1L, 1L, 1L, 1L, 1L),
Litter = c(1L, 1L, 1L, 1L, 1L, 1L)),
.Names = c("Weight", "Feed", "Time", "Pig", "Evit", "Cu", "Litter"),
row.names = c(NA, 6L), class = "data.frame")

You have the right idea, but is.na is a function and so needs to be used with parenthesis.
dietox[!is.na(dietox$Feed), ]

Related

Arules in R - values that I exclude keep returning

I am applying the apriori algorithm in R with the database structured as followed (in dput()):
structure(list(Firm.s.global.reorganization = structure(c(1L,
2L, 1L, 2L, 2L), .Label = c("no", "yes"), class = "factor"),
Delivery.time = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("no",
"yes"), class = "factor"), Automation.of.production.process = structure(c(2L,
1L, 2L, 1L, 1L), .Label = c("no", "yes"), class = "factor"),
Poor.quality.of.offshored.production = structure(c(1L, 1L,
1L, 1L, 1L), .Label = c("no", "yes"), class = "factor"),
Made.in.effect = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("no",
"yes"), class = "factor"), Proximity.to.customers = structure(c(1L,
1L, 1L, 1L, 1L), .Label = c("no", "yes"), class = "factor")), row.names = c(NA,
5L), class = "data.frame")
When I run my code I only want values to return that have a "yes" value, thus I use the following code:
rules7 <- apriori(data4, parameter = list(support = 0.05,confidence = 0.5, maxlen=5), appearance=list(rhs=c("Firm.s.global.reorganization=yes"),
lhs=c("Delivery.time=yes",
"Automation.of.production.process=yes",
"Poor.quality.of.offshored.production=yes",
"Made.in.effect=yes",
"Proximity.to.customers=yes",
"Implementation.of.strategies.based.on.product.process.innovation=yes",
"Untapped.production.capacity=yes",
"Know.how.in.the.home.country=yes",
"Change.in.total.costs.of.sourcing=yes",
"Logistics.costs=yes",
"Need.for.greater.organizational.flexibility=yes",
"Economic.crisis=yes",
"Improve.customer.service=yes",
"Labour.costs..gap.reduction=yes",
"Government.support.to.relocation=yes",
"Proximity.to.suppliers=yes",
"Loyalty.to.the.home.country=yes"),default="lhs"))
But the results I keep receiving include:
lhs rhs support confidence coverage lift count
[1] {Made.in.effect=no,
Untapped.production.capacity=no,
Economic.crisis=yes} => {Firm.s.global.reorganization=yes} 0.02521008 1.0000000 0.02521008 3.838710 6
even though I explicitly used "Made.in.effect=yes" in my code to avoid the "no's".
How can I make sure I only receive "yes" results on both lhs and rhs?
Thanks!

well already fixed it.
Incase someone struggles with it in the future:
change the default to:
default="none"))

Which is one of the best format to store a large dataframe in R?

I have a dataframe with ~127000000 observations and 5 columns.
It looks like this:
structure(list(species = structure(1:7, .Label = c("Aa achalensis",
"Aa argyrolepis", "Aa aurantiaca", "Aa calceata", "Aa colombiana",
"Aa denticulata", "Aa erosa"), class = "factor"), establishment = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "dark", class = "factor"),
region = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "USA", class = "factor"),
height = c(0.348046463, 0.265755867, 0.382584479, 0.336147631,
0.333343259, 0.333343259, 0.382584479), resitance = c(-0.3906,
-0.6257, -0.2987, -0.423, -0.4307, -0.4307, -0.2987)), class = "data.frame", row.names = c(NA,
-7L))
I tried to save it as .csv but it didn't work. I also increase the memory.limit(size=56000) but it also didn't help.
Is there a way that I can do this task? Another format? I am beginner in R.
Thank you all very much!

combine multiple elements in a list with different indexes in r

I have a list and I need to add together elements with different indexes. I'm struggling because I want to create a loop at different indexes.
data(aSAH)
rocobj <- roc(aSAH$outcome, aSAH$s100b)
dat<-coords(rocobj, "all", ret=c("threshold","sensitivity", "specificity"), as.list=TRUE)
I want to create a function where I can look at all the sensitivity/1-specificity combos at all thresholds in a new data frame. I know threshold is found in dat[1,], sensitivity is found in dat[2,] and specificity is found in dat[3,]. So I tried:
for (i in length(dat)) {
print(dat[1,i]
print(dat[2,i]/(1-dat[3,i]))
}
Where I should end up with a dataframe that has threshold and sensitivity/1-specificity.
DATA
dput(head(aSAH))
structure(list(gos6 = structure(c(5L, 5L, 5L, 5L, 1L, 1L), .Label = c("1",
"2", "3", "4", "5"), class = c("ordered", "factor")), outcome = structure(c(1L,
1L, 1L, 1L, 2L, 2L), .Label = c("Good", "Poor"), class = "factor"),
gender = structure(c(2L, 2L, 2L, 2L, 2L, 1L), .Label = c("Male",
"Female"), class = "factor"), age = c(42L, 37L, 42L, 27L,
42L, 48L), wfns = structure(c(1L, 1L, 1L, 1L, 3L, 2L), .Label = c("1",
"2", "3", "4", "5"), class = c("ordered", "factor")), s100b = c(0.13,
0.14, 0.1, 0.04, 0.13, 0.1), ndka = c(3.01, 8.54, 8.09, 10.42,
17.4, 12.75)), .Names = c("gos6", "outcome", "gender", "age",
"wfns", "s100b", "ndka"), row.names = 29:34, class = "data.frame")
EDIT
One answer:
dat_transform <- as.data.frame(t(dat))
dat_transform <- dat_transform %>% mutate(new=sensitivity/(1-specificity))

You can use :
transform(t, res = sensitivity/(1-specificity))[c(1, 4)]
Or with dplyr :
library(dplyr)
t %>%
mutate(res = sensitivity/(1-specificity)) %>%
select(threshold, res)
Also note that t is a default function in R to tranpose dataframe so better to use some other variable name for the dataframe.

R - Create new data frame based on variable names

Given two dataframes:
d1=structure(list(y = c(0.04090403771224, 0.321216286364446, -1.00056198338576,
0.549872767053012, 0.746529361891068, -0.756394989312306, 0.432210041946058,
1.04202671889042, 0.846691694527378, -0.0372890199537169), x = c(-0.626453810742332,
0.183643324222082, -0.835628612410047, 1.59528080213779, 0.329507771815361,
-0.820468384118015, 0.487429052428485, 0.738324705129217, 0.575781351653492,
-0.305388387156356), tx = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L,
1L), x2 = c(-0.41499456329968, -0.394289953710349, -0.0593133967111857,
1.10002537198388, 0.763175748457544, -0.164523596253587, -0.253361680136508,
0.696963375404737, 0.556663198673657, -0.68875569454952)), .Names = c("y",
"x", "tx", "x2"), row.names = c(NA, -10L), class = "data.frame")
d2=dput(reg1.coefs)
structure(c(-0.752515009279226, 2.43055896098665, 0.833724197554561,
-1.79389056223944), .Names = c("(Intercept)", "x", "tx", "x:tx"
))
How can I create a third data frame that selects only the variables in d2 and creates additional variable(s) corresponding to product of the the variables with a ':' in d2. In this example, the code should return a third variable x:tx corresponding to the product of x and tx in d.
out=structure(list(y = c(0.04090403771224, 0.321216286364446, -1.00056198338576,
0.549872767053012, 0.746529361891068, -0.756394989312306, 0.432210041946058,
1.04202671889042, 0.846691694527378, -0.0372890199537169), x = c(-0.626453810742332,
0.183643324222082, -0.835628612410047, 1.59528080213779, 0.329507771815361,
-0.820468384118015, 0.487429052428485, 0.738324705129217, 0.575781351653492,
-0.305388387156356), tx = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L,
1L), x1x2 = c(-0.626453810742332, 0.183643324222082, -0.835628612410047,
1.59528080213779, 0.329507771815361, -0.820468384118015, 0, 0,
0.575781351653492, -0.305388387156356)), .Names = c("y", "x",
"tx", "x1x2"), row.names = c(NA, -10L), class = "data.frame")
This is obviously easy for one case, but I need the code to be general enough so that if d2 contains products of other variables (e.g., x:x2 or x:x2:tx) the correct out is produced.

How do I understand the warnings from rbind?

If I have two data.frames with the same column names, I can use rbind to make a single data frame. However, if I have one is a factor and the other is an int, I get a warning like this:
Warning message: In [<-.factor(*tmp*, ri, value = c(1L, 1L, 0L,
0L, 0L, 1L, 1L, : invalid factor level, NA generated
The following is a simplification of the problem:
t1 <- structure(list(test = structure(c(1L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 2L), .Label = c("False", "True"), class = "factor")), .Names = "test", row.names = c(NA,
-10L), class = "data.frame")
t2 <- structure(list(test = c(1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L
)), .Names = "test", row.names = c(NA, -10L), class = "data.frame")
rbind(t1, t2)
With the single column, this is easy to understand, but when it is part of a dozen or more factors, it can be difficult. What is there about the warning message to tell me which column to look at? Barring that, what is a good technique to understand which column is in error?

You could knock up a simple little comparison script using class and mapply, to compare where the rbind will break down due to non-matching data types, e.g.:
one <- data.frame(a=1,b=factor(1))
two <- data.frame(b=2,a=2)
common <- intersect(names(one),names(two))
mapply(function(x,y) class(x)==class(y), one[common], two[common])
# a b
# TRUE FALSE

Based on thelatemail's answer, here is a function to compare two data.frames for rbinding:
mergeCompare <- function(one, two) {
cat("Distinct items: ", setdiff(names(one),names(two)), setdiff(names(two),names(one)), "\n")
print("Non-matching items:")
common <- intersect(names(one),names(two))
print (mapply(function(x,y) {class(x)!=class(y)}, one[common], two[common]))
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using conditional selection to create a subset of data - r

You have the right idea, but is.na is a function and so needs to be used with parenthesis. dietox[!is.na(dietox$Feed), ]

Related

Arules in R - values that I exclude keep returning

Which is one of the best format to store a large dataframe in R?

combine multiple elements in a list with different indexes in r

R - Create new data frame based on variable names

How do I understand the warnings from rbind?

Categories

Resources