Function defining answer by a vector - r

Looking to learn function writing. I have data laid out in the following (e.g.):
Genus Species Wing Tail
A X 10.5 20.3
A Y 10.7 20.7
B XX 15.2 22.5
B XY 15.5 24
I calculate variance for a given trait using the equation:
sqrt(max(Wing) - min (Wing))
which I sum for all traits.
So I can write the following function so sum variance for the total data set:
variance<- function(data){
t <- sqrt(max(Tail)-min(Tail))
w <- sqrt(max(Wing)-min(Wing))
x <- sum(t,w)
x
}
But I can'twork out how to generate a response to give me an output where this result is dependant on the Genus. So i'm looking to generate an output like:
Genus A Genus B
2.345 3.456

I am going to give a new name to your function because it's just wrong to call it "variance". I hope you can overlook that. We can work on a dataframe object
dput(dfrm)
structure(list(Genus = structure(c(1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Species = structure(c(1L, 4L, 2L, 3L
), .Label = c("X", "XX", "XY", "Y"), class = "factor"), Wing = c(10.5,
10.7, 15.2, 15.5), Tail = c(20.3, 20.7, 22.5, 24)), .Names = c("Genus",
"Species", "Wing", "Tail"), class = "data.frame", row.names = c(NA,
-4L))
dev2<- function(df){
t <- sqrt(max(df[["Tail"]])-min(df[["Tail"]]))
w <- sqrt(max(df[["Wing"]])-min(df[["Wing"]]))
x <- sum(t,w)
x
}
Now use it to work on the full dataframe, using the split-lapply strategy, which passes sections of the original dataframe determined by the Genus values to the dev2 function
lapply( split(dfrm, list(dfrm$Genus)), FUN = dev2)
$A
[1] 1.079669
$B
[1] 1.772467

Related

Expand rows based on arbitrary N and M

Let's say I have a parameter V which is equal to 100 on row v. I have another parameter D on row d that is equal to 100 too. I would like to create a dataframe with N rows above row v and M rows below row d, so that v-Nth row is equal to V + Nf and d+Mth row is equal to D - Mf. For this example, let's assume that f is equal to 5:
Input:
> dput(df)
structure(list(rw = structure(2:1, .Label = c("d", "v"), class = "factor"),
vals = c(100, 100)), class = "data.frame", row.names = c(NA,
-2L))
Expected output:
> dput(df)
structure(list(rw = structure(c(8L, 7L, 6L, 5L, 1L, 2L, 3L, 4L
), .Label = c("d", "d+1", "d+2", "d+M", "v", "v-1", "v-2", "v-N"
), class = "factor"), vals = c(85, 90, 95, 100, 100, 105, 110,
115)), class = "data.frame", row.names = c(NA, -8L))
How can I achieve this? Let me know if you have any questions.
Thanks!
We can use Map with seq to create a sequence of values by looping over the rows of 'vals' column, then pass another variable byval with sign for getting the sequence backwards or forwards for each corresponding element of 'vals' and pass it on the by argument in seq while setting the constant length of sequence to output as 4 ('n'). Then, set the names of the list wiith the 'rw' column and convert the list to a two column data.frame with stack
byval <- 5
n <- 4
stack(setNames(Map(function(x, y) sort(seq(x, length.out = n, by = y)),
df$vals, c(-byval, byval)), df$rw))[2:1]
If we need different length.out, pass it as another parameter
n1 <- 6
n2 <- 5
stack(setNames(Map(function(x, y, z) sort(seq(x, length.out = z, by = y)),
df$vals, c(-byval, byval), c(n1, n2)), df$rw))[2:1]

How to convert list elements within a data frame to a data frame

I have a data frame with some list items, and within that data frame I would like to convert all the list items to a data frame. The example data frame df. I want data frame without any list item.
structure(list(Study = structure(c(1L, 3L, 2L), .Label = c("new",
"y", "z"), class = "factor"), IQC = structure(list(result.1 = 4,
result.2 = 20, result.3 = 2.67), .Names = c("result.1", "result.2",
"result.3")), EQC = structure(list(result.1 = "1.12*", result.2 = "0.9*",
result.3 = 3.1), .Names = c("result.1", "result.2", "result.3"
)), CQCg = structure(list(result.1 = 307.65, result.2 = 307.65,
result.3 = 16.16), .Names = c("result.1", "result.2", "result.3"
)), CQCp = structure(list(result.1 = 22.27, result.2 = 20.93,
result.3 = 9.59), .Names = c("result.1", "result.2", "result.3"
)), AQCg = structure(list(result.1 = 10.8, result.2 = 8.99, result.3 = 8.37), .Names = c("result.1",
"result.2", "result.3")), AQCp = structure(list(result.1 = 3.81,
result.2 = "1.07*", result.3 = "0.2*"), .Names = c("result.1",
"result.2", "result.3")), Rank = c(1.42, 1.92, 2.67)), .Names = c("Study",
"IQC", "EQC", "CQCg", "CQCp", "AQCg", "AQCp", "Rank"), row.names = c(NA,
3L), class = "data.frame")
To remove the list class from any column in the data.frame, one could simply use
df[] <- unlist(df)
After this conversion, the entries displayed in the data.frame remain unchanged:
#> df
# Study IQC EQC CQCg CQCp AQCg AQCp Rank
#1 1 4 1.12* 307.65 22.27 10.8 3.81 1.42
#2 3 20 0.9* 307.65 20.93 8.99 1.07* 1.92
#3 2 2.67 3.1 16.16 9.59 8.37 0.2* 2.67
But one can verify that the list columns have been converted into character columns:
#> sapply(df,class)
# Study IQC EQC CQCg CQCp AQCg AQCp Rank
#"character" "character" "character" "character" "character" "character" "character" "character"
PS: Hat tip to #docendodiscimus for pointing out a mistake in a previous version of this answer.
You can use unlist to do what you need. I hope this is helpful.
unlisted <- as.data.frame(t(apply(df, 1, unlist)))

Merging two dataframes using only columns from one dataframe and ignoring others in R

I am trying to merge two dataframes, I've been reading the different posts but I couldn't find a way to obtain my desired output.
dfA:
Name Surname C
Ja Men T
Ale Bu T
Ge Men
dfB:
Name Surname C Ex
Ge Men T hello
Je Di T hello
Desired output:
Merge:
Name Surname C
Ja Men T
Ale Bu T
Ge Men T
Je Di T
That is, fill the columns in dfA with the available columns in dfB and ignore the columns from dfB that are not present in dfA.
I tried:
merge(dfA,dfB, by=c("Name", "Surname", "Caracter"), all.x = T)
And other combinations of the merge. I tried using dplyr but couldn't get a satisfactory results.
Any help would be aprreciated.
Thanks in advance
Data:
dfA <- data.frame(
name=c("Ja", "Ale", "Ge"),
surname=c("Men", "Bu", "Men"),
C= c("T", "T", NA))
dfB <- data.frame(
name=c("Ge", "Je"),
surname=c("Men","Di"),
C= c("T","T"),
X = c("hello","hello"))
Using dput():
# based on dput(dfA)
dfA <- structure(list(name = structure(c(3L, 1L, 2L), .Label = c("Ale",
"Ge", "Ja"), class = "factor"), surname = structure(c(2L, 1L,
2L), .Label = c("Bu", "Men"), class = "factor"), C = structure(c(1L,
1L, NA), .Label = "T", class = "factor")), .Names = c("name",
"surname", "C"), row.names = c(NA, -3L), class = "data.frame")
# based on dput(dfB)
dfB <- structure(list(name = structure(1L, .Label = "Ge", class = "factor"),
surname = structure(1L, .Label = "Men", class = "factor"),
C = "T", X = structure(1L, .Label = "hello", class = "factor")),
.Names = c("name", "surname", "C", "X"),
row.names = c(NA, -1L), class = "data.frame")
Assuming that the input is as in the output shown at the end of the question, we perform a left join of dfA with dfB . Note that coalese returns its first non-null argument -- NAs are regarded as SQL nulls:
library(sqldf)
sqldf("select A.Name, A.Surname, coalesce(A.C, B.C) C
from dfA A left join dfB B on A.Name = B.Name and A.Surname = B.Surname")
giving:
name surname C
1 Ja Men T
2 Ale Bu T
3 Ge Men T
We could use safe_full_join from my package safejoin, and resolve column conflicts using dplyr::coalesce :
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
library(dplyr)
safe_full_join(dfA, dfB[names(dfA)], by=c("name","surname"), conflict = coalesce, check="")
# name surname C
# 1 Ja Men T
# 2 Ale Bu T
# 3 Ge Men T
# 4 Je Di T
check = "" is for not displaying warning, as we're joining on factor columns with different levels

Passing current value of ddply split on to function

Here is some sample data for which I want to encode the gender of the names over time:
names_to_encode <- structure(list(names = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("names", "year"), row.names = c(NA, -6L), class = "data.frame")
Here is a minimal set of the Social Security data, limited to just those names from 1890 and 1990:
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
I've defined a function which subsets the Social Security data given a year or range of years. In other words, it calculates whether a name was male or female over a given time period by figuring out the proportion of male and female births with that name. Here is the function along with a helper function:
require(plyr)
require(dplyr)
select_ssa <- function(years) {
# If we get only one year (1890) convert it to a range of years (1890-1890)
if (length(years) == 1) years <- c(years, years)
# Calculate the male and female proportions for the given range of years
ssa_select <- ssa_demo %.%
filter(year >= years[1], year <= years[2]) %.%
group_by(name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = sapply(proportion_female, male_or_female))
return(ssa_select)
}
# Helper function to determine whether a name is male or female in a given year
male_or_female <- function(proportion_female) {
if (proportion_female > 0.5) {
return("female")
} else if(proportion_female == 0.5000) {
return("either")
} else {
return("male")
}
}
Now what I want to do is use plyr, specifically ddply, to subset the data to be encoded by year, and merge each of those pieces with the value returned by the select_ssa function. This is the code I have.
ddply(names_to_encode, .(year), merge, y = select_ssa(year), by.x = "names", by.y = "name", all.x = TRUE)
When calling select_ssa(year), this command works just fine if I hard code a value like 1890 as the argument to the function. But when I try to pass it the current value for year that ddply is working with, I get an error message:
Error in filter_impl(.data, dots(...), environment()) :
(list) object cannot be coerced to type 'integer'
How can I pass the current value of year on to ddply?
I think you're making things too complicated by trying to do a join inside ddply. If I were to use dplyr I would probably do something more like this:
names_to_encode <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("jane", "john", "madison"), class = "factor"), year = c(1890, 1990, 1890, 1990, 1890, 2012)), .Names = c("name", "year"), row.names = c(NA, -6L), class = "data.frame")
ssa_demo <- structure(list(name = c("jane", "jane", "john", "john", "madison", "madison"), year = c(1890L, 1990L, 1890L, 1990L, 1890L, 1990L), female = c(372, 771, 56, 81, 0, 1407), male = c(0, 8, 8502, 29066, 14, 145)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("name", "year", "female", "male"))
names_to_encode$name <- as.character(names_to_encode$name)
names_to_encode$year <- as.integer(names_to_encode$year)
tmp <- left_join(ssa_demo,names_to_encode) %.%
group_by(year,name) %.%
summarise(female = sum(female),
male = sum(male)) %.%
mutate(proportion_male = round((male / (male + female)), digits = 4),
proportion_female = round((female / (male + female)), digits = 4)) %.%
mutate(gender = ifelse(proportion_female == 0.5,"either",
ifelse(proportion_female > 0.5,"female","male")))
Note that 0.1.1 is still a little finicky about the types of join columns, so I had to convert them. I think I saw some activity on github that suggested that was either fixed in the dev version, or at least something they're working on.

lattice::xyplot for multiple lines from quantile regression output

This is a data.frame whose third "column" is in fact a matrix:
pred.Alb <- structure(list(Age =
c(20, 30, 40, 50, 60, 70, 80, 20, 30, 40,
50, 60, 70, 80), Sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Male", "Female"),
class = "factor"),
pred = structure(c(4.34976914720261, 4.3165897157342, 4.2834102842658,
4.23952109360855, 4.15279286619591, 4.05535487959442, 3.95791689299294,
4.02417706540447, 4.05661037005163, 4.08904367469879, 4.0942071858864,
3.9902915232358, 3.85910606712565, 3.72792061101549, 4.37709246711838,
4.38914906337186, 4.40120565962535, 4.3964228776405, 4.32428258270227,
4.23530290952571, 4.14632323634915, 4.3, 4.3, 4.3, 4.28809523809524,
4.22857142857143, 4.15714285714286, 4.08571428571429, 4.59781730640631,
4.59910124381436, 4.60038518122242, 4.58132673532165, 4.48089875618564,
4.36012839374081, 4.23935803129598, 4.39298701298701, 4.39711229946524,
4.40123758594347, 4.39484310896076, 4.34636957813428, 4.28737628384687,
4.22838298955946), .Dim = c(14L, 3L), .Dimnames = list(c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14"), c("tau= 0.10", "tau= 0.25", "tau= 0.50")))),
.Names = c("Age", "Sex", "pred"), out.attrs =
structure(list(dim = structure(c(7L, 2L), .Names = c("Age", "Sex")),
dimnames = structure(list(Age = c("Age=20",
"Age=30", "Age=40", "Age=50", "Age=60", "Age=70", "Age=80"),
Sex = c("Sex=Male", "Sex=Female")),
.Names = c("Age", "Sex"))),
.Names = c("dim", "dimnames")), row.names = c(NA, -14L),
class = "data.frame")
It was created with this code:
require(rms) # also loads Hmisc
require(quantreg) # might also get loaded by rms
rqAlb10fit2 <- rq(BL_ALBUMIN ~ rcs(Age,3) *Sex , data=redBan,
tau= c(0.1, 0.25, 0.5) )
pred.Alb <- expand.grid(Age=seq(20,80,by=10), Sex=c("Male", "Female") )
pred.Alb$pred <- predict(rqAlb10fit2,
newdata=expand.grid(Age=seq(20,80,by=10), Sex=c("Male", "Female") ) )
I would like to have a line plot of the predictions by Sex and tau level. I can get a points plot with:
xyplot(pred~Age|Sex, data=pred.Alb, type="p")
When I add type="l", the lines slew back and forth connecting the various levels of tau.
I doubt that it matters, but running on Mac 10.7.5 with quantreg_4.96/rms_3.6-3/Hmisc_3.10-1. If you want to show me a ggplot solution with classic theme, I'm OK with that too, it's just that I am not very good with ggplot2 and Harrell's rms package is mated to lattice.
The problem appears to be that y loses its dimension attribute when it's passed into the panel function, becoming a simple vector. It still goes ahead and plots, recycling x to match y's length, which you can't see type="p", but can when type="l".
Here is a custom panel function that accomplishes what you want by first converting y back to a matrix and then calling panel.xyplot separately on each of its columns:
panel.matplot <- function(x,y,...) {
y <- matrix(y, nrow=length(x))
apply(y, 2, function(Y) panel.xyplot(x,Y, ...))
}
xyplot(pred~Age|Sex, data=pred.Alb, type="l", panel=panel.matplot)
BTW: In cases like this, I often find it useful to poke around 'inside' the panel function call. A simple way to do this is to construct a dummy panel function containing a browser() call. Here, for example, is how I discovered the problem in this case:
xyplot(pred~Age|Sex, data=pred.Alb, type="l",
panel = function(x,y,...) browser())
Browse[2]> x
# [1] 20 30 40 50 60 70 80
Browse[2]> y
# [1] 4.349769 4.316590 4.283410 4.239521 4.152793 4.055355 3.957917 4.377092
# [9] 4.389149 4.401206 4.396423 4.324283 4.235303 4.146323 4.597817 4.599101
# [17] 4.600385 4.581327 4.480899 4.360128 4.239358
... at which point the required fix is both (a) pretty obvious and (b) can be tested out from within the existing browser call.
You can do this by reshaping to long and using the groups argument to xyplot:
pred2 <- as.data.frame(pred.Alb$pred)
varying=names(pred2)
pred2$Age <- pred.Alb$Age
pred2$Sex <- pred.Alb$Sex
pred2.long <- reshape(pred2, direction='long', varying=varying, sep='= ')
xyplot(tau~Age|Sex, data=pred2.long, type="l", groups=time)

Resources