How to invoke functions on subsets of random samples of data - r

I'm trying to perform a t.test for a specific subset of data. Say I have a data set of 116 birds, and want to find a random sample of 35 birds (non-unique) of the "Species" category. I then want to find the mean of the "Body.Mass" of these random species. Then, I want to invoke a t.test on this sample as representative of the whole data.
I first stored the data in object "bird." I tried taking the random sample using sample(bird$Species, 35), which yielded 35 random species of bird. Now I can't seem to further subset this random sample to find the means of the Body.Mass of every random sample species. I tried to subset using tidyverse, but that's the only way I'm aware of to solve a problem like this.
library(dplyr)
bird = read.csv("NZBIRDS.csv")
dput(head(bird))
set.seed(20)
sambird = sample(bird$Species,35)
sambird
bmbird <- sambird %>% summarize(avg = mean(Body.Mass))
bmbird
structure(list(Species = c("Grebes", "Grebes", "Petrels", "Petrels",
"Petrels", "Petrels"), Name = c("P. cristatus", "P. rufopectus",
"P. gavia", "P. assimilis", "P. urinatrix", "P. georgicus"),
Extinct = c("No", "No", "Yes", "Yes", "Yes", "No"), Habitat = c("A",
"A", "A", "A", "A", "A"), Nest.Site = c("G", "G", "GC", "GC",
"GC", "GC"), Nest.Density = c("L", "L", "H", "H", "H", "H"
), Diet = c("F", "F", "F", "F", "F", "F"), Flight = c("Yes",
"Yes", "Yes", "Yes", "Yes", "Yes"), Body.Mass = c(1100L,
250L, 300L, 200L, 130L, 120L), Egg.Length = c(57, 43, 57,
54, 38, 39)), .Names = c("Species", "Name", "Extinct", "Habitat",
"Nest.Site", "Nest.Density", "Diet", "Flight", "Body.Mass", "Egg.Length"
), row.names = c(NA, 6L), class = "data.frame")
Error in UseMethod("summarise_") : no applicable method for 'summarise_' applied to an object of class "factor"

It's a bit unclear whether you want to sample from a list of the unique species in the data, or sample rows so that each "Species" type can appear multiple times in the data. If you want to sample from the unique species, you can do:
# Only sampling one species since the example data
# contains only two, should work fine
# for more random species
random_species = sample(unique(bird$Species), 1, replace = FALSE)
bird %>%
filter(Species %in% random_species) %>%
group_by(Species) %>%
summarize(avg = mean(Body.Mass))

Related

Calculate interrater reliability for multiple raters and trials

I have data for a large number of Trials (only three shown here) and ratings by subjects A, B, C, D, and E(many more in the actual data). In each Trial subjects were asked to determine whether event f or event n occurred:
df <- structure(list(Trial = 1:3, Trial_time = c("00:00:00.001", "00:00:00.002",
"00:00:00.003"), A = c("f", "n", "n"), B = c("f", "n", "f"),
C = c("f", "f", "n"), D = c("f", "f", "n"), E = c("f", "f",
"n")), row.names = c(NA, -3L), class = c("tbl_df", "tbl",
"data.frame"))
How can I establish an interrater reliability score for this kind of rating? Help is much appreciated!

trying to group individual students by their ID# in order to analyze student gender distribution , each student has multiple rows for each class taken

I got the aggregate function to work and group individual students by their ID# but it won't display their gender, just says N/A, im new at programming so if you can explain where I'm going wrong that would be greatly appreciated. I am trying to analyze student performance and want to see how many males vs females there are as a first step.
here is a sample of the code I am trying:
library(dplyr)
library(tidyr)
rm(list=ls())
fake.data <- data.frame(ID = c(1,1,1,1,2,2,2,3,3,3,4,4,4),
Gender = c("M","M","M","M","F","F",'F',"F","F","F", "M", "M", "M"),
RolledGrade = c("A", "A-", "A", "B", "C", "B", "C", "B", "B", "A", "A", "A-", "A-"),
InstructorCode = c("IO", "ED", "IA", "SA", "BA", "BA", "SA", "IA", "EA", "IO", "ED", "ED", "ED"),
stringsAsFactors = FALSE)
grouped_by_gender<- aggregate(fake.data$Gender, fake.data["ID"], FUN = "mean")
I am still having trouble understanding what you want as an output.
Maybe this is what you are looking for or a start to get you where you want.
library(dplyr)
grouped_by_gender <- fake.data %>%
group_by(ID, Gender) %>%
tally()

calculate duration in a complex table

I have a table as shown.
df <- data.frame("name" = c("jack", "william", "david", "john"),
"01-Jan-19" = c(NA,"A",NA,"A"),
"01-Feb-19" = c("A","A",NA,"A"),
"01-Mar-19" = c("A","A","A","A"),
"01-Apr-19" = c("A","A","A","A"),
"01-May-19" = c(NA,"A","A","A"),
"01-Jun-19" = c("A","SA","A","SA"),
"01-Jul-19" = c("A","SA","A","SA"),
"01-Aug-19" = c(NA,"SA","A","SA"),
"01-Sep-19" = c(NA,"SA","A","SA"),
"01-Oct-19" = c("SA","SA","A","SA"),
"01-Nov-19" = c("SA","SA",NA,"SA"),
"01-Dec-19" = c("SA","SA","SA",NA),
"01-Jan-20" = c("SA","M","A","M"),
"01-Feb-20" = c("M","M","M","M"))
Over a time period, each person journeys through of position progression (3 position categories from A to SA to M). My objective is:
Calculate the average duration of A (assistant) position and SA (senior assistant) position. i.e. the duration between the date the first of one category appears, and the date the last of this category appears, regardless of missing data in between.
I transposed the data using R “gather” function
df1 <- gather (df, "date", "position", 2:15)
then I am not sure how to best proceed. What might be the best way to further approach this?
We can get the data in longer format and calculate the number of days between first date when the person was "SA" and the first date when he was "A".
library(dplyr)
df %>%
tidyr::pivot_longer(cols = -name, names_to = 'person', values_drop_na = TRUE) %>%
mutate(person = dmy(person)) %>%
group_by(name) %>%
summarise(avg_duration = person[match('SA', value)] - person[match('A', value)])
# name duration
# <fct> <drtn>
#1 david 275 days
#2 jack 242 days
#3 john 151 days
#4 william 151 days
If needed the mean value we can pull and then calculate mean by adding to the above chain
%>% pull(duration) %>% mean
#Time difference of 204.75 days
data
df <- structure(list(name = c("jack", "william", "david", "john"),
`01-Jan-19` = c(NA, "A", NA, "A"), `01-Feb-19` = c("A", "A",
NA, "A"), `01-Mar-19` = c("A", "A", "A", "A"), `01-Apr-19` = c("A",
"A", "A", "A"), `01-May-19` = c(NA, "A", "A", "A"), `01-Jun-19` = c("A",
"SA", "A", "SA"), `01-Jul-19` = c("A", "SA", "A", "SA"),
`01-Aug-19` = c(NA, "SA", "A", "SA"), `01-Sep-19` = c(NA,
"SA", "A", "SA"), `01-Oct-19` = c("SA", "SA", "A", "SA"),
`01-Nov-19` = c("SA", "SA", NA, "SA"), `01-Dec-19` = c("SA",
"SA", "SA", NA), `01-Jan-20` = c("SA", "M", "A", "M"), `01-Feb-20` = c("M",
"M", "M", "M")), row.names = c(NA, -4L), class = "data.frame")

Mutate_if or mutate_at in dplyr with Dates

I have a data set that is over 100 columns, but for example lets suppose I have a data set that looks like
dput(tib)
structure(list(f_1 = c("A", "O", "AC", "AC", "AC", "O", "A", "AC", "O", "O"), f_2 = c("New", "New",
"New", "New", "Renewal", "Renewal", "New", "Renewal", "New",
"New"), first_dt = c("07-MAY-18", "25-JUL-16", "09-JUN-18", "22-APR-19",
"03-MAR-19", "10-OCT-16", "08-APR-19", "27-FEB-17", "02-MAY-16",
"26-MAY-15"), second_dt = c(NA, "27-JUN-16", NA, "18-APR-19",
"27-FEB-19", "06-OCT-16", "04-APR-19", "27-FEB-17", "25-APR-16",
NA), third_dt = c("04-APR-16", "21-JUL-16", "05-JUN-18", "18-APR-19",
"27-FEB-19", "06-OCT-16", "04-APR-19", "27-FEB-17", "25-APR-16",
"19-MAY-15"), fourth_dt = c("05-FEB-15", "25-JAN-16", "05-JUN-18",
"10-OCT-18", "08-JAN-19", "02-SEP-16", "24-OCT-18", "29-SEP-16",
"27-JAN-15", "14-MAY-15"), fifth_dt = structure(c(1459728000,
1469059200, 1528156800, 1555545600, 1551225600, 1475712000, 1554336000,
1488153600, 1461542400, 1431993600), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), sex = c("M", "M", "F", "F", "M", "F", "F",
"F", "F", "F")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
Most of the date (ends_with(dt)) columns are strings, but I want to convert them into dates. I tried mutate_at but received the following:
tib %>% mutate_at(vars(ends_with("dt")), funs(parse_date_time(.))) %>% glimpse()
Error in mutate_impl(.data, dots) :
Evaluation error: argument "orders" is missing, with no default.
Any thoughts on what caused this error? Should I use a different mutate function?
As akrun noted, one of the columns is already in dttm format. Once that column is ignored the following code works for me:
tib %>%
select(-fifth_dt) %>%
mutate_at(vars(ends_with("dt")), parse_date_time, orders = "%d-%m-%y")
The funs is deprecated. In place, list can be used
library(dplyr)
tib %>%
mutate_at(3:6, list(~ parse_date_time(., "%d-%m-%y")))

Extracting data from a list of lists into its own `data.frame` with `purrr`

Representative sample data (list of lists):
l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T",
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr",
score = -0.21104594634643), .Names = c("id", "label",
"link", "score")), e = 49.1279871269422), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = -0.934821052832427,
b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina",
link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 9L, label = "Nebraska", link = "America/Scoresbysund",
score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id",
"label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f",
c = "P", d = list(structure(list(id = 8L, label = "Georgia",
link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 2L, label = "Washington", link = "America/Shiprock",
score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 6L, label = "North Dakota", link = "Universal",
score = 1.03168296038975), .Names = c("id", "label",
"link", "score")), structure(list(id = 1L, label = "New Hampshire",
link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id",
"label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania",
link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id",
"label", "link", "score"))), e = 132.1153538536), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = 0.202685974077313, b = "x",
c = "O", d = structure(list(id = 3L, label = "Delaware",
link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id",
"label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = -0.396243444741009,
b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota",
link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 9L, label = "Nebraska", link = "America/Ojinaga",
score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id",
"label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a",
"b", "c", "d", "e")))
I have a list of lists, by virtue of a JSON data download.
The list has 176 elements, each with 33 nested elements some of which are also lists of varying length.
I am interested in analyzing the data contained in a particular nested list, which has a length of ~150 for each of the 176 which has either 4 or 5 elements -- some have 4 and some have 5. I am trying to extract this nested list of interest and convert it into a data.frame to be able to perform some analysis.
In the representative sample data above, I am interested in the nested list d for each of the 5 elements of l. The desired data.frame would therefore look something like:
id label link score externalId
5 Utah Asia/Anadyr -0.2110459 NA
8 South Carolina Pacific/Wallis 0.5265409 -6.743544
.
.
I've been attempting to use purrr which appears to have a sensible and consistent flow for processing data in lists, but I am running into errors that I can't fully understand the cause of -- could very well be that I don't properly understand the commands/logic of purrr or lists (likely both). This is the code I've been attempting but throws the associated error:
df <- map_df(l, "d", ~as.data.frame(.))
Error: incompatible sizes (5 != 4)
I believe this has to do with the differing lengths of d for each component, or perhaps the differing contained data (sometimes 4 elements sometimes 5) or perhaps the function I've used here is misspecified -- truthfully I'm not entirely sure.
I have worked around this by using a for loop, which I know is inefficient and hence my question here on SO.
This is the for loop I currently employ:
df <- data.frame(id = integer(), label = character(), score = numeric(), externalId = numeric())
for(i in seq_along(l)){
df_temp <- l[[i]][[4]] %>% map_df(~as.data.frame(.))
df <- rbind(df, df_temp)
}
Some assistance preferably with purrr - alternatively some version of apply as this is still superior to my for-loop - would be greatly appreciated. Also if there's a resource for the above I'd like to understand rather than just find the right code.
You can do this in three steps, first pulling out d, then binding the rows within each element of d, and then binding everything into a single object.
I use bind_rows from dplyr for the within-list row binding. map_df does the final row binding.
library(purrr)
library(dplyr)
l %>%
map("d") %>%
map_df(bind_rows)
This is also equivalent:
map_df(l, ~bind_rows(.x[["d"]] ) )
The result looks like:
# A tibble: 12 x 5
id label link score externalId
<int> <chr> <chr> <dbl> <dbl>
1 5 Utah Asia/Anadyr -0.2110459 NA
2 8 South Carolina Pacific/Wallis 0.5265409 -6.743544
3 9 Nebraska America/Scoresbysund 0.2508955 16.425747
4 8 Georgia America/Nome 0.5264941 7.915836
5 2 Washington America/Shiprock -0.5551864 15.068666
6 6 North Dakota Universal 1.0316830 NA
7 1 New Hampshire America/Cordoba 1.2158206 9.727642
8 1 Alaska Asia/Istanbul -0.2318326 NA
9 4 Pennsylvania Africa/Dar_es_Salaam 0.5902453 NA
10 3 Delaware Asia/Samarkand 0.6955771 15.236482
11 4 North Dakota America/Tortola 1.0306027 -7.216669
12 9 Nebraska America/Ojinaga -1.1139800 -8.451451
For more information on purrr, I recommend Grolemund and Wickham's "R for Data Science" http://r4ds.had.co.nz/
I think one issue you are facing is that some of the items in l$d are lists of variables with one observation each, ready to be converted to data frames, while other items are lists of such lists.
But I'm not that good at purrr myself. Here's how I would do it:
l <- lapply(l, function(x){x$d}) ## work with the data you need.
list_of_observations <- Filter(function(x) {!is.null(names(x))},l)
list_of_lists <- Filter(function(x) {is.null(names(x))}, l)
another_list_of_observations <- unlist(list_of_lists, recursive=FALSE)
df <- lapply(c(list_of_observations, another_list_of_observations),
as.data.frame) %>% bind_rows

Resources