subsample random rows of tibble

subsample random rows of tibble - r

Suppose i have two data objects, df.A and df.B.
df.A <- structure(list(Species = structure(c(7L, 7L, 1L, 1L, 1L, 1L,
4L, 6L, 5L, 5L), .Label = c("Carcharhinus leucas", "Carcharhinus limbatus",
"Carcharhinus perezi", "Galeocerdo cuvier", "Ginglymostoma cirratum",
"Hypanus americanus", "Negaprion brevirostris", "Sphyrna mokarran"
), class = "factor"), Sex = structure(c(1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor")), row.names = c(NA,
10L), class = "data.frame")
> class(df.A)
[1] "data.frame"
df.B <- structure(list(Diel.phase = structure(c(2L, 2L, 1L, 2L, 1L, 2L,
2L, 1L, 1L, 1L), .Label = c("Day", "Night"), class = "factor"),
Season = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L,
2L), .Label = c("Summer", "Winter"), class = "factor")), row.names = c(NA,
-10L), groups = structure(list(.rows = structure(list(1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame")), class = c("rowwise_df", "tbl_df", "tbl",
"data.frame"))
> class(df.B)
[1] "rowwise_df" "tbl_df" "tbl" "data.frame"
Let's say I want to subsample 2 rows from each object. The code below works for df.A but not for df.B. Instead, all rows for df.B are returned.
df.B %>% slice_sample(n=2)
Can someone explain this result? And how can i apply sample_slice to object of class(df.B) without back-transforming to data.frame object first?

The grouping influences how the tibble is treated.
You can do this:
df.B %>% ungroup() %>% slice_sample(n=2)

Related

Percentages in the wrong position in ggplot2

I'm trying to plot a graph for a likert test using ggplot2 and I would like to have the percentages values appearing on the graph. I've created a df with all the averages and percentages so I could write it on the graph. It all seems to be working good, except the values are being plotted as if they were upsided or something.
This is the code I'm using
example <- structure(list(grupo = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("EJA",
"REG"), class = "factor"), nivel = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("CINCO", "DOZE", "NOVE"), class = "factor"), tipo = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 3L), .Label = c("COR", "PAD", "RES"), class = "factor"),
likert = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L,
3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L), .Label = c("0",
"1", "2", "3"), class = c("ordered", "factor")), cnt = c(3L,
1L, 3L, 5L, 3L, 1L, 3L, 6L, 2L, 1L, 10L, 5L, 5L, 9L, 11L,
6L, 4L, 10L, 10L, 10L), freq = c(0.25, 0.083, 0.25, 0.417,
0.231, 0.077, 0.231, 0.462, 0.154, 0.077, 0.769, 0.167, 0.167,
0.3, 0.367, 0.2, 0.133, 0.333, 0.333, 0.333), prop = c(25,
8.3, 25, 41.7, 23.1, 7.7, 23.1, 46.2, 15.4, 7.7, 76.9, 16.7,
16.7, 30, 36.7, 20, 13.3, 33.3, 33.3, 33.3), proptext = c("25",
"8.3", "25", "41.7", "23.1", "7.7", "23.1", "46.2", "15.4",
"7.7", "76.9", "16.7", "16.7", "30", "36.7", "20", "13.3",
"33.3", "33.3", "33.3")), row.names = c(NA, -20L), groups = structure(list(
grupo = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("EJA",
"REG"), class = "factor"), nivel = structure(c(1L, 1L, 1L,
2L, 2L, 2L), .Label = c("CINCO", "DOZE", "NOVE"), class = "factor"),
tipo = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("COR",
"PAD", "RES"), class = "factor"), .rows = structure(list(
1:4, 5:8, 9:11, 12:15, 16:19, 20L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
ggplot(example, aes(x=(interaction(grupo, nivel, tipo)),y=prop, fill=likert))+
geom_col()+
#scale_y_continuous(labels = percent)+
coord_flip() +
ggtitle("Testing")+
xlab("A, B, and C")+
ylab("%")+
geom_text(aes(label = proptext), size = 2, colour = "black")
Would someone have an idea of how could I solve it?

The geom_text may also require the x, y
library(dplyr)
library(tidyr)
library(ggplot2)
example %>%
unite(new, grupo, nivel, tipo, sep = ".") %>%
ggplot(aes(x=new, fill=likert))+
geom_col(aes(y= prop))+
geom_text(aes(x = new, y = prop, label = proptext),
position = position_stack(vjust = .5)) +
coord_flip() +
#scale_y_continuous(labels = percent)+
ggtitle("Testing")+
xlab("A, B, and C")+
ylab("%")
-output

How to reduce a data frame by grouping data?

:)
Is there an easy way to group a particular data set into a reduced data frame from certain characteristics? I was thinking of an algorithm for this, but is there any function in R that can be used for this? I've trying to use dplyr, but it didin't work very well...
E.g.:
P.S .: My data is in an matrix of more than 1Gb, that is, I need a more automatic process.
Example Data:
structure(list(Nun = 1:6, Event = c(1L, 1L, 1L, 1L, 2L, 2L),
Time = structure(c(3L, 4L, 5L, 6L, 1L, 2L), .Label = c("11:34",
"11:36", "8:50", "8:52", "8:54", "8:56"), class = "factor"),
User = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("U1",
"U7"), class = "factor")), .Names = c("Nun", "Event", "Time",
"User"), class = "data.frame", row.names = c(NA, -6L))

You can use summarise from dplyr package:
library(dplyr)
your_data_frame %>%
group_by(User, Event) %>%
summarise(Duration = max(Time) - min(Time))

Here is the data.table way.
Example Data:
x<-structure(list(Nun = 1:6, Event = c(1L, 1L, 1L, 1L, 2L, 2L),
Time = structure(c(1508514600, 1508514720, 1508514840, 1508514960,
1508524440, 1508524560), class = c("POSIXct", "POSIXt"), tzone = ""),
User = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("U1",
"U7"), class = "factor")), .Names = c("Nun", "Event", "Time",
"User"), row.names = c(NA, -6L), class = "data.frame")
Code:
require(data.table)
setDT(x)
x[,list(Duration = max(Time)-min(Time)),by = list(Event,User)]
Event User Duration
1: 1 U1 6 mins
2: 2 U7 2 mins

How can I use R to fill rows based on column?

I have the following table
Code Name Class
1
2 Monday day
5 green color
9
6
1 red color
1
2
9 Tuesday day
6
5
Goal is to the fill the Name and Class columns based on the Code column of a filled row. For example, the second row is filled and the code is 2. I would like to fill all the rows where code = 2 with Name=Monday and Class=day.
I tried using fill() from tidyR but that seems to require ordered data.
structure(list(Code = c(1L, 2L, 5L, 9L, 6L, 1L, 1L, 2L, 9L, 6L,
5L), Name = structure(c(1L, 3L, 2L, 1L, 1L, 4L, 1L, 1L, 5L, 1L,
1L), .Label = c("", "green", "Monday", "red", "Tuesday"), class = "factor"),
Class = structure(c(1L, 3L, 2L, 1L, 1L, 2L, 1L, 1L, 3L, 1L,
1L), .Label = c("", "color", "day"), class = "factor")), .Names = c("Code",
"Name", "Class"), class = "data.frame", row.names = c(NA, -11L
))

library(dplyr)
final_df <- left_join(df, df[df$Name!='',], by='Code')[,c(1,4:5)]
colnames(final_df) <- colnames(df)
final_df

Select observations in R based on maximum number listed in a column

I hope I've done this correctly! I have two data frames:
teachers = structure(list(Teacher = c(123L, 123L, 123L, 123L, 124L),
tStudents = c(3L, 3L, 4L, 3L, 4L), Term = c(1801L, 1802L, 1801L, 1803L, 1802L),
Course = structure(c(5L, 6L, 7L, 6L, 8L), .Label = c("ENGG",
"ENGG2", "LITT", "LITT2", "MATH", "MATH2", "PHYS", "SCIE"
), class = "factor")), .Names = c("Teacher", "tStudents", "Term", "Course"), row.names = c(NA, 5L), class = "data.frame")
enrols = structure(list(UniqueStudent = structure(c(3L, 2L, 1L, 5L, 4L),
.Label = c("1801-ENGG-N1-abcd1#abc.edu.au", "1801-MATH-C1-abcd1#abc.edu.au","1801-PHYS-L1-abcd1#abc.edu.au", "1802-MATH2-G1-abcd1#abc.edu.au", "1802-SCIE-K2-abcd1#abc.edu.au"), class = "factor"), Term = c(1801L,1801L, 1801L, 1802L, 1802L), Student.Email.Addresses = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "abcd1#abc.edu.au", class = "factor"), ID = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "s12344", class = "factor"),
Gender.Description = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "M", class = "factor"),
Age = c(12L, 12L, 12L, 12L, 12L), Program.Short.Description = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "LSC1", class = "factor"), Term.CC.CN = structure(c(3L,
2L, 1L, 5L, 4L), .Label = c("1801-ENGG-N1", "1801-MATH-C1",
"1801-PHYS-L1", "1802-MATH2-G1", "1802-SCIE-K2"), class = "factor"),
Course.Code = structure(c(4L, 2L, 1L, 5L, 3L), .Label = c("ENGG",
"MATH", "MATH2", "PHYS", "SCIE"), class = "factor"), Class.Number = structure(c(4L,
1L, 5L, 3L, 2L), .Label = c("C1", "G1", "K2", "L1", "N1"), class = "factor"),
Teacher = c(123L, 123L, 125L, 124L, 123L)), .Names = c("UniqueStudent", "Term", "Student.Email.Addresses", "ID", "Gender.Description", "Age", "Program.Short.Description", "Term.CC.CN", "Course.Code", "Class.Number", "Teacher"), row.names = c(NA, 5L), class = "data.frame")
teachers$tStudents lists the maximum number of students allowed to be allocated to a teacher per Term and Course. I've also pre-merged the Course enrolments in the "enrols" data to list the Teachers for each course.
So, what I need to do is create class lists from the enrols data using the teachers data by c("teacher", "Term", "Course") but my class lists can only select a maximum value of students based on the number listed in teachers$tStudents. Ideally, I'd also like to select a representative distribution of students so that the new class lists have both genders, different ages and are from different Program.Short.Description.
I've tried merging in different ways in dplyr and can create full lists with all students but haven't been able to use the teachers$tStudents column to limit the number of observations to select. Is this possible?

Special characters in a column: mess in the table

I have a problem with special characters in a column of a table.
Here is an example of the data:
structure(list(shipType = structure(c(1L, 3L, 1L, 2L, 4L), .Label = c("CARGO",
"FISHING", "TOWING_LONG_WIDE", "UNKNOWN"), class = "factor"),
shipCargo = structure(c(1L, 1L, 1L, 1L, 1L), .Label = "UNDEFINED", class = "factor"),
destination = structure(c(3L, 1L, 2L, 4L, 5L), .Label = c("\\KORSOR ;.,NA,.\\",
"LEHTMA", "RIGA", "TALLIN", "VYBORG"), class = "factor"),
eta = structure(c(1L, 2L, 5L, 3L, 4L), .Label = c("01/01 00:00 UTC",
"01/01 09:00 UTC", "24/12 16:00 UTC", "26/12 07:00 UTC",
"30/12 16:00 UTC"), class = "factor"), imo = structure(c(3L,
5L, 1L, 4L, 2L), .Label = c("7101891", "7406318", "9066045",
"9158185", "Russia"), class = "factor"), callsign = structure(c(5L,
1L, 2L, 3L, 4L), .Label = c("12", "UALB", "UBYK8", "UFPC",
"UICC"), class = "factor"), country = structure(c(2L, 1L,
2L, 2L, 2L), .Label = c("2014-12-29", "Russia"), class = "factor"),
month = c(12L, 1L, 12L, 12L, 12L), date = structure(c(2L,
1L, 2L, 2L, 2L), .Label = c("", "2014-12-29"), class = "factor"),
week = c(1L, NA, 1L, 1L, 1L), X = c(NA, NA, NA, NA, NA)), .Names = c("shipType",
"shipCargo", "destination", "eta", "imo", "callsign", "country",
"month", "date", "week", "X"), class = "data.frame", row.names = c(NA,
-5L))
As you can see on the second row, there is a problem in the column "destination" when reading the file with the following code
data <- read.table(file, header=T, fill=T, sep=",")
I have tried different things, such as: exporting with quotes and without headers
data <- read.table(file, sep=",", fill=T, head=F, quote="")
and then removing the first line (the actual headers that are in the table...) and adding one more time these headers
data <- data[-1,]
colnames(data)<-c( "shipType", "shipCargo","destination","eta","imo","callsign", "country","month","date","week")
It looks better, but there are a lot of special characters and it will be time consuming / source of errors (I have lot of tables..) to edit.
Is there a way to avoid the columns to be messed up when importing the file?
Thank you!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

subsample random rows of tibble - r

The grouping influences how the tibble is treated. You can do this: df.B %>% ungroup() %>% slice_sample(n=2)

Related

Percentages in the wrong position in ggplot2

How to reduce a data frame by grouping data?

How can I use R to fill rows based on column?

Select observations in R based on maximum number listed in a column

Special characters in a column: mess in the table

Categories

Resources