Finding max of column by group with condition - r

I have a data frame like this:
for each gill, I would like to find the maximum time for which the Diameter is different from 0. I have tried to use the function aggregate and the dplyr package but this did not work. A combinaison of for, if and aggregate would probably work but I did not find how to do it.
I'm not sure of the best way to approach this. I'd appreciate any help.

After grouping by 'Gill', subset the 'Time' where 'Diametre' is not 0 and get the max (assuming 'Time' is numeric class)
library(dplyr)
df1 %>%
group_by(Gill) %>%
summarise(Time = max(Time[Diametre != 0]))

Here how you can use aggregate:
> df<- data.frame(
Gill = rep(1:11, each = 2),
diameter = c(0,0,1,0,0,0,73.36, 80.08,1,25.2,53.48,61.21,28.8,28.66,71.2,80.25,44.55,53.50,60.91,0,11,74.22),
time = 0.16
)
> df
Gill diameter time
1 1 0.00 0.16
2 1 0.00 0.16
3 2 1.00 0.16
4 2 0.00 0.16
5 3 0.00 0.16
6 3 0.00 0.16
7 4 73.36 0.16
8 4 80.08 0.16
9 5 1.00 0.16
10 5 25.20 0.16
11 6 53.48 0.16
12 6 61.21 0.16
13 7 28.80 0.16
14 7 28.66 0.16
15 8 71.20 0.16
16 8 80.25 0.16
17 9 44.55 0.16
18 9 53.50 0.16
19 10 60.91 0.16
20 10 0.00 0.16
21 11 11.00 0.16
22 11 74.22 0.16
> # Remove diameter == 0 before aggregate
> dfnew <- df[df$diameter != 0, ]
> aggregate(dfnew$time, list(dfnew$Gill), max )
Group.1 x
1 2 0.16
2 4 0.16
3 5 0.16
4 6 0.16
5 7 0.16
6 8 0.16
7 9 0.16
8 10 0.16
9 11 0.16

I would use a different approach than the elegant solution that akrun suggested. I know how to use this method to create the column MaxTime that you show in your image.
#This will split your df into a list of data frames for each gill.
list.df <- split(df1, df1$Gill)
Then you can use lapply to find the maximum of Time for each Gill and then make that value a new column called MaxTime.
lapply(list.df, function(x) mutate(x, MaxTime = max(x$Time[x$Diametre != 0])))
Then you can combine these split dataframes back together using bind_rows()
df1 = bind_rows(list.df)

Related

Create matrix from dataset in R

I want to create a matrix from my data. My data consists of two columns, date and my observations for each date. I want the matrix to have year as rows and days as columns, e.g. :
17 18 19 20 ... 31
1904 x11 x12 ...
1905
1906
.
.
.
2019
The days in this case is for December each year. I would like missing values to equal NA.
Here's a sample of my data:
> head(cdata)
# A tibble: 6 x 2
Datum Snödjup
<dttm> <dbl>
1 1904-12-01 00:00:00 0.02
2 1904-12-02 00:00:00 0.02
3 1904-12-03 00:00:00 0.01
4 1904-12-04 00:00:00 0.01
5 1904-12-12 00:00:00 0.02
6 1904-12-13 00:00:00 0.02
I figured that the first thing I need to do is to split the date into year, month and day (European formatting, YYYY-MM-DD) so I did that and got rid of the date column (the one that says Datum) and also got rid of the unrelevant days, namely the ones < 17.
cdata %>%
dplyr::mutate(year = lubridate::year(Datum),
month = lubridate::month(Datum),
day = lubridate::day(Datum))
select(cd, -c(Datum))
cu <- cd[which(cd$day > 16
& cd$day < 32
& cd$month == 12),]
and now it looks like this:
> cu
# A tibble: 1,284 x 4
Snödjup year month day
<dbl> <dbl> <dbl> <int>
1 0.01 1904 12 26
2 0.01 1904 12 27
3 0.01 1904 12 28
4 0.12 1904 12 29
5 0.12 1904 12 30
6 0.15 1904 12 31
7 0.07 1906 12 17
8 0.05 1906 12 18
9 0.05 1906 12 19
10 0.04 1906 12 20
# … with 1,274 more rows
Now I need to fit my data into a matrix with missing values as NA. Is there anyway to do this?
Base R approach, using by.
r <- `colnames<-`(do.call(rbind, by(dat, substr(dat$date, 1, 4), function(x) x[2])), 1:31)
r[,17:31]
# 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# 1904 -0.28 -2.66 -2.44 1.32 -0.31 -1.78 -0.17 1.21 1.90 -0.43 -0.26 -1.76 0.46 -0.64 0.46
# 1905 1.44 -0.43 0.66 0.32 -0.78 1.58 0.64 0.09 0.28 0.68 0.09 -2.99 0.28 -0.37 0.19
# 1906 -0.89 -1.10 1.51 0.26 0.09 -0.12 -1.19 0.61 -0.22 -0.18 0.93 0.82 1.39 -0.48 0.65
Toy data
set.seed(42)
dat <- do.call(rbind, lapply(1904:1906, function(x)
data.frame(date=seq(ISOdate(x, 12, 1, 0), ISOdate(x, 12, 31, 0), "day" ),
value=round(rnorm(31), 2))))
You can try :
library(dplyr)
library(tidyr)
cdata %>%
mutate(year = lubridate::year(Datum),
day = lubridate::day(Datum)) %>%
filter(day >= 17) %>%
complete(day = 17:31) %>%
select(year, day, Snödjup) %>%
pivot_wider(names_from = day, values_from = Snödjup)

R group values in column based on intervals and average the results for each interval

I have two tables
table 1:
Dates_only <- data.frame(ID=c('1118','1118','1118','1118','1118',
'1118','1118','1118','1119','1119',
'1119','1119','1119','1119','1119',
'1119','13PP','13PP','13PP','13PP',
'13PP','13PP','13PP','13PP'),
Quart_y=c('2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2'),
Quart=c(0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00))
and table 2:
Values <- data.frame(ID=c('1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP'),
Day=c(0,0,0,0.14,0.13,0.13,0.2,0.23,0.24,0.27,0.28,
0.32,0.32,0.32,0.44,0.47,0.49,0.49,0.59,0.64,
0.61,0.72,0.71,0.73,0.95,0.86,0.78,1.1,0.93,1.15),
Value=c(7.6,6.2,6.8,7.1,6.2,5.9,6.8,5.8,4.6,6.5,5.4,
4.2,6.3,4.8,4,6,4.3,3.8,5.9,4,3.6,5.6,3.8,
3.4,5.4,3.2,3,5,2.9,2.9))
What I am trying to do is to find a way to change the values in Values$Day according to Dates_only$Quart.
Specifically, Dates_only$Quart represent quantified quarters (2017Q3 - 0.25, 2017Q4-0.50,...,2018Q4-1.50) etc. While, Values$Day represents quantified days.
I want to change the Values$Day classified by quarter instead, for example:
for 0<=Values$Day<=0.25 the Values$Day==0.25, for 0.25<Values$Day<=0.50 the Values$Day==0.50 etc.
What I have tried to do is to use this method bellow but it comes up with an error message:
unique_quarters <- unique(Dates_only$Quart)
unique_quarters <- append(unique_quarters, 0, after=0)
df3 <- transform(Dates_only,
Transf_Day=Values$Quart[findInterval(Values$Day, unique_quarters)])
The issue I guess is the problem that findInterval(Values$Day, unique_quarters) returns
1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 5 4 5
While Values$Quart has values
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
try this:
library(tidyverse)
as.tbl(Values) %>%
mutate(Int=cut(Day, seq(0,3,0.25), include.lowest = T)) %>%
mutate(Int2=factor(Int, labels = seq(0.25,1.25,0.25)))
# A tibble: 30 x 5
ID Day Value Int Int2
<fctr> <dbl> <dbl> <fctr> <fctr>
1 1118 0.00 7.6 [0,0.25] 0.25
2 1119 0.00 6.2 [0,0.25] 0.25
3 13PP 0.00 6.8 [0,0.25] 0.25
4 1118 0.14 7.1 [0,0.25] 0.25
5 1119 0.13 6.2 [0,0.25] 0.25
6 13PP 0.13 5.9 [0,0.25] 0.25
7 1118 0.20 6.8 [0,0.25] 0.25
8 1119 0.23 5.8 [0,0.25] 0.25
9 13PP 0.24 4.6 [0,0.25] 0.25
10 1118 0.27 6.5 (0.25,0.5] 0.5
# ... with 20 more rows

How to barplot in R using the first column as data labels

Below is my data, with headers.
Using R, I would like to barplot() this data using the value in the S column as the label.
S Value
10 0.00
20 0.00
30 0.00
40 0.01
50 0.71
60 4.97
70 13.22
80 22.95
90 32.93
100 42.93
I'm scouring the help files, but I can't seem to find an example of this seemingly simple task.
This will quickly resolve your problem, but then you'll have to add details to set the graph details layout:
Your example:
S <- c(10, 20,30,40,50,6,70,80,90,100)
Value <- c(0.00,0.00,0.00,0.01,0.71,4.97,13.22,22.95,32.93,42.93)
df <- do.call(rbind, Map(data.frame, S=S, Value=Value))
df
S Value
1 10 0.00
2 20 0.00
3 30 0.00
4 40 0.01
5 50 0.71
6 6 4.97
7 70 13.22
8 80 22.95
9 90 32.93
10 100 42.93
barplot(df$Value, names.arg = df$S)

R script to format datatable to exactly 2 decimal places

I have made a datatable "Event_Table" with 46 rows and 6 columns. At some point I export this to text file and would like the output of some fields to be truncated to exactly 2 decimal places.
Event_Table[1:34,3:6]=round(Event_Table[1:34,3:6])
Event_Table[36:39,3:6]=format(round(Event_Table[36:39,3:6],2), nsmall=2)
Event_Table[41:46,3:6]=format(round(Event_Table[41:46,3:6],2), nsmall=2)
Line 1 and 2 produce the desired result, but subsequently running line 3 throws an error:
Error in Math.data.frame(list(CO = c("0", "0", "0.786407766990291", "0", :
non-numeric variable in data frame: CONCONATotal
Why? If remove line 2, then line 3 runs fine. So somethign about setting the formatting in one part of the table is affecting the entire table and prevents a second format command form being possible (even though the formatting is only being applied to discrete parts of the table). Any ideas how to avoid this, or to achieve what is required in a different way?
EDIT:
I should perhaps add that the following code is not quite sufficient:
Event_Table[36:46,3:6]=round(Event_Table[36:46,3:6], digits=2)
Trailing zeros are truncated. i.e. A value of 1 is displayed as "1", not as "1.00". The latter being what is required.
EDIT2:
Here is the table:
ChrSize Chr CO NCO NA Total
1 230218 1 4.00 1.00 0 5.00
2 813184 2 6.00 6.00 0 12.00
3 316620 3 2.00 3.00 0 5.00
4 1531933 4 13.00 20.00 0 33.00
5 576874 5 3.00 8.00 0 11.00
6 270161 6 4.00 2.00 0 6.00
7 1090940 7 11.00 5.00 0 16.00
8 562643 8 5.00 9.00 0 14.00
9 439888 9 6.00 3.00 0 9.00
10 745751 10 10.00 6.00 0 16.00
11 666816 11 3.00 7.00 0 10.00
12 1078177 12 11.00 13.00 1 25.00
13 924431 13 7.00 12.00 0 19.00
14 784333 14 5.00 6.00 1 12.00
15 1091291 15 6.00 17.00 0 23.00
16 948066 16 7.00 6.00 0 13.00
17 12071326 TOTAL 103.00 124.00 2 229.00
18 NA Event Lengths: NA NA NA NA
19 NA Min Len 0.00 22.00 0 0.00
20 NA Max Len 14745.00 12524.00 0 14745.00
21 NA Mean Len 2588.00 1826.00 0 2153.00
22 NA Median Len 1820.00 1029.00 0 1322.00
23 NA Chromatids: NA NA NA NA
24 NA 1_chrom 0.00 98.00 2 100.00
25 NA 2_chrom 81.00 22.00 0 103.00
26 NA 3_chrom 14.00 4.00 0 18.00
27 NA 4_chrom 8.00 0.00 0 8.00
28 NA Classe: NA NA NA NA
29 NA 1_1brin 0.00 55.00 0 55.00
30 NA 1_2brins 0.00 43.00 2 45.00
31 NA 2_nonsis 81.00 15.00 0 96.00
32 NA 2_sis 0.00 7.00 0 7.00
33 NA classe_3 14.00 4.00 0 18.00
34 NA classe_4 8.00 0.00 0 8.00
35 NA Fraction of Chromatids: NA NA NA NA
36 NA 1_chrom 0.00 0.79 1 0.44
37 NA 2_chrom 0.79 0.18 0 0.45
38 NA 3_chrom 0.14 0.03 0 0.08
39 NA 4_chrom 0.08 0.00 0 0.03
40 NA Fraction of each Classe: NA NA NA NA
41 NA 1_1brin 0.00 0.44 0 0.24
42 NA 1_2brins 0.00 0.35 1 0.20
43 NA 2_nonsis 0.79 0.12 0 0.42
44 NA 2_sis 0.00 0.06 0 0.03
45 NA classe_3 0.14 0.03 0 0.08
46 NA classe_4 0.08 0.00 0 0.03
I require rows 1-34 formatted without decimals.
And rows 36-46 formatted with precisely 2 decimal places for all values.
EDIT3: The initial data is read sequentially into tables called "data", then a derivative output table "Event_Table" is generated in which I am inserting summaries of various aspects of each "data" table (i.e. totals, means, medians etc). I then sequentially export the "Event_Tables" since these contain the required summary informations for each "data" table.
Here is the start of the code:
# FIRST SET WORKING DIRECTORY WHERE INPUT FILES ARE!
files = list.files(pattern="Events_") # import files names with "Event_" string into variable "files"
files1 = length(files) # Count number of files
files2 = read.table(text = files, sep = "_", as.is = TRUE) #Split file names by "_" separator and create table "files2"
for (j in 1:files1)
{data <- read.table(files[j], header=TRUE) #Import datatable from files number 1 to j
# Making derivative dataframes:
Event_Table <- data.frame(matrix(NA, nrow = 46, ncol = 6)) # Creates dataframe of arbitrary size full of NAs
names(Event_Table) <- c("ChrSize","Chr","CO","NCO","NA","Total") # Adds column names to dataframe
Event_Table ["Chr"] = c(1:16, "TOTAL","Event Lengths:","Min Len", "Max Len","Mean Len","Median Len","Chromatids:","1_chrom","2_chrom","3_chrom","4_chrom","Classe:","1_1brin","1_2brins","2_nonsis","2_sis","classe_3","classe_4","Fraction of Chromatids:","1_chrom","2_chrom","3_chrom","4_chrom","Fraction of each Classe:","1_1brin","1_2brins","2_nonsis","2_sis","classe_3","classe_4") # Inserts vector 1:16 (numbers 1 to 16) in column 1 of dataframe
Event_Table [1:16,"ChrSize"] = c(230218,813184,316620,1531933,576874,270161,1090940,562643,439888,745751,666816,1078177,924431,784333,1091291,948066)
Event_Table [17,"ChrSize"] =sum(Event_Table [1:16,"ChrSize"])
nE = nrow(data) # Total number of events
Event_Table [17,"Total"] = nrow(data)
Event_Table [19,"Total"] = min(data ["len"])
Event_Table [20,"Total"] = max(data ["len"])
Event_Table [21,"Total"] = mean(data ["len"])
Event_Table [22,"Total"] = median(data [1:nrow(data),"len"])
#More stuff here, etc, then close j loop }
So the Event_Table is set up as a data.frame of type matrix filled with NAs.
I then fill it manually with relevant info in relevant grid positions.
I then simply want to format the visual appearance of these fields.
If I am going about this all wrong, then please can you suggest a better way to do this! Thanks
Here is a proof of concept using 2 rather different data frames:
DF1 <- data.frame(x = rnorm(10), person = rep(LETTERS[1:2], 5))
DF2 <- data.frame(y = 1:10L, result = rep(LETTERS[3:4], 5), alt = rep(letters[3:4], 5))
write.table(DF1, file = "example.csv", sep = ",")
write.table(DF2, file = "example.csv", sep = ",", append = TRUE)
This issues a warning (about column names - no problem) and gives:
x person
1 0.796933543 A
2 1.495800567 B
3 0.359153458 A
4 2.105378598 B
5 0.175455314 A
6 -1.850171347 B
7 -0.87197177 A
8 2.682650638 B
9 1.040676847 A
10 -0.086197042 B
y result alt
1 1 C c
2 2 D d
3 3 C c
4 4 D d
5 5 C c
6 6 D d
7 7 C c
8 8 D d
9 9 C c
10 10 D d
From here you can control the formatting as desired. You may wish to suppress the column names or give more informative ones, and you probably don't want the row numbering either. See ?write.table for all the options.
It could be a similar problem as Error in Math.data.frame.....non-numeric variable in data frame:. Maybe you have commas in your data. If that is not the case, could you show what is in your table?

Categorical Survey Analysis - data structure problems

I am trying to run a probability table for an entire survey. I want to then export these statistics into a csv where each column represents a single question. Each question in my original is its own column, like so:
print(InternalSurveyPercent)
Q1 Q2 Q3 Q4
1 3 2 Mazda
2 3 4 Ford
3 5 2 Toyota
9 3 2 Hyundai
I'd like the results to look like this, but for each column.
InternalSurveyPercent$Q1
Q1
1 25%
2 25%
3 25%
4 0%
5 0%
9 25%
I use this function to generate the list (is lapply the right way to do this?)
InternalSurveyPercent = lapply(InternalSurvey, function(x) prop.table(table(x)))
Then I multiply by 100 because it makes graphic my data easier.
InternalSurveyPercent = sapply(InternalSurveyPercent, "*", 100)
I'm not really sure where to go from here. I'm very confused about how the data is being structured at this point.
str(InternalSurveyPercent)
List of 4
$ Q1: table [1:5(1d)] 25.00 25.00 25.00 0.00 0.00 25.00
..- attr(*, "dimnames")=List of 1
.. ..$ x: chr [1:5] "1" "2" "3" "4" ...
Why is it returning a list? Why not a data frame with 4 variables (columns)? Thoughts on where I am going wrong/getting lost?
Thank you!
Seems folks are having different interpretation on the output, suggest to re-frame the question and desired output with clarity. Anyhow, here s a data.table solution based on how far I understand the question.
# the data
df <- read.table(text="Q1 Q2 Q3 Q4
1 3 2 Mazda
2 3 4 Ford
3 5 2 Toyota
9 3 2 Hyundai", header=T, as.is=T)
library(data.table)
# one liner to get the %
setDT(df)[,lapply(.SD, function(x) prop.table(table(x))*100)][]
# Q1 Q2 Q3 Q4
# 1: 25 75 75 25
# 2: 25 25 25 25
# 3: 25 75 75 25
# 4: 25 25 25 25
# If you prefer stitch the result table with the original together, you could:
df2 <- setDT(df)[,lapply(.SD, function(x) prop.table(table(x))*100)]
df[,paste0("Q",(1:4),"%") := df2[,1:4,with=FALSE], with=FALSE][]
# Q1 Q2 Q3 Q4 Q1% Q2% Q3% Q4%
# 1: 1 3 2 Mazda 25 75 75 25
# 2: 2 3 4 Ford 25 25 25 25
# 3: 3 5 2 Toyota 25 75 75 25
# 4: 9 3 2 Hyundai 25 25 25 25
This may be helpful. I am guessing that you have six options in Q1-3 (i.e., 1,2,3,4,5,and 9). But, Q4 is a different question in that there may not be the same options. Therefore, you will see ten options in the outcome.
devtools::install_github("hadley/tidyr")
library(tidyr)
# I am following your idea with data provided by #LyzandeR
ana <- lapply(InternalSurvey, function(x) prop.table(table(x)))
bob <- data.frame(t(unnest(lapply(ana, as.data.frame.list))), stringsAsFactors = FALSE)
bob <- replace(bob, is.na(bob), 0)
colnames(bob) <- gsub("X", "Q", colnames(bob))
# Q1 Q2 Q3 Q4
#X1 0.25 0.00 0.00 0.00
#X2 0.25 0.00 0.75 0.00
#X3 0.25 0.75 0.00 0.00
#X9 0.25 0.00 0.00 0.00
#X5 0.00 0.25 0.00 0.00
#X4 0.00 0.00 0.25 0.00
#Ford 0.00 0.00 0.00 0.25
#Hyundai 0.00 0.00 0.00 0.25
#Mazda 0.00 0.00 0.00 0.25
#Toyota 0.00 0.00 0.00 0.25

Resources