Create matrix from dataset in R - r

I want to create a matrix from my data. My data consists of two columns, date and my observations for each date. I want the matrix to have year as rows and days as columns, e.g. :
17 18 19 20 ... 31
1904 x11 x12 ...
1905
1906
.
.
.
2019
The days in this case is for December each year. I would like missing values to equal NA.
Here's a sample of my data:
> head(cdata)
# A tibble: 6 x 2
Datum Snödjup
<dttm> <dbl>
1 1904-12-01 00:00:00 0.02
2 1904-12-02 00:00:00 0.02
3 1904-12-03 00:00:00 0.01
4 1904-12-04 00:00:00 0.01
5 1904-12-12 00:00:00 0.02
6 1904-12-13 00:00:00 0.02
I figured that the first thing I need to do is to split the date into year, month and day (European formatting, YYYY-MM-DD) so I did that and got rid of the date column (the one that says Datum) and also got rid of the unrelevant days, namely the ones < 17.
cdata %>%
dplyr::mutate(year = lubridate::year(Datum),
month = lubridate::month(Datum),
day = lubridate::day(Datum))
select(cd, -c(Datum))
cu <- cd[which(cd$day > 16
& cd$day < 32
& cd$month == 12),]
and now it looks like this:
> cu
# A tibble: 1,284 x 4
Snödjup year month day
<dbl> <dbl> <dbl> <int>
1 0.01 1904 12 26
2 0.01 1904 12 27
3 0.01 1904 12 28
4 0.12 1904 12 29
5 0.12 1904 12 30
6 0.15 1904 12 31
7 0.07 1906 12 17
8 0.05 1906 12 18
9 0.05 1906 12 19
10 0.04 1906 12 20
# … with 1,274 more rows
Now I need to fit my data into a matrix with missing values as NA. Is there anyway to do this?

Base R approach, using by.
r <- `colnames<-`(do.call(rbind, by(dat, substr(dat$date, 1, 4), function(x) x[2])), 1:31)
r[,17:31]
# 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# 1904 -0.28 -2.66 -2.44 1.32 -0.31 -1.78 -0.17 1.21 1.90 -0.43 -0.26 -1.76 0.46 -0.64 0.46
# 1905 1.44 -0.43 0.66 0.32 -0.78 1.58 0.64 0.09 0.28 0.68 0.09 -2.99 0.28 -0.37 0.19
# 1906 -0.89 -1.10 1.51 0.26 0.09 -0.12 -1.19 0.61 -0.22 -0.18 0.93 0.82 1.39 -0.48 0.65
Toy data
set.seed(42)
dat <- do.call(rbind, lapply(1904:1906, function(x)
data.frame(date=seq(ISOdate(x, 12, 1, 0), ISOdate(x, 12, 31, 0), "day" ),
value=round(rnorm(31), 2))))

You can try :
library(dplyr)
library(tidyr)
cdata %>%
mutate(year = lubridate::year(Datum),
day = lubridate::day(Datum)) %>%
filter(day >= 17) %>%
complete(day = 17:31) %>%
select(year, day, Snödjup) %>%
pivot_wider(names_from = day, values_from = Snödjup)

Related

How I can calculate the max for these data

Here is a part of my data
dat<-read.table (text="
Flower A1 A2 A3 TM MN B1 B2 B3
F1 12 9 11 12 0.56 19 1 12
F2 11 16 13 13 0.65 22 4 12
F3 10 12 14 11 0.44 29 9 12
", header=TRUE)
I want to calculate Max for column MN. For example, for value 0.44, the max is max(0.44,1-0.44)= 0.56.
I struggle to get it with a data frame.
Here is the outcome of the interest:
Flower A TM B MN Max
F1 12 12 19 0.56 0.56
F2 11 13 22 0.65 0.65
F3 10 11 29 0.44 0.56
F1 9 12 1 0.56 0.56
F2 16 13 4 0.65 0.65
F3 12 11 9 0.44 0.56
F1 11 12 12 0.56 0.56
F2 13 13 12 0.65 0.65
F3 14 11 12 0.44 0.56
Try the code below
transform(
reshape(
setNames(dat, gsub("(\\d+)", ".\\1", names(dat))),
direction = "long",
idvar = c("Flower", "TM", "MN"),
varying = -c(1, 5, 6)
),
Max = pmax(MN, 1 - MN)
)
which gives
Flower TM MN time A B Max
F1.12.0.56.1 F1 12 0.56 1 12 19 0.56
F2.13.0.65.1 F2 13 0.65 1 11 22 0.65
F3.11.0.44.1 F3 11 0.44 1 10 29 0.56
F1.12.0.56.2 F1 12 0.56 2 9 1 0.56
F2.13.0.65.2 F2 13 0.65 2 16 4 0.65
F3.11.0.44.2 F3 11 0.44 2 12 9 0.56
F1.12.0.56.3 F1 12 0.56 3 11 12 0.56
F2.13.0.65.3 F2 13 0.65 3 13 12 0.65
F3.11.0.44.3 F3 11 0.44 3 14 12 0.56
Using reshape and ave.
reshape(dat, varying=list(2:4, 7:9), direction='long', idvar='Flower') |>
transform(Max=ave(MN, Flower, FUN=max))
# Flower TM MN time A1 B1 Max
# F1.1 F1 12 0.56 1 12 19 0.56
# F2.1 F2 13 0.65 1 11 22 0.65
# F3.1 F3 11 0.44 1 10 29 0.44
# F1.2 F1 12 0.56 2 9 1 0.56
# F2.2 F2 13 0.65 2 16 4 0.65
# F3.2 F3 11 0.44 2 12 9 0.44
# F1.3 F1 12 0.56 3 11 12 0.56
# F2.3 F2 13 0.65 3 13 12 0.65
# F3.3 F3 11 0.44 3 14 12 0.44
Note: R >= 4.1 used.

How to output twice in R pipe?

library(psych)
library(mokken)
bfi[1:3] %>%
na.omit() %>%
mokken::check.monotonicity() %T>%
summary %>%
{.$Hi[.$Hi<0]}
A1
-0.3873723
Above script works well.I get the final output but still want to review the output of summary.
How to make summary output too in this pipe?
If we want the summary as well, place it in a list
library(psych)
library(mokken)
library(magrittr)
out <- bfi[1:3] %>%
na.omit() %>%
mokken::check.monotonicity() %>%
{list(summary(.), .$Hi[.$Hi < 0])}
out
#[[1]]
# ItemH #ac #vi #vi/#ac maxvi sum sum/#ac zmax #zsig crit
#A1 -0.39 75 54 0.72 0.52 9.79 0.1305 16.75 51 550
#A2 0.06 50 8 0.16 0.14 0.63 0.0126 4.76 7 128
#A3 0.09 30 6 0.20 0.12 0.45 0.0149 4.63 6 134
#[[2]]
# A1
#-0.3873723
You can use %T>% print() to show the result of summary() but not return it.
bfi[1:3] %>%
na.omit() %>%
mokken::check.monotonicity() %T>%
{print(summary(.))} %>%
{.$Hi[.$Hi<0]}
# ItemH #ac #vi #vi/#ac maxvi sum sum/#ac zmax #zsig crit
# A1 -0.39 75 54 0.72 0.52 9.79 0.1305 16.75 51 550
# A2 0.06 50 8 0.16 0.14 0.63 0.0126 4.76 7 128
# A3 0.09 30 6 0.20 0.12 0.45 0.0149 4.63 6 134
#
# A1
# -0.3873723
If you assign it to a variable, it doesn't store the result of summary().
out <- ...
out
# A1
# -0.3873723

Finding max of column by group with condition

I have a data frame like this:
for each gill, I would like to find the maximum time for which the Diameter is different from 0. I have tried to use the function aggregate and the dplyr package but this did not work. A combinaison of for, if and aggregate would probably work but I did not find how to do it.
I'm not sure of the best way to approach this. I'd appreciate any help.
After grouping by 'Gill', subset the 'Time' where 'Diametre' is not 0 and get the max (assuming 'Time' is numeric class)
library(dplyr)
df1 %>%
group_by(Gill) %>%
summarise(Time = max(Time[Diametre != 0]))
Here how you can use aggregate:
> df<- data.frame(
Gill = rep(1:11, each = 2),
diameter = c(0,0,1,0,0,0,73.36, 80.08,1,25.2,53.48,61.21,28.8,28.66,71.2,80.25,44.55,53.50,60.91,0,11,74.22),
time = 0.16
)
> df
Gill diameter time
1 1 0.00 0.16
2 1 0.00 0.16
3 2 1.00 0.16
4 2 0.00 0.16
5 3 0.00 0.16
6 3 0.00 0.16
7 4 73.36 0.16
8 4 80.08 0.16
9 5 1.00 0.16
10 5 25.20 0.16
11 6 53.48 0.16
12 6 61.21 0.16
13 7 28.80 0.16
14 7 28.66 0.16
15 8 71.20 0.16
16 8 80.25 0.16
17 9 44.55 0.16
18 9 53.50 0.16
19 10 60.91 0.16
20 10 0.00 0.16
21 11 11.00 0.16
22 11 74.22 0.16
> # Remove diameter == 0 before aggregate
> dfnew <- df[df$diameter != 0, ]
> aggregate(dfnew$time, list(dfnew$Gill), max )
Group.1 x
1 2 0.16
2 4 0.16
3 5 0.16
4 6 0.16
5 7 0.16
6 8 0.16
7 9 0.16
8 10 0.16
9 11 0.16
I would use a different approach than the elegant solution that akrun suggested. I know how to use this method to create the column MaxTime that you show in your image.
#This will split your df into a list of data frames for each gill.
list.df <- split(df1, df1$Gill)
Then you can use lapply to find the maximum of Time for each Gill and then make that value a new column called MaxTime.
lapply(list.df, function(x) mutate(x, MaxTime = max(x$Time[x$Diametre != 0])))
Then you can combine these split dataframes back together using bind_rows()
df1 = bind_rows(list.df)

Efficient way to connect information between two dataframes based on factors in R (or how to avoid loops in R)

I have two large dataframes, one is called Dates_only and the other Values
**Dates_only:**
ID Quart_y Quart
1 1118 2017Q3 0.25
2 1118 2017Q4 0.50
3 1118 2018Q1 0.75
4 1118 2018Q2 1.00
5 1118 2018Q3 1.25
6 1118 2018Q4 1.50
7 1118 2019Q1 1.75
8 1118 2019Q2 2.00
9 1119 2017Q3 0.25
10 1119 2017Q4 0.50
11 1119 2018Q1 0.75
12 1119 2018Q2 1.00
13 1119 2018Q3 1.25
14 1119 2018Q4 1.50
15 1119 2019Q1 1.75
16 1119 2019Q2 2.00
17 13PP 2017Q3 0.25
18 13PP 2017Q4 0.50
19 13PP 2018Q1 0.75
20 13PP 2018Q2 1.00
21 13PP 2018Q3 1.25
22 13PP 2018Q4 1.50
23 13PP 2019Q1 1.75
24 13PP 2019Q2 2.00
And the second dataset:
**Values**
ID Day Value
1 1118 0 7.6
2 1119 0 6.2
3 13PP 0 6.8
4 1118 0.14 7.1
5 1119 0.13 6.2
6 13PP 0.13 5.9
7 1118 0.20 6.8
8 1119 0.23 5.8
9 13PP 0.24 4.6
10 1118 0.27 6.5
11 1119 0.28 5.4
12 13PP 0.32 4.2
13 1118 0.32 6.3
14 1119 0.32 4.8
15 13PP 0.44 4.0
16 1118 0.47 6.0
17 1119 0.49 4.3
18 13PP 0.49 3.8
19 1118 0.59 5.9
20 1119 0.64 4.0
21 13PP 0.61 3.6
22 1118 0.72 5.6
23 1119 0.71 3.8
24 13PP 0.73 3.4
25 1118 0.95 5.4
26 1119 0.86 3.2
27 13PP 0.78 3.0
28 1118 1.10 5.0
29 1119 0.93 2.9
30 13PP 1.15 2.9
What I want to do is to create another column (a fourth) in the Dates_only called Value_average, and it will contain average scores extracted from Values dataframe from the column Values$Value.
Specifically, as you can observe in Dates_only the Quart_y represents quarters/year, the Quart quantify this with a number from 0.25:2.
So, the pattern goes like this Q3 - x.25, Q4 - x.50, Q1 - x.75, Q2 - x.00.
In the second dataframe, Values, we have some scores that represent days of the year. The concept is that for days that have scores 0<Day<0.25 belong to the 2017Q3, days with scores 0.25<Day<0.50 belong to 2017Q4, and days with scores 1.00<Day<1.25 belong to 2018Q3.
I want for each ID from Dates_only dataframe to find the average of the Values$Value numbers that belong to the appropriate time frame:
For ID=1118 and for 2017Q3 the 'Values$Day' elements that are between 0<Day<=0.25 are (0, 0.14, 0.20) and the equivalent Values$Value are (7.6, 7.1, 6.8) so the Dates_only$Value_average is going to be 7.16. The next will average values for days 0.25<Day<=0.50 etc.
**Dates_only:**
ID Quart_y Quart Value_average
1 1118 2017Q3 0.25 7.16
2 1118 2017Q3 0.50 6.27
The code that I have used is:
Dates_only$Value_average <- 0
for (i in 1:length(Dates_only$ID)){
id <- as.character(Dates_only$ID[i])
quart <- as.numeric(Dates_only$Quart[i])
quart_prev <- quart-0.25
count_d <- 0
sum_val <- 0
for (k in 1:length(Values$ID)){
if (id==as.character(Values$ID[k])
&& quart>=as.numeric(Values$Day[k])
&& as.numeric(Values$Day[k])>quart_prev){
sum_val <- as.numeric(Values$Value[k]) + sum_val
count_d <- count_d + 1
}
}
av_value <- sum_val/count_d
Dates_only$Value_average[i] <- av_value
}
Is there a more efficient code to do that in very large datasets (over 300K observations)? I am pretty sure there is but my novice skills on R do not help a lot.
To replicate the two dataframes:
Dates_only <- data.frame(ID=c('1118','1118','1118','1118','1118',
'1118','1118','1118','1119','1119',
'1119','1119','1119','1119','1119',
'1119','13PP','13PP','13PP','13PP',
'13PP','13PP','13PP','13PP'),
Quart_y=c('2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2'),
Quart=c(0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00))
Values <- data.frame(ID=c('1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP'),
Day=c(0,0,0,0.14,0.13,0.13,0.2,0.23,0.24,0.27,0.28,
0.32,0.32,0.32,0.44,0.47,0.49,0.49,0.59,0.64,
0.61,0.72,0.71,0.73,0.95,0.86,0.78,1.1,0.93,1.15),
Value=c(7.6,6.2,6.8,7.1,6.2,5.9,6.8,5.8,4.6,6.5,5.4,
4.2,6.3,4.8,4,6,4.3,3.8,5.9,4,3.6,5.6,3.8,
3.4,5.4,3.2,3,5,2.9,2.9))
We can accomplish almost all of this using the dplyr package
library(dplyr)
Values %>%
mutate(Day = ifelse(Day == 0, 0.01, Day)) %>%
mutate(Quart = ceiling(Day / 0.25) * 0.25) %>%
full_join(., Dates_only, by = c("ID", "Quart")) %>%
group_by(ID, Quart, Quart_y) %>%
summarise(Value_average = mean(Value, na.rm = TRUE))
Which gives you:
ID Quart Quart_y Value_average
<fctr> <dbl> <fctr> <dbl>
1 1118 0.25 2017Q3 7.166667
2 1118 0.50 2017Q4 6.266667
3 1118 0.75 2018Q1 5.750000
4 1118 1.00 2018Q2 5.400000
5 1118 1.25 2018Q3 5.000000
6 1118 1.50 2018Q4 NaN
7 1118 1.75 2019Q1 NaN
8 1118 2.00 2019Q2 NaN
9 1119 0.25 2017Q3 6.066667
10 1119 0.50 2017Q4 4.833333
# ... with 14 more rows
See below for a breakdown of each line of code for any questions:
# Start with your `Values` data frame
Values %>%
# Recode `Day` that are '0.00', as they currently will be excluded from
# the rule 2017Q3: 0<Day<=0.25
# I picked 0.01 arbitrarily to fit this rule
mutate(Day = ifelse(Day == 0, 0.01, Day)) %>%
# Now round all `Day` values up to the nearest 0.25
mutate(Quart = ceiling(Day / 0.25) * 0.25) %>%
# Now join the two data frames using a `full_join`
# A left_join may also be used if you are uninterested in NA's
full_join(., Dates_only, by = c("ID", "Quart")) %>%
# Finally, designate groupings to calculate the mean values
# for each ID for each quarter
group_by(ID, Quart, Quart_y) %>%
summarise(Value_average = mean(Value, na.rm = TRUE))

Why same category is giving different frequency in R

Process_Table = Process_Table[order(-Process_Table$Process, -Process_Table$Freq),]
#output
Process Freq Percent
17 Other Airport Services 45 15.46
5 Check-in 35 12.03
23 Ticket sales and support channels 35 12.03
11 Flight and inflight 33 11.34
19 Pegasus Plus 23 7.90
24 Time Delays 16 5.50
7 Other 13 4.47
14 Other 13 4.47
22 Other 13 4.47
25 Other 13 4.47
16 Other 11 3.78
20 Other 6 2.06
26 Other 6 2.06
3 Other 5 1.72
13 Other 5 1.72
18 Other 5 1.72
21 Other 4 1.37
1 Other 2 0.69
2 Other 1 0.34
4 Other 1 0.34
6 Other 1 0.34
8 Other 1 0.34
9 Other 1 0.34
10 Other 1 0.34
12 Other 1 0.34
15 Other 1 0.34
as you can see it is giving different frequency for the same level
whereas, if i am printing the levels in that feature it is giving an output as the following
levels(Process_Table$Process)
[1] "Check-in" "Flight and inflight"
[3] "Other" "Other Airport Services"
[5] "Pegasus Plus" "Ticket sales and support channels"
[7] "Time Delays"
what i want is the combined frequency of "Others" category. Can anyone help me out on this.
Edit: code was used to derive to the first set of output:
Process_Table$Percent = round(Process_Table$Freq/sum(Process_Table$Freq) * 100, 2)
Process_Table$Process = as.character(Process_Table$Process)
low_list = Process_Table %>%
filter(Percent < 5.50) %>%
select(Process)
Process_Table$Process = ifelse(Process_Table$Process %in% low_list$Process, 'Other', Process_Table$Process)
as.data.frame(Process_Table)
Process_Table$Process = as.factor(Process_Table$Process)
Your Processed_Table should undergo another step of aggregating. Add the following to your final step of data aggregating.
Processed_Table <- Processed_Table %>% group_by(Process) %>% summarize(Freq = sum(Freq), Percent = sum(Percent))

Resources