Aggregate/sum and N/A values - r

I have a problem with the way aggregate or N/A deals with sums.
I would like the sums per area.code from following table
test <- read.table(text = "
area.code A B C D
1 0 NA 0.00 NA NA
2 1 0.0 3.10 9.6 0.0
3 1 0.0 3.20 6.0 0.0
4 2 0.0 6.10 5.0 0.0
5 2 0.0 6.50 8.0 0.0
6 2 0.0 6.90 4.0 3.1
7 3 0.0 6.70 3.0 3.2
8 3 0.0 6.80 3.1 6.1
9 3 0.0 0.35 3.2 6.5
10 3 0.0 0.67 6.1 6.9
11 4 0.0 0.25 6.5 6.7
12 5 0.0 0.68 6.9 6.8
13 6 0.0 0.95 6.7 0.0
14 7 1.2 NA 6.8 0.0
")
So, seems pretty easy:
aggregate(.~area.code, test, sum)
area.code A B C D
1 1 0 6.30 15.6 0.0
2 2 0 19.50 17.0 3.1
3 3 0 14.52 15.4 22.7
4 4 0 0.25 6.5 6.7
5 5 0 0.68 6.9 6.8
6 6 0 0.95 6.7 0.0
Apparently not so simple, because area code 7 is completely omitted from the aggregate() command.
I would however like the N/As to be completely ignored or computed as zero values, which na= command gives that option?
replacing all N/As with 0 is an option if I just want the sum... but the mean is really problematic then (since it can't differentiate between 0 and N/A anymore)

If you are willing to consider an external package (data.table):
setDT(test)
test[, lapply(.SD, sum), area.code]
area.code A B C D
1: 0 NA 0.00 NA NA
2: 1 0.0 6.30 15.6 0.0
3: 2 0.0 19.50 17.0 3.1
4: 3 0.0 14.52 15.4 22.7
5: 4 0.0 0.25 6.5 6.7
6: 5 0.0 0.68 6.9 6.8
7: 6 0.0 0.95 6.7 0.0
8: 7 1.2 NA 6.8 0.0

One option is to create a function that gives NA when all the values are NA or otherwise use sum. Along with that, use na.action argument in aggregate as aggregate can remove the row if there is at least one NA
f1 <- function(x) if(all(is.na(x))) NA else sum(x, na.rm = TRUE)
aggregate(.~area.code, test, f1, na.action = na.pass)
# area.code A B C D
#1 0 NA 0.00 NA NA
#2 1 0.0 6.30 15.6 0.0
#3 2 0.0 19.50 17.0 3.1
#4 3 0.0 14.52 15.4 22.7
# 4 0.0 0.25 6.5 6.7
#6 5 0.0 0.68 6.9 6.8
#7 6 0.0 0.95 6.7 0.0
#8 7 1.2 NA 6.8 0.0
When there are only NA elements and we use sum with na.rm = TRUE, it returns 0
sum(c(NA, NA), na.rm = TRUE)
#[1] 0

Another solution is to use dplyr:
test %>%
group_by(area.code) %>%
summarise_all(sum, na.rm = TRUE)

Related

Times series in R : how to change y-axis?

New R user here, working with meteorological data (data frame is called "Stations"). Trying to plot 3 time series with temperature on y-axis with a regression line on each one, but I encounter a few problems and there is no error messages.
Loop doesn't seem to be working and I can't figure out why.
Didn't manage to change x-axis graduation values for years ("Année" in the data frame) instead of a number.
Title is the same for the 3 plots, how do I change it so each plot has its own title?
Regression line is not shown on the graph.
Thanks in advance!
Here is my code :
for (i in c(6,8,10))
plot(ts(Stations[,i]), col="dodgerblue4", xlab="Temps", ylab="Température", main="Genève")
for (i in c(6,8,10))
abline(h=Stations[,i])```
Nb.enr time Année Mois Jour T2m_GE pcp_GE T2m_PU pcp_PU T2m_NY
1 19810101 1981 1 1 1.3 0.3 2.8 0.0 2.3
2 19810102 1981 1 2 1.2 0.1 2.3 1.2 1.6
3 19810103 1981 1 3 4.1 21.8 4.9 5.2 3.8
4 19810104 1981 1 4 5.1 10.3 5.1 17.4 4.9
5 19810105 1981 1 5 0.9 0.0 1.0 0.1 0.8
6 19810106 1981 1 6 0.5 5.7 0.7 6.0 0.5
7 19810107 1981 1 7 -2.7 0.0 -2.1 0.1 -1.9
8 19810108 1981 1 8 -3.2 0.0 -4.1 0.0 -3.8
9 19810109 1981 1 9 -5.2 0.0 -3.5 0.0 -5.1
10 19810110 1981 1 10 -3.1 10.6 -0.9 6.0 -2.6

Make a named table from list of dataframes

Say I have a column with Id of a product and a list of data frames with characteristics about them:
bundle dataframe
bundle
1 284993459
2 1048768805
3 511310430
4 1034630958
5 1235581326
d2 list
[[1]]
id value
1 35 0.2
2 1462 0.2
3 1109 0.2
4 220 0.2
5 211 0.1
[[2]]
list()
[[3]]
id name value
1 394 0.5
2 1462 0.5
[[4]]
id name value
1 926 0.3
2 1462 0.3
3 381 0.3
4 930 0.2
[[5]]
id name value
1 926 0.5
2 1462 0.5
I need to create columns with all characteristics ID and their values for each product.
bundle = data.frame(bundle = c(284993459,1048768805,511310430,1034630958,1235581326))
d2<- list(data.frame(id = c(35,1462,1109,220,211), value = c(0.2, 0.2, 0.2,0.2,0.1)),
data.frame(id = NULL, value = NULL),
data.frame(id = c(394,1462), value = c(0.5,0.5)),
data.frame(id = c(926,1462,381,930), value = c(0.3,0.3,0.3,0.2)),
data.frame(id = c(926,1462), value = c(0.5,0.5)))
bundle 35 1462 1109 220 211 394 1462
1 284993459 0.2 0.2 0.2 0.2 0.1 0 0
2 1048768805 0 0 0 0 0 0 0
3 511310430 0 0 0 0 0 0.5 0.5
Can't figure out how to do this. Had an idea to unlist this data frame list, but no good came of it, since a have more than 8000 prodict IDs:
for (i in seq(d2))
assign(paste0("df", i), d2[[i]])
If we take a different approach than I have to to join transposed characteristics data frames so the values are filled row by row.
Here's a tidyverse solution. First we add a bundle column to all data.frames and stitch them together using purr::map2_dfr , then use tidyr::spread to format as wide.
library(tidyverse)
res <- map2_dfr(bundle$bundle,d2,~mutate(.y,bundle=.x)) %>%
spread(id,value,)
res[is.na(res)] <- 0
# bundle 35 211 220 381 394 926 930 1109 1462
# 1 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
# 2 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
# 3 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
# 4 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5
You can first add the bundle to each data.frame within the list, then pivot it using reshape2::dcast or data.table::dcast before updating NAs to 0
ans <- data.table::dcast(
do.call(rbind, Map(function(nm, DF) within(DF, bundle <- nm), bundle$bundle, d2)),
bundle ~ id)
ans[is.na(ans)] <- 0
ans
# bundle 35 211 220 381 394 926 930 1109 1462
#1 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
#2 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
#3 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
#4 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5
edit: adding more explanations after OP's comment
1) function(nm, DF) within(DF, bundle <- nm) takes the input data.frame DF and adds a new column called bundle with values equal to nm.
2) Map applies a function to the corresponding elements of given vectors. (see ?Map) That means that Map applies the above function using each of the bundle values and add them to each data.frame in d2
Another approach could be
library(data.table)
library(tidyverse)
df <- rbindlist(
lapply(lapply(d2, function(x) if(nrow(x)==0) data.frame(id=NA, value=NA) else x), #in case there is no dataframe row in a list assign a blank dataframe
function(y) y %>% spread(id, value)), #convert all dataframes in wide format
fill = T) %>% #rbind all dataframe in a single dataframe
select(-`<NA>`) %>%
cbind.data.frame(bundle = bundle$bundle)
Output is:
35 211 220 1109 1462 394 381 926 930 bundle
1: 0.2 0.1 0.2 0.2 0.2 NA NA NA NA 284993459
2: NA NA NA NA NA NA NA NA NA 1048768805
3: NA NA NA NA 0.5 0.5 NA NA NA 511310430
4: NA NA NA NA 0.3 NA 0.3 0.3 0.2 1034630958
5: NA NA NA NA 0.5 NA NA 0.5 NA 1235581326
Sample data:
bundle <- data.frame(bundle = c(284993459,1048768805,511310430,1034630958,1235581326))
d2 <- list(data.frame(id = c(35,1462,1109,220,211), value = c(0.2, 0.2, 0.2,0.2,0.1)),
data.frame(id = NULL, value = NULL),
data.frame(id = c(394,1462), value = c(0.5,0.5)),
data.frame(id = c(926,1462,381,930), value = c(0.3,0.3,0.3,0.2)),
data.frame(id = c(926,1462), value = c(0.5,0.5)))
There are two possible approaches which differ only in the sequence of operations:
Reshape all dataframes in the list individually from long to wide format and rbind() matching columns.
rbind() all dataframes in long form and reshape to wide format afterwards.
Both approaches require to include bundle somehow.
For the sake of completeness, here are different implementations of the second approach using data.table.
library(data.table)
library(magrittr)
d2 %>%
# bind row-wise into large data.table, create id column
rbindlist(idcol = "bid") %>%
# right join to append bundle column
setDT(bundle)[, bid := .I][., on = "bid"] %>%
# reshape from long to wide format
dcast(., bundle ~ id, fill = 0)
bundle 35 211 220 381 394 926 930 1109 1462
1: 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
2: 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
3: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
4: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5
Here, piping is used just to visualize the sequence of function calls. With data.table's chaining the statement becomes more concise:
library(data.table) # library(magrittr) not required
setDT(bundle)[, bid := .I][
rbindlist(d2, id = "bid"), on = "bid"][, dcast(.SD, bundle ~ id, fill = 0)]
or
library(data.table) # library(magrittr) not required
dcast(setDT(bundle)[, bid := .I][
rbindlist(d2, id = "bid"), on = "bid"], bundle ~ id, fill = 0)
Another variant is to rename the list elements before calling rbindlist() which will take the names for creating the idcol:
library(data.table)
library(magrittr)
d2 %>%
# rename list elements
setNames(bundle$bundle) %>%
# bind row-wise into large data.table, create id column from element names
rbindlist(idcol = "bundle") %>%
# convert bundle from character to factor to maintain original order
.[, bundle := forcats::fct_inorder(bundle)] %>%
# reshape from long to wide format
dcast(., bundle ~ id, fill = 0)
bundle 35 211 220 381 394 926 930 1109 1462
1: 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
2: 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
3: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
4: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5
Note that the variants presented so far have skipped the empty dataframe which belongs to bundle 1048768805 (likewise the answers by Moody_Mudskipper and chinsoon12).
In order to keep the empty dataframe in the final result, the order of the join has to be changed so that all rows of bundle will be kept:
library(data.table)
dcast(
rbindlist(d2, id = "bid")[setDT(bundle)[, bid := .I], on = "bid"],
bundle ~ id, fill = 0
)[, "NA" := NULL][]
bundle 35 211 220 381 394 926 930 1109 1462
1: 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
2: 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
3: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
4: 1048768805 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5
Or, if the exact order of bundle is to be maintained:
library(data.table)
dcast(
rbindlist(d2, id = "bid")[setDT(bundle)[, bid := .I], on = "bid"],
bid + bundle ~ id, fill = 0
)[, c("bid", "NA") := NULL][]
bundle 35 211 220 381 394 926 930 1109 1462
1: 284993459 0.2 0.1 0.2 0.0 0.0 0.0 0.0 0.2 0.2
2: 1048768805 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3: 511310430 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5
4: 1034630958 0.0 0.0 0.0 0.3 0.0 0.3 0.2 0.0 0.3
5: 1235581326 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.5

how to filter data by condition to make the number of rows to be the same for each group

This is my sample data:
date label type exdate x y z w
1 10 A 2 15 0.25 0.35 13.49
1 10 A 2 12.5 1.30 1.45 13.49
1 10 B 2 10 1.7 1.8 13.49
1 10 B 2 12.5 0.3 0.4 13.49
1 10 B 2 17.5 1.8 0.3 13.49
1 11 A 3 15 0.75 0.8 13.49
1 11 A 3 12.5 1.8 1.9 13.49
1 11 A 3 17.5 0.2 0.35 13.49
1 11 B 3 10 0.1 0.25 13.49
1 11 B 3 15 2.15 2.3 13.49
1 11 B 3 12.5 0.8 0.85 13.49
1 11 B 3 17.5 4.1 4.3 13.49
2 11 A 4 10 3.7 4 13.49
2 11 A 4 15 1 1.1 13.49
2 11 A 4 12.5 2.05 2.2 13.49
2 11 A 4 17.5 0.4 0.55 13.49
2 11 B 4 10 0.3 0.4 13.49
2 11 B 4 15 2.45 2.6 13.49
2 11 B 4 12.5 1.05 1.15 13.49
2 11 B 4 17.5 4.3 4.6 13.49
Firstly, I will group my data set by c(date,label,exdate), and for each group it will be A and B inside variable 'type'. BUT I want to let the number of rows for type A and type B is the same.
Filter conditions:
To make data to be the same number of rows, the distance between x and w should be same or almost the same for any pairs of type A and type B.
For example:
type x w
A 2 3.5
A 3 3.5
A 4 3.5
B 1 3.5
B 2 3.5
# The output after filter
type x w
A 2 3.5 (pair with type_B ; x = 1)
A 3 3.5 (pair with type_B ; x = 2)
B 1 3.5
B 2 3.5
So, for the sample data above, the result I hope:
date label type exdate x y z w
1 10 A 2 15 0.25 0.35 13.49
1 10 A 2 12.5 1.30 1.45 13.49
1 10 B 2 12.5 0.3 0.4 13.49
1 10 B 2 17.5 1.8 0.3 13.49
1 11 A 3 15 0.75 0.8 13.49
1 11 A 3 12.5 1.8 1.9 13.49
1 11 A 3 17.5 0.2 0.35 13.49
1 11 B 3 15 2.15 2.3 13.49
1 11 B 3 12.5 0.8 0.85 13.49
1 11 B 3 17.5 4.1 4.3 13.49
2 11 A 4 10 3.7 4 13.49
2 11 A 4 15 1 1.1 13.49
2 11 A 4 12.5 2.05 2.2 13.49
2 11 A 4 17.5 0.4 0.55 13.49
2 11 B 4 10 0.3 0.4 13.49
2 11 B 4 15 2.45 2.6 13.49
2 11 B 4 12.5 1.05 1.15 13.49
2 11 B 4 17.5 4.3 4.6 13.49
To make this result, how can I code? Is it insert else if condition inside filter()?

Find all points on a plane

I am trying to get all points on a 2d plane in the range (0..10,0..10) with a step of 0.5. I would like two store these values in a dataframe like this:
x y
1 1 1.5
2 0 0.5
3 4 2.0
I am considering using a loop to start from 0.0 for the x column and fill the y column such that I get something like this:
x y
1 0 0
2 0 0.5
3 0 1
and so on upto 10. And increment it by 0.5 and do for 1 and so on. I would like to know a more efficient way of doing this in R?.
Is this what you want?
expand.grid(x=seq(0,10,by=0.5),y=seq(0,10,by=0.5))
x y
1 0.0 0.0
2 0.5 0.0
3 1.0 0.0
4 1.5 0.0
5 2.0 0.0
6 2.5 0.0
7 3.0 0.0
8 3.5 0.0
9 4.0 0.0
10 4.5 0.0
11 5.0 0.0
12 5.5 0.0
13 6.0 0.0
14 6.5 0.0
15 7.0 0.0
16 7.5 0.0
17 8.0 0.0
18 8.5 0.0
19 9.0 0.0
20 9.5 0.0
21 10.0 0.0
22 0.0 0.5
23 0.5 0.5
24 1.0 0.5
25 1.5 0.5
26 2.0 0.5
27 2.5 0.5
28 3.0 0.5
29 3.5 0.5
30 4.0 0.5
...

Naming variables according to rows in R

I have to data tables. Data table 1 has two variables and 561 observations while data table 2 has 563 variables and 10,000 observations. I'm trying to figure out how can I the observations of code_name variable from data table 1 to rename the variables in data table 2.
What I have:
Data table 1
code code_name
11 rasf
04 iadf
27 pqwr
09 pklf
86 irmw
30 pwql
Data table 2
activity subject V1 V2 V3 V4 V5 V6
5 2 0.29 0.19 5.3 1.8 8.3 0.3
9 7 0.11 0.10 7.8 2.0 0.5 0.9
9 7 0.19 1.10 8.0 1.9 0.4 0.7
What I need:
activity subject rasf iadf pqwr pklf irmw pwql
5 2 0.29 0.19 5.3 1.8 8.3 0.3
9 7 0.11 0.10 7.8 2.0 0.5 0.9
9 7 0.19 1.10 8.0 1.9 0.4 0.7
What I did:
#Extracts all rows and just column two from the data table 1
new_data_table1 <- data_table1[,2]
#Set names on data table 2 to build the final data
final_data <- setnames(data_table2, names(data_table2), c("activity", "subject", new_data_table1))
The problem with my code is that when I extract all rows from data table 1 it gives a long list, showing vectors for the structure and labels of the data. Because of that, when I run my code I get this table:
activity subject 243 244 245 246 247 248
5 2 0.29 0.19 5.3 1.8 8.3 0.3
9 7 0.11 0.10 7.8 2.0 0.5 0.9
9 7 0.19 1.10 8.0 1.9 0.4 0.7
The new names for the variables are numbers because they are the structures and not the labels.
we can use names function to naming variables according to rows
names(df1)[3:length(df1)] <- df$code_name
df1
activity subject rasf iadf pqwr pklf irmw pwql
1 5 2 0.29 0.19 5.3 1.8 8.3 0.3
2 9 7 0.11 0.10 7.8 2.0 0.5 0.9
3 9 7 0.19 1.10 8.0 1.9 0.4 0.7
data
df
code code_name
1 11 rasf
2 4 iadf
3 27 pqwr
4 9 pklf
5 86 irmw
6 30 pwql
df1
activity subject V1 V2 V3 V4 V5 V6
1 5 2 0.29 0.19 5.3 1.8 8.3 0.3
2 9 7 0.11 0.10 7.8 2.0 0.5 0.9
3 9 7 0.19 1.10 8.0 1.9 0.4 0.7
We can use grep to find the index of the column names in the second dataset that start with "V" followed by numbers and change it to the second column value from the first dataset.
names(df2)[grep("^V\\d+", names(df2))] <- as.character(df1[,2] )

Resources