I have a list of nested data frames and I want to extract the observations of the earliest year, my problem is the first year change with the data frames. the year is either 1992 or 2005.
I want to create a list to stock them, I tried with which, but since there is the same year, observations are repeated, and I want them apart
new_df<- which(df[[i]]==1992 | df[[i]]==2005)
I've tried with ifelse() but I have to do an lm operation after, and it doesn't work. And I can't take only the first rows, because the year are repeated
my code looks like this:
df<- list(a<-data.frame(a_1<-(1992:2015),
a_2<-sample(1:24)),
b<-data.frame(b_1<-(1992:2015),
b_2<-sample(1:24)),
c<-data.frame(c_1<-(2005:2015),
c_2<-sample(1:11)),
d<-data.frame(d_1<-(2005:2015),
d_2<-sample(1:11)))
You can define a function to get the data on one data.frame and loop on the list to extract values.
Below I use map from the purrr package but you can also use lapply and for loops
Please do not use <- when assigning values in a function call (here data.frame() ) because it will mess colnames. = is used in function calls for arguments variables and it's okay to use it. You can read this ;)
df<- list(a<-data.frame(a_1 = (1992:2015),
a_2 = sample(1:24)),
b<-data.frame(b_1 = (1992:2015),
b_2 = sample(1:24)),
c<-data.frame(c_1 = (2005:2015),
c_2 = sample(1:11)),
d<-data.frame(d_1 = (2005:2015),
d_2 = sample(1:11)))
extract_miny <- function(df){
miny <- min(df[,1])
res <- df[df[,1] == miny, 2]
names(res) <- miny
return(res)
}
map(df, extract_miny)
If the data is sorted as the example, you can slice() the first row for the information. Notice the use of = rather than <- in creating a nested dataframe.
library(tidyverse)
df <- list(
a = data.frame(a_1 = (1992:2015),
a_2 = sample(1:24)),
b = data.frame(b_1 = (1992:2015),
b_2 = sample(1:24)),
c = data.frame(c_1 = (2005:2015),
c_2 = sample(1:11)),
d = data.frame(d_1 = (2005:2015),
d_2 = sample(1:11))
)
df %>%
imap_dfr( ~ slice(.x, 1) %>%
set_names(c("year", "value")) %>%
mutate(dataframe = .y) %>%
as_tibble())
# A tibble: 4 x 3
year value dataframe
<int> <int> <chr>
1 1992 19 a
2 1992 2 b
3 2005 1 c
4 2005 5 d
You may subset anonymeously.
lapply(df, \(x) setNames(x[x[[1]] == min(x[[1]]), ], c('year', 'value'))) |> do.call(what=rbind)
# year value
# 1 1992 6
# 2 1992 9
# 3 2005 11
# 4 2005 11
Or maybe better by creating a variable from which sample the value stems from.
Map(`[<-`, df, 'sample', value=letters[seq_along(df)]) |>
lapply(\(x) setNames(x[x[[1]] == min(x[[1]]), ], c('year', 'value', 'sample'))) |>
do.call(what=rbind)
# year value sample
# 1 1992 6 a
# 2 1992 9 b
# 3 2005 11 c
# 4 2005 11 d
Data:
df <- list(structure(list(a_1.....1992.2015. = 1992:2015, a_2....sample.1.24. = c(6L,
18L, 23L, 5L, 7L, 14L, 4L, 10L, 19L, 17L, 15L, 1L, 11L, 22L,
13L, 8L, 20L, 16L, 2L, 3L, 24L, 21L, 9L, 12L)), class = "data.frame", row.names = c(NA,
-24L)), structure(list(b_1.....1992.2015. = 1992:2015, b_2....sample.1.24. = c(9L,
24L, 18L, 8L, 16L, 11L, 13L, 23L, 15L, 20L, 19L, 21L, 12L, 22L,
7L, 3L, 6L, 17L, 2L, 5L, 4L, 10L, 1L, 14L)), class = "data.frame", row.names = c(NA,
-24L)), structure(list(c_1.....2005.2015. = 2005:2015, c_2....sample.1.11. = c(11L,
2L, 5L, 10L, 9L, 6L, 1L, 7L, 3L, 8L, 4L)), class = "data.frame", row.names = c(NA,
-11L)), structure(list(d_1.....2005.2015. = 2005:2015, d_2....sample.1.11. = c(11L,
2L, 5L, 1L, 6L, 9L, 3L, 7L, 10L, 4L, 8L)), class = "data.frame", row.names = c(NA,
-11L)))
Related
I have a table with two columns A and B. I want to create a new table with two new columns added: X and Y. These two new columns are to contain data from column A, but every second row from column A. Correspondingly for column X, starting from the first value in column A and from the second value in column A for column Y.
So far, I have been doing it in Excel. But now I need it in R best function form so that I can easily reuse that code. I haven't done this in R yet, so I am asking for help.
Example data:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)), class = "data.frame", row.names = c(NA,
-10L))
Sample result:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L), X = c(2L,
NA, 5L, NA, 54L, NA, 34L, NA, 10L, NA), Y = c(NA, 7L, NA, 11L,
NA, 12L, NA, 14L, NA, 6L)), class = "data.frame", row.names = c(NA,
-10L))
It is not a super elegant solution, but it works:
exampleDF <- structure(list(A = c(2L, 7L, 5L, 11L, 54L,
12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L,
32L, 19L, 24L, 44L, 37L)),
class = "data.frame", row.names = c(NA, -10L))
index <- seq(from = 1, to = nrow(exampleDF), by = 2)
exampleDF$X <- NA
exampleDF$X[index] <- exampleDF$A[index]
exampleDF$Y <- exampleDF$A
exampleDF$Y[index] <- NA
You could also make use of the row numbers and the modulo operator:
A simple ifelse way:
library(dplyr)
df |>
mutate(X = ifelse(row_number() %% 2 == 1, A, NA),
Y = ifelse(row_number() %% 2 == 0, A, NA))
Or using pivoting:
library(dplyr)
library(tidyr)
df |>
mutate(name = ifelse(row_number() %% 2 == 1, "X", "Y"),
value = A) |>
pivot_wider()
A function using the first approach could look like:
See comment
xy_fun <- function(data, A = A, X = X, Y = Y) {
data |>
mutate({{X}} := ifelse(row_number() %% 2 == 1, {{A}}, NA),
{{Y}} := ifelse(row_number() %% 2 == 0, {{A}}, NA))
}
xy_fun(df, # Your data
A, # The col to take values from
X, # The column name of the first new column
Y # The column name of the second new column
)
Output:
A B X Y
1 2 3 2 NA
2 7 5 NA 7
3 5 1 5 NA
4 11 21 NA 11
5 54 67 54 NA
6 12 32 NA 12
7 34 19 34 NA
8 14 24 NA 14
9 10 44 10 NA
10 6 37 NA 6
Data stored as df:
df <- structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)
),
class = "data.frame",
row.names = c(NA, -10L)
)
I like the #harre approach:
Another approach with base R we could ->
Use R's recycling ability (of a shorter-vector to a longer-vector):
df$X <- df$A
df$Y <- df$B
df$X[c(FALSE, TRUE)] <- NA
df$Y[c(TRUE, FALSE)] <- NA
df
A B X Y
1 2 3 2 NA
2 7 5 NA 5
3 5 1 5 NA
4 11 21 NA 21
5 54 67 54 NA
6 12 32 NA 32
7 34 19 34 NA
8 14 24 NA 24
9 10 44 10 NA
10 6 37 NA 37
I need to overlap two different plots. They use the same scale already.
My code for each separated scatterplot look like this.
ggscatter(chemicals, x = "columnB", y = "columnA",
color = "nombre",
palette = "jco",
ellipse = FALSE,
ellipse.type = "convex",
repel = TRUE,
max.overlaps = 10,
font.label = c(6, "plain", "red"))
ggscatter(rivers, x = "V3", y = "V2",
label = rivers$V1,
palette = "jco",
ellipse = FALSE,
ellipse.type = "convex",
repel = FALSE,
max.overlaps = 10,
font.label = c(6, "plain", "blue"))
The first data look like this...
chemicals <- structure(list(columnA = c(0.34526, -0.47491, 1.9717, -1.28922,
-1.3365, -1.06089, -1.35741, -1.03362, 1.33577, 0.26619, -1.33583,
0.56619, -0.84651, 0.52487, -0.44644, 0.33894, 1.33558, -1.36652,
-1.41608, 0.08864, -0.98665, -0.13102, 0.96633, -0.33869, -1.45537,
1.50434, -1.30283, -0.03662, -0.83985, -0.86605, 0.96659, -1.37216,
1.05501, 0.34936, -0.56608, -0.84148, 1.16633, 1.15391, -1.10533,
-0.04087, 1.36684, 0.39588, -0.4166, -0.7338, -1.33663, 1.24798,
0.26939, 0.57514, 0.21976, -0.62348, -1.3341, 0.6696, 1.71274,
0.0337, -1.33959, -0.33319, -0.21368, -0.25305, 0.56606, 0.56665
), columnB = c(0.46696, 0.15238, 0.28205, -1.01343, -0.45548, -0.58032,
-0.03174, -1.86618, 0.37332, 0.33668, 0.3668, 0.67415, -0.0393,
1.21716, 0.06624, 1.4333, 0.42663, 0.33143, 0.33529, -2.66816,
0.76601, 0.06666, 0.86633, 0.59532, -0.33115, -0.76641, 0.06633,
0.50038, -0.11718, 0.28718, -1.84348, -0.2598, -0.37834, 1.82102,
0.66669, 0.56604, -2.17667, -1.86617, 0.67087, -2.2598, -2.06249,
-0.25863, 1.26661, -1.76684, 0.06665, 0.80114, -1.33408, 0.23333,
0.21658, 0.39268, 0.50466, -0.09929, -0.09178, 1.07363, 1.15409,
-0.49409, 1.628, 0.26664, 0.62084, 0.50397)), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L,
24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L,
35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L,
46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L,
57L, 58L, 59L, 60L), class = "data.frame")
The second data looks like this...
rivers <- structure(list(V1 = structure(c(7L, 5L, 6L, 1L, 3L, 4L, 8L, 2L
), .Label = c("riverA", "riverB", "riverC", "riverD",
"riverE", "riverF", "riverG", "riverH"), class = "factor"),
V2 = structure(c(8L, 7L, 6L, 5L, 4L, 1L, 2L, 3L), .Label = c("-0.800",
"0.021", "0.220", "0.590", "0.999", "0.333", "0.700", "0.850"
), class = "factor"), V3 = structure(c(1L, 3L, 4L, 2L, 7L,
6L, 8L, 5L), .Label = c("-0.028", "-0.011", "-0.078", "-0.4",
"-0.952", "0.275", "0.630", "0.725"), class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
I need to put both of these scatter plots together in one plot.
I don't have ggpubr, but here is a demonstration using ggplot2:
library(dplyr)
library(ggplot2)
rivers %>%
mutate(source = "rivers", across(c(V3,V2), ~ as.numeric(as.character(.)))) %>%
select(source, columnA = V3, columnB = V2) %>%
bind_rows(mutate(chemicals, source = "chemicals")) %>%
ggplot(aes(columnA, columnB)) +
geom_point(aes(color = source))
I'm guessing this should be straight-forward to translate into ggpubr::ggscatter.
The premise of row-binding (via base rbind or dplyr::bind_rows or data.table::rbindlist) is that the number of rows matters not, it's the columns that matter. In the base case, there must be the same number of columns with the same names:
dat1 <- data.frame(a = 1, b = 2)
dat2 <- data.frame(a = 1:2, d = 3:4)
rbind(dat1, dat2)
# Error in match.names(clabs, names(xi)) :
# names do not match previous names
dat2b <- data.frame(a = 1:2, b = 3:4)
rbind(dat1, dat2b)
# a b
# 1 1 2
# 2 1 3
# 3 2 4
Both dplyr::bind_rows and data.table::rbindlist provide wiggle room around this, either by default (former) or with options (latter):
dat2 <- data.frame(a = 1:2, d = 3:4)
dplyr::bind_rows(dat1, dat2)
# a b d
# 1 1 2 NA
# 2 1 NA 3
# 3 2 NA 4
data.table::rbindlist(list(dat1, dat2), use.names = TRUE, fill = TRUE)
# a b d
# <num> <num> <int>
# 1: 1 2 NA
# 2: 1 NA 3
# 3: 2 NA 4
In this case, though, you want to normalize the names, so for one of them you need to change in either or both of them so that they can be aligned/row-bound properly.
FYI, you don't actually have to rename or rbind them to do things the brute-force way in ggplot2, but doing it this way has consequences and limits several other options so it is generally discouraged:
ggplot() +
geom_point(aes(columnA, columnB), color = "red", data = chemicals) +
geom_point(aes(as.numeric(as.character(V3)), as.numeric(as.character(V2))), color = "blue", data = rivers)
... but this doesn't help you adapt the process to ggscatter, so it is doubly not useful. I'll keep it, but don't go down this last path.
I'm working at a dataset as follows:
structure(list(date = structure(1:24, .Label = c("2010Y1-01m",
"2010Y1-02m", "2010Y1-03m", "2010Y1-04m", "2010Y1-05m", "2010Y1-06m",
"2010Y1-07m", "2010Y1-08m", "2010Y1-09m", "2010Y1-10m", "2010Y1-11m",
"2010Y1-12m", "2011Y1-01m", "2011Y1-02m", "2011Y1-03m", "2011Y1-04m",
"2011Y1-05m", "2011Y1-06m", "2011Y1-07m", "2011Y1-08m", "2011Y1-09m",
"2011Y1-10m", "2011Y1-11m", "2011Y1-12m"), class = "factor"),
a = structure(c(1L, 18L, 19L, 20L, 22L, 23L, 2L, 4L, 5L,
7L, 8L, 10L, 1L, 21L, 3L, 6L, 9L, 11L, 12L, 13L, 14L, 15L,
16L, 17L), .Label = c("--", "10159.28", "10295.69", "10580.82",
"10995.65", "11245.84", "11327.23", "11621.99", "12046.63",
"12139.78", "12848.27", "13398.26", "13962.6", "14559.72",
"14982.58", "15518.64", "15949.87", "7363.45", "8237.71",
"8830.99", "9309.47", "9316.56", "9795.77"), class = "factor"),
b = structure(c(1L, 15L, 22L, 23L, 3L, 5L, 6L, 8L, 9L, 11L,
13L, 16L, 1L, 21L, 2L, 4L, 7L, 10L, 12L, 14L, 17L, 18L, 19L,
20L), .Label = c("--", "1058.18", "1455.6", "1539.01", "1867.07",
"2036.92", "2102.23", "2372.84", "2693.96", "2769.65", "2973.04",
"3146.88", "3227.23", "3604.71", "365.07", "3678.01", "4043.18",
"4438.55", "4860.76", "5360.94", "555.51", "653.19", "980.72"
), class = "factor")), class = "data.frame", row.names = c(NA,
-24L))
I'm trying to calculate yearly_pct_change for column a and b, so firstly, I replace -- in a and b with NA, then convert date column,the code I have used:
df[df == "--"] <- NA
df$date <- as.Date(paste0(df$date, '-01'), '%YY1-%mm-%d')
df %>%
# mutate(date = lubridate::ymd(paste0(date, '-01'))) %>%
mutate(ratio_a = round((a / lag(a, 12) - 1)*100, 2),
ratio_b = round((b / lag(b, 12) - 1)*100, 2))
In the final result, ratio_a and ratio_b are all NAs.
But with data as belows I manipulated in excel by replacing -- into space, it works:
structure(list(date = structure(1:24, .Label = c("2010Y1-01m",
"2010Y1-02m", "2010Y1-03m", "2010Y1-04m", "2010Y1-05m", "2010Y1-06m",
"2010Y1-07m", "2010Y1-08m", "2010Y1-09m", "2010Y1-10m", "2010Y1-11m",
"2010Y1-12m", "2011Y1-01m", "2011Y1-02m", "2011Y1-03m", "2011Y1-04m",
"2011Y1-05m", "2011Y1-06m", "2011Y1-07m", "2011Y1-08m", "2011Y1-09m",
"2011Y1-10m", "2011Y1-11m", "2011Y1-12m"), class = "factor"),
a = c(NA, 7363.45, 8237.71, 8830.99, 9316.56, 9795.77, 10159.28,
10580.82, 10995.65, 11327.23, 11621.99, 12139.78, NA, 9309.47,
10295.69, 11245.84, 12046.63, 12848.27, 13398.26, 13962.6,
14559.72, 14982.58, 15518.64, 15949.87), b = c(NA, 365.07,
653.19, 980.72, 1455.6, 1867.07, 2036.92, 2372.84, 2693.96,
2973.04, 3227.23, 3678.01, NA, 555.51, 1058.18, 1539.01,
2102.23, 2769.65, 3146.88, 3604.71, 4043.18, 4438.55, 4860.76,
5360.94)), class = "data.frame", row.names = c(NA, -24L))
Does someone could help me to figure out why my code above give NAs for ratio columns? Thanks.
Your data has factors, try to convert them to number.
library(dplyr)
df[df == "--"] <- NA
df$date <- as.Date(paste0(df$date, '-01'), '%YY1-%mm-%d')
df %>%
type.convert() %>%
mutate(ratio_a = round((a / lag(a, 12) - 1)*100, 2),
ratio_b = round((b / lag(b, 12) - 1)*100, 2))
I'm new in R and to be honest don't know how to call what I'm looking for :)
I have data-set "ds" set with 2 columns:
D | res
==========
Ds 20
Dx 23
Dp 1
Ds 12
Ds 23
Ds 54
Dn 65
Ds 122
Dx 11
Dx 154
Dx 18
Do 4
Df 17
Dp 5
Dp 107
Dp 8
Df 3
Dp 33
Dd 223
Dc 7
Dv 22
Du 34
Dh 22
Ds 12
Dy 78
Dd 128
I need to calculate top 4 from column "D" by "Res" so desired result would look like :
D | Res
========
Dd 351
Dp 154
Ds 243
Dx 206
and by %age:
D | % Of Total
==========
Dd 29.10%
Dp 12.77%
Ds 20.15%
Dx 17.08%
Thanks
We can use aggregate() to obtain the sum of each type of "D", and we can introduce a new column to account for the edit of the OP and include also the percentage.
In order to display the result in the desired form, we can apply the order() function to rearrange the rows according to the value of Res. The function rev() in this case ensures that the highest value is put on top, and head() with the parameter 4 displays the first four rows.
summarized <- aggregate(Res ~. , df1, sum)
summarized$Perc <- with(summarized, paste0(round(Res/sum(Res)*100,2),"%"))
head(summarized[rev(order(summarized$Res)),],4)
D Res Perc
2 Dd 351 29.1%
8 Ds 243 20.15%
11 Dx 206 17.08%
7 Dp 154 12.77%
data
df1 <- structure(list(D = structure(c(8L, 11L, 7L, 8L, 8L, 8L, 5L,
8L, 11L, 11L, 11L, 6L, 3L, 7L, 7L, 7L, 3L, 7L, 2L, 1L, 10L, 9L,
4L, 8L, 12L, 2L), .Label = c("Dc", "Dd", "Df", "Dh", "Dn", "Do",
"Dp", "Ds", "Du", "Dv", "Dx", "Dy"), class = "factor"), Res = c(20L,
23L, 1L, 12L, 23L, 54L, 65L, 122L, 11L, 154L, 18L, 4L, 17L, 5L,
107L, 8L, 3L, 33L, 223L, 7L, 22L, 34L, 22L, 12L, 78L, 128L)),
.Names = c("D", "Res"), class = "data.frame", row.names = c(NA, -26L))
If you mean to sum Res per D and then select the top 4 sums (assuming you made mistakes calculating the sums for ds and dp) you could try:
library(dplyr)
df1 %>% mutate(per = Res/sum(Res)) %>% group_by(D) %>% summarise(Res = sum(Res), perc = sum(per)) %>% top_n(4, Res)
Source: local data frame [4 x 3]
D Res perc
(fctr) (int) (dbl)
1 Dd 351 0.2910448
2 Dp 154 0.1276949
3 Ds 243 0.2014925
4 Dx 206 0.1708126
Option using data.table
library(data.table)
out = setorder(setDT(data)[, .(tmp = sum(res)), by = D]
[, .(D, ptg = (tmp/sum(tmp))*100)], -ptg)[1:4,]
#> out
# D ptg
#1: Dd 29.10448
#2: Ds 20.14925
#3: Dx 17.08126
#4: Dp 12.76949
How can I select all of the rows for a random sample of column values?
I have a dataframe that looks like this:
tag weight
R007 10
R007 11
R007 9
J102 11
J102 9
J102 13
J102 10
M942 3
M054 9
M054 12
V671 12
V671 13
V671 9
V671 12
Z990 10
Z990 11
That you can replicate using...
weights_df <- structure(list(tag = structure(c(4L, 4L, 4L, 1L, 1L, 1L, 1L,
3L, 2L, 2L, 5L, 5L, 5L, 5L, 6L, 6L), .Label = c("J102", "M054",
"M942", "R007", "V671", "Z990"), class = "factor"), value = c(10L,
11L, 9L, 11L, 9L, 13L, 10L, 3L, 9L, 12L, 12L, 14L, 5L, 12L, 11L,
15L)), .Names = c("tag", "value"), class = "data.frame", row.names = c(NA,
-16L))
I need to create a dataframe containing all of the rows from the above dataframe for two randomly sampled tags. Let's say tags R007and M942 get selected at random, my new dataframe needs to look like this:
tag weight
R007 10
R007 11
R007 9
M942 3
How do I do this?
I know I can create a list of two random tags like this:
library(plyr)
tags <- ddply(weights_df, .(tag), summarise, count = length(tag))
set.seed(5464)
tag_sample <- tags[sample(nrow(tags),2),]
tag_sample
Resulting in...
tag count
4 R007 3
3 M942 1
But I just don't know how to use that to subset my original dataframe.
is this what you want?
subset(weights_df, tag%in%sample(levels(tag),2))
If your data.frame is named dfrm, then this will select 100 random tags
dfrm[ sample(NROW(dfrm), 100), "tag" ] # possibly with repeats
If, on the other hand, you want a dataframe with the same columns (possibly with repeats):
samp <- dfrm[ sample(NROW(dfrm), 100), ] # leave the col name entry blank to get all
A third possibility... you want 100 distinct tags at random, but not with the probability at all weighted to the frequency:
samp.tags <- unique(dfrm$tag)[ sample(length(unique(dfrm$tag)), 100]
Edit: With to revised question; one of these:
subset(dfrm, tag %in% c("R007", "M942") )
Or:
dfrm[dfrm$tag %in% c("R007", "M942"), ]
Or:
dfrm[grep("R007|M942", dfrm$tag), ]