I have a table with two columns A and B. I want to create a new table with two new columns added: X and Y.
The X column is to contain the values from the A column, but with the division performed. Values from the first row (from column A) divided by the values from the second row in column A and so for all subsequent rows, e.g. the third row divided by the fourth row etc.
The Y column is to contain the values from the B column, but with the division performed. Values from the first row (from column B) divided by the values from the second row in column B and so for all subsequent rows, e.g. the third row divided by the fourth row etc.
So far I used Excel for this. But now I need it in R if possible in the form of a function so that I can reuse this code easily. I haven't done this in R yet, so I am asking for help.
Example data:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)), class = "data.frame", row.names = c(NA,
-10L))
Sample results:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L), X = c("",
"0,285714286", "", "0,454545455", "", "4,5", "", "2,428571429",
"", "1,666666667"), Y = c("", "0,6", "", "0,047619048", "", "2,09375",
"", "0,791666667", "", "1,189189189")), class = "data.frame", row.names = c(NA,
-10L))
You could use dplyr's across and lag (combined with modulo for picking every second row):
library(dplyr)
df |> mutate(across(c(A, B), ~ ifelse(row_number() %% 2 == 0, lag(.) / ., NA), .names = "new_{.col}"))
If you want a character vector change NA to "".
Output:
A B new_A new_B
1 2 3 NA NA
2 7 5 0.2857143 0.60000000
3 5 1 NA NA
4 11 21 0.4545455 0.04761905
5 54 67 NA NA
6 12 32 4.5000000 2.09375000
7 34 19 NA NA
8 14 24 2.4285714 0.79166667
9 10 44 NA NA
10 6 37 1.6666667 1.18918919
Function:
ab_fun <- function(data, vars) {
data |>
mutate(across(c(A, B), ~ ifelse(row_number() %% 2 == 0, lag(.) / ., NA), .names = "new_{.col}"))
}
ab_fun(df, c(A,B))
Updated with new data and correct code. + Function
Related
I have a table with two columns A and B. I want to create a new table with two new columns added: X and Y. These two new columns are to contain data from column A, but every second row from column A. Correspondingly for column X, starting from the first value in column A and from the second value in column A for column Y.
So far, I have been doing it in Excel. But now I need it in R best function form so that I can easily reuse that code. I haven't done this in R yet, so I am asking for help.
Example data:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)), class = "data.frame", row.names = c(NA,
-10L))
Sample result:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L), X = c(2L,
NA, 5L, NA, 54L, NA, 34L, NA, 10L, NA), Y = c(NA, 7L, NA, 11L,
NA, 12L, NA, 14L, NA, 6L)), class = "data.frame", row.names = c(NA,
-10L))
It is not a super elegant solution, but it works:
exampleDF <- structure(list(A = c(2L, 7L, 5L, 11L, 54L,
12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L,
32L, 19L, 24L, 44L, 37L)),
class = "data.frame", row.names = c(NA, -10L))
index <- seq(from = 1, to = nrow(exampleDF), by = 2)
exampleDF$X <- NA
exampleDF$X[index] <- exampleDF$A[index]
exampleDF$Y <- exampleDF$A
exampleDF$Y[index] <- NA
You could also make use of the row numbers and the modulo operator:
A simple ifelse way:
library(dplyr)
df |>
mutate(X = ifelse(row_number() %% 2 == 1, A, NA),
Y = ifelse(row_number() %% 2 == 0, A, NA))
Or using pivoting:
library(dplyr)
library(tidyr)
df |>
mutate(name = ifelse(row_number() %% 2 == 1, "X", "Y"),
value = A) |>
pivot_wider()
A function using the first approach could look like:
See comment
xy_fun <- function(data, A = A, X = X, Y = Y) {
data |>
mutate({{X}} := ifelse(row_number() %% 2 == 1, {{A}}, NA),
{{Y}} := ifelse(row_number() %% 2 == 0, {{A}}, NA))
}
xy_fun(df, # Your data
A, # The col to take values from
X, # The column name of the first new column
Y # The column name of the second new column
)
Output:
A B X Y
1 2 3 2 NA
2 7 5 NA 7
3 5 1 5 NA
4 11 21 NA 11
5 54 67 54 NA
6 12 32 NA 12
7 34 19 34 NA
8 14 24 NA 14
9 10 44 10 NA
10 6 37 NA 6
Data stored as df:
df <- structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L, 6L),
B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)
),
class = "data.frame",
row.names = c(NA, -10L)
)
I like the #harre approach:
Another approach with base R we could ->
Use R's recycling ability (of a shorter-vector to a longer-vector):
df$X <- df$A
df$Y <- df$B
df$X[c(FALSE, TRUE)] <- NA
df$Y[c(TRUE, FALSE)] <- NA
df
A B X Y
1 2 3 2 NA
2 7 5 NA 5
3 5 1 5 NA
4 11 21 NA 21
5 54 67 54 NA
6 12 32 NA 32
7 34 19 34 NA
8 14 24 NA 24
9 10 44 10 NA
10 6 37 NA 37
The following code will return the average conditioned that the months are greater than 6.
mean(df[df$delta1>6, "delta1"], na.rm=T)
Now, how do I do apply this for every column in the dataframe?
df:
delta1 delta2 delta3
NA 2 3
4 NA 6
7 8 NA
10 NA 12
NA 14 15
16 NA 18
19 20 NA
The apply-family of functions is useful here:
sapply(df, function(x) mean(x[x>6], na.rm=T))
We can set the values in the dataframe which are less than equal to 6 to NA and count the mean using colMeans ignoring the NA values.
df[df <= 6] <- NA
colMeans(df, na.rm = TRUE)
#delta1 delta2 delta3
# 13 14 15
data
df <- structure(list(delta1 = c(NA, 4L, 7L, 10L, NA, 16L, 19L), delta2 = c(2L,
NA, 8L, NA, 14L, NA, 20L), delta3 = c(3L, 6L, NA, 12L, 15L, 18L,
NA)), class = "data.frame", row.names = c(NA, -7L))
I have a data like this
df<-structure(list(X1 = c(37L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, NA,
11L, 12L), X2 = c(40L, NA, 35L, 35L, 35L, 34L, NA, 28L, 28L,
NA, 25L, 24L), X3 = c(60L, 44L, 49L, 41L, NA, NA, NA, 25L, 26L,
NA, NA, 22L), T1 = c(19L, 55L, 47L, 46L, 36L, 42L, 25L, NA, 33L,
42L, 50L, 22L), T2 = c(75L, NA, 32L, 44L, 27L, 31L, 17L, NA,
18L, 45L, 10L, 11L), T3 = c(5L, 6L, 7L, 8L, 9L, 10L, 11L, NA,
46L, 36L, 42L, NA), P1 = c(2L, 2L, 3L, 4L, 2L, 6L, 7L, 8L, 9L,
NA, 1L, 12L), P2 = c(40L, 44L, 4L, 2L, 1L, 1L, NA, 1L, 1L, 1L,
5L, 55L), P3 = c(1L, 44L, 49L, 3L, NA, NA, NA, 25L, 26L, NA,
NA, 66L)), class = "data.frame", row.names = c(NA, -12L))
I have three groups and each group has 3 columns , they are called X, T and P.
I am trying to find out how many of rows in each group are overlapped with another group and how many rows in each group is different than another group. ( each row of each group must at lest have 2 values)
so I am looking for an output like this
X 10 rows overlapping with T and 2 different
T has 10 overlapping with X and 2 different
X has 10 overlapping with P and 1 different
T has 10 overlapping with P and 3 different
it means I have 10 rows of X1,X2 and X3 which have at least 2 values and they have values in the group T (T1,T2,T3). There is one row that is completely empty or has only 1 value but they have values in T group.
The same for other combination
This question is still sort of ambiguous and narrow, but here is the general idea for tidying your data to the point where you can easily summarize over different groups and/or rows:
library(tidyverse)
df %>%
as_tibble %>%
rowid_to_column %>%
gather(select=-rowid) %>%
separate(key, into=c('group', 'column'), sep=1) %>%
group_by(group)
Extending along the lines of John Colby's answer, you can summarize how many rows are populated with 2 or more non-NA values in each letter's columns:
library(tidyverse)
df_summarized <- df %>%
rowid_to_column() %>%
gather(colname, value, -rowid) %>%
separate(colname, into = c("letter", "number"), sep = 1) %>%
count(rowid, letter, wt = !is.na(value), name = "num_values") %>%
mutate(populated = num_values >= 2)
> df_summarized
# A tibble: 36 x 4
rowid letter num_values populated
<int> <chr> <int> <lgl>
1 1 P 3 TRUE
2 1 T 3 TRUE
3 1 X 3 TRUE
4 2 P 3 TRUE
5 2 T 2 TRUE
6 2 X 2 TRUE
7 3 P 3 TRUE
8 3 T 3 TRUE
9 3 X 3 TRUE
10 4 P 3 TRUE
# ... with 26 more rows
And then use that to compare between letters. For instance, here I see that 9 rows have the same populated / not-populated status among X and T columns. Three rows (7, 8, and 10) differ in their populated status between those two letters.
> df_summarized %>%
+ select(-num_values) %>%
+ spread(letter, populated)
# A tibble: 12 x 4
rowid P T X
<int> <lgl> <lgl> <lgl>
1 1 TRUE TRUE TRUE
2 2 TRUE TRUE TRUE
3 3 TRUE TRUE TRUE
4 4 TRUE TRUE TRUE
5 5 TRUE TRUE TRUE
6 6 TRUE TRUE TRUE
7 7 FALSE TRUE FALSE # T but no X
8 8 TRUE FALSE TRUE # X but no T
9 9 TRUE TRUE TRUE
10 10 FALSE TRUE FALSE # T but no X
11 11 TRUE TRUE TRUE
12 12 TRUE TRUE TRUE
We could query the data like this to get the overlaps and non-overlaps:
df_summarized %>%
select(-num_values) %>%
spread(letter, populated) %>%
summarize(PT = sum(P==T),
PT_non = sum(P!=T),
TX = sum(T==X),
TX_non = sum(T!=X),
XP = sum(X==P),
XP_non = sum(X!=P))
# A tibble: 1 x 6
PT PT_non TX TX_non XP XP_non
<int> <int> <int> <int> <int> <int>
1 9 3 9 3 12 0
Good evening,
I have the following data frame:
Sex A B C D E
M 1 20 45 42 12
F 2 10 32 23 43
M 39 32 2 23 43
M 24 43 2 44 12
F 11 3 4 4 11
How would I calculate the two-sample t-test for each numerical variable for the data frame listed above by the sex variable by using the apply function. The result should be a matrix that contains five
columns: F.mean (mean of the numerical variable for Female), M.mean (mean of the numerical variable
for Male), t (for t-statistics), df (for degrees of freedom), and p (for p-value).
Thank you!!
Here is an option using apply with margin 2
out = apply(data[,-1], 2, function(x){
unlist(t.test(x[data$Sex == 'M'], x[data$Sex == 'F'])[c(1:3,5)],
recursive=FALSE)
})
#> out
# A B C D E
#statistic.t 1.2432059 3.35224633 -0.08318328 1.9649783 -0.2450115
#parameter.df 2.5766151 2.82875770 2.70763487 1.9931486 1.8474695
#p.value 0.3149294 0.04797862 0.93946696 0.1887914 0.8309453
#estimate.mean of x 21.3333333 31.66666667 16.33333333 36.3333333 22.3333333
#estimate.mean of y 6.5000000 6.50000000 18.00000000 13.5000000 27.0000000
data
data = structure(list(Sex = structure(c(2L, 1L, 2L, 2L, 1L), .Label = c("F",
"M"), class = "factor"), A = c(1L, 2L, 39L, 24L, 11L), B = c(20L,
10L, 32L, 43L, 3L), C = c(45L, 32L, 2L, 2L, 4L), D = c(42L, 23L,
23L, 44L, 4L), E = c(12L, 43L, 43L, 12L, 11L)), .Names = c("Sex",
"A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,
-5L))
should be a combination of apply, t.test and aggregate, I think. But first turn the row names into a names colums. Then you can do subsetting with aggregate and then apply with t.test
i want to subset 3 columns based on one of the columns which has duplicate ids so that i only get 3 columns which have the unique values
structure(list(ID = 1:4, x = c(46L, 47L, 47L, 47L), y = c(5L,
6L, 7L, 7L)), .Names = c("ID", "x", "y"), row.names = c(1L, 6L,
11L, 16L), class = "data.frame")
using duplicated on the data frame method should works:
dat[!duplicated(dat),] # (equivalent here to dat[!duplicated(dat$ID),] )
ID x y
1 1 46 5
6 2 47 6
11 3 47 7
16 4 47 7