subsetting columns based on columns containing duplicates - r

i want to subset 3 columns based on one of the columns which has duplicate ids so that i only get 3 columns which have the unique values
structure(list(ID = 1:4, x = c(46L, 47L, 47L, 47L), y = c(5L,
6L, 7L, 7L)), .Names = c("ID", "x", "y"), row.names = c(1L, 6L,
11L, 16L), class = "data.frame")

using duplicated on the data frame method should works:
dat[!duplicated(dat),] # (equivalent here to dat[!duplicated(dat$ID),] )
ID x y
1 1 46 5
6 2 47 6
11 3 47 7
16 4 47 7

Related

R get new table

I have a table with two columns A and B. I want to create a new table with two new columns added: X and Y.
The X column is to contain the values from the A column, but with the division performed. Values from the first row (from column A) divided by the values from the second row in column A and so for all subsequent rows, e.g. the third row divided by the fourth row etc.
The Y column is to contain the values from the B column, but with the division performed. Values from the first row (from column B) divided by the values from the second row in column B and so for all subsequent rows, e.g. the third row divided by the fourth row etc.
So far I used Excel for this. But now I need it in R if possible in the form of a function so that I can reuse this code easily. I haven't done this in R yet, so I am asking for help.
Example data:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L)), class = "data.frame", row.names = c(NA,
-10L))
Sample results:
structure(list(A = c(2L, 7L, 5L, 11L, 54L, 12L, 34L, 14L, 10L,
6L), B = c(3L, 5L, 1L, 21L, 67L, 32L, 19L, 24L, 44L, 37L), X = c("",
"0,285714286", "", "0,454545455", "", "4,5", "", "2,428571429",
"", "1,666666667"), Y = c("", "0,6", "", "0,047619048", "", "2,09375",
"", "0,791666667", "", "1,189189189")), class = "data.frame", row.names = c(NA,
-10L))
You could use dplyr's across and lag (combined with modulo for picking every second row):
library(dplyr)
df |> mutate(across(c(A, B), ~ ifelse(row_number() %% 2 == 0, lag(.) / ., NA), .names = "new_{.col}"))
If you want a character vector change NA to "".
Output:
A B new_A new_B
1 2 3 NA NA
2 7 5 0.2857143 0.60000000
3 5 1 NA NA
4 11 21 0.4545455 0.04761905
5 54 67 NA NA
6 12 32 4.5000000 2.09375000
7 34 19 NA NA
8 14 24 2.4285714 0.79166667
9 10 44 NA NA
10 6 37 1.6666667 1.18918919
Function:
ab_fun <- function(data, vars) {
data |>
mutate(across(c(A, B), ~ ifelse(row_number() %% 2 == 0, lag(.) / ., NA), .names = "new_{.col}"))
}
ab_fun(df, c(A,B))
Updated with new data and correct code. + Function

Choose rows in which the absolute value of subtraction is less a specified value

Let's say I have this dataframe:
ID X1 X2
1 1 2
2 2 1
3 3 1
4 4 1
5 5 5
6 6 20
7 7 20
8 9 20
9 10 20
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))
And I want to select rows in which the absolute value of the subtraction of rows are more or equal to 2 (based on columns X1 and X2).
For example, row 4 value is 4-1, which is 3 and should be selected.
Row 9 value is 10-20, which is -10. Absolute value is 10 and should be selected.
In this case it would be rows 3, 4, 6, 7, 8 and 9
I tried:
dataset2 = dataset[,abs(dataset- c(dataset[,2])) > 2]
But I get an error.
The operation:
abs(dataset- c(dataset[,2])) > 2
Does give me rows that the sum are more than 2, but the result only works for my second column and does not select properly
We can get the difference between the 'X1', 'X2' columns, create a logical expression in subset to subset the rows
subset(dataset, abs(X1 - X2) >= 2)
# ID X1 X2
#3 3 3 1
#4 4 4 1
#6 6 6 20
#7 7 7 20
#8 8 9 20
#9 9 10 20
Or using index
subset(dataset, abs(dataset[[2]] - dataset[[3]]) >= 2)
data
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))

How do i replace the fourth row of values in a dataframe with a corresponding vector in R?

I have a set of values
col1|col2|col3|col4
5 10 15 20
2 4 6 8
3 6 9 12
4 3 7 15
I would like to replace row 4 with a vector
c(4,8,12,16)
I would like to inset the vector in column 4 and replace the original values. I tried this script.
df[[4]]<- vector_name
I expect the result
col1|col2|col3|col4
5 10 15 20
2 4 6 8
3 6 9 12
4 8 12 16
We can use replace
replace(df1, cbind(nrow(df1), seq_along(df1)), v1)
data
df1 <- structure(list(col1 = c(5L, 2L, 3L, 4L), col2 = c(10L, 4L, 6L,
3L), col3 = c(15L, 6L, 9L, 7L), col4 = c(20L, 8L, 12L, 15L)),
class = "data.frame", row.names = c(NA,
-4L))
v1 <- c(4, 8, 12, 16)

Subsetting a dataset into multiples subsets in R

I have a data that looks something like this:
structure(list(ID = structure(c(1L, 2L, 2L, 3L, 4L, 5L, 6L, 6L,
6L), .Label = c("a", "b", "c", "d", "e", "f"), class = "factor"),
Value = c(10L, 13L, 12L, 43L, 23L, 66L, 78L, 42L, 19L)), .Names = c("ID",
"Value"), class = "data.frame", row.names = c(NA, -9L))
I would like to divide this dataset into multiple datasets on the basis of the ID values, i.e. one dataset that contains only ID = a, another that contains only ID = b, and so on.
How do I do this subsetting automatically in R? I understand that if the number of values in ID is less, we could just do it manually, but in case there are a lot of values under ID, there has to be a smarter way of doing this.
You can use the split function.
df <- structure(list(ID = structure(c(1L, 2L, 2L, 3L, 4L, 5L, 6L, 6L,
6L), .Label = c("a", "b", "c", "d", "e", "f"), class = "factor"),
Value = c(10L, 13L, 12L, 43L, 23L, 66L, 78L, 42L, 19L)), .Names = c("ID",
"Value"), class = "data.frame", row.names = c(NA, -9L))
> df
ID Value
1 a 10
2 b 13
3 b 12
4 c 43
5 d 23
6 e 66
7 f 78
8 f 42
9 f 19
listed_df <- split(df, df$ID)
> listed_df
$a
ID Value
1 a 10
$b
ID Value
2 b 13
3 b 12
$c
ID Value
4 c 43
$d
ID Value
5 d 23
$e
ID Value
6 e 66
$f
ID Value
7 f 78
8 f 42
9 f 19
To call on one of these just use index it with $.
sum(listed_df$f$Value)
You can also lapply a function across each of the dataframes within the list. If you wanted to sum up each Value or something you could do..
lapply(df_list, function(x) sum(x$Value))
You can also do this just by grouping the original dataframe by ID and then perform summarise operations on it from there.
This should be pretty easy.
exampleb <- subset(df, ID == 'b')
exampleb
ID Value
2 b 13
3 b 12
Also, take a look at these links.
https://www.r-bloggers.com/5-ways-to-subset-a-data-frame-in-r/
https://www.statmethods.net/management/subset.html

R Programming: Calculate two-sample t-test for a data frame with formatting

Good evening,
I have the following data frame:
Sex A B C D E
M 1 20 45 42 12
F 2 10 32 23 43
M 39 32 2 23 43
M 24 43 2 44 12
F 11 3 4 4 11
How would I calculate the two-sample t-test for each numerical variable for the data frame listed above by the sex variable by using the apply function. The result should be a matrix that contains five
columns: F.mean (mean of the numerical variable for Female), M.mean (mean of the numerical variable
for Male), t (for t-statistics), df (for degrees of freedom), and p (for p-value).
Thank you!!
Here is an option using apply with margin 2
out = apply(data[,-1], 2, function(x){
unlist(t.test(x[data$Sex == 'M'], x[data$Sex == 'F'])[c(1:3,5)],
recursive=FALSE)
})
#> out
# A B C D E
#statistic.t 1.2432059 3.35224633 -0.08318328 1.9649783 -0.2450115
#parameter.df 2.5766151 2.82875770 2.70763487 1.9931486 1.8474695
#p.value 0.3149294 0.04797862 0.93946696 0.1887914 0.8309453
#estimate.mean of x 21.3333333 31.66666667 16.33333333 36.3333333 22.3333333
#estimate.mean of y 6.5000000 6.50000000 18.00000000 13.5000000 27.0000000
data
data = structure(list(Sex = structure(c(2L, 1L, 2L, 2L, 1L), .Label = c("F",
"M"), class = "factor"), A = c(1L, 2L, 39L, 24L, 11L), B = c(20L,
10L, 32L, 43L, 3L), C = c(45L, 32L, 2L, 2L, 4L), D = c(42L, 23L,
23L, 44L, 4L), E = c(12L, 43L, 43L, 12L, 11L)), .Names = c("Sex",
"A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,
-5L))
should be a combination of apply, t.test and aggregate, I think. But first turn the row names into a names colums. Then you can do subsetting with aggregate and then apply with t.test

Resources