Subsetting a dataset into multiples subsets in R - r

I have a data that looks something like this:
structure(list(ID = structure(c(1L, 2L, 2L, 3L, 4L, 5L, 6L, 6L,
6L), .Label = c("a", "b", "c", "d", "e", "f"), class = "factor"),
Value = c(10L, 13L, 12L, 43L, 23L, 66L, 78L, 42L, 19L)), .Names = c("ID",
"Value"), class = "data.frame", row.names = c(NA, -9L))
I would like to divide this dataset into multiple datasets on the basis of the ID values, i.e. one dataset that contains only ID = a, another that contains only ID = b, and so on.
How do I do this subsetting automatically in R? I understand that if the number of values in ID is less, we could just do it manually, but in case there are a lot of values under ID, there has to be a smarter way of doing this.

You can use the split function.
df <- structure(list(ID = structure(c(1L, 2L, 2L, 3L, 4L, 5L, 6L, 6L,
6L), .Label = c("a", "b", "c", "d", "e", "f"), class = "factor"),
Value = c(10L, 13L, 12L, 43L, 23L, 66L, 78L, 42L, 19L)), .Names = c("ID",
"Value"), class = "data.frame", row.names = c(NA, -9L))
> df
ID Value
1 a 10
2 b 13
3 b 12
4 c 43
5 d 23
6 e 66
7 f 78
8 f 42
9 f 19
listed_df <- split(df, df$ID)
> listed_df
$a
ID Value
1 a 10
$b
ID Value
2 b 13
3 b 12
$c
ID Value
4 c 43
$d
ID Value
5 d 23
$e
ID Value
6 e 66
$f
ID Value
7 f 78
8 f 42
9 f 19
To call on one of these just use index it with $.
sum(listed_df$f$Value)
You can also lapply a function across each of the dataframes within the list. If you wanted to sum up each Value or something you could do..
lapply(df_list, function(x) sum(x$Value))
You can also do this just by grouping the original dataframe by ID and then perform summarise operations on it from there.

This should be pretty easy.
exampleb <- subset(df, ID == 'b')
exampleb
ID Value
2 b 13
3 b 12
Also, take a look at these links.
https://www.r-bloggers.com/5-ways-to-subset-a-data-frame-in-r/
https://www.statmethods.net/management/subset.html

Related

Creating a new data frame from two existing data frame based on values from two columns [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Using anti_join() from the dplyr on two tables from two different databases
(1 answer)
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
Input Data Frame
DF 1 (example - nrow = 10)
Col A | Col B | Col C
a 1 2
a 3 4
b 5 6
c 9 10
DF 2 (example - nrow = 20)
Col A | Col B | Col E
a 1 22
a 31 41
a 3 63
b 5 6
b 11 13
c 9 20
I want to create a third data set which contains each of the additional row found in the Data Frame 2 for the Col A and Col B entry.
Output File (nrow = 20-10 = 10)
Col A | Col B | Col E
a 31 41
b 11 13
library(dplyr)
anti_join(df2, df1, by = c("ColA", "ColB"))
# ColA ColB ColE
# 1 a 31 41
# 2 b 11 13
Data:
df1 <- structure(list(ColA = c("a", "a", "b", "c"), ColB = c(1L, 3L,
5L, 9L), ColC = c(2L, 4L, 6L, 10L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(ColA = c("a", "a", "a", "b", "b", "c"), ColB = c(1L,
31L, 3L, 5L, 11L, 9L), ColE = c(22L, 41L, 63L, 6L, 13L, 20L)), class = "data.frame", row.names = c(NA,
-6L))
We can use
library(data.table)
setDT(df2)[!df1, on = .(ColA, ColB)]
# ColA ColB ColE
#1: a 31 41
#2: b 11 13
data
df1 <- structure(list(ColA = c("a", "a", "b", "c"), ColB = c(1L, 3L,
5L, 9L), ColC = c(2L, 4L, 6L, 10L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(ColA = c("a", "a", "a", "b", "b", "c"), ColB = c(1L,
31L, 3L, 5L, 11L, 9L), ColE = c(22L, 41L, 63L, 6L, 13L, 20L)), class = "data.frame", row.names = c(NA,
-6L))

r - Find corresponding value from multiple columns according to pmin in multiple columns

My df is sth like this:
Item P P1 P2 P3 D1 D2 D3 pmin num NP
A 10 8 11 20 2 1 10 1 D2 11
B 10 8 11 20 2 1 10 1 D2 11
C 10 8 11 20 2 1 10 1 D2 11
D 50 40 35 70 10 15 20 10 D1 40
E 20 15 22 30 5 2 10 2 D2 22
As shown in my df above, I've first calculated D1 and D2, 'pmin' is the parallel min for D1 and D2, 'num' gives the column name(D1 or D2) corresponding to my pmin.
Now what I want is return a new column called 'NP' that gives me the corresponding values in P1 or P2 according to the pmin (by looking across the row). For example, if it says D2 in 'num', looking across the row, I return value from P2, if it says D1 in 'num', I return the value from P1.
Not sure if I explained it nicely but here's how I did for 'pmin' and 'num':
df$pmin = do.call(pmin, df[,5:6] )
df$num = apply(df[,5:6], 1,function(x) names(x)[which.min(x)])
Also in my real dataset, I have P1 through P4 and D1 through D4.
I tried sth like
ifelse( num == 'D1', P1, P2)
but it doesn't work for more than two columns (P1~P4..)
Thanks in advance!!
btw does anyone know how to use
case_when()
from library(dplyr) to get 'NP'?
We can use row/column indexing to extract the elements of 'P1/P2' columns that corresponds to the 'D1', 'D2'
m1 <- cbind(seq_len(nrow(df)), match(df$num, c("D1", "D2", "D3")))
df$NP <- df[c("P1", "P2", "P3")][m1]
df$NP
#[1] 11 11 11 40 22
data
df <- structure(list(Item = c("A", "B", "C", "D", "E"), P = c(10L,
10L, 10L, 50L, 20L), P1 = c(8L, 8L, 8L, 40L, 15L), P2 = c(11L,
11L, 11L, 35L, 22L), P3 = c(20L, 20L, 20L, 70L, 30L), D1 = c(2L,
2L, 2L, 10L, 5L), D2 = c(1L, 1L, 1L, 15L, 2L), D3 = c(10L, 10L,
10L, 20L, 10L), pmin = c(1L, 1L, 1L, 10L, 2L), num = c("D2",
"D2", "D2", "D1", "D2"), NP = c(11L, 11L, 11L, 40L, 22L)),
class = "data.frame", row.names = c(NA,
-5L))

R Programming: Calculate two-sample t-test for a data frame with formatting

Good evening,
I have the following data frame:
Sex A B C D E
M 1 20 45 42 12
F 2 10 32 23 43
M 39 32 2 23 43
M 24 43 2 44 12
F 11 3 4 4 11
How would I calculate the two-sample t-test for each numerical variable for the data frame listed above by the sex variable by using the apply function. The result should be a matrix that contains five
columns: F.mean (mean of the numerical variable for Female), M.mean (mean of the numerical variable
for Male), t (for t-statistics), df (for degrees of freedom), and p (for p-value).
Thank you!!
Here is an option using apply with margin 2
out = apply(data[,-1], 2, function(x){
unlist(t.test(x[data$Sex == 'M'], x[data$Sex == 'F'])[c(1:3,5)],
recursive=FALSE)
})
#> out
# A B C D E
#statistic.t 1.2432059 3.35224633 -0.08318328 1.9649783 -0.2450115
#parameter.df 2.5766151 2.82875770 2.70763487 1.9931486 1.8474695
#p.value 0.3149294 0.04797862 0.93946696 0.1887914 0.8309453
#estimate.mean of x 21.3333333 31.66666667 16.33333333 36.3333333 22.3333333
#estimate.mean of y 6.5000000 6.50000000 18.00000000 13.5000000 27.0000000
data
data = structure(list(Sex = structure(c(2L, 1L, 2L, 2L, 1L), .Label = c("F",
"M"), class = "factor"), A = c(1L, 2L, 39L, 24L, 11L), B = c(20L,
10L, 32L, 43L, 3L), C = c(45L, 32L, 2L, 2L, 4L), D = c(42L, 23L,
23L, 44L, 4L), E = c(12L, 43L, 43L, 12L, 11L)), .Names = c("Sex",
"A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA,
-5L))
should be a combination of apply, t.test and aggregate, I think. But first turn the row names into a names colums. Then you can do subsetting with aggregate and then apply with t.test

How to intersect values from two data frames with R

I would like to create a new column for a data frame with values from the intersection of a row and a column.
I have a data.frame called "time":
q 1 2 3 4 5
a 1 13 43 5 3
b 2 21 12 3353 34
c 3 21 312 123 343
d 4 123 213 123 35
e 4556 11 123 12 3
And another table, called "event":
q dt
a 1
b 3
c 4
d 2
e 1
I want to put another column called inter on the second table that will be fill the values that are in the intersection between the q and the columns dt from the first data.frame. So the result would be this:
q dt inter
a 1 1
b 3 12
c 4 123
d 2 123
e 1 4556
I have tried to use merge(event, time, by.x = "q", by.y = "dt"), but it generate the error that they aren't the same id. I have also tried to transpose the time data.frame to cross section the values but I didn't have success.
library(reshape2)
merge(event, melt(time, id.vars = "q"),
by.x=c('q','dt'), by.y=c('q','variable'), all.x = TRUE)
Output:
q dt value
1 a 1 1
2 b 3 12
3 c 4 123
4 d 2 123
5 e 1 4556
Notes
We use the function melt from the package reshape2 to convert the data frame time from wide to long format. And then we merge (left outer join) the data frames event and the melted time by two columns (q and dt in event, q and variable in the melted time) .
Data:
time <- structure(list(q = structure(1:5, .Label = c("a", "b", "c", "d",
"e"), class = "factor"), `1` = c(1L, 2L, 3L, 4L, 4556L), `2` = c(13L,
21L, 21L, 123L, 11L), `3` = c(43L, 12L, 312L, 213L, 123L), `4` = c(5L,
3353L, 123L, 123L, 12L), `5` = c(3L, 34L, 343L, 35L, 3L)), .Names = c("q",
"1", "2", "3", "4", "5"), class = "data.frame", row.names = c(NA,
-5L))
event <- structure(list(q = structure(1:5, .Label = c("a", "b", "c", "d",
"e"), class = "factor"), dt = c(1L, 3L, 4L, 2L, 1L)), .Names = c("q",
"dt"), class = "data.frame", row.names = c(NA, -5L))
This may be a little clunky but it works:
inter=c()
for (i in 1:nrow(time)) {
xx=merge(time,event,by='q')
dt=xx$dt
z=y[i,dt[i]+1]
inter=c(inter,z)
final=cbind(time[,1],dt,inter)
}
colnames(final)=c('q','dt','inter')
Hope it helps.
Output:
q dt inter
[1,] 1 1 1
[2,] 2 3 12
[3,] 3 4 123
[4,] 4 2 123
[5,] 5 1 4556

subsetting columns based on columns containing duplicates

i want to subset 3 columns based on one of the columns which has duplicate ids so that i only get 3 columns which have the unique values
structure(list(ID = 1:4, x = c(46L, 47L, 47L, 47L), y = c(5L,
6L, 7L, 7L)), .Names = c("ID", "x", "y"), row.names = c(1L, 6L,
11L, 16L), class = "data.frame")
using duplicated on the data frame method should works:
dat[!duplicated(dat),] # (equivalent here to dat[!duplicated(dat$ID),] )
ID x y
1 1 46 5
6 2 47 6
11 3 47 7
16 4 47 7

Resources