Create dataset based on condition - r

I have dataset new with variable a b and c
a b c
hdjfh 434 876
sdfdsf 34 98
gfdsdfdsf 534 672
rsdfdsf 65 87
gsdfdsf 67 54
vbvnn 98 09
gkhjgfk 100 768
rknfg 78 3546
i want to create two datatsets such that dataset new1 need to satisfy condition b >110 or c >110. second dataset new2 will have records that are not satisfied by the condition b >110 or c >110

If you want to assign the two data sets to new variables, you can do this:
df <- data.frame(a=c('hdjfh','sdfdsf','gfdsdfdsf','rsdfdsf','gsdfdsf','vbvnn','gkhjgfk','rknfg'),b=c(434L,34L,534L,65L,67L,98L,100L,78L),c=c(876L,98L,672L,87L,54L,9L,768L,3546L),stringsAsFactors=F);
cond <- df$b>110|df$c>110;
new1 <- df[cond,];
new2 <- df[!cond,];
new1;
## a b c
## 1 hdjfh 434 876
## 3 gfdsdfdsf 534 672
## 7 gkhjgfk 100 768
## 8 rknfg 78 3546
new2;
## a b c
## 2 sdfdsf 34 98
## 4 rsdfdsf 65 87
## 5 gsdfdsf 67 54
## 6 vbvnn 98 9
Another option is to use split() to get a list:
split(df,df$b>110|df$c>110);
## $`FALSE`
## a b c
## 2 sdfdsf 34 98
## 4 rsdfdsf 65 87
## 5 gsdfdsf 67 54
## 6 vbvnn 98 9
##
## $`TRUE`
## a b c
## 1 hdjfh 434 876
## 3 gfdsdfdsf 534 672
## 7 gkhjgfk 100 768
## 8 rknfg 78 3546
##

Related

R - Reducing a matrix

I have a square matrix that is like:
A <- c("111","111","111","112","112","113")
B <- c(100,10,20,NA,NA,10)
C <- c(10,20,40,NA,10,20)
D <- c(10,20,NA,NA,40,200)
E <- c(20,20,40,10,10,20)
F <- c(NA,NA,40,100,10,20)
G <- c(10,20,NA,30,10,20)
df <- data.frame(A,B,C,D,E,F,G)
names(df) <- c("Codes","111","111","111","112","112","113")
# Codes 111 111 111 112 112 113
# 1 111 100 10 10 20 NA 10
# 2 111 10 20 20 20 NA 20
# 3 111 20 40 NA 40 40 NA
# 4 112 NA NA NA 10 100 30
# 5 112 NA 10 40 10 10 10
# 6 113 10 20 200 20 20 20
I want to reduce it so that observations with the same row and column names are summed up.
So I want to end up with:
# Codes 111 112 113
# 1 111 230 120 30
# 2 112 50 130 40
# 3 113 230 40 20
I tried to first combine the rows with the same "Codes" number, but I was having a lot of trouble.
In tidyverse
library(tidyverse)
df %>%
pivot_longer(-Codes, values_drop_na = TRUE) %>%
group_by(Codes, name) %>%
summarise(value = sum(value), .groups = 'drop')%>%
pivot_wider()
# A tibble: 3 x 4
Codes `111` `112` `113`
<chr> <dbl> <dbl> <dbl>
1 111 230 120 30
2 112 50 130 40
3 113 230 40 20
One way in base R:
tapply(unlist(df[-1]), list(names(df)[-1][col(df[-1])], df[,1][row(df[-1])]), sum, na.rm = TRUE)
111 112 113
111 230 50 230
112 120 130 40
113 30 40 20
Note that this can be simplified as denoted by #thelatemail to
grp <- expand.grid(df$Codes, names(df)[-1])
tapply(unlist(df[-1]), grp, FUN=sum, na.rm=TRUE)
You can also use `xtabs:
xtabs(vals~., na.omit(cbind(grp, vals = unlist(df[-1]))))
Var2
Var1 111 112 113
111 230 120 30
112 50 130 40
113 230 40 20
When dealing with actual matrices - especially with large ones -, expressing the operation as (sparse) linear algebra should be most efficient.
library(Matrix) ## for sparse matrix operations
idx <- c("111","111","111","112","112","113")
mat <- matrix(c(100,10,20,NA,NA,10,
10,20,40,NA,10,20,
10,20,NA,NA,40,200,
20,20,40,10,10,20,
NA,NA,40,100,10,20,
10,20,NA,30,10,20),
nrow=length(idx),
byrow=TRUE, dimnames=list(idx, idx))
## convert NA's to zero
mat[is.na(mat)] <- 0
## examine matrix
mat
## 111 111 111 112 112 113
## 111 100 10 20 0 0 10
## 111 10 20 40 0 10 20
## 111 10 20 0 0 40 200
## 112 20 20 40 10 10 20
## 112 0 0 40 100 10 20
## 113 10 20 0 30 10 20
## indicator matrix
## converts between "code" and "idx" spaces
M_code_idx <- fac2sparse(idx)
## project to "code_code" space
M_code_idx %*% mat %*% t(M_code_idx)
## 3 x 3 Matrix of class "dgeMatrix"
## 111 112 113
## 111 230 50 230
## 112 120 130 40
## 113 30 40 20

How to turn four vectors of differing lengths into a long format dataframe?

I am very new to R programming and have been provided the following data to implement a non-parametric test on. My issue lies in being able to turn this data (in R) into a long format data frame, so I may then conduct a histo/box plot. We aren't allowed to simply turn data into csv then read in, it has to be done in R.
A:1361,1466,1319,1426,1437,1541,1474,1386,1510,1373,1463,1305,1571,1224,1372
B:1581,1515,1606,1518,1395,1584,1671,1573,1454,1674,1459,1647
C:1482,1570,1575,1634,1542,1651,1189,1678,1391,1525
D:2084,1566,1990,1996,2052,1436,1808,1679,1981,2014,1759,1842,1603,1670,1845,2016,1621,2050,1690,1933
I've turned these into vectors but keep spitting error mssgs when I try to turn into data frame (vectors different lengths). Any pointers would be much help, I've been trying to troubleshoot for hours and my prof is no help.
Thanks
You can use stack to put into one long format.
I'll assume you are starting with a vector of strings,
vec <- c("A:1361,1466,1319,1426,1437,1541,1474,1386,1510,1373,1463,1305,1571,1224,1372", "B:1581,1515,1606,1518,1395,1584,1671,1573,1454,1674,1459,1647", "C:1482,1570,1575,1634,1542,1651,1189,1678,1391,1525", "D:2084,1566,1990,1996,2052,1436,1808,1679,1981,2014,1759,1842,1603,1670,1845,2016,1621,2050,1690,1933")
We can split into a list,
str(setNames(sapply(vecspl, `[`, -1), sapply(vecspl, `[[`, 1)))
# List of 4
# $ A: chr [1:15] "1361" "1466" "1319" "1426" ...
# $ B: chr [1:12] "1581" "1515" "1606" "1518" ...
# $ C: chr [1:10] "1482" "1570" "1575" "1634" ...
# $ D: chr [1:20] "2084" "1566" "1990" "1996" ...
From here, we can stack(.) it:
stack(setNames(sapply(vecspl, `[`, -1), sapply(vecspl, `[[`, 1)))
# values ind
# 1 1361 A
# 2 1466 A
# 3 1319 A
# 4 1426 A
# 5 1437 A
# 6 1541 A
# 7 1474 A
# 8 1386 A
# 9 1510 A
# 10 1373 A
# 11 1463 A
# 12 1305 A
# 13 1571 A
# 14 1224 A
# 15 1372 A
# 16 1581 B
# 17 1515 B
# 18 1606 B
# 19 1518 B
# 20 1395 B
# 21 1584 B
# 22 1671 B
# 23 1573 B
# 24 1454 B
# 25 1674 B
# 26 1459 B
# 27 1647 B
# 28 1482 C
# 29 1570 C
# 30 1575 C
# 31 1634 C
# 32 1542 C
# 33 1651 C
# 34 1189 C
# 35 1678 C
# 36 1391 C
# 37 1525 C
# 38 2084 D
# 39 1566 D
# 40 1990 D
# 41 1996 D
# 42 2052 D
# 43 1436 D
# 44 1808 D
# 45 1679 D
# 46 1981 D
# 47 2014 D
# 48 1759 D
# 49 1842 D
# 50 1603 D
# 51 1670 D
# 52 1845 D
# 53 2016 D
# 54 1621 D
# 55 2050 D
# 56 1690 D
# 57 1933 D
You could scan the information. Then clean it from labels using gsub, strsplit at the commas, use substrings as names, then stack. You could continue automatically type.converting numerics, aggregate the sums and barplot the result.
x <- scan(text='A:1361,1466,1319,1426,1437,1541,1474,1386,1510,1373,1463,1305,1571,1224,1372
B:1581,1515,1606,1518,1395,1584,1671,1573,1454,1674,1459,1647
C:1482,1570,1575,1634,1542,1651,1189,1678,1391,1525
D:2084,1566,1990,1996,2052,1436,1808,1679,1981,2014,1759,1842,1603,1670,1845,2016,1621,2050,1690,1933',
what='character', quiet=TRUE)
x |>
gsub(pattern='\\w:', replacement='') |>
strsplit(',') |>
setNames(substr(x, 1, 1)) |>
stack() |>
type.convert(as.is=TRUE) |>
aggregate(values ~ ind, data=_, sum) |>
barplot(values ~ ind, data=_, col=seq_len(length(x)) + 1)

find max column value in r conditional on another column

I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))

Specific Join of two Dataframes

I have two data frames: df1 and df2:
> df1
ID Gender age cd evnt scr test_dt
1 C0004 MALE 22 1 1 82 7/3/2014
2 C0004 MALE 22 1 2 76 7/3/2014
3 C0005 MALE 22 1 3 1514 7/3/2014
4 C0005 MALE 23 2 1 81 11/3/2014
5 C0006 MALE 23 2 2 75 11/3/2014
6 C0006 MALE 23 2 3 878 11/3/2014
and,
> df2
ID hgt wt phys_dt
1 C0004 70 147 6/29/2015
2 C0004 70 157 6/27/2016
3 C0005 67 175 6/27/2016
4 C0005 65 171 7/2/2014
5 C0006 69 160 6/29/2015
6 C0006 64 143 7/2/2014
I want to join df1 and df2 in a way that yields the following data frame, call it df3:
> df3
ID Gender age cd evnt scr hgt wt
1 C0004 MALE 22 1 1 82 70 147
2 C0004 MALE 22 1 2 76 70 157
3 C0005 MALE 22 1 3 1514 67 175
4 C0005 MALE 23 2 1 81 65 171
5 C0006 MALE 23 2 2 75 69 160
6 C0006 MALE 23 2 3 878 64 143
I'm trying to add df2$hgt and df2$wt to the proper ID row. The tricky part is that I want to join hgt and wt to the ID row whose dates (df1$test_dt and df2$phys_dt) most closely align. I was thinking I could first sort the two data frames by ID then by their respective dates then try and join? I'm not quite sure how to approach this. Thanks.
If you want to murge just matching the df1$ID and df2$ID, the following should do it:
df3 <- left_join(df1, df2, by = c("ID" = "ID"))
if the date should be matched as well as the ID, you could try:
df3 <- left_join(df1, df2, by = c("ID" = "ID", "test_dt" = "phys_dt"))
it is in the library(dplyr)

Shift up rows in R

This is a simple example of how my data looks like.
Suppose I got the following data
>x
Year a b c
1962 1 2 3
1963 4 5 6
. . . .
. . . .
2001 7 8 9
I need to form a time series of x with 7 column contains the following variables:
Year a lag(a) b lag(b) c lag(c)
What I did is the following:
> x<-ts(x) # converting x to a time series
> x<-cbind(x,x[,-1]) # adding the same variables to the time series without repeating the year column
> x
Year a b c a b c
1962 1 2 3 1 2 3
1963 4 5 6 4 5 6
. . . . . . .
. . . . . . .
2001 7 8 9 7 8 9
I need to shift the last three column up so they give the lags of a,b,c. then I will rearrange them.
Here's an approach using dplyr
df <- data.frame(
a=1:10,
b=21:30,
c=31:40)
library(dplyr)
df %>% mutate_each(funs(lead(.,1))) %>% cbind(df, .)
# a b c a b c
#1 1 21 31 2 22 32
#2 2 22 32 3 23 33
#3 3 23 33 4 24 34
#4 4 24 34 5 25 35
#5 5 25 35 6 26 36
#6 6 26 36 7 27 37
#7 7 27 37 8 28 38
#8 8 28 38 9 29 39
#9 9 29 39 10 30 40
#10 10 30 40 NA NA NA
You can change the names afterwards using colnames(df) <- c("a", "b", ...)
As #nrussel noted in his answer, what you described is a leading variable. If you want a lagging variable, you can change the lead in my answer to lag.
X <- data.frame(
a=1:100,
b=2*(1:100),
c=3*(1:100),
laga=1:100,
lagb=2*(1:100),
lagc=3*(1:100),
stringsAsFactors=FALSE)
##
Xts <- ts(X)
Xts[1:(nrow(Xts)-1),c(4,5,6)] <- Xts[2:nrow(Xts),c(4,5,6)]
Xts[nrow(Xts),c(4,5,6)] <- c(NA,NA,NA)
> head(Xts)
a b c laga lagb lagc
[1,] 1 2 3 2 4 6
[2,] 2 4 6 3 6 9
[3,] 3 6 9 4 8 12
[4,] 4 8 12 5 10 15
[5,] 5 10 15 6 12 18
[6,] 6 12 18 7 14 21
##
> tail(Xts)
a b c laga lagb lagc
[95,] 95 190 285 96 192 288
[96,] 96 192 288 97 194 291
[97,] 97 194 291 98 196 294
[98,] 98 196 294 99 198 297
[99,] 99 198 297 100 200 300
[100,] 100 200 300 NA NA NA
I'm not sure if by shift up you literally mean shift the rows up 1 place like above (because that would mean you are using lagging values not leading values), but here's the other direction ("true" lagged values):
X2 <- data.frame(
a=1:100,
b=2*(1:100),
c=3*(1:100),
laga=1:100,
lagb=2*(1:100),
lagc=3*(1:100),
stringsAsFactors=FALSE)
##
Xts2 <- ts(X2)
Xts2[2:nrow(Xts2),c(4,5,6)] <- Xts2[1:(nrow(Xts2)-1),c(4,5,6)]
Xts2[1,c(4,5,6)] <- c(NA,NA,NA)
##
> head(Xts2)
a b c laga lagb lagc
[1,] 1 2 3 NA NA NA
[2,] 2 4 6 1 2 3
[3,] 3 6 9 2 4 6
[4,] 4 8 12 3 6 9
[5,] 5 10 15 4 8 12
[6,] 6 12 18 5 10 15
##
> tail(Xts2)
a b c laga lagb lagc
[95,] 95 190 285 94 188 282
[96,] 96 192 288 95 190 285
[97,] 97 194 291 96 192 288
[98,] 98 196 294 97 194 291
[99,] 99 198 297 98 196 294
[100,] 100 200 300 99 198 297

Resources