Reproduce character pattern as numeric pattern - r

I would like to expand the following data frame
d <- data.frame(a = c(rep("A",5),rep("B",5),rep("C",3),rep("D",2)))
> d
a
1 A
2 A
3 A
4 A
5 A
6 B
7 B
8 B
9 B
10 B
11 C
12 C
13 C
14 D
15 D
so that there is a column b looking like:
> d
a b
1 A 1
2 A 1
3 A 1
4 A 1
5 A 1
6 B 2
7 B 2
8 B 2
9 B 2
10 B 2
11 C 3
12 C 3
13 C 3
14 D 4
15 D 4
Not really sure how to realise that.

Use match:
match(d$a, unique(d$a))

d$b <- as.integer(factor(d$a, levels=unique(d$a)))

Related

I want to delete the IDs that have no information in the remaining columns

Here is a representation of my dataset:
Number<-c(1:10)
AA<-c(head(LETTERS,4), rep(NA,6))
BB<-c(head(letters,6), rep(NA,4))
CC<-c(1:6, rep(NA,4))
DD<-c(10:14, rep(NA,5))
EE<-c(3:8, rep(NA,4))
FF<-c(6:1, rep(NA,4))
mydata<-data.frame(Number,AA,BB,CC,DD,EE,FF)
I want to delete all the IDs (Number) that have no information in the remaining columns, automatically. I want to tell the function that if there is a value in Number but there is only NA in all the remaining columns, delete the row.
I must have the dataframe below:
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
Another possible base R solution:
mydata[rowSums(is.na(mydata[,-1])) != ncol(mydata[,-1]), ]
Output
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
Or we could use apply:
mydata[!apply(mydata[,-1], 1, function(x) all(is.na(x))),]
A possible solution, using janitor::remove_empty:
library(dplyr)
library(janitor)
inner_join(mydata, remove_empty(mydata[-1], which = "rows"))
#> Joining, by = c("AA", "BB", "CC", "DD", "EE", "FF")
#> Number AA BB CC DD EE FF
#> 1 1 A a 1 10 3 6
#> 2 2 B b 2 11 4 5
#> 3 3 C c 3 12 5 4
#> 4 4 D d 4 13 6 3
#> 5 5 <NA> e 5 14 7 2
#> 6 6 <NA> f 6 NA 8 1
We can use if_all/if_all
library(dplyr)
mydata %>%
filter(if_any(-Number, complete.cases))
-output
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
or
mydata %>%
filter(!if_all(-Number, is.na))
Or with base R
subset(mydata, rowSums(!is.na(mydata[-1])) >0 )
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
Try this:
df <- df[,colSums(is.na(df))<nrow(df)]
This makes a copy of your data though. If you have a large dataset then you can use:
Filter(function(x)!all(is.na(x)), df)
and depending on your approach you can use
library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]
If you want to use a data.table which is usually a pretty solid go-to

How do I transform this matrix format to a matrix for repeated measures analysis in R? [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
Sorry "this matrix format" is very vague in my question (suggestions to improve my question?). I have a matrix that's like this
x <- data.frame(ID = c('A','B','C','D'), SCORE_YR1 = c(2,2,1,0),
SCORE_YR2 = c(2,3,3,1), SCORE_YR3 = c(0,2,2,5))
x
ID SCORE_YR1 SCORE_YR2 SCORE_YR3
1 A 2 2 0
2 B 2 3 2
3 C 1 3 2
4 D 0 1 5
I would like to transform the matrix format to look like this
y <- data.frame(ID = rep(c('A','B','C','D'),3), YEAR = rep(1:3,each=4),
SCORE = c(x$SCORE_YR1,x$SCORE_YR2,x$SCORE_YR3))
y
ID YEAR SCORE
1 A 1 2
2 B 1 2
3 C 1 1
4 D 1 0
5 A 2 2
6 B 2 3
7 C 2 3
8 D 2 1
9 A 3 0
10 B 3 2
11 C 3 2
12 D 3 5
Is there a function that can easily transform the dataframe like this?
Thanks
You can use melt from the reshape2 package:
library(reshape2)
x <- melt(x, id.vars = "ID")
Change column names to what you have above:
names(x)[2:3] <- c("YEAR","SCORE")
At this point the data frame it looks like this:
> x
ID YEAR SCORE
1 A SCORE_YR1 2
2 B SCORE_YR1 2
3 C SCORE_YR1 1
4 D SCORE_YR1 0
5 A SCORE_YR2 2
6 B SCORE_YR2 3
7 C SCORE_YR2 3
8 D SCORE_YR2 1
9 A SCORE_YR3 0
10 B SCORE_YR3 2
11 C SCORE_YR3 2
12 D SCORE_YR3 5
Doing as.numeric on your YEAR column converts it to a number:
x$YEAR <- as.numeric(x$YEAR)
> x
ID YEAR SCORE
1 A 1 2
2 B 1 2
3 C 1 1
4 D 1 0
5 A 2 2
6 B 2 3
7 C 2 3
8 D 2 1
9 A 3 0
10 B 3 2
11 C 3 2
12 D 3 5
The problem is that you have data in a "wide" format and you want to convert it to "long". melt is usually great for these situations.
With dplyr and tidyr, you can do:
library(dplyr); library(tidyr)
x %>%
gather(YEAR, SCORE, -ID) %>%
mutate(YEAR = extract_numeric(YEAR))
# ID YEAR SCORE
#1 A 1 2
#2 B 1 2
#3 C 1 1
#4 D 1 0
#5 A 2 2
#6 B 2 3
#7 C 2 3
#8 D 2 1
#9 A 3 0
#10 B 3 2
#11 C 3 2
#12 D 3 5
Or use reshape function from base R:
reshape(x, varying = 2:4, sep = "_YR", dir = "long", timevar = "YEAR")[1:3]
# ID YEAR SCORE
#1.1 A 1 2
#2.1 B 1 2
#3.1 C 1 1
#4.1 D 1 0
#1.2 A 2 2
#2.2 B 2 3
#3.2 C 2 3
#4.2 D 2 1
#1.3 A 3 0
#2.3 B 3 2
#3.3 C 3 2
#4.3 D 3 5
A base solution that would give you something that could easily be reworked to what you need would involve using stack. The data.frame function will do the "rep()-ing for you via R's recyclng rules:
y <- data.frame(x$ID, stack(x[-1]))
y
#-------------
x.ID values ind
1 A 2 SCORE_YR1
2 B 2 SCORE_YR1
3 C 1 SCORE_YR1
4 D 0 SCORE_YR1
5 A 2 SCORE_YR2
6 B 3 SCORE_YR2
7 C 3 SCORE_YR2
8 D 1 SCORE_YR2
9 A 0 SCORE_YR3
10 B 2 SCORE_YR3
11 C 2 SCORE_YR3
12 D 5 SCORE_YR3
This would convert the factor ind column to a numeric vector:
> y$ind <- seq_along(unique(y$ind))[y$ind]
> y
x.ID values ind
1 A 2 1
2 B 2 1
3 C 1 1
4 D 0 1
5 A 2 2
6 B 3 2
7 C 3 2
8 D 1 2
9 A 0 3
10 B 2 3
11 C 2 3
12 D 5 3

Multiplication of different subsets with different data in R

I have a large dataset, which I splitted up into subsets. For each subsets, I have to do the same calculations but with different numbers. Example:
Main Table
x a b c d
A 1 2 4 5
A 4 5 1 7
A 3 5 6 2
B 4 5 2 9
B 3 5 2 8
C 4 2 5 2
C 1 9 6 9
C 1 2 3 4
C 6 3 6 2
Additional Table for A
a b c d
A 5 1 6 1
Additional Table for B
a b c d
B 1 5 2 6
Additional Table for C
a c c d
C 8 2 4 1
I need to multiply all rows A in the Main Table with the values from Additional Table for A, all rows B in the Main table with the values from B and all rows B in the main table with values from C. It is completely fine to merge the additional tables into a combined one, if this makes the solution easier.
I thought about a for-loop but I am not able to put the different multiplicators (from the Additional Tables) into the code. Since there is a large number of subgroups, coding each multiplication manually should be avoided. How do I do this multiplications?
If we start with the addition table as addDf and main table as df:
addDf
x a b c d
A A 5 1 6 1
B B 1 5 2 6
C C 8 2 4 1
We can use a merge and the by-element multiplication of matrix as,
df[-1] <- merge(addDf, data.frame(x = df[1]), by = "x")[-1] * df[order(df[1]), -1]
df
x a b c d
1 A 5 2 24 5
2 A 20 5 6 7
3 A 15 5 36 2
4 B 4 25 4 54
5 B 3 25 4 48
6 C 32 4 20 2
7 C 8 18 24 9
8 C 8 4 12 4
9 C 48 6 24 2
Note: Borrowed a little syntax sugar from #akrun as df[-1] assignment.
We can use Map after splitting the main data 'df' (assuming that all of the datasets are data.frames.
df[-1] <- unsplit(Map(function(x,y) x*y[col(x)],
split(df[-1], df$x),
list(unlist(dfA), unlist(dfB), unlist(dfC))), df$x)
df
# x a b c d
#1 A 5 2 24 5
#2 A 20 5 6 7
#3 A 15 5 36 2
#4 B 4 25 4 54
#5 B 3 25 4 48
#6 C 32 4 20 2
#7 C 8 18 24 9
#8 C 8 4 12 4
#9 C 48 6 24 2
Or we can use a faster option with data.table
library(data.table)
setnames(setDT(do.call(rbind, list(dfA, dfB, dfC)), keep.rownames=TRUE)[df,
.(a= a*i.a, b= b*i.b, c = c*i.c, d= d*i.d), on = c('rn' = 'x'), by = .EACHI], 1, 'x')[]
# x a b c d
#1: A 5 2 24 5
#2: A 20 5 6 7
#3: A 15 5 36 2
#4: B 4 25 4 54
#5: B 3 25 4 48
#6: C 32 4 20 2
#7: C 8 18 24 9
#8: C 8 4 12 4
#9: C 48 6 24 2
The above would be difficult if there many columns, in that case, we could use mget to retrieve the columns and do the * on the i. columns with Map
setDT(do.call(rbind, list(dfA, dfB, dfC)), keep.rownames=TRUE)[df,
Map(`*`, mget(names(df)[-1]), mget(paste0("i.", names(df)[-1]))) ,
on = c('rn' = 'x'), by = .EACHI]

Join two dataframe

I have to collect values from one dataframe and place in another. I have tried to use merge function but that mess up order in second dataframe.
This is how my data looks like.
> df<-as.data.frame(cbind(letters[1:4],1:4))
> df
V1 V2
1 a 1
2 b 2
3 c 3
4 d 4
> dflist <- data.frame("home"= sample(df[,1],15, replace = TRUE))
>
> dflist$away <-sample(df[,1],15, replace = TRUE)
> dflist
home away
1 a b
2 a a
3 d c
4 d a
5 c c
6 a c
7 b d
8 b b
9 a b
10 b d
11 b a
12 a a
13 a c
14 c b
15 d a
Desired result should look like this.
home away value1 value2
1 a b 1 2
2 a a 1 1
3 d c 4 3
4 d a 4 1
5 c c 3 3
.
Outcome table will be lose its order if I use merge here.
You could try this:
dflist[c("value1", "value2")] <- t(apply(dflist, 1, function(x)
c(df[match(x[1], df$V1),2], df[match(x[2], df$V1),2])))
dflist
home away value1 value2
1 a b 1 2
2 a a 1 1
3 d c 4 3
4 d a 4 1
5 c c 3 3
6 a c 1 3
7 b d 2 4
8 b b 2 2
9 a b 1 2
10 b d 2 4
11 b a 2 1
12 a a 1 1
13 a c 1 3
14 c b 3 2
15 d a 4 1

Counting how many times an element occurs in the column of a data.frame

Let's say I have a data.frame with a factor.
d = data.frame(f = c("a","a","a","b","b","b","b","d","d"))
f
1 a
2 a
3 a
4 b
5 b
6 b
7 b
8 d
9 d
And I want to add a column telling me how many times an element occurs.
Like this
f n
1 a 3
2 a 3
3 a 3
4 b 4
5 b 4
6 b 4
7 b 4
8 d 2
9 d 2
How would I do this?
Can also use some plyr functions - join & ddply
d <- data.frame(f = c("a","a","a","b","b","b","b","d","d"))
d2 <- join(d, ddply(d, .(f), 'nrow'))
d2
f nrow
1 a 3
2 a 3
3 a 3
4 b 4
5 b 4
6 b 4
7 b 4
8 d 2
9 d 2
You can use table like this:
d$n <- table(d$f)[d$f]
# f n
#1 a 3
#2 a 3
#3 a 3
#4 b 4
#5 b 4
#6 b 4
#7 b 4
#8 d 2
#9 d 2
You can use ave and length:
> d$n <- as.numeric(ave(as.character(d$f), d$f, FUN = length))
> d
f n
1 a 3
2 a 3
3 a 3
4 b 4
5 b 4
6 b 4
7 b 4
8 d 2
9 d 2
With the "data.table" package, you might do something like:
library(data.table)
D <- data.table(d)
D[, n := as.numeric(.N), by = f]

Resources