R counting strings variables in each row of a dataframe - r

I have a dataframe that looks something like this, where each row represents a samples, and has repeats of the the same strings
> df
V1 V2 V3 V4 V5
1 a a d d b
2 c a b d a
3 d b a a b
4 d d a b c
5 c a d c c
I want to be able to create a new dataframe, where ideally the headers would be the string variables in the previous dataframe (a, b, c, d) and the contents of each row would be the number of occurrences of each the respective variable from
the original dataframe. Using the example from above, this would look like
> df2
a b c d
1 2 1 0 2
2 2 1 1 1
3 2 1 0 1
4 1 1 1 2
5 1 0 3 1
In my actual dataset, there are hundreds of variables, and thousands of samples, so it'd be ideal if I could automatically pull out the names from the original dataframe, and alphabetize them into the headers for the new dataframe.

You may try
library(qdapTools)
mtabulate(as.data.frame(t(df)))
Or
mtabulate(split(as.matrix(df), row(df)))
Or using base R
Un1 <- sort(unique(unlist(df)))
t(apply(df ,1, function(x) table(factor(x, levels=Un1))))

You can stack the columns and then use table:
table(cbind(id = 1:nrow(mydf),
stack(lapply(mydf, as.character)))[c("id", "values")])
# values
# id a b c d
# 1 2 1 0 2
# 2 2 1 1 1
# 3 2 2 0 1
# 4 1 1 1 2
# 5 1 0 3 1

Related

R function to replace tricky merge in Excel (vlookup + hlookup)

I have a tricky merge that I usually do in Excel via various formulas and I want to automate with R.
I have 2 dataframes, one called inputs looks like this:
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
And another called df
id v
1 1
1 2
1 3
2 2
3 1
I would like to combined them based on the id and v values such that I get
id v key
1 1 A
1 2 A
1 3 C
2 2 D
3 1 T
So I'm matching on id and then on the column from v1 thru v2, in the first example you will see that I match id = 1 and v1 since the value of v equals 1. In Excel I do this combining creatively VLOOKUP and HLOOKUP but I want to make this simpler in R. Dataframe examples are simplified versions as the I have more records and values go from v1 thru up to 50.
Thanks!
You could use pivot_longer:
library(tidyr)
library(dplyr)
key %>% pivot_longer(!id,names_prefix='v',names_to = 'v') %>%
mutate(v=as.numeric(v)) %>%
inner_join(df)
Joining, by = c("id", "v")
# A tibble: 5 × 3
id v value
<int> <dbl> <chr>
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Data:
key <- read.table(text="
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F",header=T)
df <- read.table(text="
id v
1 1
1 2
1 3
2 2
3 1 ",header=T)
You can use two column matrices as index arguments to "[" so this is a one liner. (Not the names of the data objects are d1 and d2. I'd opposed to using df as a data object name.)
d1[-1][ data.matrix(d2)] # returns [1] "A" "A" "C" "D" "T"
So full solution is:
cbind( d2, key= d1[-1][ data.matrix(d2)] )
id v key
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Try this:
x <- "
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
"
y <- "
id v
1 1
1 2
1 3
2 2
3 1
"
df <- read.table(textConnection(x) , header = TRUE)
df2 <- read.table(textConnection(y) , header = TRUE)
key <- c()
for (i in 1:nrow(df2)) {
key <- append(df[df2$id[i],(df2$v[i] + 1L)] , key)
}
df2$key <- rev(key)
df2
># id v key
># 1 1 1 A
># 2 1 2 A
># 3 1 3 C
># 4 2 2 D
># 5 3 1 T
Created on 2022-06-06 by the reprex package (v2.0.1)

How can I replace values in an R matrix matching on rowname, columnname and value

I have a really big matrix and I'd like to replace the values in it using a lookup table.
I have a table of values (which looks a bit like this):
Origin Destination Distance Final
1 1 1 A
1 1 2 B
1 1 3 E
1 2 2 F
1 3 1 B
1 3 2 C
2 1 1 B
2 2 1 A
2 3 3 C
3 1 1 A
3 1 2 D
3 2 1 B
3 3 2 A
...
and I have a matrix, which looks something like this:
x 1 1 3 1 2 1 ...
1 1 3 2 1 2 1
1 2 2 1 2 2 1
3 2 1 2 1 1 2
1 3 1 2 1 2 1
2 1 1 3 1 1 1
1 2 2 1 3 1 1
...
I'm trying to match my matrix rownames with the Origin column, the matrix Colnames with the Destination Column and the matrix values with the Distance Column and then replace that value with the Final Column.
The Matrix is 4000 by 4000.
The Table is 27 by 4
So when I'm done it should look like:
x 1 1 3 1 2 1 ...
1 A E C A F A
1 B B B B F A
3 D A A A B A
1 E A C A F A
2 B B C B A B
1 B B B E A A
...
I'm currently using a little loop, which looks like this;
for (i in 1:nrow(CategoryTable)){
Origin <- CategoryTable[i,"O"]
Dest <- CategoryTable[i,"D"]
Distance <- CategoryTable[i,"Dist"]
Final <- CategoryTable[i,"Final"]
CategoryGrid[CategoryGrid == Distance][CategoryGrid[row.names(CategoryGrid) %in% Origin,colnames(CategoryGrid) %in% Dest]] <-CategoryTable[i,"Final"]
}
Based on this question (Replace all values in a matrix <0.1 with 0) I can replace all the things matching a specific value or the things matching a column or row. But I can't match all at once.
The active ingredient of the current attempt is:
CategoryGrid[CategoryGrid == Distance][CategoryGrid[row.names(CategoryGrid) %in% Origin,colnames(CategoryGrid) %in% Dest]] <-CategoryTable[i,"Final"]
So I was trying to match the rows and columns and then pass that as a boolean vector to the value match, and then do the RHS assignation.
However, what I actually get is:
Error in CategoryGrid[row.names(CategoryGrid) %in% Origin, colnames(CategoryGrid) %in% :
incorrect number of dimensions
How would you go about achieving this?
Replacing the active ingredient with the following did the trick.
CategoryGrid[row.names(CategoryGrid) == Origin , colnames(CategoryGrid) == Dest] <- apply(CategoryGrid[row.names(CategoryGrid) == Origin , colnames(CategoryGrid) == Dest], MARGIN=c(1,2),function(x) ifelse(x == Distance, Final, x))
Specify the rules for the rows and columns and then do an apply and put the third variable in it's own function.

Sort Data in the Table

For example, now I get the table
A B C
A 0 4 1
B 2 1 3
C 5 9 6
I like to order the columns and rows by my own defined order, to achieve
B A C
B 1 2 3
A 4 0 1
C 9 5 6
This can be accomplished in base R. First we make the example data:
# make example data
df.text <- 'A B C
0 4 1
2 1 3
5 9 6'
df <- read.table(text = df.text, header = T)
rownames(df) <- LETTERS[1:3]
A B C
A 0 4 1
B 2 1 3
C 5 9 6
Then we simply re-order the columns and rows using a vector of named indices:
# re-order data
defined.order <- c('B', 'A', 'C')
df <- df[, defined.order]
df <- df[defined.order, ]
B A C
B 1 2 3
A 4 0 1
C 9 5 6
If the defined order is given as
defined_order <- c("B", "A", "C")
and the initial table is created by
library(data.table)
# create data first
dt <- fread("
id A B C
A 0 4 1
B 2 1 3
C 5 9 6")
# note that row names are added as own id column
then you could achieve the desired result using data.table as follows:
# change column order
setcolorder(dt, c("id", defined_order))
# change row order
dt[order(defined_order)]
# id B A C
# 1: B 1 2 3
# 2: A 4 0 1
# 3: C 9 5 6

vectorise rows of a dataframe, apply vector function, return to original dataframe r

Given the following df:
a=c('a','b','c')
b=c(1,2,5)
c=c(2,3,4)
d=c(2,1,6)
df=data.frame(a,b,c,d)
a b c d
1 a 1 2 2
2 b 2 3 1
3 c 5 4 6
I'd like to apply a function that normally takes a vector (and returns a vector) like cummax row by row to the columns in position b to d.
Then, I'd like to have the output back in the df, either as a vector in a new column of the df, or replacing the original data.
I'd like to avoid writing it as a for loop that would iterate every row, pull out the content of the cells into a vector, do its thing and put it back.
Is there a more efficient way? I've given the apply family functions a go, but I'm struggling to first get a good way to vectorise content of columns by row and get the right output.
the final output could look something like that (imagining I've applied a cummax() function).
a b c d
1 a 1 2 2
2 b 2 3 3
3 c 5 5 6
or
a b c d output
1 a 1 2 2 (1,2,2)
2 b 2 3 1 (2,3,3)
3 c 5 4 6 (5,5,6)
where output is a vector.
Seems this would just be a simple apply problem that you want to cbind to df:
> cbind(df, apply(df[ , 4:2] # work with columns in reverse order
, 1, # do it row-by-row
cummax) )
a b c d 1 2 3
d a 1 2 2 2 1 6
c b 2 3 1 2 3 6
b c 5 4 6 2 3 6
Ouch. Bitten by failing to notice that this would be returned in a column oriented matrix and need to transpose that result; Such a newbie mistake. But it does show the value of having a question with a reproducible dataset I suppose.
> cbind(df, t(apply(df[ , 4:2] , 1, cummax) ) )
a b c d d c b
1 a 1 2 2 2 2 2
2 b 2 3 1 1 3 3
3 c 5 4 6 6 6 6
To destructively assign the result to df you would just use:
df <- # .... that code.
This does the concatenation with commas (and as a result no longer needs to be transposed:
> cbind(df, output=apply(df[ , 4:2] , 1, function(x) paste( cummax(x), collapse=",") ) )
a b c d output
1 a 1 2 2 2,2,2
2 b 2 3 1 1,3,3
3 c 5 4 6 6,6,6

order/sort data frame with respect to a character reference list

Consider these two df examples
df1=data.frame(names=c('a','b','c'),value=1:3)
df2=data.frame(names=c('c','a','b'),value=1:3)
so that
> df1
names value
1 a 1
2 b 2
3 c 3
> df2
names value
1 c 1
2 a 2
3 b 3
Now, I would like to sort the df1 to the same order as the names column in df2, to obtain
names value
c 3
a 1
b 2
How can I achieve this?
try
df1[match(df2$names,df1$names),]
> df1[match(df2$names,df1$names),]
names value
3 c 3
1 a 1
2 b 2

Resources