Frequency of Characters in Strings as columns in data frame using R - r

I have a data frame initial of the following format
> head(initial)
Strings
1 A,A,B,C
2 A,B,C
3 A,A,A,A,A,B
4 A,A,B,C
5 A,B,C
6 A,A,A,A,A,B
and the data frame I want is final
> head(final)
Strings A B C
1 A,A,B,C 2 1 1
2 A,B,C 1 1 1
3 A,A,A,A,A,B 5 1 0
4 A,A,B,C 2 1 1
5 A,B,C 1 1 1
6 A,A,A,A,A,B 5 1 0
to generate the data frames the following codes can be used to keep the number of rows high
initial<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100))
final<-data.frame(Strings=rep(c("A,A,B,C","A,B,C","A,A,A,A,A,B"),100),A=rep(c(2,1,5),100),B=rep(c(1,1,1),100),C=rep(c(1,1,0),100))
What is the fastest way I can achieve this? Any help will be greatly appreciated

We can use base R methods for this task. We split the 'Strings' column (strsplit(...)), set the names of the output list with the sequence of rows, stack to convert to data.frame with key/value columns, get the frequency with table, convert to 'data.frame' and cbind with the original dataset.
cbind(df1, as.data.frame.matrix(
table(
stack(
setNames(
strsplit(as.character(df1$Strings),','), 1:nrow(df1))
)[2:1])))
# Strings A B C D
#1 A,B,C,D 1 1 1 1
#2 A,B,B,D,D,D 1 2 0 3
#3 A,A,A,A,B,C,D,D 4 1 1 2
or we can use mtabulate after splitting the column.
library(qdapTools)
cbind(df1, mtabulate(strsplit(as.character(df1$Strings), ',')))
# Strings A B C D
#1 A,B,C,D 1 1 1 1
#2 A,B,B,D,D,D 1 2 0 3
#3 A,A,A,A,B,C,D,D 4 1 1 2
Update
For the new dataset 'initial', the second method works. If we need to use the first method with the correct order, convert to factor class with levels specified as the unique elements of 'ind'.
df1 <- stack(setNames(strsplit(as.character(initial$Strings), ','),
seq_len(nrow(initial))))
df1$ind <- factor(df1$ind, levels=unique(df1$ind))
cbind(initial, as.data.frame.matrix(table(df1[2:1])))

Related

Filtering of dataframe columns displaying a counter intuitive behavior (R)

Take as an example the dataframe below. I need to change the dataframe by keeping only the columns that are in the filter objects.
test <- data.frame(A = c(1,6,1,2,3) , B = c(1,2,1,1,2), C = c(1,7,6,4,1), D = c(1,1,1,1,1))
filter <- c("A", "B", "C", "D")
filter2 <- c("A","B","D")
To do that I'm using this piece of code:
`%ni%` <- Negate(`%in%`)
test <- test[,-which(names(test) %ni% filter2)]
If I use the filter2 object I get what is expected:
A B D
1 1 1 1
2 6 2 1
3 1 1 1
4 2 1 1
5 3 2 1
However, if I use the filter object, I get a dataframe with zero columns:
data frame with 0 columns and 5 rows
I expected to get an untouched dataframe, since filter had all test columns in it. Why does this happen, and how can I write a more reliable code not to get empty dataframes in these situations?
Use ! instead of -
test[,!(names(test) %ni% filter2)]
test[,!(names(test) %ni% filter)]
by wrapping with which and using -, it works only when the length of output of which is greater than 0
> which(names(test) %ni% filter2)
[1] 3
> which(names(test) %ni% filter)
integer(0)
By doing the -, there is no change in the integer(0) case
> -which(names(test) %ni% filter)
integer(0)
> -which(names(test) %ni% filter2)
[1] -3
thus,
> test[integer(0)]
data frame with 0 columns and 5 rows
I think you can simplify the column selection process by subsetting the dataframe with character vector of column names.
test[filter]
# A B C D
#1 1 1 1 1
#2 6 2 7 1
#3 1 1 6 1
#4 2 1 4 1
#5 3 2 1 1
test[filter2]
# A B D
#1 1 1 1
#2 6 2 1
#3 1 1 1
#4 2 1 1
#5 3 2 1

how to convert a list with different length of lists to a dataframe in r

I have a list containing three different length of vectors with unique elements for each vector.
data <- list(ARG=letters[1:8],BRZ=c("a","b","c","f","h","g","l","m","n"),US=c("u","b","c","e","h","f","q","a","n","t"))
I would like to convert this list to a data frame by mergering them together, the result is expected as below or similar output, Thank you for helping this.
ID ARG BRZ US
a 1 1 1
b 1 1 1
c 1 1 1
d 1
e 1 1
f 1 1 1
g 1 1
h 1 1 1
l 1
m 1
n 1 1
q 1
t 1
u 1
We use mtabulate and transpose the output
library(qdapTools)
t(mtabulate(data))
Or if we are using base R, then stack into a data.frame with 2 columns and apply the table
table(stack(data))
Assuming that there are no duplicates for each entry. If there are duplicates, then we may need a logical vector coerced to binary
+(table(stack(data)) >0)

Dropping common rows in two different dataframes

I am a beginner using R. I have two different dataframes like the image called df-1 and df-2. I want to combine two dataframes and drop common rows. (Or I want to removal common rows and want to remain unique ID of rows.
Therefore, What I want to make is like df-3.
A merge is not appropriate because I don't need common rows.
df-1
ID NUMBER FORM DATE CD AD
1 A15 200302033666 1 20031219 3 7
2 B67 200302034466 1 20031204 3 1
3 C15 200302034455 1 20031223 3 1
4 D67 200303918556 1 20030319 3 1
5 E48 200303918575 1 20030304 3 1
6 F80 200303918588 1 20030325 3 1
7 G63 200303918595 1 20030317 3 1
df-2
ID NUMBER FORM DATE CD AD
1 A15 200302033666 1 20031219 3 7
2 K99 200402034466 1 20041204 2 3
3 Z75 200502034455 2 20021222 1 6
4 D67 200303918556 1 20030319 3 1
5 E48 200303918575 1 20030304 3 1
6 F80 200303918588 1 20030325 3 1
7 G63 200303918595 1 20030317 3 1
df-3
ID NUMBER FORM DATE CD AD
1 B67 200302034466 1 20031204 3 1
2 C15 200302034455 1 20031223 3 1
3 K99 200402034466 1 20041204 2 3
4 Z75 200502034455 2 20021222 1 6
Use rbind to merge df1 and df2 and then selecet unique values
df3 <- unique(rbind(df1,df2))
Can you just use unique on df3 to keep only unique rows? Or, in one line,
df3 <- unique(merge(df1, df2))
Also, avoid using brackets when naming variables - df(1) looks like "apply function df to 1"
If I'm interpreting your question correctly you want a dataframe with records that are present in only one of the original dataframes.
With dplyr:
library(dplyr)
df1_anti <- anti_join(df1, df2)
df2_anti <- anti_join(df2, df1)
df3 <- bind_rows(df1_anti, df2_anti)
df1_anti contains rows present in df1 but not in df2.
df2_anti contains rows present in df2 but not in df1.
df3 is the UNION the two dfs.

Duplicating data frame rows by freq value in same data frame [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 7 years ago.
I have a data frame with names by type and their frequencies. I'd like to expand this data frame so that the names are repeated according to their name-type frequency.
For example, this:
> df = data.frame(name=c('a','b','c'),type=c(0,1,2),freq=c(2,3,2))
name type freq
1 a 0 2
2 b 1 3
3 c 2 2
would become this:
> df_exp
name type
1 a 0
2 a 0
3 b 1
4 b 1
5 b 1
6 c 2
7 c 2
Appreciate any suggestions on a easy way to do this.
You can just use rep to "expand" your data.frame rows:
df[rep(sequence(nrow(df)), df$freq), c("name", "type")]
# name type
# 1 a 0
# 1.1 a 0
# 2 b 1
# 2.1 b 1
# 2.2 b 1
# 3 c 2
# 3.1 c 2
And there's a function expandRows in the splitstackshape package that does exactly this. It also has the option to accept a vector specifying how many times to replicate each row, for example:
expandRows(df, "freq")

order/sort data frame with respect to a character reference list

Consider these two df examples
df1=data.frame(names=c('a','b','c'),value=1:3)
df2=data.frame(names=c('c','a','b'),value=1:3)
so that
> df1
names value
1 a 1
2 b 2
3 c 3
> df2
names value
1 c 1
2 a 2
3 b 3
Now, I would like to sort the df1 to the same order as the names column in df2, to obtain
names value
c 3
a 1
b 2
How can I achieve this?
try
df1[match(df2$names,df1$names),]
> df1[match(df2$names,df1$names),]
names value
3 c 3
1 a 1
2 b 2

Resources