Get frequency of occurrence of a character value in R dataframe - r

I have a dataframe like this:
col1 col2 cole3
abc def xzy
lmk qwe abc
def lmk xzy
xzy abc qwe
The three columns hold character datatype values.
Across the 3 columns I have 5 unique values: abc, def, xzy, lmk and qwe.
What I need is a count of number of times each of these values appears in the whole dataframe.
abc 3
qwe 2
def 2
xzy 3
lmk 2
All the count() and aggregate functions only work column-wise and when I unlist, it doesnt seem to work either.
Any suggestions for functions that I can use?
Many thanks in advance.

You should do something like that (assuming that your previous data frame is called df1):
data <- c(df1$col1, df1$col2, df1$col3)
table(data)
And it gives you desired values. Make sure your data in df1 are characters not factors.

You can use an aggregate function with 'union all' - make sure you use 'union all', not just 'union'.
Let us assume that your table is called tmpcol with columns col1, col2, and col3...
select value, sum(cnt)
from (
select col1 as value, count(*) as cnt
from tmpcol
group by col1
union all
select col2 as value, count(*) as cnt
from tmpcol
group by col2
union all
select col3 as value, count(*) as cnt
from tmpcol
group by col3
) dataset
group by value;

Related

Adding a new column to one dataframe using a column from a second consisting of vectors with strings in R

I'm trying to add a new column to a dataframe, the data will be filled from column A in second dataframe and depend on column B. The values in Column B are vectors. The length of vectors in my real data are far too long for an if_else statement, I'm looking for something more along the lines of stringer or grepl recognizing character strings but cycling through the rows of the dataframe like a for loop.
df1 <- data.frame(col1 = c('siteA', 'siteB', 'siteC', 'siteD'),
col2 = c('ecoA', 'ecoB', 'ecoC', 'ecoD'))
df2 <- data.frame(colA = c('type1', 'type2'),
colB = c("c('ecoA','ecoC')", "c('ecoB','ecoD')"))
I've tried merge, mutate with if statements, joins (dplyr), and case when, but again these either didn't fill all of the rows or they're far too long/complicated for the data set I have.
This is the end result I'm hoping for:
|col1 | col2 |colA |
|-----|------|-----|
|siteA| ecoA |type1|
|siteB| ecoB |type2|
|siteC| ecoC |type1|
|siteD| ecoD |type2|
You could parse colB character string to transform it to a vector of character strings and unnest it:
df2$colB <- lapply(df2$colB,function(x) eval(parse(text=x)))
df2 <- tidyr::unnest(df2,cols=colB)
dplyr::inner_join(df1,df2,by = c(col2="colB"))
col1 col2 colA
1 siteA ecoA type1
2 siteB ecoB type2
3 siteC ecoC type1
4 siteD ecoD type2

Check if a string is a subset in another string in R

I've this data below that includes ID, and Code (chr type)
ID <- c(1,1,1,2,2,3,3,3, 4, 4)
Code <- c("0011100000", "0001100000", "1001100000", "1100000000",
"1000000000", "1000000000", "0100000000", "0010000000", "0010000001", "0010000001")
df <- data.frame(ID, Code)
I need to remove records (within each ID) based Code value pattern, That is:
For each ID, we look at the values of Code, and we remove the ones that are subset of other row.
For example, for ID=1, row #2 is a subset of row #1, so we remove row #2. But, row #3 is NOT a subset of row #2 or #3, so we keep it.
For ID=2, row #5 is a subset of row #4, so we remove it.
For ID=3, they are all different, so we keep them all.
For ID=4, since the Code for both records are the same, then keep the first one.
Here is the expected final view of the results:
It's not that pretty, but a bit of checking of every combination with a join will do it.
Convert to a data.table
library(data.table)
setDT(df)
Make a row counter, and identify all the 1 locations in each string and save to a list.
df[, rn := .I]
df[, ones := gregexpr("1", df$Code)]
Join each group to itself, and compare the lists where the row numbers don't match. Then keep the row numbers where the lists are subsets, and drop these rows from the original data. In the case of duplicates, only remove the first occasion of the duplicate.
df[
funion(
df[df, on=c("ID","rn>rn"), if(all(i.ones[[1]] %in% ones[[1]])) .(Code=i.Code), by=.EACHI][, -"rn"],
df[df, on=c("ID","rn<rn"), if(all(i.ones[[1]] %in% ones[[1]])) .(Code=i.Code), by=.EACHI][, -"rn"]
),
on=c("ID","Code"),
mult="first",
drop := 1
]
df[is.na(drop), -c("rn","ones","drop")]
# ID Code
#1: 1 0011100000
#2: 1 1001100000
#3: 2 1100000000
#4: 3 1000000000
#5: 3 0100000000
#6: 3 0010000000
#7: 4 0010000001

How to concatenate columns for one row without enumerating them?

Given:
let table1 = datatable (col1:string, col2:string)
["abc","def"]
;
let table2 = datatable (col3:string, col4:string)
["ghi","jkl"]
where table1 | union table2 gives:
col1 col2 col3 col4
abc def
ghi jkl
how to get this instead?
col1 col2 col3 col4
abc def ghi jkl
Assume there might be many more columns, so any solution requiring enumerating all of them doesn't work.
Related question:
Is there a way to combine data from two tables in Kusto?
One solution I know of is:
table1 | extend pivot=""
| join kind=innerunique (table2 | extend pivot = "") on pivot
| project-away pivot*

Dplyr or R basis. How to select (or delete) lines that have identical values (column 1 and column 2) and keeping column 3 values

In a data.frame class object with {dplyr} or R {base}.
How to select (or delete) lines that have identical values in column 1 and column 2 ( and keeping column's 3 values).
I have no idea (use distinct fonction?)
test <- data.frame(column1 = c("paris","moscou", "rennes"),
column2 = c("paris", "lima", "rennes"),
column3 =c(12,56,78))
> print (test)
column1 column2 column3
1 paris paris 12
2 moscou lima 56
3 rennes rennes 78
Example:
line 1: paris paris
line 4: rennes rennes
library(dplyr)
test2 <- test %>%
filter(column1 == column2)
print (test2)
Error: level sets of factors are different
We can use subset from base R
subset(test, as.character(column1) == as.character(column2))
In dplyr, use filter to retrieve specific rows and use select to retrieve specific columns.
For data.frames you need to as.character to match strings:
library(dplyr)
test %>%
filter(as.character(column1) == as.character(column2))

R Using a for() loop to fill one dataframe with another

I have two dataframes and I wish to insert the values of one dataframe into another (let's call them DF1 and DF2).
DF1 consists of 2 columns 1 and 2. Column 1 (col1) contains characters a to z and col2 has values associated with each character (from a to z)
DF2 is a dataframe with 3 columns. The first two consist of every combination of DF1$col1 so: aa ab ac ad etc; where the first letter is in col1 and the second letter is in col2
I want to create a simple mathematical model utilizing the values in DF1$col2 to see the outcomes of every possible combination of objects in DF1$col1
The first step I wanted to do is to transfer values from DF1$col2 to DF2$col3 (values from DF2$col3 should be associated to values in DF2col1), but that's where I'm stuck. I currently have
for(j in 1:length(DF2$col1))
{
## this part is to use the characters in DF2$col1 as an input
## to yield the output for DF2$col3--
input=c(DF2$col1)[j]
## This is supposed to use the values found in DF1$col2 to fill in DF2$col3
g=DF1[(DF1$col2==input),"pred"]
## This is so that the values will fill in DF2$col3--
DF2$col3=g
}
When I run this, DF2$col3 will be filled up with the same value for a specific character from DF1 (e.g. DF2$col3 will have all the rows filled with the value associated with character "a" from DF1)
What exactly am I doing wrong?
Thanks a bunch for your time
You should really use merge for this as #Aaron suggested in his comment above, but if you insist on writing your own loop, than you have the problem in your last line, as you assign g value to the whole col3 column. You should use the j index there also, like:
for(j in 1:length(DF2$col1))
{
DF2$col3[j] = DF1[(which(DF1$col2 == DF2$col1[j]), "pred"]
}
If this would not work out, than please also post some sample database to be able to help in more details (as I do not know, but have a gues what could be "pred").
It sounds like what you are trying to do is a simple join, that is, match DF1$col1 to DF2$col1 and copy the corresponding value from DF1$col2 into DF2$col3. Try this:
DF1 <- data.frame(col1=letters, col2=1:26, stringsAsFactors=FALSE)
DF2 <- expand.grid(col1=letters, col2=letters, stringsAsFactors=FALSE)
DF2$col3 <- DF1$col2[match(DF2$col1, DF1$col1)]
This uses the function match(), which, as the documentation states, "returns a vector of the positions of (first) matches of its first argument in its second." The values you have in DF1$col1 are unique, so there will not be any problem with this method.
As a side note, in R it is usually better to vectorize your work rather than using explicit loops.
Not sure I fully understood your question, but you can try this:
df1 <- data.frame(col1=letters[1:26], col2=sample(1:100, 26))
df2 <- with(df1, expand.grid(col1=col1, col2=col1))
df2$col3 <- df1$col2
The last command use recycling (it could be writtent as rep(df1$col2, 26) as well).
The results are shown below:
> head(df1, n=3)
col1 col2
1 a 68
2 b 73
3 c 45
> tail(df1, n=3)
col1 col2
24 x 22
25 y 4
26 z 17
> head(df2, n=3)
col1 col2 col3
1 a a 68
2 b a 73
3 c a 45
> tail(df2, n=3)
col1 col2 col3
674 x z 22
675 y z 4
676 z z 17

Resources