Count certain words in dataframe/matrix in R [duplicate] - r

This question already has answers here:
Extracting a series of integers using a loop
(4 answers)
Closed 5 years ago.
I have this data , and what i want to do is to count the occurences(frequencies) of ONE, TWO, THREE in each columns
ex. 2 ONEs in the A column, 2 TWOs in the B column, 1 ONE in the C column etc
What function can i use to count certain words in R?
And how can i make a histogram out of this counts?
ABC <-read.csv("c:/Data/dataset.csv")
A B C
1 TWO ONE THREE
2 ONE ONE TWO
3 THREE TWO THREE
4 ONE TWO ONE
5 TWO THREE TWO

We can use mtabulate to get the count of unique elements in the dataset by each column
library(qdapTools)
t(mtabulate(ABC))
# A B C
#ONE 2 2 1
#THREE 1 1 2
#TWO 2 2 2
Or we use table, after unlisting the dataset and replicating the names of 'ABC'. Note that here we are calling the table only once.
tbl <- table(unlist(ABC),names(ABC)[col(ABC)])
tbl
# A B C
# ONE 2 2 1
# THREE 1 1 2
# TWO 2 2 2
A slightly faster option would be to use vapply with tabulate
vapply(ABC, function(x) tabulate(factor(x)), numeric(3))
If we need a barplot
barplot(tbl, beside=TRUE, legend=TRUE)

df <- data.frame(A=c('TWO','ONE','THREE','ONE','TWO'),B=c('ONE','ONE','TWO','TWO','THREE'),C=c('THREE','TWO','THREE','ONE','TWO'),stringsAsFactors=F);
sapply(df,table);
## A B C
## ONE 2 2 1
## THREE 1 1 2
## TWO 2 2 2

Related

if i want to sort a column by size in rstudio, how do i make sure that the associated values of the rows sort with the column?

I have a data.frame with 1200 rows and 5 columns, where each row contains 5 values of one person. now i need to sort one column by size but I want the remaining columns to sort with the column, so that one column is sorted by increasing values and the other columns contain the values of the right persons. ( So that one row still contains data from one and the same person)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
these are the column names of my data.frame and I wanna sort it by the column called "avg"
First of all, please always provide us with a reproducible example such as below. The sorting of a data frame by default sorts all columns.
vector <- 1:3
BAPlotDET <- data.frame(vector, vector, vector, vector, vector)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
fsskiddet fspiddet avg diff absdiff
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
BAPlotDET <- BAPlotDET[order(-BAPlotDET$avg),]
> BAPlotDET
fsskiddet fspiddet avg diff absdiff
3 3 3 3 3 3
2 2 2 2 2 2
1 1 1 1 1 1

Compare lists in dataframes based on personal code, shorten one lists if longer

I have two separate dataframes each for one speaker of an interacting dyad. They have different amounts of talk-turns (rows) which is why I keep them in separate files for now.
In order to run my final analyses I need identical number of rows for each speaker.
So what I want to do is compare dyad_id 1 in both data frames and then shorten the longer list for one by deleting the last row for all columns.
I prepared a data frame to illustrate what I already have.
So far, I tried to split the data frame by the dyad_id in both data sets to now compare the splits one after another and delete the unnecessary rows. As I have various conversations, I need to automate this to go through all dyad_ids one after another.
I hope someone can help me, I am completely lost.
dyad_id_A <- c(1,1,1,2,2,2,2,3,3,3,3,3)
fw_quantiles_a <- c(4,3,1,2,3,2,4,1,4,5,6,7)
df_A<- data.frame(dyad_id_A,fw_quantiles_a)
dyad_id_B <- c(1,1,1,1,2,2,2,3,3,3,3)
fw_quantiles_b <- c(3,1,2,1,2,4,1,3,3,4,5)
df_B <- data.frame(dyad_id_B,fw_quantiles_b)
example for final dataset
dyad_id_AB <- c(1,1,1,2,2,2,3,3,3,3)
What I tried so far:
split_conv_A = split(df_A, list(df_A$dyad_id_A))
split_conv_B = split(df_B, list(df_B$dyad_id_B))
Add a time counter within each dyad_id_x group and then merge together:
df_A$time <- ave(df_A$dyad_id_A, df_A$dyad_id_A, FUN=seq_along)
df_B$time <- ave(df_B$dyad_id_B, df_B$dyad_id_B, FUN=seq_along)
merge(
df_A, df_B,
by.x=c("dyad_id_A","time"), by.y=c("dyad_id_B","time")
)
# dyad_id_A time fw_quantiles_a fw_quantiles_b
#1 1 1 4 3
#2 1 2 3 1
#3 1 3 1 2
#4 2 1 2 2
#5 2 2 3 4
#6 2 3 2 1
#7 3 1 1 3
#8 3 2 4 3
#9 3 3 5 4
#10 3 4 6 5
Maybe we can try using table to calculate frequncies of id's in both the dataframe assuming you have the same id's in both the dataframe. Calculate the minimum between them using pmin and repeat the names based on the frequency.
tab <- pmin(table(df_A$dyad_id_A), table(df_B$dyad_id_B))
as.integer(rep(names(tab), tab))
# [1] 1 1 1 2 2 2 3 3 3 3

Go through a column and collect a running total in new column [duplicate]

This question already has answers here:
Creation of a specific vector without loop or recursion in R
(2 answers)
Split data.frame by value
(2 answers)
Closed 4 years ago.
I have a dataframe whose rows represent people. For a given family, the first row has the value 1 in the column A, and all following rows contain members of the same family until another row in in column A has the value 1. Then, a new family starts.
I would like to assign IDs to all families in my dataset. In other words, I would like to take:
A
1
2
3
1
3
3
1
4
And turn it into:
A family_id
1 1
2 1
3 1
1 2
3 2
3 2
1 3
4 3
I'm playing with a dataframe of 3 million rows, so a simple for-loop solution I came up with falls short of necessary efficiency. Also, the family_id need not be sequential.
I'll take a dplyr solution.
data:
df <- data.frame(A = c(1:3,1,3,3,1,4))
code:
df$familiy_id <- cumsum(c(-1,diff(df$A)) < 0)
result:
# A familiy_id
#1 1 1
#2 2 1
#3 3 1
#4 1 2
#5 3 2
#6 3 2
#7 1 3
#8 4 3
please note:
This solution starts a new group when a number occurs that is smaller than the previous one.
When its 100% sure that a new group always begins with a 1 consistently, then ronak's solution is perfect.

How to assign IDs for consecutive rows in R split by a given kind of row? [duplicate]

This question already has answers here:
Creation of a specific vector without loop or recursion in R
(2 answers)
Split data.frame by value
(2 answers)
Closed 4 years ago.
I have a dataframe whose rows represent people. For a given family, the first row has the value 1 in the column A, and all following rows contain members of the same family until another row in in column A has the value 1. Then, a new family starts.
I would like to assign IDs to all families in my dataset. In other words, I would like to take:
A
1
2
3
1
3
3
1
4
And turn it into:
A family_id
1 1
2 1
3 1
1 2
3 2
3 2
1 3
4 3
I'm playing with a dataframe of 3 million rows, so a simple for-loop solution I came up with falls short of necessary efficiency. Also, the family_id need not be sequential.
I'll take a dplyr solution.
data:
df <- data.frame(A = c(1:3,1,3,3,1,4))
code:
df$familiy_id <- cumsum(c(-1,diff(df$A)) < 0)
result:
# A familiy_id
#1 1 1
#2 2 1
#3 3 1
#4 1 2
#5 3 2
#6 3 2
#7 1 3
#8 4 3
please note:
This solution starts a new group when a number occurs that is smaller than the previous one.
When its 100% sure that a new group always begins with a 1 consistently, then ronak's solution is perfect.

Adding group column to data frame [duplicate]

This question already has an answer here:
Compute the minimum of a pair of vectors
(1 answer)
Closed 7 years ago.
Say I have the following data frame:
dx=data.frame(id=letters[1:4], count=1:4)
# id count
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
And I would like to (grammatically) add a column that will get the count whenever count<3, otherwise 3, so I'll get the following:
# id count group
# 1 a 1 1
# 2 b 2 2
# 3 c 3 3
# 4 d 4 3
I thought to use
dx$group=if(dx$count<3){dx$count}else{3}
but it doesn't work on arrays. How can I do it?
In this particular case you can just use pmin (as I stated in the comments above):
df$group <- pmin(df$count, 3)
In general your if/else construction does not work on vectors, but you can use the function ifelse. It takes three arguments: First the condition, then the result if the condition is met and finally the result if the condition is not met. For your example you would write the following:
df$group <- ifelse(df$count < 3, df$count, 3)
Note that in your example the pmin solution is better. Just mentioning the ifelse solution for completeness.

Resources