Subdividing and counting how many values in particular columns under certain conditions in r - r

I am new to R and data analysis. I have a database similar to this one below, just a lot bigger and I was trying to find a general way to count for each country how many actions there are and how many subquestion with value 1, value 2 and so on there are. For each action there are multiple questions, subquestions and subsubquestions but I would love to find a way to count
1:how many actions there are per country, excluding subquestions
2: a way to find out how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn.
id country questionn subquestion value actionn
06 NIE 1 1 1 1
05 NIG 1 1 1 1
07 TAN 1 1 1 1
08 BEN 1 1 1 1
03 TOG 1 1 2 1
45 MOZ 1 1 2 1
40 ZIM 1 1 1 1
56 COD 1 1 1 1
87 BFA 1 1 1 1
09 IVC 1 1 2 1
08 SOA 1 1 2 1
02 MAL 1 1 2 1
78 MAI 1 1 2 1
35 GUB 1 1 2 1
87 RWA 1 1 2 1
41 ETH 1 1 1 1
06 NIE 1 2 2 1
05 NIG 1 2 1 1
87 BFA 1 2 1 2
I have tried to create subsets of the data frame and count everything for each country once at a time but it is going to take forever and I was wondering if there was a general way to do it.
For the first question I have done this
df1<-df %>% group_by (country) %>% summarise (countries=county)
unique(df1)
count(df1)
For the second question I was thinking of individually select and count each rows which has quesionn=1, subquestion=1, value=1 and actionn=1, then to select and count how many per country with qustionn=1, subquestionn=2, value=1, actionn=1 etc. Value refers to whether the answer to the question is 1=yes or 2=no.
I would be grateful for any help, thank you soo much :)

For the first question you can try to do something like this:
df %>%
filter(subquestion != 2) %>%
group_by(country) %>%
summarise(num_actions = n())
This will return the number of actions per country, removing rows that do not have 2 for the subquestion column. Note that the n() in the summarize function will count the number observations in the groups (in this case countries).
I am not sure I fully understand the second question, but my suggestion would be to make a new label for the particular observation you want to know (how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn):
df %>%
mutate(country_question_code = paste(country, action, questionn, sep = "_")) %>%
group_by(country_question_code) %>%
summarize(num_subquestion = n())

For question 1 possible solution (assuming country name are not unique and actionn can be 0, 1,2 or more..
For just total count:
df%>%group_by(country)%>%
summarise(
"Count_actions" = sum(actionn)
) #ignores all other columns.
If you want to count how many times a country appears use n() in place of sum(actionn, na.rm=TRUE).# this may not be desired but sometime simple solution is the best
(Just count the frequency of country)
Or df%>%group_by(country, actionn)%>%summarise("count_actions"= n()) will give country wise count for each type ( say 1,2 or more actions).
Data table version dt[, .(.N), by=.(country, actionn )]
For question 2: use grouping for "for each on your question" after putting filter on data as required. Here, filter subquestions 1 or 2 with (and) value 1 for each "country, question and actionn":
df%>%filter(subquestions <=2 & value==1)%>%group_by( country, question, actionn)%>%summarise("counts_desired"= n(), "sums_desired"= sum(actionn, na.rm=TRUE))
Hope this works. I am also learning and applying it on similar data.
Have not tested it and made certain assumptions about your data (numerical and clean). (Also for.mobile while traveling! Cheers!!)

Related

Creating a Token count by date and co-occurence term proportion by date using quanteda

Hello Guys I have a quite massive data sets that cointais reviews of utilities services from customers all over the UK, this is a small sample of what the data looks like
df <- data.frame (text = c("The investors and their supporters shall invest and do something mostly invest",
" Shall we tell the investors to invest ?", "Investors shall invest.",
"Investors may sometimes invest","spend what Investor Do"),
date = c("10/12/2022", "10/12/2022", "10/12/2022","11/12/2022","12/12/2022"))
What I want is to be able to count the frequency of terms/words/tokens
by date
for instance the word invest appears 6 times in total, so for the date 10/12/2022 its word count is 4 I wanna be able to use the quanteda library (since it is so powerfull) to achieve the count and plot the viz over date and
I also wanna plot the association or co-occurence of the word investor
& invest over date
for instance we have in this example 5 reviews in those reviews 4/5 times the word invest and investor were present and I'd like to plot that percentage over date as well is that is possible or what amazing options does the quantada lib has that can perfomre this task? will it be possible to also find lets say a min porcentage of the 0.25 most frequente qords that appear when "invest" appears
to achive the first point I started witht he following code:
df %>%
corpus(text_field="text") %>%
dfm() %>%
textstat_frequency(10)
which gives:
feature frequency rank docfreq group
1 invest 6 1 5 all
2 investors 4 2 4 all
3 shall 3 3 3 all
4 the 2 4 2 all
5 and 2 4 1 all
6 do 2 4 2 all
7 their 1 7 1 all
8 supporters 1 7 1 all
9 something 1 7 1 all
10 we 1 7 1 all
Warning message:
'dfm.corpus()' is deprecated. Use 'tokens()' first.
by how would I go about plotting the frequency of this words over the date column, I read in the documentation that one can group but I had have no luck in doing so
and for the second question I dont know for sure if what funtion of the quenteda lib to use but I am trying to mirror the tm::findAssocs() fun from the tm library
I am super attentive to your answers guys I will be upvoting and picking the answer as soons as they come THANKS A TRILLION for your help it really mena sthe world to me
Answer to your first question:
The dates are put into the docvars part of your corpus. This can be used within the textstat_frequency with the group option.
dat <- data.frame (text = c("The investors and their supporters shall invest and do something mostly invest",
" Shall we tell the investors to invest ?", "Investors shall invest.",
"Investors may sometimes invest","spend what Investor Do"),
date = c("10/12/2022", "10/12/2022", "10/12/2022","11/12/2022","12/12/2022"))
library(dplyr)
library(quanteda)
library(quanteda.textstats)
dat %>%
corpus(text_field="text") %>%
tokens() %>%
dfm() %>%
textstat_frequency(groups = date)
feature frequency rank docfreq group
1 invest 4 1 3 10/12/2022
2 investors 3 2 3 10/12/2022
3 shall 3 2 3 10/12/2022
4 the 2 4 2 10/12/2022
5 and 2 4 1 10/12/2022
6 their 1 6 1 10/12/2022
7 supporters 1 6 1 10/12/2022
8 do 1 6 1 10/12/2022
9 something 1 6 1 10/12/2022
10 mostly 1 6 1 10/12/2022
11 we 1 6 1 10/12/2022
12 tell 1 6 1 10/12/2022
13 to 1 6 1 10/12/2022
14 ? 1 6 1 10/12/2022
15 . 1 6 1 10/12/2022
16 investors 1 1 1 11/12/2022
17 invest 1 1 1 11/12/2022
18 may 1 1 1 11/12/2022
19 sometimes 1 1 1 11/12/2022
20 do 1 1 1 12/12/2022
21 spend 1 1 1 12/12/2022
22 what 1 1 1 12/12/2022
23 investor 1 1 1 12/12/2022
You now have access to the frequency per day.
As for question 2, I think you can use textstat_simil. Something like below. It does give some different answers as using tm::findAssoc, usually more features. So I'm not completely sure if this is the correct answer. Maybe someone from the quanteda team can confirm or deny this.
my_dfm <- dat %>%
corpus(text_field="text") %>%
tokens() %>%
dfm()
textstat_simil(my_dfm,
my_dfm[, c("investor")],
method = "correlation",
margin = "features",
min_simil = 0.7)
textstat_simil object; method = "correlation"
investor
the .
investors .
and .
their .
supporters .
shall .
invest .
do .
something .
mostly .
we .
tell .
to .
? .
. .
may .
sometimes .
spend 1
what 1
investor 1
You can save the outcome of textstat_simil as a data.frame or list if you want to with as.data.frame or as.list.

special case of merge in R

I'm trying to solve this:
For example:
Warehouse
id amount size
1 cymbals 5 24
2 snares 3 10
3 tom1 2 19
Incoming
id amount size
1 snares 2 15
Resulting
id amount size
1 cymbals 5 24
2 snares 5 15
3 tom1 2 19
I am newby in R, so I was looking for the most elegant/legible way of getting the 'resulting' (I am not concerned with performance). Resulting would be: taking every incoming item, loop in the warehouse if it exists: add the amounts, and replace the size with the new size; if it doesn't exist, add it.
With dplyr, we can bind the two dataframes together , group them by id and calculate the sum of amount and take the last value of size so if the value is present in incoming it will take it from there or else it will take the size value from warehouse dataframe.
library(dplyr)
bind_rows(warehouse, incoming) %>%
group_by(id) %>%
summarise(amount = sum(amount),
size = last(size))
# id amount size
# <chr> <int> <int>
#1 cymbals 5 24
#2 snares 5 15
#3 tom1 2 19

Working with repeates values in rows

I am working with a df of 46216 observation where the units are homes and people, where each home may have any number of integrants, like:
enter image description here
and this for another almost 18000 homes.
What i need to do is to get the mean of education years for every home, for what i guess i will need a variable that computes the number of people of each home.
What i tried to do is:
num_peopl=by(df$person_number, df$home, max), for each home I take the highest person number with the total number of people who live there, but when I try to cbind this with the df i get:
"arguments imply differing number of rows: 46216, 17931"
It is like it puts the number of persons only for one row, and leaves the others empty.
How can i do this? Is there a function?
I think aggregate and join may be what your looking for. Aggregate does the same thing that you did, but puts it into a data frame that I'm more familiar with at least.
Then I used dplyr left_join, joining the home number's together:
library(tidyverse)
df<-data.frame(home_number = c(1,1,1,2,2,3),
person_number = c(1,2,3,1,2,1),
age = c(20,21,1,54,50,30),
sex = c("m","f","f","m","f","f"),
salary = c(1000,890,NA,900,500,1200),
years_education = c(12,10,0,8,7,14))
df2<-aggregate(df$person_number, by = list(df$home_number), max)
df_final<-df%>%
left_join(df2, by = c("home_number" = "Group.1"))
home_number person_number age sex salary years_education x
1 1 1 20 m 1000 12 3
2 1 2 21 f 890 10 3
3 1 3 1 f NA 0 3
4 2 1 54 m 900 8 2
5 2 2 50 f 500 7 2
6 3 1 30 f 1200 14 1

Updating a table with the rolling average of previous rows in R?

So I have a table where every row represents a given user in a specific event. Each row contains two types of information: the outcomes of such event, as well as data regarding a user specifically. Multiple users can take part in the a same event.
For clarity, here is an simplified example of such table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 3 2 2
1 1/1/2017 $15 150 2 2 1 2
2 2/1/2017 $50 60 1 1 5 1
2 2/1/2017 $45 100 4 3 5 2
3 3/1/2017 $25 75 1 2 3 1
3 3/1/2017 $20 210 2 5 5 1
3 3/1/2017 $25 120 3 1 0 4
3 3/1/2017 $15 100 4 3 1 1
4 4/1/2017 $75 25 4 0 2 1
My goal is to build a model that can, given a specific user's performance history (in the example attributes X, Y and Z), predict a given revenue and time for an event.
What I am after now is a way to format my data in order to train and test such model. More specifically, I want to transform the table in a way that each row would keep the event specific information, while presenting the moving average of each users attributes up until the previous event. An example of the thought process could be: a user up until an event presents averages of 2, 3.5, and 1.5 in attributes X, Y and Z respectively, and the revenue and time outcomes of such event were $25 and 75, now I will use this as a input for my training.
Once again for clarity, here is an example of the output I would expect applying such logic on the original table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 0 0 0
1 1/1/2017 $15 150 2 0 0 0
2 2/1/2017 $50 60 1 3 2 2
2 2/1/2017 $45 100 4 0 0 0
3 3/1/2017 $25 75 1 2 3.5 1.5
3 3/1/2017 $20 210 2 2 1 2
3 3/1/2017 $25 120 3 0 0 0
3 3/1/2017 $15 100 4 3 5 2
4 4/1/2017 $75 25 4 3 3 1.5
Notice that in each users first appearance all attributes are 0, since we still know nothing about them. Also, in a user's second appearance, all we know is the result of his first appearance. In lines 5 and 9, users 1 and 4 third appearances start to show the rolling mean of their previous performances.
If I were dealing with only a single user, I would tackle this problem by simply calculating the moving average of his attributes, and then shifting only the data in the attribute columns down one row. My questions are:
Is there a way to perform such shift filtered by UserID, when dealing with a table with multiple users?
Or is there a better way in R to calculate the rolling mean directly from the original table by always placing a result in each user's next appearance?
It can assumed that all rows are already sorted by date. Any other tips or references related to this problem are also welcome.
Also, It wasn't obvious how to summarize my question with a one liner title, so I'm open to suggestions from any R experts that might think of an improved way of describing it.
We can achieve your desired output using the dplyr package.
library(dplyr)
tablinka %>%
arrange(UserID, EventID) %>%
group_by(UserID) %>%
mutate_at(c("X", "Y", "Z"), cummean) %>%
mutate_at(c("X", "Y", "Z"), lag) %>%
mutate_at(c("X", "Y", "Z"), funs(ifelse(is.na(.), 0, .))) %>%
arrange(EventID, UserID) %>%
ungroup()
We arrange the data, group it, and then apply the desired transformations (the dplyr functions cummean, lag, and replacing NA with 0 using an ifelse).
Once this is done, we rearrange the data to its original state, and ungroup it.

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

Resources