SQL averaging over column values for an id in the table, grouped by values in the same column for a different id in the table - sqlite

I have searched SO but can't find a case that matches mine. This is because the same column in my table needs to be used for Averaging and Grouping By, however the catch is that the id is different when implementing each function. Let me explain:
This is my 'Answers' table:
Now Question with ID 1 is "What is your age" and Question ID 2 is "What is your gender". I found the top 7 genders, these are : 'Male', 'Female', 'male', 'female', '-1', 'Nonbinary', 'non-binary'
AnswerText contains the answers for each question. I want to get the Average age for each gender category in the list.
I have done this:
SELECT AVG(AnswerText), Gender
FROM
(
SELECT AnswerText as Gender
FROM Answer
WHERE Answer.QuestionID = 2
AND Gender IN ('Male', 'Female', 'male', 'female', '-1', 'Nonbinary', 'non-
binary')
)
WHERE Answer.QuestionID = 1
GROUP BY Gender
This is throwing error 'no such column: AnswerText'
I am using SQLlite, how do I achieve this? I can provide the table if required
Any guidance is appreciated. Thanks

SELECT AVG(AnswerText), Gender
FROM
(
SELECT AnswerText as Gender //Try to remove as Gender, the alias is not necesary
FROM Answer
WHERE Answer.QuestionID = 2
AND Gender IN ('Male', 'Female', 'male', 'female', '-1', 'Nonbinary', 'non-
binary') // remove the AND statement. I think is redundant
)
WHERE Answer.QuestionID = 1
GROUP BY Gender // This should group by all different genders or you can group then by later.
This is just based from a conceptual point of view. I don't have the software available to help more.

Related

How to replace id's with same value in a large data set

I have a large data set 513 observation in total. one of the column in the data set relates to ID.
However, in my data set I have one ID repeating (For eg 1). Id1 = 1756, Id2 = 1756.
How do I correct this and just separate the two ID's without touching the other rows/columns and values of the data set. I have tried using this:
Initial_df[nrow(Initial_df$sofifa_id) +1, 1:17] = c('1757', '230204',
'71', '9000', '26', '182', '65', 'KV Kortrijk', 'Right', '85', '66',
'65', '71', '54', '57', '52', '69', '1' )
Thank you so much for your time.

Is there a way to map or match people's names to religions in R?

I'm working on a paper on electoral politics and tried using this dataset to calculate the share of the electorate that each religion,so I created an if() function and a Christian variable and tried to increase the number of Christians by one whenever a Christian name pops up, but was unable to do so. Would appreciate it if you could help me with this
library(dplyr)
library(ggplot2)
Christian=0
if(Sample...Sheet1$V2=="James"){
Christian=Christian+1
}
PS
The Output
Warning message:
In if (Sample...Sheet1$V2 == "James") { :
the condition has length > 1 and only the first element will be used
Notwithstanding my comment about the fundamental non-validity of this approach, here’s how you would solve this general problem in R:
Generate a lookup table of the different names and categories — this table is independent of your input data:
religion_lookup = tribble(
~ Name, ~ Religion,
'James', 'Christian',
'Christopher', 'Christian',
'Ahmet', 'Muslim',
'Mohammed', 'Muslim',
'Miriam', 'Jewish',
'Tarjinder', 'Sikh'
)
match your input data against the lookup table (I’m using an input table data with a column Name instead of your Sample...Sheet1$V2):
matched = match(data$Name, religion_lookup$Name)
religion = religion_lookup$Religion[matched]
Count the results:
table(religion)
religion
Christian Jewish Muslim Sikh
2 5 3 1
Note the lack of ifs and loops in the above.
Christian <- sum( Sample...Sheet1$V2=="James" )
There goes, don't need the if block.

How do I merge specific rows into a new column?

My current dataset has an education variable which has 18 categories ranging from 'no qualifications' to 'Postgraduate'
I want to create a new education variable that will consists of only 5 categories (e.g no qualifications- primary school - secondary school - bachelors - postgrad). I would like to merge some of the 18 categories together to form one category in my new variable (e.g categories 3,4,5 into secondary school).
You could just recode the categories with recode() from dplyr. It goes like:
library(dplyr)
name_of_dataset[[number_of_old_column]] %>% recode('1st grade' = 'primary school', '2nd grade' = 'primary school')
and so on. Old name goes first, new name goes second. You can put the new data in a new column with mutate() from dplyr. It goes like:
name_of_dataset %>% mutate(name_of_new_column = recode(.[[number_of_old_column]], '1st grade' = 'primary school', '2nd grade' = 'primary school'))

How to apply operation to each row after looking up info from another data frame in R

I am new to R and was wondering how to do the following:
I have a data frame called 'wage' which has features like
First.Name Last.Name Hourly.Pay
Lara Davis 39.29
John Childers 35.12
Lara Grace 40.16
In 'wage' the first name can be non-unique. I have another data frame called 'wage_gender' which has features like
name gender ProbMale ProbFemale
Lara Female 0.0088 0.9912
John Male 0.992 0.008
The 'name' is wage_gender are all unique and should correspond to the First.Name in 'wage'. The two data frames are not of the same size. Also, some names in wage may not be there in wage gender. So, it should get set to NA.
I want to add a 'gender' feature to the 'wage' data frame by looking up the genders from 'wage_gender'. However i can't seem to get it to work. Here is what I have
f = function(r, gen)
r$gender = gen[which(gen$name == r$First.Name),]$gender
apply(wage, 1, f, gen=wage_gender)
Basically, I expect apply to use 'f' over each row and look for the name in 'wage_gender' and assign the appropriate gender but it throws an error: Error in r$First.Name : $ operator is invalid for atomic vectors I am not sure what I am doing wrong.
A different way to do this is to add the names as row.names in wage_gender and then just use that as a lookup table.
row.names(wage_gender) = wage_gender$name
wage_gender[wage$First.Name, "gender"]
[1] "Female" "Male" "Female"
That will also give NA if the name is not in wage_gender
Just rename the column 'name' as 'first.name' in 'wage_gender'
names(wage_gender)[i] <- "First.Name" #(where i is the number of the column that has 'name' as name)
You can also rename like that (it's more elegant, but longer):
names(wage_gender)[names(wage_gender == "name")] = "First.Name"
And then, merge the two data.frame:
new.df <- merge(wage_gender,wage,by ="First.Name")

Compare Different Columns in a dataframe in R

I have a data frame which looks like this:
State Rank1 Rank2 Rank3 Rank4
1 37.20% 32.88% 20.92% 7.02%
2 44.01% 30.15% 22.68% 1.54%
3 49.72% 48.86% 47.61% 46.50%
4 60.40% 30.35% 26.34% 49.78%
The data set contains data of last years election of a particular geography. Column A contains the state code and Column B:D contains information about the vote share of top 4 parties in a particular state.
My task is to divide them into 4 category based on certain criterias :
Unipolar - The winner faced no difficulty in getting majority. In most cases like this, the winners cross the half-way mark, i.e. more than 50% of the vote share (VS is used in the dataset.
Bi-Polar - The electorates have shown significance indecision in deciding who should get the majority. Runner-up has got substantial votes and there is considerable distance between the runner-up and the 3rd contestant.
Multi-polar - Typically multiple contestants have got substantial votes and even minor swing in the votes would have resulted in a different outcome.
Divided-Unipolar - Winner has got clear mandate but electorates have shown indecision in deciding whom to vote for next. More than 1 contender has got almost similar votes.
How do I do this in R as there would be very close comparison among the share of votes. Thanks in advance.
Cheers!
Use a data.frame for your votes :
votes = data.frame(State = c(1:4), Rank1 = c(37.20, 44.01, 49.72, 60.40),
Rank2 = c(32.88, 30.15, 48.86, 30.35),
Rank3 = c(20.92, 22.68, 47.61, 26.34),
Rank4 = c(7.02, 1.54, 46.50, 49.78))
Then use library dplyr to use the function case_when.
for exemple
library(dplyr)
votes$type = case_when(
votes$Rank1 > unipolar_limit | votes$Rank1 - votes$Rank2 > unipolar_limit2 ~ "Unipolar"
, votes$Rank2 - votes$Rank3 > bipolar_limit ~ "Bipolar"
, votes$Rank1 - votes$Rank3 < multi_limit ~ "Multi-polar"
, votes$Rank1 > 50 & votes$Rank2 - votes$Rank4 < divided_limit ~ "Divided-Unipolar"
)
Something like that with the limit or conditions you like.

Resources