Counting rows with unique ID in R studio with length(unique) - r

I have a dataset with people's IDs (some people (IDs) have multiple rows) and risks (transcribed in a number between 1 and 7, and some NA) and I wanted to count the number of people in each risk group, without counting the same person twice.
When creating a subset containing only 1 row/person, I obtain a certain number of people for each group. However, when I use this function (for each risk group):
length(unique(data$person_id[data$RISK==1])
it seems like I get one extra person in each risk group (so 7 extra people in total).
Does someone have an explanation for this? Do I have to do -1 each time I use this function?
Thanks in advance!

Related

How to assign numerical values in one column to text in other column

Basically I'm trying to sort through a large dataset where the first 3 numbers correspond to different texts. Before I can filter through, I'm trying to assign the different strings values.
Crops 101
Fishing 102
Livestock 103
Movies, TV, & Stage 201
In the larger dataset, there are hundreds of numbers such as 1018347 where the first three numbers correspond to crops and include the number of times that value appeared. The numbers after specify what type of crops, but for the purpose of my work I need to sort through the entire thing by the first three numbers and sum the amount for each time occurred. I'm fairly new to R and wasn't able to find a sufficient answer, so any help would be appreciated.
Not sure if i am getting your question correctly but it seems you are looking for a way to first create a new variable based on the first three numbers in the variable and afterwards summarize the results in a sum per category
What could work is
data %>%
mutate(first_part = substr(variable,1,3)) %>%
group_by(first_part) %>%
summarize(occurrences= n())
Code above counts the amount of times the "first_part"(which is the first 3 numbers) is occurring. This can also be reproduces for the second part or both together.

Is there an R function where I can get the names within a specific column in my dataset

Edit: using the aid from one of the users, I was able to use "table(ArrestData$CHARGE)", yet, since there are over 2400 entries, many of the entries are being omitted. I am looking for the top 5 charges, is there code for this? Additionally, I am looking at a particular council district (which is another variable titled "CITY_COUNCIL_DIST"). I want to see which are the top 5 charges given out within a specific council district. Is there code for this?
Thanks for the help!
Original post follows
Just like how I can use "names(MyData)" to see the names of my variables, I am wondering if I can use a code to see the names/responses/data points of a specific column.
In other words, I am attempting to see the names in my rows for a specific column of data. I would like to see what names are cumulatively being used.
After I find this, I would like to know how many times each name within the rows is being used, whether thats numeric or percentage. After this, I would like to see how many times each name within the rows is being used with the condition that it meets a numeric value of another column/variable.
Apologies if this, in any way, is confusing.
To go further in depth, I am playing around with the Los Angeles Police Data that I got via the Office of the Mayor's website. From 2017-2018, I am attempting to see what charges and the amount of each specific charge were given out in Council District 5. CHARGE and CITY_COUNCIL_DIST are the two variables I am looking at.
Any and all help will be appreciated.
To get all the distinct variables, you can use the unique function, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> unique(x)
[1] 1 2 3 4 5 6
To count the number of distinct values you can use table, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> table(x)
x
1 2 3 4 5 6
2 1 2 1 3 1
The first row gives you the distinct values and the second row the counts for each of them.
EDIT
This edit is aimed to answer your second question following with my previous example.
In order to look for the top five most repeated values of a variable we can use base R. To do so, I would first create a dataframe from your table of frequencies:
df <- as.data.frame(table(x))
Having this, now you just have to order the column Freq in descending order:
df[order(-df$Freq),]
In order to look for the top five most repeated values of a variable within a group, however, we need to go beyond base R. I would use dplyr to create an augmented dataframe with frequencies for each value of the variable of interest, let it be count_variable:
library(dplyr)
x_or <- x %>%
group_by(group_variable, count_variable) %>%
summarise(freq=n())
where x is your original dataframe, group_variable is the variable for your groups and count_variable is the variable you want to count. Now, you just have to order the object in a way you get the frequencies of your count_variable ordered by group_variables:
x_or %>%
arrange(group_variable, count_variable, freq)

How to compute questionnaire total score and subscores by summing all and a selection of columns in R?

I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected

Aggregate and then count?

I have basic knowledge of R, but I'm not sure how to go about the following programming:
I have a large data frame of data from 25 participants. Each participant has a different total number of sound files (i.e., #1 might have 105 and #2 might have 98). For each participant, we've coded whether or not they're in a specific location (i.e., if they're in their apartment, 1, if not, 0).
I want to get the count of how many times Apartment=1 for each participant. This is where I'm confused: my Excel sheet is organized such that all of the participants are stacked on top of each other. So the first 105 rows are for Participant #1, and the next 98 rows are for Participant #2.
Do I need to aggregate my data by participant number, and then get the count?
I've never written a for loop, but is that another possible solution?

Complex dataframe selecting and sorting by quintile

I have a complex dataframe (orig_df). Of the 25 columns, 5 are descriptions and characteristics that I wish to use as grouping criteria. The remainder are time series. There are tens of thousands of rows.
I noted in initial analysis and numerical summary that there are significant issues with outlier observations within some of the specific grouping criteria. I used "group by" and looking at the quintile results within those groups. I would like to eliminate the low and high (individual observation) outliers relative to the (group-by based quintile) to improve the decision tree and clustering analytics. I also want to keep the outliers to analyze separately for the root cause.
How do I manipulate the dataframe such that the individual observations are compared to the group-based quintile results and the parse is saved (orig_df becomes ideal_df and outlier_df)?
After identifying the outliers using the link Nikos Tavoularis share above, you can use ifelse to create a new variable and identify which records are outliers and the ones that are not. This way you can keep the data there, but you can use this new variable to sort them out whenever you want

Resources