Combining/aggregating data in R

Combining/aggregating data in R - r

I feel like this is a really simple question, and I've looked a lot of places to try to find an answer to it, but everything seems to be looking to do a lot more than what I want--
I have a dataset that has multiple observations from multiple participants. One of the factors is where they're from (e.g. Buckinghamshire, Sussex, London). I want to combine everything that isn't London so I have two categories that are London and notLondon. How would I do this? I'd them want to be able to run a lm on these two, so how would I edit my dataset so that I could do lm(fom ~ [other factor]) where it would be the combined category?
Also, how would I combine all observations from each respective participant for a category? e.g. I have a category that's birth year, but currently when I do a summary of my data it will say, for example, 1996:265, because there are 265 observations from people born in '96. But I just want it to tell me how many participants were born in 1996.
Thanks!

There are multiple parts to your question so let's take it step by step.
1.
For the first part this is a great use of tidyr::fct_collapse(). See example here:
library(tidyverse)
set.seed(1)
d <- sample(letters[1:5], 20, T) %>% factor()
# original distribution
table(d)
#> d
#> a b c d e
#> 6 4 3 1 6
# lumped distribution
fct_collapse(d, a = "a", other_level = "other") %>% table()
#> .
#> a other
#> 6 14
Created on 2022-02-10 by the reprex package (v2.0.1)
2.
For the second part, you will have to clarify and share some data to get more help.
3.
Here you can use dplyr::summarize(n = n()) but you need to share some data to get an answer with your specific case.
However something like:
df %>% group_by(birth_year) %>% summarize(n = n())
will give you number of people with that birth year listed.

Related

Merge two rows in columns based on condition

I am new here, as well as to R, and I couldn't find any past queries that answered my following question, so I apologise if this has already been brought up before.
I am trying to merge the ID columns from two different datasets into one, but some of the IDs in the rows have been coded differently. I need to replace all the "LNZ_" IDs with the "LNZ.", however, I cannot figure out how I would go about doing this.
df_1 <- data.frame(ID_1 = c("LNZ_00001", "LNZ_00002", "LNZ_00003", "DFG00001", "CWD00001"),
Sex=c("M","F","F","M","F"))
df_2 <- data.frame(ID_2 = c("LNZ.00001", "LNZ.00002", "LNZ_00003", "DFG00001", "CWD00001"),
Type=c("S","S","B","B","B"),
AGE=c(56,75,66,64,64))
The above is similar to the datasets that I have, only more scaled down. I hope this is somewhat clear, and any help would be appreciated.
Thanks!

The issue with merging is that your ID columns have different formatting for some of the entries which are supposed to be matched. Therefore you need to modify those values to match before performing the merge. In the examples you gave, the difference is between a period separator (.) and an underscore (_). If your real data has more complex issues, you may need to use different code to clean up those values.
However, once that is resolved, you can perform your merge easily. Here I've used the {tidyverse} packages to accomplish both steps in one pipe chain.
library(tidyverse)
df_1 <- data.frame(ID_1 = c("LNZ_00001", "LNZ_00002", "LNZ_00003", "DFG00001", "CWD00001"), Sex=c("M","F","F","M","F"))
df_2 <- data.frame(ID_2 = c("LNZ.00001", "LNZ.00002", "LNZ_00003", "DFG00001", "CWD00001"), Type=c("S","S","B","B","B"), AGE=c(56,75,66,64,64))
df_2 %>%
mutate(ID_2 = str_replace(ID_2, "\\.", "\\_")) %>%
left_join(df_1, by = c("ID_2" = "ID_1"))
#> ID_2 Type AGE Sex
#> 1 LNZ_00001 S 56 M
#> 2 LNZ_00002 S 75 F
#> 3 LNZ_00003 B 66 F
#> 4 DFG00001 B 64 M
#> 5 CWD00001 B 64 F
Created on 2022-07-17 by the reprex package (v2.0.1)

How to correct misspelling in column and collapse values into correct row in R

I'm rather new to R and struggling through data tidying. I have a problem, where I don't find an answer to, but maybe I'm searching with the wrong terms.
I have a table (df_samples) in the following format:
species
gender
group
sample1
sample2
sample n
penguin
m
i.
20
21
n
penguin
f
i.
NA
18
n
lion
m
ii.
5
4
n
lion
f
ii.
2
9
n
penguin
f
ii.
22
NA
n
tiger
m
ii.
7
6
n
tiger
f
ii.
6
8
n
Now, the problem here is the penguin with group ii. which is wrong and should be i. In my table there are several hundred different species and samples. I have this problem with several rows, where species have the wrong group.
I was able to find the specific rows with the problems using the following code:
n_occur <- data.frame(table(df_samples$species))
df_samples_2 <- df_samples[df_samples$species %in% n_occur$Var1[n_occur$Freq > 2],]
This gives me the problematic rows and I can view them in an own dataframe. There I am able identify the rows with the mistakes and could correct them. But I have two problems where I'm stuck.
First I don't know how to index the problematic value to change it directly in my original data frame.
Second I have no idea how to bring the data stored in the row with the mistake to the "correct" row.
I am sure, there are answers on the web, but I am really struggling to express my problem in a way, which allows me to find them.
I would be grateful if somebody is able to help, either by pointing out how to search or by solving the problem.

There are a few ways for this.
Assume all species have the same group
If all species belong to the same group, you can use a vector that stores the species and group information to replace the current group.
Again, this will replace ALL groups within the same species.
base R
correct_group <- c("penguin" = "i.", "tiger" = "ii.", "lion" = "ii.")
df$group <- correct_group[match(df$species, names(correct_group))]
dplyr
library(dplyr)
df %>% mutate(group = correct_group[match(species, names(correct_group))])
If you are doing it by hand:
We can also do it one by one if the species do not belong to the same group (only if you have a few records to change).
First identify the row index where species is "penguin" and group is "ii.". This is the record that you would like to change. Then simply replace the group value with "i.".
base R
df[which(df$species == "penguin" & df$group == "ii."), "group"] <- "i."
dplyr
library(dplyr)
df %>% mutate(group = ifelse(species == "penguin" & group == "ii.", "i.", group))
Output
All of the above methods produce the same output.
species gender group sample1 sample2 sample.n
1 penguin m i. 20 21 n
2 penguin f i. NA 18 n
3 lion m ii. 5 4 n
4 lion f ii. 2 9 n
5 penguin f i. 22 NA n
6 tiger m ii. 7 6 n
7 tiger f ii. 6 8 n
Remember for the dplyr methods, you have to "save" the df back to it (df <- df %>% dplyr::method), otherwise, it will only output the results to the console without actually changing anything.

Using your process you can try the following steps.
Add unique ID to the rows so that it can be filtered later.
df_samples<-df_samples %>%
rowid_to_column()
Remove problem rows from df_samples based on the rowid in df_samples_2
df_samples<-df_samples[-df_samples_2$rowid,]
Update df_samples_2 as per your requirements, row by row mutates based on rowid.
Merge corrected rows back to df_samples
df_samples<-bind_rows(df_samples,df_samples_2)
Also if your end goal & data is as mentioned above you could also try this on your original df_samples
df_samples <-df_samples %>%
group_by(species) %>% #this will create internal groups
arrange(species,group) %>% # Will ensure i. will be before ii.
mutate(group=lag(group,default=first(group))) # lag() will copy earlier row values to current row per group.

Conditional sampling by group based on sample mean

I am trying to use R to make a bunch of different trivia quizzes. I have a large dataset (quiz_df) containing numerous questions divided into categories and difficulties looking like this:
ID Category Difficulty
1 1 Sports 3
2 2 Science 7
3 3 Low culture 4
4 4 High culture 2
5 5 Geography 8
6 6 Lifestyle 3
7 7 Society 3
8 8 History 5
9 9 Sports 2
10 10 Science 8
... ... ... ...
1000 1000 Science 3
Now I want to randomly sample 3 questions from each of the 8 categories, so that the mean difficulty is 4 (or the sum being 4*24 = 96).
library(dplyr)
set.seed(100)
quiz1 <- quiz_df %>% group_by(Category) %>% sample_n(3)
This creates a random quiz set with 3 questions in each category, but does not take into consideration the difficulty. I am aware of the weight-option in sample_n:
library(dplyr)
set.seed(100)
quiz1 <- quiz_df %>% group_by(Category) %>% sample_n(3, weight = Diffculty)
But this does not solve the issue. Ideally, I would like to add an option like: sum = 96, or mean = 4.
Does anyone have any clues?

This is a brute-force solution:
library(dplyr)
# Generating sample data.
set.seed(1986)
n = 1000
quiz_df = data.frame(id = 1:n,
Category = sample(c("Sports", "Science", "Society"), size = n, replace = TRUE),
Difficulty = sample(1:10, size = n , replace = TRUE))
# Solution: resample until condition is met.
repeat {
temp.draw = quiz_df %>% group_by(Category) %>% slice_sample(n = 3) # From documentation, sample_n() is outdated!
temp.mean = mean(temp.draw$Difficulty)
if (temp.mean == 4) # Check if the draw satisfies your condition.
{
final.draw = temp.draw
break
}
}
final.draw
mean(final.draw$Difficulty)
First of all, as you are new to SO, let me tell you that you should always include some example data in your questions - not just the structure, but something we can run on our machine. Anyway, for this time I just simulated some data, including three instances of Category only. My solution runs in less than two seconds, however with the whole data set the code may need more time.
The idea is to just resample until we get 24 questions, three for each category, whose mean Difficulty equals 4. Clearly, this is not an elegant solution, but it may be a first step.
I am going to try again to find a better solution. I guess the problem is that the draws are not independent, I will look deeper into that.
Ps, from the documentation I see that sample_n() has been superseeded by slice_sample(), so I suggest you to rely on the latter.

How can I count and compare data over multiple columns in R?

I am working with a dataframe which contains text data which has been categorised and coded. Each numerical value from 1-12 represent a type of word.
I want to count the frequencies of occurrence each number (1 to 12) over 6 columns (pre1 to pre6) so I know how many types of words have been used. Could anyone please advise on how to do this?
My df is structured as such:

Would something like that work for you?
library(dplyr)
df <- data.frame(pre1 = c(sample(1:12, 10)),
pre2 = c(sample(1:12, 10)),
pre3 = c(sample(1:12, 10)),
pre4 = c(sample(1:12, 10)),
pre5 = c(sample(1:12, 10)),
pre6 = c(sample(1:12, 10)))
count <- count(df, pre1, pre2, pre3, pre4, pre5, pre6)

One solution is this:
library(tidyverse)
mtcars %>%
select(cyl, am, gear) %>% # select the columns of interest
gather(column, number) %>% # reshape
count(column, number) # get counts of numbers for each column
# # A tibble: 8 x 3
# column number n
# <chr> <dbl> <int>
# 1 am 0 19
# 2 am 1 13
# 3 cyl 4 11
# 4 cyl 6 7
# 5 cyl 8 14
# 6 gear 3 15
# 7 gear 4 12
# 8 gear 5 5
In your case column will get values as pre1, pre2, etc., number' will get values 1 - 12 andn` will be the count of a specific number at a specific column.

It is not entirely clear from the question, whether you want frequency tables for all of these columns together or for each column seperately. In possible further questions you should also make clear, whether those numbers are coded as numerics, as characters or as factors (the result of str(pCat) is a good way to do that). For this particular question, it fortunately does not matter.
The answers I have already given in the comments are
table(unlist(pCat[,4:9]))
and
table(pCat$pre3)
as an extension for the latter, I shall also point to the comment by ANG , which boils down to
lapply(pCat[,4:9], table)
These are straightforward solutions with base R without any further unneccessary packages. The answers by JonGrub and AntoniosK base on the tidyverse. There is no obvious need to import dplyr or tidyverse for this problem but I guess, the authors open those packages anyways, whenever they use R, so it does not really impose any cost on them. Other great packages to base good answers on are data.table and sqldf. Those are good packages and many people do a lot of things, that could be done in base R in one of these packages. The packages promise to be more clear or faster or reuse possible knowledge you might already have. Nothing is wrong with that. However, I take your question as an indication, that you are still learning R and I would advise, to learn R first, before you become distracted by learning special packages and DSLs.
People have been using base R for decades and they will continue to do so. Learning base R wil lnot distract you from R and the knowledge will continue to be worthwhile in decades. If the same can be said for the tidyverse or datatable, time will tell (although sqldf is probably also a solid investment in the future, maybe more so than R).

User grpl on each element of a dataframe column to find a string in a different data frame

I need to find if the elements on a data frame column are present in another data frame column in order to retrieve a count and a total.
Example
Data frame 1
Details<-data.frame(FirstName=c("Carlos SM","Carlos JOH","Carlos WIL","Carlos JON","Carlos BR","Peter D","Peter MILL","Peter WILS","Peter MOO","Homer T"),Points=c("3","4","7","6","4","9","1","2","1","9"))
Data frame 2
Results <- data.frame(Person=c("Carlos","Homer","Peter"))
The ideal output will add two columns to the data frame called Results one for the count of the times each string is found in the Details data frame and the other for a total sum of points. Like so
FirstName Appearances Total Points
Carlos 5 24
Perter 4 13
Homer 2 13

This should do the trick
Results$Appearances=sapply(Results$Person,function(x) sum(grepl(x,Details$FirstName)))
Results$`Total Points`=sapply(Results$Person,function(x) sum(grepl(x,Details$FirstName)*as.numeric(Details$Points)))
Results
Person Appearances Total Points
1 Carlos 5 22
2 Homer 1 7
3 Peter 4 11
Also, it seems like the numbers in your expected output are a little bit off. It is really confusing. Was it just your mistake, or did you want some unobvious way of character matching that would produce that kind of result?

Using tidyr and dplyr:
library(tidyr)
library(dplyr)
Details %>% separate(FirstName, c("Person", "last"), " ") %>%
group_by(Person) %>%
summarise(Appearances = n(),
"Total Points" = sum(Points)) %>%
left_join(Results, .)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex