Combining 2 columns in R prioriziting one of them - r

I know nothing of R, and I have a data.frame with 2 columns, both of them are about the sex of the animals, but one of them have some corrections and the other doesn't.
My desired data.frame would be like this:
id sex father mother birth.date farm
0 1 john ray 05/06/94 1
1 1 doug ana 18/02/93 NA
2 2 bryan kim 21/03/00 3
But i got to this data.frame by using merge on 2 others data.frames
id sex.x father mother birth.date sex.y farm
0 2 john ray 05/06/94 1 1
1 1 doug ana 18/02/93 NA NA
2 2 bryan kim 21/03/00 2 3
data.frame 1 or Animals (Has the wrong sex for some animals)
id sex father mother birth.date
0 2 john ray 05/06/94
1 1 doug ana 18/02/93
2 2 bryan kim 21/03/00
data.frame 2 or Farm (Has the correct sex):
id farm sex
0 1 1
2 3 2
The code i used was: Animals_Farm <- merge(Animals , Farm, by="id", all.x=TRUE)
I need to combine the 2 sex columns into one, prioritizing sex.y. How do I do that?

If I correctly understand you example you have a situation similar to what I show below based on the example from the merge function.
> (authors <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 3), "yes")))
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia yes
> (books <- data.frame(
name = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Ripley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...", "LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
deceased = c("yes", rep("no", 6))))
name title deceased
1 Tukey Exploratory Data Analysis yes
2 Venables Modern Applied Statistics ... no
3 Tierney LISP-STAT no
4 Ripley Spatial Statistics no
5 Ripley Stochastic Simulation no
6 McNeil Interactive Data Analysis no
7 R Core An Introduction to R no
> (m1 <- merge(authors, books, by.x = "surname", by.y = "name"))
surname nationality deceased.x title deceased.y
1 McNeil Australia yes Interactive Data Analysis no
2 Ripley UK no Spatial Statistics no
3 Ripley UK no Stochastic Simulation no
4 Tierney US no LISP-STAT no
5 Tukey US yes Exploratory Data Analysis yes
6 Venables Australia no Modern Applied Statistics ... no
Where authors might represent your first dataframe and books your second and deceased might be the value that is in both dataframe but only up to date in one of them (authors).
The easiest way to only include the correct value of deceased would be to simply exclude the incorrect one from the merge.
> (m2 <- merge(authors, books[names(books) != "deceased"],
by.x = "surname", by.y = "name"))
surname nationality deceased title
1 McNeil Australia yes Interactive Data Analysis
2 Ripley UK no Spatial Statistics
3 Ripley UK no Stochastic Simulation
4 Tierney US no LISP-STAT
5 Tukey US yes Exploratory Data Analysis
6 Venables Australia no Modern Applied Statistics ...
The line of code books[names(books) != "deceased"] simply subsets the dataframe books to remove the deceased column leaving only the correct deceased column from authors in the final merge.

Related

summarize multiple binary variables in a single column

in a survey I conducted, I asked about the education level of the participants. The results are spread over several columns as binary variables. I would appreciate efficient ways to combine the results into a single variable. The tables below show the current and desired data format.
ID
high school
college
PhD
1
high school
-1
-1
2
-1
college
-1
3
-1
-1
PhD
4
high school
-1
-1
ID
Educational background
1
high school
2
college
3
PhD
4
high school
To answer your specific question using the tidyverse, creating a test dataset with the code at the end of this post:
library(tidyverse)
df %>%
mutate(
across(-ID, function(x) ifelse(x == "-1", NA, x)),
EducationalBackground=coalesce(high_school, college, PhD)
)
ID high_school college PhD EducationalBackground
1 1 high_school <NA> <NA> high_school
2 2 <NA> college <NA> college
3 3 <NA> <NA> PhD PhD
4 4 high_school <NA> <NA> high_school
The code works by converting the text values of "-1" in your columns, which I take to be missing value flags, to true missing values. Then I use coalesce to find the first non-missing value in the three columns that contain survey data and place it in the new summary column. This assumes that there will be one and only one non-missing value in each row of the data frame.
That said, my preference would be to avoid the problem by adapting your workflow earlier in the piece to avoid the problem. But you haven't given any details of that, so I can't make any suggestions about how to do that.
Test data
df <- read.table(textConnection("ID high_school college PhD
1 high_school -1 -1
2 -1 college -1
3 -1 -1 PhD
4 high_school -1 -1"), header=TRUE)

Display which and how many variables correspond to conditions

I have a dataset divided into passenger names and their status (suppose, 10 cateogories) like this.
Passenger
Status
Peter
Captain
Mary
Mrs.
Claudette
Mrs.
Marius
Doc.
Holmes
Mr.
...
...
ecc.
In R, how can I display how many passengers are characterised by a specific Status and who?
I had in mind a table that represented a situation like "n passengers into the "Mrs." category and their names are "Claudette, Mary ecc."
(I don't need the whole string message, only the number and their names)
How can I do it?
Simply using dplyr
dummy <- read.table(text = "Passenger Status
Peter Captain
Mary Mrs.
Claudette Mrs.
Marius Doc.
Holmes Mr.", header = T)
dummy %>%
group_by(Status) %>%
summarise(n = n(),
names = paste0(Passenger, collapse = ", ")) %>%
mutate(res = paste0(n, ' passengers into the ', Status, "category and their names are ", names))
Status n names res
<chr> <int> <chr> <chr>
1 Captain 1 Peter 1 passengers into the Captaincategory and their names are Peter
2 Doc. 1 Marius 1 passengers into the Doc.category and their names are Marius
3 Mr. 1 Holmes 1 passengers into the Mr.category and their names are Holmes
4 Mrs. 2 Mary, Claudette 2 passengers into the Mrs.category and their names are Mary, Claudette

picking a value from one column filters another column

I am in the midst of dealing with a big data project where I want to filter one column from highlighting with another.
For example, I want to showcase house 1, and as a result of that, I want to compare House 1 with other values from other URBAN houses, not all houses.
table <- data.frame(
house=paste("House", 1:15),
category = c("Urban", "Rural", "Suburban")
)
table
# house category
# 1 House 1 Urban
# 2 House 2 Rural
# 3 House 3 Suburban
# 4 House 4 Urban
# 5 House 5 Rural
# 6 House 6 Suburban
# 7 House 7 Urban
# 8 House 8 Rural
# 9 House 9 Suburban
# 10 House 10 Urban
# 11 House 11 Rural
# 12 House 12 Suburban
# 13 House 13 Urban
# 14 House 14 Rural
# 15 House 15 Suburban
I tried to give this a go, but it is not working for me...
table %>%
filter(house == house1) %>%
filter(category == table$house)
I want the output to look like this...
# house category
# 1 House 1 Urban
# 2 House 4 Urban
# 3 House 7 Urban
# 4 House 10 Urban
# 5 House 13 Urban
Any suggestions are truly appreciated.
Maybe you can try the base R code below
subset(df,category == category[house=="House1"])
or dplyr option
df %>%
filter(category == category[house == "House1"])
which gives
house category
1 House1 Urban
3 House3 Urban
5 House5 Urban
12 House12 Urban
13 House13 Urban
14 House14 Urban
dummy data
df <- structure(list(house = c("House1", "House2", "House3", "House4",
"House5", "House6", "House7", "House8", "House9", "House10",
"House11", "House12", "House13", "House14", "House15"), category = c("Urban",
"Suburban", "Urban", "Rural", "Urban", "Suburban", "Suburban",
"Rural", "Rural", "Suburban", "Suburban", "Urban", "Urban", "Urban",
"Rural")), class = "data.frame", row.names = c(NA, -15L))
Using dplyr you might do something like this
table %>%
filter(category == filter(table, house=="House 1") %>% pull(category))
Basically just a sub-query to find the category of House 1.
Try
table %>%
filter(category == "Urban")
Remember you need to use quotes " " in your filter statements.
You can also use match which will return index of first match and you can get corresponding category from it.
subset(table, category == category[match('House 1', house)])
house category
1 House 1 Urban
4 House 4 Urban
7 House 7 Urban
10 House 10 Urban
13 House 13 Urban
Same code with filter if you want to use dplyr :
dplyr::filter(table, category == category[match('House 1', house)])

How to group similar strings together in a database in R

I have a tibble of just 1 column called 'title'.
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
2 zostavax shingles vaccine
3 xray operator
4 workplace mental health wellbeing workshop
5 zostavax recall toolkit
6 xray meetint
7 workplace mental health and wellbeing
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
I wish to find similar records and group them together as such (all the while keeping their indices):
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
2 zostavax shingles vaccine
5 zostavax recall toolkit
3 xray operator
6 xray meetint
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
4 workplace mental health wellbeing workshop
7 workplace mental health and wellbeing
I'm using the below function to find strings that are close enough to each other (cutoff = 0.75)
compareJW <- function(string1, string2, cutoff)
{
require(RecordLinkage)
jarowinkler(string1, string2) > cutoff
}
I've implemented the loop below to 'send' similar records together in a new dataframe but it's not working properly, I've tried a few variations but nothing is working yet.
# create new database
newDB <- data.frame(matrix(ncol = ncol(dat), nrow = 0))
colnames(newDB) <- names(dat)
newDB <- as_tibble(newDB)
for(i in 1:nrow(dat))
{
# print(dat$title[i])
for(j in 1:nrow(dat))
{
print(dat$title[i])
print(dat$title[j])
# score <- jarowinkler(dat$title[i], dat$title[j])
if(dat$title[i] != dat$title[j]
&&
compareJW(dat$title[i], dat$title[j], 0.75))
{
print("if")
# newDB <- rbind(newDB,
# dat$title[i],
# dat$title[j])
}
else
{
print("else")
# newDB <- rbind(newDB, dat$title[i])
}
}
}
(I've inserted prints in the loop 'to see what's happening')
REPRODUCIBLE DAT:
dat <-
structure(list(title = c("lymphoedema clinic", "zostavax shingles vaccine",
"xray operator", "workplace mental health wellbeing workshop",
"zostavax recall toolkit", "xray meetint", "workplace mental health and wellbeing",
"lymphoedema early intervenstion", "lymphoedema expo", "lymphoedema for breast care nurses",
"xray meeting and case studies", "xray online examination", "xray operator in service paediatric extremities"
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
))
Any suggestions please?
EDIT: I'd also like a new index column called 'group' as below:
> dat
# A tibble: 13 x 1
index group title
<chr>
1 1 lymphoedema clinic
8 1 lymphoedema early intervenstion
9 1 lymphoedema expo
10 1 lymphoedema for breast care nurses
2 2 zostavax shingles vaccine
5 2 zostavax recall toolkit
3 3 xray operator
6 3 xray meetint
11 3 xray meeting and case studies
12 3 xray online examination
13 3 xray operator in service paediatric extremities
4 4 workplace mental health wellbeing workshop
7 4 workplace mental health and wellbeing
I'm afraid I've never tried RecordLinkage, but if you're just using the Jaro-Winkler distance it should also be fairly easy to cluster similar strings with the stringdist package. Using your dput above:
library(tidyverse)
library(stringdist)
map_dfr(dat$title, ~ {
i <- which(stringdist(., dat$title, "jw") < 0.40)
tibble(index = i, title = dat$title[i])
}, .id = "group") %>%
distinct(index, .keep_all = T) %>%
mutate(group = as.integer(group))
Explanation:
map_dfr iterates over each string in dat$title, extracts the indices of the closest matches computed by stringdist (constrained by 0.40, i.e. your "threshold"), creates a tibble with the indices and matches, then stacks these tibbles with a group variable corresponding to the integer position (and row number) of the original string. distinct then drops any cluster duplicates based on repeats of index.
Output:
# A tibble: 13 x 3
group index title
<int> <int> <chr>
1 1 1 lymphoedema clinic
2 1 8 lymphoedema early intervenstion
3 1 9 lymphoedema expo
4 1 10 lymphoedema for breast care nurses
5 2 2 zostavax shingles vaccine
6 2 5 zostavax recall toolkit
7 2 11 xray meeting and case studies
8 3 3 xray operator
9 3 6 xray meetint
10 3 12 xray online examination
11 3 13 xray operator in service paediatric extremities
12 4 4 workplace mental health wellbeing workshop
13 4 7 workplace mental health and wellbeing
An interesting alternative would be to use tidytext with widyr to tokenize by word and compute the cosine similarity of the titles based on similar words, rather than characters as above.

R. How to add sum row in data frame

I know this question is very elementary, but I'm having a trouble adding an extra row to show summary of the row.
Let's say I'm creating a data.frame using the code below:
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income)
The code above creates the data.frame below:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
What I'm trying to do is to add a 5th row and contains: name = "total", nationality = "NA", age = total of all rows. My desired output looks like this:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
5 Total NA 16500
In a real case, my data.frame has more than a thousand rows, and I need efficient way to add the total row.
Can some one please advice? Thank you very much!
We can use rbind
rbind(x, data.frame(name='Total', nationality=NA, income = sum(x$income)))
# name nationality income
#1 James American 5000
#2 Kyle British 4000
#3 Chris American 4500
#4 Mike Japanese 3000
#5 Total <NA> 16500
using index.
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income, stringsAsFactors=FALSE)
x[nrow(x)+1, ] <- c('Total', NA, sum(x$income))
UPDATE: using list
x[nrow(x)+1, ] <- list('Total', NA, sum(x$income))
x
# name nationality income
# 1 James American 5000
# 2 Kyle British 4000
# 3 Chris American 4500
# 4 Mike Japanese 3000
# 5 Total <NA> 16500
sapply(x, class)
# name nationality income
# "character" "character" "numeric"
If you want the exact row as you put in your post, then the following should work:
newdata = rbind(x, data.frame(name='Total', nationality='NA', income = sum(x$income)))
I though agree with Jaap that you may not want this row to add to the end. In case you need to load the data and use it for other analysis, this will add to unnecessary trouble. However, you may also use the following code to remove the added row before other analysis:
newdata = newdata[-newdata$name=='Total',]

Resources