Comparing answers and solutions in R (i.e., comparing two table) - r

I have two tables, 1. answers table and 2. solution table.
The answer table is a list of Name+Answer.
name=c("Jenns","Amy","Jake","Alison","Tommy","Jason","Alex","Vivian")
guess_answer=c("sdgf23894011","lp98ung67543","pwerugji22im","21loop98un89","9580ik8584sf","awe25f6ty788","k0o2jgpo146i","rgyhuj87630l")
answer=data.frame(cbind(name,guess_answer))
> answer
name guess_answer
1 Jenns sdgf23894011
2 Amy lp98ung67543
3 Jake pwerugji22im
4 Alison 21loop98un89
5 Tommy 9580ik8584sf
6 Jason awe25f6ty788
7 Alex k0o2jgpo146i
8 Vivian rgyhuj87630l
The solution table is lists of country with a corresponding (digit+alphabet).
corresponding_number=c("2341rg4524gr","9580ik7584sf","pp0or9rjg7n2","g0o2jgpo146i","lp98ung67543","pwerugji22im","lokibh678901")
country=c("US","UK","CN","AU","JP","KR", "NP")
counry_name=c("United State","United Kingdom","China","Australia","Japan","Korea","North Pole")
solution = cbind(country, corresponding_number,counry_name)
solution = data.frame(solution)
> solution
country corresponding_number counry_name
1 US 2341rg4524gr United State
2 UK 9580ik7584sf United Kingdom
3 CN pp0or9rjg7n2 China
4 AU g0o2jgpo146i Australia
5 JP lp98ung67543 Japan
6 KR pwerugji22im Korea
7 NP lokibh678901 North Pole
I would like to compare the answer table to the solution table, in which if the guess_number is the exact same or 1 digit/alphabet different, it is consider as correct. Then I want to create a table with the country, corresponding_number, and the counry_name.
For example:
> newtable
name corresponding_number country_name
[1,] "xxx" "sdgf23894011" "xxx"
[2,] "JP" "lp98ung67543" "Japan"
[3,] "KR" "pwerugji22im" "Korea"
[4,] "xxx" "21loop98un89" "xxx"
[5,] "UK" "9580ik8584sf" "United Kingdom"
[6,] "xxx" "awe25f6ty788" "xxx"
[7,] "AU" "k0o2jgpo146i" "Australia"
[8,] "xxx" "rgyhuj87630l" "xxx"
The name needs to be: either replaced by "xxx" if answer is wrong, or the "country abbreviations" if answer is wrong.
whether the answer is correct or wrong is based on the guess_answer; the guess_answer is correct if it is a)exactly the same as one of the corresponding_number, or b)1 digit/alphabet different.
the guess_answer will not change but the colname will become "corresponding_number"
Include a third columns showing the full country name, if the guess_answer is wrong, the responding full country name will be "xxx" as well .
edit: first condition.

Here one option is stringdist_left_join, after the join and mutate to replace the NA elements with 'xxx'
library(fuzzyjoin)
library(dplyr)
stringdist_left_join(answer, solution,
by = c("guess_answer" = "corresponding_number"))%>%
mutate(corresponding_number = case_when(is.na(corresponding_number)
~ guess_answer, TRUE ~ corresponding_number),
name = case_when(is.na(country) ~ 'xxx', TRUE ~ country),
counry_name = replace(counry_name, is.na(counry_name), 'xxx')) %>%
select(name, corresponding_number = guess_answer, counry_name)
# name corresponding_number counry_name
#1 xxx sdgf23894011 xxx
#2 JP lp98ung67543 Japan
#3 KR pwerugji22im Korea
#4 xxx 21loop98un89 xxx
#5 UK 9580ik8584sf United Kingdom
#6 xxx awe25f6ty788 xxx
#7 AU k0o2jgpo146i Australia
#8 xxx rgyhuj87630l xxx
data
answer <- data.frame(name,guess_answer, stringsAsFactors = FALSE)
solution <- data.frame(country, corresponding_number,
counry_name, stringsAsFactors = FALSE)

In base R, we can use adist.
#Calculate distance between guess_answer and corresponding_number
mat <- adist(answer$guess_answer, solution$corresponding_number)
#assign default value to result column
answer$country_name <- 'xxx'
#select values with distance of less than or equal to 1
mat1 <- which(mat <= 1, arr.ind = TRUE)
#Order them by row
ord <- order(mat1[, 1])
#Assign values to the column
answer$country_name[mat1[ord, 1]] <- solution$counry_name[mat1[ord, 2]]
answer
# name guess_answer country_name
#1 Jenns sdgf23894011 xxx
#2 Amy lp98ung67543 Japan
#3 Jake pwerugji22im Korea
#4 Alison 21loop98un89 xxx
#5 Tommy 9580ik8584sf United Kingdom
#6 Jason awe25f6ty788 xxx
#7 Alex k0o2jgpo146i Australia
#8 Vivian rgyhuj87630l xxx
data
answer <- data.frame(name,guess_answer, stringsAsFactors = FALSE)
solution <- data.frame(country, corresponding_number,counry_name,
stringsAsFactors = FALSE)

Related

New Column Based on Conditions

To set the scene, I have a set of data where two columns of the data have been mixed up. To give a simple example:
df1 <- data.frame(Name = c("Bob", "John", "Mark", "Will"), City=c("Apple", "Paris", "Orange", "Berlin"), Fruit=c("London", "Pear", "Madrid", "Orange"))
df2 <- data.frame(Cities = c("Paris", "London", "Berlin", "Madrid", "Moscow", "Warsaw"))
As a result, we have two small data sets:
> df1
Name City Fruit
1 Bob Apple London
2 John Paris Pear
3 Mark Orange Madrid
4 Will Berlin Orange
> df2
Cities
1 Paris
2 London
3 Berlin
4 Madrid
5 Moscow
6 Warsaw
My aim is to create a new column where the cities are in the correct place using df2. I am a bit new to R so I don't know how this would work.
I don't really know where to even start with this sort of a problem. My full dataset is much larger and it would be good to have an efficient method of unpicking this issue!
If the 'City' values are only different. We may loop over the rows, create a logical vector based on the matching values with 'Cities' from 'df2', and concatenate with the rest of the values by getting the matched values second in the order
df1[] <- t(apply(df1, 1, function(x)
{
i1 <- x %in% df2$Cities
i2 <- !i1
x1 <- x[i2]
c(x1[1], x[i1], x1[2])}))
-output
> df1
Name City Fruit
1 Bob London Apple
2 John Paris Pear
3 Mark Madrid Orange
4 Will Berlin Orange
using dplyr package this is a solution, where it looks up the two City and Fruit values in df1, and takes the one that exists in the df2 cities list.
if none of the two are a city name, an empty string is returned, you can replace that with anything you prefer.
library(dplyr)
df1$corrected_City <- case_when(df1$City %in% df2$Cities ~ df1$City,
df1$Fruit%in% df2$Cities ~ df1$Fruit,
TRUE ~ "")
output, a new column created as you wanted with the city name on that row.
> df1
Name City Fruit corrected_City
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin
Another way is:
library(dplyr)
library(tidyr)
df1 %>%
mutate(across(1:3, ~case_when(. %in% df2$Cities ~ .), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ')
Name City Fruit New_Col
1 Bob Apple London London
2 John Paris Pear Paris
3 Mark Orange Madrid Madrid
4 Will Berlin Orange Berlin

Switching values to labels in a new column

I got a column of labelled values. Let's call it country.
When I run:
attr(dat[["Country"]], "labels")
I get the next table:
USA Germany France UK Spain India Saudi Arabia
1 2 3 4 5 6 7
Now I got a new column of int values that are not labelled. Let's call it newCountry. I would like to change those int values to the label of the original Country column. In other words, I would like to go from this in an efficient way...
3
2
2
1
5
4
to this...
France
Germany
Germany
USA
Spain
UK
The problem is that the data frame has a column, Country, with the attribute "labels" set. In its turn, this attribute, which is just a vector, has the attribute "names" set. So the steps to get the "names" of the "labels" are:
Get the "labels" of column Country;
Get the "names" of the vector of labels;
Extract the names corresponding to a vector of indices, the vector i.
First read in the posted data.
nms <- scan(text = "USA Germany France UK Spain India 'Saudi Arabia'",
what = character())
i <- scan(text = "3 2 2 1 5 4")
Now create a data set example.
labs <- setNames(1:7, nms)
dat <- data.frame(Country = sample(letters, 7))
attr(dat[["Country"]], "labels") <- labs
And extract what the question asks for, following the steps above.
labsCountry <- attr(dat[["Country"]], "labels")
names(labsCountry)[i]
#[1] "France" "Germany" "Germany" "USA" "Spain" "UK"
Or a one-liner:
names(attr(dat[["Country"]], "labels"))[i]
#[1] "France" "Germany" "Germany" "USA" "Spain" "UK"
To see that this does not depend on the values of the labels, create a second example.
labs2 <- setNames(101:107, nms)
attr(dat[["Country"]], "labels") <- labs2
And though the "labels" are different, the same instructions work:
attr(dat[["Country"]], "labels")
# USA Germany France UK Spain India Saudi Arabia
# 101 102 103 104 105 106 107
labsCountry <- attr(dat[["Country"]], "labels")
names(labsCountry)[i]

R - conditional pattern matching using grepl

I have two data frames, like so:
name <- c("joe", "kim", "kerry", "david")
name2 <- c("kim", "david", "joe", "kerry")
school <- c("cambridge", "south carolina", "vermont binghamton", "delaware")
school2 <- c("south carolina", "delaware", "cambridge magdalene", "vermont")
df1 <- data.frame(name, school)
df2 <- data.frame(name2, school2)
What I would like to do is the following:
Search df2$name2 for a match in df1$name.
If a match is found, compare df2$school2 to df1$school from the matching row.
If no match is found for df2$school2 in df1$school, return FALSE in column df2$perfect.match
So for example, since "joe" in df2 matches "joe" in df1, there's a match. However, since the values for "school" in both aren't the same, the would be a column in df2 with a value of FALSE in the third row. Same for 4th row in df2.
I have tried using grep and grepl. I figure grepl would be best, since it returns a logical value. What I tried was:
df2$perfect.match <- ifelse(grepl(paste(df2$name2, collapse = "|"),
df1$name, fixed = F) & grepl(paste(df2$school2, collapse = "|"), df1$school, fixed = F), "", "FALSE")
however, all I get is this:
name2 school2 perfect.match
1 kim south carolina FALSE
2 david delaware
3 joe cambridge magdalene
4 kerry vermont
When my desired result is:
df2
name2 school2 perfect.match
1 kim south carolina
2 david delaware
3 joe cambridge magdalene FALSE
4 kerry vermont FALSE
If possible, something speedy would be best. The real dataframe are quite large. Thanks.
UPDATE:
I would like to also be able to force the rows that are false to have the same value for df2$school as their corresponding name match in df1$school Like so:
name2 school2
1 kim south carolina
2 david delaware
3 joe cambridge
4 kerry vermont binghamton
You can just do...
df2$perfect.match <- paste(df2$name2, df2$school2) %in% paste(df1$name, df1$school)
df2
name2 school2 perfect.match
1 kim south carolina TRUE
2 david delaware TRUE
3 joe cambridge magdalene FALSE
4 kerry vermont FALSE
We can use match and %in%. grepl wouldn't be right here since this is exact matching and not pattern matching.
df2$perfect_match <- df2$school2 %in% df1$school[match(df2$name2, df1$name)]
df2
# name2 school2 perfect_match
#1 kim south carolina TRUE
#2 david delaware TRUE
#3 joe cambridge magdalene FALSE
#4 kerry vermont FALSE
Slightly faster than pasting the columns together:
matches <- df2$name2 %in% df1$name
df2$perfect.match <- df2$school2[matches] %in% df1$school
microbenchmark::microbenchmark(
v1 = {matches <- df2$name2 %in% df1$name
df2$perfect.match <- df2$school2[matches] %in% df1$school
},
v2 = {df2$perfect.match <- paste(df2$name2, df2$school2) %in% paste(df1$name, df1$school)}
)
Using dplyr, you can do:
dfX <- df1 %>%
bind_rows(.,df2) %>%
group_by(name) %>%
distinct(school) %>%
count(name, name = "perfect.matched") %>%
left_join(df2,.,by = 'name') %>%
mutate(., perfect.matched = ifelse(perfect.matched ==1,"","FALSE"))
And to get the following output:
> dfX
name school perfect.matched
1 kim south carolina
2 david delaware
3 joe cambridge magdalene FALSE
4 kerry vermont FALSE

R. How to add sum row in data frame

I know this question is very elementary, but I'm having a trouble adding an extra row to show summary of the row.
Let's say I'm creating a data.frame using the code below:
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income)
The code above creates the data.frame below:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
What I'm trying to do is to add a 5th row and contains: name = "total", nationality = "NA", age = total of all rows. My desired output looks like this:
name nationality income
1 James American 5000
2 Kyle British 4000
3 Chris American 4500
4 Mike Japanese 3000
5 Total NA 16500
In a real case, my data.frame has more than a thousand rows, and I need efficient way to add the total row.
Can some one please advice? Thank you very much!
We can use rbind
rbind(x, data.frame(name='Total', nationality=NA, income = sum(x$income)))
# name nationality income
#1 James American 5000
#2 Kyle British 4000
#3 Chris American 4500
#4 Mike Japanese 3000
#5 Total <NA> 16500
using index.
name <- c("James","Kyle","Chris","Mike")
nationality <- c("American","British","American","Japanese")
income <- c(5000,4000,4500,3000)
x <- data.frame(name,nationality,income, stringsAsFactors=FALSE)
x[nrow(x)+1, ] <- c('Total', NA, sum(x$income))
UPDATE: using list
x[nrow(x)+1, ] <- list('Total', NA, sum(x$income))
x
# name nationality income
# 1 James American 5000
# 2 Kyle British 4000
# 3 Chris American 4500
# 4 Mike Japanese 3000
# 5 Total <NA> 16500
sapply(x, class)
# name nationality income
# "character" "character" "numeric"
If you want the exact row as you put in your post, then the following should work:
newdata = rbind(x, data.frame(name='Total', nationality='NA', income = sum(x$income)))
I though agree with Jaap that you may not want this row to add to the end. In case you need to load the data and use it for other analysis, this will add to unnecessary trouble. However, you may also use the following code to remove the added row before other analysis:
newdata = newdata[-newdata$name=='Total',]

Get count of group-level observations with multiple individual observations from dataframe in R

How do I get a dataframe like this:
soccer_player country position
"sam" USA left defender
"jon" USA right defender
"sam" USA left midfielder
"jon" USA offender
"bob" England goalie
"julie" England central midfielder
"jane" England goalie
To look like this (country with the counts of unique players per country):
country player_count
USA 2
England 3
The obvious complication is that there are multiple observations per player, so I cannot simply do table(df$country) to get the number of observations per country.
I have been playing with the table() and merge() functions but have not had any luck.
Here's one way:
as.data.frame(table(unique(d[-3])$country))
# Var1 Freq
# 1 England 3
# 2 USA 2
Drop the third column, remove any duplicate Country-Name pairs, then count the occurrences of each country.
The new features of dplyr v 3.0 provide a compact solution:
Data:
dd <- read.csv(text='
soccer_player,country,position
"sam",USA,left defender
"jon",USA,right defender
"sam",USA,left midfielder
"jon",USA,offender
"bob",England,goalie
"julie",England,central midfielder
"jane",England,goalie')
Code:
library(dplyr)
dd %>% distinct(soccer_player,country) %>%
count(country)
Without using any packages you can do:
List = by(df, df$country, function(x) length(unique(x$soccer_player)))
DataFrame = do.call(rbind, lapply(names(List), function(x)
data.frame(country=x, player_count=List[[x]])))
# country player_count
#1 England 2
#2 USA 2
It's easier with something like data.table:
dt = data.table(df)
dt[,list(player_count = length(unique(soccer_player))),by=country]
Here is an sqldf solution:
library(sqldf)
sqldf("select country, count(distinct soccer_player) player_count
from df
group by country")
## country player_count
## 1 England 2
## 2 USA 2
and here is a base R solution:
as.data.frame(xtabs(~ country, unique(df[1:2])), responseName = "player_count")
## country player_count
## 1 England 2
## 2 USA 2
One more base R option, using aggregate:
> aggregate(soccer_player ~ country, dd, FUN = function(x) length(unique(x)))
# country soccer_player
#1 England 3
#2 USA 2

Resources